Fall Student Projects 2021: Skills Tests
---
 Task 3 -> Retrieve CellxGene Data

---
by Aashay Gondalia (aagond@iu.edu)


In this workbook, I have implemented a data fetching function that can scrap the cellxgene website data, download
and read the datasets from all the different collections in the required format mentioned in the [google doc](https://docs.google.com/document/d/1YncjOGbgKKRJw2M0fPt5bZqdCgm9EbwyTR9DoJmZtBs/edit#). 

A table is prepared after inputting the required information from all the incoming data. 

For future work, this task can be parallelized and the RAM usage can be maintained using lazy reads ([Dask](https://dask.org/)).


Installing required packages

---


 -> Scanpy

 Scanpy is used in this workbook to read the h5 format datasets.

In [1]:
!pip install scanpy



1. Importing necessary packages


In [2]:
import datetime

import requests
from requests.adapters import HTTPAdapter
import json

import numpy as np
import pandas as pd
import scanpy as sc
import os

2. Function to Fetch Collection Data from the https://cellxgene.cziscience.com/ website

In [3]:
def fetchCollectionData():
  ## HTTP Adapter Setup
  adapter = HTTPAdapter(max_retries=3)  #Hard-coded 3 Max Retries
  https = requests.Session()
  https.mount("https://", adapter)

  ## URL Elements
  CELLXGENE_PRODUCTION_ENDPOINT = 'https://api.cellxgene.cziscience.com'
  COLLECTIONS = CELLXGENE_PRODUCTION_ENDPOINT + "/dp/v1/collections/"
  DATASETS = CELLXGENE_PRODUCTION_ENDPOINT + "/dp/v1/datasets/"

  ## Fetch collection data
  r = https.get(COLLECTIONS)
  r.raise_for_status()

  collections = sorted(r.json()['collections'], key= lambda key :key['created_at'], reverse=True)
  print('Collection Fetch Complete.')
  return collections, https, CELLXGENE_PRODUCTION_ENDPOINT, COLLECTIONS, DATASETS

3. Function to filter dataset - Applied Filters : {'Disease': 'Normal', 'Species': 'Homo Sapiens'}

In [4]:
def filter_Dataset_Homo_Sapien_Normal(all_collections):
  only_normal_homo_sapiens_ids = []
  for metadata in all_collections:
    collection_cell_counter = 0
    for dataset in metadata['datasets']:
      diseases = dataset['disease']
      id = dataset['id']
      for disease in diseases:
        if (str(disease['label']).lower() == 'normal' and str(dataset['organism']['label']).lower() == 'homo sapiens'):
          #Disease = disease['label']
          #Assay = dataset['assay']
          #Tissue = dataset['tissue']
          #Dataset_Name = dataset['name']
          try:
            collection_cell_counter += dataset['cell_count']
          except:
            pass
          only_normal_homo_sapiens_ids.append(id)
  print('Dataset Filters Applied.')
  return only_normal_homo_sapiens_ids

4. Initializing the output table. As mentioned in the google document, the dataframe column names are set accordingly. 

In [5]:
def initializeTable():
  table = pd.DataFrame({
    'Organ/Tissue Type' : [], 
    'Cell Type CL ID' : [],
    'HGNC/ENSEMBL Gene IDs' : [],
    'No. of Cells of this type' : [],
    'Disease' : [],
    'Assay' : [],
    'Tissue' : [],
    'Dataset Name' : [],
    })
  print('Table Initialization Complete.')
  return table


Gene data is fetched from the X(embeddings) matrix and all the non-zero values of GENEs for the corresponding cells are fetched.

In [6]:
def fetch_and_include_GENE_data(table, dataset, list_ex):
  print('Introducting GENE data into the table.')
  rows,cols = dataset.X.nonzero()
  #print('Rows - ', rows , len(rows))
  #print('Cols - ', cols, len(cols))

  diction = {}
  for i in list_ex:
    diction[i] = []

  #print(diction)

  for j in range(len(cols)):
    if (rows[j] in diction.keys()):
      diction[rows[j]].append(cols[j])

  gene_list = []
  #dataset_rows = len(diction.keys())
  for i in diction.keys():
    genes = ''
    for position in diction[i]:
      genes = genes + dataset.var_names[position] + ';'
    gene_list.append(genes)
  #print(len(gene_list))
  #print(gene_list)
  #for g in gene_list:
  #  print(g, '\n')
  for i in range(table.shape[0]):
    table['HGNC/ENSEMBL Gene IDs'][i] = gene_list[i]
  print('Gene Data Succesfully entered.')

This function is used to read the downloaded data from the cellxgene website, fetch the required information and enter it into the table. 

In [7]:
def enter_Details_into_Table(download_name, Disease, Assay, Tissue, Dataset_Name):
  '''try:
    table = pd.read_csv('dataTable.csv', sep='|')
    print('Table already exists -> Imported Data')
  except:'''
  table = initializeTable()

  print('Adding data to Table....')
  dataset = sc.read_h5ad(download_name)
  print('Dataset Imported in Scanpy Successfully.')
  os.remove(download_name)
  print('Removed dataset to aid program execution.')

  #print('Dataset Reading Complete')
  # 'Organ/Tissue Type', 'Cell Type CL ID', 'HGNC/ENSEMBL Gene IDs',
  # 'Cells of this type', 'Disease', 'Assay', 'Tissue', 'Dataset Name'
  
  # Gene IDs Aggregation into a single field.
  table_cell_ids = []
  table_row_ids = []
  for i in range(dataset.shape[0]):

    if (dataset.obs['cell_type_ontology_term_id'][i] not in table_cell_ids):
      table_row_ids.append(i)
      
      num_cells_float = dataset.obs.cell_type_ontology_term_id.value_counts()[dataset.obs['cell_type_ontology_term_id'][i]] 
      no_of_cells_of_same_type = int (num_cells_float)
      
      table.loc[len(table.index)] = [
                              dataset.obs['tissue'][i],
                              dataset.obs['cell_type_ontology_term_id'][i],
                              '_to_be_filled_',
                              no_of_cells_of_same_type,
                              Disease, 
                              dataset.obs['assay'][i], 
                              Tissue, 
                              Dataset_Name
                              ]
      
      table_cell_ids.append(dataset.obs['cell_type_ontology_term_id'][i])

  fetch_and_include_GENE_data(table, dataset, table_row_ids)
  print('Data successfully added to the Table.')
  print(table)

  return table

Total checker function to match the total cell_count in the dataset and the cell_count mentioned on the website.

In [8]:
def check_total_cell_count(cell_count_total_website, table):
  print('Website Cell Count : ', cell_count_total_website)
  table_total = int(sum(table['No. of Cells of this type']))
  print('Total Cell Counts in the Table : ', table_total)
  try:
    if (table_total == cell_count_total_website):
      print('Cell count in the prepared table matches the cell count data on the website!!')
  except:
    pass



Master Function is the main executable function. It calls all the above mentioned functions and saves the table in the required 'pipe-seperated' values format. 



In [9]:
table_dataholder = None
def masterFunction():
  
  collections, https, CELLXGENE_PRODUCTION_ENDPOINT, COLLECTIONS, DATASETS = fetchCollectionData()
  all_collections = []

  ## INITIAL METADATA FETCH
  for collection in collections:
    r1 = https.get(COLLECTIONS + collection['id'], timeout=5)
    collection_metadata = r1.json()
    all_collections.append(collection_metadata)
  
  ## Populating only_normal_homo_sapiens_ids list with all the filtered dataset ids.
  only_normal_homo_sapiens_ids = filter_Dataset_Homo_Sapien_Normal(all_collections)
          
  
  for collection in all_collections:
    for dataset in collection['datasets']:
      try:
        cell_count_total_website = dataset['cell_count']
      except:
        cell_count_total_website = None

      for asset in dataset['dataset_assets']:

        # Using the H5 format for less overload and compatibility
        # Faced some issues with the RDS formatting. 
        #High overload on the python wrapper to read RDS files.
        if ((asset['filetype'] == 'H5AD') and (asset['dataset_id'] in only_normal_homo_sapiens_ids)):
          DATASET_REQUEST = DATASETS + asset['dataset_id']  +"/asset/"+  asset['id']
          
          r2 = requests.post(DATASET_REQUEST)
          r2.raise_for_status()
          presigned_url = r2.json()['presigned_url']
          
          headers = {'range': 'bytes=0-0'}
          r3 = https.get(presigned_url, headers=headers)
          print('\nDataset -> ', dataset['name'], '\nURL -> ', presigned_url)
          
          if (r3.status_code == requests.codes.partial):
            download_name = dataset['name'] + '.h5ad'
            print('Dataset Download Started.')
            r3 = https.get(presigned_url, timeout=10)
            r3.raise_for_status()
            open(download_name, 'wb').write(r3.content)
            print('Dataset Download Complete.')
            table = enter_Details_into_Table(download_name, dataset['disease'], dataset['assay'], dataset['tissue'], dataset['name'])

            ## SAVE TABLE 
            print('Saving Table....')
            os.mkdir(collection['name'])
            filepath = collection['name'] + '/' + dataset['name']
            table.to_csv(filepath, sep='|')
            print('Table Saved Successfully.')
            table_dataholder = table
            check_total_cell_count(cell_count_total_website, table)

            ## To effectively use the Google COLAB RAM. 
            table = None
            del table
            print('Local Copy of table removed from RAM')



Master Function currently fetches collection data and downloads the dataset in a serial manner, which can be parallelized.



In [None]:
masterFunction()

Collection Fetch Complete.
Dataset Filters Applied.

Dataset ->  Tabula Sapiens - Endothelial 
URL ->  https://corpora-data-prod.s3.amazonaws.com/5a11f879-d1ef-458a-910c-9b0bdfca5ebf/local.h5ad?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIATLYQ5N5XTAJYHK5X%2F20210823%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20210823T071432Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjENb%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJHMEUCIQDFohp83KeOaO%2BRApWcOVfbeNumiFZW5r2McxMncVYSYAIgO1x2%2FfUu6lJcAmFc%2FjFoLOj4qqDZXF6WgGO9%2Bgwg4hgq9AMI%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARABGgwyMzE0MjY4NDY1NzUiDLdD1SDWLB1zJxRv9CrIA66gFdNCQf4Ya9IwRA3wwRrftrp1m%2Bt0jN1yAV0Dd%2F3qlW6EylGydxf5MS%2BZvzwOzMsY2Fzd6DBn8Q%2BqYX014JFOiY1XkVPrMGECN39R%2FRfLYvrJpLG%2FQkDrqw7sOPE8YHWtcY2dNfx0PPgFchyUpQ1mcTSaIqJQhlxt8mXZAz70kYDkopQwr79tTYOoOGhJ4b4PIUkVVt94Hfs2u%2BUuTYf92K1ZpoBy6Lx0t0%2F%2FMm4WC7Cjbpa4dw0FYu5i%2FwC5%2FwEWcE15ahDme%2BipkTese2m%2B1scF1Bc%2FN6DCtuSo1ha36nFtNittmGpFf

  return array(a, dtype, copy=False, order=order)


Introducting GENE data into the table.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Gene Data Succesfully entered.
Data successfully added to the Table.
         Organ/Tissue Type  ...                  Dataset Name
0                    liver  ...  Tabula Sapiens - Endothelial
1                  trachea  ...  Tabula Sapiens - Endothelial
2   saliva-secreting gland  ...  Tabula Sapiens - Endothelial
3                   tongue  ...  Tabula Sapiens - Endothelial
4                   tongue  ...  Tabula Sapiens - Endothelial
5                   tongue  ...  Tabula Sapiens - Endothelial
6                      eye  ...  Tabula Sapiens - Endothelial
7                    heart  ...  Tabula Sapiens - Endothelial
8            muscle tissue  ...  Tabula Sapiens - Endothelial
9          large intestine  ...  Tabula Sapiens - Endothelial
10                    lung  ...  Tabula Sapiens - Endothelial
11                    lung  ...  Tabula Sapiens - Endothelial

[12 rows x 8 columns]
Saving Table....
Table Saved Successfully.
Website Cell Count :  32701
Total Cell Counts in the Table 