Fall Student Projects 2021: Skills Tests
---
 Task 3 -> Retrieve CellxGene Data

---
by Aashay Gondalia (aagond@iu.edu)


In this workbook, I have implemented a data fetching function that can scrap the cellxgene website data, download
and read the datasets from all the different collections in the required format mentioned in the [google doc](https://docs.google.com/document/d/1YncjOGbgKKRJw2M0fPt5bZqdCgm9EbwyTR9DoJmZtBs/edit#). 

A table is prepared after inputting the required information from all the incoming data. 

For future work, this task can be parallelized and the RAM usage can be maintained using lazy reads ([Dask](https://dask.org/)).


Installing required packages

---


 -> Scanpy

 Scanpy is used in this workbook to read the h5 format datasets.

In [1]:
!pip install scanpy

Collecting scanpy
  Downloading scanpy-1.8.1-py3-none-any.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 4.2 MB/s 
Collecting sinfo
  Downloading sinfo-0.3.4.tar.gz (24 kB)
Collecting anndata>=0.7.4
  Downloading anndata-0.7.6-py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 68.4 MB/s 
Collecting umap-learn>=0.3.10
  Downloading umap-learn-0.5.1.tar.gz (80 kB)
[K     |████████████████████████████████| 80 kB 8.6 MB/s 
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.4.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 56.7 MB/s 
[?25hCollecting stdlib_list
  Downloading stdlib_list-0.8.0-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 2.0 MB/s 
Building wheels for collected packages: umap-learn, pynndescent, sinfo
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.1-py3-none-any.whl size=76564 sha256=b263f03881251944c552fdb82

1. Importing necessary packages


In [2]:
import datetime

import requests
from requests.adapters import HTTPAdapter
import json

import numpy as np
import pandas as pd
import scanpy as sc
import os

2. Function to Fetch Collection Data from the https://cellxgene.cziscience.com/ website

In [3]:
def fetchCollectionData():
  ## HTTP Adapter Setup
  adapter = HTTPAdapter(max_retries=3)  #Hard-coded 3 Max Retries
  https = requests.Session()
  https.mount("https://", adapter)

  ## URL Elements
  CELLXGENE_PRODUCTION_ENDPOINT = 'https://api.cellxgene.cziscience.com'
  COLLECTIONS = CELLXGENE_PRODUCTION_ENDPOINT + "/dp/v1/collections/"
  DATASETS = CELLXGENE_PRODUCTION_ENDPOINT + "/dp/v1/datasets/"

  ## Fetch collection data
  r = https.get(COLLECTIONS)
  r.raise_for_status()

  collections = sorted(r.json()['collections'], key= lambda key :key['created_at'], reverse=True)
  print('Collection Fetch Complete.')
  return collections, https, CELLXGENE_PRODUCTION_ENDPOINT, COLLECTIONS, DATASETS

3. Function to filter dataset - Applied Filters : {'Disease': 'Normal', 'Species': 'Homo Sapiens'}

In [4]:
def filter_Dataset_Homo_Sapien_Normal(all_collections):
  only_normal_homo_sapiens_ids = []
  for metadata in all_collections:
    collection_cell_counter = 0
    for dataset in metadata['datasets']:
      diseases = dataset['disease']
      id = dataset['id']
      for disease in diseases:
        if (str(disease['label']).lower() == 'normal' and str(dataset['organism']['label']).lower() == 'homo sapiens'):
          #Disease = disease['label']
          #Assay = dataset['assay']
          #Tissue = dataset['tissue']
          #Dataset_Name = dataset['name']
          try:
            collection_cell_counter += dataset['cell_count']
          except:
            pass
          only_normal_homo_sapiens_ids.append(id)
  print('Dataset Filters Applied.')
  return only_normal_homo_sapiens_ids

4. Initializing the output table. As mentioned in the google document, the dataframe column names are set accordingly. 

In [5]:
def initializeTable():
  table = pd.DataFrame({
    'Organ/Tissue Type' : [], 
    'Cell Type CL ID' : [],
    'HGNC/ENSEMBL Gene IDs' : [],
    'No. of Cells of this type' : [],
    'Disease' : [],
    'Assay' : [],
    'Tissue' : [],
    'Dataset Name' : [],
    })
  print('Table Initialization Complete.')
  return table


This function is used to read the downloaded data from the cellxgene website, fetch the required information and enter it into the table. 

In [6]:
def enter_Details_into_Table(download_name, Disease, Assay, Tissue, Dataset_Name):
  '''try:
    table = pd.read_csv('dataTable.csv', sep='|')
    print('Table already exists -> Imported Data')
  except:'''
  table = initializeTable()

  print('Adding data to Table....')
  dataset = sc.read_h5ad(download_name)
  print('Dataset Imported in Scanpy Successfully.')
  os.remove(download_name)
  print('Removed dataset to aid program execution.')

  #print('Dataset Reading Complete')
  # 'Organ/Tissue Type', 'Cell Type CL ID', 'HGNC/ENSEMBL Gene IDs',
  # 'Cells of this type', 'Disease', 'Assay', 'Tissue', 'Dataset Name'
  
  # Gene IDs Aggregation into a single field.
  list_of_Genes = ''
  for i in range(dataset.shape[1]):
    list_of_Genes = list_of_Genes + dataset.var_names[i] + ';'

  initial_dataset_no_of_rows = dataset.shape[0]
  counter = 0

  # Table data entry loop.
  for i in range(dataset.shape[0]):
    no_of_cells_of_same_type = int(dataset.obs.cell_type.value_counts()[dataset.obs['cell_type'][i]])
    table.loc[len(table.index)] = [
                              dataset.obs['tissue'][i],
                              dataset.obs['cell_type_ontology_term_id'][i],
                              list_of_Genes,
                              no_of_cells_of_same_type,
                              Disease, 
                              dataset.obs['assay'][i], 
                              Tissue, 
                              Dataset_Name
      ]
    counter += 1 
  print('Data successfully added to the Table.', '\n\t-> Added ', counter, ' rows to the table.')
  print(table)

  return table

Master Function is the main executable function. It calls all the above mentioned functions and saves the table in the required 'pipe-seperated' values format. 



In [7]:
table_dataholder = None
def masterFunction():
  
  collections, https, CELLXGENE_PRODUCTION_ENDPOINT, COLLECTIONS, DATASETS = fetchCollectionData()
  all_collections = []

  ## INITIAL METADATA FETCH
  for collection in collections:
    r1 = https.get(COLLECTIONS + collection['id'], timeout=5)
    collection_metadata = r1.json()
    all_collections.append(collection_metadata)
  
  ## Populating only_normal_homo_sapiens_ids list with all the filtered dataset ids.
  only_normal_homo_sapiens_ids = filter_Dataset_Homo_Sapien_Normal(all_collections)
          
  
  for collection in all_collections:
    for dataset in collection['datasets']:
      for asset in dataset['dataset_assets']:

        # Using the H5 format for less overload and compatibility
        # Faced some issues with the RDS formatting. 
        #High overload on the python wrapper to read RDS files.
        if ((asset['filetype'] == 'H5AD') and (asset['dataset_id'] in only_normal_homo_sapiens_ids)):
          DATASET_REQUEST = DATASETS + asset['dataset_id']  +"/asset/"+  asset['id']
          
          r2 = requests.post(DATASET_REQUEST)
          r2.raise_for_status()
          presigned_url = r2.json()['presigned_url']
          
          headers = {'range': 'bytes=0-0'}
          r3 = https.get(presigned_url, headers=headers)
          print('\nDataset -> ', dataset['name'], '\nURL -> ', presigned_url)
          
          if (r3.status_code == requests.codes.partial):
            download_name = dataset['name'] + '.h5ad'
            print('Dataset Download Started.')
            r3 = https.get(presigned_url, timeout=10)
            r3.raise_for_status()
            open(download_name, 'wb').write(r3.content)
            print('Dataset Download Complete.')
            table = enter_Details_into_Table(download_name, dataset['disease'], dataset['assay'], dataset['tissue'], dataset['name'])

            ## SAVE TABLE 
            print('Saving Table....')
            os.mkdir(collection['name'])
            filepath = collection['name'] + '/' + dataset['name']
            table.to_csv(filepath, sep='|')
            print('Table Saved Successfully.')
            table_dataholder = table

            ## To effectively use the Google COLAB RAM. 
            table = None
            del table
            print('Local Copy of table removed from RAM')


Master Function currently fetches collection data and downloads the dataset in a serial manner, which can be parallelized.



In [None]:
masterFunction()

Collection Fetch Complete.
Dataset Filters Applied.

Dataset ->  Tabula Sapiens - Endothelial 
URL ->  https://corpora-data-prod.s3.amazonaws.com/5a11f879-d1ef-458a-910c-9b0bdfca5ebf/local.h5ad?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIATLYQ5N5XVFB544VL%2F20210822%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20210822T174004Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEMn%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJHMEUCIQCXNXTLtEsI0%2Btv%2Bgespit%2FxpPLYvZRTY8NPqGEzB21JAIgJOu%2FBL%2BkmnuYvyt2RQdpQRwjzUrQrQ2tV8inVuZOmGMq9AMI8v%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARABGgwyMzE0MjY4NDY1NzUiDFZcApvQUWK2oeGUqCrIA19Pp%2BbD60o9JtrOQgxMisoO8me%2BSFXOtdSZWtdCi01bZGxY9heHatuQ1MTDOmAQrVXFouG%2F2K8xqKVE0y6loxom7NeAZ49mQFwZTvD8kvfG9T9qKfJqWFow%2BINdLegJcKQ4gaRBWpgSoBeaq8lmW0GLo%2BPrCBIfaW82GgtI0Jl%2Fw1jnnGa6hlt7ANGL7IY%2Fk4g2uJQpWRVF12NbiC2D8xoOk5TNm54ytzHbmL6JszanrlndLxeJgtoKjR3PCv7cFOn1euIY1pQPN8o1NXFbvo5uN3QDhPeY388%2Bb3RVjHGhQrCaAke%2FSs9SMOS0e9S%2F7lJ

  return array(a, dtype, copy=False, order=order)


Data successfully added to the Table. 
	-> Added  32701  rows to the table.
      Organ/Tissue Type  ...                  Dataset Name
0                 liver  ...  Tabula Sapiens - Endothelial
1                 liver  ...  Tabula Sapiens - Endothelial
2                 liver  ...  Tabula Sapiens - Endothelial
3                 liver  ...  Tabula Sapiens - Endothelial
4                 liver  ...  Tabula Sapiens - Endothelial
...                 ...  ...                           ...
32696       vasculature  ...  Tabula Sapiens - Endothelial
32697       vasculature  ...  Tabula Sapiens - Endothelial
32698       vasculature  ...  Tabula Sapiens - Endothelial
32699       vasculature  ...  Tabula Sapiens - Endothelial
32700       vasculature  ...  Tabula Sapiens - Endothelial

[32701 rows x 8 columns]
Saving Table....
Table Saved Successfully.
Local Copy of table removed from RAM

Dataset ->  Tabula Sapiens - Immune 
URL ->  https://corpora-data-prod.s3.amazonaws.com/c5d88abe-f23a-45fa-a5

In [None]:
table_dataholder