<a href="https://polly.elucidata.io/manage/workspaces?action=open_polly_notebook&amp;source=github&amp;path=ElucidataInc%2Fpolly-python%2Fblob%2Fmain%2FSingleCell-polly-python.ipynb&amp;kernel=elucidata%2FSingle-cell+Downstream&amp;machine=medium" target="_parent"><img alt="Open in Polly" src="https://elucidatainc.github.io/PublicAssets/open_polly.svg"/></a>


# Welcome to Pollyglot Notebook for analysis of single cell data

This notebook allows you to get started with analysis of single cell data type on Polly

<blockquote>When you first open the notebook, please run the code cells below.</blockquote>

For more details on how to use Notebooks on Polly, please visit [Polly Notebooks](https://docs.elucidata.io/Scaling%20compute/Polly%20Notebooks.html).

For more details on API access to your OmixAtlas, please visit [Accessing OmixAtlas using polly-python through Polly Notebooks](https://docs.elucidata.io/OmixAtlas/Polly%20Python.html).

In [None]:
# please do not modify
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)

## Install Polly Python

In [None]:
!sudo pip3 install polly-python --quiet # to search and download selected dataset

In [None]:
restartkernel() #Pause for a few seconds before the kernel is refreshed

In [None]:
# please do not modify
from IPython.display import HTML
HTML('''<script type="text/javascript"> Jupyter.notebook.kernel.execute("url = '" + window.location + "'", {}, {}); </script>''')

## Fetch OmixAtlas ID and Dataset ID

- **OmixAtlas ID**: Unique target repository identifier which is required for downloading datasets using **polly-python** 
- **Dataset ID**: Unique identifier for datasets on Polly which is required for downloading datasets using **polly-python** 

In [None]:
import urllib.parse as urlparse
from urllib.parse import parse_qs

parsed = urlparse.urlparse(url)
repo_vars_list = [parse_qs(parsed.query).get(query_url)[0] for query_url in ['repo_id', 'repo_name', 'dataset_id']]
repo_id=repo_vars_list[0]
repo_name=repo_vars_list[1]
dataset_id=repo_vars_list[2]
file_name = dataset_id+'.h5ad'

print(file_name)
print(repo_id)
print(repo_name)
print(dataset_id)

## Get Authentication Token

### Query metadata in Liver OmixAtlas

All data in Liver OmixAtlas are structured and and stored in indexes that can be queries through polly python  

Metadata fields are curated and tagged with ontologies, which simplifies finding relevant datasets  

To filter and search the metadata in any of the indexes in Liver OmixAtlas, the following function can be used:  


                                **query_metadata (** *query written in SQL* **)**
The SQL queries have the following syntax:

                        **SELECT** *field names* **FROM** *index_name* **WHERE** *conditions*

For a list of curated fields, indices and conditions available for querying, please visit [Data Schema](https://docs.elucidata.io/OmixAtlas/Data%20Schema.html)

In [None]:
#Import packages
from polly.omixatlas import OmixAtlas
import os
import pandas as pd
from json import dumps

In [None]:
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN']) # Obtain authentication tokens
omixatlas = OmixAtlas(AUTH_TOKEN)

In [None]:
# Querying dataset
query=f"SELECT * FROM {repo_name}.datasets WHERE dataset_id = '{dataset_id}'"
results=omixatlas.query_metadata(query)
results

## Download and load the h5ad file
Single cell dataset is stored in h5ad file. A HEAD file (.H5AD) file that provides a scalable way of keeping track of data together with learned annotations. Please read more about h5ad file format [here](https://anndata.readthedocs.io/en/latest/). We store single cell data in h5ad format. An h5ad file can be read in R and python using [scanpy](https://scanpy.readthedocs.io/en/stable/) (both for python and R).

In [None]:
data = omixatlas.download_data(repo_id, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
    print("Downloaded data successfully")
else:
    raise Exception("Download not successful")

In [None]:
# loading the file
import scanpy as sc
data = sc.read_h5ad(file_name) 
data

In [None]:
data.obs.head()

In [None]:
data.var.head()

## QC plot
### 'gene_counts' - number of unique genes detected in each cell.
### 'umi_counts' - total number of molecules detected within a cell (correlates strongly with unique genes)
### 'percent_mito' - percentage of reads that map to the mitochondrial genome

In [None]:
sc.pl.highest_expr_genes(data, n_top=20)

In [None]:
sc.pl.violin(data, ['gene_counts', 'umi_counts', 'percent_mito'],
             jitter=0.4, multi_panel=True)

## Samples and Clusters Distribution

In [None]:
if "kw_cell_type_singleR" in data.obs.columns:
    
    sam_clus_dis = pd.DataFrame(data.obs.groupby(['kw_cell_type_singleR', 'clusters'])["clusters"].count())
    sam_clus_dis.rename(columns = {'clusters': 'counts'}, inplace = True)
    sam_clus_dis = sam_clus_dis.reset_index()
    sam_clus_dis = data.obs.groupby(['kw_cell_type_singleR', 'clusters'])["clusters"].count().unstack('clusters')
    ax = sam_clus_dis.plot(kind='bar', stacked=True, xlabel="Cell Types", ylabel='Number of Cells', figsize=(20, 8))
    
elif "clusters" in data.obs.columns:
    clus_dis = pd.DataFrame(data.obs.clusters.value_counts())
    ax = clus_dis.plot(kind='bar', stacked=True, xlabel='Clusters', ylabel='Number of Cells', figsize=(20, 8)) 


## tSNE plot

In [None]:
if "X_tsne" in data.obsm:
    if "kw_cell_type_singleR" in data.obs.columns:
        sc.pl.tsne(data, color = 'kw_cell_type_singleR')
    else:
        sc.pl.tsne(data, color = 'clusters')
else:
    print("tSNE Plot is not available")

## UMAP plot

In [None]:
if "X_umap" in data.obsm:
    if "kw_cell_type_singleR" in data.obs.columns:
        sc.pl.umap(data, color = 'kw_cell_type_singleR')
    else:
        sc.pl.umap(data, color = 'clusters')
else:
    print("Umap Plot is not available")