<a href="https://polly.elucidata.io/manage/workspaces?action=open_polly_notebook&source=github&path=path_place_holder&kernel=elucidata/RNA-Seq Downstream&machine=gp" target="_parent"><img src="https://elucidatainc.github.io/PublicAssets/open_polly.svg" alt="Open in Polly"/></a>


# This Pollyglot Notebook is for analysis of Transcriptomics data in LINCS OmixAtlas

This notebook allows you to get started with your analysis. You can use this notebook as base for further analytical work you might be interested to do.

<blockquote>When you first open the notebook, please run the code cells below.</blockquote>

For more details on how to use Notebooks on Polly, please visit [Polly Notebooks](https://docs.elucidata.io/Scaling%20compute/Polly%20Notebooks.html).

For more details on API access to your OmixAtlas, please visit [Accessing OmixAtlas using polly-python through Polly Notebooks](https://docs.elucidata.io/OmixAtlas/Polly%20Python.html)

In [None]:
# please do not modify
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)

## Install Polly Python

In [None]:
!sudo pip3 install polly-python --quiet # to search and download selected dataset

In [None]:
restartkernel() #Pause for a few seconds before the kernel is refreshed

In [None]:
# please do not modify
from IPython.display import HTML
HTML('''<script type="text/javascript"> Jupyter.notebook.kernel.execute("url = '" + window.location + "'", {}, {}); </script>''')

## Fetch OmixAtlas ID and Dataset ID

- **OmixAtlas ID**: Unique target repository identifier which is required for downloading datasets using **polly-python** 
- **Dataset ID**: Unique identifier for datasets on Polly which is required for downloading datasets using **polly-python** 

In [None]:
import urllib.parse as urlparse
from urllib.parse import parse_qs

parsed = urlparse.urlparse(url)
repo_vars_list = [parse_qs(parsed.query).get(query_url)[0] for query_url in ['repo_id', 'repo_name', 'dataset_id']]
repo_id=repo_vars_list[0]
repo_name=repo_vars_list[1]
dataset_id=repo_vars_list[2]
file_name=dataset_id +'.csv'

## Get Authentication Token

### Query metadata in OmixAtlas

All data in OmixAtlas are structured and and stored in indexes that can be queries through polly python  

Metadata fields are curated and tagged with ontologies, which simplifies finding relevant datasets  

To filter and search the metadata in any of the indexes in OmixAtlas, the following function can be used:  


                                **query_metadata (** *query written in SQL* **)**
The SQL queries have the following syntax:

                        **SELECT** *field names* **FROM** *index_name* **WHERE** *conditions*

For a list of curated fields, indices and conditions available for querying, please visit [Data Schema](https://docs.elucidata.io/OmixAtlas/Data%20Schema.html)

In [None]:
#Import packages
from polly.omixatlas import OmixAtlas
import os
import pandas as pd
from json import dumps

In [None]:
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN']) # Obtain authentication tokens
omixatlas = OmixAtlas(AUTH_TOKEN)

In [None]:
# Querying dataset
query=f"SELECT * FROM {repo_name}.datasets WHERE dataset_id = '{dataset_id}'"
results=omixatlas.query_metadata(query)
results

## Download and load the .gct file
Transcriptomics dataset is stored in gct file. A HEAD file (.gct) file that provides a scalable way of keeping track of data together with learned annotations. An gct file can be read in python using pandas.

In [None]:
data = omixatlas.download_data(repo_id, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
    print("Downloaded data successfully")
else:
    raise Exception("Download not successful")

## Work with a .gct file

A GCT file (.gct) is a tab-delimited text file that contains gene expression data. It contains both expression data and sample metadata in one file. Please read more about GCT file format [here](http://software.broadinstitute.org/software/igv/GCT). 

We store transcriptional (bulk), proteomics and metabolomics data in gct format. A gct file can be read both in R and Python using [cmapR](https://github.com/cmap/cmapR) (for R) and [cmapPy](https://github.com/cmap/cmapPy) (for Python). Both the packages are installed in this environment and can be used on the datalake files.

In [None]:
import pandas as pd
import cmapPy
from cmapPy.pandasGEXpress.parse_gct import parse

gct_obj = parse(file_name) ## Parse the file to create a gct object
df_real = gct_obj.data_df ## Extract the dataframe from the gct object
col_metadata = gct_obj.col_metadata_df # Extract the column metadata from the gct object
row_metadata = gct_obj.row_metadata_df # Extract the row metadata from the gct object
df_real.head()

# Sample Distribution

In [None]:
import plotly.express as px
df = col_metadata[["pert_id","kw_curated_disease","pert_type", "pert_iname", "kw_curated_tissue", "kw_curated_cell_type",
                   "kw_curated_drug","kw_curated_genetic_mod_type","kw_curated_modified_gene", "kw_curated_cell_line" ]]
df = pd.DataFrame(df.nunique()).reset_index()
df = df.sort_values(by =[0])
df = df[(df[0]>1)]
lst = df["chd"].to_list()
print(lst)
fig = px.sunburst(col_metadata, path=lst)
fig.show()

# PCA plot

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import plotly.express as px

list_genes = df_real.index.to_list()
transpose_data = df_real.rename(columns = col_metadata['pert_id'])
df_re = transpose_data.T.reset_index()

pca = PCA(n_components=2)
components = pca.fit_transform(df_re[list_genes])
var = pca.explained_variance_ratio_
total_var = var.sum() * 100

fig = px.scatter(components, x=0, y=1, color=df_re['cid'],title=f'Total Explained Variance: {total_var:.2f}%',
                 labels={'0': f'PC 1 ({var[0]*100:.2f})%', '1': f'PC 2 ({var[1]*100:.2f})%'})
fig.show()

# Heatplot

In [None]:
import seaborn as sns
import plotly.figure_factory as ff

transpose_data['var'] = transpose_data.var(axis=1)
transpose_data = transpose_data.sort_values(by =['var'], ascending = False)
transpose_data = transpose_data.drop(['var'], axis = 1)
transpose_data.columns = [f'{x}_{i}' for i, x in enumerate(transpose_data.columns, 1)]
transpose_data = transpose_data[sorted(transpose_data.columns)]
heatmap_data = transpose_data.iloc[0:20,0:len(col_metadata)]

fig = px.imshow(heatmap_data, aspect="auto")
fig.show()

## Dendogram

#### Clustering of samples

In [None]:
fig = ff.create_dendrogram(heatmap_data.T, orientation='bottom',labels = heatmap_data.columns)
fig.update_layout(width=1200, height= len(heatmap_data)*30)
fig.show()

## Interface with Polly Workspaces
Polly Notebooks can read data from Workspaces and can write data to Workspaces through Polly CLI in a bash kernel.  

*For more details on data transfer between Notebooks and Workspaces, please visit [Accessing Workspace files in Notebook](https://docs.elucidata.io/Scaling%20compute/Polly%20Notebooks.html#accessing-workspace-files-in-notebook).*

In [None]:
## Save a file to your current workspace
# polly files copy -s <name_of_your_file> -d "polly://" -y

## List all files in your current workspace
polly files list --workspace-path "polly://" -y