<a href="https://polly.elucidata.io/manage/workspaces?action=open_polly_notebook&amp;source=github&amp;path=ElucidataInc%2Fpolly-python%2Fblob%2Fmain%2FDiscover%2Fontology_recommendation_disease_tissue.ipynb&amp;kernel=elucidata%2FPython+3&amp;machine=medium" target="_parent"><img alt="Open in Polly" src="https://elucidatainc.github.io/PublicAssets/open_polly.svg"/></a>


# Ontology recommendations for disease and tissue using polly-python

Ontology recommendation functionality for disease and tissue are added in Polly-Python. In the existing SQL query itslef, the users would now be able to call a function - 'recommend' on disease and tissue column of metadata to get recommendations. 

Usage of 'recommend' function - 

recommend(field_name, search_term, key - ['match' | 'related'])

field_name -> It can take value: disease, tissue, curated_disease, curated_tissue based on V1 or V2 APIs.

search_term -> Disease or tissue name for which recommendations are required.

key -> Can be "match" or "related"

    match - Only the terms that have an exact match of the keyword in them will be returned as an output.
        
    related - The list of expanded terms would contain the matched terms, the synonyms, and hypernyms of the keyword as per MeSH ontology. 

## For users querying V2 infrastructure

For 'match' query in disease - 

query = """SELECT * FROM geo.datasets WHERE CONTAINS(curated_disease, recommend('curated_disease', 'breast neoplasms', 'match'))"""

For 'related' query in tissue - 

query = """SELECT * FROM geo.datasets WHERE CONTAINS(curated_tissue, recommend( 'curated_tissue', 'liver', 'related'))"""


In [3]:
# please do not modify
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)

# Import polly-python

In [4]:
!sudo pip3 install polly-python  #Restart kernel after the cell executes.

Collecting polly-python
[?25l  Downloading https://files.pythonhosted.org/packages/15/bc/3bd95540fad7f863898cd0f095720996931e5c49e40e2fe8d5894f29c684/polly_python-0.1.3-py3-none-any.whl (41kB)
[K     |################################| 51kB 14.5MB/s eta 0:00:01
[?25hCollecting rst2txt (from polly-python)
  Downloading https://files.pythonhosted.org/packages/9d/d4/cee0341774dbdd4c128816ebc31fbbf39a11c79220a96a4f1152aa511caa/rst2txt-1.1.0-py2.py3-none-any.whl
Collecting requests==2.25.1 (from polly-python)
[?25l  Downloading https://files.pythonhosted.org/packages/29/c1/24814557f1d22c56d50280771a17307e6bf87b70727d975fd6b2ce6b014a/requests-2.25.1-py2.py3-none-any.whl (61kB)
[K     |################################| 61kB 7.6MB/s eta 0:00:011
[?25hCollecting Deprecated (from polly-python)
  Downloading https://files.pythonhosted.org/packages/51/6a/c3a0408646408f7283b7bc550c30a32cc791181ec4618592eec13e066ce3/Deprecated-1.2.13-py2.py3-none-any.whl
Collecting chardet==4.0.0 (from polly-py

  Downloading https://files.pythonhosted.org/packages/b1/78/dcfd84d3aabd46a9c77260fb47ea5d244806e4daef83aa6fe5d83adb182c/platformdirs-2.4.0-py3-none-any.whl
Collecting pathspec>=0.9.0 (from black->polly-python)
  Downloading https://files.pythonhosted.org/packages/42/ba/a9d64c7bcbc7e3e8e5f93a52721b377e994c22d16196e2b0f1236774353a/pathspec-0.9.0-py2.py3-none-any.whl
Collecting tomli>=1.1.0; python_version < "3.11" (from black->polly-python)
  Downloading https://files.pythonhosted.org/packages/05/e4/74f9440db36734d7ba83c574c1e7024009ce849208a41f90e94a134dc6d1/tomli-1.2.3-py3-none-any.whl
Collecting s3transfer<0.6.0,>=0.5.0 (from boto3>=1.17.73->polly-python)
[?25l  Downloading https://files.pythonhosted.org/packages/7b/9c/f51775ebe7df5a7aa4e7c79ed671bde94e154bd968aca8d65bb24aba0c8c/s3transfer-0.5.2-py3-none-any.whl (79kB)
[K     |################################| 81kB 57.5MB/s eta 0:00:01
[?25hCollecting importlib-resources; python_version < "3.7" (from tqdm->polly-python)
  Download

In [None]:
restartkernel() #Pause for a few seconds before the kernel is refreshed

# Import Dependencies

In [1]:
import os
from polly.auth import Polly
from polly.omixatlas import OmixAtlas

# Auth With Token on Polly

In [3]:
POLLY_REFRESH_TOKEN = os.environ['POLLY_REFRESH_TOKEN']
omixatlas = OmixAtlas(POLLY_REFRESH_TOKEN)

# SQL Queries for V2 storage infrastructure

## Previous query on V2 infrastructure
Before implementation of this feature, users query for a given tissue and disease as shown below. 

For this query, user is able to fetch 1388 datasets for the given disease and tissue combination.

In [4]:
sql_query = """SELECT dataset_id, curated_disease, curated_tissue FROM geo.datasets WHERE 
        CONTAINS(curated_disease,'Breast Neoplasms') AND 
        CONTAINS(curated_tissue,'breast')""" 
result = omixatlas.query_metadata(sql_query)
result

Query execution succeeded (time taken: 1.96 seconds, data scanned: 0.908 MB)
Fetched 1388 rows


Unnamed: 0,dataset_id,curated_disease,curated_tissue
0,GSE9691_GPL3921,[Breast Neoplasms],[breast]
1,GSE97221_GPL10558,[Breast Neoplasms],[breast]
2,GSE97317_GPL11154,[Breast Neoplasms],[breast]
3,GSE9734_GPL4742,"[Breast Neoplasms, Carcinoma, Pancreatic Duc...","[pancreas, breast, kidney, colon]"
4,GSE97482_GPL10332,[Breast Neoplasms],[breast]
...,...,...,...
1383,GSE9483_GPL6071,"[Neoplasms, Basal Cell, Breast Cancer, Fami...",[breast]
1384,GSE95035_GPL10558,[Breast Neoplasms],[breast]
1385,GSE95087_GPL16956,[Breast Neoplasms],[breast]
1386,GSE95304_GPL11154,[Breast Neoplasms],[breast]


## New queries after implementation of ontology recommendations
Now the users can query as shown below. 

For query with ontology recommendations, the user is able to fetch 2223 datasets for the given disease and tissue combination. This is ~60% higher than previous ones.

In [6]:
sql_query = """SELECT dataset_id, curated_disease, curated_tissue FROM geo.datasets WHERE 
        CONTAINS(curated_disease, recommend('curated_disease', 'breast neoplasms', 'related')) AND 
        CONTAINS(curated_tissue, recommend('curated_tissue', 'breast', 'related'))""" 
result = omixatlas.query_metadata(sql_query)
result

Query execution succeeded (time taken: 2.04 seconds, data scanned: 0.908 MB)
Fetched 2223 rows


Unnamed: 0,dataset_id,curated_disease,curated_tissue
0,GSE96058_GPL11154,"[Triple Negative Breast Neoplasms, Brittle co...",[breast]
1,GSE96058_GPL18573,"[Triple Negative Breast Neoplasms, Brittle co...",[breast]
2,GSE96085_GPL15084,"[Carcinoma, Neoplasms, Second Primary, Brea...",[mammary gland]
3,GSE96520_GPL4135,"[Mammary Neoplasms, Animal, Breast Neoplasms]",[mammary gland]
4,GSE96567_GPL15084,"[Breast Neoplasms, Carcinoma, Neoplasms, Se...",[mammary gland]
...,...,...,...
2218,GSE38912_GPL11154,[Breast Neoplasms],"[colon, breast]"
2219,GSE38912_GPL15433,[Breast Neoplasms],"[colon, breast]"
2220,GSE38912_GPL3921,[Breast Neoplasms],"[colon, breast]"
2221,GSE38912_GPL9052,[Breast Neoplasms],"[breast, colon]"


## Other query examples on V2 infrastructure

In [7]:
sql_query = """SELECT * FROM geo.datasets WHERE 
                CONTAINS(curated_disease, recommend('curated_disease', 'hepatitis', 'match'))"""
result = omixatlas.query_metadata(sql_query)
result

Query execution succeeded (time taken: 2.07 seconds, data scanned: 38.845 MB)
Fetched 554 rows


Unnamed: 0,data_matrix_available,curated_organism,src_uri,total_num_samples,year,description,curated_cell_line,data_table_name,data_table_version,platform,...,abstract,version,curated_strain,bucket,curated_tissue,dataset_source,data_type,overall_design,is_current,region
0,,[Homo sapiens],polly:data://GEO_data_lake/data/Microarray/GSE...,34.0,2018.0,Role of Humoral Immunity against Hepatitis B V...,[None],geo__gse96851_gpl570,0.0,Microarray,...,,0,[None],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,Liver samples were obtained from 4 patients wi...,True,us-west-2
1,False,[Homo sapiens],polly:data://GEO_data_lake/data/GEO_metadata/G...,54.0,2017.0,A Pharmacogenomic Landscape in Human Liver Can...,"[SK-HEP-1, CLC33, CLC26, CLC17, CLC30, HL...",,,RNAseq,...,,0,[None],discover-prod-datalake-v1,[None],GEO,Transcriptomics,RNAseq for 81 liver cancer cell models was per...,True,us-west-2
2,False,[Homo sapiens],polly:data://GEO_data_lake/data/GEO_metadata/G...,3.0,2017.0,A Pharmacogenomic Landscape in Human Liver Can...,"[SK-HEP-1, CLC49, CLC26, SNU-354, Mahlavu,...",,,RNAseq,...,,0,[None],discover-prod-datalake-v1,[None],GEO,Transcriptomics,RNAseq for 81 liver cancer cell models was per...,True,us-west-2
3,False,[Homo sapiens],polly:data://GEO_data_lake/data/GEO_metadata/G...,16.0,2018.0,A Pharmacogenomic Landscape in Human Liver Can...,"[SK-HEP-1, SNU-398, JHH-4, SNU-886, CLC43,...",,,RNAseq,...,,0,[None],discover-prod-datalake-v1,[None],GEO,Transcriptomics,RNAseq for 81 liver cancer cell models was per...,True,us-west-2
4,,[Mus musculus],polly:data://GEO_data_lake/data/RNASeq/GSE9723...,10.0,2018.0,Pyroptosis by Caspase11/4-Gasdermin-D Pathway ...,[None],geo__gse97234_gpl13112,0.0,RNASeq,...,,0,[C57BL/6],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,"9 total samples = 3 AH liver, 3 ASH liver, 3 c...",True,us-west-2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
549,,[Mus musculus],polly:data://GEO_data_lake/data/Microarray/GSE...,20.0,2019.0,Analysis of differentially expressed genes in ...,[None],geo__gse138916_gpl21163,0.0,Microarray,...,,0,[None],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,snap-frozen liver samples were obtained from g...,True,us-west-2
550,,[Mus musculus],polly:data://GEO_data_lake/data/RNASeq/GSE1389...,10.0,2019.0,MBOAT7's role in the progression of Non-alcoho...,[None],geo__gse138945_gpl13112,0.0,RNASeq,...,Recent studies have identified a genetic varia...,0,[None],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,RNAseq of liver homogenates from high fat diet...,True,us-west-2
551,,[Mus musculus],polly:data://GEO_data_lake/data/RNASeq/GSE1389...,10.0,2019.0,LPI's role in the progression of Non-alcoholic...,[None],geo__gse138946_gpl13112,0.0,RNASeq,...,Recent studies have identified a genetic varia...,0,[None],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,RNAseq of Liver homogenate +/- 18:0 Lysophosph...,True,us-west-2
552,False,[Homo sapiens],polly:data://GEO_data_lake/data/GEO_metadata/G...,3.0,2019.0,Rimonabant suppresses RNA transcription of hep...,[Hep-G2],,,RNAseq,...,,0,[None],discover-prod-datalake-v1,[None],GEO,Transcriptomics,Transcriptome analysis of PHH treated with DMS...,True,us-west-2


In [19]:
sql_query = """SELECT dataset_id, curated_tissue FROM geo.datasets WHERE 
            CONTAINS(curated_tissue, recommend('curated_tissue', 'liver', 'related'))"""
result = omixatlas.query_metadata(sql_query)
result

Query execution succeeded (time taken: 3.07 seconds, data scanned: 0.574 MB)
Fetched 7216 rows


Unnamed: 0,dataset_id,curated_tissue
0,GSE9581_GPL6119,"[brain, liver, testis, heart]"
1,GSE9581_GPL6120,"[brain, liver, testis, heart]"
2,GSE9588_GPL4372,[liver]
3,GSE96059_GPL17021,[liver]
4,GSE96093_GPL17021,[liver]
...,...,...
7211,GSE75277_GPL1261,[liver]
7212,GSE75285_GPL16298,"[liver, blood, blastema]"
7213,GSE75285_GPL570,"[blood, liver, blastema]"
7214,GSE75285_GPL6801,"[blood, liver, blastema]"


In [9]:
sql_query = """SELECT dataset_id, curated_disease, curated_tissue FROM geo.datasets 
WHERE (CONTAINS(curated_disease, recommend('curated_disease', 'breast neoplasms', 'related')) OR
CONTAINS(curated_disease, recommend('curated_disease', 'pancreatic neoplasms', 'related')))AND 
(CONTAINS(curated_tissue, recommend('curated_tissue', 'breast', 'related')) OR 
CONTAINS(curated_tissue, recommend('curated_tissue', 'pancreas', 'related')))"""
result = omixatlas.query_metadata(sql_query)
result

Query execution succeeded (time taken: 2.15 seconds, data scanned: 0.908 MB)
Fetched 2565 rows


Unnamed: 0,dataset_id,curated_disease,curated_tissue
0,GSE96058_GPL11154,"[Triple Negative Breast Neoplasms, Brittle co...",[breast]
1,GSE96058_GPL18573,"[Triple Negative Breast Neoplasms, Brittle co...",[breast]
2,GSE96085_GPL15084,"[Carcinoma, Neoplasms, Second Primary, Brea...",[mammary gland]
3,GSE96520_GPL4135,"[Mammary Neoplasms, Animal, Breast Neoplasms]",[mammary gland]
4,GSE96567_GPL15084,"[Breast Neoplasms, Carcinoma, Neoplasms, Se...",[mammary gland]
...,...,...,...
2560,GSE95304_GPL11154,[Breast Neoplasms],[breast]
2561,GSE95472_GPL6244,[Triple Negative Breast Neoplasms],[breast]
2562,GSE95554_GPL17117,[Breast Neoplasms],"[breast, oil secretion]"
2563,GSE95700_GPL570,[Triple Negative Breast Neoplasms],[breast]
