<a href="https://polly.elucidata.io/manage/workspaces?action=open_polly_notebook&amp;source=github&amp;path=ElucidataInc%2Fpolly-python%2Fblob%2Fmain%2FCuration%2FCustom+Curation+on+Polly+Python.ipynb&amp;kernel=elucidata%2FPython+3.10&amp;machine=medium" target="_parent"><img alt="Open in Polly" src="https://elucidatainc.github.io/PublicAssets/open_polly.svg"/></a>


# Using the Curation Library on Polly Python3 to Custom-Curate Metadata

Check out [this notebook](https://github.com/ElucidataInc/polly-python/blob/main/Curation/Using%20the%20Curation%20Library%20on%20Polly-Python.ipynb) for basic usage of the curation functions.

## Install and import libraries, authenticate

In [1]:
# Install polly python
!sudo pip3 install polly-python --quiet

You should consider upgrading via the '/usr/local/bin/python3.10 -m pip install --upgrade pip' command.[0m


In [2]:
# Import libraries
from polly.auth import Polly     
from polly.curation  import Curation
from polly.omixatlas import OmixAtlas
import os
import pandas as pd
from json import dumps
import ipywidgets as widgets
import urllib.parse as urlparse
from urllib.parse import parse_qs
import cmapPy
from cmapPy.pandasGEXpress.parse_gct import parse

In [3]:
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN']) # Obtain authentication tokens
omixatlas = OmixAtlas(AUTH_TOKEN)
curate = Curation()

# Example 1 - GEO OA Dataset GSE193523_GPL13912

## Select and query a dataset

In [4]:
# Get url from OmixAtlas page in Polly GUI - filter the dataset of your choice and paste url here
url = 'https://polly.elucidata.io/manage/omixatlas/details?dataset_id=GSE193523_GPL13912&dataset_src=GEO&repo_id=9&type=gct&repo_name=geo'
parsed = urlparse.urlparse(url)
repo_vars_list = [parse_qs(parsed.query).get(query_url)[0] for query_url in ['repo_id', 'repo_name', 'dataset_id']]
repo_id=repo_vars_list[0]
repo_name=repo_vars_list[1]
dataset_id=repo_vars_list[2]
file_name=dataset_id +'.gct'
file_name

'GSE193523_GPL13912.gct'

In [5]:
# Querying dataset
query=f"SELECT * FROM {repo_name}.datasets WHERE dataset_id = '{dataset_id}'"
results=omixatlas.query_metadata(query)
results

Query execution succeeded (time taken: 2.50 seconds, data scanned: 4.857 MB)
Fetched 1 rows


Unnamed: 0,dataset_id,abstract,author,bucket,curated_cell_line,curated_cell_type,curated_disease,curated_drug,curated_gene,curated_organism,curated_strain,curated_tissue,data_matrix_available,data_table_name,data_table_version,data_type,dataset_source,description,drug_smiles,file_location,file_type,is_current,key,missing_samples,overall_design,package,platform,publication,region,src_repo,src_uri,subseries_ids,summary,timestamp_,total_num_samples,version,year
0,GSE193523_GPL13912,,,discover-prod-datalake-v1,[None],[fibroblast],[Neoplasms],[None],[RAP1A],[Mus musculus],[None],"[liver, heart, muscle, brain]",False,,,Transcriptomics,GEO,Aberrant expression and localization of the RA...,[],https://discover-prod-datalake-v1.s3-us-west-2...,gct,True,GEO_data_lake/data/GEO_metadata/GSE193523/GCT/...,[],,GEO_data_lake/data,Microarray,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,us-west-2,geo,polly:data://GEO_data_lake/data/GEO_metadata/G...,[],Short telomeres induce a DNA damage response (...,1667795109241,24,0,2022


## Download the dataset and load the .gct file

A HEAD file (.gct) file that provides a scalable way of keeping track of data together with learned annotations. An gct file can be read in python using pandas. We store transcriptional (bulk), proteomics and metabolomics data in gct format. A gct file can be read both in R and Python using [cmapR](https://github.com/cmap/cmapR) (for R) and [cmapPy](https://github.com/cmap/cmapPy) (for Python). Both the packages are installed in this environment and can be used on the datalake files.

In [6]:
# Download the dataset from OmixAtlas
data = omixatlas.download_data(repo_id, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
    print("Downloaded data successfully")
else:
    raise Exception("Download not successful")

Downloaded data successfully


--2022-11-08 16:10:27--  https://discover-prod-datalake-v1.s3.amazonaws.com/GEO_data_lake/data/GEO_metadata/GSE193523/GCT/GSE193523_GPL13912_metadata.gct?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAVRYB5UBIETRWBPJG%2F20221108%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20221108T161027Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEFAaCXVzLXdlc3QtMiJHMEUCIQCT4ztmvNT2d8o3ALJa5bijGm8cRZt%2Fd0H3WBxV7egkxgIgdjaCQoylddOTq5zAcL%2FmwyNSziwpxTDufEAt9aG4OgAqjQMISRADGgwzODE3MTkyNTcxNjgiDCTiNUJWmU3LYrOxAirqAiUO7iMM1DKAJWuuxlooDwhUiAxxbqbLF3%2BHS1je0YLe%2FtRvR%2BiWZasEuiLSqHRYxkFWnmO%2FPDsvxtf7TDkD3cxnQB7LJgJIWcIsIbFJ3PQXfVF74ET32h7pSaveViOyZAu9v8Gk%2BQSmFmtN4h4bV%2FBRU0yjBGGm50q04zk1Y9Byiih%2ByGhRFwr9Ma0fETEYFp7ROvs1OOXFssNhZ8fj1EzFwml0V2UpQYXpTHWf5ieFqwpK3%2BgocWpDoC6pfWfT0In6AX0i4HnrhkZf6%2FHtZdwkCvvuDQl2GU8MjQMwugg8c66OtjtR2yOca7n5MqSGe0j4cx8lP10BeYucnS1Q3w%2B%2BKiT7SUGLZb%2F76QcV8gjR0K6a1BSHdQ9%2F%2FDCdAgaSvkLV06tqReypx1xR71n0y2Y3%2BLQFjqHCdDX2KZmJl

In [7]:
# Load and read the GCT object using cmapPy
gct_obj = parse(file_name) ## Parse the file to create a gct object
df_real = gct_obj.data_df ## Extract the dataframe from the gct object
col_metadata = gct_obj.col_metadata_df # Extract the column metadata from the gct object
row_metadata = gct_obj.row_metadata_df # Extract the row metadata from the gct object

In [8]:
col_metadata

chd,title,status,submission_date,last_update_date,type,channel_count,source_name_ch1,organism_ch1,taxid_ch1,characteristics_ch1.0.tissue,characteristics_ch1.1.Sex,characteristics_ch1.2.age (months),characteristics_ch1.3.rap1 genotype,growth_protocol_ch1,molecule_ch1,extract_protocol_ch1,label_ch1,label_protocol_ch1,hyb_protocol,scan_protocol,data_processing,platform_id,contact_name,contact_laboratory,contact_department,contact_institute,contact_address,contact_city,contact_state,contact_zip/postal_code,contact_country,supplementary_file,series_id,data_row_count,kw_curated_gene,kw_curated_tissue,kw_curated_genetic_mod_type,kw_curated_strain,kw_curated_cell_line,kw_curated_cell_type,kw_curated_disease,kw_curated_drug,kw_curated_modified_gene,curated_cohort_id,curated_is_control,curated_cohort_name
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1
GSM5813633,L1600_MUT_15mo_F,Public on Nov 04 2022,Jan 12 2022,Nov 05 2022,RNA,1,Liver_RAP1 MUT_15MF,Mus musculus,10090,Liver,F,15,RAP1 MUT,The Rap1 I312R knockin mouse was generated usi...,total RNA,RNA was extracted using Qiagen RNeasy Mini kit...,Cy3,Cyanine-3 (Cy3) labeled cRNA was prepared from...,A total of 600 ng Cy3-labeled cRNA was fragmen...,"Following posthybridization rinses, arrays wer...",Raw microarray data were log transformed to yi...,GPL13912,"Supriyo,,De",Computational Biology & Genomics Core,Laboratory of Genetics and Genomics,"NIA-IRP, NIH",251 Bayview Blvd,Baltimore,Maryland,21224,USA,NONE,"GSE193523,GSE193524",33912,RAP1A,liver,mutation,none,none,none,Normal,none,RAP1A,2,0,
GSM5813634,L1602_WT_15mo_F,Public on Nov 04 2022,Jan 12 2022,Nov 05 2022,RNA,1,Liver_RAP1 WT_15MF,Mus musculus,10090,Liver,F,15,RAP1 WT,The Rap1 I312R knockin mouse was generated usi...,total RNA,RNA was extracted using Qiagen RNeasy Mini kit...,Cy3,Cyanine-3 (Cy3) labeled cRNA was prepared from...,A total of 600 ng Cy3-labeled cRNA was fragmen...,"Following posthybridization rinses, arrays wer...",Raw microarray data were log transformed to yi...,GPL13912,"Supriyo,,De",Computational Biology & Genomics Core,Laboratory of Genetics and Genomics,"NIA-IRP, NIH",251 Bayview Blvd,Baltimore,Maryland,21224,USA,NONE,"GSE193523,GSE193524",33912,RAP1A,liver,wildtype,none,none,none,Normal,none,none,6,1,
GSM5813635,L1609_MUT_15mo_F,Public on Nov 04 2022,Jan 12 2022,Nov 05 2022,RNA,1,Liver_RAP1 MUT_15MF,Mus musculus,10090,Liver,F,15,RAP1 MUT,The Rap1 I312R knockin mouse was generated usi...,total RNA,RNA was extracted using Qiagen RNeasy Mini kit...,Cy3,Cyanine-3 (Cy3) labeled cRNA was prepared from...,A total of 600 ng Cy3-labeled cRNA was fragmen...,"Following posthybridization rinses, arrays wer...",Raw microarray data were log transformed to yi...,GPL13912,"Supriyo,,De",Computational Biology & Genomics Core,Laboratory of Genetics and Genomics,"NIA-IRP, NIH",251 Bayview Blvd,Baltimore,Maryland,21224,USA,NONE,"GSE193523,GSE193524",33912,RAP1A,liver,mutation,none,none,none,Normal,none,RAP1A,2,0,
GSM5813636,L1610_WT_15mo_F,Public on Nov 04 2022,Jan 12 2022,Nov 05 2022,RNA,1,Liver_RAP1 WT_15MF,Mus musculus,10090,Liver,F,15,RAP1 WT,The Rap1 I312R knockin mouse was generated usi...,total RNA,RNA was extracted using Qiagen RNeasy Mini kit...,Cy3,Cyanine-3 (Cy3) labeled cRNA was prepared from...,A total of 600 ng Cy3-labeled cRNA was fragmen...,"Following posthybridization rinses, arrays wer...",Raw microarray data were log transformed to yi...,GPL13912,"Supriyo,,De",Computational Biology & Genomics Core,Laboratory of Genetics and Genomics,"NIA-IRP, NIH",251 Bayview Blvd,Baltimore,Maryland,21224,USA,NONE,"GSE193523,GSE193524",33912,RAP1A,liver,wildtype,none,none,none,Normal,none,none,6,1,
GSM5813637,L1646_WT_14mo_F,Public on Nov 04 2022,Jan 12 2022,Nov 05 2022,RNA,1,Liver_RAP1 WT_14MF,Mus musculus,10090,Liver,F,14,RAP1 WT,The Rap1 I312R knockin mouse was generated usi...,total RNA,RNA was extracted using Qiagen RNeasy Mini kit...,Cy3,Cyanine-3 (Cy3) labeled cRNA was prepared from...,A total of 600 ng Cy3-labeled cRNA was fragmen...,"Following posthybridization rinses, arrays wer...",Raw microarray data were log transformed to yi...,GPL13912,"Supriyo,,De",Computational Biology & Genomics Core,Laboratory of Genetics and Genomics,"NIA-IRP, NIH",251 Bayview Blvd,Baltimore,Maryland,21224,USA,NONE,"GSE193523,GSE193524",33912,RAP1A,liver,wildtype,none,none,none,Normal,none,none,6,1,
GSM5813638,L1647_MUT_14mo_F,Public on Nov 04 2022,Jan 12 2022,Nov 05 2022,RNA,1,Liver_RAP1 MUT_14MF,Mus musculus,10090,Liver,F,14,RAP1 MUT,The Rap1 I312R knockin mouse was generated usi...,total RNA,RNA was extracted using Qiagen RNeasy Mini kit...,Cy3,Cyanine-3 (Cy3) labeled cRNA was prepared from...,A total of 600 ng Cy3-labeled cRNA was fragmen...,"Following posthybridization rinses, arrays wer...",Raw microarray data were log transformed to yi...,GPL13912,"Supriyo,,De",Computational Biology & Genomics Core,Laboratory of Genetics and Genomics,"NIA-IRP, NIH",251 Bayview Blvd,Baltimore,Maryland,21224,USA,NONE,"GSE193523,GSE193524",33912,RAP1A,liver,mutation,none,none,none,Normal,none,RAP1A,2,0,
GSM5813639,B1600_MUT_15mo_F,Public on Nov 04 2022,Jan 12 2022,Nov 05 2022,RNA,1,Brain_RAP1 MUT_15MF,Mus musculus,10090,Brain,F,15,RAP1 MUT,The Rap1 I312R knockin mouse was generated usi...,total RNA,RNA was extracted using Qiagen RNeasy Mini kit...,Cy3,Cyanine-3 (Cy3) labeled cRNA was prepared from...,A total of 600 ng Cy3-labeled cRNA was fragmen...,"Following posthybridization rinses, arrays wer...",Raw microarray data were log transformed to yi...,GPL13912,"Supriyo,,De",Computational Biology & Genomics Core,Laboratory of Genetics and Genomics,"NIA-IRP, NIH",251 Bayview Blvd,Baltimore,Maryland,21224,USA,NONE,"GSE193523,GSE193524",33912,RAP1A,brain,mutation,none,none,none,Normal,none,RAP1A,0,0,
GSM5813640,B1602_WT_15mo_F,Public on Nov 04 2022,Jan 12 2022,Nov 05 2022,RNA,1,Brain_RAP1 WT_15MF,Mus musculus,10090,Brain,F,15,RAP1 WT,The Rap1 I312R knockin mouse was generated usi...,total RNA,RNA was extracted using Qiagen RNeasy Mini kit...,Cy3,Cyanine-3 (Cy3) labeled cRNA was prepared from...,A total of 600 ng Cy3-labeled cRNA was fragmen...,"Following posthybridization rinses, arrays wer...",Raw microarray data were log transformed to yi...,GPL13912,"Supriyo,,De",Computational Biology & Genomics Core,Laboratory of Genetics and Genomics,"NIA-IRP, NIH",251 Bayview Blvd,Baltimore,Maryland,21224,USA,NONE,"GSE193523,GSE193524",33912,RAP1A,brain,wildtype,none,none,none,Normal,none,none,4,1,
GSM5813641,B1609_MUT_15mo_F,Public on Nov 04 2022,Jan 12 2022,Nov 05 2022,RNA,1,Brain_RAP1 MUT_15MF,Mus musculus,10090,Brain,F,15,RAP1 MUT,The Rap1 I312R knockin mouse was generated usi...,total RNA,RNA was extracted using Qiagen RNeasy Mini kit...,Cy3,Cyanine-3 (Cy3) labeled cRNA was prepared from...,A total of 600 ng Cy3-labeled cRNA was fragmen...,"Following posthybridization rinses, arrays wer...",Raw microarray data were log transformed to yi...,GPL13912,"Supriyo,,De",Computational Biology & Genomics Core,Laboratory of Genetics and Genomics,"NIA-IRP, NIH",251 Bayview Blvd,Baltimore,Maryland,21224,USA,NONE,"GSE193523,GSE193524",33912,RAP1A,brain,mutation,none,none,none,Normal,none,RAP1A,0,0,
GSM5813642,B1610_WT_15mo_F,Public on Nov 04 2022,Jan 12 2022,Nov 05 2022,RNA,1,Brain_RAP1 WT_15MF,Mus musculus,10090,Brain,F,15,RAP1 WT,The Rap1 I312R knockin mouse was generated usi...,total RNA,RNA was extracted using Qiagen RNeasy Mini kit...,Cy3,Cyanine-3 (Cy3) labeled cRNA was prepared from...,A total of 600 ng Cy3-labeled cRNA was fragmen...,"Following posthybridization rinses, arrays wer...",Raw microarray data were log transformed to yi...,GPL13912,"Supriyo,,De",Computational Biology & Genomics Core,Laboratory of Genetics and Genomics,"NIA-IRP, NIH",251 Bayview Blvd,Baltimore,Maryland,21224,USA,NONE,"GSE193523,GSE193524",33912,RAP1A,brain,wildtype,none,none,none,Normal,none,none,4,1,


## Curate metadata

In [9]:
# For the purpose of demonstration of curation functions, let's remove the curated columns and other irrelavant columns
metadata_example = col_metadata[['source_name_ch1','organism_ch1','growth_protocol_ch1']]
metadata_example

chd,source_name_ch1,organism_ch1,growth_protocol_ch1
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GSM5813633,Liver_RAP1 MUT_15MF,Mus musculus,The Rap1 I312R knockin mouse was generated usi...
GSM5813634,Liver_RAP1 WT_15MF,Mus musculus,The Rap1 I312R knockin mouse was generated usi...
GSM5813635,Liver_RAP1 MUT_15MF,Mus musculus,The Rap1 I312R knockin mouse was generated usi...
GSM5813636,Liver_RAP1 WT_15MF,Mus musculus,The Rap1 I312R knockin mouse was generated usi...
GSM5813637,Liver_RAP1 WT_14MF,Mus musculus,The Rap1 I312R knockin mouse was generated usi...
GSM5813638,Liver_RAP1 MUT_14MF,Mus musculus,The Rap1 I312R knockin mouse was generated usi...
GSM5813639,Brain_RAP1 MUT_15MF,Mus musculus,The Rap1 I312R knockin mouse was generated usi...
GSM5813640,Brain_RAP1 WT_15MF,Mus musculus,The Rap1 I312R knockin mouse was generated usi...
GSM5813641,Brain_RAP1 MUT_15MF,Mus musculus,The Rap1 I312R knockin mouse was generated usi...
GSM5813642,Brain_RAP1 WT_15MF,Mus musculus,The Rap1 I312R knockin mouse was generated usi...


## Manually check which entities are present in the columns

If we don't know which entities can be curated from which columns, let's use the recognise_entity() function to check.

In [10]:
text = metadata_example['source_name_ch1'][1] # Get text from any row
curate.recognise_entity(text)

[{'keyword': 'Liver_RAP1',
  'entity_type': 'tissue',
  'span_begin': 0,
  'span_end': 10,
  'score': 0.9780030846595764},
 {'keyword': 'RAP1 WT',
  'entity_type': 'gene',
  'span_begin': 6,
  'span_end': 12,
  'score': 0.9878277778625488}]

In [11]:
text = metadata_example['growth_protocol_ch1'][1] # Get text from any row
curate.recognise_entity(text)

[{'keyword': 'C57BL6',
  'entity_type': 'strain',
  'span_begin': 552,
  'span_end': 557,
  'score': 0.9993879795074463},
 {'keyword': 'Rap1',
  'entity_type': 'gene',
  'span_begin': 4,
  'span_end': 7,
  'score': 0.9979879856109619},
 {'keyword': 'mouse',
  'entity_type': 'species',
  'span_begin': 23,
  'span_end': 27,
  'score': 0.9839400053024292},
 {'keyword': 'mouse',
  'entity_type': 'species',
  'span_begin': 524,
  'span_end': 528,
  'score': 0.9794715642929077},
 {'keyword': 'mice',
  'entity_type': 'species',
  'span_begin': 559,
  'span_end': 562,
  'score': 0.9717391133308411}]

In [12]:
text = metadata_example['organism_ch1'][1] # Get text from any row
curate.recognise_entity(text)

[{'keyword': 'Mus musculus',
  'entity_type': 'tissue',
  'span_begin': 0,
  'span_end': 12,
  'score': 0.9990906119346619},
 {'keyword': 'Mus musculus',
  'entity_type': 'species',
  'span_begin': 0,
  'span_end': 11,
  'score': 0.9550890922546387}]

## Standardise entities in the columns

  * In the case of 'organism_ch1', we can directly go ahead and normalize it for the 'species' entity, since the entity type is clear. In this case, however, the entity is already normalized.
  * For other columns, we can choose to normalize them after extracting the desired entity. For example, I want to extract 'tissue type' and 'gene' from 'source_name_ch1', and 'strain' from 'growth_protocol_ch1'.
  * I will add these standardise columns as curated columns in a new dataframe.

In [13]:
# Let's first test the output of the standardize_entity() function
curate.standardise_entity("Mus musculus" ,"species")

{'ontology': 'NCBI',
 'ontology_id': 'txid10090',
 'name': 'Mus musculus',
 'entity_type': 'species',
 'score': None,
 'synonym': None}

In [14]:
# Take the 'name' entry from the dictionary for 'curated_species'
metadata_curated = pd.DataFrame()
metadata_curated['curated_species'] = metadata_example['organism_ch1'].map(lambda kw: 
                                    curate.standardise_entity(kw,"species")["name"])
metadata_curated

Unnamed: 0_level_0,curated_species
cid,Unnamed: 1_level_1
GSM5813633,Mus musculus
GSM5813634,Mus musculus
GSM5813635,Mus musculus
GSM5813636,Mus musculus
GSM5813637,Mus musculus
GSM5813638,Mus musculus
GSM5813639,Mus musculus
GSM5813640,Mus musculus
GSM5813641,Mus musculus
GSM5813642,Mus musculus


In [15]:
metadata_curated['curated_tissue'] = metadata_example['source_name_ch1'].map(lambda kw: 
                                    curate.standardise_entity(kw.split("_")[0],"tissue")["name"])
metadata_curated['curated_gene'] = metadata_example['source_name_ch1'].map(lambda kw: 
                                    curate.standardise_entity(kw.split("_")[1].split(" ")[0],"gene")["name"])
metadata_curated

Unnamed: 0_level_0,curated_species,curated_tissue,curated_gene
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GSM5813633,Mus musculus,liver,RAP1A
GSM5813634,Mus musculus,liver,RAP1A
GSM5813635,Mus musculus,liver,RAP1A
GSM5813636,Mus musculus,liver,RAP1A
GSM5813637,Mus musculus,liver,RAP1A
GSM5813638,Mus musculus,liver,RAP1A
GSM5813639,Mus musculus,brain,RAP1A
GSM5813640,Mus musculus,brain,RAP1A
GSM5813641,Mus musculus,brain,RAP1A
GSM5813642,Mus musculus,brain,RAP1A


## Annotate all entities from column(s) of interest

  * In this case, I will take my column(s) of interest and assign ontologies and entity types to them without additional manual intervention using the annotate_with_ontology() function.

In [16]:
# Let's first test the output of the annotate_with_ontology() function
text = metadata_example['growth_protocol_ch1'][1] # Get text from any row
curate.annotate_with_ontology(text)

[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species'),
 Tag(name='RAP1A', ontology_id='HGNC:9855', entity_type='gene')]

In [24]:
# Define function to return column names and curated values
def annotate(string):
    res = curate.annotate_with_ontology(string)
    curatedvals = [res[i][0] for i in range(len(res))] # Select "name" from list of tuples
    return curatedvals 

metadata_curated_2 = pd.DataFrame()
metadata_curated_2 = metadata_example.apply(lambda kw: annotate(kw['growth_protocol_ch1']), axis=1, result_type='expand')
metadata_curated_2

Unnamed: 0_level_0,0,1
cid,Unnamed: 1_level_1,Unnamed: 2_level_1
GSM5813633,Mus musculus,RAP1A
GSM5813634,Mus musculus,RAP1A
GSM5813635,Mus musculus,RAP1A
GSM5813636,Mus musculus,RAP1A
GSM5813637,Mus musculus,RAP1A
GSM5813638,Mus musculus,RAP1A
GSM5813639,Mus musculus,RAP1A
GSM5813640,Mus musculus,RAP1A
GSM5813641,Mus musculus,RAP1A
GSM5813642,Mus musculus,RAP1A


# Example 2 - GEO OA Dataset GSE165082_GPL11154# 

## Select and query a dataset

In [25]:
# Get url from OmixAtlas page in Polly GUI - filter the dataset of your choice and paste url here
url = 'https://polly.elucidata.io/manage/omixatlas/details?dataset_id=GSE165082_GPL11154&dataset_src=GEO&repo_id=9&type=gct&repo_name=geo'
parsed = urlparse.urlparse(url)
repo_vars_list = [parse_qs(parsed.query).get(query_url)[0] for query_url in ['repo_id', 'repo_name', 'dataset_id']]
repo_id=repo_vars_list[0]
repo_name=repo_vars_list[1]
dataset_id=repo_vars_list[2]
file_name=dataset_id +'.gct'
file_name

'GSE165082_GPL11154.gct'

In [28]:
# Querying dataset
query=f"SELECT * FROM {repo_name}.datasets WHERE dataset_id = '{dataset_id}' AND data_matrix_available='true' "
results=omixatlas.query_metadata(query)
results

Query execution succeeded (time taken: 2.65 seconds, data scanned: 4.858 MB)
Fetched 1 rows


Unnamed: 0,dataset_id,abstract,author,bucket,curated_cell_line,curated_cell_type,curated_disease,curated_drug,curated_gene,curated_organism,curated_strain,curated_tissue,data_matrix_available,data_table_name,data_table_version,data_type,dataset_source,description,drug_smiles,file_location,file_type,is_current,key,missing_samples,overall_design,package,platform,publication,region,src_repo,src_uri,subseries_ids,summary,timestamp_,total_num_samples,version,year
0,GSE165082_GPL11154,,,discover-prod-datalake-v1,[None],[None],[Parkinson Disease],[None],"[PTPRC, RNF5, VTRNA2-1, NFYA, IL18R1, VTRNA1-2...",[Homo sapiens],[None],[blood],True,geo__gse165082_gpl11154,0,Transcriptomics,GEO,DNA Methylation and Expression Profiles of Who...,[],https://discover-prod-datalake-v1.s3-us-west-2...,gct,True,GEO_data_lake/data/RNASeq/GSE165082/GCT/GSE165...,[],Parkinson's disease (PD) and control subjects ...,GEO_data_lake/data,RNASeq,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,us-west-2,geo,polly:data://GEO_data_lake/data/RNASeq/GSE1650...,[],Parkinsonâs disease (PD) is the second most ...,1667826430763,26,0,2021


## Download the dataset and load the .gct file

In [29]:
# Download the dataset from OmixAtlas
data = omixatlas.download_data(repo_id, dataset_id)
url = data.get('data').get('attributes').get('download_url')
status = os.system(f"wget -O '{file_name}' '{url}'")
if status == 0:
    print("Downloaded data successfully")
else:
    raise Exception("Download not successful")

--2022-11-08 16:38:42--  https://discover-prod-datalake-v1.s3.amazonaws.com/GEO_data_lake/data/RNASeq/GSE165082/GCT/GSE165082_GPL11154_curated.gct?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAVRYB5UBIMG4OLMMS%2F20221108%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20221108T163842Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEFEaCXVzLXdlc3QtMiJHMEUCIBsVxoHZcPVcxO%2B8E6QP%2B300844ab99omAWXThR4vUXlAiEAgakIRCz5Bt8mi07vk6kzDLU1XN%2BGuYL3NL2euBIuRGMqjQMIShADGgwzODE3MTkyNTcxNjgiDPnw1h3TYUZFRWMQ%2FCrqAjshbGg%2BbmG6Q1QV3m8t2hABDHcJdMfEKJgCghwhfJgXbAMUf%2BwWfpxDdsXKNTRb62GRGLK0MQAr7i%2FaEknkhSH6Wk6GNeEnNHHYOuQj6KYlGRMpdJGJsyjppxWS43M5oGssPvKTO2jWxfzshMGPwDOUr9FHBcMaRviiNEIApARpdPO5i%2B%2FaaZutji%2BUablJKucmpFcMQEtjnSmHuXY6EM0id1PBpVM7GXAR%2BjQeNfnDDmycqWl2XclI%2FO5YXw72lVBq1dWNWdi9STwmchoYiJymqz8zmoAKyQB5gvRhGVXilJ3oLBfL%2FZkLZbLZn0dAFCo9Uiccx46ETIQ7aieS%2BHcA4e8g%2BA30g%2BPtPw%2Bv1%2B99BDgQP4XWHLXpsyIE4S3OqeIJGiQMA6iRv8gu5TsXqNTNNMHg%2FXGkKq8AxLcySvwI

Downloaded data successfully


. .......... .......... 85% 1.09M 0s
  4300K .......... .......... .......... .......... .......... 86%  244M 0s
  4350K .......... .......... .......... .......... .......... 87%  227M 0s
  4400K .......... .......... .......... .......... .......... 88%  249M 0s
  4450K .......... .......... .......... .......... .......... 89%  253M 0s
  4500K .......... .......... .......... .......... .......... 90%  240M 0s
  4550K .......... .......... .......... .......... .......... 91%  209M 0s
  4600K .......... .......... .......... .......... .......... 92%  210M 0s
  4650K .......... .......... .......... .......... .......... 93%  232M 0s
  4700K .......... .......... .......... .......... .......... 94%  264M 0s
  4750K .......... .......... .......... .......... .......... 95%  280M 0s
  4800K .......... .......... .......... .......... .......... 96%  248M 0s
  4850K .......... .......... .......... .......... .......... 97%  282M 0s
  4900K .......... .......... .......... ..........

In [30]:
# Load and read the GCT object using cmapPy
gct_obj = parse(file_name) ## Parse the file to create a gct object
df_real = gct_obj.data_df ## Extract the dataframe from the gct object
col_metadata = gct_obj.col_metadata_df # Extract the column metadata from the gct object
row_metadata = gct_obj.row_metadata_df # Extract the row metadata from the gct object

In [31]:
col_metadata

chd,title,status,submission_date,last_update_date,type,channel_count,source_name_ch1,organism_ch1,taxid_ch1,characteristics_ch1.0.tissue,characteristics_ch1.1.group,treatment_protocol_ch1,growth_protocol_ch1,molecule_ch1,extract_protocol_ch1,data_processing,platform_id,contact_name,contact_institute,contact_address,contact_city,contact_state,contact_zip.postal_code,contact_country,instrument_model,library_selection,library_source,library_strategy,relation,supplementary_file_1,series_id,data_row_count,kw_curated_disease,kw_curated_tissue,kw_curated_genetic_mod_type,kw_curated_strain,kw_curated_cell_line,kw_curated_cell_type,kw_curated_drug,kw_curated_gene,kw_curated_modified_gene,curated_cohort_id,curated_is_control,curated_cohort_name
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1
GSM5025809,045_PD,Public on Mar 11 2021,Jan 19 2021,May 14 2021,SRA,1,blood,Homo sapiens,9606,whole blood,PD,,,total RNA,Peripheral blood was collected from patients u...,Illumina Casava1.7 software used for basecalli...,GPL11154,"QI,,WANG",ARIZONA STATE UNIVERSITY-TEMPE CAMPUS,PO Box 875001,TEMPE,AZ,85287,USA,Illumina HiSeq 2000,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,NONE,"GSE165082,GSE165083",0,Parkinson Disease,blood,wildtype,none,none,none,none,none,none,1,0,PD; whole blood
GSM5025810,012_PD,Public on Mar 11 2021,Jan 19 2021,May 14 2021,SRA,1,blood,Homo sapiens,9606,whole blood,PD,,,total RNA,Peripheral blood was collected from patients u...,Illumina Casava1.7 software used for basecalli...,GPL11154,"QI,,WANG",ARIZONA STATE UNIVERSITY-TEMPE CAMPUS,PO Box 875001,TEMPE,AZ,85287,USA,Illumina HiSeq 2000,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,NONE,"GSE165082,GSE165083",0,Parkinson Disease,blood,wildtype,none,none,none,none,none,none,1,0,PD; whole blood
GSM5025811,015_PD,Public on Mar 11 2021,Jan 19 2021,May 14 2021,SRA,1,blood,Homo sapiens,9606,whole blood,PD,,,total RNA,Peripheral blood was collected from patients u...,Illumina Casava1.7 software used for basecalli...,GPL11154,"QI,,WANG",ARIZONA STATE UNIVERSITY-TEMPE CAMPUS,PO Box 875001,TEMPE,AZ,85287,USA,Illumina HiSeq 2000,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,NONE,"GSE165082,GSE165083",0,Parkinson Disease,blood,wildtype,none,none,none,none,none,none,1,0,PD; whole blood
GSM5025812,024_PD,Public on Mar 11 2021,Jan 19 2021,May 14 2021,SRA,1,blood,Homo sapiens,9606,whole blood,PD,,,total RNA,Peripheral blood was collected from patients u...,Illumina Casava1.7 software used for basecalli...,GPL11154,"QI,,WANG",ARIZONA STATE UNIVERSITY-TEMPE CAMPUS,PO Box 875001,TEMPE,AZ,85287,USA,Illumina HiSeq 2000,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,NONE,"GSE165082,GSE165083",0,Parkinson Disease,blood,wildtype,none,none,none,none,none,none,1,0,PD; whole blood
GSM5025813,016_PD,Public on Mar 11 2021,Jan 19 2021,May 14 2021,SRA,1,blood,Homo sapiens,9606,whole blood,PD,,,total RNA,Peripheral blood was collected from patients u...,Illumina Casava1.7 software used for basecalli...,GPL11154,"QI,,WANG",ARIZONA STATE UNIVERSITY-TEMPE CAMPUS,PO Box 875001,TEMPE,AZ,85287,USA,Illumina HiSeq 2000,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,NONE,"GSE165082,GSE165083",0,Parkinson Disease,blood,wildtype,none,none,none,none,none,none,1,0,PD; whole blood
GSM5025814,050_PD,Public on Mar 11 2021,Jan 19 2021,May 14 2021,SRA,1,blood,Homo sapiens,9606,whole blood,PD,,,total RNA,Peripheral blood was collected from patients u...,Illumina Casava1.7 software used for basecalli...,GPL11154,"QI,,WANG",ARIZONA STATE UNIVERSITY-TEMPE CAMPUS,PO Box 875001,TEMPE,AZ,85287,USA,Illumina HiSeq 2000,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,NONE,"GSE165082,GSE165083",0,Parkinson Disease,blood,wildtype,none,none,none,none,none,none,1,0,PD; whole blood
GSM5025815,028_PD,Public on Mar 11 2021,Jan 19 2021,May 14 2021,SRA,1,blood,Homo sapiens,9606,whole blood,PD,,,total RNA,Peripheral blood was collected from patients u...,Illumina Casava1.7 software used for basecalli...,GPL11154,"QI,,WANG",ARIZONA STATE UNIVERSITY-TEMPE CAMPUS,PO Box 875001,TEMPE,AZ,85287,USA,Illumina HiSeq 2000,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,NONE,"GSE165082,GSE165083",0,Parkinson Disease,blood,wildtype,none,none,none,none,none,none,1,0,PD; whole blood
GSM5025816,011_PD,Public on Mar 11 2021,Jan 19 2021,May 14 2021,SRA,1,blood,Homo sapiens,9606,whole blood,PD,,,total RNA,Peripheral blood was collected from patients u...,Illumina Casava1.7 software used for basecalli...,GPL11154,"QI,,WANG",ARIZONA STATE UNIVERSITY-TEMPE CAMPUS,PO Box 875001,TEMPE,AZ,85287,USA,Illumina HiSeq 2000,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,NONE,"GSE165082,GSE165083",0,Parkinson Disease,blood,wildtype,none,none,none,none,none,none,1,0,PD; whole blood
GSM5025817,022_PD,Public on Mar 11 2021,Jan 19 2021,May 14 2021,SRA,1,blood,Homo sapiens,9606,whole blood,PD,,,total RNA,Peripheral blood was collected from patients u...,Illumina Casava1.7 software used for basecalli...,GPL11154,"QI,,WANG",ARIZONA STATE UNIVERSITY-TEMPE CAMPUS,PO Box 875001,TEMPE,AZ,85287,USA,Illumina HiSeq 2000,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,NONE,"GSE165082,GSE165083",0,Parkinson Disease,blood,wildtype,none,none,none,none,none,none,1,0,PD; whole blood
GSM5025818,051_PD,Public on Mar 11 2021,Jan 19 2021,May 14 2021,SRA,1,blood,Homo sapiens,9606,whole blood,PD,,,total RNA,Peripheral blood was collected from patients u...,Illumina Casava1.7 software used for basecalli...,GPL11154,"QI,,WANG",ARIZONA STATE UNIVERSITY-TEMPE CAMPUS,PO Box 875001,TEMPE,AZ,85287,USA,Illumina HiSeq 2000,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,NONE,"GSE165082,GSE165083",0,Parkinson Disease,blood,wildtype,none,none,none,none,none,none,1,0,PD; whole blood


## Curate metadata

In [32]:
# For the purpose of demonstration of curation functions, let's remove the curated columns and other irrelavant columns
col_metadata_example_2 = col_metadata[['characteristics_ch1.1.group']]
col_metadata_example_2

chd,characteristics_ch1.1.group
cid,Unnamed: 1_level_1
GSM5025809,PD
GSM5025810,PD
GSM5025811,PD
GSM5025812,PD
GSM5025813,PD
GSM5025814,PD
GSM5025815,PD
GSM5025816,PD
GSM5025817,PD
GSM5025818,PD


In [36]:
col_metadata_example_2 = curate.assign_control_pert_labels(col_metadata_example_2)
col_metadata_example_2

chd,characteristics_ch1.1.group,is_control,control_prob
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GSM5025809,PD,False,0.09
GSM5025810,PD,False,0.09
GSM5025811,PD,False,0.09
GSM5025812,PD,False,0.09
GSM5025813,PD,False,0.09
GSM5025814,PD,False,0.09
GSM5025815,PD,False,0.09
GSM5025816,PD,False,0.09
GSM5025817,PD,False,0.09
GSM5025818,PD,False,0.09
