This notebook demonstrates the power of our gene downselection according to the CTD inference score.

We show how to use the gene downselection to indicate the most important chemicals to test for, whose presense is highly correlated with tissue being cancerous.

In [1]:
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

Load data, and downselect to features that we're interested in (either to use numerically, or to map to integer values below)

In [97]:
genes = pd.read_csv('genes.csv')
gtex_gene_exp = pd.read_csv('gtex_gene_expression.csv')
tox_chem = pd.read_csv('toxicogenomics_chemicals.csv')

In [98]:
genes = genes[["ensembl_id","hgnc_name","cytogenetic_location","gene_biotype"]]
gtex_gene_exp = gtex_gene_exp[["gene_id","chromosome","chromosome_start","chromosome_end","score","strand_type"]]
tox_chem = tox_chem[["gene_id","chemical_id","chemical_name","gene_forms"]]

Join tables on gene_id

In [99]:
merge1 = pd.merge(genes, gtex_gene_exp, left_on = 'ensembl_id', right_on = 'gene_id')
merge1 = merge1.drop("ensembl_id",axis=1)
merge_final = merge1.merge(tox_chem, on='gene_id')

Create a map to extract the most "important" (see report) genes per organ under investigation

In [100]:
mapping = {
    "Brain": [u"ENSG00000187325",u"ENSG00000196368",u"ENSG00000037042",u"ENSG00000106299",u"ENSG00000198900",u"ENSG00000162738",u"ENSG00000135334",u"ENSG00000198914",u"ENSG00000139352",u"ENSG00000188486",u"ENSG00000264364",u"ENSG00000178988",u"ENSG00000115461",u"ENSG00000163344",u"ENSG00000124733",u"ENSG00000136938",u"ENSG00000105887",u"ENSG00000102409",u"ENSG00000165502"],
    "Breast": [u"ENSG00000124098",u"ENSG00000105887",u"ENSG00000163344",u"ENSG00000124733",u"ENSG00000136938",u"ENSG00000211663",u"ENSG00000124107",u"ENSG00000211666",u"ENSG00000244468",u"ENSG00000165502",u"ENSG00000115461"],
    "HeadAndNeck": [u"ENSG00000100985",u"ENSG00000211666",u"ENSG00000136938",u"ENSG00000143536",u"ENSG00000211663",u"ENSG00000165502",u"ENSG00000124466",u"ENSG00000124107",u"ENSG00000241794"],
    "Kidney": [u"ENSG00000124107",u"ENSG00000136938",u"ENSG00000199568",u"ENSG00000211666",u"ENSG00000211663",u"ENSG00000165507",u"ENSG00000165502",u"ENSG00000115461"],
    "Lung": [u"ENSG00000211956",u"ENSG00000165502",u"ENSG00000211663",u"ENSG00000211666",u"ENSG00000124107"],
    "Prostate":[u"ENSG00000124733",u"ENSG00000173334",u"ENSG00000136938",u"ENSG00000115461",u"ENSG00000225937",u"ENSG00000165502"],
    "Thyroid":[u"ENSG00000124107",u"ENSG00000174460",u"ENSG00000165502",u"ENSG00000115461"],
    "Uterus":[u"ENSG00000211666",u"ENSG00000165502",u"ENSG00000124107"]
}

Select organ under investigation:

In [93]:
organ_invest = "Uterus"

Get most important genes associated with that organ

In [125]:
genes_invest = mapping[organ_invest]
features_invest = merge_final[merge_final["gene_id"].isin(genes_invest)]
gene_names = features_invest["hgnc_name"]
chem_names = features_invest["chemical_name"]

These genes give us clues into the most important chemicals to test for in tissue that may be cancerous

In [132]:
unqChems = features_invest["chemical_name"].unique().size
unqGenes = len(genes_invest)
chemsOverall= merge_final["chemical_name"].unique().size
genesOverall = merge_final["gene_id"].unique().size

pcnt_reduction_genes = int(genesOverall/unqGenes)
pcnt_reduction_chems = int(chemsOverall/unqChems)

We generate a summary report on the gene downselection:

In [135]:
report = 'You are investigating {0} tissue.\n\n The key genes to test for are: {1}\n\n This is a reduction of factor {2} x in your genetic testing.\n\n Chemicals that indicate a high probability of cancerous tissue are: {3}\n\n This is a reduction of factor {4} x out of all indicator chemicals.'.format(organ_invest,gene_names.unique(),pcnt_reduction_genes,chem_names.unique(),pcnt_reduction_chems)

In [136]:
print(report)

You are investigating Uterus tissue.

 The key genes to test for are: ['secretory leukocyte peptidase inhibitor' 'ribosomal protein L36a like']

 This is a reduction of factor 582 x in your genetic testing.

 Chemicals that indicate a high probability of cancerous tissue are: ['Asbestos, Crocidolite' 'Aspirin' 'Estradiol' 'Carmustine'
 'cobaltous chloride' 'Copper' 'Cyclosporine' 'Cisplatin' 'Calcitriol'
 'Fluticasone' 'Hydrogen Peroxide' 'Iron' 'NSC 689534' 'ormosil'
 'Particulate Matter' 'Polyethylene Glycols' 'propionaldehyde'
 'Salmeterol Xinafoate' 'Asbestos, Serpentine' 'Silicon Dioxide' 'Smoke'
 'temozolomide' 'Tetrachlorodibenzodioxin' 'zoledronic acid'
 "3-(4'-hydroxy-3'-adamantylbiphenyl-4-yl)acrylic acid" 'CD 437'
 'chloropicrin' 'ICG 001' 'K 7174' 'Sodium Selenite']

 This is a reduction of factor 113 x out of all indicator chemicals.
