### Making plots to illustrate the results of some queries with the web tool
This notebook takes a file describing a list of genes with those genes species and identifying strings. It loads the dataset that is used by the search tool, finds out if those genes are in the dataset, and then subsets the set of genes to include just those. It then uses the descriptions of those genes to query the tool (using the script rather than the streamlit web app, but the results are identical), and keeps track of where other genes from the list fall in the resulting gene rankings. The output of the notebook is a summary of these results that specifies the mean and standard deviation of binned ranks for each search.

In [1]:
import datetime
import nltk
import pandas as pd
import numpy as np
import time
import math
import sys
import gensim
import os
import random
import warnings
from collections import defaultdict
from nltk.corpus import brown
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from gensim.parsing.preprocessing import strip_non_alphanum, stem_text, preprocess_string, remove_stopwords
from gensim.utils import simple_preprocess
from nltk.stem import WordNetLemmatizer
from statsmodels.sandbox.stats.multicomp import multipletests
from itertools import product

sys.path.append("../../oats")
sys.path.append("../../oats")
from oats.utils.utils import save_to_pickle, load_from_pickle, flatten, to_hms
from oats.utils.utils import function_wrapper_with_duration, remove_duplicates_retain_order
from oats.biology.dataset import Dataset
from oats.biology.groupings import Groupings
from oats.biology.relationships import ProteinInteractions, AnyInteractions
from oats.annotation.ontology import Ontology
from oats.annotation.annotation import annotate_using_noble_coder, term_enrichment
from oats.distances import pairwise as pw
from oats.nlp.vocabulary import get_overrepresented_tokens, get_vocab_from_tokens
from oats.nlp.vocabulary import reduce_vocab_connected_components, reduce_vocab_linares_pontes, token_enrichment

warnings.simplefilter('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
nltk.download('punkt', quiet=True)
nltk.download('brown', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

Warming up PyWSD (takes ~10 secs)... took 6.275835752487183 secs.


True

In [2]:
# Paths to the files that are used for this notebook.
dataset_path = "../../quoats/data/genes_texts_annots.csv"
dataset = Dataset(dataset_path, keep_ids=True)
dataset.describe()


Unnamed: 0,species,unique_gene_identifiers,unique_descriptions
0,ath,5850,3493
1,gmx,30,23
2,mtr,37,36
3,osa,92,85
4,sly,69,69
5,zma,1405,810
6,total,7483,4516


### Anthocyanin biosynthesis genes

Previously out of this list we had 10 of 16 maize genes in the dataset and 16 of 18 Aribidopsis genes. Now we have 13 of 16 maize genes and 18 of 18 Arabidopsis genes.

In [3]:
mapping = dataset.get_species_to_name_to_ids_dictionary(include_synonyms=False, lowercase=True)
genes = pd.read_csv("anthocyanin_biosynthesis_genes.csv")
genes["id"] = genes.apply(lambda x: mapping[x["species_code"]].get(x["identifier"].strip().lower(),-1), axis=1)
genes[genes["id"]!=-1]["id"] = genes[genes["id"]!=-1]["id"].map(lambda x: x[0])
genes["in_current_dataset"] = genes["id"].map(lambda x: x!=-1)
genes

# Looking just at the ones that are in the current dataset.
genes = genes[genes["in_current_dataset"]]
genes["id"] = genes["id"].map(lambda x: x[0])
genes

Unnamed: 0,species,species_code,identifier,in_previous_dataset,id,in_current_dataset
0,Maize,zma,GRMZM2G422750,True,1216,True
2,Maize,zma,GRMZM2G025832,False,3761,True
4,Maize,zma,GRMZM2G026930,True,166,True
5,Maize,zma,GRMZM2G345717,True,169,True
6,Maize,zma,GRMZM2G165390,True,218,True
7,Maize,zma,GRMZM2G016241,True,219,True
9,Maize,zma,GRMZM5G822829,False,170,True
10,Maize,zma,GRMZM2G005066,True,165,True
11,Maize,zma,GRMZM2G084799,True,2330,True
12,Maize,zma,GRMZM2G701063,True,164,True


In [4]:
# Grabbing the texts dictionary from the dataset that we can use to grab the descriptions to query.
texts = dataset.get_description_dictionary()
texts[2617]

'Yellow seed coat. Yellow seeds (lacks brown pigment in seed coat), lacks anthocyanin in leaves and stems, lacks flavonols. Colorless seed coat. Yellow seeds due to absence of brown pigment in seed coat (testa). Pollen has a characteristic lemon-yellow color. Anthocyanins absent in leaves, stems and all tissues, this is conspicuous in older plants when chlorophyll starts breaking down, and mutant plants are yellowish. Deficient in chalcone synthase activity and lacks flavonoids. Double mutant. Disruption of the synthesis of brown pigment in seed coat (testa). Yellow seeds. Altered phenylpropanoid metabolism. Disruption of the synthesis of brown pigment in seed coat (testa), yellow seeds. Yellow seeds (lacks brown pigment in seed coat), lacks anthocyanin in leaves and stems. Pale yellow seeds, no flavonoid in tapetum cells.'

In [5]:
# Prepare dictionaries to hold the resulting arrays.
resulting_bin_arrays = defaultdict(dict)
resulting_bin_arrays["zma"]["zma"] = []
resulting_bin_arrays["ath"]["ath"] = []
resulting_bin_arrays["zma"]["ath"] = []
resulting_bin_arrays["ath"]["zma"] = []
resulting_bin_arrays

defaultdict(dict,
            {'zma': {'zma': [], 'ath': []}, 'ath': {'ath': [], 'zma': []}})

### Searching within the same species

In [6]:
# The searches within the same species.
rank_for_not_found = 100
bins =[0,11,21,31,41,51,rank_for_not_found]
bin_names = [10,20,30,40,50,rank_for_not_found]
assert len(bin_names) == len(bins)-1

ctr = 0
for gene in genes.itertuples():
    
    ctr = ctr+1
    limit = 50
    species = gene[1]
    species_code = gene[2]
    identifier = gene[3]
    gene_id = gene[5]
    text = texts[gene_id]
    
    # Because these are being passed as strings to the command line, quotes need to be removed now,
    # instead of waiting for them to be removed as a preprocessing step of the search strings in the streamlit script.
    text = text.replace("'","")
    text = text.replace('"','')
    
    path = "../plant-phenotypes-nlp/quoats/outputs_within_anthocyanin/output_{}.tsv".format(ctr)
    os.chdir('../../quoats')
    os.system("python main.py -s {} -t identifiers -q '{}:{}' -l {} -o {} -r 0.000 -a TFIDF".format(species,species,identifier,limit,path))
    time.sleep(6)
    df = pd.read_csv(path, sep='\t')
    df = df[["Rank","Internal ID"]]
    df = df.drop_duplicates()
    id_to_rank = dict(zip(df["Internal ID"].values,df["Rank"].values))
    assert rank_for_not_found > limit
    
    # For within the same species, get rid of the identical gene (always rank 1).
    ids_of_interest = [i for i in genes[genes["species"]==species]["id"].values if i != gene_id]
    ranks = [id_to_rank.get(i, rank_for_not_found) for i in ids_of_interest]
    resulting_bin_arrays[species_code][species_code].append(np.histogram(ranks, bins=bins)[0])
    
    #print(np.array( resulting_bin_arrays[species_code][species_code]))
    print(ranks)
    print("done with {} queries".format(ctr))
    os.chdir("../plant-phenotypes-nlp/quoats/")
print('done with all queries')

[100, 20, 16, 37, 36, 21, 12, 25, 34, 21, 22, 10]
done with 1 queries
[23, 12, 10, 9, 7, 13, 8, 100, 17, 13, 11, 29]
done with 2 queries
[19, 13, 17, 22, 18, 29, 20, 100, 30, 29, 27, 5]
done with 3 queries
[12, 8, 15, 13, 11, 17, 10, 100, 25, 17, 16, 100]
done with 4 queries
[25, 6, 11, 10, 3, 13, 9, 100, 43, 13, 12, 22]
done with 5 queries
[25, 6, 12, 10, 3, 13, 9, 100, 42, 13, 11, 22]
done with 6 queries
[19, 6, 27, 17, 23, 20, 16, 100, 47, 25, 8]
done with 7 queries
[11, 8, 20, 14, 16, 12, 21, 45, 27, 21, 18, 46]
done with 8 queries
[3, 100, 12, 100, 100, 100, 100, 100, 100, 100, 100, 100]
done with 9 queries
[100, 100, 8, 16, 100, 100, 18, 17, 100, 18, 6, 100]
done with 10 queries
[19, 6, 27, 17, 23, 20, 16, 100, 47, 25, 8]
done with 11 queries
[24, 8, 30, 23, 25, 13, 29, 20, 100, 19, 29, 9]
done with 12 queries
[10, 30, 12, 26, 18, 17, 14, 20, 100, 35, 14, 13]
done with 13 queries
[3, 5, 7, 6, 47, 100, 100, 100, 100, 45, 8, 14, 13, 4, 2, 15, 11]
done with 14 queries
[4, 5, 7, 6, 3

### Searching across different species

In [7]:
# The searches across different species.
rank_for_not_found = 100
bins =[0,11,21,31,41,51,rank_for_not_found]
bin_names = [10,20,30,40,50,rank_for_not_found]
assert len(bin_names) == len(bins)-1

ctr = 0
for gene in genes.itertuples():
    
    ctr = ctr+1
    limit = 50
    from_species = gene[1]
    from_species_code = gene[2]
    identifier = gene[3]
    gene_id = gene[5]
    text = texts[gene_id]
    
    
    # Do the switch to make the species for the query and IDs of interest the opposite one.
    to_species = {"Arabidopsis":"Maize","Maize":"Arabidopsis"}[from_species]
    to_species_code = {"ath":"zma","zma":"ath"}[from_species_code]
    
    # Because these are being passed as strings to the command line, quotes need to be removed now,
    # instead of waiting for them to be removed as a preprocessing step of the search strings in the streamlit script.
    text = text.replace("'","")
    text = text.replace('"','')
    
    
    path = "../plant-phenotypes-nlp/quoats/outputs_across_anthocyanin/output_{}.tsv".format(ctr)
    os.chdir('../../quoats')
    os.system("python main.py -s {} -t identifiers -q '{}:{}' -l {} -o {} -r 0.000 -a TFIDF".format(to_species,from_species,identifier,limit,path))
    time.sleep(4)
    df = pd.read_csv(path, sep='\t')
    df = df[["Rank","Internal ID"]]
    df = df.drop_duplicates()
    id_to_rank = dict(zip(df["Internal ID"].values,df["Rank"].values))
    assert rank_for_not_found > limit
    
    # For within the same species, get rid of the identical gene (always rank 1).
    ids_of_interest = [i for i in genes[genes["species"]==to_species]["id"].values if i != gene_id]
    ranks = [id_to_rank.get(i, rank_for_not_found) for i in ids_of_interest]
    resulting_bin_arrays[from_species_code][to_species_code].append(np.histogram(ranks, bins=bins)[0])
    
    #print(np.array( resulting_bin_arrays[species_code][species_code]))
    print(ranks)
    print("done with {} queries".format(ctr))
    
    os.chdir("../plant-phenotypes-nlp/quoats/")
print('done with all queries')

[4, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]
done with 1 queries
[100, 100, 100, 100, 100, 100, 17, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]
done with 2 queries
[100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]
done with 3 queries
[100, 100, 36, 100, 37, 100, 12, 100, 35, 100, 32, 38, 100, 100, 100, 100, 15, 100]
done with 4 queries
[100, 100, 100, 100, 100, 100, 49, 100, 100, 39, 29, 100, 100, 100, 100, 100, 100, 100]
done with 5 queries
[100, 100, 100, 100, 100, 100, 46, 100, 100, 36, 29, 100, 100, 100, 100, 100, 100, 100]
done with 6 queries
[36, 100, 100, 100, 100, 100, 23, 100, 100, 100, 100, 100, 100, 100, 100, 100, 26, 100]
done with 7 queries
[21, 100, 100, 100, 100, 100, 8, 100, 100, 45, 44, 100, 39, 100, 100, 100, 100, 100]
done with 8 queries
[100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]
done with 9 queries
[100, 100, 100, 100, 100, 

In [8]:
# Create the output dataframe with the means and standard deviation for each bin and direction.
output_rows = []
for (s1,s2) in product(["ath","zma"],["ath","zma"]):
    means = np.mean(np.array(resulting_bin_arrays[s1][s2]),axis=0)
    std_devs = np.std(np.array(resulting_bin_arrays[s1][s2]),axis=0)
    for i in range(len(bin_names)):
        output_rows.append([s1, s2, bin_names[i], means[i], std_devs[i]])    
names = ["from","to","bin","mean","sd"]    
output_df = pd.DataFrame(output_rows,columns=names)
output_df

Unnamed: 0,from,to,bin,mean,sd
0,ath,ath,10,5.5,2.629956
1,ath,ath,20,3.888889,2.330686
2,ath,ath,30,1.388889,1.829963
3,ath,ath,40,0.888889,2.051798
4,ath,ath,50,0.944444,1.177201
5,ath,ath,100,4.388889,4.296496
6,ath,zma,10,0.833333,1.57233
7,ath,zma,20,0.388889,0.825893
8,ath,zma,30,0.555556,1.257079
9,ath,zma,40,0.055556,0.229061


In [9]:
#os.chdir('/Users/irbraun/phenologs-with-oats/quoats')
output_path = "plots/anthocyanin_plot_data.csv"
output_df.to_csv(output_path, index=False)

### Autophagy genes

In [10]:
#os.chdir('/Users/irbraun/phenologs-with-oats/quoats')
mapping = dataset.get_species_to_name_to_ids_dictionary(include_synonyms=False, lowercase=True)
genes = pd.read_csv("autophagy_core_genes.csv")
genes["id"] = genes.apply(lambda x: mapping[x["species_code"]].get(x["identifier"].strip().lower(),-1), axis=1)
genes[genes["id"]!=-1]["id"] = genes[genes["id"]!=-1]["id"].map(lambda x: x[0])
genes["in_current_dataset"] = genes["id"].map(lambda x: x!=-1)
genes

# Looking just at the ones that are in the current dataset.
genes = genes[genes["in_current_dataset"]]
genes["id"] = genes["id"].map(lambda x: x[0])
genes

Unnamed: 0,species,species_code,name,identifier,id,in_current_dataset
1,Arabidopsis,ath,ATG2,AT3G19190,23,True
3,Arabidopsis,ath,ATG4,AT2G44140,4688,True
4,Arabidopsis,ath,ATG5,AT5G17290,462,True
5,Arabidopsis,ath,ATG6,AT3G61710,297,True
6,Arabidopsis,ath,ATG7,AT5G45900,5585,True
8,Arabidopsis,ath,ATG9,AT2G31260,533,True
9,Arabidopsis,ath,ATG10,AT3G07525,2377,True
10,Arabidopsis,ath,ATG11,AT4G30790,5272,True
11,Arabidopsis,ath,ATG12,AT1G54210,4274,True
12,Arabidopsis,ath,ATG13,AT3G49590,4968,True


12 out of the 16 core autophagy genes from the list are present in the dataset, see the core genes file.

In [11]:
# Grabbing the texts dictionary from the dataset that we can use to grab the descriptions to query.
texts = dataset.get_description_dictionary()
texts[4688]

'Small plants and premature senescence under normal soil-grown conditions. Hypersensitivity to N and fixed-C deprivation.'

In [12]:
# Prepare dictionaries to hold the resulting arrays.
resulting_bin_arrays = defaultdict(dict)
resulting_bin_arrays["ath"]["ath"] = []
resulting_bin_arrays

defaultdict(dict, {'ath': {'ath': []}})

In [35]:
os.getcwd()# The searches within the same species.
rank_for_not_found = 100
bins =[0,11,21,31,41,51,rank_for_not_found]
bin_names = [10,20,30,40,50,rank_for_not_found]
assert len(bin_names) == len(bins)-1
if(os.getcwd()=="/work/triffid/prasanth/reorganizing-irb-scripts/quoats"):
     os.chdir("../plant-phenotypes-nlp/quoats/")


ctr = 0
for gene in genes.itertuples():
    
    ctr = ctr+1
    limit = 50
    species = gene[1]
    species_code = gene[2]
    identifier = gene[3]
    gene_id = gene[5]
    text = texts[gene_id]
    
    # Because these are being passed as strings to the command line, quotes need to be removed now,
    # instead of waiting for them to be removed as a preprocessing step of the search strings in the streamlit script.
    text = text.replace("'","")
    text = text.replace('"','')
    print(os.getcwd())
    
    
    path = "../plant-phenotypes-nlp/quoats/outputs_within_autophagy/output_{}.tsv".format(ctr)
    
    if(os.getcwd()=="/work/triffid/prasanth/reorganizing-irb-scripts/quoats"):
        os.chdir("../plant-phenotypes-nlp/quoats/")

    os.chdir('../../quoats')
    os.system("quoats/main.py -s {} -t identifiers -q '{}:{}' -l {} -o {} -r 0.000 -a tfidf".format(species,species,identifier,limit,path))
    time.sleep(7)
    if os.path.exists(path):
        print('I am in')
        os.getcwd()
        df = pd.read_csv(path, sep='\t')
        df = df[["Rank","Internal ID"]]
        df = df.drop_duplicates()
        id_to_rank = dict(zip(df["Internal ID"].values,df["Rank"].values))
        assert rank_for_not_found > limit


        # For within the same species, get rid of the identical gene (always rank 1).
        ids_of_interest = [i for i in genes[genes["species"]==species]["id"].values if i != gene_id]
        ranks = [id_to_rank.get(i, rank_for_not_found) for i in ids_of_interest]
        resulting_bin_arrays[species_code][species_code].append(np.histogram(ranks, bins=bins)[0])
     
        print(ranks)
        print("done with {} queries".format(ctr))
        
        os.chdir("../plant-phenotypes-nlp/quoats/")
        
print('done with all queries')

/work/triffid/prasanth/reorganizing-irb-scripts/plant-phenotypes-nlp/quoats
I am in
[100, 100, 100, 100, 100, 100, 100, 100, 100]
done with 1 queries
/work/triffid/prasanth/reorganizing-irb-scripts/plant-phenotypes-nlp/quoats
I am in
[100, 100, 100, 100, 100, 100, 100, 100, 100]
done with 2 queries
/work/triffid/prasanth/reorganizing-irb-scripts/plant-phenotypes-nlp/quoats
I am in
[100, 100, 100, 100, 100, 13, 100, 100, 100]
done with 3 queries
/work/triffid/prasanth/reorganizing-irb-scripts/plant-phenotypes-nlp/quoats
/work/triffid/prasanth/reorganizing-irb-scripts/quoats
I am in
[100, 4, 10, 100, 100, 9, 6, 5, 7]
done with 5 queries
/work/triffid/prasanth/reorganizing-irb-scripts/plant-phenotypes-nlp/quoats
I am in
[100, 100, 100, 100, 100, 9, 100, 100, 100]
done with 6 queries
/work/triffid/prasanth/reorganizing-irb-scripts/plant-phenotypes-nlp/quoats
I am in
[100, 100, 18, 100, 100, 100, 100, 100, 100]
done with 7 queries
/work/triffid/prasanth/reorganizing-irb-scripts/plant-phenot

In [36]:
# Create the output dataframe with the means and standard deviation for each bin and direction.
output_rows = []
s1 = "ath"
s2 = "ath"
means = np.mean(np.array(resulting_bin_arrays[s1][s2]),axis=0)
std_devs = np.std(np.array(resulting_bin_arrays[s1][s2]),axis=0)
for i in range(len(bin_names)):
    output_rows.append([s1, s2, bin_names[i], means[i], std_devs[i]])    
names = ["from","to","bin","mean","sd"]    
output_df = pd.DataFrame(output_rows,columns=names)
output_df

Unnamed: 0,from,to,bin,mean,sd
0,ath,ath,10,0.625,1.57619
1,ath,ath,20,0.375,0.563656
2,ath,ath,30,0.083333,0.399653
3,ath,ath,40,0.0,0.0
4,ath,ath,50,0.0,0.0
5,ath,ath,100,7.916667,2.03954


In [37]:
os.getcwd()

'/work/triffid/prasanth/reorganizing-irb-scripts/plant-phenotypes-nlp/quoats'

In [39]:
#os.chdir('../plant-phenotypes-nlp/quoats')
output_path = "plots/autophagy_plot_data.csv"
output_df.to_csv(output_path, index=False)