# __Step 5.1: Species over time__

Goals here:
- Determine overall genus mention
- Determine genus mention over time
- Same analysis at other taxonmic levels

Considerations:
- Deal with common names
  - Common names must be those in the USDA common name database.
  - If some non-specific names are mentioned, even though they most likely refer to a particular species frequently, they will not be counted.
- Deal with synonyms
  - Both NCBI and USDA data have synonym info. They will be pointed to a specific level.
- Deal with redundancy
  - It is possible that multiple taxa levels are mentioned in a single title/abstract: e.g., Solanceae, Solanum, tomato. At the genus level, it will be counted just one time at both the family and the genus levels for this record.
- Missing info
  - Some species info may be mentioned only in the full text.
- NCBI compressed taxa dump was downloaded in 11/11/2021. Notice that some taxa names are found in USDA common names and in NCBI taxonomy website, but not in the names.dmp.
  - Example: Achnatherum  
  - One level up, tribe Stipeae, only have 25 children when there should be 34. Missing: Barkworthia, Eriocoma, Neotrinia, Oryzopsis (ricegrass), Pseudoeriocoma, Ptilagrostiella, Stipellula, Thorneochloa, x Eriosella.
  - Download the taxnomy dump again from [here](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz) on 9/20/22. 
- Also realized that I need to include `no_rank` taxa when parse parent-child relations, otherwise, the lineage will be broken.

## ___Set up___

### Module import

In [2]:
import pickle, nltk, re, multiprocessing
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm
from scipy.sparse import csr_matrix, lil_matrix, coo_matrix, dok_matrix
from time import time
from collections import OrderedDict

### Key variables

In [3]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "5_species_over_time/"
work_dir.mkdir(parents=True, exist_ok=True)

# species information
# NEED TO SPECIFY
# NEED TO SPECIFY
dir1           = proj_dir / "1_obtaining_corpus"
#names_dmp_path = dir1 / "taxonomy/names.dmp"
#nodes_dmp_path = dir1 / "taxonomy/nodes.dmp"
usda_plant_db  = dir1 / "usda/USDA_Plants_Database.txt"

names_dmp_path = work_dir / "taxonomy/names.dmp"
nodes_dmp_path = work_dir / "taxonomy/nodes.dmp"

# plant science corpus with date and other info
dir2        = proj_dir / "2_text_classify//2_5_predict_pubmed"
corpus_file = dir2 / "corpus_plant_421658.tsv.gz"

# So PDF is saved in a format properly
mpl.rcParams['pdf.fonttype'] = 42
plt.rcParams["font.family"] = "sans-serif"

## ___Get plant names___

In `1_obtaining_corpus`, plant names are from two sources:
- NCBI: the taxonomy database with mention of all taxa levels
  - This will also contain synonyms for different levels.
- USDA: plant common names with species information


### NCBI taxonomy

Functions modified from:
- `1_obtaining_corpus\script_get_plant_taxa.py`

#### Parse name_dmp

In [4]:
#
# For: Getting the tax_id of Viridiplantae and generate a dictionary.
# Parameters
#   names_dmp_file - The Path object to the names.dmp file from NCBI taxonomy.
#   target - Target taxon name.
# Return:
#   target_id - The NCBI taxon ID for the taxon.
#   names_dic - A dictionary with: {tax_id:{name_class:[names]}
#
def get_name_dict(names_dmp_path, target):
  target_id = ""
  names_dmp = open(names_dmp_path)
  L         = names_dmp.readline()
  names_dic = {}
  while L != "":
    L = L.strip().split("\t")
    tax_id = L[0]
    name   = L[2]
    name_c = L[6]
    if L[2] == target:
      print(f"{target} tax_id:",tax_id)
      target_id = tax_id

    if tax_id not in names_dic:
      names_dic[tax_id] = {name_c:[name]}
    elif name_c not in names_dic[tax_id]:
      names_dic[tax_id][name_c] = [name]
    else:
      names_dic[tax_id][name_c].append(name)
    L = names_dmp.readline()
  return target_id, names_dic

In [5]:
target_id, names_dic = get_name_dict(names_dmp_path, 'Viridiplantae')

Viridiplantae tax_id: 33090


In [6]:
names_dic['33090'], names_dic['147383']

({'authority': ['Chlorobionta Jeffrey, 1982',
   'Chloroplastida Adl et al. 2005',
   'Viridiplantae Cavalier-Smith, 1981'],
  'synonym': ['Chlorobionta', 'Chloroplastida'],
  'equivalent name': ['Chlorophyta/Embryophyta group',
   'chlorophyte/embryophyte group'],
  'blast name': ['green plants'],
  'common name': ['green plants'],
  'scientific name': ['Viridiplantae']},
 {'authority': ['Stipeae Dumort., 1824'], 'scientific name': ['Stipeae']})

#### Parse node_dmp

This is to get:
- Parent-child relation
- Child-parent relation
- Rank count:
- Taxa_id-rank relation


In [7]:
#
# For: Get the parent-child relationships from nodes.dmp file.
# Parameters: 
#   nodes_dmp_path - The Path object to the nodes.dmp file from NCBI taxonomy.
# Return: 
#   parent_child - A dictionary with {parent:[children]}
#
def get_parent_child(nodes_dmp_path):
    nodes_dmp    = open(nodes_dmp_path)
    L            = nodes_dmp.readline()
    rank_d       = {} # {rank: count}
    taxa_rank    = {} # {taxa_id: rank}
    rank_taxa    = {} # {rank: taxa_id}
    parent_child = {}
    child_parent = {}
    target_ranks = ['genus', 'family', 'order']

    debug_count  = 0
    debug_list   = []
    while L != "":
        L = L.strip().split("\t")
        tax_id = L[0]
        par_id = L[2]
        rank   = L[4]
        if rank not in rank_d:
            rank_d[rank] = 1
        else:
            rank_d[rank]+= 1
        
        # Don't want any species or taxon with no rank
        # 9/20/22: actually, do not want no rank result in problem. Am example
        #   is taxid=2822797, child of 147368, this lead to some taxa missing.
        #   so removed.
        #if rank not in ["no rank", "species"]:
        if rank != "species":
            # debug
            if par_id == '147383':
                debug_count += 1
                debug_list.append(names_dic[tax_id]['scientific name'][0])
                #print(debug_count, tax_id, names_dic[tax_id]['scientific name'])

            # populate parent_child dict
            if par_id not in parent_child:
                parent_child[par_id] = [tax_id]
            else:
                parent_child[par_id].append(tax_id)
            
             # populate child_parent dict
            if tax_id not in child_parent:
                child_parent[tax_id] = par_id
            else:
                print(f"ERR: {tax_id} with >1 parents",
                        child_parent[tax_id], par_id)
            
            # populate taxa_rank and rank_taxa dicts
            taxa_rank[tax_id] = rank
            
            if rank not in rank_taxa:
                rank_taxa[rank] = [tax_id]
            else:
                rank_taxa[rank].append(tax_id)
            
        L = nodes_dmp.readline()
        
    return parent_child, child_parent, rank_d, taxa_rank, rank_taxa, debug_list

In [8]:
parent_child, child_parent, rank_d, taxa_rank, rank_taxa, debug_list = \
                                              get_parent_child(nodes_dmp_path)

#### Get offsprings of Viridiplantae

These are the names to search for, after adding the USDA names.

In [9]:
#
# For: Get the offsprings of a parent.
# Parameters: 
#   p - The parent taxa ID to get children for.
#   paren_child - The dictionary returned from get_parent_child().
#   offspring - An initially empty list to append offspring IDs.
# Return: 
#   offspring - The populated offspring list.
#
def get_offsprings(p, parent_child, offsprings, debug=0):
    if debug:
        print(p)
    if p in parent_child:
        # Initialize c with an empty element for debugging purpose
        #c = [""]
        c = parent_child[p]
        if debug:
            print("",p, c)
            if p == "147383":
                print("debug parent found")

        offsprings.extend(c)
        for a_c in c:
            get_offsprings(a_c, parent_child, offsprings)
    else:
        if debug:
            print(" NO CHILD")
    return offsprings

In [10]:
offsprings_33090 = get_offsprings(target_id, parent_child, [])

In [11]:
len(offsprings_33090)

25232

In [12]:
# Convert taxa id into scientific names
offspring_names = []
redun = {}
for o in offsprings_33090:
    if o in names_dic:
        for nc in names_dic[o]: # for each name_class
            if nc != 'authority': 
                for name in names_dic[o][nc]:
                    if name not in redun:
                        offspring_names.append(name)
                        redun[name] = 0
                    #else:
                    #    print("Redun:", name)

# Note that this number is larger than offspring_33090 which contain indicies
# This is because there are other names, like synonyms for each index.
len(offspring_names)

26782

In [23]:
# Save as pickle
with open(work_dir / "viridiplantae_offspring_names.pickle", "wb") as f:
  pickle.dump(offspring_names, f)

### USDA names

Functions modified from:
- `1_obtaining_corpus\script_get_plant_common_names.py`

In [13]:
cnames = {} # {common_name:[scientific name, family]}

with open(usda_plant_db) as f:
    f.readline() # header, don't need it
    L = f.readline()
    while L != "":
        L = L.strip()
        # There is empty line in the file.
        if L == "":
            break
        #print(L.split(","))
        try:
            # some names have "," in there. So need to split with ""\,"
            [symbol, synonym, sname, cname, fam] = L.split("\",")
        except ValueError:
            print("ValueError:",[L])
            break
        # rid of quotes
        
        [symbol, synonym, sname, cname, fam] = [symbol.split("\"")[1], 
                                                synonym.split("\"")[1], 
                                                sname.split("\"")[1], 
                                                cname.split("\"")[1], 
                                                fam.split("\"")[1]]
        #print([symbol, synonym, sname, cname, fam])
        # Get genus name out
        genus = sname.split(" ")[0]

        if cname != "":
            if cname not in cnames:
                cnames[cname] = [genus, fam]
            #else:
            #    print("Redun cname:", [cname], cnames[cname], [sname,fam])        
        L = f.readline()

In [29]:
# Save as pickle
with open(work_dir / "usda_common_names_dict.pickle", "wb") as f:
  pickle.dump(cnames, f)

### Rid of USDA names not found in the NCBI list

In [14]:
names_dic['37868']

{'authority': ['Achnatherum P.Beauv., 1812'],
 'scientific name': ['Achnatherum']}

In [15]:
# Check if all USDA genus names are found in NCBI
# This helped identified issues with the parent_child script and missing data
# due to the use of older NCBI taxa dump file. Currently, missings ones are 
# fungal and are excluded.
cnames_overlap = {}
for cname in cnames:
  genus = cnames[cname][0]
  if genus in offspring_names:
    cnames_overlap[cname] = genus

In [16]:
common_names = list(cnames_overlap.keys())

In [24]:
# Save as pickle
with open(work_dir / "usda_common_names.pickle", "wb") as f:
  pickle.dump(common_names, f)

## ___Find names in corpus___

### Read and preprocess corpus

In [17]:
corpus = pd.read_csv(corpus_file, compression='gzip', sep='\t')

In [18]:
corpus.shape

(421658, 11)

In [19]:
# Function based on Mauro Di Pietro (2020):
#  https://towardsdatascience.com/text-classification-with-no-model-training-935fe0e42180
# For the purpose here, did not do lower-casing
def utils_preprocess_text(text, lst_stopwords, flg_stemm=False, flg_lemm=True):
    '''
    Preprocess a string.
    :parameter
        :param text: string - name of column containing text
        :param lst_stopwords: list - list of stopwords to remove
        :param flg_stemm: bool - whether stemming is to be applied
        :param flg_lemm: bool - whether lemmitisation is to be applied
    :return
        cleaned text
    '''
    ## clean: stripping, then removing punctuations.
    text = str(text).strip()
    
    # RE: replace any character that is not alphanumeric, underscore, whitespace
    #  with ''. Originally this is it, but realized that biological terms have
    #  special characters including roman numerals, dash, and ",". So they are
    #  not removed.
    #text = re.sub(r'[^\w\s(α-ωΑ-Ω)-,]', '', text)
    # Use the original method
    text = re.sub(r'[^\w\s]', '', text)

    ## Tokenize (convert from string to list)
    lst_text = text.split()    
    
    ## remove Stopwords
    if lst_stopwords is not None:
        lst_text = [word for word in lst_text if word not in 
                    lst_stopwords]
                
    ## Stemming (remove -ing, -ly, ...)
    if flg_stemm == True:
        ps = nltk.stem.porter.PorterStemmer()
        lst_text = [ps.stem(word) for word in lst_text]
                
    ## Lemmatisation (convert the word into root word)
    if flg_lemm == True:
        lem = nltk.stem.wordnet.WordNetLemmatizer()
        lst_text = [lem.lemmatize(word) for word in lst_text]
            
    ## back to string from list
    text = " ".join(lst_text)
    return text

In [21]:
tqdm.pandas(desc="Clean text")
lst_stopwords       = nltk.corpus.stopwords.words("english")
corpus["txt_clean"] = corpus["txt"].progress_apply(lambda x: 
                                        utils_preprocess_text(x, lst_stopwords))
corpus.sample(2)

Clean text: 100%|██████████| 421658/421658 [05:04<00:00, 1385.83it/s]


Unnamed: 0.1,Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,txt,reg_article,y_prob,y_pred,txt_clean
125390,557681,17504472,2007-05-17,The New phytologist,Small populations are mate-poor but pollinator...,"If pollinators or compatible mates are scarce,...",plants,Small populations are mate-poor but pollinator...,1,0.825434,1,Small population matepoor pollinatorrich rare ...
321669,1165942,28756542,2017-08-02,Applied biochemistry and biotechnology,Bioethanol Production from Soybean Residue via...,Bioethanol was produced using polysaccharide f...,soybean,Bioethanol Production from Soybean Residue via...,1,0.823566,1,Bioethanol Production Soybean Residue via Sepa...


In [22]:
corpus["txt_clean"].to_csv(work_dir / "txt_clean.csv")

### Find names

- Update csr values
  - https://stackoverflow.com/questions/56981077/how-to-update-value-in-csr-matrix
- See code testing section on the different functions tried
- Got kernel error
- Set aside the following as `script_5_1a_find_names.py` and run,

In [27]:
def get_match_csr(txt):

  with multiprocessing.Pool(processes=14) as pool:
    results_ncbi_list = list(tqdm(pool.imap(task, enumerate(txt)), 
                                  total=len(txt)))

  row_idx   = []
  col_idx   = []
  csr_val   = []
  for row, results_ncbi in enumerate(results_ncbi_list):
    non0_idx = np.nonzero(results_ncbi)[0].tolist()
    row_idx.extend([row]*len(non0_idx))
    col_idx.extend(non0_idx)
    csr_val.extend([1]*len(non0_idx))

  # create a sparse matrix with shape=(num_docs, num_names)
  match_csr = csr_matrix((csr_val, (row_idx, col_idx)),
                         shape=(txt.shape[0], len(offspring_names)), 
                         dtype=np.int0)

  return match_csr

def task(item):
  '''Task to parallelize
  Args:
    item (tuple): (row_number, doc)
  Return:
    results_ncbi (list): an offspring_name is present in the doc (1) or not(1)
  '''
  (row, doc) = item
  # Get the matching common names as a list
  results_usda = [name for name in common_names if(f" {name} " in doc)]

  # Add the results to doc
  for cname in results_usda:  # for each common name
    genus = cnames[cname][0]  # get the genus name
    doc += f" {genus}"        # add the genus name to doc
  
  # Match to NCBI names
  results_ncbi = [1 if(name in doc) else 0 for name in offspring_names]

  return results_ncbi

In [28]:
match_csr = get_match_csr(corpus['txt_clean'])

# Save as a pickle
with open(work_dir / "match_csr.pickle", "wb") as f:
  pickle.dump(match_csr, f)

  2%|▏         | 9301/421658 [00:38<28:43, 239.22it/s]


KeyboardInterrupt: 

In [None]:
# run the above code in script_5_1a_find_names.py and load the saved obj
with open(work_dir / "match_csr.pickle", "rb") as f:
  match_csr = pickle.load(, f)

## ___Genus level counts___

### Get genus level tax_id and names

In [None]:
# a list of genus tax_ids
genus_taxids = rank_taxa["genus"]
len(genus_taxids), genus_taxids[:3]

In [None]:
# convert tax_ids to scientific names
genus_names = [names_dic[tax_id]['scientific name'][0] 
               for tax_id in genus_taxids]
len(genus_names), genus_names[:3]

In [None]:
# Get match_csr column index for genus names
genus_csr_idx   = [offspring_names.index(name) for name in genus_names]
len(genus_csr_idx)

In [None]:
# Get the genus sub-csr
genus_csr = match_csr[:, genus_csr_idx]
genus_csr.shape

## ___Code testing___

### Testing parent-child parsing results

In [88]:
'37868' in offsprings_33090

True

In [89]:
'Achnatherum' in offspring_names

True

In [90]:
rank_d

{'no rank': 233626,
 'superkingdom': 4,
 'genus': 104243,
 'species': 1996830,
 'order': 1761,
 'family': 9909,
 'subspecies': 27153,
 'subfamily': 3203,
 'strain': 45246,
 'serogroup': 140,
 'biotype': 17,
 'tribe': 2304,
 'phylum': 292,
 'class': 462,
 'species group': 347,
 'forma': 633,
 'clade': 917,
 'suborder': 373,
 'subclass': 166,
 'varietas': 9244,
 'kingdom': 13,
 'subphylum': 32,
 'forma specialis': 746,
 'isolate': 1322,
 'infraorder': 130,
 'superfamily': 891,
 'infraclass': 19,
 'superorder': 57,
 'subgenus': 1741,
 'superclass': 6,
 'parvorder': 26,
 'serotype': 1235,
 'species subgroup': 129,
 'subcohort': 3,
 'cohort': 5,
 'genotype': 21,
 'subtribe': 582,
 'section': 479,
 'series': 9,
 'morph': 12,
 'subkingdom': 1,
 'superphylum': 1,
 'subsection': 21,
 'pathogroup': 5}

In [91]:
# Spot check parent-child
len(parent_child['147383']), '37868' in parent_child['147383']

(33, True)

In [92]:
children_147383 = '''
    Achnatherum   
    Aciachne   
    Amelichloa   
    Anatherostipa   
    Anemanthele   
    Austrostipa   
    Barkworthia   
    Celtica   
    Eriocoma   
    Hesperostipa   
    Jarava   
    Lorenzochloa   
    Macrochloa   
    Nassella   
    Neotrinia   
    Oloptum   
    Ortachne   
    Oryzopsis (ricegrass)   
    Pappostipa   
    Patis   
    Piptatheropsis   
    Piptatherum   
    Piptochaetium   
    Psammochloa   
    Pseudoeriocoma   
    Ptilagrostiella   
    Ptilagrostis   
    Stipa   
    Stipellula   
    Thorneochloa   
    Timouria   
    Trikeraia   
    x Eriosella'''

### Test string preprocessing

In [93]:
doc = corpus['txt'][1]
doc = str(doc).strip()
doc = re.sub(r'[^\w\s]', '', doc)
doc

'Cholinesterases from plant tissues VI Preliminary characterization of enzymes from Solanum melongena L and Zea mays L Enzymes capable of hydrolyzing esters of thiocholine have been assayed in extracts of Solanum melongena L eggplant and Zea Mays L corn The enzymes from both species are inhibited by the anticholinesterases neostigmine physostigmine and 284c51 and by AMO1618 a plant growth retardant and they both have pH optima near pH 80 The enzyme from eggplant is maximally active at a substrate concentration of 015 mM acetylthiocholine and is inhibited at higher substrate concentrations On the basis of this last property the magnitude of inhibition by the various inhibitors and the substrate specificity we conclude that the enzyme from eggplant but not that from corn is a cholinesterase'

In [94]:
children_147383_parsed = []
for child in children_147383.split("\n"):
  children_147383_parsed.append(child.strip())

In [95]:
len(children_147383_parsed), len(parent_child['147383'])

(34, 33)

In [96]:
parent_child['147368']

['147381',
 '191503',
 '888031',
 '1648037',
 '1648038',
 '2822795',
 '2822796',
 '2822797']

In [97]:
# Still missing 1, this is a synonym
for child in children_147383_parsed:
  if child not in debug_list:
    print(child)


Oryzopsis (ricegrass)


In [98]:
# Spot check child_parent
for child in parent_child[target_id]:
  print(child, taxa_rank[child], child_parent[child])

3041 phylum 33090
35493 phylum 33090
144313 no rank 33090
144314 no rank 33090
2806169 phylum 33090


In [99]:
# this was missing in the 2021 taxa dump file, now it is there
child_parent['2950019']

'147383'

### Testing find names

#### Use list comprehension

In [169]:
doc = corpus['txt_clean'][1]
results_ncbi = [name for name in offspring_names if(name in doc)]
print(results_ncbi)

['Zea', 'Solanum']


In [170]:
results_ncbi = [1 if(name in doc) else 0 for name in offspring_names]
print(sum(results_ncbi))

2


In [171]:
# Note that here I add two spaces to pad the names, to prevent matching to
# substring.
results_usda = [name for name in common_names if(f" {name} " in doc)]
print(results_usda)

#results_usda = [1 if(f" {name} " in doc) else 0 for name in common_names ]
#print(sum(results_usda))

['eggplant', 'corn']


In [172]:
## Get genus names for common matches and add the doc 
for cname in results_usda:
  genus = cnames[cname][0]
  doc += f" {genus}"
print(doc)


Cholinesterases plant tissue VI Preliminary characterization enzyme Solanum melongena L Zea may L Enzymes capable hydrolyzing ester thiocholine assayed extract Solanum melongena L eggplant Zea Mays L corn The enzyme specie inhibited anticholinesterase neostigmine physostigmine 284c51 AMO1618 plant growth retardant pH optimum near pH 80 The enzyme eggplant maximally active substrate concentration 015 mM acetylthiocholine inhibited higher substrate concentration On basis last property magnitude inhibition various inhibitor substrate specificity conclude enzyme eggplant corn cholinesterase Solanum Zea


In [173]:
# Do the count again
results_ncbi = [1 if(name in doc) else 0 for name in offspring_names]
print(sum(results_ncbi))

2


In [174]:
# Get the non-zero indices
non0_idx = np.nonzero(results_ncbi)[0].tolist()
type(non0_idx), non0_idx

(list, [4360, 16432])

In [175]:
test_col_idx = [1,2]
test_col_idx.extend(non0_idx)
test_col_idx

[1, 2, 4360, 16432]

In [176]:
[1]*3

[1, 1, 1]

#### Use ordered dict comprehension

In [183]:
# test
test_names = ["eggplant", "corn", "spinach"]
[name if(f" {name} " in doc) else 0 for name in test_names ]

['eggplant', 'corn', 0]

In [184]:
OrderedDict((name, 1) if(f" {name} " in doc) else (name, 0) 
                                              for name in test_names)

OrderedDict([('eggplant', 1), ('corn', 1), ('spinach', 0)])

In [187]:
type(common_names), "eggplant" in common_names

(list, True)

In [188]:
results_usda = OrderedDict(
  (name, 1) if(f" {name} " in doc) else (name, 0) for name in common_names)

In [190]:
results_usda['eggplant'], results_usda['corn'], results_usda['spinach']

(1, 1, 0)

### Testing `get_match_csr`

In [108]:
# testing
test_txt_clean = corpus['txt_clean'][:5]
test_match_csr = get_match_csr(test_txt_clean)

100%|██████████| 5/5 [00:00<00:00, 24.48it/s]


In [109]:
# Expected 421628 x 26782
test_match_csr.shape

(421658, 26782)

In [117]:
col_sum = test_match_csr.sum(axis=0)
col_sum.shape

(1, 26782)

In [118]:
non0_idx = col_sum.nonzero()[1]
non0_idx


array([ 4360,  5401, 10061, 16432, 21477, 21986])

In [119]:
for idx in non0_idx:
  print(offspring_names[idx])

Zea
Hordeum
Spinacia
Solanum
Sesbania
Arachis


In [121]:
# Spinach
test_txt_clean[0]

'Identification 120 mu phase decay delayed fluorescence spinach chloroplast subchloroplast particle intrinsic back reaction The dependence level phase thylakoids internal pH After 500 mu laser flash 120 mu phase decay delayed fluorescence visible variety circumstance spinach chloroplast subchloroplast particle enriched Photosystem II prepared mean digitonin The level phase high case inhibition oxygen evolution donor side Photosystem II Comparison result Babcock Sauer 1975 Biochim Biophys Acta 376 329344 indicates EPR signal IIf suppose due Z oxidized first secondary donor Photosystem II well correlated large amplitude 120 mu phase We explain 120 mu phase intrinsic back reaction excited reaction center presence Z predicted Van Gorkom Donze 1973 Photochem Photobiol 17 333342 The redox state Z dependent internal pH thylakoids The result effect pH mu region compared obtained m region'

In [122]:
# Solanum, Zea
test_txt_clean[1]

'Cholinesterases plant tissue VI Preliminary characterization enzyme Solanum melongena L Zea may L Enzymes capable hydrolyzing ester thiocholine assayed extract Solanum melongena L eggplant Zea Mays L corn The enzyme specie inhibited anticholinesterase neostigmine physostigmine 284c51 AMO1618 plant growth retardant pH optimum near pH 80 The enzyme eggplant maximally active substrate concentration 015 mM acetylthiocholine inhibited higher substrate concentration On basis last property magnitude inhibition various inhibitor substrate specificity conclude enzyme eggplant corn cholinesterase'

In [123]:
# Arachis, Sesbania  
test_txt_clean[2]

'Fructose 16bisphosphate aldolase activity Rhizobium specie FDP aldolase found present cellfree extract Rhizobium leguminosarum Rhizobium phaseoli Rhizobium trifolii Rhizobium meliloti Rhizobium lupini Rhizobium japonicum Rhizobium specie Arachis hypogaea Sesbania cannabina The enzyme 3 representative specie optimal activity pH 84 02M veronal buffer The enzyme activity completely lost treatment 60 degree C 15 min The Km value range 238 455 X 106M FDP Metal chelating agent inhibited enzyme activity monovalent bivalent metal ion failed stimulate activity Bivalent metal ion general rather inhibitory'

In [124]:
# barley  
test_txt_clean[3]

'Studies trypsin inhibitor barley I Purification property To clarify property function trypsin inhibitor Japanese barley comparison inhibitor Pirkka barley inhibitor isolated barley Hordeum distichum L var emend Lamark extraction 1 NaCl ammonium sulfate fractionation repeated chromatography DEAEcellulose CMcellulose The final purified preparation inhibitor found homogeneous chromatographic electrophoretic analysis The inhibitor thermostable stable broad pH range 2 11 No inhibition observed heavy metal ion many reagent 102 M except pchloromercuribenzoate caused 69 loss activity The inhibitor subjected isoelectric focusing pH 751 molecular weight calculated 14200900 polyacrylamide gel electrophoresis presence sodium dodecyl sulfate The apparent dissociation constant complex inhibitor trypsinEC 34214 164 X 107M casein substrate One microgram purified inhibitor inhibited 15 mug pure trypsin hydrolysis alphaNbenzoylDLargininepnitroanilide By chemical modification arginyl residue inhibitor 1

In [125]:
# Soybean , Sesbania  
test_txt_clean[4]

'Reconstitution ion transport respiratory control vesicle formed reduced coenzyme Qcytochrome c reductase phospholipid Reduced coenzyme Qcytochrome c reductase bovine heart mitochondrion complex III incorporated phospholipid vesicle cholate dialysis procedure Soybean phospholipid mixture purified phosphatidylcholine phosphatidylethanolamine cardiolipin could used Oxidation reduced coenzyme Q2 reconstituted vesicle cytochrome c oxidant showed following energycoupling phenomenon 1 Protons translocated outward coupling ratio H2e 19 02 Measurements mitochondrion similar condition showed H2e ratio 18 Proton translocation seen presence uncoupling agent addition net acidification medium overall oxidation reaction 2 Potassium ion taken reconstituted vesicle presence valinomycin reaction coupled electron transfer The coupling ratio K uptake K2e 20 vesicle approximately 15 mitochondrion 3 The rate oxidation reduced coenzyme Q2 reconstituted vesicle stimulated 10fold uncouplers valinomycin plus n

### Testing `get_match_csr` to get run time estimate

https://stackoverflow.com/questions/28427236/set-row-of-csr-matrix

#### Create lists of values, row_idx, and col idx, then create csr

In [142]:
def get_match_csr_v1(txt):
  row_idx   = []
  col_idx   = []
  csr_val   = []
  for row, doc in enumerate(tqdm(txt)):
    # Get the matching common names as a list
    results_usda = [name for name in common_names if(f" {name} " in doc)]

    # Add the results to doc
    for cname in results_usda:  # for each common name
      genus = cnames[cname][0]  # get the genus name
      doc += f" {genus}"        # add the genus name to doc
    
    # Match to NCBI names
    results_ncbi = [1 if(name in doc) else 0 for name in offspring_names]

    # Assign row_idx, col_idx, and values for non-zero results_ncbi
    non0_idx = np.nonzero(results_ncbi)[0].tolist()
    row_idx.extend([row]*len(non0_idx))
    col_idx.extend(non0_idx)
    csr_val.extend([1]*len(non0_idx))

  # create a sparse matrix with shape=(num_docs, num_names)
  match_csr = csr_matrix((csr_val, (row_idx, col_idx)),
                         shape=(corpus.shape[0], len(offspring_names)), 
                         dtype=np.int0)

  return match_csr


In [143]:
test100 = corpus['txt_clean'][:100]
t = time()
test100_csr1 = get_match_csr_v1(test100)
print(time()-t)

100%|██████████| 100/100 [00:04<00:00, 22.50it/s]


4.452066421508789


#### Create empty match_csr first

In [197]:
# Create empty match_csr, then create a tmp_csr with row_idx, col_idx, values,
# then add match_csr with the tmp_csr
def get_match_csr_v2(txt):

  # create an empty sparse matrix with shape=(num_docs, num_names)
  match_csr = csr_matrix((corpus.shape[0], len(offspring_names)), 
                         dtype=np.int0)

  for row, doc in enumerate(tqdm(txt)):
    # Get the matching common names as a list
    results_usda = [name for name in common_names if(f" {name} " in doc)]

    # Add the results to doc
    for cname in results_usda:  # for each common name
      genus = cnames[cname][0]  # get the genus name
      doc += f" {genus}"        # add the genus name to doc
    
    # Match to NCBI names
    results_ncbi = [1 if(name in doc) else 0 for name in offspring_names]

    # Assign row_idx, col_idx, and values for non-zero results_ncbi
    non0_idx = np.nonzero(results_ncbi)[0].tolist()
    row_idx  = [row]*len(non0_idx)
    col_idx  = non0_idx
    csr_val  = [1]*len(non0_idx)

    # Create a tmp csr to hold this row
    tmp_csr = csr_matrix((csr_val, (row_idx, col_idx)),
                         shape=(corpus.shape[0], len(offspring_names)), 
                         dtype=np.int0)

    # Update match_csr by adding tmp_csr to it
    match_csr = match_csr + tmp_csr

  return match_csr


In [198]:
t = time()
test100_csr_v2 = get_match_csr_v2(test100)
print(time()-t)

100%|██████████| 100/100 [00:04<00:00, 21.38it/s]

4.682857036590576





#### Assign a list to a row in match_csr

In [150]:
def get_match_csr_v3(txt):

  # create an empty sparse matrix with shape=(num_docs, num_names)
  match_csr = csr_matrix((corpus.shape[0], len(offspring_names)), 
                         dtype=np.int0)

  row_idx   = []
  col_idx   = []
  csr_val   = []
  for row, doc in enumerate(tqdm(txt)):
    # Get the matching common names as a list
    results_usda = [name for name in common_names if(f" {name} " in doc)]

    # Add the results to doc
    for cname in results_usda:  # for each common name
      genus = cnames[cname][0]  # get the genus name
      doc += f" {genus}"        # add the genus name to doc
    
    # Match to NCBI names
    results_ncbi = [1 if(name in doc) else 0 for name in offspring_names]

    # Assign row_idx, col_idx, and values for non-zero results_ncbi
    #non0_idx = np.nonzero(results_ncbi)[0].tolist()
    #row_idx.extend([row]*len(non0_idx))
    #col_idx.extend(non0_idx)
    #csr_val.extend([1]*len(non0_idx))

    # Assign new row values to match_csr
    match_csr[row, :] = results_ncbi

  return match_csr


In [151]:
t = time()
test100_csr_v3 = get_match_csr_v3(test100)
print(time()-t)

  self._set_arrayXarray(i, j, x)
100%|██████████| 100/100 [00:10<00:00,  9.70it/s]

10.317140579223633





#### Generate lil matrix instead

In [199]:
# Similar to v3, but use lil matrix instead, tried coo also, but does not work
def get_match_lil_v4(txt):

  # create an empty sparse matrix with shape=(num_docs, num_names)
  # instead of csr, use lil
  match_lil = lil_matrix((corpus.shape[0], len(offspring_names)), 
                         dtype=np.int0)

  #row_idx   = []
  #col_idx   = []
  #csr_val   = []
  for row, doc in enumerate(tqdm(txt)):
    # Get the matching common names as a list
    results_usda = [name for name in common_names if(f" {name} " in doc)]

    # Add the results to doc
    for cname in results_usda:  # for each common name
      genus = cnames[cname][0]  # get the genus name
      doc += f" {genus}"        # add the genus name to doc
    
    # Match to NCBI names
    results_ncbi = [1 if(name in doc) else 0 for name in offspring_names]

    # Assign row_idx, col_idx, and values for non-zero results_ncbi
    #non0_idx = np.nonzero(results_ncbi)[0].tolist()
    #row_idx.extend([row]*len(non0_idx))
    #col_idx.extend(non0_idx)
    #csr_val.extend([1]*len(non0_idx))

    # Assign new row values to match_csr
    match_lil[row, :] = np.asarray(results_ncbi)

  return match_lil


In [200]:
t = time()
test100_lil_v4 = get_match_lil_v4(test100)
print(time()-t)

100%|██████████| 100/100 [00:04<00:00, 22.62it/s]


4.567139387130737


#### Try dok_matrix

In [166]:
def get_match_dok_v5(txt):

  # create an empty sparse matrix with shape=(num_docs, num_names)
  # instead of csr, use lil
  match_dok = dok_matrix((corpus.shape[0], len(offspring_names)), 
                         dtype=np.int0)

  row_idx   = []
  col_idx   = []
  csr_val   = []
  for row, doc in enumerate(tqdm(txt)):
    # Get the matching common names as a list
    results_usda = [name for name in common_names if(f" {name} " in doc)]

    # Add the results to doc
    for cname in results_usda:  # for each common name
      genus = cnames[cname][0]  # get the genus name
      doc += f" {genus}"        # add the genus name to doc
    
    # Match to NCBI names
    results_ncbi = [1 if(name in doc) else 0 for name in offspring_names]

    # Assign row_idx, col_idx, and values for non-zero results_ncbi
    #non0_idx = np.nonzero(results_ncbi)[0].tolist()
    #row_idx.extend([row]*len(non0_idx))
    #col_idx.extend(non0_idx)
    #csr_val.extend([1]*len(non0_idx))

    # Assign new row values to match_csr
    match_dok[row, :] = results_ncbi

  return match_dok


In [167]:
t = time()
test100_dok_v5 = get_match_dok_v5(test100)
print(time()-t)

100%|██████████| 100/100 [00:06<00:00, 14.87it/s]

6.7292234897613525





#### Use ordered dictionary comprehension

This is extremely slow...

In [191]:
#https://www.pythonpool.com/python-ordereddict/
def get_match_csr_v6(txt):
  row_idx   = []
  col_idx   = []
  csr_val   = []
  for row, doc in enumerate(tqdm(txt)):
    # Get the matching common names as an ordered dict
    results_usda = OrderedDict((name, 1) if(f" {name} " in doc) else (name, 0) 
                               for name in common_names)

    # Add the results to doc
    for cname in results_usda:  # for each common name
      genus = cnames[cname][0]  # get the genus name
      doc += f" {genus}"        # add the genus name to doc
    
    # Match to NCBI names as an ordered dict
    results_ncbi = OrderedDict((name, 1) if(name in doc) else (name, 0) 
                               for name in offspring_names)

    # Assign row_idx, col_idx, and values for non-zero results_ncbi
    non0_idx = np.nonzero(results_ncbi.values())[0].tolist()
    row_idx.extend([row]*len(non0_idx))
    col_idx.extend(non0_idx)
    csr_val.extend([1]*len(non0_idx))

  # create a sparse matrix with shape=(num_docs, num_names)
  match_csr = csr_matrix((csr_val, (row_idx, col_idx)),
                         shape=(corpus.shape[0], len(offspring_names)), 
                         dtype=np.int0)

  return match_csr


In [192]:
t = time()
test100_csr6 = get_match_csr_v6(test100)
print(time()-t)

  3%|▎         | 3/100 [00:15<08:31,  5.27s/it]


KeyboardInterrupt: 

#### Try multiprocessing

- https://superfastpython.com/multiprocessing-pool-for-loop/
- https://stackoverflow.com/questions/42749772/multiprocessing-how-to-use-pool-map-on-a-list-and-function-with-arguments
- https://python.omics.wiki/multiprocessing_map/multiprocessing_partial_function_multiple_arguments
- https://stackoverflow.com/questions/41920124/multiprocessing-use-tqdm-to-display-a-progress-bar

In [202]:
data_pairs = [ [3,5], [4,3], [7,3], [1,6] ]

def myfunc(p):
  product_of_list = np.prod(p)
  return product_of_list


pool = multiprocessing.Pool(processes=4)
result_list = pool.map(myfunc, data_pairs)
print(result_list)

[15, 12, 21, 6]


In [205]:
for i in enumerate(test100):
  print(type(i))
  break

<class 'tuple'>


In [245]:
def get_match_csr_v7(txt):

  with multiprocessing.Pool(processes=15) as pool:
    results_ncbi_list = list(tqdm(pool.imap(task, enumerate(txt)), 
                                  total=len(txt)))

  row_idx   = []
  col_idx   = []
  csr_val   = []
  for row, results_ncbi in enumerate(results_ncbi_list):
    non0_idx = np.nonzero(results_ncbi)[0].tolist()
    row_idx.extend([row]*len(non0_idx))
    col_idx.extend(non0_idx)
    csr_val.extend([1]*len(non0_idx))

  # create a sparse matrix with shape=(num_docs, num_names)
  match_csr = csr_matrix((csr_val, (row_idx, col_idx)),
                         shape=(txt.shape[0], len(offspring_names)), 
                         dtype=np.int0)

  return match_csr

def task(item):
  '''Task to parallelize
  Args:
    item (tuple): (row_number, doc)
  Return:
    results_ncbi (list): an offspring_name is present in the doc (1) or not(1)
  '''
  (row, doc) = item
  # Get the matching common names as a list
  results_usda = [name for name in common_names if(f" {name} " in doc)]

  # Add the results to doc
  for cname in results_usda:  # for each common name
    genus = cnames[cname][0]  # get the genus name
    doc += f" {genus}"        # add the genus name to doc
  
  # Match to NCBI names
  results_ncbi = [1 if(name in doc) else 0 for name in offspring_names]

  return results_ncbi

In [246]:
t = time()
test100_csr7 = get_match_csr_v7(test100)
print(time()-t)

100%|██████████| 100/100 [00:00<00:00, 107.34it/s]


1.6789848804473877


In [236]:
test100_csr7.shape

(100, 26782)

In [237]:
test100[1]

'Cholinesterases plant tissue VI Preliminary characterization enzyme Solanum melongena L Zea may L Enzymes capable hydrolyzing ester thiocholine assayed extract Solanum melongena L eggplant Zea Mays L corn The enzyme specie inhibited anticholinesterase neostigmine physostigmine 284c51 AMO1618 plant growth retardant pH optimum near pH 80 The enzyme eggplant maximally active substrate concentration 015 mM acetylthiocholine inhibited higher substrate concentration On basis last property magnitude inhibition various inhibitor substrate specificity conclude enzyme eggplant corn cholinesterase'

In [238]:
offspring_names.index("Solanum"), offspring_names.index("Zea")

(16432, 4360)

In [239]:
# Expect to be 1, 1, 0
test100_csr7[1, 16432], test100_csr7[1, 4360], test100_csr7[1, 16431]

(1, 1, 0)

In [240]:
test100[2]

'Fructose 16bisphosphate aldolase activity Rhizobium specie FDP aldolase found present cellfree extract Rhizobium leguminosarum Rhizobium phaseoli Rhizobium trifolii Rhizobium meliloti Rhizobium lupini Rhizobium japonicum Rhizobium specie Arachis hypogaea Sesbania cannabina The enzyme 3 representative specie optimal activity pH 84 02M veronal buffer The enzyme activity completely lost treatment 60 degree C 15 min The Km value range 238 455 X 106M FDP Metal chelating agent inhibited enzyme activity monovalent bivalent metal ion failed stimulate activity Bivalent metal ion general rather inhibitory'

In [241]:
offspring_names.index("Arachis"), offspring_names.index("Sesbania")

(21986, 21477)

In [242]:
test100_csr7[2, 21986], test100_csr7[2, 21477]

(1, 1)