# Intro

The purpose of this script/notebook is to take a data file that has been scraped from APS.org using Samara scraper.

This file is consumed by MAL's script that inspects the disease names, and determines what plant part is being affected.  Two columns are added to the datafile, resulting in a file with 14 columns.

This script then takes that file (aps_with_po_mappings.tsv) and removes unneeded columns, and formats the remaining columns for consumption using DOSDP tools.

DOSDP consumes a .tsv, and creates an ontology file.

In [1]:
# Imports
import csv
import pandas as pd
import numpy as np

## desired header order:
    defined_class	defined_class label	pathogen	pathogen label	host	host label	plantstructure	plantstructure label

defined_class  (OOPS:xxx)

defined_class label (disease common name)

pathogen (NCBITaxon:xxxxxx)

pathogen_label ( pathogen name )

host (NCBITaxon:xxxxxx)

host_label (host common name)

plantstructure (PO:xxx)

plantstructure_label (common name)

# Clean the original file to prepare for pandas dataframe loading

This file was created by MAL using her script to identify plant structures contained within the original Samara scrape

The headers are intact (feb 2018), but it's missing the new plantstructure and plantscturcture label headers.  These are added.

Then the file is loaded into pandas dataframe

In [2]:
raw_scrape_file = '../data/aps_with_po_mappings.tsv'

In [3]:
# load in the headers:  Need to check if they are expected.  
column_headers = None

# print the first line of the file:  Shows there is no header. (FIXED: feb 2018)
with open(raw_scrape_file,'r') as infile:
    first_line = infile.readline()
    column_headers = first_line.split('\t')
    print(column_headers)
    print(len(column_headers))
    
# add the missing column headers
column_headers.append("plant_part_id")
column_headers.append("plant_part_name")
print(type(column_headers),len(column_headers))

######  NOTE THE NEWLINE CHARACTER AT THE END OF THE HEADER COLUMN!!!!
# remove extra whitespace from headers
clean_headers = [x.strip() for x in column_headers]

# we need to check if the headers are what we expect.  Throw an error otherwise
expected_headers = ['disease_name', 'source_taxon_verbatim_name', 'source_taxon_name', 'source_taxon_id', 'interaction_type_label', 'interaction_type_id', 'target_taxon_verbatim_name', 'target_taxon_name', 'target_taxon_id', 'source_citation', 'source_url', 'source_accessed_at', 'plant_part_id', 'plant_part_name']

if clean_headers == expected_headers:
    print("Headers are as expected.  Continue as planned.")
else:
    print('infile headers do not match expected headers')




['disease_name', 'source_taxon_verbatim_name', 'source_taxon_name', 'source_taxon_id', 'interaction_type_label', 'interaction_type_id', 'target_taxon_verbatim_name', 'target_taxon_name', 'target_taxon_id', 'source_citation', 'source_url', 'source_accessed_at\n']
12
<class 'list'> 14
Headers are as expected.  Continue as planned.


# Load into Pandas DF:

Now that the original data is cleaned up, we can load it into a pandas dataframe for some further QC

In [4]:
# load in the raw file into a pandas df.  
# skiprows = don't read the header row.  We checked the headers earlier, and had to add the two extra ones.
df = pd.read_csv(raw_scrape_file, sep='\t', header=None, skiprows=1, names=clean_headers)

df.shape

(11751, 14)

# fill in blank plant parts with "whole plant"

You'll notice above that the last two columns contain a lot of "NaN" in them.  That's because Marie's script only filled in 'plant_part_id' and 'plant_part_name' for rows where a plant part could be found in the name/row.  This left a lot of blanks (NaN) in the pandas dataframe.  These need to be filled to prevent errors down the road.

You'll notice I only fill with "PO_0000003", not the full purl.  This was deemed a problem, and I also need to remove the purl from the others.

In [5]:
df.plant_part_id.fillna('PO:0000003', inplace=True)
df.plant_part_name.fillna('whole plant', inplace=True)

df

Unnamed: 0,disease_name,source_taxon_verbatim_name,source_taxon_name,source_taxon_id,interaction_type_label,interaction_type_id,target_taxon_verbatim_name,target_taxon_name,target_taxon_id,source_citation,source_url,source_accessed_at,plant_part_id,plant_part_name
0,Bacterial leaf spot,Pseudomonas cichorii (Swingle 1925) Stapp 1928,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:1441629,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
1,Southern wilt,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:1457195,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0000003,whole plant
2,Southern wilt,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0000003,whole plant
3,Ascochyta leaf spot,Ascochyta doronici Allesch.,Ascochyta doronici Allesch.,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
4,Alternaria leaf spot,Alternaria alternata (Fr.:Fr.) Keissl.,Alternaria alternata (Fr.,NCBITaxon:187775,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
5,Alternaria leaf spot,Alternaria alternata (Fr.:Fr.) Keissl.,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
6,Alternaria leaf spot,A. dauci (Kühn) Groves & Skolko,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
7,Alternaria leaf spot,A. gerberae Rabbe et al.,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
8,Black root rot,Thielaviopsis basicola (Berk. & Broome) Ferraris,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:124036,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0009005,root
9,Black root rot,Chalara elegans Nag Raj & Kendr. [synanam...,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:301394,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0009005,root


In [8]:
# remove the url from the plant part IDs

# take only the last 10 digits of the column (the curie IDs are 10 digits)
df['plant_part_id'] = df['plant_part_id'].str[-10:]
df

Unnamed: 0,disease_name,source_taxon_verbatim_name,source_taxon_name,source_taxon_id,interaction_type_label,interaction_type_id,target_taxon_verbatim_name,target_taxon_name,target_taxon_id,source_citation,source_url,source_accessed_at,plant_part_id,plant_part_name
0,Bacterial leaf spot,Pseudomonas cichorii (Swingle 1925) Stapp 1928,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:1441629,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
1,Southern wilt,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:1457195,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0000003,whole plant
2,Southern wilt,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0000003,whole plant
3,Ascochyta leaf spot,Ascochyta doronici Allesch.,Ascochyta doronici Allesch.,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
4,Alternaria leaf spot,Alternaria alternata (Fr.:Fr.) Keissl.,Alternaria alternata (Fr.,NCBITaxon:187775,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
5,Alternaria leaf spot,Alternaria alternata (Fr.:Fr.) Keissl.,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
6,Alternaria leaf spot,A. dauci (Kühn) Groves & Skolko,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
7,Alternaria leaf spot,A. gerberae Rabbe et al.,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
8,Black root rot,Thielaviopsis basicola (Berk. & Broome) Ferraris,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:124036,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0009005,root
9,Black root rot,Chalara elegans Nag Raj & Kendr. [synanam...,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:301394,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0009005,root



The data must be in a specific format for digestion using patternapply.py.  So the data must be massaged using python's pandas.

## Steps:

1. Make a a copy of the original dataframe to avoid confusion

2. Remove all columns not needed (keep disease_name to feed disease_name_formatter())

3. Add additional columns needed by 'pattern_apply.py'  (defined_class, and defined_class label)

4. Rename all column headers to comply to pattern

5. Remove all rows with missing data (maybe move them to new 'rejects.tsv' for later use)

6. write remaining df to file

7. run pattern_apply.py on it. (this is now replaced by DOSDP tools)


## desired header order:
defined_class, defined_class label, pathogen, pathogen label, host, host label, plantstructure, plantstructure label

 [defined_class (OOPS:xxx), defined_class label (disease common name), pathogen (NCBITaxon:xxxxxx), pathogen_label ( pathogen name ), host (NCBITaxon:xxxxxx), host_label (host common name), plantstructure (PO:xxx), plantstructure_label (common name)]

In [10]:
#### Step 1 ####
# start with a copy of the original dataframe.
patterndf = df.copy()


#### Step 2 ####
# remove unused columns
patterndf.drop("interaction_type_label", axis=1, inplace=True)
patterndf.drop("interaction_type_id", axis=1, inplace=True)
patterndf.drop("source_citation", axis=1, inplace=True)
patterndf.drop("source_url", axis=1, inplace=True)
patterndf.drop("source_accessed_at", axis=1, inplace=True)
patterndf.drop("target_taxon_verbatim_name", axis=1, inplace=True)
patterndf.drop("source_taxon_verbatim_name", axis=1, inplace=True)

#### Step 3 ####
# add defined_class and defined_class label columns
patterndf['defined_class'] = 'OOPS:{}'  # in need of a system to assign ID numbers...
# just rename the disease name column as the defined_class label
patterndf.rename(columns = {'disease_name':'defined_class label'}, inplace = True)
# add a synonyms column, populate it with the original disease name
patterndf['synonyms'] = patterndf['defined_class label']

#### Step 4 ####
# rename column headers to accurately reflect the patternapply needs
patterndf.rename(columns = {'source_taxon_name':'pathogen label'}, inplace = True)
patterndf.rename(columns = {'source_taxon_id':'pathogen'}, inplace = True)
patterndf.rename(columns = {'target_taxon_id':'host'}, inplace = True)
patterndf.rename(columns = {'target_taxon_name':'host label'}, inplace = True)
patterndf.rename(columns = {'plant_part_id':'plantstructure'}, inplace = True)
patterndf.rename(columns = {'plant_part_name':'plantstructure label'}, inplace = True)

#### Step 4.5 ####
## reorder the columns
desired_order = ['defined_class', 'defined_class label','pathogen','pathogen label','host','host label','plantstructure','plantstructure label','synonyms']
patterndf = patterndf[desired_order]

patterndf

Unnamed: 0,defined_class,defined_class label,pathogen,pathogen label,host,host label,plantstructure,plantstructure label,synonyms
0,OOPS:{},Bacterial leaf spot,NCBITaxon:1441629,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf,Bacterial leaf spot
1,OOPS:{},Southern wilt,NCBITaxon:1457195,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0000003,whole plant,Southern wilt
2,OOPS:{},Southern wilt,no:match,,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0000003,whole plant,Southern wilt
3,OOPS:{},Ascochyta leaf spot,no:match,Ascochyta doronici Allesch.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf,Ascochyta leaf spot
4,OOPS:{},Alternaria leaf spot,NCBITaxon:187775,Alternaria alternata (Fr.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf,Alternaria leaf spot
5,OOPS:{},Alternaria leaf spot,no:match,,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf,Alternaria leaf spot
6,OOPS:{},Alternaria leaf spot,no:match,,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf,Alternaria leaf spot
7,OOPS:{},Alternaria leaf spot,no:match,,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf,Alternaria leaf spot
8,OOPS:{},Black root rot,NCBITaxon:124036,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0009005,root,Black root rot
9,OOPS:{},Black root rot,NCBITaxon:301394,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0009005,root,Black root rot


# Clean the data:  

Will need to simply remove anything that has a "NaN" or "no:match"

In [11]:
# start with a check on the dimensions of our dataframe, establishing a starting point
patterndf.shape

(11751, 9)

In [12]:
# 1. Move all rows with a "no:match" in either pathogen or host column into a "reject.df"

no_pathogen_df = patterndf[patterndf['pathogen'] == 'no:match']
patterndf = patterndf[patterndf['pathogen'] != 'no:match']

print('Rows with pathogen: {}\nRows without: {}'.format(patterndf.shape[0],no_pathogen_df.shape[0]))

no_host_df = patterndf[patterndf['host'] == 'no:match']
patterndf = patterndf[patterndf['host'] != 'no:match']

print('Rows with host: {}\nRows without: {}'.format(patterndf.shape[0],no_host_df.shape[0]))


patterndf.shape

Rows with pathogen: 7029
Rows without: 4722
Rows with host: 6100
Rows without: 929


(6100, 9)

In [13]:
### remove any column with an "NaN"
patterndf.dropna(how='any',inplace=True)

patterndf.shape

(6094, 9)

# Results:

The original data set was 11752 rows (diseases)
4722 of them did not have a pathogen match

## 6094 diseases!



In [14]:
# A few of the NCBITaxon IDs were weird, so we need to replace them.
taxon_corrections = '../data/taxonToBeChanged.txt'
correction_headers = ['wrong','right','right_label']
ncbi_corrections_df = pd.read_csv(taxon_corrections, sep='\t', header=None, names=correction_headers)
# set the index as the bad IDs
ncbi_corrections_df = ncbi_corrections_df.set_index('wrong')
# now we can look up the correct label using the wrong ID as the index
ncbi_corrections_df.loc['NCBITaxon:115135','right_label']

'Epicoccum nigrum'

# Name formatting function

It will be important to have a function that can dynamically name the diseases without too much repetition.
The function will be fed a name (full of weird parenthesis and references), a pathogen (from NCBI), a host (also from NCBI), and a plant part (whole plant is used as a default).

if a plant part is included, that plant part should be used as part of the name.  But if the 'part' is the whole plant, that part is not to be included in the name.

The important part of this function is that it doesn't do any of the column parsing or anything like dataframe stuff.  All reading of OG files and splitting of them into dataframes should be done outside of this function, so this function can be reused, no mater how the original data is parsed.

In [15]:
# let's also write a little structure to hold the categories for names being formatted, so we can improve this process.
# you'll need to reset these values after looping through the whole dataframe
name_structures = {
    'A': 0,
    'B': 0,
    'C': 0,
    'D': 0
}

def disease_name_formatter(rawname,host,pathogen,part):
    """
    function to format the names of diseases.  
    Requires:
    rawname - the name that was given in the APS scrape
    host - obvious
    pathogen - obvious
    part - if not mentioned explicitly in 'aps_with_po_mappings' (from MAL), should defalt to "whole plant"
    
    """
    # start for real by checking if all the things passed in are what we expect them to be (ie: strings)
    # continue by stripping any bracketed reference, and whitespace.
    
    assert (isinstance(rawname, str)),"raw names must be strings"
    short_name = rawname.split('(')[0].strip()
    assert (isinstance(pathogen, str)),"pathogen names must be strings!!\n {} is not a string".format(pathogen)
    pathogenname = pathogen.split('(')[0].strip()
    assert (isinstance(host, str)),"host names must be strings"
    hostname = host.split('()')[0].strip()
    new_name = None
    

    if pathogenname in short_name and host in short_name:
        new_name = short_name
        name_structures['A'] +=1
        print('A -',new_name)
#         return('A')
    elif pathogenname in short_name:
        new_name = '{aps_name} of {host}'.format(aps_name=short_name, host=host)
        name_structures['B'] +=1
        print('B -', new_name)
#         return('B')
    elif host in short_name:
        new_name = '{pathogen} {aps_name}'.format(pathogen=pathogenname,aps_name=short_name)
        name_structures['C'] +=1
        print('C -', new_name)
#         return('C')
    else:
        new_name = '{pathogen} {aps_name} of {host}'.format(aps_name=short_name, pathogen=pathogenname, host=host)
        name_structures['D'] +=1
        print('D -', new_name )
#         return('D')
    return(new_name)


In [17]:

# reset the counts:
name_structures = {
    'A': 0,
    'B': 0,
    'C': 0,
    'D': 0
}

# run the whole dataframe through the name-formatter function
for index, row in patterndf.iterrows():
    row['defined_class label'] = disease_name_formatter(row['defined_class label'],row['host label'], row['pathogen label'],row['plantstructure label'])


D - Pseudomonas cichorii Bacterial leaf spot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Ralstonia solanacearum Southern wilt of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Alternaria alternata Alternaria leaf spot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Thielaviopsis basicola Black root rot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Chalara elegans Nag Raj & Kendr. [synanamorph] Black root rot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Botrytis cinerea Pers. Botrytis blight of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Cercospora gerberae Chupp & Viegas Cercospora leaf spot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Plasmopara sp. Downy mildew of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Fusarium solani Fusarium crown rot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Fusarium oxysporum Schlechtend. Fusarium root rot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Fusarium oxysporum Schlechtend. Fusarium wilt of Gerbera jamesonii H. Bolus ex J. D. Hoo

D - Pseudomonas solanacearum Bugtok of Musa spp.
D - Banana bunchy top virus Bunchy top of Musa spp.
D - Chalara paradoxa Ceratocystis fruit rot of Musa spp.
D - Cladosporium musae E. W. Mason Cladosporium speckle of Musa spp.
D - Junghuhnia vincta Corm dry rot of Musa spp.
D - Cordana johnstonii M. B. Ellis Cordana leaf spot of Musa spp.
D - Cordana musae Cordana leaf spot of Musa spp.
D - Fusarium pallidoroseum Crown rot of Musa spp.
D - Colletotrichum musae Crown rot of Musa spp.
D - Fusarium spp. Crown rot of Musa spp.
D - Acremonium spp. Crown rot of Musa spp.
D - Cylindrocladium spp. Cylindrocladium root rot of Musa spp.
D - Deightoniella torulosa Damping-off of Musa spp.
D - Deightoniella torulosa Deightoniella fruit speckle, leaf spot and tip rot of Musa spp.
D - Cercospora hayi Calp. Diamond spot of Musa spp.
D - Fusarium spp. Diamond spot of Musa spp.
D - Nattrassia mangiferae Dwarf Cavendish tip rot of Musa spp.
D - Unknown cause Elephantiasis of Musa spp.
D - Guignardia mus

D - Carnation latent virus Carnation latent of Dianthus caryophylium L.
D - Carnation mottle virus Carnation mottle of Dianthus caryophylium L.
D - Carnation necrotic fleck virus Carnation necrotic fleck & Carnation streak of Dianthus caryophylium L.
D - Carnation ringspot virus Carnation ring spot of Dianthus caryophylium L.
D - Carnation vein mottle virus Carnation vein mottle of Dianthus caryophylium L.
D - Xanthomonas campestris pv. manihotis Bacterial blight of Manihot esculenta Crantz
D - Xanthomonas campestris pv. cassavae Bacterial angular leaf spot of Manihot esculenta Crantz
D - Agrobacterium tumefaciens Bacterial stem gall of Manihot esculenta Crantz
D - Erwinia herbicola Bacterial wilt of Manihot esculenta Crantz
D - Colletotrichum gloeosporioides Anthracnose of Manihot esculenta Crantz
D - Armillaria mellea Armillaria root rot of Manihot esculenta Crantz
D - Scytalidium sp. Black root and stem rot of Manihot esculenta Crantz
D - Cercospora vicosae Muller & Chupp Blight lea

D - Mariannaea elegans Stalk rots, minor of Zea mays L.
D - Mucor sp. Stalk rots, minor of Zea mays L.
D - Aspergillus spp. Storage rots of Zea mays L.
D - Penicillium spp. and   fungi Storage rots of Zea mays L.
D - Phyllachora maydis Maubl. Tar spot* of Zea mays L.
D - Trichoderma viride Pers. Trichoderma ear rot and root rot of Zea mays L.
D - Hypocrea sp. Trichoderma ear rot and root rot of Zea mays L.
D - Stenocarpella maydis White ear rot, root and stalk rot of Zea mays L.
D - Ascochyta ischaemi Sacc. Yellow leaf blight of Zea mays L.
D - Phyllosticta maydis D.C. Arny & R.R. Nelson Yellow leaf blight of Zea mays L.
D - Mycosphaerella zeae-maydis Mukunya & Boothroyd Yellow leaf blight of Zea mays L.
D - Dolichodorus spp. Awl of Zea mays L.
D - Dolichodorus heterocephalus Cobb Awl of Zea mays L.
D - Ditylenchus dipsaci Bulb and stem* of Zea mays L.
D - Radopholus similis Burrowing of Zea mays L.
D - Heterodera avenae Wollenweb. Cyst of Zea mays L.
D - Heterodera zeae Koshy et al. C

D - Phymatotrichopsis omnivorum Texas root rot of Corylus
D - Microsphaera coryli Homma Powdery mildew of Corylus avellana
D - Microsphaera coryli Homma Powdery mildew of Corylus
D - Microsphaera ellisii U. Braun Powdery mildew of Corylus avellana
D - Microsphaera ellisii U. Braun Powdery mildew of Corylus
D - Microsphaera hommae U. Braun Powdery mildew of Corylus avellana
D - Microsphaera hommae U. Braun Powdery mildew of Corylus
D - Microsphaera verruculosa Yu & Lai on various Corylus sp. Powdery mildew of Corylus avellana
D - Microsphaera verruculosa Yu & Lai on various Corylus sp. Powdery mildew of Corylus
D - Phyllactinia guttata Powdery mildew of Corylus avellana
D - Phyllactinia guttata Powdery mildew of Corylus
D - Pucciniastrum coryli Komarov Rust of Corylus avellana
D - Pucciniastrum coryli Komarov Rust of Corylus
D - Apple mosaic virus Hazelnut mosaic of Corylus avellana
D - Apple mosaic virus Hazelnut mosaic of Corylus
D - Prunus necrotic ringspot virus Hazelnut mosaic of C

D - Erythricium salmonicolor Pink disease of Mangifera indica L.
D - Corticium salmonicolor Berk. & Broome Pink disease of Mangifera indica L.
D - Oidium asteris - punicei Peck [anamorph] Powdery mildew of Mangifera indica L.
D - Oidium mangiferae Berthet. Powdery mildew of Mangifera indica L.
D - Rhizopus arrhizus A. Fischer Rhizopus rot of Mangifera indica L.
D - Rhizopus oryzae Went & Prinsen Geerligs Rhizopus rot of Mangifera indica L.
D - Phymatotrichopsis omnivora Root rot of Mangifera indica L.
D - Phytophthora nicotianae Breda de Haan Root rot of Mangifera indica L.
D - Phytophthora nicotianae Breda de Haan var. parasitica Root rot of Mangifera indica L.
D - Phytophthora palmivora Root rot of Mangifera indica L.
D - Pythium spp. Root rot of Mangifera indica L.
D - Pythium splendens H. Braun Root rot of Mangifera indica L.
D - Rhizoctonia solani Kühn Root rot of Mangifera indica L.
D - Thanatephorus cucumeris Root rot of Mangifera indica L.
D - Elsinoe mangiferae Bitancourt & Je

D - Pseudomonas syringae pv. pisi Bacterial blight of Pisum sativum L.
D - Pseudomonas syringae subsp. syringae van Hall 1902 Brown spot of Pisum sativum L.
D - Alternaria alternata Alternaria blight of Pisum sativum L.
D - Colletotrichum gloeosporioides Anthracnose of Pisum sativum L.
D - Colletotrichum pisi Pat. Anthracnose of Pisum sativum L.
D - Aphanomyces euteiches Drechs. f. sp. pisi W.F. Pfender & D. J. Hagedorn Aphanomyces root rot of Pisum sativum L.
D - Ascochyta pinodes L.K. Jones [anamorph] Ascochyta blight of Pisum sativum L.
D - Ascochyta pinodella L.K. Jones Ascochyta foot rot and black stem of Pisum sativum L.
D - Fusicladium pisicola Linford Black leaf of Pisum sativum L.
D - Cercospora pisa-sativae J. A. Stevenson Cercospora leaf spot of Pisum sativum L.
D - Cladosporium cladosporioides Cladosporium blight of Pisum sativum L.
D - Cladosporium pisicola W.C. Snyder Cladosporium blight of Pisum sativum L.
D - Pythium spp. Damping off, seed rot of Pisum sativum L.
D - Pe

D - Syspastospora parasitica Leaf spot of Euphorbia pulcherrima Willd. ex Klotzsch
D - Melanospora parasitica Leaf spot of Euphorbia pulcherrima Willd. ex Klotzsch
D - Stemphylium sp. Leaf spot of Euphorbia pulcherrima Willd. ex Klotzsch
D - Armillaria tabescens Mushroom root rot of Euphorbia pulcherrima Willd. ex Klotzsch
D - Oidium sp. Powdery mildew of Euphorbia pulcherrima Willd. ex Klotzsch
D - Erysiphe sp. Powdery mildew of Euphorbia pulcherrima Willd. ex Klotzsch
D - Phytophthora nicotianae Breda de Haan Root rot of Euphorbia pulcherrima Willd. ex Klotzsch
D - Phytophthora nicotianae var. parasitica Root rot of Euphorbia pulcherrima Willd. ex Klotzsch
D - Pythium aphanidermatum Root rot of Euphorbia pulcherrima Willd. ex Klotzsch
D - Pythium debaryanum Auct. non R. Hesse Root rot of Euphorbia pulcherrima Willd. ex Klotzsch
D - Pythium myriotylum Drechs. Root rot of Euphorbia pulcherrima Willd. ex Klotzsch
D - Pythium perniciosum Serbinow Root rot of Euphorbia pulcherrima Willd. 

D - Alternaria alternata Seed mold of Trifolium pratense L.
D - Pythium debaryanum Auct. non R. Hesse Seed rot and damping-off of Trifolium pratense L.
D - Cymadothea trifolii Sooty blotch of Trifolium pratense L.
D - Colletotrichum trifolii Bain & Essary Southern anthracnose of Trifolium pratense L.
D - Athelia rolfsii Southern blight of Trifolium pratense L.
D - Stagonospora recedens Stagonospora leaf spot and root rot of Trifolium pratense L.
D - Cercospora zebrina Pass. Summer black stem of Trifolium pratense L.
D - Stemphylium sarciniforme Target spot of Trifolium pratense L.
D - Stemphylium botryosum Wallr. Target spot of Trifolium pratense L.
D - Pleospora tarda E. Simmons Target spot of Trifolium pratense L.
D - Rhizoctonia crocorum Violet root rot of Trifolium pratense L.
D - Coprinus psychromorbidus Redhead & J.A. Traquair Winter crown rot of Trifolium pratense L.
D - Phoma sclerotioides G. Preuss ex Sacc. Winter crown rot of Trifolium pratense L.
D - Plenodomus meliloti Dear

D - Sclerophthora macrospora Sclerophthora disease of Saccharum spp. hybrids
D - Alternaria alternata Seedling blight of Saccharum spp. hybrids
D - Bipolaris sacchari Seedling blight of Saccharum spp. hybrids
D - Bipolaris hawaiiensis Seedling blight of Saccharum spp. hybrids
D - Curvularia lunata Seedling blight of Saccharum spp. hybrids
D - Curvularia senegalensis Seedling blight of Saccharum spp. hybrids
D - Setosphaeria rostrata K.J. Leonard Seedling blight of Saccharum spp. hybrids
D - Exserohilum rostratum Seedling blight of Saccharum spp. hybrids
D - Cytospora sacchari E.J. Butler Sheath rot of Saccharum spp. hybrids
D - Helminthosporium sp. Priode Target blotch of Saccharum spp. hybrids
D - Deightoniella papuana D. Shaw Veneer blotch* of Saccharum spp. hybrids
D - Elsinoe sacchari Lo White rash* of Saccharum spp. hybrids
D - Sphaceloma sacchari Lo White rash* of Saccharum spp. hybrids
D - Fusarium sacchari Wilt of Saccharum spp. hybrids
D - Cephalosporium sacchari E.J. Butler i

D - Pratylenchus spp. Lesion of Nicotiana tabacum L.
D - Rotylenchulus reniformis Linford & Oliveira Reniform of Nicotiana tabacum L.
D - Meloidogyne arenaria Root-knot of Nicotiana tabacum L.
D - Meloidogyne hapla Chitwood Root-knot of Nicotiana tabacum L.
D - Meloidogyne incognita Root-knot of Nicotiana tabacum L.
D - Meloidogyne javanica Root-knot of Nicotiana tabacum L.
D - Helicotylenchus spp. Spiral of Nicotiana tabacum L.
D - Paratrichodorus spp. Stubby-root of Nicotiana tabacum L.
D - Trichodorus spp. Stubby-root of Nicotiana tabacum L.
D - Merlinius spp. Stunt of Nicotiana tabacum L.
D - Tylenchorhynchus spp. Stunt of Nicotiana tabacum L.
D - Orobanche ramosa L. Broomrape of Nicotiana tabacum L.
D - Orobanche ludoviciana Nutt. Broomrape of Nicotiana tabacum L.
D - Cuscuta spp. Dodder of Nicotiana tabacum L.
D - Striga gesnerioides Witchweed of Nicotiana tabacum L.
D - Alfalfa mosaic virus Alfalfa mosaic of Nicotiana tabacum L.
D - Beet curly top virus Beet curly top of Nicotia

In [18]:
#check the counts
name_structures

{'A': 0, 'B': 48, 'C': 1, 'D': 6045}

In [21]:
# some of the diseases have hosts that map to an "EST" instead of a plant.  Will need to correct this with substitution.

est_name_dict = {
'Beta vulgaris/Cercospora beticola mixed EST library': 'NCBITaxon:161934',
'Brassica juncea/Albugo candida mixed genomic library':'NCBITaxon:3707',
'Gossypium hirsutum/Verticillium dahliae mixed EST library': 'NCBITaxon:3635',
'Oryza sativa/Pyricularia oryzae mixed EST library': 'NCBITaxon:4530',
'Phaseolus vulgaris/Colletotrichum lindemuthianum mixed EST library': 'NCBITaxon:3885',
'Triticum aestivum/Phaeosphaeria nodorum mixed EST library': 'NCBITaxon:4565',
'Zea mays/Colletotrichum graminicola mixed EST library': 'NCBITaxon:4577'
}

est_id_dict = {
'NCBITaxon:1585532': 'NCBITaxon:161934',
'NCBITaxon:910407':'NCBITaxon:3707',
'NCBITaxon:69324': 'NCBITaxon:3635',
'NCBITaxon:105664': 'NCBITaxon:4530',
'NCBITaxon:709942': 'NCBITaxon:3885',
'NCBITaxon:331356': 'NCBITaxon:4565',
'NCBITaxon:176297': 'NCBITaxon:4577',
'NCBITaxon:553005': 'NCBITaxon:4154'
}
# count = 0
# for index, row in patterndf.iterrows():
#     hostname = row['host_label']
#     if hostname in est_dict:
#         print(row)
# #         row['host'] = est_dict[hostname]
# #         count +=1

count = 0
for index, row in patterndf.iterrows():
    host_id = row['host']
    if host_id in est_id_dict:
        print(row['host'],row['host label'])
        row['host'] = est_id_dict[host_id]
        count +=1
print(count)

NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxon:69324 Gossypium spp.
NCBITaxo

In [23]:
count = 0
for index, row in patterndf.iterrows():
    host_id = row['pathogen']
    if host_id in est_id_dict:
        print(row['pathogen'],row['pathogen label'])
        row['pathogen'] = est_id_dict[host_id]
        count +=1
print(count)

NCBITaxon:709942 Bean leafroll virus (BLRV)
NCBITaxon:709942 Bean (pea) leaf roll virus
NCBITaxon:176297 Maize chlorotic dwarf virus (MCDV)
NCBITaxon:176297 Maize chlorotic mottle virus (MCMV)
NCBITaxon:176297 Maize leaf fleck virus (MLFV)
NCBITaxon:176297 Maize line virus (MLV)
NCBITaxon:176297 Maize mosaic virus (MMV)
NCBITaxon:176297 Maize pellucid ringspot virus (MPRV)
NCBITaxon:176297 Maize rayado fino virus (MRFV)
NCBITaxon:176297 Maize red stripe virus (MRSV)
NCBITaxon:176297 Maize ring mottle virus (MRMV)
NCBITaxon:176297 Maize rough dwarf virus (MRDV)
NCBITaxon:176297 Maize sterile stunt virus (strains of barley yellow striate virus)
NCBITaxon:176297 Maize streak virus (MSV)
NCBITaxon:176297 Maize tassel abortion virus (MTAV)
NCBITaxon:176297 Maize vein enation virus (MVEV)
NCBITaxon:176297 Maize wallaby ear virus (MWEV)
NCBITaxon:176297 Maize white leaf virus
NCBITaxon:176297 Maize white line mosaic virus (MWLMV)
NCBITaxon:105664 Rice black-streaked dwarf virus (RBSDV)
NCBITa

In [27]:
patterndf


Unnamed: 0,defined_class,defined_class label,pathogen,pathogen label,host,host label,plantstructure,plantstructure label,synonyms
0,OOPS:{},Pseudomonas cichorii Bacterial leaf spot of Ge...,NCBITaxon:1441629,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Bacterial leaf spot
1,OOPS:{},Ralstonia solanacearum Southern wilt of Gerber...,NCBITaxon:1457195,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Southern wilt
4,OOPS:{},Alternaria alternata Alternaria leaf spot of G...,NCBITaxon:187775,Alternaria alternata (Fr.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Alternaria leaf spot
8,OOPS:{},Thielaviopsis basicola Black root rot of Gerbe...,NCBITaxon:124036,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Black root rot
9,OOPS:{},Chalara elegans Nag Raj & Kendr. [synanamorph]...,NCBITaxon:301394,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Black root rot
10,OOPS:{},Botrytis cinerea Pers. Botrytis blight of Gerb...,NCBITaxon:1290391,Botrytis cinerea Pers.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Botrytis blight
12,OOPS:{},Cercospora gerberae Chupp & Viegas Cercospora ...,NCBITaxon:1247214,Cercospora gerberae Chupp & Viegas,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Cercospora leaf spot
13,OOPS:{},Plasmopara sp. Downy mildew of Gerbera jameson...,NCBITaxon:4780,Plasmopara sp.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Downy mildew
14,OOPS:{},Fusarium solani Fusarium crown rot of Gerbera ...,NCBITaxon:1501798,Fusarium solani (Mart.) Sacc.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Fusarium crown rot
15,OOPS:{},Fusarium oxysporum Schlechtend. Fusarium root ...,NCBITaxon:909455,Fusarium oxysporum Schlechtend.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Fusarium root rot


In [25]:
# add fake IDs to the defined_class page (These IDs are autoiterated from 13)

# start = 13  # OPS:0000013 is the last one in the upper_level.owl

for index, row in patterndf.iterrows():
    
#     # add in some OOPS:IDs
#     start += 1
#     new_id = '{0:07d}'.format(start)  # format the number to be 7 digit padded.
#     row['defined_class'] = row['defined_class'].format(new_id)
    
    # plant part IDs needed to be : instead of _ 
    row['plantstructure'] = row['plantstructure'].replace("_",":")
    
    # replace the strangely obsolete/alt IDs from the whole datasheet
    if row['pathogen'] in list(ncbi_corrections_df.index):
        print('found a bad ID')
        row['pathogen'],row['pathogen label'] = ncbi_corrections_df.loc[row['pathogen'],'right'], ncbi_corrections_df.loc[row['pathogen'],'right_label']

        
    # remove problem rows (MAL found these rows do not land on the OOPS hierarchy correctly)
    if row['host'] in ['NCBITaxon:553005', 'NCBITaxon:71936', 'NCBITaxon:71936']:
        patterndf.drop(index, inplace=True)
        print('dropped a bad host')
        
    if row['plantstructure'] == 'PO:0001081':
        patterndf.drop(index, inplace =True)
        print('dropped a bad plant structure')
        
    if row['pathogen'] == 'NCBITaxon:32644':
        patterndf.drop(index, inplace=True)
        print('dropped a bad pathogen')
        

found a bad ID
found a bad ID
found a bad ID
found a bad ID
dropped a bad pathogen
found a bad ID
found a bad ID
found a bad ID
found a bad ID
found a bad ID
found a bad ID
dropped a bad pathogen
dropped a bad pathogen
dropped a bad pathogen
found a bad ID
found a bad ID
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
dropped a bad host
found a bad ID
dropped a bad host
dropped a bad host
found a bad ID
found a bad ID
found a bad ID
found a bad ID
dropped a bad plant structure
found a bad ID
found a bad ID
foun

In [26]:
patterndf

Unnamed: 0,defined_class,defined_class label,pathogen,pathogen label,host,host label,plantstructure,plantstructure label,synonyms
0,OOPS:{},Pseudomonas cichorii Bacterial leaf spot of Ge...,NCBITaxon:1441629,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Bacterial leaf spot
1,OOPS:{},Ralstonia solanacearum Southern wilt of Gerber...,NCBITaxon:1457195,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Southern wilt
4,OOPS:{},Alternaria alternata Alternaria leaf spot of G...,NCBITaxon:187775,Alternaria alternata (Fr.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Alternaria leaf spot
8,OOPS:{},Thielaviopsis basicola Black root rot of Gerbe...,NCBITaxon:124036,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Black root rot
9,OOPS:{},Chalara elegans Nag Raj & Kendr. [synanamorph]...,NCBITaxon:301394,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Black root rot
10,OOPS:{},Botrytis cinerea Pers. Botrytis blight of Gerb...,NCBITaxon:1290391,Botrytis cinerea Pers.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Botrytis blight
12,OOPS:{},Cercospora gerberae Chupp & Viegas Cercospora ...,NCBITaxon:1247214,Cercospora gerberae Chupp & Viegas,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Cercospora leaf spot
13,OOPS:{},Plasmopara sp. Downy mildew of Gerbera jameson...,NCBITaxon:4780,Plasmopara sp.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Downy mildew
14,OOPS:{},Fusarium solani Fusarium crown rot of Gerbera ...,NCBITaxon:1501798,Fusarium solani (Mart.) Sacc.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Fusarium crown rot
15,OOPS:{},Fusarium oxysporum Schlechtend. Fusarium root ...,NCBITaxon:909455,Fusarium oxysporum Schlechtend.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Fusarium root rot




# the next step is finding duplicates, and making them synonyms 

`DOSDP-tools` will use the 'pathogen', 'host', and 'plantstructure' to convert them into owl classes.  However, some of the names are different, even though the pathogen/host/plantstructure is identical.

We need to group these "functional duplicates" and merge their names into a synonyms column


In [22]:
patterndf

Unnamed: 0,defined_class,defined_class label,pathogen,pathogen_label,host,host_label,plantstructure,plantstructure_label,synonyms,synonyms1,synonyms2
0,OOPS:{},Pseudomonas cichorii Bacterial leaf spot of Ge...,NCBITaxon:1441629,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Bacterial leaf spot,Pseudomonas cichorii Bacterial leaf spot of Ge...,Bacterial leaf spot
1,OOPS:{},Ralstonia solanacearum Southern wilt of Gerber...,NCBITaxon:1457195,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Southern wilt,Ralstonia solanacearum Southern wilt of Gerber...,Southern wilt
4,OOPS:{},Alternaria alternata Alternaria leaf spot of G...,NCBITaxon:187775,Alternaria alternata (Fr.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Alternaria leaf spot,Alternaria alternata Alternaria leaf spot of G...,Alternaria leaf spot
8,OOPS:{},Thielaviopsis basicola Black root rot of Gerbe...,NCBITaxon:124036,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Black root rot,Thielaviopsis basicola Black root rot of Gerbe...,Black root rot
9,OOPS:{},Chalara elegans Nag Raj & Kendr. [synanamorph]...,NCBITaxon:301394,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Black root rot,Chalara elegans Nag Raj & Kendr. [synanamorph]...,Black root rot
10,OOPS:{},Botrytis cinerea Pers. Botrytis blight of Gerb...,NCBITaxon:1290391,Botrytis cinerea Pers.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Botrytis blight,Botrytis cinerea Pers. Botrytis blight of Gerb...,Botrytis blight
12,OOPS:{},Cercospora gerberae Chupp & Viegas Cercospora ...,NCBITaxon:1247214,Cercospora gerberae Chupp & Viegas,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Cercospora leaf spot,Cercospora gerberae Chupp & Viegas Cercospora ...,Cercospora leaf spot
13,OOPS:{},Plasmopara sp. Downy mildew of Gerbera jameson...,NCBITaxon:4780,Plasmopara sp.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Downy mildew,Plasmopara sp. Downy mildew of Gerbera jameson...,Downy mildew
14,OOPS:{},Fusarium solani Fusarium crown rot of Gerbera ...,NCBITaxon:1501798,Fusarium solani (Mart.) Sacc.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Fusarium crown rot,Fusarium solani Fusarium crown rot of Gerbera ...,Fusarium crown rot
15,OOPS:{},Fusarium oxysporum Schlechtend. Fusarium root ...,NCBITaxon:909455,Fusarium oxysporum Schlechtend.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Fusarium root rot,Fusarium oxysporum Schlechtend. Fusarium root ...,Fusarium root rot


In [28]:
patterndf['synonyms1'] = patterndf.groupby(['host','pathogen','plantstructure'])['defined_class label'].transform(lambda x : '|'.join(x))

In [29]:
patterndf['synonyms2'] = patterndf.groupby(['host','pathogen','plantstructure'])['synonyms'].transform(lambda x : '|'.join(x))

In [30]:
# drop any duplicate classes
final_df = patterndf.drop_duplicates(['host','pathogen','plantstructure'])

In [31]:
for index, row in final_df.iterrows():
    row['synonyms'] = row['synonyms1'] + '|' + row['synonyms2']
    if '|' in row['synonyms']:
#         print('SYN1: {}\nSYN2: {}'.format(row['synonyms1'],row['synonyms2']))
        print(row.synonyms)

Pseudomonas cichorii Bacterial leaf spot of Gerbera jamesonii H. Bolus ex J. D. Hook|Bacterial leaf spot
Ralstonia solanacearum Southern wilt of Gerbera jamesonii H. Bolus ex J. D. Hook|Southern wilt
Alternaria alternata Alternaria leaf spot of Gerbera jamesonii H. Bolus ex J. D. Hook|Alternaria leaf spot
Thielaviopsis basicola Black root rot of Gerbera jamesonii H. Bolus ex J. D. Hook|Black root rot
Chalara elegans Nag Raj & Kendr. [synanamorph] Black root rot of Gerbera jamesonii H. Bolus ex J. D. Hook|Black root rot
Botrytis cinerea Pers. Botrytis blight of Gerbera jamesonii H. Bolus ex J. D. Hook|Botrytis blight
Cercospora gerberae Chupp & Viegas Cercospora leaf spot of Gerbera jamesonii H. Bolus ex J. D. Hook|Cercospora leaf spot
Plasmopara sp. Downy mildew of Gerbera jamesonii H. Bolus ex J. D. Hook|Downy mildew
Fusarium solani Fusarium crown rot of Gerbera jamesonii H. Bolus ex J. D. Hook|Fusarium crown rot
Fusarium oxysporum Schlechtend. Fusarium root rot of Gerbera jamesonii H

Mycoleptodiscus terrestris Sudden death of Theobroma cacao L.|Sudden death
Ceratobasidium koleroga Thread blight of Theobroma cacao L.|Thread blight
Ganoderma philippii Wet root rot of Theobroma cacao L.|Wet root rot
Crinipellis perniciosa Witches’ broom of Theobroma cacao L.|Witches’ broom
Cephaleuros virescens Kunze Algal disease of Theobroma cacao L.|Algal disease
Dolichodorus spp. Awl nematode of Theobroma cacao L.|Awl nematode
Heterodera spp. Cyst nematode of Theobroma cacao L.|Cyst nematode
Xiphinema spp. Dagger nematode of Theobroma cacao L.|Dagger nematode
Pratylenchus spp. Lesion nematode of Theobroma cacao L.|Lesion nematode
Rotylenchulus spp. Reniform nematode of Theobroma cacao L.|Reniform nematode
Hoplolaimus spp. Ring nematode of Theobroma cacao L.|Ring nematode
Meloidogyne spp. Root-knot nematode of Theobroma cacao L.|Root-knot nematode
Trichodorus spp. Stubby root nematode of Theobroma cacao L.|Stubby root nematode
Loranthus spp. Mistletoe of Theobroma cacao L.|Mistleto

Phytophthora spp. Phytophthora root rot of Citrullus spp., Cucumis spp., Cucurbita spp., and  s|Phytophthora capsici Leonian Phytophthora root rot of Citrullus spp., Cucumis spp., Cucurbita spp., and  s|Phytophthora root rot|Phytophthora root rot
Trichothecium roseum Pink mold rot of Citrullus spp., Cucumis spp., Cucurbita spp., and  s|Pink mold rot
Pythium spp. Pythium fruit rot of Citrullus spp., Cucumis spp., Cucurbita spp., and  s|Pythium fruit rot (cottony leak)
Rhizopus stolonifer Rhizopus soft rot of Citrullus spp., Cucumis spp., Cucurbita spp., and  s|Rhizopus soft rot (fruit)
Cladosporium cucumerinum Ellis & Arth. Scab/gummosis of Citrullus spp., Cucumis spp., Cucurbita spp., and  s|Scab/gummosis
Sclerotinia sclerotiorum Sclerotinia stem rot of Citrullus spp., Cucumis spp., Cucurbita spp., and  s|Sclerotinia stem rot
Septoria cucurbitacearum Sacc. Septoria leaf blight of Citrullus spp., Cucumis spp., Cucurbita spp., and  s|Septoria leaf blight
Athelia rolfsii Southern blight o

Phaeosphaeria avenaria G. F. Weber) O. Eriksson Speckled blotch of Avena sativa L.|Speckled blotch (Septoria blight)
Gaeumannomyces graminis Take-all of Avena sativa L.|Gaeumannomyces graminis Take-all of Avena sativa L.|Take-all (white head)|Take-all (white head)
Bipolaris victoriae Victoria blight of Avena sativa L.|Victoria blight
Ditylenchus dipsaci Bulb and stem of Avena sativa L.|Bulb and stem (in Europe)
Heterodera avenae Wollenbeber Cyst, oat or cereal of Avena sativa L.|Cyst, oat or cereal
Heterodera hordecalis Andersson Cyst of Avena sativa L.|Cyst
Heterodera latipons Franklin Cyst of Avena sativa L.|Cyst
Punctodera chalcoensis Stone et al. Cyst of Avena sativa L.|Cyst
Xiphinema americanum Cobb Dagger, American of Avena sativa L.|Dagger, American
Pratylenchus spp. Lesion of Avena sativa L.|Pratylenchus spp. Pin of Avena sativa L.|Lesion|Pin
Pratylenchus thornei Sher & Allen Lesion of Avena sativa L.|Lesion
Nothocriconemella mutabilis Ring of Avena sativa L.|Ring
Meloidogyne s

Phytophthora nicotianae van Breda de Haan var. parasitica Pink rot of Solanum tuberosum L.|Pink rot
Pleospora herbarum of Solanum tuberosum L.|Pleospora herbarum
Stemphylium herbarum E. Simmons [anamorph] Pleospora herbarum of Solanum tuberosum L.|Pleospora herbarum
Rhizoctonia solani Kühn Rhizoctonia canker and black scurf of Solanum tuberosum L.|Rhizoctonia canker and black scurf
Thanatephorus cucumeris Rhizoctonia canker and black scurf of Solanum tuberosum L.|Rhizoctonia canker and black scurf
Rosellinia sp. Rosellinia black rot* of Solanum tuberosum L.|Rosellinia black rot*
Septoria lycopersici Speg. var. malagutii Ciccarone & Boerema Septoria leaf spot of Solanum tuberosum L.|Septoria leaf spot
Helminthosporium solani Dur. & Mont. Silver scurf of Solanum tuberosum L.|Silver scurf
Polyscytalum pustulans Skin spot of Solanum tuberosum L.|Skin spot
Athelia rolfsii Stem rot of Solanum tuberosum L.|Stem rot (southern blight)
Angiosorus solani Thecaphora smut of Solanum tuberosum L.|Th

Xanthomonas campestris pv. vasculorum Gumming disease* of Saccharum spp. hybrids|Gumming disease*
Xanthomonas albilineans Leaf scald of Saccharum spp. hybrids|Leaf scald
Pseudomonas rubrisubalbicans Mottled stripe of Saccharum spp. hybrids|Pseudomonas avenae Manns Red stripe of Saccharum spp. hybrids|Pseudomonas rubrilineans Red stripe of Saccharum spp. hybrids|Mottled stripe|Red stripe (top rot)|Red stripe (top rot)
Clavibacter xyli subsp. xyli Davis et al. Ratoon stunting disease of Saccharum spp. hybrids|Ratoon stunting disease
Thanatephorus cucumeris Banded sclerotial of Saccharum spp. hybrids|Banded sclerotial (leaf) disease
Rhizoctonia solani Kühn Banded sclerotial of Saccharum spp. hybrids|Banded sclerotial (leaf) disease
Ceratocystis adiposa Black rot of Saccharum spp. hybrids|Black rot
Chalara sp. Black rot of Saccharum spp. hybrids|Chalara paradoxa Pineapple disease of Saccharum spp. hybrids|Black rot|Pineapple disease
Cercospora atrofiliformis Yen et al. Black stripe* of Sac

In [35]:
final_df

Unnamed: 0,defined_class,defined_class label,pathogen,pathogen label,host,host label,plantstructure,plantstructure label,synonyms
0,OOPS:0000014,Pseudomonas cichorii Bacterial leaf spot of Ge...,NCBITaxon:1441629,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Pseudomonas cichorii Bacterial leaf spot of Ge...
1,OOPS:0000015,Ralstonia solanacearum Southern wilt of Gerber...,NCBITaxon:1457195,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Ralstonia solanacearum Southern wilt of Gerber...
4,OOPS:0000016,Alternaria alternata Alternaria leaf spot of G...,NCBITaxon:187775,Alternaria alternata (Fr.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Alternaria alternata Alternaria leaf spot of G...
8,OOPS:0000017,Thielaviopsis basicola Black root rot of Gerbe...,NCBITaxon:124036,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Thielaviopsis basicola Black root rot of Gerbe...
9,OOPS:0000018,Chalara elegans Nag Raj & Kendr. [synanamorph]...,NCBITaxon:301394,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Chalara elegans Nag Raj & Kendr. [synanamorph]...
10,OOPS:0000019,Botrytis cinerea Pers. Botrytis blight of Gerb...,NCBITaxon:1290391,Botrytis cinerea Pers.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Botrytis cinerea Pers. Botrytis blight of Gerb...
12,OOPS:0000020,Cercospora gerberae Chupp & Viegas Cercospora ...,NCBITaxon:1247214,Cercospora gerberae Chupp & Viegas,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Cercospora gerberae Chupp & Viegas Cercospora ...
13,OOPS:0000021,Plasmopara sp. Downy mildew of Gerbera jameson...,NCBITaxon:4780,Plasmopara sp.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Plasmopara sp. Downy mildew of Gerbera jameson...
14,OOPS:0000022,Fusarium solani Fusarium crown rot of Gerbera ...,NCBITaxon:1501798,Fusarium solani (Mart.) Sacc.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Fusarium solani Fusarium crown rot of Gerbera ...
15,OOPS:0000023,Fusarium oxysporum Schlechtend. Fusarium root ...,NCBITaxon:909455,Fusarium oxysporum Schlechtend.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Fusarium oxysporum Schlechtend. Fusarium root ...


In [33]:
list(final_df)

# # remove all extra synonym classes
final_df.drop("synonyms1", axis=1, inplace=True)
final_df.drop("synonyms2", axis=1, inplace=True)
final_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,defined_class,defined_class label,pathogen,pathogen label,host,host label,plantstructure,plantstructure label,synonyms
0,OOPS:{},Pseudomonas cichorii Bacterial leaf spot of Ge...,NCBITaxon:1441629,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Pseudomonas cichorii Bacterial leaf spot of Ge...
1,OOPS:{},Ralstonia solanacearum Southern wilt of Gerber...,NCBITaxon:1457195,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Ralstonia solanacearum Southern wilt of Gerber...
4,OOPS:{},Alternaria alternata Alternaria leaf spot of G...,NCBITaxon:187775,Alternaria alternata (Fr.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Alternaria alternata Alternaria leaf spot of G...
8,OOPS:{},Thielaviopsis basicola Black root rot of Gerbe...,NCBITaxon:124036,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Thielaviopsis basicola Black root rot of Gerbe...
9,OOPS:{},Chalara elegans Nag Raj & Kendr. [synanamorph]...,NCBITaxon:301394,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Chalara elegans Nag Raj & Kendr. [synanamorph]...
10,OOPS:{},Botrytis cinerea Pers. Botrytis blight of Gerb...,NCBITaxon:1290391,Botrytis cinerea Pers.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Botrytis cinerea Pers. Botrytis blight of Gerb...
12,OOPS:{},Cercospora gerberae Chupp & Viegas Cercospora ...,NCBITaxon:1247214,Cercospora gerberae Chupp & Viegas,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0025034,leaf,Cercospora gerberae Chupp & Viegas Cercospora ...
13,OOPS:{},Plasmopara sp. Downy mildew of Gerbera jameson...,NCBITaxon:4780,Plasmopara sp.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Plasmopara sp. Downy mildew of Gerbera jameson...
14,OOPS:{},Fusarium solani Fusarium crown rot of Gerbera ...,NCBITaxon:1501798,Fusarium solani (Mart.) Sacc.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0000003,whole plant,Fusarium solani Fusarium crown rot of Gerbera ...
15,OOPS:{},Fusarium oxysporum Schlechtend. Fusarium root ...,NCBITaxon:909455,Fusarium oxysporum Schlechtend.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO:0009005,root,Fusarium oxysporum Schlechtend. Fusarium root ...


In [34]:
# add in the IDs
start = 13  # OPS:0000013 is the last one in the upper_level.owl

for index, row in final_df.iterrows():
    
    # add in some OOPS:IDs
    start += 1
    new_id = '{0:07d}'.format(start)  # format the number to be 7 digit padded.
    row['defined_class'] = row['defined_class'].format(new_id)

In [36]:
#  2b. re-write the .tsv file
final_df.to_csv(path_or_buf='../patterns/disease/APS_Scrape_to_DP.tsv', sep='\t', na_rep='', float_format=None, columns=None, header=True, index=False, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=False, date_format=None, doublequote=True, escapechar=None, decimal='.')