# Intro

The purpose of this script/notebook is to take a data file that has been scraped from APS.org using Samara scraper.

This file is consumed by MAL's script that inspects the disease names, and determines what plant part is being affected.  Two columns are added to the datafile, resulting in a file with 14 columns.

This script then takes that file (aps_with_po_mappings.tsv) and removes unneeded columns, and formats the remaining columns for consumption using DOSDP tools.

DOSDP consumes a .tsv, and creates an ontology file.

In [2]:
# Imports
import csv
import pandas as pd
import numpy as np

## desired header order:
iri	iri label	pathogen	pathogen label	host	host label	plantstructure	plantstructure label

iri  (OOPS:xxx)

iri_label (disease common name)

pathogen (NCBITaxon:xxxxxx)

pathogen_label ( pathogen name )

host (NCBITaxon:xxxxxx)

host_label (host common name)

plantstructure (PO:xxx)

plantstructure_label (common name)

# Clean the original file to prepare for pandas dataframe loading

This file was created by MAL using her script to identify plant structures contained within the original Samara scrape

The headers are intact (feb 2018), but it's missing the new plantstructure and plantscturcture label headers.  These are added.

Then the file is loaded into pandas dataframe

In [1]:
raw_scrape_file = '../data/aps_with_po_mappings.tsv'

In [3]:
# load in the headers:  Need to check if they are expected.  
column_headers = None

# print the first line of the file:  Shows there is no header. (FIXED: feb 2018)
with open(raw_scrape_file,'r') as infile:
    first_line = infile.readline()
    column_headers = first_line.split('\t')
    print(column_headers)
    print(len(column_headers))
    
# add the missing column headers
column_headers.append("plant_part_id")
column_headers.append("plant_part_name")
print(type(column_headers),len(column_headers))

######  NOTE THE NEWLINE CHARACTER AT THE END OF THE HEADER COLUMN!!!!
# remove extra whitespace from headers
clean_headers = [x.strip() for x in column_headers]

# we need to check if the headers are what we expect.  Throw an error otherwise
expected_headers = ['disease_name', 'source_taxon_verbatim_name', 'source_taxon_name', 'source_taxon_id', 'interaction_type_label', 'interaction_type_id', 'target_taxon_verbatim_name', 'target_taxon_name', 'target_taxon_id', 'source_citation', 'source_url', 'source_accessed_at', 'plant_part_id', 'plant_part_name']

if clean_headers == expected_headers:
    print("Headers are as expected.  Continue as planned.")
else:
    print('infile headers do not match expected headers')




['disease_name', 'source_taxon_verbatim_name', 'source_taxon_name', 'source_taxon_id', 'interaction_type_label', 'interaction_type_id', 'target_taxon_verbatim_name', 'target_taxon_name', 'target_taxon_id', 'source_citation', 'source_url', 'source_accessed_at\n']
12
<class 'list'> 14
Headers are as expected.  Continue as planned.


# Load into Pandas DF:

Now that the original data is cleaned up, we can load it into a pandas dataframe for some further QC

In [4]:
# load in the raw file into a pandas df.  
# skiprows = don't read the header row.  We checked the headers earlier, and had to add the two extra ones.
df = pd.read_csv(raw_scrape_file, sep='\t', header=None, skiprows=1, names=clean_headers)

df

Unnamed: 0,disease_name,source_taxon_verbatim_name,source_taxon_name,source_taxon_id,interaction_type_label,interaction_type_id,target_taxon_verbatim_name,target_taxon_name,target_taxon_id,source_citation,source_url,source_accessed_at,plant_part_id,plant_part_name
0,Bacterial leaf spot,Pseudomonas cichorii (Swingle 1925) Stapp 1928,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:1441629,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
1,Southern wilt,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:1457195,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,,
2,Southern wilt,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,,
3,Ascochyta leaf spot,Ascochyta doronici Allesch.,Ascochyta doronici Allesch.,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
4,Alternaria leaf spot,Alternaria alternata (Fr.:Fr.) Keissl.,Alternaria alternata (Fr.,NCBITaxon:187775,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
5,Alternaria leaf spot,Alternaria alternata (Fr.:Fr.) Keissl.,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
6,Alternaria leaf spot,A. dauci (Kühn) Groves & Skolko,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
7,Alternaria leaf spot,A. gerberae Rabbe et al.,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
8,Black root rot,Thielaviopsis basicola (Berk. & Broome) Ferraris,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:124036,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0009005,root
9,Black root rot,Chalara elegans Nag Raj & Kendr. [synanam...,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:301394,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0009005,root


# fill in blank plant parts with "whole plant"

You'll notice above that the last two columns contain a lot of "NaN" in them.  That's because Marie's script only filled in 'plant_part_id' and 'plant_part_name' for rows where a plant part could be found in the name/row.  This left a lot of blanks (NaN) in the pandas dataframe.  These need to be filled to prevent errors down the road.

You'll notice I only fill with "PO_0000003", not the full purl.  This was deemed a problem, and I also need to remove the purl from the others.

In [5]:
df.plant_part_id.fillna('PO_0000003', inplace=True)
df.plant_part_name.fillna('whole plant', inplace=True)

df

Unnamed: 0,disease_name,source_taxon_verbatim_name,source_taxon_name,source_taxon_id,interaction_type_label,interaction_type_id,target_taxon_verbatim_name,target_taxon_name,target_taxon_id,source_citation,source_url,source_accessed_at,plant_part_id,plant_part_name
0,Bacterial leaf spot,Pseudomonas cichorii (Swingle 1925) Stapp 1928,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:1441629,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
1,Southern wilt,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:1457195,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0000003,whole plant
2,Southern wilt,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0000003,whole plant
3,Ascochyta leaf spot,Ascochyta doronici Allesch.,Ascochyta doronici Allesch.,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
4,Alternaria leaf spot,Alternaria alternata (Fr.:Fr.) Keissl.,Alternaria alternata (Fr.,NCBITaxon:187775,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
5,Alternaria leaf spot,Alternaria alternata (Fr.:Fr.) Keissl.,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
6,Alternaria leaf spot,A. dauci (Kühn) Groves & Skolko,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
7,Alternaria leaf spot,A. gerberae Rabbe et al.,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0025034,leaf
8,Black root rot,Thielaviopsis basicola (Berk. & Broome) Ferraris,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:124036,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0009005,root
9,Black root rot,Chalara elegans Nag Raj & Kendr. [synanam...,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:301394,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,http://purl.obolibrary.org/obo/PO_0009005,root


In [6]:
# Creating a new DataFrame using only the columns I need for patternapply.py

# remove the url from the plant part IDs

# take only the last 10 digits of the column (the curie IDs are 10 digits)
df['plant_part_id'] = df['plant_part_id'].str[-10:]
df

Unnamed: 0,disease_name,source_taxon_verbatim_name,source_taxon_name,source_taxon_id,interaction_type_label,interaction_type_id,target_taxon_verbatim_name,target_taxon_name,target_taxon_id,source_citation,source_url,source_accessed_at,plant_part_id,plant_part_name
0,Bacterial leaf spot,Pseudomonas cichorii (Swingle 1925) Stapp 1928,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:1441629,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
1,Southern wilt,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:1457195,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0000003,whole plant
2,Southern wilt,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0000003,whole plant
3,Ascochyta leaf spot,Ascochyta doronici Allesch.,Ascochyta doronici Allesch.,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
4,Alternaria leaf spot,Alternaria alternata (Fr.:Fr.) Keissl.,Alternaria alternata (Fr.,NCBITaxon:187775,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
5,Alternaria leaf spot,Alternaria alternata (Fr.:Fr.) Keissl.,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
6,Alternaria leaf spot,A. dauci (Kühn) Groves & Skolko,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
7,Alternaria leaf spot,A. gerberae Rabbe et al.,,no:match,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0025034,leaf
8,Black root rot,Thielaviopsis basicola (Berk. & Broome) Ferraris,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:124036,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0009005,root
9,Black root rot,Chalara elegans Nag Raj & Kendr. [synanam...,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:301394,pathogen of,http://purl.obolibrary.org/obo/RO_0002556,Diseases of African Daisy (Gerbera jamesonii H...,Gerbera jamesonii H. Bolus ex J. D. Hook,NCBITaxon:13547,"R. Wick and B. Dicklow, primary collators (las...",http://www.apsnet.org/publications/commonnames...,2016-09-07,PO_0009005,root



The data must be in a specific format for digestion using patternapply.py.  So the data must be massaged using python's pandas.

## Steps:

1. Make a a copy of the original dataframe to avoid confusion

2. Remove all columns not needed (keep disease_name to feed disease_name_formatter())

3. Add additional columns needed by 'pattern_apply.py'  (iri, and iri label)

4. Rename all column headers to comply to pattern

5. Remove all rows with missing data (maybe move them to new 'rejects.tsv' for later use)

6. write remaining df to file

7. run pattern_apply.py on it. (this is now replaced by DOSDP tools)


## desired header order:
iri, iri label, pathogen, pathogen label, host, host label, plantstructure, plantstructure label

 [iri (OOPS:xxx), iri_label (disease common name), pathogen (NCBITaxon:xxxxxx), pathogen_label ( pathogen name ), host (NCBITaxon:xxxxxx), host_label (host common name), plantstructure (PO:xxx), plantstructure_label (common name)]

In [7]:
#### Step 1 ####
# start with a copy of the original dataframe.
patterndf = df.copy()


#### Step 2 ####
# remove unused columns
patterndf.drop("interaction_type_label", axis=1, inplace=True)
patterndf.drop("interaction_type_id", axis=1, inplace=True)
patterndf.drop("source_citation", axis=1, inplace=True)
patterndf.drop("source_url", axis=1, inplace=True)
patterndf.drop("source_accessed_at", axis=1, inplace=True)
patterndf.drop("target_taxon_verbatim_name", axis=1, inplace=True)
patterndf.drop("source_taxon_verbatim_name", axis=1, inplace=True)

#### Step 3 ####
# add iri and iri_name columns
patterndf['iri'] = 'OOPS:xxxxxxx'  # in need of a system to assign ID numbers...
# just rename the disease name column as the iri label
patterndf.rename(columns = {'disease_name':'iri_label'}, inplace = True)

#### Step 4 ####
# rename column headers to accurately reflect the patternapply needs
patterndf.rename(columns = {'source_taxon_name':'pathogen_label'}, inplace = True)
patterndf.rename(columns = {'source_taxon_id':'pathogen'}, inplace = True)
patterndf.rename(columns = {'target_taxon_id':'host'}, inplace = True)
patterndf.rename(columns = {'target_taxon_name':'host_label'}, inplace = True)
patterndf.rename(columns = {'plant_part_id':'plantstructure'}, inplace = True)
patterndf.rename(columns = {'plant_part_name':'plantstructure_label'}, inplace = True)

#### Step 4.5 ####
## reorder the columns
desired_order = ['iri', 'iri_label','pathogen','pathogen_label','host','host_label','plantstructure','plantstructure_label']
patterndf = patterndf[desired_order]

patterndf

Unnamed: 0,iri,iri_label,pathogen,pathogen_label,host,host_label,plantstructure,plantstructure_label
0,OOPS:xxxxxxx,Bacterial leaf spot,NCBITaxon:1441629,Pseudomonas cichorii (Swingle 1925) Stapp 1928,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf
1,OOPS:xxxxxxx,Southern wilt,NCBITaxon:1457195,Ralstonia solanacearum (Smith 1896) Yabuuchi e...,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0000003,whole plant
2,OOPS:xxxxxxx,Southern wilt,no:match,,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0000003,whole plant
3,OOPS:xxxxxxx,Ascochyta leaf spot,no:match,Ascochyta doronici Allesch.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf
4,OOPS:xxxxxxx,Alternaria leaf spot,NCBITaxon:187775,Alternaria alternata (Fr.,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf
5,OOPS:xxxxxxx,Alternaria leaf spot,no:match,,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf
6,OOPS:xxxxxxx,Alternaria leaf spot,no:match,,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf
7,OOPS:xxxxxxx,Alternaria leaf spot,no:match,,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0025034,leaf
8,OOPS:xxxxxxx,Black root rot,NCBITaxon:124036,Thielaviopsis basicola (Berk. & Broome) Ferraris,NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0009005,root
9,OOPS:xxxxxxx,Black root rot,NCBITaxon:301394,Chalara elegans Nag Raj & Kendr. [synanamorph],NCBITaxon:13547,Gerbera jamesonii H. Bolus ex J. D. Hook,PO_0009005,root


# Clean the data:  

Will need to simply remove anything that has a "NaN" or "no:match"

In [8]:
# start with a check on the dimensions of our dataframe, establishing a starting point
patterndf.shape

(11751, 8)

In [9]:
# 1. Move all rows with a "no:match" in either pathogen or host column into a "reject.df"

no_pathogen_df = patterndf[patterndf['pathogen'] == 'no:match']
patterndf = patterndf[patterndf['pathogen'] != 'no:match']

print('Rows with pathogen: {}\nRows without: {}'.format(patterndf.shape[0],no_pathogen_df.shape[0]))

no_host_df = patterndf[patterndf['host'] == 'no:match']
patterndf = patterndf[patterndf['host'] != 'no:match']

print('Rows with host: {}\nRows without: {}'.format(patterndf.shape[0],no_host_df.shape[0]))


patterndf.shape

Rows with pathogen: 7029
Rows without: 4722
Rows with host: 6100
Rows without: 929


(6100, 8)

In [10]:
### remove any column with an "NaN"
patterndf.dropna(how='any',inplace=True)

patterndf.shape

(6094, 8)

# Results:

The original data set was 11752 rows (diseases)
4722 of them did not have a pathogen match

## 6094 diseases!



# Name formatting function

It will be important to have a function that can dynamically name the diseases without too much repetition.
The function will be fed a name (full of weird parenthesis and references), a pathogen (from NCBI), a host (also from NCBI), and a plant part (whole plant is used as a default).

if a plant part is included, that plant part should be used as part of the name.  But if the 'part' is the whole plant, that part is not to be included in the name.

The important part of this function is that it doesn't do any of the column parsing or anything like dataframe stuff.  All reading of OG files and splitting of them into dataframes should be done outside of this function, so this function can be reused, no mater how the original data is parsed.

In [11]:
# let's also write a little structure to hold the categories for names being formatted, so we can improve this process.
# you'll need to reset these values after looping through the whole dataframe
name_structures = {
    'A': 0,
    'B': 0,
    'C': 0,
    'D': 0
}

def disease_name_formatter(rawname,host,pathogen,part):
    """
    function to format the names of diseases.  
    Requires:
    rawname - the name that was given in the APS scrape
    host - obvious
    pathogen - obvious
    part - if not mentioned explicitly in 'aps_with_po_mappings' (from MAL), should defalt to "whole plant"
    
    """
    # start for real by checking if all the things passed in are what we expect them to be (ie: strings)
    # continue by stripping any bracketed reference, and whitespace.
    
    assert (isinstance(rawname, str)),"raw names must be strings"
    short_name = rawname.split('(')[0].strip()
    assert (isinstance(pathogen, str)),"pathogen names must be strings!!\n {} is not a string".format(pathogen)
    pathogenname = pathogen.split('(')[0].strip()
    assert (isinstance(host, str)),"host names must be strings"
    hostname = host.split('()')[0].strip()
    new_name = None
    

    if pathogenname in short_name and host in short_name:
        new_name = short_name
        name_structures['A'] +=1
        print('A -',new_name)
#         return('A')
    elif pathogenname in short_name:
        new_name = '{aps_name} of {host}'.format(aps_name=short_name, host=host)
        name_structures['B'] +=1
        print('B -', new_name)
#         return('B')
    elif host in short_name:
        new_name = '{pathogen} {aps_name}'.format(pathogen=pathogenname,aps_name=short_name)
        name_structures['C'] +=1
        print('C -', new_name)
#         return('C')
    else:
        new_name = '{pathogen} {aps_name} of {host}'.format(aps_name=short_name, pathogen=pathogenname, host=host)
        name_structures['D'] +=1
        print('D -', new_name )
#         return('D')
    return(new_name)


In [12]:

# reset the counts:
name_structures = {
    'A': 0,
    'B': 0,
    'C': 0,
    'D': 0
}

# run the whole dataframe through the name-formatter function
for index, row in patterndf.iterrows():
    row['iri_label'] = disease_name_formatter(row['iri_label'],row['host_label'], row['pathogen_label'],row['plantstructure_label'])


D - Pseudomonas cichorii Bacterial leaf spot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Ralstonia solanacearum Southern wilt of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Alternaria alternata Alternaria leaf spot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Thielaviopsis basicola Black root rot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Chalara elegans Nag Raj & Kendr. [synanamorph] Black root rot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Botrytis cinerea Pers. Botrytis blight of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Cercospora gerberae Chupp & Viegas Cercospora leaf spot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Plasmopara sp. Downy mildew of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Fusarium solani Fusarium crown rot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Fusarium oxysporum Schlechtend. Fusarium root rot of Gerbera jamesonii H. Bolus ex J. D. Hook
D - Fusarium oxysporum Schlechtend. Fusarium wilt of Gerbera jamesonii H. Bolus ex J. D. Hoo

D - Penicillium aurantiogriseum Dierckx. Blue mold rot of Asparagus officinalis
D - Cercospora asparagi Sacc. Cercospora blight of Asparagus officinalis
D - Fusarium culmorum Dead stem of Asparagus officinalis
D - Fusarium oxysporum Fusarium crown and root rot of Asparagus officinalis
D - Fusarium redolens Wollenw. Fusarium crown and root rot of Asparagus officinalis
D - Gibberella fujikuroi Fusarium crown and root rot of Asparagus officinalis
D - Gibberella fujikuroi Fusarium crown and root rot of Asparagus officinalis
D - Fusarium oxysporum Fusarium spear spot of Asparagus officinalis
D - Fusarium redolens Wollenw. Fusarium spear spot of Asparagus officinalis
D - Botrytis cinerea Pers. Gray mold shoot blight of Asparagus officinalis
D - Alternaria alternata Leaf spot of Asparagus officinalis
D - Phomopsis asparagi Phomopsis blight of Asparagus officinalis
D - Phomopsis asparagicola Bausa Alcalde Phomopsis blight of Asparagus officinalis
D - Phomopsis javanica Uecker et D.A. Johnson P

D - Pratylenchus spp. Lesion nematodes of Beta vulgaris L.
D - Longidorus spp. Needle nematodes of Beta vulgaris L.
D - Ditylenchus destructor Thorne Potato rot nematode of Beta vulgaris L.
D - Meloidogyne spp. Root-knot of Beta vulgaris L.
D - Ditylenchus dipsaci Stem and bulb nematode of Beta vulgaris L.
D - Paratrichodorus spp. Stubby-root nematodes of Beta vulgaris L.
D - Trichodorus spp. Stubby-root nematodes of Beta vulgaris L.
D - Alfalfa mosaic virus Alfalfa mosaic of Beta vulgaris L.
D - Beet curly top virus Beet curly top of Beta vulgaris L.
D - Beet western yellows virus Beet mild yellows & Beet western yellows of Beta vulgaris L.
D - Beet mosaic virus Beet mosaic of Beta vulgaris L.
D - Beet yellows virus Beet yellows of Beta vulgaris L.
D - Cucumber mosaic virus Cucumber mosaic of Beta vulgaris L.
D - Lettuce infectious yellows virus Lettuce infectious yellows of Beta vulgaris L.
D - Beet necrotic yellow vein virus Rhizomania of Beta vulgaris L.
D - Polymyxa betae Keskin) 

D - Bipolaris setaria Bipolaris leaf spot of Dendrathema × grandiflorum Kitam. = Chrysanthemum × morifolium Ramat.
D - Cercospora chrysanthemi Heald & F. A. Wolf Cercospora leaf spot of Dendrathema × grandiflorum Kitam. = Chrysanthemum × morifolium Ramat.
D - Macrophomina phaseolina Charcoal stem rot of Dendrathema × grandiflorum Kitam. = Chrysanthemum × morifolium Ramat.
D - Cylindrosporium chrysanthemi Ellis & spot Dearn. Cylindrosporium leaf of Dendrathema × grandiflorum Kitam. = Chrysanthemum × morifolium Ramat.
D - Fusarium oxysporum Schlechtend. Fusarium wilt of Dendrathema × grandiflorum Kitam. = Chrysanthemum × morifolium Ramat.
D - Itersonilia perplexans Derx Itersonilia petal blight of Dendrathema × grandiflorum Kitam. = Chrysanthemum × morifolium Ramat.
D - Pythium spp. Pythium root rot of Dendrathema × grandiflorum Kitam. = Chrysanthemum × morifolium Ramat.
D - Pythium ultimum Trow Pythium root rot of Dendrathema × grandiflorum Kitam. = Chrysanthemum × morifolium Ramat.
D -

D - Maize Rio IV virus Maize Rio IV* of Zea mays L.
D - Maize stunting virus Maize stunting* of Zea mays L.
D - Wheat spot mosaic virus Wheat spot mosaic of Zea mays L.
D - Xanthomonas campestris pv. malvacearum Bacterial blight of Gossypium spp.
D - Agrobacterium tumefaciens Crown gall of Gossypium spp.
D - Erwinia herbicola Lint degradation of Gossypium spp.
D - Pantoea agglomerans Lint degradation of Gossypium spp.
D - Colletotrichum gossypii Southworth [anamorph] Anthracnose of Gossypium spp.
D - Ramularia gossypii Areolate mildew of Gossypium spp.
D - Cercosporella gossypii Speg. Areolate mildew of Gossypium spp.
D - Mycosphaerella areola J. Ehrlich & F.A. Wolf [teleomorph] Areolate mildew of Gossypium spp.
D - Ascochyta gossypii Woronichin Ascochyta blight of Gossypium spp.
D - Thielaviopsis basicola Black root rot of Gossypium spp.
D - Chalara elegans Nag Raj & Kendrick [synanamorph] Black root rot of Gossypium spp.
D - Ascochyta gossypii Woronichin Boll rot of Gossypium spp.
D 

D - Botrytis cinerea Pers. Flower blight of Dahlia sp.
D - Alternaria alternata Leaf spot of Dahlia sp.
D - Erysiphe communis Powdery mildew of Dahlia sp.
D - Erysiphe polygoni DC. Powdery mildew of Dahlia sp.
D - Athelia rolfsii Southern blight of Dahlia sp.
D - Entyloma dahliae Syd. & P. Syn. Smut of Dahlia sp.
D - Rhizoctonia solani Kühn Stem and tuber rot of Dahlia sp.
D - Thanatephorus cucumeris Stem and tuber rot of Dahlia sp.
D - Sclerotinia sclerotiorum Cottony stem rot of Dahlia sp.
D - Fusarium oxysporum Schlechtend. Vascular wilt of Dahlia sp.
D - Verticillium albo-atrum Ranch and Berth. Vascular wilt of Dahlia sp.
D - Cucumber mosaic virus Mosaic of Dahlia sp.
D - Dahlia mosaic virus Mosaic of Dahlia sp.
D - Tomato spotted wilt virus Mosaic of Dahlia sp.
D - Impatiens necrotic spot virus Ringspot of Dahlia sp.
D - Xylophilus ampelinus Bacterial blight of Vitis spp.
D - Xanthomonas ampelina Panagopoulos Bacterial blight of Vitis spp.
D - Agrobacterium tumefaciens Crown gall 

D - Rhizoctonia cerealis Van der Hoeven Sharp eyespot of Avena sativa L.
D - Ceratobasidium cereale D. Murray & L.L. Burpee Sharp eyespot of Avena sativa L.
D - Ustilago segetum Smut, covered of Avena sativa L.
D - Ustilago avenae Smut, loose of Avena sativa L.
D - Microdochium nivale Snow mold, pink of Avena sativa L.
D - Fusarium nivale Ces. ex Berl. & Voglino Snow mold, pink of Avena sativa L.
D - Monographella nivalis Snow mold, pink of Avena sativa L.
D - Typhula idahoensis Remsberg Snow mold, speckled or gray of Avena sativa L.
D - Typhula incarnata Lasch Snow mold, speckled or gray of Avena sativa L.
D - Stagonospora avenae Speckled blotch of Avena sativa L.
D - Septoria avenae A.B. Frank Speckled blotch of Avena sativa L.
D - Phaeosphaeria avenaria G. F. Weber) O. Eriksson Speckled blotch of Avena sativa L.
D - Gaeumannomyces graminis Take-all of Avena sativa L.
D - Gaeumannomyces graminis Take-all of Avena sativa L.
D - Bipolaris victoriae Victoria blight of Avena sativa L.
D 

D - Tomato black ring virus Shoot stunting of Prunus persica var. nucipersica
D - Prune dwarf virus Stunt of Prunus persica
D - Prunus necrotic ringspot virus Stunt of Prunus persica
D - Prune dwarf virus Stunt of Prunus persica var. nucipersica
D - Prunus necrotic ringspot virus Stunt of Prunus persica var. nucipersica
D - Tomato ringspot virus Yellow bud mosaic of Prunus persica
D - Tomato ringspot virus Yellow bud mosaic of Prunus persica var. nucipersica
D - Cytospora canker Peach tree short life of Prunus persica
D - Cytospora canker Peach tree short life of Prunus persica var. nucipersica
D - Pseudomonas solanacearum Bacterial wilt of Arachis hypogaea L.
D - Alternaria tenuissima Alternaria leaf blight of Arachis hypogaea L.
D - Alternaria arachidis Kulk. Alternaria leaf spot of Arachis hypogaea L.
D - Alternaria alternata Alternaria spot and veinal necrosis of Arachis hypogaea L.
D - Colletotrichum arachidis Sawada Anthracnose of Arachis hypogaea L.
D - Colletotrichum dematium A

D - Pythium irregulare Buisman Root rot of Brassica rapa
D - Pythium irregulare Buisman Root rot of Brassica campestris
D - Rhizoctonia solani Kühn Root rot of Brassica napus
D - Rhizoctonia solani Kühn Root rot of Brassica rapa
D - Rhizoctonia solani Kühn Root rot of Brassica campestris
D - Thanatephorus cucumeris Root rot of Brassica napus
D - Thanatephorus cucumeris Root rot of Brassica rapa
D - Thanatephorus cucumeris Root rot of Brassica campestris
D - Athelia rolfsii Root rot of Brassica napus
D - Athelia rolfsii Root rot of Brassica rapa
D - Athelia rolfsii Root rot of Brassica campestris
D - Sclerotinia sclerotiorum Sclerotinia stem rot of Brassica napus
D - Sclerotinia sclerotiorum Sclerotinia stem rot of Brassica rapa
D - Sclerotinia sclerotiorum Sclerotinia stem rot of Brassica campestris
D - Alternaria spp. Seed rot, damping-off of Brassica napus
D - Alternaria spp. Seed rot, damping-off of Brassica rapa
D - Alternaria spp. Seed rot, damping-off of Brassica campestris
D - F

D - Plasmopara helianthi Novot. f. helianthi Downy mildew of Helianthus tuberosus
D - Fusarium equiseti Fusarium stalk rot of Helianthus annuus
D - Fusarium equiseti Fusarium stalk rot of Helianthus tuberosus
D - Nectria haematococca Berk. & Broome Fusarium stalk rot of Helianthus annuus
D - Nectria haematococca Berk. & Broome Fusarium stalk rot of Helianthus tuberosus
D - Fusarium tabacinum Fusarium stalk rot of Helianthus annuus
D - Fusarium tabacinum Fusarium stalk rot of Helianthus tuberosus
D - Gibberella fujikuroi Fusarium wilt of Helianthus annuus
D - Gibberella fujikuroi Fusarium wilt of Helianthus tuberosus
D - Myrothecium roridum Tode Myrothecium leaf and stem spot of Helianthus annuus
D - Myrothecium roridum Tode Myrothecium leaf and stem spot of Helianthus tuberosus
D - Phoma macdonaldii Boerema Phoma black stem of Helianthus annuus
D - Phoma macdonaldii Boerema Phoma black stem of Helianthus tuberosus
D - Leptosphaeria lindquistii Frezzi Phoma black stem of Helianthus annu

D - Tilletia contraversa J. G. Kühn Dwarf bunt of Triticum spp. L.
D - Tilletia controversa J. G. Kühn) Dwarf bunt of Triticum spp. L.
D - Tilletia brevifaciens G. W. Fisch. Dwarf bunt of Triticum spp. L.
D - Tilletia calospora Pass. Dwarf bunt of Triticum spp. L.
D - Tilletia contraversa var. elymi Zaprom. Dwarf bunt of Triticum spp. L.
D - Tilletia contraversa var. prostrata Lavrov Dwarf bunt of Triticum spp. L.
D - Tilletia elymicola Lavrov Dwarf bunt of Triticum spp. L.
D - Tilletia pancicii Bubák & Ranoj. Dwarf bunt of Triticum spp. L.
D - Tilletia prostrata Dwarf bunt of Triticum spp. L.
D - Claviceps purpurea Ergot of Triticum spp. L.
D - Claviceps microcephala Ergot of Triticum spp. L.
D - Claviceps purpurea var. purpurea Ergot of Triticum spp. L.
D - Claviceps sesleriae Stäger Ergot of Triticum spp. L.
D - Claviceps setulosa Ergot of Triticum spp. L.
D - Cordyceps purpurea Ergot of Triticum spp. L.
D - Cordyceps setulosa Quél. Ergot of Triticum spp. L.
D - Kentrosporium microc

In [13]:
#check the counts
name_structures

{'A': 0, 'B': 48, 'C': 1, 'D': 6045}

# Write the file to .tsv for consumption in DOSDP tools

This works for now.  

## the next step is finding duplicates, and making them synonyms 


In [42]:
#  2b. re-write the .tsv file
patterndf.to_csv(path_or_buf='../patterns/disease/APS_Scrape_to_DP.tsv', sep='\t', na_rep='', float_format=None, columns=None, header=True, index=False, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=False, date_format=None, doublequote=True, escapechar=None, decimal='.')