In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import imp
import sys
from sqlalchemy import create_engine

In [2]:
# Create db connection
r2_func_path = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template\useful_resa2_code.py'
resa2 = imp.load_source('useful_resa2_code', r2_func_path)

engine, conn = resa2.connect_to_resa2()

# RA database

This notebook documents an initial look at the RADB following KET's e-mails received 16/03/2017 at 09.51 and 23/03/2017 at 14.25. The basic aim is to modify the database so that "summary effects" can be calculated for different taxonomic levels, rather than just for "species groups", as currently.

`ECOTOX_SUMMARY_EFFECT` shows an example of the output available based on "species groups". We want to create similar views for different levels in the taxonomic tree, as defined in the NCBI tables (see KET's e-mails for details). The starting point for this is the `ECOTOX_EXTRACT1_M` view, which contains the basic data, including the Latin name for each organism. The information in this view is then summarised into two more views (`ECOTOX_SUMMARY_EFFECT_PRIMARY` and `ECOTOX_SUMMARY_EFFECT_REVIEW`), and these are subsequently combined into `ECOTOX_SUMMARY_EFFECT`, which is what we're interested in here. 

This notebook aims to:

 1. Get a brief overview of the NCBI data and how it can be linked to `ECOTOX_EXTRACT1_M` <br><br>
 
 2. Understand how we can use the NCBI data to permit aggregation by any taxonomic rank
 
## 1. Example data from `ECOTOX_EXTRACT1_M`

In [None]:
# Get the first 100 rows from ECOTOX_EXTRACT1_M
sql = ("SELECT * FROM risk_assessment.ecotox_extract1_m "
       "WHERE rownum <= 100")
df = pd.read_sql_query(sql, engine)

df.head()

## 2. NCBI dataset

It looks as though there are two key tables: `NCBI_NAMES` and `NCBI_NODES`. The former includes the organism names, while the latter has the taxonomic ranks. It's not immediately obvious to me how to link this information to `ECOTOX_EXTRACT1_M`, but perhaps matching `LATIN_NAME` in the RADB to `NAME_TXT` in the NCBI database is a good place to start?

In [4]:
# First, get all unique latin names from `ECOTOX_EXTRACT1_M`
sql = ("SELECT UNIQUE(latin_name) "
       "FROM risk_assessment.ecotox_extract1_m")
lat_df = pd.read_sql_query(sql, engine)

print 'There are %s unique Latin names in the RADB.' % len(lat_df)

# Try matching these based on NCBI text
sql = ("SELECT * FROM risk_assessment.ncbi_names "
       "WHERE name_class = 'scientific name' "
       "AND name_txt IN ("
       "SELECT UNIQUE(latin_name) "
       "FROM risk_assessment.ecotox_extract1_m)")
ncbi_df = pd.read_sql_query(sql, engine)

print 'Of these, %s can be matched exactly in the NCBI database.' % len(ncbi_df)

There are 3781 unique Latin names in the RADB.
Of these, 2351 can be matched exactly in the NCBI database.


So, it looks as though we can link the ECOTOX data to the NCBI data based on Latin names in about 70% of cases. This obviously needs improving - **check with KET for options**.

Of these 2701 matches, how many of the 29 NCBI ranks are currently included in the RADB?

In [None]:
# Get matching NCBI ranks
sql = ("SELECT UNIQUE(rank) "
       "FROM risk_assessment.ncbi_nodes "
       "WHERE tax_id IN ("
       "SELECT tax_id FROM risk_assessment.ncbi_names "
       "WHERE name_txt IN ("
       "SELECT UNIQUE(latin_name) "
       "FROM risk_assessment.ecotox_extract1_m))")
rank_df = pd.read_sql_query(sql, engine)

rank_df

So the databse currently includes entries for 15 of the 29 possible rankings.

For each organism, I need to be able to get (i) its NCBI rank and (ii) the name of every higher taxonomic level that the organism belongs to. The `NCBI_NODES` table does contain this information, but the table structure is recursive: each row has a "parent", which defines the heirarchy. It would be more useful to create a flattened version, with the ranks/taxonomic levels as columns and a row for each `tax_id`, something like this:

| tax_id | rank 1 | rank 2 | rank 3 | rank 4 |
|:------:|:------:|:------:|:------:|:------:|
|    1   |        |        |    1   |    3   |
|    2   |    1   |        |    5   |    7   |

With this structure, it should be straightforward to get taxonomic data for each organism from any level of the taxonomic tree. Generating this table looks a bit tricky, but Oracle has something called "[recursive/heirarchical queries](https://oracle-base.com/articles/misc/hierarchical-queries)", which may be useful in this context. I've never used them before, though, so I'll need to investigate further when I get back in April.

Summary so far:

 * Check with KET regarding linking ECOTOX data to the NCBI codes. Are there better options than linking `LATIN_NAME` to `NAME_TXT`? <br><br>
 
 * `NCBI_NODES` needs restructuring to flatten the recursive data into something we can use for taxonomic aggregation

**Update 05/04/2017:** Returning to this after a week away. Restructuring the NCBI table is tricky, especially because the table is quite large. I posted a question on Stackoverflow [here](http://stackoverflow.com/questions/43212421/flatten-recursive-taxonomic-table-oracle?noredirect=1#comment73499002_43212421) but have had no responses so far. As time is tight, I've decided to write my own code in Python to acomplish the restructuring, as I don't have time to dig into the details of recursive SQL queries at present. The main issue is that my version is likely to be slow, so some optimisation may be necessary.

The first step is to read the entire `NCBI_NODES` table into memory and replace the `rank` text column with an integer column using `rank_ids` running from 0 to 28 (see the Word document received from KET 23/03/2017 at 14.25 for details). I can then extract just the three columns of interest - `tax_id`, `parent_tax_id` and `rank_id`. 

This table will be quite large: if all values are stored as 64 bit integers, I should end up with 1560278 rows and four columns (three of values, plus the index). This will occupy approximately

$$\frac{1560278 * 4 * 64}{8.10^6} = 49.93 MB$$

in memory. Although this is pretty big, I should still be able to work with it on my laptop quite easily.

In [None]:
# Read table of NCBI ranks
in_xlsx = r'C:\Data\James_Work\Staff\Knut_Eric_T\RA_Database\ncbi_ranks.xlsx'
rank_df = pd.read_excel(in_xlsx, sheetname='Sheet1')

# Read entire NCBI_NODES table
sql = ("SELECT tax_id, parent_tax_id, rank "
       "FROM risk_assessment.ncbi_nodes")
ncbi_df = pd.read_sql_query(sql, engine)

# Join rank ids
ncbi_df = ncbi_df.merge(rank_df, how='left',
                        on='rank')
del ncbi_df['rank']

print 'Total number of records:', len(ncbi_df)
print 'Size of data frame in memory: %.2f MB' % (ncbi_df.memory_usage().sum()/1.E6)

ncbi_df.head()

It's reassuring to see that the size of the data frame in memory is exactly as predicted. The hard bit is how to flatten the recursive relationship. Note also that the results array will be pretty large: 156028 records and 31 columns (including the index) should occupy about 387 MB. 

The code below performs the flattening using Pandas. Because I'm concerned about performance, I've wrapped everything in a function to give me some options for optimisation and benchmarking later on.

In [None]:
%%time
def flatten_pandas(ncbi_df):
    """ Use Pandas to flatten the recursive NCBI table.
    """
    # Pre-allocate results array in memory
    data = np.zeros((len(ncbi_df), 30))*np.nan
    tax_df = pd.DataFrame(data=data, columns=['tax_id']+range(1,29)+[0])

    # Loop over tax_id
    for idx, row in ncbi_df[:1000].iterrows():
        # Add tax_id to output
        tax_df.ix[idx, 'tax_id'] = row.tax_id

        # Add tax_id to the appropriate level
        tax_df.ix[idx, row.rank_id] = row.tax_id

        # Walk tree
        par = row.parent_tax_id    
        while par != 1: # parent = 1 is the top level of the tree
            # Get the rank of the parent
            par_rank = ncbi_df[ncbi_df['tax_id']==par].rank_id.iloc[0]

            # Enter the parent at the appropriate rank
            tax_df.ix[idx, par_rank] = par

            # Get the next parent
            par = ncbi_df[ncbi_df['tax_id']==par].parent_tax_id.iloc[0]

    return tax_df

# Flatten data
tax_df = flatten_pandas(ncbi_df)
print tax_df.head()

This code works but, as I feared, it's very inefficient. The test example above takes 7.5 minutes to process the first 1000 rows, which means the full table will take around $1500*7.5$ minutes, which is more than a week! Clearly this isn't good enough, so time for some optimisation...

Using Pandas makes for much cleaner/easier code, but cutting away this convenience and dropping down to more basic Numpy should be faster. As a first step, I'll convert `ncbi_df` to a `dict`. This will take up more memory, but should hopefully provide fast data access based on `tax_id`.

In [None]:
# First convert ncbi_df to a dict
ncbi_df.index = ncbi_df['tax_id']
del ncbi_df['tax_id']

ncbi_dict = ncbi_df.to_dict(orient='index')
print 'Size of ncbi_dict: %.2f MB.' % (sys.getsizeof(ncbi_dict)/1.E6)

So the data is now taking up twice as much memory as before, but it's still manageable. The code below removes all the Pandas indexing sophistication and uses basic Numpy instead.

In [None]:
%%time
def flatten_numpy(ncbi_dict):
    """ Use Numpy to flatten the recursive NCBI table.
    """
    # Pre-allocate results array in memory
    data = np.zeros((len(ncbi_dict.keys()), 30))*np.nan

    # Loop over tax_id
    for idx, tax_id in enumerate(ncbi_dict.keys()[:1000]):
        # Add tax_id to output
        data[idx, 0] = tax_id

        # Add tax_id to the appropriate level
        rank_id = ncbi_dict[tax_id]['rank_id']
        if rank_id == 0:
            rank_id = 29
        data[idx, rank_id] = tax_id

        # Walk tree
        par = ncbi_dict[tax_id]['parent_tax_id']  
        while par != 1: # parent = 1 is the top level of the tree
            # Get the rank of the parent
            par_rank = ncbi_dict[par]['rank_id']

            # Enter the parent at the appropriate rank
            if par_rank == 0:
                par_rank = 29
            data[idx, par_rank] = par

            # Get the next parent
            par = ncbi_dict[par]['parent_tax_id']

    # Build df
    tax_df = pd.DataFrame(data=data, columns=['tax_id']+range(1,29)+[0])
    
    return tax_df

# Flatten data
tax_df = flatten_numpy(ncbi_dict)
print tax_df.head()

Wow - this is a remarkable speedup! For the first 1000 rows we're down from 7.5 minutes to 0.4 seconds, a factor of more than 1000! How about the full dataset?

In [None]:
%%time
def flatten_numpy(ncbi_dict):
    """ Use Numpy to flatten the recursive NCBI table.
    """
    # Pre-allocate results array in memory
    data = np.zeros((len(ncbi_dict.keys()), 30))*np.nan

    # Loop over tax_id
    for idx, tax_id in enumerate(ncbi_dict.keys()):
        # Add tax_id to output
        data[idx, 0] = tax_id

        # Add tax_id to the appropriate level
        rank_id = ncbi_dict[tax_id]['rank_id']
        if rank_id == 0:
            rank_id = 29
        data[idx, rank_id] = tax_id

        # Walk tree
        par = ncbi_dict[tax_id]['parent_tax_id']  
        while par != 1: # parent = 1 is the top level of the tree
            # Get the rank of the parent
            par_rank = ncbi_dict[par]['rank_id']

            # Enter the parent at the appropriate rank
            if par_rank == 0:
                par_rank = 29
            data[idx, par_rank] = par

            # Get the next parent
            par = ncbi_dict[par]['parent_tax_id']

    # Build df
    tax_df = pd.DataFrame(data=data, columns=['tax_id']+range(1,29)+[0])
    
    return tax_df

# Flatten data
tax_df = flatten_numpy(ncbi_dict)
print tax_df.head()

25 seconds for the full thing is pretty good, and certainly good enough for this application. However, for future reference, note that additional performance increases are possible using [Numba](http://numba.pydata.org/), simply use the following:

    from numba import jit
    
and then add the `@jit` decorator to the function call:

    @jit
    def flatten_numpy(ncbi_dict):
        # code here

This automatically re-writes the function C/++ and gives further speed improvements. If you need to go even faster, these operations could be parallelised by splitting `ncbi_df` into chunks and using something like this:

    from multiprocessing import Pool

    @jit # Optional Numba optimisation
    def flatten_numpy(ncbi_df):
        # code here

    if __name__ == '__main__':
        p = Pool(8) # Number of cores to use
        p.map(flatten_numpy, [ncbi_df1, ncbi_df2, ..., ncbi_df8])
        
The final step is to write the output back to the RA database for inspection.

In [None]:
# Write results to RAdb
tax_df.to_sql('ncbi_nodes_flat', 
              schema='risk_assessment',
              con=engine, index=False)

This seems to have worked, and some basic manual checking using [this website](https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi) gives compatible results. Note, however, that in some cases the `NCBI_NODES` table has multiple parents with rank 0. As an example:

    tax_id 2499 (rank 4) > tax_id 36549 (rank 0) > tax_id 28384 (rank 0)
    
My code only allows a single `tax_id` for each rank, and in situations like this it is the highest taxonomic level that gets assigned i.e. my code returns the tree:

    tax_id 2499 (rank 4) > tax_id 28384 (rank 0)
    
(missing out 36549), because 36549 and 28384 both have rank 0, but 28384 is at a higher taxonomic level. This seems reasonable to me, but **check with KET**.

## 3. Updating views

The flattened NCBI data provides full taxonomic information for each `tax_id`. In principle, this should make it possible to perform aggregation at any taxonomic level. 

**NB:** Back in November 2016, Knut Erik asked me to modify some existing views in the database, such that grouping took place based on the `LATIN_NAME` column rather than the `SPECIES_GROUP`. In the end, I created a new view called `ECOTOX_SUMMARY_EFFECT_SPECIES` (see e-mails around 14/11/2016 for details).

Using the newly formatted NCBI data, it should be possible to generate a similar table by grouping based on `NCBI_RANK=4`, which corresponds to species. We can then check to see how aggregation via NCBI ranks compares to aggregation using Latin names, which should provide a useful test. To explore this, I've used the following workflow in the database:

 1. Created two new views, `ECOTOX_EXTRACT1_M_NCBI` and `ECOTOX_LIKE_EXTRACT1_NCBI`. These are identical to `ECOTOX_EXTRACT1_M` and `ECOTOX_LIKE_EXTRACT1`, except I've joined-in the NCBI ranks by matching NCBI `NAME_TXT` to RAdb `LATIN_NAME`. As noted above, not all the names can be matched, so this **needs further checking**. <br><br>
 
 2. Created two new views called `ECOTOX_SUM_EFF_PRI_SPEC` and `ECOTOX_SUM_EFF_REV_SPEC`, which are the same as `ECOTOX_SUMMARY_EFFECT_PRIMARY` and `ECOTOX_SUMMARY_EFFECT_REVIEW`, except the grouping is performed on `NCBI_RANK=4` (species) instead of on `LATIN_NAME` or `SPECIES_GROUP` as we have done previously. <br><br>
 
 3. Created a new view called `ECOTOX_SUM_EFF_SPEC`, which combines the results from `ECOTOX_SUM_EFF_PRI_SPEC` and `ECOTOX_SUM_EFF_REV_SPEC` (exactly analogous to the way `ECOTOX_SUMMARY_EFFECT` combines output from `ECOTOX_SUMMARY_EFFECT_PRIMARY` and `ECOTOX_SUMMARY_EFFECT_REVIEW`).
 
In summary, `ECOTOX_SUM_EFF_SPEC` is based on the NCBI data aggregated at `RANK=4`, whereas the table I created back in November (called `ECOTOX_SUMMARY_EFFECT_SPECIES`) is based on aggregation using the RAdb `LATIN_NAME`. These two tables are similar, but not the same, so **they need checking carefully to make sure the values based on the NCBI data look reasonable**.

The following issues need further investigation:

 * Not all Latin names in our database have exact matches in the NCBI database.  <br><br>
 
 * In a few cases, the Latin name specified in the RAdb matches more than one tax_id in NCBI_NAMES. This results in some duplication i.e. the same Latin name gets assigned to more than one level in the taxonomic tree. This may lead to contradictory results.
 
## 4. Matching NCBI names to the RAdb

As described in section 2, above, not all the Latin names in the RAdb have matches in the NCBI database. KET has asked me to identify names that don't match so we can follow them up.

The code below is modified from section 2.

In [15]:
# Match LATIN_NAME to NCBI_TXT
sql = ("SELECT unique(a.latin_name), b.name_txt "
       "FROM risk_assessment.ecotox_extract1_m a "
       "LEFT JOIN risk_assessment.ncbi_names b "
       "ON a.latin_name = b.name_txt")
df = pd.read_sql_query(sql, engine)

# Write output
out_path = r'C:\Data\James_Work\Staff\Knut_Eric_T\RA_Database\latin_names_match_ncbi_txt.csv'
df.to_csv(out_path)

## 5. Matching NCBI part 2

Karina has been through the output generated above and has added some additional taxonomic data. The next step is to see how may of the RAdb names can now be matched.

Firstly, the brief check in section 2 does not find *all* the names in the RADB, because it ignores the entries in `ECOTOX_LIKE_EXTRACT1`. The code below is more complete.

In [3]:
# Match LATIN_NAME to NCBI_TXT
sql = ("SELECT latin_name, tax_id, name_txt "
       "FROM (SELECT UNIQUE(latin_name) AS latin_name "
             "FROM (SELECT latin_name FROM risk_assessment.ecotox_extract1_m "
                   "UNION "
                   "SELECT latin_name FROM risk_assessment.ecotox_like_extract1)) a "
       "LEFT OUTER JOIN (SELECT * FROM RISK_ASSESSMENT.NCBI_NAMES) b "
       "ON b.NAME_TXT = a.LATIN_NAME")
df = pd.read_sql_query(sql, engine)

print len(df)
df2 = df.dropna()
print len(df2)

# Write output
out_path = r'C:\Data\James_Work\Staff\Knut_Eric_T\RA_Database\latin_names_match_ncbi_txt.csv'
df.to_csv(out_path)

3865
2728


So there are actually 3865 names in the RADB, of which 2728 (71%) can be matched using NCBI. I have updated the files `latin_names_match_ncbi_txt.csv` and `latin_names_match_ncbi_txt.xlsx` to reflect this.

### 5.1. Link "sp." to genus

Karina noted that the reason for missing matches is often because of "sp." in the latin names in the RADB. In an e-mail dated 26/04/2017 at 09.07, Knut Erik states that these should all be considered as "genus". 

Using this information, how many of the missing entries can now be matched?

The first step is to get a list of all the NCBI names linked to their taxonomic level.

In [4]:
# Get ncbi_names and rank
sql = ("SELECT a.tax_id, name_txt, rank "
       "FROM risk_assessment.ncbi_names a "
       "LEFT OUTER JOIN risk_assessment.ncbi_nodes b "
       "ON a.tax_id = b.tax_id")
ncbi_df = pd.read_sql_query(sql, engine)

print 'Number of NCBI records:', len(ncbi_df)

ncbi_df.head()

Number of NCBI records: 2322907


Unnamed: 0,tax_id,name_txt,rank
0,63,Vitreoscilla filiformis (ex Pringsheim 1951) S...,species
1,63,strain L1401-2,species
2,64,Herpetosiphon,genus
3,64,Herpetosiphon Holt and Lewin 1968,genus
4,65,ATCC 23779,species


Next, split the list of Latin names in the RAdb into those that have already been matched successfully and those that haven't

In [5]:
# Split data into names matched successfully and names not matched
df_match = df.dropna()
df_nomat = df.query('tax_id != tax_id')

# Remove blank cols
del df_nomat['tax_id'], df_nomat['name_txt']

print 'Number of unmatched entries:', len(df_nomat)

df_nomat.head()

Number of unmatched entries: 1137


Unnamed: 0,latin_name
2728,
2729,Stenocypris sp.
2730,Ephemerella infrequens
2731,Tilapia natalensis
2732,Melita sp.


So, there are 1137 entries that we need to find matches for. As a first step, we'll split the unmatched Latin names in the RAdb into two groups: those that end in `sp.` (or `Sp.`), and those ending in something else.

In [6]:
# Split latin name col at first word
df_nomat['first'], df_nomat['last'] = df_nomat['latin_name'].str.split(' ', 1).str

df_nomat.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,latin_name,first,last
2728,,,
2729,Stenocypris sp.,Stenocypris,sp.
2730,Ephemerella infrequens,Ephemerella,infrequens
2731,Tilapia natalensis,Tilapia,natalensis
2732,Melita sp.,Melita,sp.


In [7]:
# Get entries ending in "sp."
df_sp = df_nomat.query('(last=="Sp.") or (last=="sp.")')

print 'There are %s unmatched entries ending in "sp." or "Sp."' % len(df_sp)

df_sp.head()

There are 475 unmatched entries ending in "sp." or "Sp."


Unnamed: 0,latin_name,first,last
2729,Stenocypris sp.,Stenocypris,sp.
2732,Melita sp.,Melita,sp.
2733,Agraylea sp.,Agraylea,sp.
2736,Juncus sp.,Juncus,sp.
2737,Aphanocapsa sp.,Aphanocapsa,sp.


Of these 475 unmatched entries, how many can be matched using the first part of the Latin name and assuming the "rank" is equal to "genus" (as specified in KET's e-mail)?

In [8]:
# Add "rank" col for all "sp." entries equal to "genus"
df_sp['rank'] = 'genus'

# Join
df_sp2 = pd.merge(df_sp, ncbi_df, how='left',
                  left_on=['first', 'rank'],
                  right_on=['name_txt', 'rank'])

print len(df_sp2)

df_sp2.head()

493


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,latin_name,first,last,rank,tax_id,name_txt
0,Stenocypris sp.,Stenocypris,sp.,genus,27410.0,Stenocypris
1,Melita sp.,Melita,sp.,genus,315597.0,Melita
2,Agraylea sp.,Agraylea,sp.,genus,997594.0,Agraylea
3,Juncus sp.,Juncus,sp.,genus,13578.0,Juncus
4,Aphanocapsa sp.,Aphanocapsa,sp.,genus,1119.0,Aphanocapsa


This data frame is slightly longer than the original (493 entries versus 475), which implies some duplicated matches. Here's a list of the duplicates.

In [9]:
# Identify duplicates
dup_df = df_sp2[df_sp2.duplicated(['latin_name'], keep=False)]

dup_df.head(10)

Unnamed: 0,latin_name,first,last,rank,tax_id,name_txt
56,Sagittaria sp.,Sagittaria,sp.,genus,4450.0,Sagittaria
57,Sagittaria sp.,Sagittaria,sp.,genus,459300.0,Sagittaria
78,Chlorella Sp.,Chlorella,Sp.,genus,3071.0,Chlorella
79,Chlorella Sp.,Chlorella,Sp.,genus,114049.0,Chlorella
80,Chlorella Sp.,Chlorella,Sp.,genus,114055.0,Chlorella
81,Chlorella Sp.,Chlorella,Sp.,genus,175242.0,Chlorella
105,Uronema sp.,Uronema,sp.,genus,35106.0,Uronema
106,Uronema sp.,Uronema,sp.,genus,104535.0,Uronema
136,Stagnicola sp.,Stagnicola,sp.,genus,55693.0,Stagnicola
137,Stagnicola sp.,Stagnicola,sp.,genus,182412.0,Stagnicola


To get a better idea of what's going on, we can check the original NCBI data for the first two examples in the table above (`Sagittaria` and `Chlorella`).

In [10]:
# Get info for Alaria
sql = ("SELECT * FROM risk_assessment.ncbi_names "
       "WHERE tax_id IN (4450, 459300, 3071, 114049, 114055, 175242) "
       "AND name_class = 'scientific name' "
       "ORDER BY name_txt")
tmp_df = pd.read_sql_query(sql, engine)

tmp_df

Unnamed: 0,tax_id,name_txt,unique_name,name_class
0,114049,Chlorella,Chlorella <unclassified Chlorophyceae>,scientific name
1,175242,Chlorella,Chlorella <Prasiolales>,scientific name
2,3071,Chlorella,Chlorella <Chlorellales>,scientific name
3,114055,Chlorella,Chlorella <unclassified Trebouxiophyceae>,scientific name
4,459300,Sagittaria,Sagittaria <Ciliata>,scientific name
5,4450,Sagittaria,Sagittaria <water-plantain>,scientific name


The problem is now clear: there are several genera in NCBI with the same name. Unfortunately, using the information available at present, I have no way of knowing which of the two `Sagittaria` entries above corresponds to `Sagittaria sp.` in the RAdb, or which of the four `Chlorella` options corresponds to `Chlorella sp.` etc.

For future reference, here's a list of the 16 duplicated names that need manually checking.

In [11]:
# Get a list of all the Latin names that need manually checking
df_sp2[df_sp2.duplicated(['latin_name'])][['latin_name']].drop_duplicates()

Unnamed: 0,latin_name
57,Sagittaria sp.
79,Chlorella Sp.
106,Uronema sp.
137,Stagnicola sp.
176,Nais sp.
212,Cynodon sp.
218,Pontogeneia sp.
222,Hydrophilus sp.
234,Iris sp.
245,Alaria sp.


Ignoring these issues for now, how many of the 475 "sp." names have we actually managed to match?

In [12]:
# Ignore duplication issue by dropping duplicates at random
df_spmatch = df_sp2.drop_duplicates(['latin_name'])
assert len(df_spmatch) == len(df_sp)

# How many are still unmatched?
df_spnomat = df_spmatch.query('tax_id != tax_id')

print 'There are still %s names ending in "sp." that cannot be matched at genus level:' % len(df_spnomat)

df_spnomat

There are still 32 names ending in "sp." that cannot be matched at genus level:


Unnamed: 0,latin_name,first,last,rank,tax_id,name_txt
19,Jaeropsis sp.,Jaeropsis,sp.,genus,,
22,Planaria sp.,Planaria,sp.,genus,,
54,Echinosphaerium sp.,Echinosphaerium,sp.,genus,,
60,Isogenus sp.,Isogenus,sp.,genus,,
87,Adenorhynchus sp.,Adenorhynchus,sp.,genus,,
144,Paraponyx sp.,Paraponyx,sp.,genus,,
148,Neoniphargus sp.,Neoniphargus,sp.,genus,,
150,Texadina sp.,Texadina,sp.,genus,,
159,Australorbis sp.,Australorbis,sp.,genus,,
164,Ammania sp.,Ammania,sp.,genus,,


#### Summary for names ending in "sp."

There are 475 unmatched names in the RAdb ending in `sp.` or `Sp.` Of these, **32 have no match in the NCBI data at genus level** (see table above), while an **additional 16 have multiple matches** (see the `Sagittaria` and `Chlorella` examples, also above). These issues will need resolving manually before we can progress further.

### 5.2. Non-"sp." names

What about names not ending in "sp."? There were 1137 entries that could not be matched and, of these, 475 ended in `sp.`. This leaves 662 entries to be matched by another method.

In [13]:
df_nosp = df_nomat.query('(last!="Sp.") and (last!="sp.")')
print len(df_nosp)

df_nosp.head()

662


Unnamed: 0,latin_name,first,last
2728,,,
2730,Ephemerella infrequens,Ephemerella,infrequens
2731,Tilapia natalensis,Tilapia,natalensis
2734,Clibanarius olivaceus,Clibanarius,olivaceus
2735,Peloscolex gabriellae,Peloscolex,gabriellae


Karina has provided some additional information to help with this matching, but unfortunately it's still not possible to uniquely identify all the Latin anmes in the RAdb. As an example, take the second row of the table above, which is listed as `Ephemerella infrequens` in the RAdb. In Karina's spreadsheet, she has noted that this is a species level designation.

Let's compare this to the data in the NCBI tables.

In [14]:
# Search NCBI for Acanthocyclops
ncbi_df[ncbi_df['name_txt'].str.startswith('Ephemerella')]

Unnamed: 0,tax_id,name_txt,rank
113627,50634,Ephemerella,genus
113628,50634,"Ephemerella Walsh, 1862",genus
113629,50635,Ephemerella sp. MFW-1996,species
390148,219339,Ephemerella invaria,species
390149,219339,"Ephemerella invaria (Walker, 1853)",species
412876,248227,Ephemerella dorothea,species
412877,248227,"Ephemerella dorothea Needham, 1908",species
412878,248228,Ephemerella subvaria,species
412879,248228,"Ephemerella subvaria McDunnough, 1931",species
508434,309540,Ephemerella sp. EP008,species


So, there are 106 Ephemerella-like names in NCBI, but none of them match exactly. The best I can do is to go up a level (to genus) and match `Ephemerella infrequens` (the species) to `Ephemerella` (the genus). This is probably resaonable in this instance, but this approach results in multiple matches when applied more generally. 

The issues are identical to those noted above for the names ending in "sp.": essentially, we try to match the first part of the name (`Ephemerella` in the example here) to the NCBI data, subject to the constraint that `rank = genus`. If we try this with the remaining 662 unmatched names, we get a number of duplicates.

In [15]:
# Add "rank" col equal to "genus"
df_nosp['rank'] = 'genus'

# Join
df_nosp2 = pd.merge(df_nosp, ncbi_df, how='left',
                    left_on=['first', 'rank'],
                    right_on=['name_txt', 'rank'])

print len(df_nosp2)

698


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [16]:
# Identify duplicates
dup_df = df_nosp2[df_nosp2.duplicated(['latin_name'], keep=False)]

dup_df.head(10)

Unnamed: 0,latin_name,first,last,rank,tax_id,name_txt
1,Ephemerella infrequens,Ephemerella,infrequens,genus,50634.0,Ephemerella
2,Ephemerella infrequens,Ephemerella,infrequens,genus,998446.0,Ephemerella
6,Nitzschia accomodata,Nitzschia,accomodata,genus,2857.0,Nitzschia
7,Nitzschia accomodata,Nitzschia,accomodata,genus,651811.0,Nitzschia
25,Chlorella ovalis,Chlorella,ovalis,genus,3071.0,Chlorella
26,Chlorella ovalis,Chlorella,ovalis,genus,114049.0,Chlorella
27,Chlorella ovalis,Chlorella,ovalis,genus,114055.0,Chlorella
28,Chlorella ovalis,Chlorella,ovalis,genus,175242.0,Chlorella
71,Nitzschia gracilis,Nitzschia,gracilis,genus,2857.0,Nitzschia
72,Nitzschia gracilis,Nitzschia,gracilis,genus,651811.0,Nitzschia


In the example above, we see that the NCBI database contains e.g. 4 different genera named "Chlorella", which are matched multiple times to `Chlorella` in the RAdb. To move forward, we need to know which of the NCBI Chlorella genera actually match Chlorella in the RAdb.

The following entries in the RAdb need manually matching.

In [17]:
# Get a list of all the Latin names that need manually checking
df_nosp2[df_nosp2.duplicated(['latin_name'])][['latin_name']].drop_duplicates()

Unnamed: 0,latin_name
2,Ephemerella infrequens
7,Nitzschia accomodata
26,Chlorella ovalis
72,Nitzschia gracilis
81,Dugesia lugubris
94,Chlorella fusca var-fusca
161,Nitzschia angularis var. affinis
186,Ephemerella lata
310,Chlorella fusca vaculata
336,Nitzschia perminuta
