In [1]:
%matplotlib inline
import pandas as pd
import imp
from sqlalchemy import create_engine

In [2]:
# Create db connection
r2_func_path = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template\useful_resa2_code.py'
resa2 = imp.load_source('useful_resa2_code', r2_func_path)

engine, conn = resa2.connect_to_resa2()

# RA database

This notebook documents an initial look at the RADB following KET's e-mails received 16/03/2017 at 09.51 and 23/03/2017 at 14.25. The basic aim is to modify the database so that "summary effects" can be calculated for different taxonomic levels, rather than just for "species groups", as currently.

`ECOTOX_SUMMARY_EFFECT` shows an example of the output available based on "species groups". We want to create similar views for different levels in the taxonomic tree, as defined in the NCBI tables (see KET's e-mails for details). The starting point for this is the `ECOTOX_EXTRACT1_M` view, which contains the basic data, including the Latin name for each organism. The information in this view is then summarised into two more views (`ECOTOX_SUMMARY_EFFECT_PRIMARY` and `ECOTOX_SUMMARY_EFFECT_REVIEW`), and these are subsequently combined into `ECOTOX_SUMMARY_EFFECT`, which is what we're interested in here. 

This notebook aims to:

 1. Get a brief overview of the NCBI data and how it can be linked to `ECOTOX_EXTRACT1_M` <br><br>
 
 2. Understand how we can use the NCBI data to permit aggregation by any taxonomic rank
 
## 1. Example data from `ECOTOX_EXTRACT1_M`

In [3]:
# Get the first 100 rows from ECOTOX_EXTRACT1_M
sql = ("SELECT * FROM risk_assessment.ecotox_extract1_m "
       "WHERE rownum <= 100")
df = pd.read_sql_query(sql, engine)

df.head()

Unnamed: 0,result_id,test_id,chemical_id,chemical_selection,cas_number_dash,chemical_name,organism_habitat,ecotox_group,species_group,common_name,...,significance_code,significance_type,significance_level_mean_op,significance_level_mean,reference_number,author,title,source,publication_year,invalid
0,2150084,2082355,350,Solution Organics,121-75-5,Malathion,Water,"Fish,Standard Test Species",Fish,Channel Catfish,...,ANOSIG,P,<,0.05,822,"Areechon,N.",Acute and Subchronic Toxicity of Malathion in ...,"Ph.D.Thesis, Auburn University, Auburn, AL:138 p.",1987,
1,2150011,2082355,350,Solution Organics,121-75-5,Malathion,Water,"Fish,Standard Test Species",Fish,Channel Catfish,...,NOSIG,P,<,0.05,822,"Areechon,N.",Acute and Subchronic Toxicity of Malathion in ...,"Ph.D.Thesis, Auburn University, Auburn, AL:138 p.",1987,
2,2150013,2082355,350,Solution Organics,121-75-5,Malathion,Water,"Fish,Standard Test Species",Fish,Channel Catfish,...,ANOSIG,P,<,0.05,822,"Areechon,N.",Acute and Subchronic Toxicity of Malathion in ...,"Ph.D.Thesis, Auburn University, Auburn, AL:138 p.",1987,
3,2150032,2082355,350,Solution Organics,121-75-5,Malathion,Water,"Fish,Standard Test Species",Fish,Channel Catfish,...,NOSIG,P,<,0.05,822,"Areechon,N.",Acute and Subchronic Toxicity of Malathion in ...,"Ph.D.Thesis, Auburn University, Auburn, AL:138 p.",1987,
4,2150085,2082355,350,Solution Organics,121-75-5,Malathion,Water,"Fish,Standard Test Species",Fish,Channel Catfish,...,ANOSIG,P,<,0.05,822,"Areechon,N.",Acute and Subchronic Toxicity of Malathion in ...,"Ph.D.Thesis, Auburn University, Auburn, AL:138 p.",1987,


## 2. NCBI dataset

It looks as though there are two key tables: `NCBI_NAMES` and `NCBI_NODES`. The former includes the organism names, while the latter has the taxonomic ranks. It's not immediately obvious to me how to link this information to `ECOTOX_EXTRACT1_M`, but perhaps matching `LATIN_NAME` in the RADB to `NAME_TXT` in the NCBI database is a good place to start?

In [4]:
# First, get all unique latin names from `ECOTOX_EXTRACT1_M`
sql = ("SELECT UNIQUE(latin_name) "
       "FROM risk_assessment.ecotox_extract1_m")
lat_df = pd.read_sql_query(sql, engine)

print 'There are %s unique Latin names in the RADB.' % len(lat_df)

# Try matching these based on NCBI text
sql = ("SELECT * FROM risk_assessment.ncbi_names "
       "WHERE name_txt IN ("
       "SELECT UNIQUE(latin_name) "
       "FROM risk_assessment.ecotox_extract1_m)")
ncbi_df = pd.read_sql_query(sql, engine)

print 'Of these, %s can be matched exactly in the NCBI database.' % len(ncbi_df)

There are 3781 unique Latin names in the RADB.
Of these, 2701 can be matched exactly in the NCBI database.


So, it looks as though we can link the ECOTOX data to the NCBI data based on Latin names in about 70% of cases. This obviously needs improving - **check with KET for options**.

Of these 2701 matches, how many of the 29 NCBI ranks are currently included in the RADB?

In [5]:
# Get matching NCBI ranks
sql = ("SELECT UNIQUE(rank) "
       "FROM risk_assessment.ncbi_nodes "
       "WHERE tax_id IN ("
       "SELECT tax_id FROM risk_assessment.ncbi_names "
       "WHERE name_txt IN ("
       "SELECT UNIQUE(latin_name) "
       "FROM risk_assessment.ecotox_extract1_m))")
rank_df = pd.read_sql_query(sql, engine)

rank_df

Unnamed: 0,rank
0,suborder
1,subspecies
2,subphylum
3,class
4,kingdom
5,superkingdom
6,genus
7,family
8,phylum
9,no rank


So the databse currently includes entries for 15 of the 29 possible rankings.

For each organism, I need to be able to get (i) its NCBI rank and (ii) the name of every higher taxonomic level that the organism belongs to. The `NCBI_NODES` table does contain this information, but the table structure is recursive: each row has a "parent", which defines the heirarchy. It would be more useful to create a flattened version, with the ranks/taxonomic levels as columns and a row for each `tax_id`, something like this:

| tax_id | rank 1 | rank 2 | rank 3 | rank 4 |
|:------:|:------:|:------:|:------:|:------:|
|    1   |        |        |    1   |    3   |
|    2   |    1   |        |    5   |    7   |

With this structure, it should be straightforward to get taxonomic data for each organism from any level of the taxonomic tree. Generating this table looks a bit tricky, but Oracle has something called "[recursive/heirarchical queries](https://oracle-base.com/articles/misc/hierarchical-queries)", which may be useful in this context. I've never used them before, though, so I'll need to investigate further when I get back in April.

Summary so far:

 * Check with KET regarding linking ECOTOX data to the NCBI codes. Are there better options than linking `LATIN_NAME` to `NAME_TXT`? <br><br>
 
 * `NCBI_NODES` needs restructuring to flatten the recursive data into something we can use for taxonomic aggregation