# Check MIxS 6 IDs
### The purpose of this notebook is to check for mismatches between the unique ids of mixs terms found in the mixs core and packages.
### This is done by comparing the core term ids to the package term ids. So, far all ids are fine.
#### Note: There are only 3 core terms that appear as package terms:
- alt (MIXS:0000094)
- depth (MIXS:0000018)
- elev (MIXS:0000093)

In [3]:
import os, sys
sys.path.append(os.path.abspath('../../mixs_data/')) # add path to mixs 6 spreadsheet 

In [50]:
import pandas as pds
from pandasql import sqldf

### Read in all data from MIxS 6 term updates.xlsx

In [14]:
xls = pds.ExcelFile('../../mixs_data/MIxS 6 term updates.xlsx')
xls.sheet_names # list sheet names

['MIxS5 - do not edit',
 'MIxS5 packages - do not edit',
 'MIxS6 Core- to edit',
 'Checklists-Core (non-MIxS)',
 'MIxS6 packages - to edit',
 'NewChecklists_MIxS6']

## Create dataframs for of core and package terms

In [16]:
coreDf = pds.read_excel(xls, 'MIxS6 Core- to edit')

In [38]:
packageDf = pds.read_excel(xls, 'MIxS6 packages - to edit')

### Message headers in standardized form
- lowercase names
- remove surrounding spaces
- replace innerspaces with underscores

In [40]:
coreDf.columns = [x.lower().strip().replace(' ', '_') for x in list(coreDf.columns)]
packageDf.columns = [x.lower().strip().replace(' ', '_') for x in list(packageDf.columns)]

## Subset dataframes to names and ids

In [48]:
coreDf = coreDf[['structured_comment_name', 'unique_mixs_id']]
packageDf = packageDf[['structured_comment_name', 'unique_mixs_id']].drop_duplicates()

In [49]:
packageDf.head()

Unnamed: 0,structured_comment_name,unique_mixs_id
0,alt,MIXS:0000094
1,elev,MIXS:0000093
2,barometric_press,MIXS:0000096
3,carb_dioxide,MIXS:0000097
4,carb_monoxide,MIXS:0000098


## Query dataframes for mismatches
- peform a left join on the core and package terms; there should be no duplicates (i.e., count > 1)
- peform a left join on the package and core terms (opposite from above); there should be no duplicates (i.e., count > 1)

In [65]:
q1 = """
select 
    coreDf.structured_comment_name, coreDf.unique_mixs_id, count(*)
from 
    coreDf
left join
    packageDf
on
    coreDf.structured_comment_name = packageDf.structured_comment_name
group by
    coreDf.structured_comment_name, coreDf.unique_mixs_id
having 
 count(*) > 1
"""

In [66]:
sqldf(q1)

Unnamed: 0,structured_comment_name,unique_mixs_id,count(*)


In [67]:
q2 = """
select 
    packageDf.structured_comment_name, packageDf.unique_mixs_id, count(*)
from 
    packageDf
left join
    coreDf
on
    packageDf.structured_comment_name = coreDf.structured_comment_name
group by
    packageDf.structured_comment_name, packageDf.unique_mixs_id
having 
 count(*) > 1
"""

In [68]:
sqldf(q2)

Unnamed: 0,structured_comment_name,unique_mixs_id,count(*)


## Check length of each dataframe

In [69]:
len(coreDf)

95

In [70]:
len(packageDf)

509

### Find core terms that are not package terms

In [76]:
q3 = """
select 
    coreDf.structured_comment_name, coreDf.unique_mixs_id
from 
    coreDf
where 
    coreDf.structured_comment_name not in (select packageDf.structured_comment_name from packageDf)
order by
    coreDf.structured_comment_name
"""

In [77]:
len(sqldf(q3))

92

### Find package terms that are not core terms

In [78]:
q4 = """
select 
    packageDf.structured_comment_name, packageDf.unique_mixs_id
from 
    packageDf
where 
    packageDf.structured_comment_name not in (select coreDf.structured_comment_name from coreDf)
order by
    packageDf.structured_comment_name
"""

In [79]:
len(sqldf(q4))

506

## Find terms are both core and package terms

In [80]:
q5 = """
select 
    coreDf.structured_comment_name, coreDf.unique_mixs_id
from 
    coreDf
where 
    coreDf.structured_comment_name IN (select packageDf.structured_comment_name from packageDf)
order by
    coreDf.structured_comment_name
"""

In [81]:
sqldf(q5)

Unnamed: 0,structured_comment_name,unique_mixs_id
0,alt,MIXS:0000094
1,depth,MIXS:0000018
2,elev,MIXS:0000093
