### HARD CODED RULES
Requires format of skipped line between different phylums (ex. empty row above Diatom, Dinoflagellate, etc.)

Assumed all {Ochromonas, } are mixotrophs.

1. assume everything after "Unknown flagellates" is irrelevant (to be deleted)
2. diatoms are NOT mixotrophs
3. remove all "[name]-like" (without genus specified)
4. remove all "[genus name] spp." AND "[genus name] sp."
5. check "cysts of"

Status Key--  
Confirmed := explicitly in the Mixotroph Database  
Unsure (sp. in mdb) := genus in Mixotroph Database lists "[genus name] sp." (ex. Ochromonas sp. for Ochromonas danica)  
Unsure (inexact name):= LIS name is in a longer Mixotroph Database name or vice versa (ex. Chattonella marina in Chattonella marina var. ovata)   

### QUESTIONS TO ASK

1. Should I be considering "cysts of Linggulodinium polyedrum" mixotrophs?
2. How should I handle these "unsure" cases? - see status key above

### Code Steps
For Mixotroph Database:
1. reset headers
2. consider potential speed ups such as phylum to row numbers dictionary

For LIS Dataframe:
1. save initial header
2. reset headers
3. delete "unknown flagellates" and everything below
4. delete any row whose first column contains "TOTAL"
5. get array of indices of missing values in second column (skipped rows)
6. add 1 to every value in that array (to get the indicies of where the phlyum is)
7. copy the values at those indicies in the first column to a new first column called phylum
8. and delete them where they were before
9. backfill the phylum column so that it is completely filled (fill in empty rows using previous value)
10. delete all rows where there are missing values in the species name column
11. rename what is now the second column as genus
12. proceed to clear out what you know it is not based on the hard coded rules:
13. delete all rows where phylum is diatom
14. delete any row whos value in the species column ends with "-like"
15. find all rows whose species name contains spp or sp EXCEPT if it also contains ochromonas and delete those rows
16. create new column called status to become the first column
17. now proceed iteratively through the condensed dataframe:
18. check if name is in mixotroph database, if yes, status = confirmed
19. if not, is the only name for it in the database sp., if yes, status = unsure (sp. in mdb)
20. if not, is there a longer name for it in the database, if yes, status = unsure (inexact name)
21. drop all rows with status "None"
22. for each phylum block, insert a row "Totals" and get the sums of everywhere that is that phylum
23. save this new file as a csv in the outputs folder and have the name be the old name + the date (timestamp)
24. add back initial header combined with current as a multiheader

In [271]:
import pandas as pd
import numpy as np

In [272]:
mdb = pd.read_csv("csvs/MDB - 3Dec2022.csv")
mdb.columns = mdb.iloc[1]
mdb = mdb.drop([0, 1]).reset_index(drop=True)
mdb.head()

1,Species Name,Taxonomic Group,AphiaID,Additional notes,Gene markers,PR2 Accession Number,GenBank Accession Number,Reference to sequence,MFT,Evidence of mixoplankton activity,...,REDS,SANT,SARC,SATL,SPSG,SSTC,SUND,TASM,WARM,WTRA
0,Acanthochiasma sp,Radiolaria,368427,Acantharia,18S_rRNA_nucleus;18S_rRNA_nucleus;18S_rRNA_nuc...,HM103395.1.1099_U;HM103418.1.1104_U;JN811207.1...,HM103395;HM103418;JN811207;GU825020;HM103399;H...,"Quaiser,A.. Comparative metagenomics of bathyp...",eSNCM,endosymbionts,...,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded
1,Acanthometra fusca,Radiolaria,not registered,Acantharia,18S_rRNA_nucleus;18S_rRNA_nucleus;18S_rRNA_nuc...,KC172856.1.1696_U;EU446351.1.1552_U;JN811165.1...,KC172856;EU446351;JN811165,"Decelle,J.. Diversity, ecology and biogeochemi...",eSNCM,endosymbionts,...,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded
2,Acanthodesmia vinculata,Radiolaria,493675,Acantharia,not recorded,not recorded,not recorded,not recorded,eSNCM,endosymbionts,...,not recorded,not recorded,not recorded,2,8,not recorded,not recorded,not recorded,15,not recorded
3,Acanthometra pellucida,Radiolaria,235750,Acantharia,18S_rRNA_nucleus;18S_rRNA_nucleus;18S_rRNA_nuc...,JN811196.1.1668_U;JQ697712.1.1693_U;JQ697708.1...,JN811196;JQ697712;JQ697708;JN811190;JQ697711;J...,"Decelle,J.. Molecular Phylogeny and Morphologi...",eSNCM,endosymbionts,...,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded
4,Acanthometron sp.,Radiolaria,391880,Acantharia,not recorded,not recorded,not recorded,not recorded,eSNCM,endosymbionts,...,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded


In [273]:
# edit mdb so that species ending in "sp" now end in "sp."
mdb['Species Name'] = mdb['Species Name'].str.replace(r'sp$', 'sp.', regex=True)

In [274]:
# import and clean LIS data
lis = pd.read_csv("csvs/LIS_2019-Phytoplankton_Final Report Data.csv")
original_headers = lis.columns  # save original column headers
lis.columns = lis.iloc[1]  # reset column headers
lis = lis.iloc[3:].reset_index(drop=True)  

In [275]:
# remove rows after unknown flagellates
unknown_flagellates_ind = lis[lis["Phylum"] == "Unknown flagellates"].index[0] 
lis = lis.iloc[:unknown_flagellates_ind]
lis = lis.iloc[:lis.last_valid_index()+1]  # remove trailing nan rows

In [276]:
# remove rows that contain "TOTAL"
lis = lis[~lis["Phylum"].str.contains("TOTAL", na=False)].reset_index(drop=True)  

In [277]:
# construct correct phylum column
actual_phylum_ind = lis[lis["Species"].isna() & lis["Phylum"].isna()].index + 1
lis = lis.rename(columns={"Phylum": "Genus"}) # rename phylum column to genus
lis.insert(0, 'Phylum', lis["Genus"].iloc[actual_phylum_ind])  # reconstruct phylum column
lis['Phylum'] = lis['Phylum'].ffill()  # forwardfill phylum

lis['Genus'] = lis['Species'].str.split().str[0]  # fill genus using first word of species name

lis = lis.dropna(subset=['Species']).reset_index(drop=True) # delete rows with na in Species column

In [278]:
# add Status column
lis.insert(0, 'Status', None)

In [279]:
# store blocks of known mixotroph genuses 
ochromonas_ind = lis[lis["Species"].str.contains("Ochromonas")].index
ochromonas_block = lis.iloc[ochromonas_ind] 

In [280]:
# remove based on hard coded rules (NOT RESETTING INDEX IN ORDER TO ADD BLOCKS BACK CORRECTLY)
lis = lis[lis["Phylum"] != "Diatom"] # remove all diatoms
lis = lis[~lis["Species"].str.contains("-like")] # remove species ending with "-like"
lis = lis[~lis["Species"].str.contains("sp.|spp.")]  # remove all sp. / spp.

In [281]:
# check "cysts of"
CYSTS_LEN = len("cysts of ")
cysts_of = lis[lis["Species"].str.contains("cysts of", regex=False)]["Species"].str.slice(CYSTS_LEN)
filtered = cysts_of.isin(mdb['Species Name'])
lis.loc[filtered[filtered].index, "Status"] = "Confirmed"

In [282]:
# add back stored blocks of known mixotrophs and mark as Confirmed
lis = pd.concat([lis, ochromonas_block]).sort_index().drop_duplicates()
lis.loc[ochromonas_ind, "Status"] = "Confirmed"

In [283]:
# check if (in none status) direct match and mark all Trues as "Confirmed"
filtered = lis[lis['Status'].isnull()]["Species"].isin(mdb['Species Name'])
lis.loc[filtered[filtered].index, "Status"] = "Confirmed"

# check (in remaining none status) if the genus has sp. and mark all Trues as "Unsure (sp. in mdb)"
genus_to_check = lis[lis['Status'].isnull()]['Species'].str.split().str[0].drop_duplicates() + " sp."
filtered = genus_to_check.isin(mdb['Species Name'])
lis.loc[filtered[filtered].index, "Status"] = "Unsure (sp. mdb)"

In [284]:
# check (in remaining none status) if the name is contained in the mdb and vice versa and mark all Trues as "Unsure (inexact name)"
filtered = lis[lis['Status'].isnull()]["Species"].apply(lambda x: mdb["Species Name"].str.contains(x, regex=False).any())
lis.loc[filtered[filtered].index, "Status"] = "Unsure (inexact name)"

pattern = '|'.join(mdb['Species Name'])
filtered = lis[lis['Status'].isnull()]["Species"].str.contains(pattern, regex=True)
lis.loc[filtered[filtered].index, "Status"] = "Unsure (inexact name)"

In [285]:
# drop all rows with Status = "None"
lis = lis.dropna(subset=['Status'])
lis

1,Status,Phylum,Genus,Species,1/3/19,1/3/19.1,1/3/19.2,1/7/19,1/7/19.1,1/7/19.2,...,12/6/19,12/6/19.1,12/6/19.2,12/5/19,12/16/19,12/16/19.1,12/16/19.2,12/4/19,12/4/19.1,12/4/19.2
143,Confirmed,Dinoflagellate,Akashiwo,Akashiwo sanguinea,,,,,,,...,,,,,,,,,,
152,Confirmed,Dinoflagellate,Dinophysis,Dinophysis acuminata,,,,,,,...,,,,,,,,,,
153,Confirmed,Dinoflagellate,Dinophysis,Dinophysis miles,,,,,,,...,,,,,,,,,,
154,Confirmed,Dinoflagellate,Dinophysis,Dinophysis norvegica,,,,,,,...,,,,,,,,,,
156,Confirmed,Dinoflagellate,Gambierdiscus,Gambierdiscus toxicus,352.0,,,,,,...,88.0,,,,,88.0,,,,
158,Confirmed,Dinoflagellate,Gonyaulax,Gonyaulax polygramma,,,,,,,...,,,,,,,,,,
164,Confirmed,Dinoflagellate,Heterocapsa,Heterocapsa circularisquama,17600.0,17600.0,30800.0,8800.0,,17600.0,...,13200.0,13200.0,2904.0,352.0,,,,2904.0,704.0,
167,Confirmed,Dinoflagellate,Noctiluca,Noctiluca scintillans,,,,,,,...,,,,,,,,,,
180,Confirmed,Dinoflagellate,Prorocentrum,Prorocentrum lima,,,,,,,...,,,,,,,,,,
182,Confirmed,Dinoflagellate,Prorocentrum,Prorocentrum micans,,,,,,,...,,,,,,,,,,
