### HARD CODED RULES
* requires format of skipped line between different phylums (ex. empty row above Diatom, Dinoflagellate, etc.)
* first word of species name is the genus

1. diatoms are NOT mixotrophs
2. remove all "[name]-like" (without genus specified)
3. remove all "[genus name] spp." AND "[genus name] sp." EXCEPT for Ochromonas genus
4. assume everything after "Unknown flagellates" is irrelevant (to be deleted)

Status Key--  
Confirmed := explicitly in the Mixotroph Database  
Unsure (sp. in mdb) := genus in Mixotroph Database lists "[genus name] sp." (ex. Ochromonas sp. for Ochromonas danica)
Unsure (inexact name):= LIS name is in a longer Mixotroph Database name (ex. Chattonella marina in Chattonella marina var. ovata)   

### QUESTIONS TO ASK

1. Should I be considering "cysts of Linggulodinium polyedrum" mixotrophs?
2. How should I handle these "unsure" cases? - see status key above

### Code Steps
For Mixotroph Database:
1. reset headers
2. consider potential speed ups such as phylum to row numbers dictionary

For LIS Dataframe:
1. save initial header
2. reset headers
3. delete "unknown flagellates" and everything below
4. delete any row whose first column contains "TOTAL"
5. get array of indices of missing values in second column (skipped rows)
6. add 1 to every value in that array (to get the indicies of where the phlyum is)
7. copy the values at those indicies in the first column to a new first column called phylum
8. and delete them where they were before
9. backfill the phylum column so that it is completely filled (fill in empty rows using previous value)
10. delete all rows where there are missing values in the species name column
11. rename what is now the second column as genus
12. proceed to clear out what you know it is not based on the hard coded rules:
13. delete all rows where phylum is diatom
14. delete any row whos value in the species column ends with "-like"
15. find all rows whose species name contains spp or sp EXCEPT if it also contains ochromonas and delete those rows
16. create new column called status to become the first column
17. now proceed iteratively through the condensed dataframe:
18. for each row:
19. check if name is in mixotroph database, if yes, status = confirmed
20. if not, is the only name for it in the database sp., if yes, status = unsure (sp. in mdb)
21. if not, is there a longer name for it in the database, if yes, status = unsure (inexact name)
22. then after iterating:
23. for each phylum block, insert a row "Totals" and get the sums of everywhere that is that phylum
24. save this new file as a csv in the outputs folder and have the name be the old name + the date (timestamp)
25. add back initial header combined with current as a multiheader

In [1]:
import pandas as pd

In [72]:
mdb = pd.read_csv("csvs/MDB - 3Dec2022.csv")
mdb.columns = mdb.iloc[1]
mdb = mdb.drop([0, 1]).reset_index(drop=True)
mdb.head()

1,Species Name,Taxonomic Group,AphiaID,Additional notes,Gene markers,PR2 Accession Number,GenBank Accession Number,Reference to sequence,MFT,Evidence of mixoplankton activity,...,REDS,SANT,SARC,SATL,SPSG,SSTC,SUND,TASM,WARM,WTRA
0,Acanthochiasma sp,Radiolaria,368427,Acantharia,18S_rRNA_nucleus;18S_rRNA_nucleus;18S_rRNA_nuc...,HM103395.1.1099_U;HM103418.1.1104_U;JN811207.1...,HM103395;HM103418;JN811207;GU825020;HM103399;H...,"Quaiser,A.. Comparative metagenomics of bathyp...",eSNCM,endosymbionts,...,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded
1,Acanthometra fusca,Radiolaria,not registered,Acantharia,18S_rRNA_nucleus;18S_rRNA_nucleus;18S_rRNA_nuc...,KC172856.1.1696_U;EU446351.1.1552_U;JN811165.1...,KC172856;EU446351;JN811165,"Decelle,J.. Diversity, ecology and biogeochemi...",eSNCM,endosymbionts,...,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded
2,Acanthodesmia vinculata,Radiolaria,493675,Acantharia,not recorded,not recorded,not recorded,not recorded,eSNCM,endosymbionts,...,not recorded,not recorded,not recorded,2,8,not recorded,not recorded,not recorded,15,not recorded
3,Acanthometra pellucida,Radiolaria,235750,Acantharia,18S_rRNA_nucleus;18S_rRNA_nucleus;18S_rRNA_nuc...,JN811196.1.1668_U;JQ697712.1.1693_U;JQ697708.1...,JN811196;JQ697712;JQ697708;JN811190;JQ697711;J...,"Decelle,J.. Molecular Phylogeny and Morphologi...",eSNCM,endosymbionts,...,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded
4,Acanthometron sp.,Radiolaria,391880,Acantharia,not recorded,not recorded,not recorded,not recorded,eSNCM,endosymbionts,...,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded,not recorded


In [22]:
lis = pd.read_csv("csvs/LIS_2019-Phytoplankton_Final Report Data.csv")
lis.columns = lis.iloc[1]
lis = lis.iloc[3:].reset_index(drop=True)
lis.head()

1,Phylum,Species,1/3/19,1/3/19.1,1/3/19.2,1/7/19,1/7/19.1,1/7/19.2,1/2/19,1/2/19.1,...,12/6/19,12/6/19.1,12/6/19.2,12/5/19,12/16/19,12/16/19.1,12/16/19.2,12/4/19,12/4/19.1,12/4/19.2
0,,,,,,,,,,,...,,,,,,,,,,
1,Diatom,,,,,,,,,,...,,,,,,,,,,
2,Achnanthes,Achnanthes spp.,,352.0,,,,,,,...,176.0,,88.0,,,,,,,
3,Actinocyclus,Actinocyclus spp.,,,,,,,,,...,,,,,,,,,,
4,Actinoptychus,Actinoptychus senarius,,,,,,,,,...,,,,,,,,,,
