# eDNA conversion of practice data - draft 3

**Resources:**
- https://docs.gbif-uat.org/publishing-dna-derived-data/1.0/en/#mapping-metabarcoding-edna-and-barcoding-data

In [1]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handling dates
import pytz # for handling time zones

In [2]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\eDNA")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [3]:
## Plate data

plate = pd.read_csv('Plate_S_ASV_OBIS_data.csv')
print(plate.shape)
plate.head()

(280440, 11)


Unnamed: 0,ASV,FilterID,Sequence_ID,Reads,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_1,05114c01_12_edna_1_S,14825,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_2,05114c01_12_edna_2_S,16094,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
2,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_3,05114c01_12_edna_3_S,22459,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
3,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,11216c01_12_edna_1,11216c01_12_edna_1_S,19312,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
4,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,11216c01_12_edna_2,11216c01_12_edna_2_S,16491,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned


**FilterID** = SAMPLING_cruise (cruise where the sample was collected) + 
    SAMPLING_station_number (in this case c + a cast number because these data were collected via CTD cast) + _ +
    SAMPLING_bottle (bottle number, if multiple samples were taken from a single bottle, there may be replicates indicated by a, b, c, etc.) + _ +
    edna (filter indicator. edna means that the standard filter for the standard edna pipeline, i.e. a standard PBDF filter, was used. other options include hplc) + _ +
    1, 2 or 3 (PCR replicate)

To summarize, **FilterID** = cruise number + cast number + bottle number + filter indicator + replicate number.

**Note** that not all IDs exactly follow this formula. Example:
- JD10706C1_0m_1

**Sequence_ID** adds the plate indicator to the end of FilterID. In this case, all samples are from plate S. Plates are labeled A-Z, AA-ZZ, etc.

In [4]:
## Taxonomy table

taxa = pd.read_csv('Filtered_ASV_taxa_table_all.csv')
print(taxa.shape)
taxa.head()

(4711, 79)


Unnamed: 0,ASV,Kingdom,Phylum,Class,Order,Family,Genus,Species,14213c01_12_edna_1,14213c01_12_edna_2,...,JD33306C1_0m_3,pcrblank2,EB_20161121,EB_20161228,EB_20170117,pcrblank3,CB_CANON160925_1,CB_CANON160925_2,CB_CANON160925_3,ArtComm2
0,ASV_1,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned,9,4,...,956,0,0,1312,1,0,197,3,210,44
1,ASV_2,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Calanidae,unassigned,unassigned,4,7,...,358,0,0,535,0,0,118,162,139,5976
2,ASV_3,Eukaryota,unknown,Dinophyceae,Gymnodiniales,Gymnodiniaceae,Akashiwo,Akashiwo sanguinea,2,3,...,795,0,0,0,35,0,1,0,1,8
3,ASV_4,Eukaryota,unknown,Dinophyceae,Gymnodiniales,Gymnodiniaceae,Cochlodinium,unassigned,2,3,...,82065,0,0,0,836,0,1,0,0,7
4,ASV_5,Eukaryota,unknown,Dinophyceae,unassigned,unassigned,unassigned,unassigned,148,147,...,14688,0,0,36,401,0,0,4,11,7


This shows the number of reads for each ASV detected for each FilterID (columns), plus the matched taxa information for each ASV, if available. 

Some columns are associated with controls:
- **CB** = collection blank which was taken on the ship. Water that should be "clean" was passed through the filter, so there should be no reads in this sample, although it tends to be the "dirtiest" control. It checks for contamination during filtration, I think.
- **EB** = extraction blank which was taken in the lab but contains no DNA (i.e. a negative control), so there should be no reads in this sample. It checks if contamination occurred during DNA extraction (lab work pre-PCR).
- **pcrblank** = A sample that went through PCR but contained no input DNA (i.e. a negative control), so there should be no reads in this sample. It checks if contamination occurred during PCR.
- **ArtComm** = artificial community that went through PCR and contained DNA from species that should not be in the study system (i.e. a positive control), so you know what results you expect.

To see these columns, use:
```python
for col in taxa.columns:
    if col not in plate['FilterID'].unique():
        print(col)
```

Additionally, taxonomic columns include the following designations:
- **unknown** = GenBank couldn't give a scientifically-agreed-upon name for a given taxonomic level. I.e., either the name doesn't exist, or there isn't enough scientific consensus to give a name.
- **no_hit** = BLAST did not find any hits for the ASV.
- **unassigned** = The ASV got BLAST hits, but the post-processing program Megan6 didn't assign the ASV to any taxonomic group.
- **g_** or **s_** = After Megan there's an additional filtering step where Genus or Species designations are only left if they were assigned with a certain, high level of confidence or higher. If this threshold is not met, a g_ or s_ is given.

In [5]:
## Plate metadata

plate_meta = pd.read_csv('Plate_S_meta_OBIS_data_022321.csv')
print(plate_meta.shape)
plate_meta.head()

(60, 72)


Unnamed: 0,sample_name,library,tag_sequence,primer_sequence_F,primer_sequence_R,R1,R2,PlateID,sample_type,locus,...,sop,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,seq_meth,sequencing_facility,seqID,identificationRemarks,identificationReferences,FilterID
0,14213c01_12_eDNA_1,S1,ACGAGACTGATT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_1_S1_L001_R1_001.fastq.gz,14213c01_12_edna_1_S1_L001_R2_001.fastq.gz,S,environmental,18S,...,ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...,1391f,EukBr,Amaral-Zettler et al. (2009),NGS Illumina Miseq,Stanford,14213c01_12_edna_1_S,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna
1,14213c01_12_eDNA_2,S2,GAATACCAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_2_S2_L001_R1_001.fastq.gz,14213c01_12_edna_2_S2_L001_R2_001.fastq.gz,S,environmental,18S,...,ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...,1391f,EukBr,Amaral-Zettler et al. (2009),NGS Illumina Miseq,Stanford,14213c01_12_edna_2_S,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna
2,14213c01_12_eDNA_3,S3,CGAGGGAAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_3_S3_L001_R1_001.fastq.gz,14213c01_12_edna_3_S3_L001_R2_001.fastq.gz,S,environmental,18S,...,ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...,1391f,EukBr,Amaral-Zettler et al. (2009),NGS Illumina Miseq,Stanford,14213c01_12_edna_3_S,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna
3,22013c01_12_eDNA_1,S4,GAACACTTTGGA,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,22013c01_12_edna_1_S4_L001_R1_001.fastq.gz,22013c01_12_edna_1_S4_L001_R2_001.fastq.gz,S,environmental,18S,...,ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...,1391f,EukBr,Amaral-Zettler et al. (2009),NGS Illumina Miseq,Stanford,22013c01_12_edna_1_S,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22013c01_12_edna
4,22013c01_12_eDNA_2,S5,ACTCACAGGAAT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,22013c01_12_edna_2_S5_L001_R1_001.fastq.gz,22013c01_12_edna_2_S5_L001_R2_001.fastq.gz,S,environmental,18S,...,ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...,1391f,EukBr,Amaral-Zettler et al. (2009),NGS Illumina Miseq,Stanford,22013c01_12_edna_2_S,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22013c01_12_edna


## Conversion

In [6]:
## eventID

occ = pd.DataFrame({'eventID':plate['Sequence_ID']})
print(occ.shape)
occ.head()

(280440, 1)


Unnamed: 0,eventID
0,05114c01_12_edna_1_S
1,05114c01_12_edna_2_S
2,05114c01_12_edna_3_S
3,11216c01_12_edna_1_S
4,11216c01_12_edna_2_S


In [7]:
## Merge with plate_meta to obtain eventDate, decimalLatitude, decimalLongitude, and other columns that can be added directly from metadata

metadata_cols = [
    'seqID',
    'SAMPLING_date_time', 
    'SAMPLING_dec_lat', 
    'SAMPLING_dec_lon',
    'env_biome',
    'env_feature',
    'env_material',
    'locus',
    'primer_sequence_F',
    'primer_sequence_R',
    'pcr_primer_name_forward',
    'pcr_primer_name_reverse',
    'pcr_primer_reference',
    'sop',
]

dwc_cols = [
    'eventID',
    'eventDate', 
    'decimalLatitude', 
    'decimalLongitude',
    'env_broad_scale',
    'env_local_scale',
    'env_medium',
    'target_gene',
    'primer_sequence_forward',
    'primer_sequence_reverse',
    'pcr_primer_name_forward',
    'pcr_primer_name_reverse',
    'pcr_primer_reference',
    'sop',
]

occ = occ.merge(plate_meta[metadata_cols], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)
occ.columns = dwc_cols
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop
0,05114c01_12_edna_1_S,2/20/2014 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...
1,05114c01_12_edna_2_S,2/20/2014 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...
2,05114c01_12_edna_3_S,2/20/2014 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...
3,11216c01_12_edna_1_S,4/21/2016 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...
4,11216c01_12_edna_2_S,4/21/2016 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...


**Note** that for pcr_primer_reference, I think it would be best to either put a DOI that links to the article, or the entire article citation (i.e. title, journal name included). This isn't a requirement, it just seems like the safest way to ensure the correct article would be identified.

In [8]:
## Format eventDate

pst = pytz.timezone('America/Los_Angeles')
eventDate = [pst.localize(datetime.strptime(dt, '%m/%d/%Y %H:%M')).isoformat() for dt in occ['eventDate']]
occ['eventDate'] = eventDate

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),ext:dx.doi.org/10.17504/protocols.io.xjufknw;a...


In [9]:
## Clean sop

# Replace semicolons with pipes
occ['sop'] = occ['sop'].str.replace(';', ' | ')

# Remove clarifying text that Katie added
occ['sop'] = occ['sop'].replace('ext:|amp:|bioinformatics:', '', regex=True).iloc[0]
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....


In [10]:
## Add DNA_sequence

occ['DNA_sequence'] = plate['ASV']
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop,DNA_sequence
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...


In [11]:
## scientificName, taxonomic info

occ['scientificName'] = plate['Species']
occ['kingdom'] = plate['Kingdom']
occ['phylum'] = plate['Phylum']
occ['class'] = plate['Class']
occ['order'] = plate['Order']
occ['family'] = plate['Family']
occ['genus'] = plate['Genus']

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,pcr_primer_reference,sop,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus


In [12]:
## Replace 'unknown', 'unassigned', etc. in scientificName and taxonomy columns with NaN

cols = ['scientificName', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus']
occ[cols] = occ[cols].replace({'unassigned':np.nan,
                              's_':np.nan,
                              'g_':np.nan,
                              'unknown':np.nan,
                              'no_hit':np.nan})
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,pcr_primer_reference,sop,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Amaral-Zettler et al. (2009),dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus


In [13]:
## Get unique species names

names = occ['scientificName'].unique()
names = names[~pd.isnull(names)] # remove NaN
print(len(names))

323


These names **do** have to be WoRMS-approved in order to be submitted to OBIS. 

There are a number of entries in the scientificName column, like "uncultured marine eukaryote", "eukaryote clone OLI11007", and "Acantharian sp. 6201", that are **not proper Latin species names**. It seems reasonable to replace these with NaN as well. **I've used a simple rule to filter these out, and it would be worth checking it as the data set expands, just to be extra careful.**

To see the entries:
```python
names = occ['scientificName'].unique()
names = names[~pd.isnull(names)] # remove NaN
for name in names:
    words_in_name = name.split(' ')
    if len(words_in_name) > 2:
        print(name)
```

In [14]:
## Replace non-Latin species names with NaN

# Get non-Latin names
non_latin_names = []
for name in names:
    words_in_name = name.split(' ')
    if len(words_in_name) > 2:
        non_latin_names.append(name)
non_latin_names_dict = {i:np.nan for i in non_latin_names}

# Add any names that didn't get caught in the simple filter
non_latin_names_dict['phototrophic eukaryote'] = np.nan
non_latin_names_dict['Candida <clade Candida/Lodderomyces clade>'] = np.nan

# Replace
occ['scientificName'].replace(non_latin_names_dict, inplace=True)

In addition, 52380 records **only give "Eukaryota" as the scientific name** (i.e. Eukaryota is in the kingdom field, and there is no more taxonomic information). These should be replaced with Biota (http://marinespecies.org/aphia.php?p=taxdetails&id=1)

In [15]:
## Replace entries where kingdom = 'Eukaryota' with the WoRMS-approved 'Biota'

occ.loc[occ['kingdom'] == 'Eukaryota', 'kingdom'] = 'Biota'

Finally **GenBank and WoRMS seem to disagree on their taxonomy**. Two examples:

1. 120 records have "Pelargonium" as the genus. [On GenBank](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=4030&lvl=3&lin=f&keep=1&srchmode=1&unlock) the full taxonomy is given as:
    - superkingdom Eukaryota
    - kingdom Viridiplantae (subkingdom on WoRMS)
    - phylum Streptophyta (infrakingdom on WoRMS, phylum listed as Tracheophyta)
    - subphylum Streptophytina (Spermatophytina on WoRMS)
    - class Magnoliopsida (matches on WoRMS)
        - this was indicated as 'unknown' in original data, i.e. GenBank couldn't give a scientifically-agreed-upon name for a given taxonomic level. **This doesn't make sense to me.**
    - order Geraniales (matches on WoRMS)
    - family Geraniaceae ([matches on WoRMS](http://marinespecies.org/aphia.php?p=taxdetails&id=382461))
    - genus Pelargonium (does NOT match on WoRMS)


2. 600 records have family Hemistasiidae (and no further taxonomic information). [On GenBank](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=2015512&lvl=3&lin=f&keep=1&srchmode=1&unlock), the full taxonomy is given as:
    - superkingdom Eukaryota
    - phylum Euglenozoa (matches on WoRMS)
    - order Diplonemea ([matches on WoRMS](http://marinespecies.org/aphia.php?p=taxdetails&id=582176))
    - family Hemistasiidae (does NOT match on WoRMS)

Interestingly, there are 60 records where species is Hemistasia phaeocysticola. These records also have family Hemistasiidae. The genus Hemistasia [matches on WoRMS](http://marinespecies.org/aphia.php?p=taxdetails&id=146165), but with a totally different taxonomy than [on GenBank](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=1503927&lvl=3&srchmode=1&keep=1&unlock):
- kingdom Protozoa
- phylum Euglenozoa
- class Kinetoplastea
- order not listed
- family not listed

**Is this messed up, or is this fine?**

**In both examples above**, the lowest taxonomic rank name given by the dataset does not match on WoRMS. I therefore need a system for searching through the higher taxonomic ranks given, finding the lowest one that will match on WoRMS, and putting that name in the scientificName column. The following few code blocks do this.

In [16]:
## Functions for finding the lowest available taxonomic rank that will match on WoRMS

def fill_lowest_taxon(df, cols):
    """ Takes the occurrence pandas data frame and fills missing values in scientificName with values from the first non-missing taxonomic rank column, listed in cols. """
    
    cols.reverse()
    
    for col in cols[:-1]:
        df['scientificName'] = df['scientificName'].combine_first(df[col])
    
    cols.reverse()
    
    return(df)

def find_not_matched(df, name_dict):
    """ Takes the occurrence pandas data frame and name_dict matching scientificName values with names on WoRMS and returns a list of names that did not match on WoRMS. """
    
    not_matched = []
    
    for name in df['scientificName'].unique():
        if name not in name_dict.keys():
            not_matched.append(name)
    
    try:
        not_matched.remove(np.nan)
    except ValueError:
        pass
            
    return(not_matched)

def replace_not_matched(df, not_matched, cols):
    """ Takes the occurrence pandas data frame and a list of scientificName values that did not match on WoRMS and replaces those values with NaN in the columns specified by cols. """
    
    df[cols] = df[cols].replace(not_matched, np.nan)
    
    return(df)  

In [17]:
## Iterate to match lowest possible taxonomic rank on WoRMS

# Note that cols (list of taxonomic column names) was defined when cleanin them above

# Initialize dictionaries
name_name_dict = {}
name_id_dict = {}
name_taxid_dict = {}
name_class_dict = {}

# Initialize not_matched
not_matched = [1]

# Iterate
while len(not_matched) > 0:
    
    # Step 1 - fill
    occ = fill_lowest_taxon(occ, cols)

    # Step 2 - get names to match
    to_match = find_not_matched(occ, name_name_dict)

    # Step 3 - match
    print('Matching {num} names on WoRMS.'.format(num = len(to_match)))
    name_id, name_name, name_taxid, name_class = WoRMS.run_get_worms_from_scientific_name(to_match, verbose_flag=False)
    name_id_dict = {**name_id_dict, **name_id}
    name_name_dict = {**name_name_dict, **name_name}
    name_taxid_dict = {**name_taxid_dict, **name_taxid}
    name_class_dict = {**name_class_dict, **name_class}
    print('Length of name_name_dict: {length}'.format(length = len(name_name_dict)))
    
    ## ---- Add in a progress bar? Also, a better/additional stopping criterion? What if not all names can be matched?

    # Step 4 - get names that didn't match
    not_matched = find_not_matched(occ, name_name_dict)
    print('Number of names not matched: {num}'.format(num = len(not_matched)))

    # Step 5 - replace these values with NaN
    occ = replace_not_matched(occ, not_matched, cols)

Matching 756 names on WoRMS.
Length of name_name_dict: 694
Number of names not matched: 62
Matching 36 names on WoRMS.
Length of name_name_dict: 714
Number of names not matched: 16
Matching 12 names on WoRMS.
Length of name_name_dict: 720
Number of names not matched: 6
Matching 3 names on WoRMS.
Length of name_name_dict: 721
Number of names not matched: 2
Matching 1 names on WoRMS.
Length of name_name_dict: 721
Number of names not matched: 1
Matching 0 names on WoRMS.
Length of name_name_dict: 721
Number of names not matched: 0


**Note** that there are 33360 records (~12% of the data) where no taxonomic information was obtained at all (i.e. scientificName is still NaN). **Pieter mentioned that I could possibly use 'Biota' for these records as well. Would that work?**

```python
occ[occ['scientificName'].isna() == True].shape
```

**These tentatively should be dropped, since they won't appear on OBIS/GBIF, but perhaps there's some argument to be made for leaving them in anyway?** I'm leaving them in for now.

Finally, during the above process, **I totally messed up the taxonomy columns in order to obtain the best possible scientificName column**. I'll fix that below.

In [18]:
## Fix taxonomy columns

# Replace with original data
occ[cols[1:]] = plate[['Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus']].copy()

# Replace missing data indicators in original data with empty strings ('')
occ[cols[1:]] = occ[cols[1:]].replace({
    'unassigned':'',
    's_':'',
    'g_':'',
    'unknown':'',
    'no_hit':''})

Check using:

```python
occ[occ['scientificName'] == 'Diplonemea'].head()
```

In [19]:
## Add scientific name-related columns

occ['scientificNameID'] = occ['scientificName']
occ['scientificNameID'].replace(name_id_dict, inplace=True)

occ['taxonID'] = occ['scientificName']
occ['taxonID'].replace(name_taxid_dict, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus,scientificNameID,taxonID
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0


**Note** that Pieter said that taxonID or taxonConceptID could potentially be used to include a GenBank ID here (the latter seems like a better fit to me). **Can I request these from Katie? Also, where would I specify that this number is a GenBankID? In the taxonRemarks?**

In [20]:
## Replace scientificName with the name that matched on WoRMS

occ['scientificName'].replace(name_name_dict, inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus,scientificNameID,taxonID
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0


In [21]:
## Add nameAccordingTo

occ['nameAccordingTo'] = 'WoRMS'

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,scientificName,kingdom,phylum,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS


In [22]:
## basisOfRecord

occ['basisOfRecord'] = 'MaterialSample'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,kingdom,phylum,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo,basisOfRecord
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample


In [23]:
## Add materialSampleID, identificationRemarks, identificationReferences -- can probably do this through merging metadata when Katie adds this info

occ['materialSampleID'] = 'insert ID for actual frozen sample if exists'

occ = occ.merge(plate_meta[['seqID', 'identificationRemarks', 'identificationReferences']], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)

# Clean identificationRemarks ---- make this more human-readable after clarifying what exactly it means w/ Katie

# Clean identificationReferences
occ['identificationReferences'] = occ['identificationReferences'].str.replace('; ', ' | ')
occ['identificationReferences'] = occ['identificationReferences'].str.replace(' D.*\)', '', regex=True)

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,order,family,genus,scientificNameID,taxonID,nameAccordingTo,basisOfRecord,materialSampleID,identificationRemarks,identificationReferences
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...


**Note** that I should ask Katie to explain the content of identificationRemarks, and then make the field contents more descriptive. **I also need to ask if these actual water samples are frozen somewhere. If they are, that information can be included under materialSampleID.**

In [24]:
## organismQuantity (number of reads)

occ['organismQuantity'] = plate['Reads']
occ['organismQuantityType'] = 'DNA sequence reads'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,genus,scientificNameID,taxonID,nameAccordingTo,basisOfRecord,materialSampleID,identificationRemarks,identificationReferences,organismQuantity,organismQuantityType
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads


**Note** that there are 215537 rows where the number of reads is 0. These will need to be dropped at the end of the script.

In [25]:
## sampleSizeValue

count_by_seq = plate.groupby('Sequence_ID', as_index=False)['Reads'].sum()
occ = occ.merge(count_by_seq, how='left', left_on='eventID', right_on='Sequence_ID')
occ.drop(columns='Sequence_ID', inplace=True)
occ.rename(columns={'Reads':'sampleSizeValue'}, inplace=True)
print(occ.shape)
occ.head()

(280440, 32)


Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,scientificNameID,taxonID,nameAccordingTo,basisOfRecord,materialSampleID,identificationRemarks,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419


In [26]:
## sampleSizeUnit

occ['sampleSizeUnit'] = 'DNA sequence reads'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,taxonID,nameAccordingTo,basisOfRecord,materialSampleID,identificationRemarks,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue,sampleSizeUnit
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600,DNA sequence reads
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702,DNA sequence reads
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275,DNA sequence reads
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220,DNA sequence reads
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196.0,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419,DNA sequence reads


In [27]:
## associatedSequences

occ['associatedSequences'] = 'biosampleID from SRA if available'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,nameAccordingTo,basisOfRecord,materialSampleID,identificationRemarks,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue,sampleSizeUnit,associatedSequences
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600,DNA sequence reads,biosampleID from SRA if available
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702,DNA sequence reads,biosampleID from SRA if available
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275,DNA sequence reads,biosampleID from SRA if available
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220,DNA sequence reads,biosampleID from SRA if available
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,MaterialSample,insert ID for actual frozen sample if exists,Genbank nr September 20 2017;>80% identity,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419,DNA sequence reads,biosampleID from SRA if available


**Ask Katie about associatedSequences value.**

## Tidy

**Recall:** I'm leaving records where scientificName is missing in for now.

In [28]:
## Drop records where organismQuantity = 0 (absences are not meaningful for this data set)

occ = occ[occ['organismQuantity'] > 0]
print(occ.shape)

(64903, 34)


**Note** that the vast majority of this data has 0 reads (215537 records out of 280440, or about 75%). I guess that makes sense - it's sparse data - but it might be good to check?

In [29]:
## Replace NaN values in text fields with ''

text_fields = ['scientificName', 'scientificNameID']
occ[text_fields] = occ[text_fields].replace(np.nan, '')
occ.isna().sum()

eventID                        0
eventDate                      0
decimalLatitude                0
decimalLongitude               0
env_broad_scale                0
env_local_scale                0
env_medium                     0
target_gene                    0
primer_sequence_forward        0
primer_sequence_reverse        0
pcr_primer_name_forward        0
pcr_primer_name_reverse        0
pcr_primer_reference           0
sop                            0
DNA_sequence                   0
scientificName                 0
kingdom                        0
phylum                         0
class                          0
order                          0
family                         0
genus                          0
scientificNameID               0
taxonID                     5278
nameAccordingTo                0
basisOfRecord                  0
materialSampleID               0
identificationRemarks          0
identificationReferences       0
organismQuantity               0
organismQu

## Save

In [30]:
## Save

occ.to_csv('eDNA_practice_plate_occ_20210308.csv', index=False, na_rep='NaN')

## Questions

1. For pcr_primer_reference, I think it would be best to either put a DOI that links to the article, or the entire article citation (i.e. title, journal name included). This isn't a requirement, it just seems like the safest way to ensure the correct article would be identified.
2. Cleaning sop field. Why is it formatted like this? Do you just type it in? If so, maybe can adhere to DwC formatting requirements (pipes, no text)? If not, no worries, I can handle it.
3. GenBank and WoRMS often disagree on taxonomy, especially when you look at the whole tree. Is this a problem? Also, talk through general scheme of getting a WoRMS match for scientificName. **Note that this process and assigning scientificName, taxonID, and taxonConceptID to organisms with no Linnaean name remains under debate, especially in the context of how it will be interpreted by OBIS search. Katie had some interesting thoughts about searching eDNA data, if I recall correctly?**
4. Would putting 'Biota' be appropriate for taxa where nothing matched on GenBank at any taxonomic level? (http://marinespecies.org/aphia.php?p=taxdetails&id=1)
5. Is it possible to obtain GenBankID for each taxon? I would like to include this information. 
6. **Question for Saara or Pieter or Abby:** Where can I include the fact that the taxonConceptID is a GenBankID?
7. materialSampleID - are these water samples frozen somewhere? Do they have an ID number?
8. What is in the identificationRemarks column? Can we make the wording more clear?
9. associatedSequences - are there biosampleIDs from SRA that I can include?
10. 75% of the rows in this dataset have 0 reads. Does that seem right?
11. Follow up questions on control data: Has the control information been incorporated into this data set already? Are the control data still useful if you're dealing with the fully processed data set? Are the control data archived and therefore accessible via biosampleID?


**Note to self**: Would it work just as well to drop the records with 0 reads up top? I don't know that it would change how long the WoRMS lookup takes (since I think the 0's appear because the ASV was present in another sample on the same plate), but it might speed up some simple processes like formatting dates and performing replacements. Maybe.