# eDNA conversion of practice data

**Resources:**
- https://docs.gbif-uat.org/publishing-sequence-derived-data/1.0/en/

## This is a test change ##
### This is another test change ###

In [187]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handling dates
import pytz # for handling time zones

## Load data

In [188]:
## Plate data

plate = pd.read_csv('Plate_S_ASV_OBIS_data.csv')
print(plate.shape)
plate.head()

(280440, 11)


Unnamed: 0,ASV,FilterID,Sequence_ID,Reads,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_1,05114c01_12_edna_1_S,14825,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_2,05114c01_12_edna_2_S,16094,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
2,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_3,05114c01_12_edna_3_S,22459,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
3,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,11216c01_12_edna_1,11216c01_12_edna_1_S,19312,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
4,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,11216c01_12_edna_2,11216c01_12_edna_2_S,16491,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned


**Note** that there are 60 unique FilterIDs and 60 unique Sequence_IDs in this table. Is that usual, to have # FilterID = # Sequence_ID? **IDs are a combination of what information?** Sequence_ID appears to be just the FilterID with underscore and S added.

In [189]:
## Sequence_ID is just FilterID + '_S'

temp = plate[['FilterID', 'Sequence_ID']].copy()
temp.drop_duplicates(inplace=True)
print(temp.shape)
temp.head()

(60, 2)


Unnamed: 0,FilterID,Sequence_ID
0,05114c01_12_edna_1,05114c01_12_edna_1_S
1,05114c01_12_edna_2,05114c01_12_edna_2_S
2,05114c01_12_edna_3,05114c01_12_edna_3_S
3,11216c01_12_edna_1,11216c01_12_edna_1_S
4,11216c01_12_edna_2,11216c01_12_edna_2_S


**FilterID seems to be composed of:**
- SAMPLING_cruise from plate_meta
- c + SAMPLING_station_number (or just SAMPLING_station, lower case and zero-padded?)
- _ + SAMPLING_bottle
- _ + edna
- _ + 1, 2 or 3 (replicate?)

In [190]:
## ASV taxa table

taxa = pd.read_csv('Filtered_ASV_taxa_table_all.csv')
print(taxa.shape)
print(taxa.columns)
taxa.head()

(4711, 79)
Index(['ASV', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus',
       'Species', '14213c01_12_edna_1', '14213c01_12_edna_2',
       '14213c01_12_edna_3', '22013c01_12_edna_1', '22013c01_12_edna_2',
       '22013c01_12_edna_3', 'CN13Dc01_12_edna_1', 'CN13Dc01_12_edna_2',
       'CN13Dc01_12_edna_3', '05114c01_12_edna_1', '05114c01_12_edna_2',
       '05114c01_12_edna_3', '14714c01_12_edna_1', '14714c01_12_edna_2',
       '14714c01_12_edna_3', '19114c01_12_edna_1', '19114c01_12_edna_2',
       '19114c01_12_edna_3', '30214c01_12_edna_1', '30214c01_12_edna_2',
       '30214c01_12_edna_3', '32414c01_12_edna_1', '32414c01_12_edna_2',
       '32414c01_12_edna_3', '12015c01_12_edna_1', '12015c01_12_edna_2',
       '12015c01_12_edna_3', '18815c01_12_edna_1', '18815c01_12_edna_2',
       '18815c01_12_edna_3', 'EB_20161116', 'pcrblank_1', '28215c01_12_edna_1',
       '28215c01_12_edna_2', '28215c01_12_edna_3', '34915c01_12_edna_1',
       '34915c01_12_edna_2', '34915c01_12_edn

Unnamed: 0,ASV,Kingdom,Phylum,Class,Order,Family,Genus,Species,14213c01_12_edna_1,14213c01_12_edna_2,...,JD33306C1_0m_3,pcrblank2,EB_20161121,EB_20161228,EB_20170117,pcrblank3,CB_CANON160925_1,CB_CANON160925_2,CB_CANON160925_3,ArtComm2
0,ASV_1,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned,9,4,...,956,0,0,1312,1,0,197,3,210,44
1,ASV_2,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Calanidae,unassigned,unassigned,4,7,...,358,0,0,535,0,0,118,162,139,5976
2,ASV_3,Eukaryota,unknown,Dinophyceae,Gymnodiniales,Gymnodiniaceae,Akashiwo,Akashiwo sanguinea,2,3,...,795,0,0,0,35,0,1,0,1,8
3,ASV_4,Eukaryota,unknown,Dinophyceae,Gymnodiniales,Gymnodiniaceae,Cochlodinium,unassigned,2,3,...,82065,0,0,0,836,0,1,0,0,7
4,ASV_5,Eukaryota,unknown,Dinophyceae,unassigned,unassigned,unassigned,unassigned,148,147,...,14688,0,0,36,401,0,0,4,11,7


It looks like this shows the number of reads for each ASV detected for each FilterID (columns), plus the matched taxa information for each ASV, if available. Blank cells have been filled with the word 'unassigned'. 

There are some column names that do not correspond to FilterIDs. They are the taxonomy columns (of course), in addition to the following:
- EB_20161116
- pcrblank_1
- pcrblank2
- EB_20161121
- EB_20161228
- EB_20170117
- pcrblank3
- CB_CANON160925_1
- CB_CANON160925_2
- CB_CANON160925_3
- ArtComm2

Are some of these controls? 

```python
for col in taxa.columns:
    if col not in plate['FilterID'].unique():
        print(col)
```

In [191]:
## Plate metadata

plate_meta = pd.read_csv('Plate_S_meta_OBIS_data.csv')
print(plate_meta.shape)
print(plate_meta.columns)
plate_meta.head()

(60, 52)
Index(['SequenceID', 'sample_name', 'order', 'tag_sequence',
       'primer_sequence_F', 'primer_sequence_R', 'library_tag_combo',
       'library', 'date_PCR', 'sample_type', 'locus', 'tag_number', 'R1', 'R2',
       'SAMPLING_cruise', 'SAMPLING_station_number', 'SAMPLING_bottle',
       'depth', 'SAMPLING_station', 'SAMPLING_project', 'SAMPLING_platform',
       'SAMPLING_platform_type', 'SAMPLING_dec_lat', 'SAMPLING_dec_lon',
       'temp', 'salinity', 'chlorophyll', 'pressure_dbar', 'nitrate',
       'diss_oxygen', 'SAMPLING_real_depth', 'SAMPLING_transmiss_%',
       'SAMPLING_sig_t', 'SAMPLING_fluor', 'SAMPLING_date_time', 'description',
       'SAMPLING_PI', 'SAMPLING_institute', 'env_biome', 'env_feature',
       'env_material', 'samp_collection_device', 'project_name',
       'samp_vol_we_dna_ext', 'samp_filter_size_ext', 'samp_filter_ext_type',
       'samp_store_temp', 'seq_meth', 'sequencing_facility', 'geo_loc_name',
       'investigation_type', 'FilterID'],
     

Unnamed: 0,SequenceID,sample_name,order,tag_sequence,primer_sequence_F,primer_sequence_R,library_tag_combo,library,date_PCR,sample_type,...,project_name,samp_vol_we_dna_ext,samp_filter_size_ext,samp_filter_ext_type,samp_store_temp,seq_meth,sequencing_facility,geo_loc_name,investigation_type,FilterID
0,14213c01_12_edna_1_S,14213c01_12_edna_1,1,ACGAGACTGATT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,S1_ACGAGACTGATT,S1,20170124,environmental,...,MBON,100ml,0.22um,Poretics,-80C,NGS Illumina Miseq,Stanford,USA:California:Monterey Bay,eukaryote,14213c01_12_edna_1
1,14213c01_12_edna_2_S,14213c01_12_edna_2,2,GAATACCAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,S2_GAATACCAAGTC,S2,20170124,environmental,...,MBON,100ml,0.22um,Poretics,-80C,NGS Illumina Miseq,Stanford,USA:California:Monterey Bay,eukaryote,14213c01_12_edna_2
2,14213c01_12_edna_3_S,14213c01_12_edna_3,3,CGAGGGAAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,S3_CGAGGGAAAGTC,S3,20170124,environmental,...,MBON,100ml,0.22um,Poretics,-80C,NGS Illumina Miseq,Stanford,USA:California:Monterey Bay,eukaryote,14213c01_12_edna_3
3,22013c01_12_edna_1_S,22013c01_12_edna_1,4,GAACACTTTGGA,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,S4_GAACACTTTGGA,S4,20170124,environmental,...,MBON,100ml,0.22um,Poretics,-80C,NGS Illumina Miseq,Stanford,USA:California:Monterey Bay,eukaryote,22013c01_12_edna_1
4,22013c01_12_edna_2_S,22013c01_12_edna_2,5,ACTCACAGGAAT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,S5_ACTCACAGGAAT,S5,20170124,environmental,...,MBON,100ml,0.22um,Poretics,-80C,NGS Illumina Miseq,Stanford,USA:California:Monterey Bay,eukaryote,22013c01_12_edna_2


This shows some metadata associated with each SequenceID and FilterID. This metadata **seems mostly redundant with that in MB_20200317_1326_18S_analysis_metadata.csv**, so for now I'll just work with it.

## Conversion Plan

According to GBIF's new guide to publishing sequence-derived biodiversity data, my occurrence file should include the following:

<span style="color:blue">eventID</span> - Highly recommended, **SequenceID in plate and plate_meta** <br>
<span style="color:blue">eventDate</span> - Required, **SAMPLING_date_time in plate_meta** <br>
<span style="color:blue">decimalLatitude</span> - Highly recommended, **SAMPLING_dec_lat in plate_meta** <br>
<span style="color:blue">decimalLongitude</span> - Highly recommended, **SAMPLING_dec_lon in plate_meta**  <br>
<span style="color:blue">env_broad_scale</span> - Recommended, equivalent to env_biome in MIxS, the major environmental system your sample or specimen came from, use [subclasses of ENVO´s biome class](http://www.ontobee.org/ontology/ENVO?iri=http://purl.obolibrary.org/obo/ENVO_00000428), **env_biome in plate_meta** <br>
<span style="color:blue">env_local_scale</span> - Recommended, equivalent to env_feature in MIxS, the entity or entities which are in your sample or specimen´s local vicinity and which you believe have significant causal influences on your sample or specimen, use terms that are present in ENVO and which are of smaller spatial grain than your entry for env_broad_scale, **env_feature in plate_meta** <br>
<span style="color:blue">env_medium</span> - Recommended, equivalent to env_material in MIxS, environmental material that immediately surrounded your sample or specimen prior to sampling, use [subclasses of ENVO´s environmental material class](http://www.ontobee.org/ontology/ENVO?iri=http://purl.obolibrary.org/obo/ENVO_00010483), **env_material in plate_meta**<br>
<span style="color:blue">sop</span> - Recommended, standard operating procedures used in assembly and/or annotation of metagenomes or a reference to a well documented protocol (e.g. using protocols.io). **Does something like this exist? If not, what information might be important to include here?** <br>
<span style="color:blue">lib_layout</span> - Recommended, equivalent to lib_const_meth in MIxS, whether to expect single, paired, or other configuration of reads. **Is this relevant here?** <br>
<span style="color:blue">target_gene</span> - Highly recommended, targeted gene or marker name for marker-based studies (e.g. 16S rRNA), **locus? Does there need to be more info included here?** <br>
<span style="color:blue">target_subfragment</span> - Highly recommended, name of subfragment of a gene or marker (e.g. V6). **Is this relevant here? Should the tag_sequence fit in somewhere?** <br>
<span style="color:blue">pcr_primer_name_forward</span> - Highly recommended, name of forward primer. **Do these primers have names?** <br>
<span style="color:blue">pcr_primer_forward</span> - Highly recommended, sequence of the forward primer, **primer_sequence_F in plate_meta** <br>
<span style="color:blue">pcr_primer_name_reverse</span> - Highly recommended, name of reverse primer <br>
<span style="color:blue">pcr_primer_revers</span> - Highly recommended, sequence of the reverse primer, **primer_sequence_R in plate_meta** <br>
<span style="color:blue">pcr_primer_reference</span> - Highly recommended, reference for primers (e.g. a DOI to a paper). **Is there a reference for primers?** <br>
<span style="color:blue">DNA_sequence</span> - Highly recommended, the actual DNA sequence of the ASV. TaxonID is highly recommended if DNA_sequence is not provided, **ASV in plate** <br>
<span style="color:blue">scientificName</span> - Required, Latin name of the closest known taxon or an OTU identifier from BOLD or UNITE, **a combination of Genus and Species from plate, or the lowest available taxon. This will have to be looked into more, especially for cases where no traditional taxonomic classification is available.** <br>
<span style="color:blue">kingdom</span> - Highly recommended, **Kingdom in plate** <br>
<span style="color:blue">phylum</span> - Recommended, **Phylum in plate** <br>
<span style="color:blue">class</span> - Recommended, **Class in plate** <br>
<span style="color:blue">order</span> - Recommended, **Order in plate** <br>
<span style="color:blue">family</span> - Recommended, **Family in plate** <br>
<span style="color:blue">genus</span> - Recommended, **Genus in plate** <br>
<span style="color:blue">basisOfRecord</span> - Required, MaterialSample <br>
<span style="color:blue">materialSampleID</span> - Highly recommended, an identifier for the MaterialSample, use the biosample ID if one was obtained from a nucleotide archive otherwise construct a globally unique identifier. **Is this Sequence_ID again? FilterID? Something else?** <br>
<span style="color:blue">identificationRemarks</span> - recommended, specification of taxonomic identification process ideally including data on applied algorithm and reference database as well as on level of confidence in the resulting identification. **Is this information available somewhere?** <br>
<span style="color:blue">identificationReferences</span> - recommended, link to protocol or code. **Is this information available somewhere?** <br>
<span style="color:blue">organismQuantity</span> - Highly recommended, number of reads, **Reads in plate** <br>
<span style="color:blue">organismQuantityType</span> - Highly recommended, DNA sequence reads <br>
<span style="color:blue">sampleSizeValue</span> - Highly recommended, total number of reads in the sample for calculating the relative abundance of sequence variants. **Is it accurate to just sum all the reads by Sequence_ID?** <br>
<span style="color:blue">sampleSizeUnit</span> - Highly recommended, DNA sequence reads <br>
<span style="color:blue">associatedSequences</span> - recommended, list of identifiers linking to archived (raw) sequence reads. **Are these sequences already archived?** <br>

## Conversion

In [235]:
## eventID

occ = pd.DataFrame({'eventID':plate['Sequence_ID']})
print(occ.shape)
occ.head()

(280440, 1)


Unnamed: 0,eventID
0,05114c01_12_edna_1_S
1,05114c01_12_edna_2_S
2,05114c01_12_edna_3_S
3,11216c01_12_edna_1_S
4,11216c01_12_edna_2_S


In [236]:
## Merge with plate_meta to obtain eventDate, decimal Lat and Lon

occ = occ.merge(plate_meta[['SequenceID', 'SAMPLING_date_time', 'SAMPLING_dec_lat', 'SAMPLING_dec_lon',
                            'env_biome', 'env_feature', 'env_material', 'locus', 'primer_sequence_F',
                            'primer_sequence_R']], how='left', left_on='eventID', right_on='SequenceID')
occ.drop(columns='SequenceID', inplace=True)
occ.columns = ['eventID', 'eventDate', 'decimalLatitude', 'decimalLongitude', 'env_broad_scale', 'env_local_scale', 'env_medium',
               'target_gene', 'pcr_primer_forward', 'pcr_primer_reverse']
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse
0,05114c01_12_edna_1_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
1,05114c01_12_edna_2_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
2,05114c01_12_edna_3_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
3,11216c01_12_edna_1_S,2016-04-21 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
4,11216c01_12_edna_2_S,2016-04-21 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC


**Need to add sop? lib_layout? target_subfragment? primer names? primer references?**

In [237]:
## Format eventDate

pst = pytz.timezone('America/Los_Angeles')
eventDate = [pst.localize(datetime.strptime(dt, '%Y-%m-%d %H:%M')).isoformat() for dt in occ['eventDate']]
occ['eventDate'] = eventDate

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC


In [238]:
## Add DNA_sequence

occ['DNA_sequence'] = plate['ASV']
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,DNA_sequence
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...


Before adding taxonomic information, I'll have to clean up the Genus and Species columns in the plate df. **Weird values include:**
- 'unassigned' has been used when there's no Genus data; also when there's no Species data
- 'g_', **I assume** also signifies no Genus; 's_' seems to be the equivalent in Species
- 'unknown' also has been used
- 'no_hit' in both Genus and Species
- 'Herdmania &lt;dinoflagellates>' - The Species entry for this seems to give both Genus and Species (Herdmania litoralis)
    - Side note: **How can a Phylum be unknown, but then Class, Order, Family, etc. be known?**
- 'Halofilum &lt;green algae>' - Similarly, Species is Halofilum ramosum
- 'Candida &lt;clade Candida/Lodderomyces clade>' - This has Species unassigned, Family = Debaryomycetaceae

Only in Species:
- 'uncultured marine eukaryote'
- 'eukaryote clone OLI11007'
- 'Dinophyceae sp. UDMS0803'
- 'uncultured marine picoeukaryote'
- 'Chaetoceros sp. UNC1415'
- 'bacterium'

**Note** there are loads of terms like those given above in the Species column. **What do they mean and where do they come from?** Maybe some of those numbers correspond to BOLD or UNITE OTU ID's. **Also note** that the Species column often contains both Genus and Species - the proper scientific name - if available. I'm not sure if this always is the case, but maybe I can just use the Species column for scientificName rather than combining Genus and Species.

In [239]:
## scientificName, taxonomic info

occ['scientificName'] = plate['Species']
occ['kingdom'] = plate['Kingdom']
occ['phylum'] = plate['Phylum']
occ['class'] = plate['Class']
occ['order'] = plate['Order']
occ['family'] = plate['Family']
occ['genus'] = plate['Genus']

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus


**So we don't have to do the WoRMS thing for eDNA data?**

**Note** that:
- 63600 (out of a total 280440 records) have scientificName = 'unassigned'
- 163380 have scientificName = 's_'
- 0 have scientificName = 'unknown'
- 12540 have scientificName = 'no_hit'

It seems like for the unassigned and s_ species, other taxonomic levels of classification (e.g. Family) are known (although I haven't checked this in all cases). When the species is 'no_hit', though, it seems like it's 'no_hit' across the board.

**Let me replace these values with NaNs to make things easier to work with.**

In [240]:
## Clean scientificName and taxonomy columns

cols = ['scientificName', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus']
occ[cols] = occ[cols].replace({'unassigned':np.nan,
                              's_':np.nan,
                              'g_':np.nan,
                              'unknown':np.nan,
                              'no_hit':np.nan})

Ok, so now we can see that:
- 9300 records have full taxonomic info (all columns are not NaN)
- A total of 239520 do not have a species designation (i.e. scientificName = NaN). This is as expected based on the above numbers.
- Of these, 9840 are only missing species
- 120 are only missing genus
- 300 are only missing family
- 1020 are only missing order
- 780 are only missing class
- 10860 are only missing phylum
- 0 are only missing kingdom
- This leaves 216600 records with more than one field missing

**How do I handle these? Based on the standard information, it seems like I should fill NaNs in the species column with the lowest known taxonomic rank. If all the rows are NaN, the record cannot be submitted to OBIS. *But*, before deleting these, I need to look into the non-Linnaean name options.**

**Note** that 33360 rows have ALL NaNs in ALL taxonomic columns

In [243]:
## Fill missing values in the species column with the lowest available taxonomic rank

occ['scientificName'] = occ['scientificName'].combine_first(occ['genus'])
occ['scientificName'] = occ['scientificName'].combine_first(occ['family'])
occ['scientificName'] = occ['scientificName'].combine_first(occ['order'])
occ['scientificName'] = occ['scientificName'].combine_first(occ['class'])
occ['scientificName'] = occ['scientificName'].combine_first(occ['phylum'])
occ['scientificName'] = occ['scientificName'].combine_first(occ['kingdom'])

In [245]:
occ['scientificName'].unique()

array(['Paracalanus', 'Chlorophyta', 'Eukaryota', 'Florenciellales',
       'Amoebophryaceae', 'Cercozoa', 'Chaetoceros diadema',
       'uncultured marine eukaryote', 'Protoperidinium',
       'Protoperidinium divergens', 'Dinophyceae',
       'Cyclotella meneghiniana', 'Prymnesium', 'Ensiculifera imariensis',
       'eukaryote clone OLI11007', 'Amoebophrya', nan, 'Gonyaulacaceae',
       'Chrysophyceae', 'Scrippsiella', 'Prymnesiaceae',
       'Dinophyceae sp. UDMS0803', 'Pelargonium', 'Hemistasiidae',
       'Thysanoessa', 'Thoracosphaeraceae', 'Diplopsaliaceae',
       'Hemiselmis', 'Euphausiidae', 'Gymnodiniaceae', 'Ciliophora',
       'Karlodinium', 'Basidiomycota', 'Dictyocha fibula', 'Characeae',
       'Picomonas', 'Polykrikos geminatum', 'Spirotrichea',
       'uncultured marine picoeukaryote', 'Strombidiidae',
       'Strombidium caudispina', 'Enteropneusta', 'Peridiniales',
       'Kareniaceae', 'Halodinium verrucatum', 'Phaeocystaceae',
       'Gyrodinium', 'Gonyaulax spin

**NEXT STEPS**
- Are any of the weird names from BOLD or UNITE already?
- If not, is there some way to search for these IDs using DNA sequences? 
- Anything  that can't ultimately be matched with a taxonomic rank or ID must ultimately be dropped.

**What does it mean when the number of reads is 0? Is that an absence record??**