# Darwin Core Conversion of eDNA Sequence Data Using the DNA Derived Data Extension

**Author:** Diana LaScala-Gruenewald

**Last Updated:** 07-Jul-2021

**Requirements:**
- Python 3
- Python 3 packages:
    - datetime
    - random
    - sys
- External packages:
    - numpy
    - pandas
    - pytz
- Custom modules:
    - WoRMS

**Resources:**
- https://docs.gbif-uat.org/publishing-dna-derived-data/1.0/en/
- https://tools.gbif.org/dwca-validator/extension.do?id=http://rs.gbif.org/terms/1.0/DNADerivedData

In [4]:
## Imports

from datetime import datetime
import os
import random

import numpy as np
import pandas as pd
import pytz # for handling time zones

import WoRMS # custom functions for querying WoRMS API

## Load data

Note that in a Jupyter Notebook, the current directory is always where the .ipynb file is being run.

In [16]:
## Plate data

filename = os.getcwd().replace('src', os.path.join('raw', 'asv_table.csv'))  
plate = pd.read_csv(filename)
print(plate.shape)
plate.head()

(280440, 11)


Unnamed: 0,ASV,FilterID,Sequence_ID,Reads,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_1,05114c01_12_edna_1_S,14825,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_2,05114c01_12_edna_2_S,16094,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
2,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_3,05114c01_12_edna_3_S,22459,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
3,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,11216c01_12_edna_1,11216c01_12_edna_1_S,19312,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
4,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,11216c01_12_edna_2,11216c01_12_edna_2_S,16491,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned


Plate data contains:

| Column name| Column definition                                                                                                                                                                           |
|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 ASV        | The sequence of the Amplicon Sequence Variant observed                                                                                                                                       |
| FilterID   | A unique identifier for the filter the sample was obtained from, composed of: <br>- cruise number <br>- CTD cast number<br>- CTD bottle number <br>- filter indicator <br>- replicate number |
| Sequence_ID| The FilterID plus a letter indicating which plate the sample was on when sequenced                                                                                                           |
| Reads      | The number of reads for the ASV                                                                                                                                                             |
| Kingdom    | The Kingdom of the taxonomic identity assigned to the ASV, if known                                                                                                                          |
| Phylum     | The Phylum of the taxonomic identity assigned to the ASV, if known                                                                                                                           |
| Class      | The Class of the taxonomic identity assigned to the ASV, if known                                                                                                                            |
| Order      | The Order of the taxonomic identity assigned to the ASV, if known                                                                                                                            |
| Family     | The Family of the taxonomic identity assigned to the ASV, if known                                                                                                                           |
| Genus      | The Genus of the taxonomic identity assigned to the ASV, if known                                                                                                                            |
| Species    | The Species of the taxonomic identity assigned to the ASV, if known                                                                                                                          |

Additionally, taxonomic columns may include the following designations:
- **unknown** = GenBank couldn't give a scientifically-agreed-upon name for a given taxonomic rank. I.e., either the name doesn't exist, or there isn't enough scientific consensus to give a name.
- **no_hit** = BLAST did not find any hits for the ASV.
- **unassigned** = The ASV got BLAST hits, but the post-processing program Megan6 didn't assign the ASV to any taxonomic group.
- **g_** or **s_** = Megan6 assigned the ASV to a genus or species, but not with high enough confidence to include it. 

In [17]:
## Plate metadata

filename = os.getcwd().replace('src', os.path.join('raw', 'metadata_table.csv'))  
meta = pd.read_csv(filename)
print(meta.shape)
meta.head()

(60, 67)


Unnamed: 0,sample_name,library,tag_sequence,primer_sequence_forward,primer_sequence_reverse,R1,R2,PlateID,sample_type,target_gene,...,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,seq_meth,sequencing_facility,seqID,identificationRemarks,identificationReferences,FilterID,associatedSequences
0,14213c01_12_eDNA_1,S1,ACGAGACTGATT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_1_S1_L001_R1_001.fastq.gz,14213c01_12_edna_1_S1_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,14213c01_12_edna_1_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna,NCBI BioProject accession number PRJNA433203
1,14213c01_12_eDNA_2,S2,GAATACCAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_2_S2_L001_R1_001.fastq.gz,14213c01_12_edna_2_S2_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,14213c01_12_edna_2_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna,NCBI BioProject accession number PRJNA433203
2,14213c01_12_eDNA_3,S3,CGAGGGAAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_3_S3_L001_R1_001.fastq.gz,14213c01_12_edna_3_S3_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,14213c01_12_edna_3_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna,NCBI BioProject accession number PRJNA433203
3,22013c01_12_eDNA_1,S4,GAACACTTTGGA,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,22013c01_12_edna_1_S4_L001_R1_001.fastq.gz,22013c01_12_edna_1_S4_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,22013c01_12_edna_1_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22013c01_12_edna,NCBI BioProject accession number PRJNA433203
4,22013c01_12_eDNA_2,S5,ACTCACAGGAAT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,22013c01_12_edna_2_S5_L001_R1_001.fastq.gz,22013c01_12_edna_2_S5_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,22013c01_12_edna_2_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22013c01_12_edna,NCBI BioProject accession number PRJNA433203


In [18]:
meta.columns

Index(['sample_name', 'library', 'tag_sequence', 'primer_sequence_forward',
       'primer_sequence_reverse', 'R1', 'R2', 'PlateID', 'sample_type',
       'target_gene', 'replicate', 'eventDate', 'SAMPLING_cruise',
       'SAMPLING_station_number', 'SAMPLING_bottle', 'SAMPLING_station',
       'depth', 'decimalLatitude', 'decimalLongitude', 'SAMPLING_PI',
       'SAMPLING_institute', 'SAMPLING_campaign', 'project_name',
       'SAMPLING_platform', 'SAMPLING_platform_type', 'samp_collect_device',
       'geo_loc_name', 'fluor', 'SAMPLING_rdepth', 'density', 'pressure',
       'nitrate', 'chlorophyll', 'diss_oxygen', 'salinity', 'temp',
       'samp_filter_ext_type', 'samp_filter_size_ext', 'env_broad_scale',
       'env_local_scale', 'env_medium', 'ESP', 'SC', 'minimumDepthInMeters',
       'maximumDepthInMeters', 'start_GMT', 'end_GMT', 'deployment_ID',
       'date_PCR', 'DNA_concentration', 'PCR_settings', 'samp_vol_we_dna_ext',
       'extraction_date', 'nucl_acid_ext', 'nucl_acid_a

Metadata contains:

| Column name             | Column definition                                                                       |
|-------------------------|-----------------------------------------------------------------------------------------|
| primer_sequence_forward | The sequence of the forward primer used during PCR                                      |
| primer_sequence_reverse | The sequence of the reverse primer used during PCR                                      |
| target_gene             | The gene being targeted for amplification during PCR                                    |
| eventDate               | The date (and time, if available) the water sample was collected                        |
| decimalLatitude         | The latitude in decimal degrees where the water sample was collected (WGS84)            |
| decimalLongitude        | The longitude in decimal degrees where the water sample was collected (WGS84)           |
| env_broad_scale         | The most broad descriptor of the environment from which the water sample was collected  |
| env_local_scale         | A more specific descriptor of the environment from which the water sample was collected |
| env_medium              | A descriptor of the medium from which the DNA was collected                             |
| sop                     | Links or references to standard operating protocols used to obtain the data             |
| pcr_primer_name_forward | Name of the forward primer used during PCR                                              |
| pcr_primer_name_reverse | Name of the reverse primer used during PCR                                              |
| pcr_primer_reference    | Reference for PCR primers                                                               |
|                         |                                                                                         |

Note that only columns used in the following processing script are described.

## Create occurrence file

For this data set, an `event` is a filtered water sample that was sequenced and an `occurrence` is an ASV observed within a water sample. Since there are no event-level measurements (i.e., measurements that are associated with the water sample but not the ASV), a separate event file is not required. We will assemble an occurrence file complying with the Occurrence Core and a DNA derived data (ddd) file complying with the DNA derived data extension.

In [25]:
## eventID - the Sequence_ID column in the plate dataframe uniquely identifies a water sample

occ = pd.DataFrame({'eventID':plate['Sequence_ID']})
print(occ.shape)
occ.head()

(280440, 1)


Unnamed: 0,eventID
0,05114c01_12_edna_1_S
1,05114c01_12_edna_2_S
2,05114c01_12_edna_3_S
3,11216c01_12_edna_1_S
4,11216c01_12_edna_2_S


In [26]:
## Merge with plate_meta to obtain eventDate, decimalLatitude, and decimalLongitude

metadata_cols = [
    'seqID',
    'eventDate', 
    'decimalLatitude', 
    'decimalLongitude',
]

dwc_cols = metadata_cols.copy()
dwc_cols[0] = 'eventID'

occ = occ.merge(meta[metadata_cols], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)
occ.columns = dwc_cols
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude
0,05114c01_12_edna_1_S,2/20/14 15:33,36.7958,-121.848
1,05114c01_12_edna_2_S,2/20/14 15:33,36.7958,-121.848
2,05114c01_12_edna_3_S,2/20/14 15:33,36.7958,-121.848
3,11216c01_12_edna_1_S,4/21/16 14:39,36.7962,-121.846
4,11216c01_12_edna_2_S,4/21/16 14:39,36.7962,-121.846


In [27]:
## Format eventDate

pst = pytz.timezone('America/Los_Angeles')
eventDate = [pst.localize(datetime.strptime(dt, '%m/%d/%y %H:%M')).isoformat() for dt in occ['eventDate']]
occ['eventDate'] = eventDate

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846


In [35]:
## Add an occurrenceID that will uniquely identify an ASV within a water sample

occ['occurrenceID'] = plate.groupby('Sequence_ID')['ASV'].cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_occ' + occ['occurrenceID'].astype(str)
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,occurrenceID
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,05114c01_12_edna_1_S_occ1
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,05114c01_12_edna_2_S_occ1
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,05114c01_12_edna_3_S_occ1
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,11216c01_12_edna_1_S_occ1
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,11216c01_12_edna_2_S_occ1


In [36]:
## scientificName, taxonomic info

occ['scientificName'] = plate['Species']
occ['kingdom'] = plate['Kingdom']
occ['phylum'] = plate['Phylum']
occ['class'] = plate['Class']
occ['order'] = plate['Order']
occ['family'] = plate['Family']
occ['genus'] = plate['Genus']

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,occurrenceID,scientificName,kingdom,phylum,class,order,family,genus
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,05114c01_12_edna_1_S_occ1,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,05114c01_12_edna_2_S_occ1,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,05114c01_12_edna_3_S_occ1,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,11216c01_12_edna_1_S_occ1,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,11216c01_12_edna_2_S_occ1,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus


For the purpose of submitting data to OBIS, all the variations on missing data (e.g. "unknown," "no_hit," etc.) do not add information. We can replace these with NaN, which is easy to work with in pandas.

In [37]:
## Replace 'unknown', 'unassigned', etc. in scientificName and taxonomy columns with NaN

cols = ['scientificName', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus']
occ[cols] = occ[cols].replace({'unassigned':np.nan,
                              's_':np.nan,
                              'g_':np.nan,
                              'unknown':np.nan,
                              'no_hit':np.nan})
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,occurrenceID,scientificName,kingdom,phylum,class,order,family,genus
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,05114c01_12_edna_1_S_occ1,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,05114c01_12_edna_2_S_occ1,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,05114c01_12_edna_3_S_occ1,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,11216c01_12_edna_1_S_occ1,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,11216c01_12_edna_2_S_occ1,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus


In [38]:
## Get unique species names

names = occ['scientificName'].unique()
names = names[~pd.isnull(names)]  # remove NaN
print(len(names))

323


These names **do** have to be WoRMS-approved in order to show up as valid occurrences OBIS. But there are a number of entries in the `scientificName` column, like "uncultured marine eukaryote," "eukaryote clone OLI11007," and "Acantharian sp. 6201," that are **not proper Latin species names**. Since these essentially indicate that a more precise name is unknown, it seemed reasonable to replace these with NaN as well. **I used a simple rule to filter them out, but it's important to check and see if any true species names are being removed.**

To check names:
```python
names = occ['scientificName'].unique()
names = names[~pd.isnull(names)]  # remove NaN
for name in names:
    words_in_name = name.split(' ')
    if len(words_in_name) > 2:
        print(name)
```

In [22]:
## Clean sop

occ['sop'] = occ['sop'].str.replace('|', ' | ', regex=False)
occ['sop'].iloc[0]

'dx.doi.org/10.17504/protocols.io.xjufknw | dx.doi.org/10.17504/protocols.io.n2vdge6 | https://github.com/MBARI-BOG/BOG-Banzai-Dada2-Pipeline'

In [23]:
## Change column names as needed

occ = occ.rename(columns = {'primer_sequence_forward':'pcr_primer_forward',
                           'primer_sequence_reverse':'pcr_primer_reverse'})
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop
0,05114c01_12_edna_1_S,2/20/14 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....
1,05114c01_12_edna_2_S,2/20/14 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....
2,05114c01_12_edna_3_S,2/20/14 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....
3,11216c01_12_edna_1_S,4/21/16 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....
4,11216c01_12_edna_2_S,4/21/16 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....


In [24]:
## Add DNA_sequence

occ['DNA_sequence'] = plate['ASV']
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop,DNA_sequence
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...


In [28]:
## Replace non-Latin species names with NaN

# Get non-Latin names
non_latin_names = []
for name in names:
    words_in_name = name.split(' ')
    if len(words_in_name) > 2:
        non_latin_names.append(name)
non_latin_names_dict = {i:np.nan for i in non_latin_names}

# Add any names that didn't get caught in the simple filter
non_latin_names_dict['phototrophic eukaryote'] = np.nan
non_latin_names_dict['Candida <clade Candida/Lodderomyces clade>'] = np.nan

# Replace
occ['scientificName'].replace(non_latin_names_dict, inplace=True)

In addition, 52380 records **only give "Eukaryota" as the scientific name** (i.e. Eukaryota is in the kingdom field, and there is no more taxonomic information). These should be replaced with Biota (http://marinespecies.org/aphia.php?p=taxdetails&id=1)

In [29]:
## Replace entries where kingdom = 'Eukaryota' with the WoRMS-approved 'Biota'

occ.loc[occ['kingdom'] == 'Eukaryota', 'kingdom'] = 'Biota'

Finally **GenBank and WoRMS seem to disagree on their taxonomy**. Two examples:

1. 120 records have "Pelargonium" as the genus. [On GenBank](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=4030&lvl=3&lin=f&keep=1&srchmode=1&unlock) the full taxonomy is given as:
    - superkingdom Eukaryota
    - kingdom Viridiplantae (subkingdom on WoRMS)
    - phylum Streptophyta (infrakingdom on WoRMS, phylum listed as Tracheophyta)
    - subphylum Streptophytina (Spermatophytina on WoRMS)
    - class Magnoliopsida (matches on WoRMS)
        - this was indicated as 'unknown' in original data, i.e. GenBank couldn't give a scientifically-agreed-upon name for a given taxonomic level. **This doesn't make sense to me.**
    - order Geraniales (matches on WoRMS)
    - family Geraniaceae ([matches on WoRMS](http://marinespecies.org/aphia.php?p=taxdetails&id=382461))
    - genus Pelargonium (does NOT match on WoRMS)


2. 600 records have family Hemistasiidae (and no further taxonomic information). [On GenBank](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=2015512&lvl=3&lin=f&keep=1&srchmode=1&unlock), the full taxonomy is given as:
    - superkingdom Eukaryota
    - phylum Euglenozoa (matches on WoRMS)
    - order Diplonemea ([matches on WoRMS](http://marinespecies.org/aphia.php?p=taxdetails&id=582176))
    - family Hemistasiidae (does NOT match on WoRMS)

Interestingly, there are 60 records where species is Hemistasia phaeocysticola. These records also have family Hemistasiidae. The genus Hemistasia [matches on WoRMS](http://marinespecies.org/aphia.php?p=taxdetails&id=146165), but with a totally different taxonomy than [on GenBank](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=1503927&lvl=3&srchmode=1&keep=1&unlock):
- kingdom Protozoa
- phylum Euglenozoa
- class Kinetoplastea
- order not listed
- family not listed

**There are ongoing discussions about this issue. IOOS is working on a crosswalk between WoRMS IDs and NCBI taxonomy IDs, but it will only work for named species. See [this](https://github.com/iobis/Project-team-Genetic-Data/issues/5) GitHub issue. Either way, this is a problem that is bigger than I can solve without going case-by-case.**

**In both examples above**, the lowest taxonomic rank name given by the dataset does not match on WoRMS. I therefore need a system for searching through the higher taxonomic ranks given, finding the lowest one that will match on WoRMS, and putting that name in the scientificName column. The following few code blocks do this.

In [30]:
## Functions for finding the lowest available taxonomic rank that will match on WoRMS

def fill_lowest_taxon(df, cols):
    """ Takes the occurrence pandas data frame and fills missing values in scientificName with values from the first non-missing taxonomic rank column, listed in cols. """
    
    cols.reverse()
    
    for col in cols[:-1]:
        df['scientificName'] = df['scientificName'].combine_first(df[col])
    
    cols.reverse()
    
    return(df)

def find_not_matched(df, name_dict):
    """ Takes the occurrence pandas data frame and name_dict matching scientificName values with names on WoRMS and returns a list of names that did not match on WoRMS. """
    
    not_matched = []
    
    for name in df['scientificName'].unique():
        if name not in name_dict.keys():
            not_matched.append(name)
    
    try:
        not_matched.remove(np.nan)
    except ValueError:
        pass
            
    return(not_matched)

def replace_not_matched(df, not_matched, cols):
    """ Takes the occurrence pandas data frame and a list of scientificName values that did not match on WoRMS and replaces those values with NaN in the columns specified by cols. """
    
    df[cols] = df[cols].replace(not_matched, np.nan)
    
    return(df)  

In [32]:
## Iterate to match lowest possible taxonomic rank on WoRMS (takes ~8 minutes when starting with ~750 names)

# Note that cols (list of taxonomic column names) was defined when cleaning them above

# Initialize dictionaries
name_name_dict = {}
name_id_dict = {}
name_taxid_dict = {}
name_class_dict = {}

# Initialize not_matched
not_matched = [1]

# Iterate
while len(not_matched) > 0:
    
    # Step 1 - fill
    occ = fill_lowest_taxon(occ, cols)

    # Step 2 - get names to match
    to_match = find_not_matched(occ, name_name_dict)

    # Step 3 - match
    print('Matching {num} names on WoRMS.'.format(num = len(to_match)))
    name_id, name_name, name_taxid, name_class = WoRMS.run_get_worms_from_scientific_name(to_match, verbose_flag=False)
    name_id_dict = {**name_id_dict, **name_id}
    name_name_dict = {**name_name_dict, **name_name}
    name_taxid_dict = {**name_taxid_dict, **name_taxid}
    name_class_dict = {**name_class_dict, **name_class}
    print('Length of name_name_dict: {length}'.format(length = len(name_name_dict)))
    
    ## ---- Add in a progress bar? Also, a better/additional stopping criterion? What if not all names can be matched?
    ## ---- Also, update this code to use pyworms instead of my functions? (Could consider cloning pyworms and adding functionality)

    # Step 4 - get names that didn't match
    not_matched = find_not_matched(occ, name_name_dict)
    print('Number of names not matched: {num}'.format(num = len(not_matched)))

    # Step 5 - replace these values with NaN
    occ = replace_not_matched(occ, not_matched, cols)

Matching 756 names on WoRMS.
Length of name_name_dict: 695
Number of names not matched: 61
Matching 35 names on WoRMS.
Length of name_name_dict: 715
Number of names not matched: 15
Matching 11 names on WoRMS.
Length of name_name_dict: 720
Number of names not matched: 6
Matching 3 names on WoRMS.
Length of name_name_dict: 721
Number of names not matched: 2
Matching 1 names on WoRMS.
Length of name_name_dict: 721
Number of names not matched: 1
Matching 0 names on WoRMS.
Length of name_name_dict: 721
Number of names not matched: 0


**Note** that there are 33360 records (~12% of the data) where no taxonomic information was obtained at all (i.e. scientificName is still NaN). I've used 'Biota' for these records.

```python
occ[occ['scientificName'].isna() == True].shape
```

In [33]:
## Change scientificName to Biota in cases where all taxonomic information is missing

print(occ[occ['scientificName'].isna() == True].shape)
occ.loc[occ['scientificName'].isna() == True, 'scientificName'] = 'Biota'
occ[occ['scientificName'].isna() == True].shape

(33360, 22)


(0, 22)

Finally, during the above process, **I totally messed up the taxonomy columns in order to obtain the best possible scientificName column**. I'll fix that below.

In [35]:
## Fix taxonomy columns

# Replace with original data
occ[cols[1:]] = plate[['Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus']].copy()

# Replace missing data indicators in original data with empty strings ('')
occ[cols[1:]] = occ[cols[1:]].replace({
    'unassigned':'',
    's_':'',
    'g_':'',
    'unknown':'',
    'no_hit':''})

Check using:

```python
occ[occ['scientificName'] == 'Diplonemea'].head()
```

In [36]:
## Add scientific name-related columns

occ['scientificNameID'] = occ['scientificName']
occ['scientificNameID'].replace(name_id_dict, inplace=True)

occ['taxonID'] = occ['scientificName']
occ['taxonID'].replace(name_taxid_dict, inplace=True)

occ['scientificName'].replace(name_name_dict, inplace=True)

occ['nameAccordingTo'] = 'WoRMS'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,scientificName,kingdom,phylum,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS


In [37]:
## Get set up to query NCBI taxonomy 

from Bio import Entrez
Entrez.email = 'dianalg@mbari.org'

# Get list of all databases available through this tool
record = Entrez.read(Entrez.einfo())
all_dbs = record['DbList']
all_dbs

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

In [38]:
## Get NCBI taxIDs for each name in dataset ---- TAKES ~ 2 MINUTES FOR 300 RECORDS

name_ncbiid_dict = {}

for name in names:
    handle = Entrez.esearch(db='taxonomy', retmax=10, term=name)
    record = Entrez.read(handle)
    name_ncbiid_dict[name] = record['IdList'][0]
    handle.close()

**Note** that this code will throw an IndexError (IndexError: list index out of range) if a term is not found.

In [39]:
## Add NCBI taxonomy IDs under taxonConceptID

# Map indicators that say no taxonomy was assigned to empty strings
name_ncbiid_dict['unassigned'], name_ncbiid_dict['s_'], name_ncbiid_dict['no_hit'], name_ncbiid_dict['unknown'], name_ncbiid_dict['g_'] = '', '', '', '', ''

# Create column
occ['taxonConceptID']  = plate['Species'].copy()
occ['taxonConceptID'].replace(name_ncbiid_dict, inplace=True)

# Add remainder of text and clean
occ['taxonConceptID'] = 'NCBI:txid' + occ['taxonConceptID']
occ['taxonConceptID'].replace('NCBI:txid', '', inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,kingdom,phylum,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,


In [41]:
## identificationRemarks

# Get identificationRemarks
occ = occ.merge(plate_meta[['seqID', 'identificationRemarks']], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)

# Add name that matched in GenBank
occ['identificationRemarks'] = plate['Species'].copy() + ', ' + occ['identificationRemarks']
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,phylum,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."


In [42]:
## basisOfRecord

occ['basisOfRecord'] = 'MaterialSample'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample


In [43]:
## Add identificationReferences 

occ = occ.merge(plate_meta[['seqID', 'identificationReferences']], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)
occ['identificationReferences'] = occ['identificationReferences'].str.replace('| ', ' | ', regex=False)

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...


In [44]:
## organismQuantity (number of reads)

occ['organismQuantity'] = plate['Reads']
occ['organismQuantityType'] = 'DNA sequence reads'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads


**Note** that there are 215537 rows where the number of reads is 0. These will need to be dropped at the end of the script.

In [45]:
## sampleSizeValue

count_by_seq = plate.groupby('Sequence_ID', as_index=False)['Reads'].sum()
occ = occ.merge(count_by_seq, how='left', left_on='eventID', right_on='Sequence_ID')
occ.drop(columns='Sequence_ID', inplace=True)
occ.rename(columns={'Reads':'sampleSizeValue'}, inplace=True)
print(occ.shape)
occ.head()

(280440, 32)


Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419


In [47]:
## sampleSizeUnit

occ['sampleSizeUnit'] = 'DNA sequence reads'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue,sampleSizeUnit
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600,DNA sequence reads
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702,DNA sequence reads
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275,DNA sequence reads
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220,DNA sequence reads
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419,DNA sequence reads


In [48]:
## associatedSequences

occ = occ.merge(plate_meta[['seqID', 'associatedSequences']], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,...,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue,sampleSizeUnit,associatedSequences
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600,DNA sequence reads,NCBI BioProject accession number PRJNA433203
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702,DNA sequence reads,NCBI BioProject accession number PRJNA433203
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275,DNA sequence reads,NCBI BioProject accession number PRJNA433203
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220,DNA sequence reads,NCBI BioProject accession number PRJNA433203
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419,DNA sequence reads,NCBI BioProject accession number PRJNA433203


In [49]:
## Drop records where organismQuantity = 0 (absences are not meaningful for this data set)

occ = occ[occ['organismQuantity'] > 0]
print(occ.shape)

(64903, 34)


**Note** that the vast majority of this data has 0 reads (215537 records out of 280440, or about 75%). **This makes sense, and is generally fine. Although Katie said something about the 0's being potentially important if the data set were to be reanalyzed due to the compositional nature of the data. We can stay in conversation about this.**

Also **note** that there is one sample (34916c01_12_edna_2_S) that has no reads at all in the ASV table.

In [69]:
## Check for NaN values in string fields

occ.isna().sum()

eventID                     0
eventDate                   0
decimalLatitude             0
decimalLongitude            0
env_broad_scale             0
env_local_scale             0
env_medium                  0
target_gene                 0
primer_sequence_forward     0
primer_sequence_reverse     0
pcr_primer_name_forward     0
pcr_primer_name_reverse     0
pcr_primer_reference        0
sop                         0
DNA_sequence                0
scientificName              0
kingdom                     0
phylum                      0
class                       0
order                       0
family                      0
genus                       0
scientificNameID            0
taxonID                     0
nameAccordingTo             0
taxonConceptID              0
identificationRemarks       0
basisOfRecord               0
identificationReferences    0
organismQuantity            0
organismQuantityType        0
sampleSizeValue             0
sampleSizeUnit              0
associated

## Save

In [33]:
## Save

occ.to_csv('eDNA_practice_plate_occ_20210527.csv', index=False, na_rep='NaN')

## Questions

1. Katie still needs to remove the clarifying text from the sop column (e.g. 'ext: ') **Done.**