# eDNA conversion of practice data - draft 2

**Resources:**
- https://docs.gbif-uat.org/publishing-dna-derived-data/1.0/en/#mapping-metabarcoding-edna-and-barcoding-data

In [34]:
## Imports

import pandas as pd
import numpy as np
import random

from datetime import datetime # for handling dates
import pytz # for handling time zones

In [35]:
## Ensure my general functions for the MPA data integration project can be imported, and import them

import sys
sys.path.insert(0, "C:\\Users\\dianalg\\PycharmProjects\\PythonScripts\\eDNA")

import WoRMS # functions for querying WoRMS REST API

## Load data

In [36]:
## Plate data

plate = pd.read_csv('Plate_S_ASV_OBIS_data.csv')
print(plate.shape)
plate.head()

(280440, 11)


Unnamed: 0,ASV,FilterID,Sequence_ID,Reads,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_1,05114c01_12_edna_1_S,14825,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_2,05114c01_12_edna_2_S,16094,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
2,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_3,05114c01_12_edna_3_S,22459,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
3,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,11216c01_12_edna_1,11216c01_12_edna_1_S,19312,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
4,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,11216c01_12_edna_2,11216c01_12_edna_2_S,16491,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned


**FilterID** = SAMPLING_cruise (cruise where the sample was collected) + 
    SAMPLING_station_number (in this case c + a cast number because these data were collected via CTD cast) + _ +
    SAMPLING_bottle (bottle number, if multiple samples were taken from a single bottle, there may be replicates indicated by a, b, c, etc.) + _ +
    edna (filter indicator. edna means that the standard filter for the standard edna pipeline, i.e. a standard PBDF filter, was used. other options include hplc) + _ +
    1, 2 or 3 (PCR replicate)

To summarize, **FilterID** = cruise number + cast number + bottle number + filter indicator + replicate number.

**Note** that not all IDs exactly follow this formula. Example:
- JD10706C1_0m_1

**Sequence_ID** adds the plate indicator to the end of FilterID. In this case, all samples are from plate S. Plates are labeled A-Z, AA-ZZ, etc.

In [37]:
## Taxonomy table

taxa = pd.read_csv('Filtered_ASV_taxa_table_all.csv')
print(taxa.shape)
taxa.head()

(4711, 79)


Unnamed: 0,ASV,Kingdom,Phylum,Class,Order,Family,Genus,Species,14213c01_12_edna_1,14213c01_12_edna_2,...,JD33306C1_0m_3,pcrblank2,EB_20161121,EB_20161228,EB_20170117,pcrblank3,CB_CANON160925_1,CB_CANON160925_2,CB_CANON160925_3,ArtComm2
0,ASV_1,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned,9,4,...,956,0,0,1312,1,0,197,3,210,44
1,ASV_2,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Calanidae,unassigned,unassigned,4,7,...,358,0,0,535,0,0,118,162,139,5976
2,ASV_3,Eukaryota,unknown,Dinophyceae,Gymnodiniales,Gymnodiniaceae,Akashiwo,Akashiwo sanguinea,2,3,...,795,0,0,0,35,0,1,0,1,8
3,ASV_4,Eukaryota,unknown,Dinophyceae,Gymnodiniales,Gymnodiniaceae,Cochlodinium,unassigned,2,3,...,82065,0,0,0,836,0,1,0,0,7
4,ASV_5,Eukaryota,unknown,Dinophyceae,unassigned,unassigned,unassigned,unassigned,148,147,...,14688,0,0,36,401,0,0,4,11,7


This shows the number of reads for each ASV detected for each FilterID (columns), plus the matched taxa information for each ASV, if available. 

Some columns are associated with controls:
- **CB** = collection blank which was taken on the ship. Water that should be "clean" was passed through the filter, so there should be no reads in this sample, although it tends to be the "dirtiest" control. It checks for contamination during filtration, I think.
- **EB** = extraction blank which was taken in the lab but contains no DNA (i.e. a negative control), so there should be no reads in this sample. It checks if contamination occurred during DNA extraction (lab work pre-PCR).
- **pcrblank** = A sample that went through PCR but contained no input DNA (i.e. a negative control), so there should be no reads in this sample. It checks if contamination occurred during PCR.
- **ArtComm** = artificial community that went through PCR and contained DNA from species that should not be in the study system (i.e. a positive control), so you know what results you expect.

To see these columns, use:
```python
for col in taxa.columns:
    if col not in plate['FilterID'].unique():
        print(col)
```

Additionally, taxonomic columns include the following designations:
- **unknown** = GenBank couldn't give a scientifically-agreed-upon name for a given taxonomic level. I.e., either the name doesn't exist, or there isn't enough scientific consensus to give a name.
- **no_hit** = BLAST did not find any hits for the ASV.
- **unassigned** = The ASV got BLAST hits, but the post-processing program Megan6 didn't assign the ASV to any taxonomic group.
- **g_** or **s_** = After Megan there's an additional filtering step where Genus or Species designations are only left if they were assigned with a certain, high level of confidence or higher. If this threshold is not met, a g_ or s_ is given.

In [38]:
## Plate metadata

plate_meta = pd.read_csv('Plate_S_meta_OBIS_data.csv')
print(plate_meta.shape)
plate_meta.head()

(60, 52)


Unnamed: 0,SequenceID,sample_name,order,tag_sequence,primer_sequence_F,primer_sequence_R,library_tag_combo,library,date_PCR,sample_type,...,project_name,samp_vol_we_dna_ext,samp_filter_size_ext,samp_filter_ext_type,samp_store_temp,seq_meth,sequencing_facility,geo_loc_name,investigation_type,FilterID
0,14213c01_12_edna_1_S,14213c01_12_edna_1,1,ACGAGACTGATT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,S1_ACGAGACTGATT,S1,20170124,environmental,...,MBON,100ml,0.22um,Poretics,-80C,NGS Illumina Miseq,Stanford,USA:California:Monterey Bay,eukaryote,14213c01_12_edna_1
1,14213c01_12_edna_2_S,14213c01_12_edna_2,2,GAATACCAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,S2_GAATACCAAGTC,S2,20170124,environmental,...,MBON,100ml,0.22um,Poretics,-80C,NGS Illumina Miseq,Stanford,USA:California:Monterey Bay,eukaryote,14213c01_12_edna_2
2,14213c01_12_edna_3_S,14213c01_12_edna_3,3,CGAGGGAAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,S3_CGAGGGAAAGTC,S3,20170124,environmental,...,MBON,100ml,0.22um,Poretics,-80C,NGS Illumina Miseq,Stanford,USA:California:Monterey Bay,eukaryote,14213c01_12_edna_3
3,22013c01_12_edna_1_S,22013c01_12_edna_1,4,GAACACTTTGGA,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,S4_GAACACTTTGGA,S4,20170124,environmental,...,MBON,100ml,0.22um,Poretics,-80C,NGS Illumina Miseq,Stanford,USA:California:Monterey Bay,eukaryote,22013c01_12_edna_1
4,22013c01_12_edna_2_S,22013c01_12_edna_2,5,ACTCACAGGAAT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,S5_ACTCACAGGAAT,S5,20170124,environmental,...,MBON,100ml,0.22um,Poretics,-80C,NGS Illumina Miseq,Stanford,USA:California:Monterey Bay,eukaryote,22013c01_12_edna_2


## Conversion

In [127]:
## eventID

occ = pd.DataFrame({'eventID':plate['Sequence_ID']})
print(occ.shape)
occ.head()

(280440, 1)


Unnamed: 0,eventID
0,05114c01_12_edna_1_S
1,05114c01_12_edna_2_S
2,05114c01_12_edna_3_S
3,11216c01_12_edna_1_S
4,11216c01_12_edna_2_S


In [128]:
## Merge with plate_meta to obtain eventDate, decimal Lat and Lon

occ = occ.merge(plate_meta[['SequenceID', 'SAMPLING_date_time', 'SAMPLING_dec_lat', 'SAMPLING_dec_lon',
                            'env_biome', 'env_feature', 'env_material', 'locus', 'primer_sequence_F',
                            'primer_sequence_R']], how='left', left_on='eventID', right_on='SequenceID')
occ.drop(columns='SequenceID', inplace=True)
occ.columns = ['eventID', 'eventDate', 'decimalLatitude', 'decimalLongitude', 'env_broad_scale', 'env_local_scale', 'env_medium',
               'target_gene', 'pcr_primer_forward', 'pcr_primer_reverse']
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse
0,05114c01_12_edna_1_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
1,05114c01_12_edna_2_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
2,05114c01_12_edna_3_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
3,11216c01_12_edna_1_S,2016-04-21 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC
4,11216c01_12_edna_2_S,2016-04-21 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC


In [129]:
## Add primer names and references

occ.insert(8, 'pcr_primer_name_forward', 'insert primer name')
occ.insert(10, 'pcr_primer_name_reverse', 'insert primer name')
occ.insert(12, 'pcr_primer_reference', 'insert primer reference')
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_name_forward,pcr_primer_forward,pcr_primer_name_reverse,pcr_primer_reverse,pcr_primer_reference
0,05114c01_12_edna_1_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference
1,05114c01_12_edna_2_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference
2,05114c01_12_edna_3_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference
3,11216c01_12_edna_1_S,2016-04-21 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference
4,11216c01_12_edna_2_S,2016-04-21 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference


In [130]:
## Add sop

occ.insert(13, 'sop', 'insert sop link/doi')
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_name_forward,pcr_primer_forward,pcr_primer_name_reverse,pcr_primer_reverse,pcr_primer_reference,sop
0,05114c01_12_edna_1_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi
1,05114c01_12_edna_2_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi
2,05114c01_12_edna_3_S,2014-02-20 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi
3,11216c01_12_edna_1_S,2016-04-21 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi
4,11216c01_12_edna_2_S,2016-04-21 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi


**Note** that pcr_primer_name_forward, pcr_primer_name_reverse, pcr_primer_reference, and sop will be in Katie's revised metadata file, so ultimately I should be able to incorporate all of them in the original merge step.

In [131]:
## Format eventDate

pst = pytz.timezone('America/Los_Angeles')
eventDate = [pst.localize(datetime.strptime(dt, '%Y-%m-%d %H:%M')).isoformat() for dt in occ['eventDate']]
occ['eventDate'] = eventDate

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_name_forward,pcr_primer_forward,pcr_primer_name_reverse,pcr_primer_reverse,pcr_primer_reference,sop
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi


In [132]:
## Add DNA_sequence

occ['DNA_sequence'] = plate['ASV']
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_name_forward,pcr_primer_forward,pcr_primer_name_reverse,pcr_primer_reverse,pcr_primer_reference,sop,DNA_sequence
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,insert primer name,TGATCCTTCTGCAGGTTCACCTAC,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...


In [133]:
## scientificName, taxonomic info

occ['scientificName'] = plate['Species']
occ['kingdom'] = plate['Kingdom']
occ['phylum'] = plate['Phylum']
occ['class'] = plate['Class']
occ['order'] = plate['Order']
occ['family'] = plate['Family']
occ['genus'] = plate['Genus']

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_name_forward,pcr_primer_forward,...,pcr_primer_reference,sop,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus


In [134]:
## Replace 'unknown', 'unassigned', etc. in scientificName and taxonomy columns with NaN

cols = ['scientificName', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus']
occ[cols] = occ[cols].replace({'unassigned':np.nan,
                              's_':np.nan,
                              'g_':np.nan,
                              'unknown':np.nan,
                              'no_hit':np.nan})
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_name_forward,pcr_primer_forward,...,pcr_primer_reference,sop,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert primer reference,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus


**Note** that there are a number of entries in the scientificName column, like "uncultured marine eukaryote", "eukaryote clone OLI11007", and "Acantharian sp. 6201", that are not proper Latin species names. It seems reasonable to replace these with NaN as well. **I've used a simple rule to filter these out, and it would be worth checking it as the data set expands, just to be extra careful.**

To see the entries:
```python
names = occ['scientificName'].unique()
names = names[~pd.isnull(names)] # remove NaN
for name in names:
    words_in_name = name.split(' ')
    if len(words_in_name) > 2:
        print(name)
```

In [135]:
## Replace non-Latin species names with NaN

# Get names
names = occ['scientificName'].unique()
names = names[~pd.isnull(names)] # remove NaN

# Get non-Latin names
non_latin_names = []
for name in names:
    words_in_name = name.split(' ')
    if len(words_in_name) > 2:
        non_latin_names.append(name)
non_latin_names_dict = {i:np.nan for i in non_latin_names}

# Add any names that didn't get caught in the simple filter
non_latin_names_dict['phototrophic eukaryote'] = np.nan

# Replace
occ['scientificName'].replace(non_latin_names_dict, inplace=True)

In [136]:
## Fill missing values in the species column with the lowest available taxonomic rank

occ['scientificName'] = occ['scientificName'].combine_first(occ['genus'])
occ['scientificName'] = occ['scientificName'].combine_first(occ['family'])
occ['scientificName'] = occ['scientificName'].combine_first(occ['order'])
occ['scientificName'] = occ['scientificName'].combine_first(occ['class'])
occ['scientificName'] = occ['scientificName'].combine_first(occ['phylum'])
occ['scientificName'] = occ['scientificName'].combine_first(occ['kingdom'])

**Note** that there are 33360 records (~12% of the data) where no taxonomic information was obtained at all (i.e. scientificName is still NaN). 

In [137]:
## Get unique species names

names = occ['scientificName'].unique()
names = names[~pd.isnull(names)] # remove NaN
len(names)

756

**Note:** These names do not have to be WoRMS-approved, as far as I can tell. As long as there's some taxonomic information, the record can be submitted to OBIS. But just to be aware, there are a number of names in scientificName that are not approved by WoRMS. To see them, use:
```python
name_id_dict, name_name_dict, name_taxid_dict, name_class_dict = WoRMS.run_get_worms_from_scientific_name(names, verbose_flag=True)
```

**Also**, many entries have no other information other than the kingdom is "Eukaryota". This troubles me a little bit, since that's a domain, not a kingdom. **Will this be a problem? Is it possible to get the actual kingdom information in the kingdom field?**

**Finally** the 33360 records with no taxonomic information will need to be dropped eventually, but I'll leave it until the end in case we want to add any more columns from the original data.

In [138]:
## basisOfRecord

occ['basisOfRecord'] = 'MaterialSample'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_name_forward,pcr_primer_forward,...,sop,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus,basisOfRecord
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,insert sop link/doi,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample


In [139]:
## Add materialSampleID, identificationRemarks, identificationReferences -- can probably do this through merging metadata when Katie adds this info

occ['materialSampleID'] = 'insert doi/SRA number (?) if sequence has been archived'
occ['identificationRemarks'] = 'insert the GenBank version used, the fact that BLAST and Megan6 were used, and the BLAST confidence level'
occ['identificationReferences'] = 'insert links/dois for published papers, and link to github code'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_name_forward,pcr_primer_forward,...,kingdom,phylum,class,order,family,genus,basisOfRecord,materialSampleID,identificationRemarks,identificationReferences
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li..."
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li..."
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li..."
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li..."
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li..."


In [140]:
## organismQuantity (number of reads)

occ['organismQuantity'] = plate['Reads']
occ['organismQuantityType'] = 'DNA sequence reads'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_name_forward,pcr_primer_forward,...,class,order,family,genus,basisOfRecord,materialSampleID,identificationRemarks,identificationReferences,organismQuantity,organismQuantityType
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",14825,DNA sequence reads
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",16094,DNA sequence reads
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",22459,DNA sequence reads
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",19312,DNA sequence reads
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",16491,DNA sequence reads


**Note** that there are 215537 rows where the number of reads is 0. These will need to be dropped at the end of the script.

In [141]:
## sampleSizeValue

count_by_seq = plate.groupby('Sequence_ID', as_index=False)['Reads'].sum()
occ = occ.merge(count_by_seq, how='left', left_on='eventID', right_on='Sequence_ID')
occ.drop(columns='Sequence_ID', inplace=True)
occ.rename(columns={'Reads':'sampleSizeValue'}, inplace=True)
print(occ.shape)
occ.head()

(280440, 29)


Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_name_forward,pcr_primer_forward,...,order,family,genus,basisOfRecord,materialSampleID,identificationRemarks,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",14825,DNA sequence reads,85600
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",16094,DNA sequence reads,90702
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",22459,DNA sequence reads,130275
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",19312,DNA sequence reads,147220
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Calanoida,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",16491,DNA sequence reads,121419


In [142]:
## sampleSizeUnit

occ['sampleSizeUnit'] = 'DNA sequence reads'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_name_forward,pcr_primer_forward,...,family,genus,basisOfRecord,materialSampleID,identificationRemarks,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue,sampleSizeUnit
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",14825,DNA sequence reads,85600,DNA sequence reads
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",16094,DNA sequence reads,90702,DNA sequence reads
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",22459,DNA sequence reads,130275,DNA sequence reads
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",19312,DNA sequence reads,147220,DNA sequence reads
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,insert primer name,GTACACACCGCCCGTC,...,Paracalanidae,Paracalanus,MaterialSample,insert doi/SRA number (?) if sequence has been...,"insert the GenBank version used, the fact that...","insert links/dois for published papers, and li...",16491,DNA sequence reads,121419,DNA sequence reads


**I'm still a little unclear about the difference between materialSampleID and associatedSequences. Maybe I can ask Abby about this.**

## Tidy

In [143]:
## Drop records where scientificName = NaN (no taxonomic information was available) and where organismQuantity = 0

occ = occ[(occ['scientificName'].isna() == False) & (occ['organismQuantity'] > 0)]
print(occ.shape)

(59625, 30)


**Note** that the vast majority of this data has 0 reads (215537 records out of 280440, or about 75%). I guess that makes sense - it's sparse data - but it might be good to check?

In [144]:
## Replace NaN values in text fields with ''

text_fields = ['eventID', 'env_broad_scale', 'env_local_scale', 'env_medium', 'target_gene', 'pcr_primer_name_forward', 'pcr_primer_forward', 
               'pcr_primer_name_reverse', 'pcr_primer_reverse', 'pcr_primer_reference', 'sop', 'DNA_sequence', 'scientificName', 'kingdom', 
               'phylum', 'class', 'order', 'family', 'genus', 'basisOfRecord', 'materialSampleID', 'identificationRemarks', 'identificationReferences', 
               'organismQuantityType', 'sampleSizeUnit']
occ[text_fields] = occ[text_fields].replace(np.nan, '')

## Save

In [145]:
## Save

occ.to_csv('eDNA_practice_plate_occ_20210108.csv', index=False, na_rep='NaN')

## Questions

1. What's up with the set of FilterID's that don't follow the formula? (E.g. JD10706C1_0m_1)
2. "Eukaryota" is a very common entry in the Kingdom field, but it's not a Kingdom, it's a Domain. Is this a problem? If so, is there some way to get the actual Kingdom information, or retain the Eukaryota classification elsewhere in the data set? **Yes, it's a problem. All occurrences have to have some entry in the scientific name column that matches on WoRMS, even though this is not clear from the GBIF guide. As such, things in the Kingdom column should be true Kingdoms, if possible. Maybe Katie can make sure this happens when she gets information off of GenBank? Additionally, if the only taxonomic information for an organism is "Eukaryota", just put "Biota" in the scientificName column (which will match on WoRMS - http://marinespecies.org/aphia.php?p=taxdetails&id=1) and NaN everywhere else. This is also possibly an option for records where no taxonomic data at all is available. Note that Pieter is willing to run the data through OBIS's automated system to see how it manages matching on WoRMS. Also, he says there's a way to query the WoRMS API using GenBank ID (AphiaRecordByExternalID - https://www.marinespecies.org/rest/AphiaRecordByExternalID/399303?type=ncbi). Finally, GenBank IDs can likely be included in a taxonID field. Or the taxonConceptID field. To follow a related issue on GBIF's GitHub: https://github.com/gbif/doc-publishing-dna-derived-data/issues/35**
3. Difference between materialSampleID and associatedSequences term definitions? **Abby is emailing contacts at GBIF to get more information on this (1/13). But also, maybe Katie would understand the definitions better than I do? After talking with Pieter and Saara, it sounds like materialSampleID is probably the ID of the actual physical water sample, if it is frozen somewhere. associatedSequences would be the biosampleID from SRA. If I want, I could get in touch with the developers of the GBIF guide and let them know that this is confusing.**
4. How might one include the control data in DwC format? Abby suggested including it in the MoF file somehow? **No one knows. Abby will connect me to her contact at OBIS who is working through an eDNA data set; they might be at least thinking about this issue (1/13). Pieter and Saara contributed a number of questions/suggestions:**
    **1. Has the control information been incorporated into this data set already? Are the control data still useful if you're dealing with the fully processed data set?**
    **2. Are the control data archived and therefore accessible via biosampleID?**
    **3. At worst, the records could be included in the data set with no scientificName and no coordinates to make it clear they're not regular occurrences.**
    
    **These questions would be best addressed to Katie.**

Other than that, just waiting for updated metadata file from Katie.

**Note** a new DwC extension is being developed for sequence derived data: https://rs.gbif.org/sandbox/extension/dna_derived_data.xml

Also **note**: We can choose for these data to become a use case (i.e. an example for others trying to prepare their sequence-derived data for submission to OBIS). This could be done with part or all of the data set, and would mean that your conversion process would be available as an example on GitHub. Would Francisco like to do this? For examples, look here: https://github.com/tdwg/dwc-for-biologging/wiki