This Jupyter notebook is one of many that fully document the data analysis completed for the publication "DNA Barcoding, Collection Management, and the Bird Collection in the Smithsonian’s National Museum of Natural History" by Schindel et al.

You can view all analysis notebooks, data, and figures in the GitHub repository here: https://github.com/MikeTrizna/USNMBirdDNABarcoding2017

# Creating Table 1 -- Sequence record and species counts

In [1]:
import pandas as pd

**Prior to Schindel 2011**

In [2]:
cols_to_keep = ['accession','specimen_voucher','scientific_name','trace_count']
before_schindel2011 = pd.read_csv('data/original/before_schindel2011.tsv', sep='\t', usecols=cols_to_keep)
print(before_schindel2011.head())

  accession     scientific_name specimen_voucher  trace_count
0  FJ231517  Anas platyrhynchos              NaN            0
1  FJ231516  Anas platyrhynchos              NaN            0
2  FJ231515  Anas platyrhynchos              NaN            0
3  FJ231514  Anas platyrhynchos              NaN            0
4  FJ231513  Anas platyrhynchos              NaN            0


**Schindel 2011**

In [3]:
schindel2011 = pd.read_csv('data/original/schindel2011.tsv',sep='\t',usecols=cols_to_keep)
print(schindel2011.head())

  accession        scientific_name   specimen_voucher  trace_count
0  JQ176458      Tangara episcopus   USNM:Birds:A_542            2
1  JQ174585      Crypturellus soui   USNM:Birds:A_257            2
2  JQ175834   Picumnus spilogaster   USNM:Birds:A_325            2
3  JQ175833   Picumnus spilogaster   USNM:Birds:A_434            2
4  JQ175812  Phylloscopus schwarzi  USNM:Birds:620463            2


**Final dataset for this paper**

In [4]:
schindel2017 = pd.read_csv('data/original/schindel2017.tsv', sep='\t',usecols=cols_to_keep)
print(schindel2017.head())

  accession          scientific_name specimen_voucher  trace_count
0  JQ176654  Xiphorhynchus obsoletus       KU:O:89742            2
1  JQ176549         Trogon violaceus       KU:O:88933            2
2  JQ176510    Todirostrum maculatum       KU:O:90939            2
3  JQ176359    Tachyphonus luctuosus       KU:O:89078            2
4  JQ176343       Tachornis squamata       KU:O:91651            2


**Combining "prior" dataset with "final" dataset**

In [5]:
prior_plus_final = pd.concat([before_schindel2011, schindel2017])
print(len(prior_plus_final))

10999


## Running counts on each dataset

In [6]:
from collections import OrderedDict

In [7]:
dataset_list = OrderedDict([('Prior to Schindel 2011:',before_schindel2011),
                            ('Schindel 2011:', schindel2011), 
                            ('Final USNM dataset:', schindel2017), 
                            ('Prior plus final:', prior_plus_final)])

Removing subspecies from scientific name

In [8]:
for df in dataset_list.values():
    df['genus_species'] = df['scientific_name'].str.split(' ').str.get(0) + ' ' + \
                            df['scientific_name'].str.split(' ').str.get(1)

\# of records

In [9]:
for label, df in dataset_list.items():
    print(label, len(df))

Prior to Schindel 2011: 8019
Schindel 2011: 2803
Final USNM dataset: 2980
Prior plus final: 10999


\# of records with at least 2 traces

In [10]:
for label, df in dataset_list.items():
    print(label, len(df[df['trace_count'] > 1]))

Prior to Schindel 2011: 2447
Schindel 2011: 2802
Final USNM dataset: 2968
Prior plus final: 5415


\# of unique species

In [11]:
for label, df in dataset_list.items():
    print(label, df['genus_species'].nunique())

Prior to Schindel 2011: 1521
Schindel 2011: 1287
Final USNM dataset: 1346
Prior plus final: 2597


\# of unique species with 2 traces

In [12]:
for label, df in dataset_list.items():
    print(label, df[df['trace_count'] > 1]['genus_species'].nunique())

Prior to Schindel 2011: 685
Schindel 2011: 1287
Final USNM dataset: 1345
Prior plus final: 1961


\# of species with a single record

In [13]:
for label, df in dataset_list.items():
    species_size = df.groupby('genus_species').size()
    print(label, len(species_size[species_size == 1]))

Prior to Schindel 2011: 265
Schindel 2011: 421
Final USNM dataset: 448
Prior plus final: 504
