This Jupyter notebook is one of many that fully document the data analysis completed for the publication "DNA Barcoding, Collection Management, and the Bird Collection in the Smithsonian’s National Museum of Natural History" by Schindel et al.

You can view all analysis notebooks, data, and figures in the GitHub repository here: https://github.com/MikeTrizna/USNMBirdDNABarcoding2017

# Creating Table 2 -- Species-level summary before and after dataset

In [1]:
import pandas as pd

In [2]:
before_schindel2011_summary = pd.read_csv('data/processed/before_schindel2011_distance_summary.tsv',
                                           sep='\t')
print(len(before_schindel2011_summary))
schindel2017_summary = pd.read_csv('data/processed/schindel2017_distance_summary.tsv',
                                   sep='\t')
print(len(schindel2017_summary))
combined_summary = pd.read_csv('data/processed/combined_distance_summary.tsv',
                               sep='\t')
print(len(combined_summary))
print(combined_summary.head())

685
1344
1960
            scientific_name  count  max_intra  min_inter
0            Abrornis humei      1        NaN    0.07716
1         Abrornis inornata      3   0.002963    0.07716
2       Abrornis proregulus      3   0.006173    0.07870
3    Abroscopus albogularis      1        NaN    0.11040
4  Abroscopus superciliaris      1        NaN    0.11040


In [3]:
from collections import OrderedDict

In [4]:
df_labels = OrderedDict([('Prior to Schindel 2011', before_schindel2011_summary),
                         ('Schindel 2017', schindel2017_summary),
                         ('Combined Datasets', combined_summary)])

In [5]:
median_dict = OrderedDict()
for title, df in df_labels.items():
    median_dict[title] = OrderedDict([
                           ('Median maximum variablility within species', '{:.2%}'.format(df['max_intra'].median())),
                           ('Median minimum divergence between species', '{:.2%}'.format(df['min_inter'].median())),
                           ('Ratio of median minimum divergence between to median \
                             maximum variability within', df['min_inter'].median()/ df['max_intra'].median())])
median_df = pd.DataFrame(median_dict)
median_df.style

Unnamed: 0,Prior to Schindel 2011,Schindel 2017,Combined Datasets
Median maximum variablility within species,0.31%,0.31%,0.31%
Median minimum divergence between species,6.63%,7.87%,7.05%
Ratio of median minimum divergence between to median maximum variability within,21.4776,25.6603,22.9948


In [6]:
mean_dict = OrderedDict()
for title, df in df_labels.items():
    mean_dict[title] = OrderedDict([
                           ('Mean maximum variablility within species', '{:.2%}'.format(df['max_intra'].mean())),
                           ('Mean minimum divergence between species', '{:.2%}'.format(df['min_inter'].mean())),
                           ('Ratio of mean minimum divergence between to mean \
                             maximum variability within', df['min_inter'].mean()/ df['max_intra'].mean())])
mean_df = pd.DataFrame(mean_dict)
mean_df.style

Unnamed: 0,Prior to Schindel 2011,Schindel 2017,Combined Datasets
Mean maximum variablility within species,0.60%,0.70%,0.69%
Mean minimum divergence between species,6.35%,7.55%,6.65%
Ratio of mean minimum divergence between to mean maximum variability within,10.5683,10.7953,9.58865
