## Pulling down mercury-stained sheets from NMNH DWCA

Attempting to grab data from scratch to replicate "Applications of deep convolutional neural networks to digitized natural history collections" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5680669/)

In [1]:
import tarfile
import pandas as pd
import numpy as np

### Getting barcodes from Figshare

First, we need to download the image bundles from Figshare in order to get their image barcodes. They are posted separately by [stained](https://smithsonian.figshare.com/articles/dataset/Mercury-stained_botany_images_for_deep_learning/5423083) and [unstained](https://smithsonian.figshare.com/articles/dataset/Unstained_botany_images_for_deep_learning/5423098) datasets.

In [2]:
stained_barcodes = []
with tarfile.open("stained.tar.gz", "r:gz") as tar:
    for filename in tar.getnames():
        if filename.endswith('.jpg'):
            barcode = filename.split('/')[1].split('.')[0]
            stained_barcodes.append(barcode)
stained_barcodes[:5]

['00000140', '00000162', '00000185', '00000209', '00000231']

In [3]:
unstained_barcodes = []
with tarfile.open("unstained.tar.gz", "r:gz") as tar:
    for filename in tar.getnames():
        if filename.endswith('.jpg'):
            barcode = filename.split('/')[1].split('.')[0]
            unstained_barcodes.append(barcode)
unstained_barcodes[:5]

['00000001', '00000003', '00000015', '00000020', '00000021']

In [66]:
stained_barcode_df = pd.DataFrame(stained_barcodes, columns=['barcode'])
stained_barcode_df['stain_status'] = 'stained'
stained_barcode_df.head()

Unnamed: 0,barcode,stain_status
0,140,stained
1,162,stained
2,185,stained
3,209,stained
4,231,stained


In [67]:
unstained_barcode_df = pd.DataFrame(unstained_barcodes, columns=['barcode'])
unstained_barcode_df['stain_status'] = 'unstained'
unstained_barcode_df.head()

Unnamed: 0,barcode,stain_status
0,1,unstained
1,3,unstained
2,15,unstained
3,20,unstained
4,21,unstained


In [68]:
combined_barcode_df = pd.concat([stained_barcode_df, unstained_barcode_df])
combined_barcode_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15553 entries, 0 to 7776
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   barcode       15553 non-null  object
 1   stain_status  15553 non-null  object
dtypes: object(2)
memory usage: 364.5+ KB


In [69]:
combined_barcode_df['stain_status'].value_counts()

unstained    7777
stained      7776
Name: stain_status, dtype: int64

In [70]:
combined_barcode_df.to_csv('barcodes_from_figshare.tsv', index=False, sep='\t')

### Pulling multimedia data from NMNH DarwinCore Archive

Here is the link to the Smithsonian NMNH IPT: https://collections.nmnh.si.edu/ipt/resource?r=nmnh_extant_dwc-a

In [48]:
multimedia_df = pd.read_csv('nmnh_multimedia_1_35.tsv.gz', 
                            dtype={'providerLiteral':'category',
                                   'description':'string'},
                            sep='\t', compression='gzip')
multimedia_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10357547 entries, 0 to 10357546
Data columns (total 19 columns):
 #   Column                     Dtype   
---  ------                     -----   
 0   id                         object  
 1   identifier                 object  
 2   type                       object  
 3   title                      object  
 4   rights                     object  
 5   rights.1                   object  
 6   UsageTerms                 object  
 7   WebStatement               object  
 8   licenseLogoURL             object  
 9   source                     object  
 10  creator                    object  
 11  providerLiteral            category
 12  description                string  
 13  subjectCategoryVocabulary  object  
 14  scientificName             float64 
 15  accessURI                  object  
 16  format                     object  
 17  PixelXDimension            int64   
 18  PixelYDimension            int64   
dtypes: category(1), flo

In [49]:
multimedia_df['providerLiteral'].value_counts()

Smithsonian Institution, NMNH, Botany                   9257341
Smithsonian Institution, NMNH, Mammals                   577050
Smithsonian Institution, NMNH, Invertebrate Zoology      184797
Smithsonian Institution, NMNH, Entomology                167637
Smithsonian Institution, NMNH, Fishes                    134203
Smithsonian Institution, NMNH, Birds                      23401
Smithsonian Institution, NMNH, Amphibians & Reptiles      13118
Name: providerLiteral, dtype: int64

In [50]:
len(multimedia_df[multimedia_df.duplicated(keep='first')])

5942814

**Uh oh, it looks like somehow a large portion of the dataset has been duplicated?**

In [51]:
multimedia_df = multimedia_df.drop_duplicates()

In [52]:
multimedia_df['providerLiteral'].value_counts()

Smithsonian Institution, NMNH, Botany                   3314799
Smithsonian Institution, NMNH, Mammals                   577050
Smithsonian Institution, NMNH, Invertebrate Zoology      184706
Smithsonian Institution, NMNH, Entomology                167637
Smithsonian Institution, NMNH, Fishes                    134193
Smithsonian Institution, NMNH, Birds                      23401
Smithsonian Institution, NMNH, Amphibians & Reptiles      12947
Name: providerLiteral, dtype: int64

In [53]:
botany_barcodes = multimedia_df[(multimedia_df['providerLiteral'] == 'Smithsonian Institution, NMNH, Botany') &\
                                (multimedia_df['description']).str.lower().str.contains('barcode')].copy()
botany_barcodes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3219718 entries, 1091 to 10357542
Data columns (total 19 columns):
 #   Column                     Dtype   
---  ------                     -----   
 0   id                         object  
 1   identifier                 object  
 2   type                       object  
 3   title                      object  
 4   rights                     object  
 5   rights.1                   object  
 6   UsageTerms                 object  
 7   WebStatement               object  
 8   licenseLogoURL             object  
 9   source                     object  
 10  creator                    object  
 11  providerLiteral            category
 12  description                string  
 13  subjectCategoryVocabulary  object  
 14  scientificName             float64 
 15  accessURI                  object  
 16  format                     object  
 17  PixelXDimension            int64   
 18  PixelYDimension            int64   
dtypes: category(1), f

In [54]:
botany_barcodes.sample(5)

Unnamed: 0,id,identifier,type,title,rights,rights.1,UsageTerms,WebStatement,licenseLogoURL,source,creator,providerLiteral,description,subjectCategoryVocabulary,scientificName,accessURI,format,PixelXDimension,PixelYDimension
2500654,http://n2t.net/ark:/65665/3b1e3673e-bd2d-4e25-...,http://collections.nmnh.si.edu/media/index.php...,image,01769128.tif,CC0,CC0,https://creativecommons.org/publicdomain/zero/...,https://naturalhistory.si.edu/research/nmnh-co...,https://www.si.edu/sites/default/files/icons/c...,"US National Herbarium, Department of Botany, N...",Conveyor Belt,"Smithsonian Institution, NMNH, Botany",Barcode 01769128,Specimen/Object,,http://n2t.net/ark:/65665/m3c4316b25-64a6-4e45...,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",6879,9123
2223761,http://n2t.net/ark:/65665/3dd0180a6-871f-49ea-...,http://collections.nmnh.si.edu/media/index.php...,image,01643868.tif,Usage Conditions Apply,Usage Conditions Apply,https://www.si.edu/termsofuse,https://naturalhistory.si.edu/research/nmnh-co...,,"US National Herbarium, Department of Botany, N...",Conveyor Belt,"Smithsonian Institution, NMNH, Botany",Barcode 01643868,Specimen/Object,,http://n2t.net/ark:/65665/m35eb950e2-1e99-43d7...,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",6872,9121
3624104,http://n2t.net/ark:/65665/312b0f61d-5771-4a4f-...,http://collections.nmnh.si.edu/media/index.php...,image,00342035.tif,CC0,CC0,https://creativecommons.org/publicdomain/zero/...,https://naturalhistory.si.edu/research/nmnh-co...,https://www.si.edu/sites/default/files/icons/c...,"US National Herbarium, Department of Botany, N...",Conveyor Belt,"Smithsonian Institution, NMNH, Botany",Barcode 00342035,Specimen/Object,,http://n2t.net/ark:/65665/m30c7857c7-eb83-4caa...,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",6807,8993
3746212,http://n2t.net/ark:/65665/370ff85e9-dbdc-4863-...,http://collections.nmnh.si.edu/media/index.php...,image,00380605.tif,CC0,CC0,https://creativecommons.org/publicdomain/zero/...,https://naturalhistory.si.edu/research/nmnh-co...,https://www.si.edu/sites/default/files/icons/c...,"US National Herbarium, Department of Botany, N...",Conveyor Belt,"Smithsonian Institution, NMNH, Botany",Barcode 00380605,Specimen/Object,,http://n2t.net/ark:/65665/m3fb4c2f09-e0e9-4363...,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",6759,8937
3149346,http://n2t.net/ark:/65665/320b4799f-e0d5-4e0e-...,http://collections.nmnh.si.edu/media/index.php...,image,00083001.tif,CC0,CC0,https://creativecommons.org/publicdomain/zero/...,https://naturalhistory.si.edu/research/nmnh-co...,https://www.si.edu/sites/default/files/icons/c...,"US National Herbarium, Department of Botany, N...",Conveyor Belt,"Smithsonian Institution, NMNH, Botany",Barcode 00083001,Specimen/Object,,http://n2t.net/ark:/65665/m341691bbf-7a51-43d0...,"tiff, jpeg, jpeg, jpeg, jpeg, jpeg",6923,9111


In [55]:
botany_barcodes.sample(5).to_dict(orient='records')

[{'id': 'http://n2t.net/ark:/65665/383c692aa-19df-4cff-a531-797416649565',
  'identifier': 'http://collections.nmnh.si.edu/media/index.php?irn=12287866',
  'type': 'image',
  'title': '02685579.tif',
  'rights': 'CC0',
  'rights.1': 'CC0',
  'UsageTerms': 'https://creativecommons.org/publicdomain/zero/1.0/',
  'WebStatement': 'https://naturalhistory.si.edu/research/nmnh-collections/museum-collections-policies',
  'licenseLogoURL': 'https://www.si.edu/sites/default/files/icons/cc0.svg',
  'source': 'US National Herbarium, Department of Botany, NMNH, Smithsonian Institution',
  'creator': 'Conveyor Belt',
  'providerLiteral': 'Smithsonian Institution, NMNH, Botany',
  'description': 'Barcode 02685579',
  'subjectCategoryVocabulary': 'Specimen/Object',
  'scientificName': nan,
  'accessURI': 'http://n2t.net/ark:/65665/m391ad16ee-637d-4fa3-84a2-d50b676df4a5',
  'format': 'tiff, jpeg, jpeg, jpeg, jpeg, jpeg',
  'PixelXDimension': 6777,
  'PixelYDimension': 8945},
 {'id': 'http://n2t.net/ark

In [56]:
def extract_barcode(description_text):
    space_split = description_text.lower().split()
    barcode_idx = space_split.index('barcode')
    if len(space_split) == barcode_idx + 1:
        return np.nan
    else:
        barcode_number = space_split[barcode_idx + 1].strip('.').strip(',')
        return barcode_number

In [57]:
botany_barcodes['barcode'] = botany_barcodes['description'].apply(extract_barcode)
botany_barcodes[['description','barcode']].sample(20)

Unnamed: 0,description,barcode
3480829,Barcode 00453051,453051
3911517,Barcode 00162711,162711
2580578,Barcode 02497538,2497538
230563,Barcode 01417384,1417384
2995726,Barcode 01578181,1578181
3134622,Barcode 02961835,2961835
3009970,"Swallen, J. R. 1436, US National Herbarium She...",1166232
10281760,Barcode 03948337,3948337
3623044,Barcode 00030398,30398
1337155,"Palmer, E. 48, US National Herbarium Sheet 823...",489097


In [58]:
botany_barcodes['barcode_len'] = botany_barcodes['barcode'].str.len()
botany_barcodes['barcode_len'].value_counts()

8.0     3219629
7.0           3
19.0          1
11.0          1
2.0           1
Name: barcode_len, dtype: int64

In [59]:
print(len(stained_barcodes))

7776


In [61]:
stained_multimedia = botany_barcodes[botany_barcodes['barcode'].isin(stained_barcodes)]
len(stained_multimedia)

10922

In [65]:
len(stained_multimedia[stained_multimedia.duplicated(subset='barcode')])

3201

**Uh oh, even after dropping complete duplicate records, there are still 3201 duplicate barcodes**

In [64]:
stained_multimedia[stained_multimedia.duplicated(subset='barcode',keep=False)].sort_values('barcode').head(10).to_dict(orient='records')

[{'id': 'http://n2t.net/ark:/65665/3ce233bf5-d0e1-4967-9d59-acc0726d5588',
  'identifier': 'http://collections.nmnh.si.edu/media/index.php?irn=10142667',
  'type': 'image',
  'title': '00000209.tif',
  'rights': 'CC0',
  'rights.1': 'CC0',
  'UsageTerms': 'https://creativecommons.org/publicdomain/zero/1.0/',
  'WebStatement': 'https://naturalhistory.si.edu/research/nmnh-collections/museum-collections-policies',
  'licenseLogoURL': 'https://www.si.edu/sites/default/files/icons/cc0.svg',
  'source': 'Specimen from Department of Botany, NMNH, Smithsonian Institution',
  'creator': 'Ingrid P. Lin',
  'providerLiteral': 'Smithsonian Institution, NMNH, Botany',
  'description': 'US National Herbarium specimen, barcode 00000209',
  'subjectCategoryVocabulary': 'Specimen/Object',
  'scientificName': nan,
  'accessURI': 'http://n2t.net/ark:/65665/m3325dc959-7428-4973-b804-f4c8273d7cbc',
  'format': 'tiff, jpeg, jpeg, jpeg, jpeg, jpeg',
  'PixelXDimension': 7319,
  'PixelYDimension': 10319,
  'b

The first duplicate barcode (00000209) appears to have 2 different specimen IDs:

* http://n2t.net/ark:/65665/3ce233bf5-d0e1-4967-9d59-acc0726d5588
* http://n2t.net/ark:/65665/350435d2c-8228-4f1c-b2ed-99b4bb0ab20d

It shows the same herbarium sheet, but the 2 links have slightly different specimen data. This is because there are 2 different specimens on the same sheet!

In [43]:
stained_multimedia = stained_multimedia.drop_duplicates(subset='barcode',keep='first')

In [44]:
stained_multimedia['rights'].value_counts()

CC0                       7061
Usage Conditions Apply     660
Name: rights, dtype: int64

In [45]:
stained_multimedia_barcodes = stained_multimedia['barcode'].unique().tolist()
len(stained_multimedia_barcodes)

7721

In [19]:
for bc in stained_barcodes:
    if bc not in stained_multimedia_barcodes:
        print(bc)

00006043
00007912
00013762
00026972
00093315
00093335
00093392
00093415
00093417
00093444
00093445
00093482
00093516
00093517
00093536
00093538
00093539
00093540
00093542
00093546
00093550
00093553
00093573
00093579
00093580
00093592
00093594
00093597
00093632
00093671
00093672
00093703
00093715
00093736
00093751
00093826
00093827
00093874
00093880
00093905
00093907
00093916
00093932
00093940
00094078
00094096
00094104
00094105
00094137
00098766_packet
00343811
00512568
00997924
01049663
01050380


In [31]:
from PIL import Image
import requests
import io

In [41]:
test_irn_url = 'http://collections.nmnh.si.edu/media/index.php?irn=10086661'

test_ark_url = 'http://n2t.net/ark:/65665/m36e1bbdd7-8c33-4a87-ab66-6f47ea582d90'

width, height = np.nan, np.nan
image_url = test_ark_url
filename = 'test_irn_download.jpg'

try:
    r = requests.get(image_url, timeout=20)
    if r.headers['Content-Type'] == 'image/jpeg':
        try:
            with Image.open(io.BytesIO(r.content)) as im:
                width, height = im.size
                im.save(filename)
        except:
            print('Weird error with ' + image_url)
except:
    print('Timeout error with ' + image_url)
print({'width': width, 'height': height})

200
{'width': 7319, 'height': 10319}
