## Digital Specimen Demo

This is a simple Digital Specimen demonstrator to illustrate a few basic functionalities of 
a [Digital Specimen](https://dissco.tech/2020/03/31/what-is-a-digital-specimen/) -- a FAIR Digital Object. The goal here is to highlight persistent data linking, data packaging and using the persistent identifier to gather information for another external scientific workflow. 


This notebook shows: 
* How to get data about what is inside a digital specimen.
* Get more information about an element through Data Type Registry (DTR).
* Get the EBI Accession ID from Digital Specimen and pass the Accession ID to enasearch API.
* Find out if a [UniPROT](https://www.uniprot.org/) (database of protein sequence) TrEMBL ID exists in the Digital Specimen repository.
* Get the UniPROT TrEMBL ID from a digital specimen identifier (nsid) and perform another action/workflow. 

![Digital Specimen]()


In [1]:
# Load the libraries for markdown display
from IPython.display import Markdown as md

Digital Specimen is a FAIR Digital Object acting as a digital twin on the Internet for a specific physical specimen.

### What is inside this Digital Specimen?
Given a PID of a Digital Specimen (Natural Science Identifier: NSId), show the elements before retrieving the data.

In [3]:
import urllib.request, json
with urllib.request.urlopen("http://nsidr.org/objects/20.5000.1025/c2618387bb0932270617") as url:
    dsdata = json.loads(url.read().decode())
    display(md("<b>This Digital Specimen contains "'{} {} '.format(len(dsdata),"elements")))
    for element in dsdata:
        print(element)

<b>This Digital Specimen contains 21 elements 

physicalSpecimenId
gbifId
institutionCode
authorReference
collectionCode
collectionDate
id
midslevel
catOfLifeReference
locality
country
scientificName
catalogNumber
recordedBy
decimalLatLon
countryCode
dwcaContent
colContent
ebiSearchResults
wikidata
wikidata_info


#### Tell me more about physicalSpecimenId
At the momement the Data Type Registry element is not in the Digital Specimen recotd, let's assume there is a data type attached to this. Using the entry in the DTR we can find more about this element. This could also be used to verify the element before asking for it. 

In [4]:
import requests
# THis is the PID of the data type in the data type registry 
url = 'http://dtr-test.pidconsortium.eu/objects/21.T11148/4ac7431c2616a213481e'
req = requests.get(url)
jsondtr = json.loads(req.content)
display(md("#### Response from the DTR"))
display(md("<b>Description"))
display(md('{}'.format(jsondtr['description'])))
display(md("<b>Provenance"))
display(md('{}'.format(jsondtr['provenance'])))

#### Response from the DTR

<b>Description

An identifier of a physical specimen in a (e.g., natural sciences) collection

<b>Provenance

{'contributors': [{'identifiedUsing': 'ORCID', 'name': 'Alex Hardisty', 'details': '0000-0002-0767-4310'}], 'creationDate': '2020-05-28T13:57:05.007Z', 'lastModificationDate': '2020-05-28T13:57:05.007Z'}

Now get the content of the physicalSpecimenId.

In [5]:
physSpec = dsdata['physicalSpecimenId']
display(md("<b>Physical specimen identifier"))
print(physSpec)

<b>Physical specimen identifier

http://coldb.mnhn.fr/catalognumber/mnhn/im/2013-7767


There are different ways organisations create and store the physical specimen identifier. 

### Sequence data 
Now  get the information about the sequence information. 
The element ebiSearchResults contains information gathered from the EBI portal. 

In [6]:
with urllib.request.urlopen("http://nsidr.org/objects/20.5000.1025/c2618387bb0932270617?jsonPointer=/ebiSearchResults/0") as url:
    data = json.loads(url.read().decode())
    display(md(' #### Accession ID and Specimen Source'))
    display(md('{}'.format(data['acc'])))
    display(md('{}'.format(data['fields'][54]['values'])))

 #### Accession ID and Specimen Source

KJ591664

['Pygmaepterys pointieri voucher MNHN-IM-2013-7767 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial.']

We can now verify that the sequnece was created using the physical specimen from MNHN.

The sequence is not stored in this Digital Specimen. To access that we can use the accession ID and ENA API to get more information.

In [7]:
# load enasearch API 
import enasearch
enaresult = enasearch.retrieve_data(      
    ids="KJ591664",
        download=None,
        display="fasta",
        file=None,
        offset=0,
        header=None, 
        expanded=True)
print(enaresult[0].seq)

AACTTTATATATTTTATTTGGTATATGATCAGGACTTGTTGGAACAGCCCTAAGTTTACTTATTCGGGCAGAGCTAGGACAGCCAGGAGCCTTACTTGGAGACGATCAGCTATATAATGTTATTGTAACGGCACATGCCTTTGTAATAATTTTTTTTTTAGTAATGCCGATAATGATTGGAGGGTTTGGAAATTGATTGGTTCCTTTAATATTAGGAGCTCCAGATATGGCTTTTCCTCGATTAAATAATATAAGATTTTGACTGTTGCCTCCTGCTCTTTTATTGTTGCTGTCTTCAGCTGCTGTAGAAAGCGGAGTGGGAACAGGATGGACTGTTTATCCTCCTTTAGCTGGAAATTTAGCACATGCTGGAGGTTCAGTAGATCTTGCAATTTTTTCTCTTCACTTAGCGGGAGCTTCATCTATCCTAGGAGCTGTTAATTTTATTACAACTATTGTCAATATACGTTGAACAGGAATACAGTTTGAACGGCTTCCATTATTTGTATGATCAGTGAAAATTACAGCCATTTTGTTATTGCTTTCTTTACCTGTGTTAGCTGGAGCTATTACTATACTTTTAACTGATCGAAATTTTAACACAGCGTTCTTTGATCCTGCAGGAGGTGGAGATCCTATTCTCTACCAACATCTATTC


### Other queries
I have UniPROT TrEMBL ID. I want see which digital and physical specimens are associated with it.

In [8]:
dsquery = "http://nsidr.org/objects/?query=A0A0A7CD09"
req = requests.get(dsquery)
jsonres = json.loads(req.content)

In [9]:
display(md("<b>Display results from digital specimen repo"))
display(md("<b>PID | Type | Physical specimen identifier |Scientific Name"))
#print(jsonres['results'])
for i in jsonres['results']: 
    #print(i['id'])
    display(md('{} | {} | {}| {}'.format(i['id'],i['type'],i['content']['physicalSpecimenId'],i['content']['scientificName'])))

<b>Display results from digital specimen repo

<b>PID | Type | Physical specimen identifier |Scientific Name

20.5000.1025/7b03a195ff167782f6ad | DigitalSpecimen | http://coldb.mnhn.fr/catalognumber/mnhn/im/2013-7985| Pygmaepterys pointieri Garrigues & Merle, 2014

20.5000.1025/216738ba25f899a2d1fc | DigitalSpecimen | http://coldb.mnhn.fr/catalognumber/mnhn/im/2013-8488| Pygmaepterys pointieri Garrigues & Merle, 2014

20.5000.1025/03de9f562e2de9c62948 | DigitalSpecimen | http://coldb.mnhn.fr/catalognumber/mnhn/im/2013-8433| Pygmaepterys pointieri Garrigues & Merle, 2014

20.5000.1025/c2618387bb0932270617 | DigitalSpecimen | http://coldb.mnhn.fr/catalognumber/mnhn/im/2013-7767| Pygmaepterys pointieri Garrigues & Merle, 2014

### Other Action (beyond linking and query)
Given a NSId (persitent identifer for a digital specimen) I would like to perform another operation on the protein  sequnece. This is a good example of reproducibility and reusability (from FAIR) where certain workflows can be repeated and reused after publication using the PID. 


In [10]:
from Bio import SeqIO
import urllib.request
import urllib.parse

In [11]:
mynsid = "http://nsidr.org/objects/20.5000.1025/c2618387bb0932270617"

# http://nsidr.org/objects/20.5000.1025/c2618387bb0932270617?jsonPointer=/ebiSearchResults/1/fields/24
jsonPointer = "?jsonPointer=/ebiSearchResults/1/fields/24"

fetch = urllib.request.urlopen(mynsid+jsonPointer)
uniprot = json.loads(fetch.read().decode())
uniprotID = uniprot['values']
stringIDXML = str(uniprotID).strip('[]')+".xml"
uniproturl = "http://www.uniprot.org/uniprot/" + (str(stringIDXML).replace("'",""))
print(uniproturl)

http://www.uniprot.org/uniprot/A0A0A7CD09.xml


Show the record name and annotations attached this record so I can do 'X'. 

In [12]:
xmlhandle = urllib.request.urlopen("http://www.uniprot.org/uniprot/A0A0A7CD09.xml")
record = SeqIO.read(xmlhandle, "uniprot-xml")
#print(record)
print(record.name)
print(record.annotations['comment_catalyticactivity'])
print(record.annotations['comment_similarity'])

A0A0A7CD09_9CAEN
['4 [Fe(II)cytochrome c] + 4 H(+) + O2 = 4 [Fe(III)cytochrome c] + 2 H2O']
['Belongs to the heme-copper respiratory oxidase family.']


This is very simple example to illustrate the advantage of PID attached to a digital specimen (which is linked to a physical specimen) and possibility of creating a actionable knowledge unit and FAIR data lifecycle. 