# Analyzing COVID-19 literature with OGER
In this very short tutorial we show how to annotate an article using the OGER web API.

More information about OGER:
- [OGER Github repository](https://github.com/OntoGene/OGER)
- [Introduction to OGER web APIs](https://covid19.nlp.idsia.ch/oger-rest.html)
- [OGER introduction video](https://files.ifi.uzh.ch/cl/rinaldi/ISMB2020/ismb-609.mp4)
- [BLAH7 OGER project page](https://coree.github.io/blah7/)

In [11]:
import requests
import pandas as pd
import io

## Analyzing PubMed articles with OGER

Annotate an article obtained from a remote repository (fetch).
We can request the annotation of a PubMed abstract by using the fetch endpoint. We need to specify which source to consider. Currently only Pubmed (pubmed) and PubMed Central (pmc) are enabled. Next we ned to provide the output format. And finally we need to provide the ID of the resource that we intend to process.

`'https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/[Source]/[Output_Format]/[Document_ID]'`

Here we use PubMed as source and want a .tsv file as return.

Identify articles in PubMed that contain the drug and COVID-19. For example: 
- [PubMed:32445440](https://pubmed.ncbi.nlm.nih.gov/32445440/) *Remdesivir for the Treatment of Covid-19 - Final Report*
- [PubMed:32895599](https://pubmed.ncbi.nlm.nih.gov/32895599/) *Favipiravir: A new and emerging antiviral option in COVID-19*


### Shell 
For example let's assume that we want to process the PubMed abstract 32895599.

The output will be delivered to your shell (standard output), so you might want to redirect to a file, e.g.

`curl https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/tsv/21436587 > 21436587.tsv`

In [12]:
! curl https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/tsv/32895599

DOCUMENT ID	TYPE	START POSITION	END POSITION	MATCHED TERM	PREFERRED FORM	ENTITY ID	ZONE	SENTENCE ID	ORIGIN	UMLS CUI
32895599	chemical	0	11	Favipiravir	favipiravir	CHEBI:134722	Title	S1	ChEBI	CUI-less
32895599	chemical	0	11	Favipiravir	favipiravir	C462182	Title	S1	MeSH supp (Chemicals and Drugs)	C1138226
32895599	chemical	0	11	Favipiravir	favipiravir	C462182	Title	S1	CTD (MESH)	C1138226
32895599	chemical	32	41	antiviral	antiviral agent	CHEBI:22587	Title	S1	ChEBI	CUI-less
32895599	disease	52	60	COVID-19	COVID-19	C000657245	Title	S1	MeSH supp (Diseases)	CUI-less
32895599	organ/tissue	114	119	globe	eyeball of camera-type eye	UBERON:0010230	Abstract	S2	Uberon	CUI-less
32895599	gene/protein	125	129	SARS	serine--tRNA ligase, cytoplasmic (rat)	PR:Q6P799	Abstract	S2	Protein Ontology	CUI-less
32895599	gene/protein	125	129	SARS	HTH-type transcriptional regulator SarS (Staphylococcus aureus subsp. aureus NCTC 8325)	PR:Q2G1N7	Abstract	S2	Protein Ontology	CUI-less
32895599	gene/protein	125	

### Python


In [13]:
url = 'https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/tsv/32895599'

In [14]:
req = requests.get(url)  
df = pd.read_csv(io.StringIO(req.text), sep='\t')

In [15]:
df.columns = [c.lower().replace(' ', '_') for c in df.columns]

In [16]:
df.head()

Unnamed: 0,document_id,type,start_position,end_position,matched_term,preferred_form,entity_id,zone,sentence_id,origin,umls_cui
0,32895599,chemical,0,11,Favipiravir,favipiravir,CHEBI:134722,Title,S1,ChEBI,CUI-less
1,32895599,chemical,0,11,Favipiravir,favipiravir,C462182,Title,S1,MeSH supp (Chemicals and Drugs),C1138226
2,32895599,chemical,0,11,Favipiravir,favipiravir,C462182,Title,S1,CTD (MESH),C1138226
3,32895599,chemical,32,41,antiviral,antiviral agent,CHEBI:22587,Title,S1,ChEBI,CUI-less
4,32895599,disease,52,60,COVID-19,COVID-19,C000657245,Title,S1,MeSH supp (Diseases),CUI-less


#### Exercise: Reconsturcting annotated sentences 
In order to reconstruct the sentences, you will have to use a different output format for OGER. Try with the `text_tsv` format.

In [17]:
url = 'https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/text_tsv/32895599'

req = requests.get(url)  

df = pd.read_csv(io.StringIO(req.text), sep='\t')
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df.head()

Unnamed: 0,document_id,type,start_position,end_position,matched_term,preferred_form,entity_id,zone,sentence_id,origin,umls_cui
0,32895599,chemical,0,11,Favipiravir,favipiravir,CHEBI:134722,Title,S1,ChEBI,CUI-less
1,32895599,chemical,0,11,Favipiravir,favipiravir,C462182,Title,S1,MeSH supp (Chemicals and Drugs),C1138226
2,32895599,chemical,0,11,Favipiravir,favipiravir,C462182,Title,S1,CTD (MESH),C1138226
3,32895599,,11,12,:,,,,S1,,
4,32895599,,13,14,A,,,,S1,,


## Analyzing plain text articles with OGER

We can request the annotation of local data by doing a POST to the /upload endpoint and passing the route parameters that specify the input and output format.

The URL of the request will be composed of the base URL (https://pub.cl.uzh.ch/projects/ontogene/oger/), the target endpoint (upload), the input format specification and the output format specification, so the final URL will be:

`https://pub.cl.uzh.ch/projects/ontogene/oger/upload/[Input_Format]/[Output_Format]`


### Shell 
Below there is an example of the request. In this example the uploaded data is raw text (txt), the requested output format is a tabular table (tsv) and the text to be annotated is passed in the POST payload.

In [18]:
! curl --location \
--request POST 'https://pub.cl.uzh.ch/projects/ontogene/oger/upload/txt/tsv' \
--header 'Content-Type: text/plain' \
--data-raw 'The initial cases of novel coronavirus (2019-nCoV)-infected \
pneumonia (NCIP) occurred in Wuhan, Hubei Province, China, in December 2019 \
and January 2020. \
We analyzed data on the first 425 confirmed cases in Wuhan to \
determine the epidemiologic characteristics of NCIP.We collected \
information on demographic characteristics, exposure history, and \
illness timelines of laboratory-confirmed cases of NCIP that had been \
reported by January 22, 2020.' 

DOCUMENT ID	TYPE	START POSITION	END POSITION	MATCHED TERM	PREFERRED FORM	ENTITY ID	ZONE	SENTENCE ID	ORIGIN	UMLS CUI
unknown	organism	27	38	coronavirus	Coronavirus	D017934		S1	MeSH desc (Organisms)	C0206419
unknown	organism	40	49	2019-nCoV	severe acute respiratory syndrome coronavirus 2	C000656484		S1	MeSH supp (Organisms)	CUI-less
unknown	disease	61	70	pneumonia	Pneumonia	D011014		S1	MeSH desc (Diseases)	C0032285
unknown	disease	61	70	pneumonia	Pneumonia	D011014		S1	CTD (MESH)	C0032285


### Python
Here we use plain text as input format and want to retrieve a .tsv file.

In [19]:
url = 'https://pub.cl.uzh.ch/projects/ontogene/oger/upload/txt/tsv?dict=509f822aaf527390'

body = 'The initial cases of novel coronavirus (2019-nCoV)-infected pneumonia (NCIP) occurred in Wuhan, Hubei Province, China, in December 2019 and January 2020. We analyzed data on the first 425 confirmed cases in Wuhan to determine the epidemiologic characteristics of NCIP. We collected information on demographic characteristics, exposure history, and illness timelines of laboratory-confirmed cases of NCIP that had been reported by January 22, 2020.'

headers = {'Content-Type': 'text/plain'}

In [20]:
req = requests.post(url, data=body, headers=headers)

df = pd.read_csv(io.StringIO(req.text), sep='\t')
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df.head()

Unnamed: 0,document_id,type,start_position,end_position,matched_term,preferred_form,entity_id,zone,sentence_id,origin,umls_cui
0,unknown,organism,27,38,coronavirus,Coronavirus,D017934,,S1,MeSH desc (Organisms),C0206419
1,unknown,organism,40,49,2019-nCoV,severe acute respiratory syndrome coronavirus 2,C000656484,,S1,MeSH supp (Organisms),CUI-less
2,unknown,disease,60,69,pneumonia,Pneumonia,D011014,,S1,MeSH desc (Diseases),C0032285
3,unknown,disease,60,69,pneumonia,Pneumonia,D011014,,S1,CTD (MESH),C0032285


#### Excersice: Changing output format 

Supported output formats:

| `[Output_Format]` value | content-type              | description |
| :--------------- | :-------------------------: | :----------- |
| tsv             | text/tab-separated-values | entities in a tab-separated table |
| xml             | text/xml                  | entities in a simple, self-explanatory XML format |
| text_tsv        | text/tab-separated-values | text and entities in a tab-separated table |
| bioc            | text/xml                  | text and entities in [BioC](http://bioc.sourceforge.net/) XML |
| bioc_json       | application/json          | text and entities in [BioC JSON](https://github.com/ncbi-nlp/BioC-JSON) |
| pubanno_json    | application/json          | text and entities in [PubAnnotator JSON](http://www.pubannotation.org/docs/annotation-format/) |
| pubtator        | text/plain                | text and entities in [PubTator format](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/tutorial/index.html#ExportannotationinPubTator) (mixture of pipe- and tab-separated text) |
| pubtator_fbk    | text/plain                | a variant of the above, with slightly different entity attributes |
| odin            | text/xml                  | text and entities in [ODIN](http://www.ontogene.org/odin) XML |
| odin_custom     | text/xml                  | text and entities in [ODIN](http://www.ontogene.org/odin) XML, with customisable CSS |

In [21]:
output_format = 'bioc_json'
url = f'https://pub.cl.uzh.ch/projects/ontogene/oger/upload/txt/{output_format}?dict=509f822aaf527390'

In [22]:
body = 'The initial cases of novel coronavirus (2019-nCoV)-infected pneumonia (NCIP) occurred in Wuhan, Hubei Province, China, in December 2019 and January 2020. We analyzed data on the first 425 confirmed cases in Wuhan to determine the epidemiologic characteristics of NCIP. We collected information on demographic characteristics, exposure history, and illness timelines of laboratory-confirmed cases of NCIP that had been reported by January 22, 2020.'

headers = {'Content-Type': 'text/plain'}

In [23]:
req = requests.post(url, data=body, headers=headers)
req.content

b'{\n  "source": "",\n  "date": "",\n  "key": "",\n  "infons": {},\n  "documents": [\n    {\n      "relations": [],\n      "id": "unknown",\n      "infons": {},\n      "passages": [\n        {\n          "relations": [],\n          "text": "The initial cases of novel coronavirus (2019-nCoV)-infected pneumonia (NCIP) occurred in Wuhan, Hubei Province, China, in December 2019 and January 2020. We analyzed data on the first 425 confirmed cases in Wuhan to determine the epidemiologic characteristics of NCIP. We collected information on demographic characteristics, exposure history, and illness timelines of laboratory-confirmed cases of NCIP that had been reported by January 22, 2020.",\n          "infons": {\n            "type": ""\n          },\n          "offset": 0,\n          "annotations": [\n            {\n              "text": "coronavirus",\n              "id": "1",\n              "infons": {\n                "preferred_form": "Coronavirus",\n                "original_resource": "M

## I/O formats
### Supported Sources 

| `source` value | description |
| :-------------- | :----------- |
| pubmed         | PubMed abstract obtained directly from NCBI. |
| pmc            | PubMed Central full-text article obtained directly from NCBI. |

### Supported Input  Formats

| `Input_Format` value | content-type     | description |
| :-------------- | :----------------: | :----------- |
| txt            | text/plain       | unstructured plain-text document |
| bioc           | text/xml         | document or collection in [BioC](http://bioc.sourceforge.net/) XML |
| bioc_json      | application/json | document or collection in [BioC JSON](https://github.com/ncbi-nlp/BioC-JSON) |
| pxml           | text/xml         | abstract in PubMed's citation XML |
| nxml           | text/xml         | article in PubMed Central's full-text XML |
| pxml.gz        | application/gzip | compressed collection of abstracts in Medline's citation XML |


### Supported Output Formats

| `Output_format` value | content-type              | description |
| :--------------- | :-------------------------: | :----------- |
| tsv             | text/tab-separated-values | entities in a tab-separated table |
| xml             | text/xml                  | entities in a simple, self-explanatory XML format |
| text_tsv        | text/tab-separated-values | text and entities in a tab-separated table |
| bioc            | text/xml                  | text and entities in [BioC](http://bioc.sourceforge.net/) XML |
| bioc_json       | application/json          | text and entities in [BioC JSON](https://github.com/ncbi-nlp/BioC-JSON) |
| pubanno_json    | application/json          | text and entities in [PubAnnotator JSON](http://www.pubannotation.org/docs/annotation-format/) |
| pubtator        | text/plain                | text and entities in [PubTator format](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/tutorial/index.html#ExportannotationinPubTator) (mixture of pipe- and tab-separated text) |
| pubtator_fbk    | text/plain                | a variant of the above, with slightly different entity attributes |
| odin            | text/xml                  | text and entities in [ODIN](http://www.ontogene.org/odin) XML |
| odin_custom     | text/xml                  | text and entities in [ODIN](http://www.ontogene.org/odin) XML, with customisable CSS |