# Analyzing COVID-19 literature with OGER
In this very short tutorial we show how to annotate an article using the OGER web API.

More information about OGER:
- [OGER Github repository](https://github.com/OntoGene/OGER)
- [Introduction to OGER web APIs](https://covid19.nlp.idsia.ch/oger-rest.html)
- [OGER introduction video](https://files.ifi.uzh.ch/cl/rinaldi/ISMB2020/ismb-609.mp4)
- [BLAH7 OGER project page](https://coree.github.io/blah7/)

In [1]:
import requests
import pandas as pd
import io

## Analyzing PubMed articles with OGER

Annotate an article obtained from a remote repository (fetch):

`'https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/[Source]/[Output_Format]/[Document_ID]'`

Here we use PubMed as source and want as return a .tsv file.

Identify articles in PubMed that contain the drug and COVID-19. For example: 
- [PubMed:32445440](https://pubmed.ncbi.nlm.nih.gov/32445440/) *Remdesivir for the Treatment of Covid-19 - Final Report*
- [PubMed:32895599](https://pubmed.ncbi.nlm.nih.gov/32895599/) *Favipiravir: A new and emerging antiviral option in COVID-19*


In [2]:
url = 'https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/tsv/32895599'

In [3]:
req = requests.get(url)  
df = pd.read_csv(io.StringIO(req.text), sep='\t')

In [4]:
df.columns = [c.lower().replace(' ', '_') for c in df.columns]

In [5]:
df.head()

Unnamed: 0,document_id,type,start_position,end_position,matched_term,preferred_form,entity_id,zone,sentence_id,origin,umls_cui
0,32895599,chemical,0,11,Favipiravir,favipiravir,CHEBI:134722,Title,S1,ChEBI,CUI-less
1,32895599,chemical,0,11,Favipiravir,favipiravir,C462182,Title,S1,MeSH supp (Chemicals and Drugs),C1138226
2,32895599,chemical,0,11,Favipiravir,favipiravir,C462182,Title,S1,CTD (MESH),C1138226
3,32895599,chemical,32,41,antiviral,antiviral agent,CHEBI:22587,Title,S1,ChEBI,CUI-less
4,32895599,disease,52,60,COVID-19,COVID-19,C000657245,Title,S1,MeSH supp (Diseases),CUI-less


### Excersice: Reconsturcting annotated sentences 
In order to reconstruct the sentences, you will have to use a different output format for OGER. Try with the `text_tsv` format.

In [6]:
url = 'https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/text_tsv/32895599'

req = requests.get(url)  

df = pd.read_csv(io.StringIO(req.text), sep='\t')
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df.head()

Unnamed: 0,document_id,type,start_position,end_position,matched_term,preferred_form,entity_id,zone,sentence_id,origin,umls_cui
0,32895599,chemical,0,11,Favipiravir,favipiravir,CHEBI:134722,Title,S1,ChEBI,CUI-less
1,32895599,chemical,0,11,Favipiravir,favipiravir,C462182,Title,S1,MeSH supp (Chemicals and Drugs),C1138226
2,32895599,chemical,0,11,Favipiravir,favipiravir,C462182,Title,S1,CTD (MESH),C1138226
3,32895599,,11,12,:,,,,S1,,
4,32895599,,13,14,A,,,,S1,,


## Analyzing plain text articles with OGER

Annotate an article obtained sent by the client (upload)

`https://pub.cl.uzh.ch/projects/ontogene/oger/upload/[Input_Format]/[Output_Format]`

Here we use as input format plain text and want to retrieve a .tsv file.

In [None]:
url = 'https://pub.cl.uzh.ch/projects/ontogene/oger/upload/txt/tsv?dict=509f822aaf527390'

body = 'The initial cases of novel coronavirus (2019-nCoV)-infected pneumonia (NCIP) occurred in Wuhan, Hubei Province, China, in December 2019 and January 2020. We analyzed data on the first 425 confirmed cases in Wuhan to determine the epidemiologic characteristics of NCIP. We collected information on demographic characteristics, exposure history, and illness timelines of laboratory-confirmed cases of NCIP that had been reported by January 22, 2020.'

headers = {'Content-Type': 'text/plain'}

In [8]:
req = requests.post(url, data=body, headers=headers)

df = pd.read_csv(io.StringIO(req.text), sep='\t')
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df.head()

Unnamed: 0,document_id,type,start_position,end_position,matched_term,preferred_form,entity_id,zone,sentence_id,origin,umls_cui
0,unknown,organism,27,38,coronavirus,Coronavirus,D017934,,S1,MeSH desc (Organisms),C0206419
1,unknown,organism,40,49,2019-nCoV,severe acute respiratory syndrome coronavirus 2,C000656484,,S1,MeSH supp (Organisms),CUI-less
2,unknown,disease,60,69,pneumonia,Pneumonia,D011014,,S1,MeSH desc (Diseases),C0032285
3,unknown,disease,60,69,pneumonia,Pneumonia,D011014,,S1,CTD (MESH),C0032285


### Excersice: Changing output format 

Supported output formats:

| `[Output_Format]` value | content-type              | description |
| :--------------- | :-------------------------: | :----------- |
| tsv             | text/tab-separated-values | entities in a tab-separated table |
| xml             | text/xml                  | entities in a simple, self-explanatory XML format |
| text_tsv        | text/tab-separated-values | text and entities in a tab-separated table |
| bioc            | text/xml                  | text and entities in [BioC](http://bioc.sourceforge.net/) XML |
| bioc_json       | application/json          | text and entities in [BioC JSON](https://github.com/ncbi-nlp/BioC-JSON) |
| pubanno_json    | application/json          | text and entities in [PubAnnotator JSON](http://www.pubannotation.org/docs/annotation-format/) |
| pubtator        | text/plain                | text and entities in [PubTator format](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/tutorial/index.html#ExportannotationinPubTator) (mixture of pipe- and tab-separated text) |
| pubtator_fbk    | text/plain                | a variant of the above, with slightly different entity attributes |
| odin            | text/xml                  | text and entities in [ODIN](http://www.ontogene.org/odin) XML |
| odin_custom     | text/xml                  | text and entities in [ODIN](http://www.ontogene.org/odin) XML, with customisable CSS |

In [22]:
output_format = 'bioc_json'
url = f'https://pub.cl.uzh.ch/projects/ontogene/oger/upload/txt/{output_format}?dict=509f822aaf527390'

In [23]:
body = 'The initial cases of novel coronavirus (2019-nCoV)-infected pneumonia (NCIP) occurred in Wuhan, Hubei Province, China, in December 2019 and January 2020. We analyzed data on the first 425 confirmed cases in Wuhan to determine the epidemiologic characteristics of NCIP. We collected information on demographic characteristics, exposure history, and illness timelines of laboratory-confirmed cases of NCIP that had been reported by January 22, 2020.'

headers = {'Content-Type': 'text/plain'}

In [25]:
req = requests.post(url, data=body, headers=headers)
req.content

b'{\n  "source": "",\n  "date": "",\n  "key": "",\n  "infons": {},\n  "documents": [\n    {\n      "relations": [],\n      "id": "unknown",\n      "infons": {},\n      "passages": [\n        {\n          "relations": [],\n          "text": "The initial cases of novel coronavirus (2019-nCoV)-infected pneumonia (NCIP) occurred in Wuhan, Hubei Province, China, in December 2019 and January 2020. We analyzed data on the first 425 confirmed cases in Wuhan to determine the epidemiologic characteristics of NCIP. We collected information on demographic characteristics, exposure history, and illness timelines of laboratory-confirmed cases of NCIP that had been reported by January 22, 2020.",\n          "infons": {\n            "type": ""\n          },\n          "offset": 0,\n          "annotations": [\n            {\n              "text": "coronavirus",\n              "id": "1",\n              "infons": {\n                "preferred_form": "Coronavirus",\n                "original_resource": "M