# Subtask 2
## Find evidence in support or against specific drugs for the treatment of COVID-19

The goal of this subtask is to find evidence in support or against specific drugs for the treatment of COVID-19 within the OGER-annotated LitCovid dataset (both the abstract dataset and the full text dataset can be used).

In [1]:
import requests
import io
import pandas as pd
from operator import itemgetter
import numpy as np

The following drugs in particular will be considered: 

| Drug | Concept IDs|  
| :---- | :----- |  
|[hydroxychloroquine](https://en.wikipedia.org/wiki/Hydroxychloroquine) | `RxNorm:5521, CHEBI:5801, MeSH:D006886, UMLS:C0020336`|  
|[remdesivir](https://en.wikipedia.org/wiki/Remdesivir)| `RxNorm:2284718, MeSH:C000606551, UMLS:C4726677, CHEBI:145994`|  
|[avigan](https://en.wikipedia.org/wiki/Favipiravir)| `CHEBI:134722, MeSH:C462182, UMLS:C1138226`|  

Define the list of drugs and COVID-19 with the concept IDs.

In [2]:
hyd_ids = ['5521', 'CHEBI:5801', 'D006886', 'C0020336']
rem_ids = ['2284718', 'C000606551', 'C4726677', 'CHEBI:145994']
avi_ids = ['CHEBI:134722', 'C462182', 'C1138226']

cov_ids = ['D000086382','D000086402','C000657245','C000656484'] # Covid-19 and SARS-CoV-2


The function to reconstruct the sentences.

In [3]:
def rm_duplicates(offset_words):
    offset_words = np.unique(offset_words)
    r = [x for x in offset_words]
    for idx_a, a in enumerate(offset_words):
        for idx_b, b in enumerate(r):
            if a[1] == b[1]:
                if a[0] < b[0]:
                    del r[idx_b]
            elif a[0] == b[0]:
                if a[1] > b[1]:
                    del r[idx_b]
    return r

In [4]:
# Function to create a sentence, given the OGER .tsv output as a panda's DataFrame and the sentence ID
def get_sentence(df, sent_id):
    #_df = df.drop_duplicates(subset=['start_position']) # instead use rm_duplicates()
    offsets = df[df["sentence_id"]==sent_id][['start_position','end_position','matched_term']].to_records(index=False)
    offsets = rm_duplicates(offsets)
    
    # Get the sentence length
    max_offset = max(offsets,key=itemgetter(1))[1]
    min_offset = min(offsets,key=itemgetter(1))[0]
    
    # Create list of spaces in the sentence length to fill up the words
    sent_len = max_offset-min_offset
    l = list(" "*(sent_len))
    for start_pos, end_pos, word in offsets:
        l[(start_pos-min_offset):(end_pos-min_offset)] = list(word)
    sent = "".join(l)
    return sent

### Use OGER to annotate the articles
Identify articles from PubMed that contain the drug of interest and COVID-19, here we use the article 'Remdesivir for the Treatment of Covid-19 - Final Report' - `PubMed:32445440`

In [5]:
COVID19_IDS = cov_ids
DRUG_IDS = rem_ids

In [6]:
url = 'https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/text_tsv/32445440' # 32205204 - hyd, 32445440 - rem

In [7]:
req = requests.get(url)  

df = pd.read_csv(io.StringIO(req.text), sep='\t')
df.columns = [c.lower().replace(' ', '_') for c in df.columns]

In [8]:
df

Unnamed: 0,document_id,type,start_position,end_position,matched_term,preferred_form,entity_id,zone,sentence_id,origin,umls_cui
0,32445440,clinical_drug,0,10,Remdesivir,remdesivir,2284718,Title,S1,RxNorm,CUI-less
1,32445440,chemical,0,10,Remdesivir,remdesivir,C000606551,Title,S1,MeSH supp (Chemicals and Drugs),C4726677
2,32445440,chemical,0,10,Remdesivir,remdesivir,CHEBI:145994,Title,S1,ChEBI,CUI-less
3,32445440,chemical,0,10,Remdesivir,GS-5734,C000606551,Title,S1,CTD (MESH),C4279131
4,32445440,,11,14,for,,,,S1,,
...,...,...,...,...,...,...,...,...,...,...,...
475,32445440,,2093,2094,",",,,,S12,,
476,32445440,gene/protein,2095,2098,NCT,nicastrin (fruit fly),PR:Q9VC27,CONCLUSIONS,S12,Protein Ontology,CUI-less
477,32445440,,2095,2106,NCT04280705,,,,S12,,
478,32445440,,2106,2107,.,,,,S12,,


Find all sentences that refer to both COVID-19 and the drug of interest. 

In [9]:
sent_ids = df.sentence_id.unique()
sent_ids
found_sentences = []
for sent_id in sent_ids:
    # Check if sentences mentions drug as well as COVID-19
    drug = df[df['sentence_id']==sent_id]['entity_id'].isin(DRUG_IDS).any()
    covid = df[df['sentence_id']==sent_id]['entity_id'].isin(COVID19_IDS).any()
    if drug and covid:
        sent = get_sentence(df, sent_id)
        print(f"{sent_id}:\n{sent}\n")
        found_sentences.append(sent)

S1:
Remdesivir for the Treatment of Covid-19 - Final Report.

S3:
We conducted a double-blind, randomized, placebo-controlled trial of intravenous remdesivir in adults who were hospitalized with Covid-19 and had evidence of lower respiratory tract infection.

S11:
Our data show that remdesivir was superior to placebo in shortening the time to recovery in adults who were hospitalized with Covid-19 and had evidence of lower respiratory tract infection.



## Process multiple article

In [23]:
import json
from tqdm.notebook import tqdm


COVID19_IDS = cov_ids
DRUG_IDS = rem_ids

In [25]:
def oger_pubmed_ids(pubmed_id, covid19_ids, drug_ids):
    # print(f'Start processing article {pubmed_id}')
    
    try:
        url = f'https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/text_tsv/{pubmed_id}' 
        req = requests.get(url)  

        df = pd.read_csv(io.StringIO(req.text), sep='\t')
        df.columns = [c.lower().replace(' ', '_') for c in df.columns]

        sent_ids = df.sentence_id.unique()
        sent_ids
        found_sentences = []

        for sent_id in sent_ids:
            # Check if sentences mentions drug as well as COVID-19
            drug = df[df['sentence_id']==sent_id]['entity_id'].isin(drug_ids).any()
            covid = df[df['sentence_id']==sent_id]['entity_id'].isin(covid19_ids).any()
            if drug and covid:
                sent = get_sentence(df, sent_id)
                #print(f"{sent_id}:\n{sent}\n")
                found_sentences.append(sent)
                
    except Exception as e:
        print(f'Error in {pubmed_id}: {e}')
        found_sentences = []
        
    
    return found_sentences, pubmed_id

In [26]:
with open('./pmid_sets/pmid-Remdesivir-set.txt', 'r') as f:
    pubmed_ids = [line.strip() for line in f]
    
pubmed_ids[:2], pubmed_ids[-1]

(['32445440', '32423584'], '32837398')

In [27]:
out_dict ={}
pbar = tqdm(pubmed_ids)
for pubmed_id in pbar:
    pbar.set_description("Processing article : %s" % pubmed_id)
    found_sentences, pubmed_id = oger_pubmed_ids(pubmed_id, cov_ids, rem_ids)
    out_dict[pubmed_id] = found_sentences

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=800.0), HTML(value='')))

Error in 32809050: Error tokenizing data. C error: EOF inside string starting at row 32
Error in 32597995: Error tokenizing data. C error: EOF inside string starting at row 166
Error in 32730095: Error tokenizing data. C error: EOF inside string starting at row 57



In [28]:
with open('found_sentences_remde.json', 'w') as fp:
    json.dump(out_dict, fp)

In [29]:
articles_with_sent = 0
found_sentences = 0

for k, v in out_dict.items():
    
    if v:
        articles_with_sent += 1
        found_sentences += len(v)
        
articles_with_sent, found_sentences, len(out_dict)

(414, 708, 800)

#### Notes:
        
|Drug|Articles with sent| Found sentences|Total articles|
|:---|:---|:---|:---|
|Hydroxychloroquine|398|737|800|
|Avigan|140|222|279|
|Remdesivir|414|708|800|
        