# AHLT - MIRI
# Drugs Interaction Classifier

In [2]:
import nltk # NLTK Library
import xml.etree.ElementTree as ET # ElementTree Library
import os
import pandas as pd
import numpy as np
from xlm_parsers_functions import *

## TODO list
# Train the first ML model that will identify the drug names from a sentence
# Build the necessary data structure that will hold the predictors and the response variable
# Train ML models with that data


## Objectives of this part
In this second part of the project, we will focus on two different things: 
1. Detection of interactions between drugs
2. Classification of each drug-drug interaction according to one of the following types:
    - Advice: 'Interactions may be expected, and Uroxatral should not be used in combination with other alpha-blockers.'
    - Effect: 'In uninfected volunteers, 46% developed rash while receiving Sustiva and Clarithromycin.'
    - Mechanism: 'Grepafloxacin is a competitive inhibitor of the metabolism of theophylline'.
    - Int: The interaction of omeprazole and ketoconazole has been stablished.

## Parsing the XML Files

### DrugBank and MedLine files

In [8]:
# Use xlm_element.tag to get the name of the xlm element
# Use xlm_element.attrib to get all the attributes of the xlm element as a string

# Give the headers name for the final dataset
headers = ['sentence_id', 'sentence_text', 'e1_id', 'e1_name', 'e1_type', 'e2_id', 'e2_name', 'e2_type', 'interaction']

# Parse the DrugBank Files
drugs_dataset = []
parent_directory = '../LaboCase/small_train_DrugBank/'
for filename in os.listdir(parent_directory):
    if filename.endswith(".xml"):
        # Parse the file
        tree = ET.parse(parent_directory + filename)
        # Create a list of lists with the interactions of the file
        drugs_dataset = drugs_dataset + listDDIFromXML(tree.getroot())


DrugBank_df = pd.DataFrame(drugs_dataset, columns=headers)

In [7]:
DrugBank_df

Unnamed: 0,sentence_id,sentence_text,e1_id,e1_name,e1_type,e2_id,e2_name,e2_type,interaction
0,DDI-DrugBank.d297.s1,Population pharmacokinetic analyses revealed t...,DDI-DrugBank.d297.s1.e0,MTX,drug,DDI-DrugBank.d297.s1.e1,NSAIDs,group,false
1,DDI-DrugBank.d297.s1,Population pharmacokinetic analyses revealed t...,DDI-DrugBank.d297.s1.e0,MTX,drug,DDI-DrugBank.d297.s1.e2,corticosteroids,group,false
2,DDI-DrugBank.d297.s1,Population pharmacokinetic analyses revealed t...,DDI-DrugBank.d297.s1.e0,MTX,drug,DDI-DrugBank.d297.s1.e3,TNF blocking agents,group,false
3,DDI-DrugBank.d297.s1,Population pharmacokinetic analyses revealed t...,DDI-DrugBank.d297.s1.e0,MTX,drug,DDI-DrugBank.d297.s1.e4,abatacept,drug,false
4,DDI-DrugBank.d297.s1,Population pharmacokinetic analyses revealed t...,DDI-DrugBank.d297.s1.e1,NSAIDs,group,DDI-DrugBank.d297.s1.e2,corticosteroids,group,false
5,DDI-DrugBank.d297.s1,Population pharmacokinetic analyses revealed t...,DDI-DrugBank.d297.s1.e1,NSAIDs,group,DDI-DrugBank.d297.s1.e3,TNF blocking agents,group,false
6,DDI-DrugBank.d297.s1,Population pharmacokinetic analyses revealed t...,DDI-DrugBank.d297.s1.e1,NSAIDs,group,DDI-DrugBank.d297.s1.e4,abatacept,drug,false
7,DDI-DrugBank.d297.s1,Population pharmacokinetic analyses revealed t...,DDI-DrugBank.d297.s1.e2,corticosteroids,group,DDI-DrugBank.d297.s1.e3,TNF blocking agents,group,false
8,DDI-DrugBank.d297.s1,Population pharmacokinetic analyses revealed t...,DDI-DrugBank.d297.s1.e2,corticosteroids,group,DDI-DrugBank.d297.s1.e4,abatacept,drug,false
9,DDI-DrugBank.d297.s1,Population pharmacokinetic analyses revealed t...,DDI-DrugBank.d297.s1.e3,TNF blocking agents,group,DDI-DrugBank.d297.s1.e4,abatacept,drug,false


## Creation of features
Before training our model, we need to come up with features to help us determine whether there is a relationship between the two drugs or not.

Some ideas for features are the following:
- Does the sentence contain a modal verb (should, must,...) between the two entities?
- Word bigrams: This is a binary feature for all word bigrams that appeared more than once in the corpus, indicating the presence or absence of each such bigram in the sentence
- Number of words between a pair of drugs
- Number of drugs between a pair of drugs
- POS of words between a pair of drugs: This is a binary feature for word POS tags obtained from POS tagging, and indicates the presence or absence of each POS between the two main drugs.
- Path between a pair of drugs: Path between two main drugs in the parse tree is another feature in our system. Because syntactic paths are in general a sparse feature, we reduced the sparsity by collapsing identical adjacent non-terminal labels. E.g., NP-S-VP-VP-NP is converted to NP-S-VP-NP. This technique decreased the number of paths by 24.8%.