## Classifying Protein-function comentions

A “*comention*” is defined as a co-occurrence of two Named Enitities (NEs) within a short span of text. Here, the two NEs we are interested are proteins and functions. The span of text is a sentence. In this assignment, you will write a python module that loads classifies sentence-level protein-function comentions.

Your first task is to load comention data stored in a `TRMTR2-train-small.txt` file. All these annotations were done manually from automatically tagged proteins (Uniprot identifiers) and functions (Gene Ontology concepts). During annotation, the experts were concerned with the correctness of the relationship between the entities. The labels are of type "GoodComention" if the sentence suggests that the protein has some relationship with the function. Otherwise the comention is labeld as "BadComention". In order to determine this lable, no other information (other than meaning of the sentence) is used.

The file is formatted in the way described below. More specifically, it is what's called a tab-delimited file. Here is a line of our example file:

`24002896        PROT (suppressor of cytokine signaling 1), not only has an effect on cytokine signaling pathway, it is also involved in the FUNC pathway in cell apoptosis.    O15524     SOCS1   GO:0009966      regulation of signal transduction     GoodComention`

Each line has the format:

`articleID \t sentenceText \t UniprotID \t proteinTextMatched \t GOID \t GOTextMatched \t [GoodComention/BadComention]`

This indicates that protein identifed by `UniprotID` is comentioned with the function identified by `GOID` within a sentence. If the label is GoodComention means that this comention was evaluated by a human curator as a valid functional relationship between the protein and the functional category (i.e. not all comentions are valid). In writing your code, use the **“TRMTR2-train-small.txt”** which is related to the category `GO:0022857 (transmembrane transporter activity)`. This file contains 284 labeled protein-function comentions. 

Within `sentenceText`: 
 - the `proteinTextMatched` is replaced with "PROT"
 - the `GOTextMatched` is replaced with "FUNC" 

## Part 1: Baseline machine learing model

Develop a supervised machine learning model using bag-of-words or TFIDF features. Use `cross_validate()` to evaluate the performance of your model using 5-fold cross validation with your preffered learing algorithm. Show the P, R, F1 and AUROC.

In [0]:
# you can use this for removing non-ascii charaters from text
def remove_non_ascii(text):
    return re.sub(r'[^\x00-\x7F]', ' ', text)

In [0]:
#Write your code here for reading the files (you can use the csv library: https://docs.python.org/3/library/csv.html)

In [0]:
#Write your code here for cross-validation (use cross_validate())

1. How did you preprocess the sentences (remove any special charaters? remove stop words? Apply stemming/ lemmatization etc?)

2. Which ML algorithm did you use?

3. List P, R, F1 and AUROC from your baseline model.

## Part 2: Advanced machine learning model

Develop an advanced supervised machine learning model by developing a collection of features (enitity-based, word-based, syntactic). Use cross_validate() to evaluate the performance of your advanced model using 5-fold cross validation with your preffered learning algorithm (used for part 1). Show the P, R, F1 and AUROC.

In [0]:
# write your code here for generating (at least 3) enitity-based features of your choice

4. List your enity-based features.

In [0]:
# write your code here for generating (at least 3) word-based features of your choice

5. List your word-based features.

In this exercise you will use the **length of the dependency path** (see "Syntactic Features for Relation Extraction" in Lecture 7 slides) as a syntactic feature.

First, you will use [spaCy](https://spacy.io/) library, which is an indutrial-strength NLP library for python, for first finding the **dependency parse** of each sentence. 

In [0]:
#Download spacy and install spaCy model
! pip install spacy
! python -m spacy download en_core_web_sm

In [0]:
#following code snippet shows how to generate the dependencies between tokens
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u'Convulsions that occur after DTaP are caused by a fever.')

#Prints the dependencies: parent tokens, child token and the type of the dependency 
for token in doc:
  print((token.head.text, token.text, token.dep_))

#You can also view the dependency structure by visiting this page: https://explosion.ai/demos/displacy?text=Convulsions%20that%20occur%20after%20DTaP%20are%20caused%20by%20a%20fever.&model=en_core_web_sm&cpu=1&cph=1

Secondly, you will use [networkx](https://networkx.github.io/documentation/stable/index.html) library for generating **dependency parse graphs** using the above information obtained through spaCy. netwrokx is one of the most popular graph libraries for python. Then you will find the **shortest distance** between the two entities in this graph and use that as the feature value.

**You can ignore the types and directionaly of the dependencies for the purposes of this assignment. In otherwords, assume all edges are undirected and are of the same type.**

In [0]:
#install networkx
! pip install networkx

In [0]:
#You can generate graphs by follwoing the networkx tutorial: https://networkx.github.io/documentation/stable/tutorial.html
#following code snippet shows you how to gerenarte a toy graph
import networkx as nx
G = nx.Graph()
G.add_node('caused')
G.add_node('Convulsions')
print(list(G.nodes))

G.add_edge('caused', 'Convulsions')
print(list(G.edges))

print(G.number_of_nodes())
print(G.number_of_edges())

# Get the path and its length using https://networkx.github.io/documentation/stable/reference/algorithms/shortest_paths.html
entity1 = 'Convulsions'
entity2 = 'caused'
print("dependency path = ",nx.shortest_path(G, source=entity1, target=entity2))
print("dependency path length = ", nx.shortest_path_length(G, source=entity1, target=entity2))

In [0]:
#Using the above spaCy/networkx copde snippets as the guide, write code here to generate dependency path lengths for all the comentions in train data. Add these values as the sytactic feature values.

In [0]:
#Write code to plot the distrubution of dependency path lengths using hist function: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html

In [0]:
#Write your code here for cross-validation (use cross_validate())

6. List P, R, F1 and AUROC from your advanced model.

In [0]:
#Write code here to generate a bar chart (using matplotlib) comparing the P,R,F1, and AUROC values of the two models (baseline vs. advanced)

## Part 3: Prediction

The (unlabled) test data is given in `TRMTR2-test-unlabeled-small.txt`. Format of this file is as follows:

`articleID \t sentenceText \t UniprotID \t proteinTextMatched \t GOID \t GOTextMatched`

Using your best model, write code predict labels for the test data. Then save your predictions to `TRMTR2-test-preds-small.txt`. Format of this file should **exactly** follow the training data format. **Upload this file in D2L file (along with the ipynb file).**

We will use your generated "`TRMTR2-test-preds-small.txt`" file for evaluting the final performance of your code. Given that this is a class assignment, we are not expecting to see state-of-the-art performance. But you need to show that you have carefully though about the types of features that are informative for this task.

The best performance on test data will get bonus points (5% for first, 3% for second, 2% for third) of the assignment grade. The leaderboad will be annouced in the class afther final evalution.