# NLE Assessed Coursework 2

For this assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about these coursework questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.

In [1]:
candidateno=198735 #this MUST be updated to your candidate number so that you get a unique data sample

In [2]:
#preliminary imports
import sys
sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
sys.path.append(r'/Users/Joe/Documents/Python Scripts/resources/resources')

import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.tokenize import word_tokenize

from sussex_nltk.corpus_readers import AmazonReviewCorpusReader
import random
from nltk.corpus import stopwords

Sussex NLTK root directory is /Users/Joe/Documents/Python Scripts/resources/resources


## Question 1: Document Similarity (25 marks)
The objective of this question is to investigate whether incorporating lexical knowledge from WordNet might improve document similarity methods.  For example, knowing that both *tiger* and *leopard* are hyponyms of *big_cat* should increase the similarity between a document mentioning a *tiger* and a document mentioning a *leopard*.

The code below will generate two document collections, both in bag-of-words format, one from the Medline Corpus and one from the Wall Street Journal corpus.

In this question, there are marks available for the quality of your code and the quality of your explanations.

In [3]:
from sussex_nltk.corpus_readers import MedlineCorpusReader
from sussex_nltk.corpus_readers import WSJCorpusReader
from nltk.stem.wordnet import WordNetLemmatizer

def normalise(tokenlist):
    tokenlist=[token.lower() for token in tokenlist]
    tokenlist=["NUM" if token.isdigit() else token for token in tokenlist]
    tokenlist=["Nth" if (token.endswith(("nd","st","th")) and token[:-2].isdigit()) else token for token in tokenlist]
    tokenlist=["NUM" if re.search("^[+-]?[0-9]+\.[0-9]",token) else token for token in tokenlist]
    return tokenlist

def filter_stopwords(tokenlist):
    stop = stopwords.words('english')
    return [w for w in tokenlist if w.isalpha() and w not in stop]

def stem(tokenlist):
    st=WordNetLemmatizer()
    return [st.lemmatize(token) for token in tokenlist]

   
def make_bow(somestring):
    rep=word_tokenize(somestring)  #step 1
    rep=normalise(rep)   #step 2
    rep=stem(rep)   #step 3
    rep=filter_stopwords(rep)  #step 4
    dict_rep={}
    for token in rep:
        dict_rep[token]=dict_rep.get(token,0)+1  #step 5
    return(dict_rep)

wsj=WSJCorpusReader()
medline=MedlineCorpusReader()

collectionsize=50
collections={"wsj":[],"medline":[]}

for key in collections.keys():
    if key=="wsj":
        generator=wsj.raw()
    else:
        generator=medline.raw()
    while len(collections[key])<collectionsize:
        collections[key].append(next(generator))

bow_collections={key:[make_bow(doc) for doc in collection] for key,collection in collections.items()}

a). For each step in the `make_bow()` function, **explain** what it does and why it is applicable when creating document representations for document similarity methods. \[8 marks\]

* Step 1 - The variable rep is assigned to the output of the function "word_tokenize". This takes in a string, in this case the variable "somestring", and splits the string up into words. It does this by creating a list, and adding a substring of the string when it hits whitespace; this would be a word and therefore a token. It also makes a token for punctuation found in the string as well such as "," and "." - these are substrings by themselves. The output is the list of all the tokens defined by the function. This is important as it allows us to identify similar tokens, use the tokens as keys and iterate easily through them.
* Step 2 - The variable "rep", a list of tokens, is a parameter and assigned to the output of the function "normalise". The list has 4 actions taken upon it which each reassign the tokenlist, after which a list of tokens is returned in this manner:
    * The first action is that for each token in the list the case is set to lowercase. This allows us to ensure that "The" is the same as "the", as this would otherwise produce false results.
    * The second action is a digit check, and if the current token is found to be a digit it replaces it with the string "NUM". This is because when comparing document similarity it doesn't matter about the actual numbers in the document only the fact that there is a number in the document.
    * The third action is that for each token in the list it checks if the token represents "6th" or "1st" or "2nd" etc. If it does, then we replace the token with "Nth". This is because again we don't need to know the number only the fact that the document talks about the 'nth' of something.
    * The fouth action is a check for some type of decimal or data value, this is what the regular expression represents. If it is a value of some kind it changes the token to "NUM". This is because again we don't need to know the data value just the fact there is a number in the document.
* Step 3 - The variable "rep" is assigned to the output of the function "stem". This takes in the tokenlist, and for each token in the list it uses the "WordNetLemmatizer" to lemmatize the token. The lemmatize function replaces the token with the base case of the current token. The tokenlist is then returned. For example 'taking' would be just 'take'. This is done to increase the probability of matching words together that are the same and not just a lot of variations of the same word.
* Step 4 - The variable "rep" is assigned to the output of the function "filter_stopwords". This takes in the tokenlist and for each token punctuation is removed. If the token is a stop word such as "the" "an" "a" etc. these are also removed otherwise the token is left unchanged. The tokenlist is then returned. This is done as punctuation and stopwords appear in all documents and therefore don't provide anything to help with document similarity. 
* Step 5 - A dictionary is created and for each token in the list "rep". The "get" statement looks for the value of the key for that token, if found, assigns the current value of that token's key plus 1, or if the key is not found it returns 0 plus 1. This therefore creates a dictionary of the amount of times each token is seen in the document, or tokenlist. This is exceedingly helpful as it allows us to show the most common tokens in the document and compare with other documents or the average occurances across different documents.

b). Apply a TF-IDF weighting to the representations and then compute: 
* the average cosine similarity of medline documents to each other, 
* the average cosine similarity of WSJ documents to each other,
* the average cosine similarity of medline documents to WSJ documents
\[8 marks\]

In [27]:
# Functions to compute data #
# word frequency in a document #
def doc_freq(doclist):
    df={}
    for doc in doclist:
        for feat in doc.keys():
            df[feat]=df.get(feat,0)+1
            
    return df
import math
# gets the idf values for each feature in the documents # 
def idf_values(doclist):
    idf = {}
    n = len(doclist)
    df = doc_freq(doclist)
    for feat in df.keys():
        idf[feat] = math.log10(n/(df.get(feat,0)))
    return idf
# converts the idf values to tfidf using the doclist as reference # 
def convert_to_tfidf(doclist,idf):
    tfidf = []
    for doc in doclist:
        cur_tfidf = {}
        for feat in idf.keys():
            tfidf_val = doc.get(feat,0) * idf.get(feat,0)
            if tfidf_val != 0:
                cur_tfidf[feat] = tfidf_val
        tfidf.append(cur_tfidf)
    return tfidf
# dot product of two documents #
def dot(docA,docB):
    the_sum=0
    for (key,value) in docA.items():
        the_sum+=value*docB.get(key,0)
    return the_sum
# cosine simularity of two documents # 
def cos_sim(docA,docB):
    return dot(docA,docB)/(math.sqrt(dot(docA,docA) * dot(docB,docB)))
# average given values and a length
def average(length,values): 
    return sum(values)/length
# finds the cosine similarity between two collections #
def sim_collection(collectionA,collectionB):
    values = []
    total = 0
    for docA in collectionA:
        for docB in collectionB:
            current = cos_sim(docA,docB)
            values.append(current)
            total += 1
    return average(len(values), values)

In [5]:
# Converting to tfidf #
wsj_tfidf = convert_to_tfidf(bow_collections["wsj"], idf_values(bow_collections["wsj"]))
medline_tfidf = convert_to_tfidf(bow_collections["medline"], idf_values(bow_collections["medline"]))

In [6]:
# Calculating cosine similarity #
print("Medline to Medline Cosine Similarity: {}".format(sim_collection(medline_tfidf,medline_tfidf)))
print("WSJ to WSJ Cosine Similarity: {}".format(sim_collection(wsj_tfidf,wsj_tfidf)))
print("Medline to WSJ Cosine Similarity: {}".format(sim_collection(medline_tfidf,wsj_tfidf)))

Medline to Medline Cosine Similarity: 0.042554082780741256
WSJ to WSJ Cosine Similarity: 0.042765676237362854
Medline to WSJ Cosine Similarity: 0.007138281221940395


c). Expand the document representations by adding **synonyms** and **hypernyms** for each **noun** in the document.  For example, 2 occurrences of the word *tiger* should add 2 occurrences of each of the following **lemma_names** found in the WordNet hypernym hierarchy above *tiger*:
* \['tiger', 'Panthera_tigris'\]
* \['big_cat', 'cat'\]
* \['feline', 'felid'\]
* \['carnivore'\]
* \['placental', 'placental_mammal', 'eutherian', 'eutherian_mammal'\]
* \['mammal', 'mammalian'\]
* \['vertebrate', 'craniate'\]
* \['chordate'\]
* \['animal', 'animate_being', 'beast', 'brute', 'creature', 'fauna'\]
* \['organism', 'being'\]
* \['living_thing', 'animate_thing'\]
* \['whole', 'unit'\]
* \['object', 'physical_object'\]
* \['physical_entity'\]
* \['entity'\]

Recompute the similarities calculated in part b).  Discuss your results. \[9 marks\]

In [7]:
# Imports wordnet and sets up noun set to work out if word is a noun #
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wn_ic
nouns = {word.name().split('.', 1)[0] for word in wn.all_synsets('n')}

In [8]:
# Function to add all synonyms to the document tree #
def get_all_synonyms(syn, list_words):
    if syn.hypernyms() != list():
        syn_list = syn.hypernyms()
        list_words += syn_list[0].lemma_names()
        get_all_synonyms(syn_list[0], list_words)
    return list_words

# Function to easily add all new words to dictionary #
def add_words(corpus, doc_count, add_list, occurs):
    for word in add_list:
        new_bow_collections[corpus][doc_count][word] = new_bow_collections[corpus][doc_count].get(word, 0) + occurs

In [30]:
# Adding to bow_collections # 
new_bow_collections = {key:[make_bow(doc) for doc in collection] for key,collection in collections.items()}
for corpus in bow_collections.keys():
    doc_count = 0
    for doc in bow_collections[corpus]:
        for word in doc.keys():
            if word in nouns:
                synsets = wn.synsets(word,wn.NOUN)
                occurs = doc[word]
                for syn in synsets:
                    add_words(corpus,doc_count,syn.lemma_names(), occurs)
                    add_words(corpus,doc_count,get_all_synonyms(syn, list()),occurs)
        doc_count += 1

In [10]:
# Converting to tfidf #
wsj_tfidf = convert_to_tfidf(new_bow_collections["wsj"], idf_values(new_bow_collections["wsj"]))
medline_tfidf = convert_to_tfidf(new_bow_collections["medline"], idf_values(new_bow_collections["medline"]))

In [11]:
# Calculating cosine similarity #
print("Medline to Medline Cosine Similarity: {}".format(sim_collection(medline_tfidf,medline_tfidf)))
print("WSJ to WSJ Cosine Similarity: {}".format(sim_collection(wsj_tfidf,wsj_tfidf)))
print("Medline to WSJ Cosine Similarity: {}".format(sim_collection(medline_tfidf,wsj_tfidf)))

Medline to Medline Cosine Similarity: 0.07991586613911628
WSJ to WSJ Cosine Similarity: 0.06693598218944792
Medline to WSJ Cosine Similarity: 0.027670734115524086


#### Results
The results show all similarities have increased from those in Question 1.2. This is due to the increase in similar words, the most common being 'entity'. However, synonyms, such as the word 'tiger' referring to the animal or the personality trait, are not distinguished and the code above adds all trees of senses for such a word. 

The accuracy of the cosine similarity would increase if we could identify the correct sense for each word in the document. The calculation would be based on the addition of trees related to that sense, not all alternative senses for a word and their associated trees. However, this does not necessarily mean that the cosine similarity will increase or decrease if we use the direct senses. It would just mean that the cosine similarity would be more accurate than the current similarity calculations.

## Question 2: Supervised Methods for WSD (25 marks)
The objective of this question is to build and evaluate a word sense disambiguation (WSD) system for words with multiple senses.  

a).  For each word occurring in the medline corpus (defined above), **write code** to find how many senses it has according to WordNet.  Print a list of the 10 most frequently occurring words with 2 senses (in this corpus). \[4 marks\]

In [12]:
import operator
def senses_dictionary(collection):
    words = {} # dictionary with key being word and value being a tuple or times seen and a list of senses
    for doc in collection:
        for word in doc.keys():
                if words.get(word,0) == 0:
                    words[word] = ( wn.synsets(word), doc[word])
                else:
                    words[word] = (words.get(word,0)[0],doc[word] + words.get(word,0)[1])
    return words
# each word with each of its senses as the values and the amount it occurs in the corpus
sense_dict = senses_dictionary(bow_collections['medline'])
# Find two sense words and remove 
words_2_senses = {}
for word in sense_dict.keys():
    if len(sense_dict[word][0]) == 2:
        words_2_senses[word] = sense_dict[word]
# Sort words to display top 10
words_2_senses = sorted(words_2_senses.items(),key=lambda word:word[1][1] ,reverse=True)
print("Most frequent words with 2 senses:")
for i in range(10):
    print("{}. ".format(i+1) + words_2_senses[i][0])

Most frequent words with 2 senses:
1. membrane
2. molecular
3. temperature
4. p
5. iii
6. mph
7. uptake
8. may
9. amino
10. molecule


b). A *supervised* WSD algorithm derives model(s) from *sense-annotated corpus data* in order to predict senses of ambiguous words in un-annotated data.  Using the entire document as context, **implement** a supervised word sense disambiguation algorithm to determine the most likely sense of each occurrence of the 3 most frequently occuring words identified in part a). \[8 marks\]

#### WSD Algorithm

The algorithm below uses Semcor, the sense-tagged subsection of the Brown Corpus. Semcor is sense-tagged by humans, thus the sense-tagging is believed to be correct. The use of Semcor in this case as the Medline corpus does not have its own sense-tagged sub-section. If it had then the algorithm would have used this data instead.
Semcor is setup in the proper way by the use of the functions below derived from Lab 6.1. However, there is some added function in the code block below that allows the creation of a classifier and the selection of only needed documents that have the given word.

The Semcor sentences are selected at random and tagged appropriately. The use of all the documents in Semcor makes sure that even though they aren't from the same corpus the calssifier knows what the different senses of the selected word relate to. These are now tagged and fed into the classifier for that word. The classifier establishes what sentences are related to the given word sense and what sentences aren't. The algorithm then gathers a list of the documents that contain the current word and collates them into a list for more efficient classification. Each document in the list is then classified and the result, which is the most likely sense for that document, is printed along with the sentences related to it.

In [13]:
# functions used to implement code below and imports # 
from nltk.corpus import semcor
import random
import nltk
from nltk.classify.naivebayes import NaiveBayesClassifier
def extract_tags(taggedsentence):
    '''
    For a tagged sentence in SemCor, identify single words which have been tagged with a WN synset
    taggedsentence: a list of items, some of which are of type wordnet.tree.Tree
    :return: a list of pairs, (word,synset)
    
    '''
    alist=[]
    for item in taggedsentence:
        if isinstance(item,nltk.tree.Tree):   #check with this is a Tree
            if isinstance(item.label(),nltk.corpus.reader.wordnet.Lemma) and len(item.leaves())==1:
                #check whether the tree's label is Lemma and whether the tree has a single leaf
                #if so add the pair (lowercased leaf,synsetlabel) to output list
                alist.append((item.leaves()[0].lower(),item.label().synset()))
    return alist
            

def extract_sentences(fileid_list):
    '''
    apply extract_tags to all sentences in all documents in a list of file ids
    fileid_list: list of ids
    :return: list of list of (token,tag) pairs, one for each sentence in corpus
    '''
    sentences=[]
    for fileid in fileid_list:
        print("Processing {}".format(fileid))
        sentences+=[extract_tags(taggedsentence) for taggedsentence in semcor.tagged_sents(fileid,tag='sem')]
    return sentences

def contains(sentence,astring):
    '''
    check whether sentence contains astring
    '''
    if len(sentence)>0:
        tokens,tags=zip(*sentence)
        return astring in tokens
    else:
        return False
    
def get_label(sentence,word):
    '''
    get the synset label for the word in this sentence
    '''
    count=0
    label="none"
    for token,tag in sentence:
        if token==word:
            count+=1
            label=str(tag)
    if count !=1:
        #print("Warning: {} occurs {} times in {}".format(word,count,sentence))
        pass
    return label

def get_word_data(sentences,word):
    '''
    select sentences containing words and construct labelled data set where each sentence is represented using Bernouilli event model
    '''
    selected_sentences=[sentence for sentence in sentences if contains(sentence,word)]
    word_data=[({token:True for (token,tag) in sentence},get_label(sentence,word)) for sentence in selected_sentences] 
    return word_data

# function to create and train classifier # 
def classifier_create(training_sent, myword):
    training=get_word_data(training_sent,myword)
    return NaiveBayesClassifier.train(training), len(training)

# function to get the specific documents for the specified word to test the data upon #
def find_testing_sentences(collection, myword):
    testlist = []
    for doc in collection:
        if myword in doc.keys():
            testlist.append(doc)
    return testlist
            

In [14]:
# setting up the training sentences #
allfiles = semcor.fileids()
shuffled=list(allfiles)
random.shuffle(shuffled)
training_sentences=extract_sentences(shuffled)

Processing brownv/tagfiles/br-b21.xml
Processing brown1/tagfiles/br-p01.xml
Processing brown2/tagfiles/br-j29.xml
Processing brownv/tagfiles/br-h07.xml
Processing brown2/tagfiles/br-p24.xml
Processing brownv/tagfiles/br-a03.xml
Processing brownv/tagfiles/br-j24.xml
Processing brown1/tagfiles/br-j05.xml
Processing brownv/tagfiles/br-g05.xml
Processing brown1/tagfiles/br-k07.xml
Processing brownv/tagfiles/br-j25.xml
Processing brownv/tagfiles/br-a43.xml
Processing brown2/tagfiles/br-f08.xml
Processing brown2/tagfiles/br-l13.xml
Processing brownv/tagfiles/br-a09.xml
Processing brown1/tagfiles/br-j10.xml
Processing brown1/tagfiles/br-e24.xml
Processing brown2/tagfiles/br-p12.xml
Processing brownv/tagfiles/br-j27.xml
Processing brown2/tagfiles/br-h18.xml
Processing brownv/tagfiles/br-g09.xml
Processing brownv/tagfiles/br-m03.xml
Processing brown2/tagfiles/br-j33.xml
Processing brownv/tagfiles/br-a33.xml
Processing brownv/tagfiles/br-a37.xml
Processing brown1/tagfiles/br-a01.xml
Processing b

In [29]:
# testing the classifier # 
evalulating_tests = [[] for i in range(3)]
length_of_training_data = []
# for each word # 
for i in range(3):
    current_word =words_2_senses[i][0]
    current_classifier, length = classifier_create(training_sentences,current_word)
    length_of_training_data.append(length)
    testing_sentences = find_testing_sentences(bow_collections['medline'],current_word)
    doc_count = 0
    for doc in testing_sentences:
        doc_count += 1
        classification = current_classifier.classify(doc)
        evalulating_tests[i].append((list(doc.keys()),classification))
        print("{} Document {}: {}".format(current_word,doc_count,list(doc.keys())))
        print("")
        print("Classified as: {}".format(classification))
        print("")

membrane Document 1: ['polysaccharide', 'cause', 'inhibition', 'multiplication', 'mumps', 'virus', 'allantoic', 'sac', 'may', 'hemagglutination', 'moreover', 'substance', 'prevent', 'adsorption', 'erythrocyte', 'available', 'evidence', 'indicates', 'active', 'inhibitor', 'block', 'cell', 'living', 'membrane', 'influenza', 'b', 'newcastle', 'disease', 'well', 'pvm', 'also', 'appears', 'lack', 'correlation', 'vitro', 'vivo', 'inhibiting', 'activity']

Classified as: Synset('membrane.n.01')

membrane Document 2: ['interaction', 'NUM', 'phage', 'strain', 'bacillus', 'protoplast', 'l', 'form', 'subtilis', 'eight', 'mutant', 'two', 'lysogens', 'described', 'qualitatively', 'quantitatively', 'removal', 'cell', 'wall', 'still', 'adsorb', 'nine', 'kill', 'host', 'five', 'multiply', 'naked', 'bacteria', 'forming', 'plaque', 'lawn', 'individual', 'gene', 'mutation', 'similarly', 'pleiotropic', 'effect', 'strongly', 'dependent', 'upon', 'plating', 'medium', 'thus', 'gta', 'cause', 'loss', 'glucosy

c). Evaluate the performance of your WSD system.  How accurate is it for each of the 3 words? **Comment** on the strengths and weaknesses of your WSD system.\[8 marks\] 

In [16]:
# Code to print out results and definitions of the first word #
word1 = wn.synsets(words_2_senses[0][0])
print(words_2_senses[0][0])
print(word1)
print("1st definition: {}".format(word1[0].definition()))
print("2nd definition: {}".format(word1[1].definition()))
print("Length of training data: {}".format(length_of_training_data[0]))
i = 0
for test in evalulating_tests[0]:
    i += 1
    print("")
    print("Test {}".format(i))
    print(test[0])
    print("")
    print(test[1])

membrane
[Synset('membrane.n.01'), Synset('membrane.n.02')]
1st definition: a thin pliable sheet of material
2nd definition: a pliable sheet of tissue that covers or lines or connects the organs or cells of animals or plants
Length of training data: 1

Test 1
['polysaccharide', 'cause', 'inhibition', 'multiplication', 'mumps', 'virus', 'allantoic', 'sac', 'may', 'hemagglutination', 'moreover', 'substance', 'prevent', 'adsorption', 'erythrocyte', 'available', 'evidence', 'indicates', 'active', 'inhibitor', 'block', 'cell', 'living', 'membrane', 'influenza', 'b', 'newcastle', 'disease', 'well', 'pvm', 'also', 'appears', 'lack', 'correlation', 'vitro', 'vivo', 'inhibiting', 'activity']

Synset('membrane.n.01')

Test 2
['interaction', 'NUM', 'phage', 'strain', 'bacillus', 'protoplast', 'l', 'form', 'subtilis', 'eight', 'mutant', 'two', 'lysogens', 'described', 'qualitatively', 'quantitatively', 'removal', 'cell', 'wall', 'still', 'adsorb', 'nine', 'kill', 'host', 'five', 'multiply', 'naked

#### Membrane

As seen in the printed statements above, 'membrane' has two definitions. In the Medline document the definition should be: 
> "a pliable sheet of tissue that covers or lines or connects the organs or cells of animals or plants"

Unfortunately, none of the documents tested produced this definition. This is because the training reference document in the Semcor collection has the word 'membrane' classified as the other definition:
>"a thin pliable sheet of material"

The above is clearly not the definition of the first test sentence as this sentence includes the words:
> 'multiplication', 'mumps', 'virus',  'cell', 'living', 'influenza',  and 'disease'

By observation, the above words refer to the second definition. For comparision, here is another selection of words from the 4th test shown above which should be classified as the 2nd definition:
> 'mutation', 'coli', 'bacteriophage', 'mutagenesis', 'antibiotic',  'cell', 'wall', 'biosynthesis', and 'genetically'

In summary, the membrane classifier failed because it relied on only 1 document for training data. The only context it had for knowning which sense to classify the sentenence, was of the first sense, and it subsequently did this for every sentence here. Had the sentence been derived from the Brown corpus and had been classsed as the 2nd definition, we could technically say that the classifier was very accurate. This would result in a false assumption as it would be classifying the sentences with the only sense it knows. In conclusion the membrane classifier is not accurate. 

Furthermore, this shows a weakness in the WSD system, as the training data is not large enough to get an accurate classifier. 

In [17]:
# Code to print out results and definitions of the second word # 
word2 = wn.synsets(words_2_senses[1][0])
print(words_2_senses[1][0])
print(word2)
print("1st definition: {}".format(word2[0].definition()))
print("2nd definition: {}".format(word2[1].definition()))
print("Length of training data: {}".format(length_of_training_data[1]))
i = 0
for test in evalulating_tests[1]:
    i += 1
    print("")
    print("Test {}".format(i))
    print(test[0])
    print("")
    print(test[1])

molecular
[Synset('molecular.a.01'), Synset('molecular.a.02')]
1st definition: relating to or produced by or consisting of molecules
2nd definition: relating to simple or elementary organization; --G.A. Miller
Length of training data: 6

Test 1
['three', 'enzyme', 'one', 'ec', 'NUM', 'peptide', 'hydrolase', 'activity', 'separated', 'partially', 'purified', 'bacillus', 'subtilis', 'distinguished', 'respect', 'molecular', 'weight', 'catalytic', 'property', 'studied', 'relation', 'physiology', 'bacterium', 'designated', 'aminopeptidase', 'ha', 'produced', 'early', 'growth', 'hydrolyzes', 'rapidly', 'another', 'ii', 'also', 'third', 'iii', 'predominantly', 'stationary', 'phase', 'efficiently', 'utilizes', 'substrate', 'synthesis', 'suggests', 'selective', 'catabolism', 'occurs', 'time', 'perhaps', 'related', 'cessation', 'onset', 'event', 'dipeptide', 'well', 'identified', 'localized', 'cell', 'wall', 'periplasm', 'organism', 'evidence', 'variation', 'cycle', 'suggest', 'important', 'funct

#### Molecular
As seen above 'Molecular' has two definitions. The 1st of which is relating to the Medline corpus:
> "relating to or produced by or consisting of molecules"

This refers to the molecules in elements and the body, and therefore all documents should in theory be referring to this definition. The other definition which isn't related to the Medline corpus:
>"relating to simple or elementary organization; --G.A. Miller"

The above is very similar in description in my opinion, which doesn't help as this could confuse the classifier. However, the classifier has classed all Molecular documents as the first definition, which results in success. However, that success is limited as the Molecular classifier was only trained on 6 documents, all of which could be the first definition. The classifier appears correct though as we see in the first test:
>'activity', 'purified', 'weight', 'catalytic', 'property', 'bacterium', 'produced', 'early', 'growth', 'predominantly', 'synthesis',  'cell', 'organism', 'cycle', 'antibiotic',  and 'metabolism'

These words selected from the first document show that molecular in this document was referring to the 1st definition. The word "weight" seems to appear after our key word Molecular in some of the documents, such as Test 2 and 6. This shows that the classifier identifies this and knows this refers to the 1st definition of Molecular. 

In Summary, the molecular classifier is accurate but should have more test documents in order to make sure that it does understand each sense of molecular. This shows a strength to the WSD algorithm as it is correct when trained on the right sense.

In [18]:
# Code to print out the results and definitions of the third word #
word3 = wn.synsets(words_2_senses[2][0])
print(words_2_senses[2][0])
print(word3)
print("1st definition: {}".format(word3[0].definition()))
print("2nd definition: {}".format(word3[1].definition()))
print("Length of training data: {}".format(length_of_training_data[2]))
i = 0
for test in evalulating_tests[2]:
    i += 1
    print("")
    print("Test {}".format(i))
    print(test[0])
    print("")
    print(test[1])

temperature
[Synset('temperature.n.01'), Synset('temperature.n.02')]
1st definition: the degree of hotness or coldness of a body or environment (corresponding to its molecular activity)
2nd definition: the somatic sensation of cold or heat
Length of training data: 67

Test 1
['effect', 'elevated', 'temperature', 'growth', 'rate', 'wa', 'studied', 'five', 'strain', 'enterobacteriaceae', 'tested', 'shift', 'resulted', 'immediate', 'decrease', 'due', 'limitation', 'availability', 'endogenous', 'methionine', 'first', 'biosynthetic', 'enzyme', 'extract', 'aerobacter', 'aerogenes', 'salmonella', 'typhimurium', 'escherichia', 'coli', 'shown', 'sensitive']

Synset('temperature.n.02')

Test 2
['localization', 'mutation', 'affecting', 'ribonuclease', 'iii', 'activity', 'enzyme', 'specific', 'ribonucleic', 'acid', 'escherichia', 'coli', 'wa', 'attempted', 'series', 'mating', 'transduction', 'experiment', 'mapped', 'near', 'nadb', 'gene', 'strain', 'carrying', 'another', 'also', 'found', 'based', 

#### Temperature

As seen above Temperature has two definitions, the 2nd definition relates to the human body temperature, which is what we can assume the documents are referring to:

>the somatic sensation of cold or heat

The first definition also refers to a body of sorts however, and therefore the temperature sense should be in theory be difficult to determine:

>the degree of hotness or coldness of a body or environment (corresponding to its molecular activity)

This is mostly due to the brackets that feature the word "molecular", but in general the 2nd definition is what we are looking for. The classification of the 1st document is correct therefore as it features the words:
> 'effect', 'elevated',  'growth', 'rate', 'enterobacteriaceae', 'biosynthetic', 'enzyme', 'extract',  'salmonella',  and 'coli'

These words relate to the body and hence the second definition. However, the 7th document's classification is wrong due to it including the words:
>'ionization', 'constant',  'nuclear', 'magnetic',  'proton', 'thermodynamic', 'restricted', 'environment', 'measurement', 'low', 'ph', and 'inflection'

The 7th document is talking about an experiment into the 1st sense of temperature, but unfortunately this was classified as the 2nd sense. This shows a weakness in the WSD system as it does not produce accurate results even with more extensive test data. With a difference between both senses, it picks one sense and classifies all with that sense. However, provided the WSD system knows about the correct sense for the Medline corpus then it appears to answer correctly, this isn't how a classifier should work.

Overall, the best strength of this WSD system is the speed at which it runs, due to the use of functions and imports, which themseleves are highly efficient. However based on the current setup, it does not efficiently classify the sense of the word given a sentence as context. 

d) How could you extend or improve your WSD system?  You are **not** expected to code any of these extensions or improvements, but your answer should give sufficient details to make it clear how they might be carried out in practice. \[5 marks\]

##### Ways to Extend the WSD system: 
* Include all the words from the 2 sense top 10 list, this would allow a greater scope of the words in the Medline corpus and their most likely senses, such as:
> p, iii, mph, uptake, may, amino, molecule

* Use of more than one type of human sense annotated training corpus to train the classifier so that it gets a wide scope of definitions, and potentially more training data that has different senses of the given word.

##### Ways to Improve the WSD system: 
* Use of a human sense annotated Medline corpus as this would give a more accurate training sense for the classifier 

* The use of the Brown corpus on this algorithm instead, as the top 10 2 sense words would be represented more by the same corpus as the corpus for testing 

* The use of words that aren't always associated with medical situations would also help, as these words may be more represented in the Brown corpus. Such as:
> mph, iii, uptake

Use the code below to verify that the length of your submission does not exceed 2000 words.

In [28]:
##This code will word count all of the markdown cells in the notebook saved at filepath
##Running it before providing any answers shows that the questions have a word count of 1202

import io
from nbformat import current

filepath="a2.ipynb"
question_count=626

with io.open(filepath, 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))

Submission length is 1992
