# NLE Assignment 2: Distributional Semantics

In this assignment, you will be investigating the *distributional hypothesis*: **words which appear in similar contexts tend to have similar meanings**.

For assessment, you are expected to complete and submit this notebook file. When answers require code, you may import and use library functions (unless explicitly told otherwise). All of your own code should be included in the notebook rather than imported from elsewhere. Written answers should also be included in the notebook. You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about the assignment questions with your peers. If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

The first few cells contain code to set-up the assignment and bring in some data. In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell. Otherwise do not change the code in these cells.

In [1]:
candidateno=198397 #this MUST be updated to your candidate number so that you get a unique data sample

In [2]:
#set up drives for resources.  Change the path as necessary

from google.colab import drive
#mount google drive
drive.mount('/content/drive/')
import sys
sys.path.append('/content/drive/My Drive/NLE Notebooks/resources/')


Mounted at /content/drive/


In [3]:
#do not change the code in this cell
#preliminary imports

import re
import random
import math
import pandas as pd
import matplotlib.pyplot as plt
from itertools import zip_longest

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('wordnet_ic')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wn_ic
brown_ic = wn_ic.ic("ic-brown.dat")


from sussex_nltk.corpus_readers import ReutersCorpusReader

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet_ic.zip.
Sussex NLTK root directory is /content/drive/My Drive/NLE Notebooks/resources


We are going to be using the Reuters corpus of financial documents for this assignment.  When you run the following cell you should see that it contains 1,113,359 sentences.

In [4]:
#do not change the code in this cell
rcr = ReutersCorpusReader().finance()
rcr.enumerate_sents()

1113359

The following cell will take 2-5 minutes to run.  It will generate a unique-to-you sample of 200,000 sentences.  These sentences are tokenised and normalised for case and number for you.

In [5]:
#do not change the code in this cell
def normalise(tokenlist):
    tokenlist=[token.lower() for token in tokenlist]
    tokenlist=["NUM" if token.isdigit() else token for token in tokenlist]
    tokenlist=["Nth" if (token.endswith(("nd","st","th")) and token[:-2].isdigit()) else token for token in tokenlist]
    tokenlist=["NUM" if re.search("^[+-]?[0-9]+\.[0-9]",token) else token for token in tokenlist]
    return tokenlist

random.seed(candidateno)  
samplesize=2000
iterations =100
sentences=[]
for i in range(0,iterations):
    sentences+=[normalise(sent) for sent in rcr.sample_sents(samplesize=samplesize)]
    print("Completed {}%".format(i))
print("Completed 100%")


Completed 0%
Completed 1%
Completed 2%
Completed 3%
Completed 4%
Completed 5%
Completed 6%
Completed 7%
Completed 8%
Completed 9%
Completed 10%
Completed 11%
Completed 12%
Completed 13%
Completed 14%
Completed 15%
Completed 16%
Completed 17%
Completed 18%
Completed 19%
Completed 20%
Completed 21%
Completed 22%
Completed 23%
Completed 24%
Completed 25%
Completed 26%
Completed 27%
Completed 28%
Completed 29%
Completed 30%
Completed 31%
Completed 32%
Completed 33%
Completed 34%
Completed 35%
Completed 36%
Completed 37%
Completed 38%
Completed 39%
Completed 40%
Completed 41%
Completed 42%
Completed 43%
Completed 44%
Completed 45%
Completed 46%
Completed 47%
Completed 48%
Completed 49%
Completed 50%
Completed 51%
Completed 52%
Completed 53%
Completed 54%
Completed 55%
Completed 56%
Completed 57%
Completed 58%
Completed 59%
Completed 60%
Completed 61%
Completed 62%
Completed 63%
Completed 64%
Completed 65%
Completed 66%
Completed 67%
Completed 68%
Completed 69%
Completed 70%
Completed 71%
Co

`generate_features()` will used in the first part of the assignment.

In [6]:
#do not change the code in this cell
def generate_features(sentences,window=1):
    mydict={}
    for sentence in sentences:
        for i,token in enumerate(sentence):
            current=mydict.get(token,{})
            features=sentence[max(0,i-window):i]+sentence[i+1:i+window+1]
            for feature in features:
                current[feature]=current.get(feature,0)+1
            mydict[token]=current
    return mydict

1) Run `generate_features(sentences[:5])`. With reference to the code and the specific examples, explain how the output was generated [10 marks]

In [None]:
generate_features(sentences[:5])

{'$': {'in': 1, 'mln': 1},
 '&': {'edwards': 1, 'sons': 1},
 "''": {',': 1, 'analyst': 1},
 '(': {'account': 1, 'in': 1, 'may': 1, 'postipankki': 1},
 ')': {'NUM': 2, 'mln': 1},
 ',': {"''": 1, 'inc': 1, 'realistic': 1, 'sons': 1},
 '.': {'gmt': 1, 'inc': 1, 'reuters': 1},
 '=': {'postipankki': 1, 'psp': 1},
 'NUM': {')': 2, 'NUM': 4, 'around': 1, 'gmt': 1, 'may': 1},
 '``': {'the': 1},
 'a.g.': {'edwards': 1},
 'account': {'(': 1, 'current': 1},
 'analyst': {"''": 1, 'zuna': 1},
 'are': {'not': 1, 'targets': 1},
 'around': {'NUM': 1, 'vote': 1},
 'baring': {'ing': 1, 'securities': 1},
 'current': {'account': 1},
 'edwards': {'&': 1, 'a.g.': 1},
 'gmt': {'.': 1, 'NUM': 1},
 'in': {'$': 1, '(': 1},
 'inc': {',': 1, '.': 1},
 'ing': {'baring': 1, 'of': 1},
 'mansoor': {'of': 1, 'zuna': 1},
 'may': {'(': 1, 'NUM': 1},
 'mln': {'$': 1, ')': 1},
 'not': {'are': 1, 'realistic': 1},
 'of': {'ing': 1, 'mansoor': 1},
 'postipankki': {'(': 1, '=': 1},
 'psp': {'=': 1},
 'realistic': {',': 1, 'no

> The function generate_features is a function which takes in an input 'sentences' and a second input 'window'. The 'sentences' is a list of sentences being [:5] meaning it takes up to 5 sentences  and the 'window' is by default 1 which is an area around a target word. The window  takes the default value of window (which equals one) to the set value of 5 i.e [1:5] being the first five sentences in the list. These sentences are then broken down into tokens and iterated through a for loop creating further windows specifically either side of each token. For example, [max(0,i-window):i]+sentence[i+1:i+window+1] takes a single word from either side, because window is the value i (being the position) so it goes back a position and infront a position of the target word. This then gives produces a word and a window of the words either side of it and counts the co-occurences of words. For example in the result above ')': {'NUM': 2, 'mln': 1} states that ')' co-occurs with 'NUM' twice and mln once. NUM can be an number and brackets are often used with mathematic equations so it is likely thats why this combination occurs most. This then outputs into a dictionary and we can see every word of the sentences with a list of words that occur with them and the amount of times they co-occur.

2) Write code and **find** the 1000 most frequently occurring words that
* are in your sample; AND
* have at least one noun sense according to WordNet [10 marks]

In [7]:
import operator
stop = stopwords.words('english')
dictnouns = {}
def getdictnouns(sentences):
  for sentence in sentences: #iterate through sentences
    for words in sentence: #iterates through words 
      if len(wn.synsets(sentence[0], wn.NOUN)) >0: #checks for atleast one noun sense
        if words in dictnouns:
          dictnouns[words] += 1 #adds to word frequency
        else:
          dictnouns.update({words:1}) #adds to dictionary and initialises frequency
getdictnouns(sentences)
dictnouns

#below is all the normalisation to remove tokens with no content suchas, number, symbols and stopwords
dictnouns_stopremoval = {k: v for k,v in dictnouns.items() if k not in stop and k.isalpha()and k != "NUM"}
sortednouns = dict(sorted(dictnouns_stopremoval.items(), key=operator.itemgetter(1), reverse=True))
sortednouns


{'said': 12706,
 'percent': 8142,
 'would': 4296,
 'million': 4046,
 'pct': 3989,
 'year': 3983,
 'bank': 3876,
 'may': 3728,
 'billion': 3393,
 'government': 3120,
 'trade': 2760,
 'tax': 2591,
 'economic': 2314,
 'last': 2235,
 'rate': 2164,
 'budget': 2156,
 'new': 2130,
 'june': 2122,
 'minister': 2045,
 'growth': 1995,
 'first': 1936,
 'april': 1920,
 'also': 1789,
 'price': 1765,
 'union': 1752,
 'central': 1739,
 'balance': 1678,
 'bonds': 1671,
 'state': 1667,
 'inflation': 1662,
 'finance': 1603,
 'total': 1592,
 'bln': 1550,
 'foreign': 1527,
 'march': 1496,
 'gdp': 1477,
 'prices': 1449,
 'expected': 1442,
 'wednesday': 1417,
 'index': 1410,
 'deficit': 1397,
 'tuesday': 1391,
 'could': 1378,
 'currency': 1357,
 'due': 1354,
 'economy': 1350,
 'thursday': 1346,
 'week': 1344,
 'july': 1344,
 'market': 1319,
 'rates': 1275,
 'change': 1271,
 'interest': 1263,
 'exports': 1234,
 'two': 1233,
 'european': 1212,
 'one': 1181,
 'officials': 1158,
 'monday': 1143,
 'monetary': 114

In [None]:
def top1000nouns(freqdist,k=1000):
  return sorted(freqdist.items(),key=operator.itemgetter(1),reverse=True)[:k] #returns list of sorted frequency of top 1000 noun senses
topnouns = top1000nouns(sortednouns)
topnouns

[('said', 12706),
 ('percent', 8142),
 ('would', 4296),
 ('million', 4046),
 ('pct', 3989),
 ('year', 3983),
 ('bank', 3876),
 ('may', 3728),
 ('billion', 3393),
 ('government', 3120),
 ('trade', 2760),
 ('tax', 2591),
 ('economic', 2314),
 ('last', 2235),
 ('rate', 2164),
 ('budget', 2156),
 ('new', 2130),
 ('june', 2122),
 ('minister', 2045),
 ('growth', 1995),
 ('first', 1936),
 ('april', 1920),
 ('also', 1789),
 ('price', 1765),
 ('union', 1752),
 ('central', 1739),
 ('balance', 1678),
 ('bonds', 1671),
 ('state', 1667),
 ('inflation', 1662),
 ('finance', 1603),
 ('total', 1592),
 ('bln', 1550),
 ('foreign', 1527),
 ('march', 1496),
 ('gdp', 1477),
 ('prices', 1449),
 ('expected', 1442),
 ('wednesday', 1417),
 ('index', 1410),
 ('deficit', 1397),
 ('tuesday', 1391),
 ('could', 1378),
 ('currency', 1357),
 ('due', 1354),
 ('economy', 1350),
 ('thursday', 1346),
 ('week', 1344),
 ('july', 1344),
 ('market', 1319),
 ('rates', 1275),
 ('change', 1271),
 ('interest', 1263),
 ('exports',

In [None]:
#do not change the code in this cell.  It relates to Q3
wordpair=("house","garden")
concept_1=wn.synsets(wordpair[0])[0]
concept_2=wn.synsets(wordpair[1])[0]
print("Path similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.path_similarity(concept_1,concept_2)))
print("Resnik similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.res_similarity(concept_1,concept_2, brown_ic)))
print("Lin similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.lin_similarity(concept_1,concept_2, brown_ic)))

Path similarity between 1st sense of house and 1st sense of garden is 0.08333333333333333
Resnik similarity between 1st sense of house and 1st sense of garden is 1.2900256809649917
Lin similarity between 1st sense of house and 1st sense of garden is 0.15380807721262396


3) Consider the code above which outputs the path similarity score, the Resnik similarity score and the Lin similarity score for a pair of concepts in WordNet.  Answer the following questions.

a) Explain what each of the numbers in the output means.

b) Write code to find the semantic similarity of a pair of words according to WordNet with a parameter to specify the measure of semantic similarity between concepts.  Explain and justify the strategy used for words which have multiple senses.

c) Choose one of the measures of semantic similarity and then for every possible pair of words identified in Q2, determine the semantic similarity of the pair according to WordNet.  Justify your choice of semantic similarity measure.

d) Identify the 10 most similar words (according to WordNet) to the most frequent word in the corpus [20 marks]

3. A  Explain what each of the numbers in the output means.

>The path similarity returns a score in a range of 0 to 1 which determines how similar two word senses are, based on the shortest path that connects the senses through taxonomy.

>The Resnik similarity returns a score to determine how similar two word senses are based on the information content of the least common subsumer.

>The Lin similarity returns a score to determine how similar two word senses are based on the information conent of the least common subsumer and of the two input synsets.

3. B Write code to find the semantic similarity of a pair of words according to WordNet with a parameter to specify the measure of semantic similarity between concepts. Explain and justify the strategy used for words which have multiple senses.

> When coming across multiple senses a strategy to use could be supervised methods, by using human judgement and associating the answer with it. Because a human can understand the grammar, structure and choice of a chosen word so the machine can learn and adapt to that choice.


In [None]:
def word_similarity(wordA,wordB,pos=wn.NOUN,measure="path"): #method taking in parameters for word1 word2 and the type of measure
    synsetsA=wn.synsets(wordA,pos) #creates synset for chosen word
    synsetsB=wn.synsets(wordB,pos) #creates synset for chosen word
    maxsofar=0
    path = "similarity title" #sets default type of measure
    brown_ic=wn_ic.ic("ic-brown.dat")
    for synsetA in synsetsA:
        for synsetB in synsetsB:
            if measure=="path": #checks type of measure
                sim=wn.path_similarity(synsetA,synsetB)
                path = "Path similarity " #changes type of measure accordingly
            elif measure=="resnik": #checks type of measure
                sim=wn.res_similarity(synsetA,synsetB,brown_ic)
                path = "Resnik similarity" #changes type of measure accordingly
            elif measure=="lin": #checks type of measure
                sim=wn.lin_similarity(synsetA,synsetB,brown_ic)
                path = "Lin similarity " #changes type of measure accordingly
            
            if sim>maxsofar: #checks if similarity is at its highest point
                maxsofar=sim
    print("{} between {} and {} is {}".format(path,wordA,wordB,maxsofar)) #prints in a set format 

word_similarity("cat","dog",measure="path")
word_similarity("cat","dog",measure="resnik")
word_similarity("cat","dog",measure="lin")

Path similarity  between cat and dog is 0.2
Resnik similarity between cat and dog is 7.911666509036577
Lin similarity  between cat and dog is 0.8768009843733973


3. C  Choose one of the measures of semantic similarity and then for every possible pair of words identified in Q2, determine the semantic similarity of the pair according to WordNet. Justify your choice of semantic similarity measure.

Lin similarity returns a score to determine how similar two word senses are based on the information conent of the least common subsumer and of the two input synsets. I chose lin similarity because it takes the information content of compared concepts into consideration. With path similarity two pairs with equal lengths of the shortest path will have the same similarity which isn't helpful and with resnik similarity two pairs with the same lowest common subsumer will have the same similarity which also doesnt help. Therefore making lin similarity a more accurate approach.

In [None]:
def lin_similarity(wordA,wordB,pos=wn.NOUN):
    synsetsA=wn.synsets(wordA,pos) #puts words into synsets
    synsetsB=wn.synsets(wordB,pos) #puts words into synsets
    maxsofar=0
    for synsetA in synsetsA:
        for synsetB in synsetsB:
            sim=wn.lin_similarity(synsetA,synsetB,brown_ic) #uses lin on word synsets
            if sim>maxsofar:
                maxsofar=sim
    print("Lin similarity between {} and {} is {}".format(wordA,wordB,maxsofar))
lin_similarity("chicken","car")

Lin similarity between chicken and car is 0.17900106582025765


In [None]:
wordstopair = top1000nouns(sortednouns)
wordstopair
for i in range(0,len(wordstopair)): #loops through eveyr word
  for j in range(i + 1, len(wordstopair)): # nested loop to then cover every word ahead of the previous loop of words, thus covering every word with every word
    lin_similarity(wordstopair[i][0],wordstopair[j][0]) #conducts lin similarity check with each given pair of words

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Lin similarity between ratings and elections is 0.49617138095124536
Lin similarity between ratings and forced is 0
Lin similarity between ratings and refunding is 0
Lin similarity between ratings and suggested is 0
Lin similarity between ratings and release is 0.24192091816462227
Lin similarity between ratings and program is 0.35674075471326744
Lin similarity between ratings and accounts is 0.2955897861702772
Lin similarity between ratings and review is 0.8505368311357361
Lin similarity between ratings and steady is 0
Lin similarity between ratings and create is 0
Lin similarity between ratings and abroad is 0
Lin similarity between ratings and assembly is 0.2684889690283737
Lin similarity between ratings and personal is 0.0663488187969821
Lin similarity between ratings and concerns is 0.3467270079752856
Lin similarity between ratings and falls is 0.3163025522754525
Lin similarity between ratings and approval is 0.2976331

3. D Identify the 10 most similar words (according to WordNet) to the most frequent word in the corpus [20 marks]

In [None]:
import operator
#below creates seperate dictionary of words 
stop = stopwords.words('english')
dictwords = {}
def getdictwords(sentences):
  for sentence in sentences:
    for words in sentence:
        if words in dictwords:
          dictwords[words] += 1
        else:
          dictwords.update({words:1})
getdictwords(sentences)
dictwords 

#below normalises dictwords i.e removes stopwords numbers and punctuation
dictwords_stopremoval = {k: v for k,v in dictwords.items() if k not in stop and k.isalpha() and k != "NUM"} #removes number and stopwords
#below sorts order of words
sortedwords = dict(sorted(dictwords_stopremoval.items(), key=operator.itemgetter(1), reverse=True)) # sorts words order
sortedwords
def mostfreqword(freqdist,k=1): #identifies most frequent word
  return sorted(freqdist.items(),key=operator.itemgetter(1),reverse=True)[:k] #sorts the frequency of words into order and finds the single highest occuring word
mostfrequent = mostfreqword(sortedwords)

print("The synsets below are all similar words to 'said' the most frequently used word in the corpus:")
similarwords = [] 
synsets = wn.synsets("said")  
    #iterate over synsets
for synset in synsets:
  #iterates through synsets lemmas
  for l in synset.lemmas(): 
    #checks if similarwords list is no bigger than 10 long and that all words are unique
    if len(similarwords) < 10 and l.name() not in similarwords:
        similarwords.append(l.name()) 
print(similarwords)

The synsets below are all similar words to 'said' the most frequently used word in the corpus:
['state', 'say', 'tell', 'allege', 'aver', 'suppose', 'read', 'order', 'enjoin', 'pronounce']


4)a) Write code to construct distributional vector representations of words in the corpus with a parameter to specify context size.  Explain how you calculate the value of association between each word and each context feature.

b) Use your code to construct representations of the 1000 words identified in Q2 with a window size of 1 

c) Use your representations to find the 10 words which are distributionally most similar to the most frequent word in the corpus. [15 marks]

4)a) Write code to construct distributional vector representations of words in the corpus with a parameter to specify context size. Explain how you calculate the value of association between each word and each context feature.
>
Using positive pointwise mutual information (PMI) is a way to value the association berween words. This is because PMI can determine the significance of a frequency of co-occurence. PMI can determine how common a co-occurence is and by doing so can establish whether it is insignificant or significant. If they are common words and common co-occurences they will be insignificant, whereas, if they are rare words then co-occurences will be determined as more important in the representation of each word. To calculate PMI you do the following:
>
\begin{eqnarray*}
PMI(word,feat) = \frac{\mbox{freq}(word,feat) \times \Sigma_{w*,f*} \mbox{freq}(w*,f*)}{\Sigma_{f*} \mbox{freq}(word,f*) \times \Sigma_{w*} \mbox{freq}(w*,feat)}
\end{eqnarray*}
>


In [9]:
def dot(vecA,vecB):
    the_sum=0
    for (key,value) in vecA.items():
        the_sum+=value*vecB.get(key,0)
    return the_sum

In [10]:
class vectors:
    def __init__(self,sentences,window=3):
        self.sentences=sentences
        self.window=window
        self.reps={}
        self.wordtotals={}
        self.feattotals={}
        self.generate_features()
        self.grandtotal=sum(self.wordtotals.values())
        self.convert_to_ppmi()
    
    def generate_features(self):
        for sentence in self.sentences:
            for i,token in enumerate(sentence):
                current=self.reps.get(token,{})
                features=sentence[max(0,i-self.window):i]+sentence[i+1:i+self.window+1]
                for feature in features:
                    current[feature]=current.get(feature,0)+1
                    self.feattotals[feature]=self.feattotals.get(feature,0)+1
                self.wordtotals[token]=self.wordtotals.get(token,0)+len(features)
                self.reps[token]=current

    def convert_to_ppmi(self):
        self.ppmi={word:{feat:max(0,math.log((freq*self.grandtotal)/(self.wordtotals[word]*self.feattotals[feat]),2)) for (feat,freq) in rep.items()} for (word,rep) in self.reps.items()}
    
    def similarity(self,word1,word2):
        rep1=self.ppmi.get(word1,{})
        rep2=self.ppmi.get(word2,{})
        return dot(rep1,rep2)/math.sqrt(dot(rep1,rep1)*dot(rep2,rep2))
    
    def nearest_neighbours(self,word1,n=1000,k=10):
        candidates=sorted(self.wordtotals.items(),key=operator.itemgetter(1),reverse=True)[:n]
        sims=[(cand,self.similarity(word1,cand)) for (cand,_) in candidates]
        return sorted(sims,key=operator.itemgetter(1),reverse=True)[:k]

In [None]:
vcr=vectors(sentences,window=1)
vcr.convert_to_ppmi()
print(vcr.similarity('korea','germany'))

0.03672634043661373


In [None]:
vcr.reps['said']

In [None]:
vcr.generate_features()

In [None]:
vcr.similarity("said", "say")

0.034341937499454474

4.C

In [None]:
vcr.nearest_neighbours("said")

[('said', 1.0),
 ("''", 0.42742291973035473),
 ('told', 0.22901210586530568),
 (',', 0.19507157818399515),
 ('added', 0.1416384654932068),
 ("'s", 0.11012275321283063),
 ('but', 0.096675436568098),
 ('says', 0.08836968511984952),
 ('noted', 0.08819544901156108),
 ('.', 0.08223991888292137)]

5) Plan and carry out an investigation into the correlation between semantic similarity according to WordNet and distributional similarity with different context window sizes. You should make sure that you include a graph of how correlation varies with context window size and that you discuss your results. [25 marks]

In [12]:
vcr=vectors(sentences,window=1)
vcr.nearest_neighbours('said')

[('said', 1.0),
 ("''", 0.42742291973035473),
 ('told', 0.22901210586530568),
 (',', 0.19507157818399515),
 ('added', 0.1416384654932068),
 ("'s", 0.11012275321283063),
 ('but', 0.096675436568098),
 ('says', 0.08836968511984952),
 ('noted', 0.08819544901156108),
 ('.', 0.08223991888292137)]

In [13]:
vcr=vectors(sentences,window=5)
vcr.nearest_neighbours('said')

[('said', 1.0),
 ("''", 0.5290441758604836),
 (',', 0.3955101788939153),
 ('.', 0.2930806867989276),
 ('economist', 0.24068712320692287),
 ('he', 0.22958137259461872),
 ('the', 0.20529199549183638),
 ('minister', 0.19305798188873144),
 ('at', 0.1889938259501839),
 ('chief', 0.17971826978787217)]

In [14]:
vcr=vectors(sentences,window=10)
vcr.nearest_neighbours('said')

[('said', 1.0),
 ("''", 0.5424119155291371),
 (',', 0.4028636718802825),
 ('.', 0.32495113212439214),
 ('he', 0.2689308498941462),
 ('the', 0.267067353236507),
 ('of', 0.25247986636505915),
 ('economist', 0.24782400139489105),
 ('a', 0.2472763389925784),
 ('in', 0.2411345856709793)]

In [15]:
vcr=vectors(sentences,window=25)
vcr.nearest_neighbours('said')

[('said', 1.0),
 ("''", 0.5289694613477398),
 ('``', 0.3969227947270833),
 (',', 0.34138590962740445),
 ('a', 0.3288429811353119),
 ('it', 0.3282071673496143),
 ('the', 0.3244342753502713),
 ('he', 0.31633289775108114),
 ('in', 0.3104919590499118),
 ('of', 0.3070066691348281)]

In [None]:
winsize1.similarity("which","were")

0.08177185265684372

In [None]:
winsize10.similarity("which","were")

0.13540878782410715

In [None]:
winsize25.similarity("which","were")

0.1528039243955428

In [None]:
winsize50.similarity("which","were")

0.15466628602217478

In [None]:
winsize100.similarity("which","were")

0.15484694578366287

5. 
> I conducted a small investigation into the different window sizes. I did this by altering the context window size between 1 and 25 with the same word to see what the cosine similarity was for each word matching. It is clear and shown above, that as the context window increases so does the similarity score of each word. So to conclude, the correlation between semantic similarity and distribution similarity with different context window sizes is they both positively correlate together. If one increases so will the other, this is because words that are used and occur in the same contexts to each other typically follow similar meanings. 

In [None]:
##This code will word count all of the markdown cells in the notebook saved at filepath
##Running it before providing any answers shows that the questions have a word count of 388

import io
from nbformat import current

filepath="/content/drive/My Drive/NLE Notebooks/assessment/NLEassignment2.ipynb"
question_count=568

with io.open(filepath, 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))


- use nbformat for read/write/validate public API
- use nbformat.vX directly to composing notebooks of a particular version

  """)


Submission length is 515
