# Week 7: Distributional Semantics

This week there is only one Jupyter notebook for you to complete!  

In the lectures, we have introduced the idea of distributional semantics. In a distributional model of meaning, words are represented in terms of their co-occurrences.

However, what does it mean for two words to co-occur together?  Here we are going to look at how the **definition of co-occurrence** used affects the nature of the similarity discovered.  In particular, we are going to contrast *close proximity* co-occurrence (where words co-occur, say, next to each other) with more *distant proximity* (where words co-occur, say, within a window of 10 words).

First, however, we need a corpus.  Here, we are going to work with the Reuters sports corpus.


In [10]:
#preliminary imports
import sys
import random
import operator
from Week4Labs.utils import normalise

sys.path.append(r'T:\Departments\Informatics\LanguageEngineering') 
sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
sys.path.append(r'/Users/juliewe/resources')

from sussex_nltk.corpus_readers import ReutersCorpusReader

#make sure you append the path to where your utils.py file is
sys.path.append(r'/Users/juliewe/Documents/teaching/NLE/NLE2019/w4/Week4Labs/')
from utils import *

First, we set up a corpus reader for the sport category of Reuters.  Using the `enumerate_sents()` method we can see it contains over 1 million sentences

In [3]:
rcr = ReutersCorpusReader().sport()
rcr.enumerate_sents()

1113359

We are going to take a sample of this corpus, tokenize the sentences and carry out text normalization for case and number.  You could increase the samplesize to 10000 (which will be repeated 100 times for a total corpus size of 1000000 sentences) but this will make a noticeable slow-down in the speed of running cells.  Also note, that repeating 100 samples of size 2000 might contain duplicate items.  Don't worry about this - sampling "with replacement" is quite common and allows us to bootstrap estimates of statistics.  However, here, we are doing it because it is faster and so that we can check on progress with a "completed %" statement.

In [11]:
random.seed(37)  #this will ensure it is the same sample every time you run the cell
samplesize=2000
iterations =100
sentences=[]
for i in range(0,iterations):
    sentences+=[normalise(sent) for sent in rcr.sample_sents(samplesize=samplesize)]
    print("Completed {}%".format(i))

Completed 0%
Completed 1%
Completed 2%
Completed 3%
Completed 4%
Completed 5%
Completed 6%
Completed 7%
Completed 8%
Completed 9%
Completed 10%
Completed 11%
Completed 12%
Completed 13%
Completed 14%
Completed 15%
Completed 16%
Completed 17%
Completed 18%
Completed 19%
Completed 20%
Completed 21%
Completed 22%
Completed 23%
Completed 24%
Completed 25%
Completed 26%
Completed 27%
Completed 28%
Completed 29%
Completed 30%
Completed 31%
Completed 32%
Completed 33%
Completed 34%
Completed 35%
Completed 36%
Completed 37%
Completed 38%
Completed 39%
Completed 40%
Completed 41%
Completed 42%
Completed 43%
Completed 44%
Completed 45%
Completed 46%
Completed 47%
Completed 48%
Completed 49%
Completed 50%
Completed 51%
Completed 52%
Completed 53%
Completed 54%
Completed 55%
Completed 56%
Completed 57%
Completed 58%
Completed 59%
Completed 60%
Completed 61%
Completed 62%
Completed 63%
Completed 64%
Completed 65%
Completed 66%
Completed 67%
Completed 68%
Completed 69%
Completed 70%
Completed 71%
Co

In [12]:
sentences[0]

['``',
 'but',
 'fortunately',
 ',',
 'stairs',
 'hit',
 'the',
 'ball',
 'hard',
 'and',
 'i',
 'charged',
 'it',
 'hard',
 'and',
 'got',
 'rid',
 'of',
 'it',
 'as',
 'quickly',
 'as',
 'i',
 'could',
 'and',
 'made',
 'a',
 'good',
 'throw',
 '.']

### Exercise 1
* Write (or adapt from previous labs) a function to find the frequency distribution of words in the sample of sentences
* Generate a list of the 100 most frequent words in the corpus. 

### Generating feature representations

We want to be able to consider any words that are in a certain **window** around a target word as features of that word.  The code below demonstrates how to iterate through a sentence and find all of the tokens within a given window of each word.

In [None]:
tokens=word_tokenize("the moon is blue and made of cheese")

window=2


for i,word in enumerate(tokens):
    print(word,tokens[max(0,i-window):i]+tokens[i+1:i+window+1])

### Exercise 2.1
Write a function `generate_features(sentences,window=1)` which takes
* a list of sentences (where each sentence is a list of tokens); and
* and a window size; 

This function should output
* a dictionary of dictionaries

The key to the outermost dictionary is a word.  The key to each internal dictionary is a another word (a co-occurrence feature).  The value in the internal dictionary should be the number of the times the words co-occur together (within the given window).

For example, with the sentences in `sents`, your function should generate the following output:

<img src="files/output21.png">



## Pointwise Mutual Information (PMI)
So far, we have calculated the frequency of two events occurring together.  For example, we can see how often the word 'tennis' appears in the window around the word 'player'

In [None]:
reps=generate_features(sentences,window=1)
reps['player']['tennis']

We use positive pointwise mutual information (PPMI) to establish how **significant** a given frequency of co-occurrence is.  If player and tennis are both very common words then their co-occurring together 10 times may be insignificant.  However, if they are rare words, then a co-occurrence of 10 should be considered as more important in the representation of each word.  PMI can be calculated as follows:

\begin{eqnarray*}
PMI(word,feat) = log_2(\frac{\mbox{freq}(word,feat) \times \Sigma_{w*,f*} \mbox{freq}(w*,f*)}{\Sigma_{f*} \mbox{freq}(word,f*) \times \Sigma_{w*} \mbox{freq}(w*,feat)})
\end{eqnarray*}



In order to carry out this calculation, we can see that we need the frequency of the co-occurrence *player* and *tennis*, the total number of times *player* has occurred with any feature, the total number of times *tennis* has occurred as a feature and the grand total of all possible co-occurrences.  We can keep track of these totals as we build the feature representations.

## Exercise 3.1
Create a class `word_vectors`.  This should be initialised with a list of sentences and a desired window size.  On initialisation, the feature representations of all words, together with word totals and feature totals should be generated and stored in the object as
* self.reps (the feature representations: a dictionary of dictionaries}
* self.wordtotals (the frequency of each word: a dictionary of integers (with the same keys as self.reps)
* self.feattotals (the frequency of each feature: a dictionary of integers (with the same keys as the dictionaries indexed by self.reps)

Generate vectors from the sample sentences with a window_size of 3.  If you look at the representation of `player`, you should find that the feature `australian` has the value 17.  The total frequency of features for the word `player` is 2722, and the total frequency of occurrences of the feature `australian` is 2220

## Positive PMI (PPMI)
We now want to convert the representation of each word from a representation based on frequency to one based on PMI.  In fact, we want to ignore any features so we use **positive PMI**

\begin{eqnarray*}
\mbox{PPMI}(word,feat)=
\begin{cases}PMI(word,feat),& \mbox{if PMI}(word,feat)>0\\
=0,& \mbox{otherwise}
\end{cases}
\end{eqnarray*}

### Exercise 3.2
Now add a method to your `word_vectors` class which will calculate the PPMI value for each feature in each vector.

The PPMI between `player` and `australian` should be 3.49

## Word Similarity
We are going to use cosine similarity to compute the similarity between two word vectors.  

First lets define a function to compute the dot product of two vectors. This could be imported or copied from Lab_4_2.  However, an implementation is given to you below which you can use.

In [None]:
def dot(vecA,vecB):
    the_sum=0
    for (key,value) in vecA.items():
        the_sum+=value*vecB.get(key,0)
    return the_sum


### Exercise 4.1

* Add a `similarity` method to your word_vectors class to enable you to calculate the similarity between two word representations.
* You should find similarity between `australian` and `african` is 0.05

## Nearest Neighbours
We now want to be able to find the nearest neighbours of a given word.  In order to do this we need to find its similarity with every other word in a set of *candidates* and then rank them by similarity.

### Exercise 4.2
* Add functionality to your `word_vectors` class to be enable you to find the *k* nearest neighbours of any words.   You can improve efficiency by only considering the 1000 most frequent words as *candidates*
* Use your functionality investigate the effect of increasing the window size on the neighbourhood of a word.  You should consider at least:
    * the words \['australian', 'football'\]
    * the neighbourhood sizes: window = \[1, 10\]
* Comment on the differences.