# Exercise "Lecture 11: Lexical Semantics"


In this set of exercises, we will convert words to vectors representing their distributional properties. 

We will start by building a word cooccurrences matrix from the Wikipedia corpus. 

We will then use Gensim and sklearn predefined methods to build an SVD word context matrix from the Wikipedia corpus used in the preceding two lectures. Finally we'll use cosine to compare the similarity between different pairs of words. 

The exercises cover the following points:

* Converting a corpus to a list of integers where each integer represent a token (this is needed for efficient computation)
* Computing a word frequency distribution 
* Creating word coocurrence matrices
* Applying SVD decomposition
* Finding neighbours

### Create a corpus from a set of files

**Exercise 1:** Store all files in 'data/wkp/' into a pandas dataframe with column Text where each row contains the content of one file

* use os.scandir to list the files in the directory, read each file into a list of strings (one string per file)   
_**Cheatsheet:**_ python_basics

In [1]:
import os
import pandas as pd
import numpy as np
import nltk

In [2]:
os.chdir("/experiments/cours nlp/data science/lecture11/wkp_sorted")
data_folder = os.listdir()
wiki_data = []
for d in data_folder:
    path = os.listdir(os.path.join(os.getcwd(), d))
    for text_data in path:
        with open(os.path.join(os.getcwd(),d,text_data),"r",encoding="utf-8") as f:
            data = f.readlines()
            wiki_data.append([data])

In [3]:
wiki = pd.DataFrame(wiki_data, columns= ['text'])
wiki

Unnamed: 0,text
0,[Airports of Serbia (Serbian Cyrillic: Аеродро...
1,[An airport authority is an independent entity...
2,"[An airport bus, or airport shuttle bus or air..."
3,[Airport check-in is the process whereby passe...
4,[Airport security refers to the techniques and...
...,...
155,"[Al-Wasat (Arabic: الوسط), also Alwasat, was a..."
156,"[The Burj Al Arab (Arabic: برج العرب, Tower of..."
157,[Al-Fazl ( Urdu الفضل) has been the most impor...
158,"[Al HaMishmar (Hebrew: על המשמר, On Guard) was..."


In [15]:
#path = "/experiments/cours nlp/data science/lecture11/wkp_sorted"
#for l in os.scandir(path):
   # print(l.is_dir())

In [4]:
wiki['text'] = wiki['text'].apply(lambda x : "".join(x))
text_ = wiki['text'].str.cat(sep = " ")

In [5]:
from nltk.tokenize import word_tokenize

**Exercise 2:** Clean the corpus

* Store the content of the 'Text" column into a string (cf. Pandas CS, "Extracting all text from a colum")
* Tokenize the string into words (cf. NLTK CS)
* Print out the first and the last 10 words of your list of tokens. Do you see tokens that may not be useful for learning word representations ?
* Lower case all tokens and remove all tokens that contains characters that are not letters

In [6]:
word = word_tokenize(text_)

In [7]:
print(word[:10])
print(word[-10:])
word = [l.lower() for l in word if l.isalpha()]

['Airports', 'of', 'Serbia', '(', 'Serbian', 'Cyrillic', ':', 'Аеродроми', 'Србије', ')']
['was', 'first', 'published', 'in', 'May', '1983', '.', '==', 'References', '==']


In [12]:
print(word[:10])

['airports', 'of', 'serbia', 'serbian', 'cyrillic', 'аеродроми', 'србије', 'is', 'a', 'serbian']


### Create a frequency distribution

**Exercise 3:** Create a frequency distribution from the list of tokens created in the previous exercise. Print out the 10 most and least frequent tokens.

_**Cheatsheet:**_ lexical semantics and stats_and_visu 
* Create a frequency distribution by iterating over the list of token while incrementing each token frequency accordingly
* Sort the tokens by decreasing frequency into a list
* Print out the first and last 10 tokens of this list (which tokens are most and least frequent ?)

In [8]:
from collections import defaultdict, Counter
freqdist = Counter(word)

In [9]:
print(f"First ten frequent tokens :{pd.DataFrame(freqdist.most_common()[:10])}\n \
Last ten frequent tokens {pd.DataFrame(freqdist.most_common()[-10:])}")

First ten frequent tokens :     0      1
0  the  11652
1   of   5443
2  and   4677
3   in   4564
4   to   3773
5    a   3479
6   is   1633
7  was   1596
8   as   1495
9  for   1284
 Last ten frequent tokens             0  1
0     mustafa  1
1    karchawi  1
2        abed  1
3       jabri  1
4  abdelkerim  1
5       mouti  1
6      jailed  1
7        alam  1
8     ittihad  1
9   ichtiraki  1


### Convert the corpus to a list of integers

**Exercise 4:** Convert the list of tokens created in Exercise 2 (corpus cleaning) into a list of integers

* Create a dictionary mapping each token to a distinct integer (Cf. Lexical Semantics CS)
* Use this dictionary to convert each token from your cleaned corpus (Exercise 2) into an integer.

In [10]:
token2int = defaultdict(lambda: len(token2int)) #can we use enumerate to do that? 
token2int['<eos>'] = 0
for text in word:
    [token2int[token] for token in text.split()]

In [None]:
token2int

In [28]:
#token2int.keys()# a function to understand

### Create a dictionary of co-occurences

**Exercise 5:** In the previous exercise, you created a list of integers where each integer is the identifier for the corresponding token in your cleaned up corpus. Iterate over that list and for each "integer token" *i*:

* get the neihbours of *i* within a window of size 5 (only looking at the right side of *i*)
* store these neihbours in a dictionary of coocurrences of the form {(i,j):f,} where *(i,j)* are neighbours and *f* is the frequency of the co-occurence 
* Sort co-occurences using integer order i.e., if the neighbour *n* is represented by an identifier smaller than *i*, store the co-occurence as *(n,i)*, otherwise as *(i,n)* .


In [None]:
cooccurrences = defaultdict(int)

for i, token in enumerate(word[:-5]):
    for j in range(1, 6):
        neighbour_token = word[i + j]
        cooccurrences[(min(token2int[token], token2int[neighbour_token]),
                       max(token2int[token], token2int[neighbour_token]))] += 1
cooccurrences

### Compute the SVD decomposition of the Co-occurence Matrix

**Exercise 6:** Compute the  SVD decomposition of the word co-occurence matrix you just created

* Create a matrix A of size (vocab_length, vocab_length)
* Fill each cell *(i,j)* in this matrix with the frequency the co-occurrence between *i* and *j*
* Use numpy [svd](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html) method from the linalg module
* full_matrices = False ensures that reduced SVD is applied (rather than full SVD)

A = U * s * V

In [22]:
import numpy as np

matrix = np.zeros((len(token2int), len(token2int)))
for (i, j), value in cooccurrences.items():
    matrix[i, j] = value
matrix

array([[  0.,   0.,   0., ...,   0.,   0.,   0.],
       [  0.,   9.,  37., ...,   0.,   0.,   0.],
       [  0.,   0., 628., ...,   0.,   0.,   0.],
       ...,
       [  0.,   0.,   0., ...,   0.,   0.,   0.],
       [  0.,   0.,   0., ...,   0.,   0.,   1.],
       [  0.,   0.,   0., ...,   0.,   0.,   0.]])

In [None]:
U, s, V = np.linalg.svd(matrix, full_matrices=False)

**Exercise 7 (PROVIDED):** Define a function which returns the similarity between two words and apply to measure the similarity of airport and news, airport and international, john and peter

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def similarity(embedding, word1, word2):
  if word1 in vocab and word2 in vocab:
    v1 = embedding[token2int[word1]].reshape(1, -1)  
    v2 = embedding[token2int[word2]].reshape(1, -1)
    return cosine_similarity(v1, v2)[0][0]

print('cosine(airport, news) =', similarity(U, 'airport', 'news'))
print('cosine(airport, international) =', similarity(U, 'airport', 'international'))
print('cosine(john, peter) =', similarity(U, 'john', 'peter'))

**Exercise 8 (PROVIDED):** Define a function which outputs the neighbours of a word

In [None]:
reverse_vocab = {j: i for i, j in token2int.items()}

def most_similar(embedding, word, n=10):
  if word in vocab:
    v = embedding[token2int[word]].reshape(1, -1)
    scores = cosine_similarity(v, embedding).reshape(-1)
    result = []

    # argsort gives n-best scores
    for i in reversed(scores.argsort()[-n:]):
      result.append((reverse_vocab[i], scores[i]))
    return result

print(most_similar(U, 'airport'))
print(most_similar(U, 'news'))

# 2. Creating a word cooccurence matrix using vectorizers

In the preceding section, we created the word co-occurence matrix programatically (we wrote the algorithm for deriving the matrix from the corpus). There is in fact a much quicker way to do this which can be summarised as follows:

* Use sklearn vectorizer methods (CountVectorizer, TfidfVectorizer) to convert the corpus (a list of documents) to a _**document/token matrix**_
* Use algebra to create the token/token matrix.  To create a _**token co-occurence matrix**_ , we simply multiply the transpose of the documents/tokens matrix by the documents/token matrix
    * shape of X: (#doc, #tokens)   
    * shape of X transpose: (#tokens, #doc)   
    * shape of X transpose * X : (#tokens, #doc) * (#doc, #tokens) = (#tokens, #tokens)


**Exercise 9:** Convert the Wikipedia files into a list of strings and preprocess each string

* Convert the Text column of the dataframe created in Exercise 1 into a list of strings
* Define a preprocessing function which takes a list of strings as input, tokenizes each string, lowercases the tokens, only keep tokens made of letters (use isalpha() method), convert the list of cleaned tokens back into a string (use "join") and stores the result into a lists of preprocessed strings

In [17]:
str_list = []
for text in wiki['text']:
    str_list.append(text)    

In [25]:
def preprocess(s):
    text_prep = []
    for text in s:
        word = word_tokenize(text)
        word = [l.lower() for l in word if l.isalpha()]
        text_prep.append(" ".join(word))
    return text_prep

In [27]:
doc = preprocess(str_list)

**Exercice 10:** Creating a document / token matrix

* Use sklearn [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) method which transforms a list of documents (strings) into a a document/token matrix where each cell indicates the frequency of a token in a document
* Use the stop_words option to remove stop words

In [47]:
from sklearn.feature_extraction.text import CountVectorizer
count_model = CountVectorizer(ngram_range=(1,1), stop_words = 'english')
X = count_model.fit_transform(doc)

**Exercice 11:** Print out the vocabulary i.e., the tokens contained in the document/token matrix

In [None]:
count_model.vocabulary_

**Exercise 12:** Create the co-occurence matrix.

To create a token co-occurence matrix, we simply multiply the transpose of the documents/tokens matrix by the documents/token matrix

* shape of X: (#doc, #tokens)
* shape of X transpose: (#tokens, #doc)
* shape of X transpose * X : (#tokens, #doc) * (#doc, #tokens) = (#tokens, #tokens)

In [50]:
X_term = X@X.T

**Exercise 13:** Use the function given in Exercise 7 to measure the similarity of airport and news, airport and international, john and peter

* You'll need to modify that part of the function which retrieves the identifier of a word

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def similarity(embedding, word1, word2):
    v1 = embedding[token2int[word1]].reshape(1, -1)  
    v2 = embedding[token2int[word2]].reshape(1, -1)
    return cosine_similarity(v1, v2)[0][0]

print('cosine(airport, news) =', similarity(U, 'airport', 'news'))
print('cosine(airport, international) =', similarity(U, 'airport', 'international'))
print('cosine(john, peter) =', similarity(U, 'john', 'peter'))