# Exercise "Lecture 11: Lexical Semantics"


In this set of exercises, we will convert words to vectors representing their distributional properties. 

In the first part, you will use Gensim and sklearn predefined methods to build an SVD word context matrix from the Wikipedia corpus used in the preceding two lectures and  use cosine to compare the similarity between the vectors representing   words. 

In the second part, you will build a word cooccurrences matrix from the Wikipedia corpus. 

The exercises cover the following points:


* Creating word coocurrence matrices
* Applying SVD decomposition
* Finding neighbours
* Converting a corpus to a list of integers where each integer represent a token (this is needed for efficient computation)
* Computing a word frequency distribution 

## Store a set of files into a Pandas data frame

**Exercise 1:** Store all files in 'data/wkp/' into a pandas dataframe with column Text where each row contains the content of one file

* use os.scandir to list the files in the directory   
    **11_CS_lexical_semantics-1** 
* read each file into a list of strings (one string per file)  
    **11_CS_python-2**
* store the list of strings into a pandas dataframe with header 'Text'   
    **08, pandas_cheat_sheet**



In [1]:
import pandas as pd
from sklearn.datasets import load_files
from os import scandir

DIR = "Comics_characters/"

# Load the data from the directory DIR using os.scandir
data = []
for entry in scandir(DIR):
    if entry.is_file() and entry.name.endswith(".txt"):
        with open(entry.path, "r") as f:
            data.append(f.read())

df = pd.DataFrame(data, columns = ["text"])
df.head()

Unnamed: 0,text
0,Al MacKenzie is a fictional character appearin...
1,Cannonball (Samuel Zachary Guthrie) is a ficti...
2,A list of the Famous Studios theatrical cartoo...
3,Donald Pierce is a fictional supervillain appe...
4,"Elongated Man (Randolph ""Ralph"" Dibny) is a fi..."


In [2]:
df.shape

(10, 1)

# PART 1

##  Creating a word cooccurence matrix using vectorizers


* Use sklearn vectorizer methods (CountVectorizer, TfidfVectorizer) to convert the corpus (a list of documents) to a _**document/token matrix**_
* Use algebra to create the token/token matrix.  To create a _**token co-occurence matrix**_ , we simply multiply the transpose of the documents/tokens matrix by the documents/token matrix
    * shape of X: (#doc, #tokens)   
    * shape of X transpose: (#tokens, #doc)   
    * shape of X transpose * X : (#tokens, #doc) * (#doc, #tokens) = (#tokens, #tokens)


**Exercise 2:** 
* Convert the Text column of the dataframe created in Exercise 1 into a list of strings   
   **Pandas CS-3**

In [3]:
text = list(df["text"])
len(text)

10

**Exercice 3:** Creating a document / token matrix

* Use sklearn [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) method which transforms a list of documents (strings) into a a document/token matrix where each cell indicates the frequency of a token in a document
* Use the stop_words option to remove stop words

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(text)

In [7]:
print(X.todense())
print(X.shape)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 2]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [1 0 0 ... 0 0 0]]
(10, 3729)


**Exercice 4:** Print out the word distribution i.e.,  the tokens contained in the document/token matrix and their frequency (use the vocabulary_ attribute of CountVectorizer module, cf. CS)

Your output should look like this:

{'donald': 1086, 'pierce': 2502, 'fictional': 1361, 'supervillain': 3261, 'appearing': 280, 'american': 242, 'comic': 720, 'books': 487, 'published': 2650, 'marvel': 2153, 'comics': 722, 'character': 631, 'depicted': 980, 'cyborg': 895, 'commonly': 731, 'enemy': 1205, 'men': 2193, 'portrayed': 2540, 'boyd': 501, 'holbrook': 1659, '2017': 64, 'film': 1371, 'logan': 2052, 'publication': 2649, 'history': 1655, 'appeared': 279, 'uncanny': 3497, '132': 14, 'april': 287, '1980': 32, 'created': 849, 'chris': 659, 'claremont': 677, 'john': 1875, 'byrne': 549, 'appearance': 277, 'modeled': 2247, 'sutherland': 3287, 'comes': 718, 'benjamin': 432, 'franklin': 1435, 'hawkeye': 1602, '1970': 28, 'biography': 448, 'born': 492, 'philadelphia': 2487, 'pennsylvania': 2466, 'appears': 281, 'high': 1647, 'ranking': 2687, 'member': 2187, 'inner': 1797, 'circle': 665, 'hellfire': 1625, 'club': 693, 'holds': 1663, 'position': 2543, 'white': 3656, 'bishop': 451, 'fact': 1318, 'genocidal': 1488, 'mutant': 2289, 'hater': 1595, 'joined': 1877, 'order': 2405, 'kill': 1915, 'members': 2188, 'mutants': 2290, 'addition': 160, 'hating': 1597, 'bigoted': 443, 'certain': 621, 'nationalities': 2312, 'harbors': 1584, 'sense': 2990, 'self': 2986, 'loathing': 2044, 'status': 3177, 'referring': 2739, 'half': 1571, 'man': 2124, 'ceo': 617, 'principal': 2595, 'shareholder': 3029, 'consolidated': 776, 'mining': 2225, 'operates': 2392, 'laboratory': 1946, 'complex': 742, 'cameron': 567, 'kentucky': 1906, 'mercenaries': 2198, 'kidnap': ....

In [8]:
voc = vectorizer.vocabulary_
voc

{'al': 205,
 'mackenzie': 2095,
 'fictional': 1362,
 'character': 631,
 'appearing': 280,
 'american': 242,
 'comic': 720,
 'books': 487,
 'published': 2652,
 'marvel': 2155,
 'comics': 722,
 'alphonso': 228,
 'mack': 2094,
 'appeared': 279,
 'cinematic': 664,
 'universe': 3526,
 'tv': 3476,
 'series': 3004,
 'agents': 193,
 'portrayed': 2542,
 'henry': 1636,
 'simmons': 3064,
 'eventually': 1260,
 'new': 2333,
 'director': 1045,
 'publication': 2651,
 'history': 1656,
 'nick': 2338,
 'fury': 1457,
 'vs': 3619,
 'aug': 363,
 '1988': 39,
 'created': 849,
 'bob': 474,
 'harras': 1590,
 'paul': 2461,
 'neary': 2320,
 'subsequently': 3233,
 'appears': 281,
 'agent': 192,
 'sept': 3000,
 '1989': 40,
 'jan': 1857,
 '1990': 41,
 'entry': 1233,
 'issue': 1850,
 'reference': 2738,
 'official': 2379,
 'handbook': 1576,
 'update': 3542,
 '89': 115,
 'biography': 448,
 'born': 492,
 'austin': 366,
 'texas': 3358,
 'liaison': 2014,
 'romantically': 2893,
 'involved': 1840,
 'contessa': 790,
 'valen

In [17]:
print(sorted(voc.items(), key=lambda x: x[1], reverse = True)[:20])

[('zombies', 3728), ('zeb', 3727), ('zauriel', 3726), ('zatara', 3725), ('zatanna', 3724), ('zander', 3723), ('zachary', 3722), ('yucatan', 3721), ('yr', 3720), ('younger', 3719), ('young', 3718), ('yorkes', 3717), ('york', 3716), ('yo', 3715), ('yeti', 3714), ('yellow', 3713), ('years', 3712), ('year', 3711), ('yakuza', 3710), ('xse', 3709)]


**Exercise 5:** Create the co-occurence matrix and apply svd decomposition to it.

To create a token co-occurence matrix, we simply multiply the transpose of the documents/tokens matrix by the documents/token matrix

* shape of X: (#doc, #tokens)
* shape of X transpose: (#tokens, #doc)
* shape of X transpose * X : (#tokens, #doc) * (#doc, #tokens) = (#tokens, #tokens)


The resulting  matrix A is of size (vocab_length, vocab_length)

* Use numpy [svd](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html) method from the linalg module to apply SVD decomposition to A (A = U * s * V)
  - use the todense() method to create a matrix in dense format
  - when calling fnp.linalg.svd use the null_matrices = False option to ensures that reduced SVD is applied (rather than full SVD)



In [18]:
# transpose X
Xt = X.T

# multiply X with its transpose
A = Xt.dot(X)

# check the shapes
print(X.shape)
print(Xt.shape)
print(A.shape)

(10, 3729)
(3729, 10)
(3729, 3729)


In [20]:
# set the diagonal to zero
A.setdiag(0)
print(A.todense())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 2]
 [0 0 0 ... 0 2 0]]


In [32]:
import numpy as np
from numpy.linalg import svd

# apply non-reduced SVD decomposition to A
# A = U * s * Vt
U, s, Vt = svd(A.todense(), full_matrices=True) 

# check the shapes
print("Non reduced SVD:")
print(f"U: {U.shape}")
print(f"s: {s.shape}")
print(f"Vt: {Vt.shape}")

# apply reduced SVD decomposition to A
# A = U * s * Vt
U, s, Vt = svd(A.todense(), full_matrices=False)

# check the shapes
print("Reduced SVD:")
print(f"U: {U.shape}")
print(f"s: {s.shape}")
print(f"Vt: {Vt.shape}")

Non reduced SVD:
U: (3729, 3729)
s: (3729,)
Vt: (3729, 3729)
Reduced SVD:
U: (3729, 3729)
s: (3729,)
Vt: (3729, 3729)


In [33]:
# Keep the first 10 singular values
s10 = np.array([s  if i < 10 else 0 for i,s in enumerate(s)])

# Reconstruct the matrix A with the first 10 singular values
A10 = U.dot(np.diag(s10)).dot(Vt.T)

# check the shape
print(A10.shape)

# Print A and A10 as a DataFrame
pd.DataFrame(A.todense(), index=voc, columns=voc)


(3729, 3729)


Unnamed: 0,al,mackenzie,fictional,character,appearing,american,comic,books,published,marvel,...,scar,neck,conceptual,colorist,christina,journal,eyes,gael,garcã,bernal
al,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
mackenzie,0,0,0,1,1,0,0,0,2,0,...,0,0,1,0,0,2,1,1,0,0
fictional,0,0,0,0,0,1,0,1,0,0,...,0,20,0,0,0,0,0,0,0,0
character,0,1,0,0,1,0,0,0,2,0,...,0,0,1,0,0,2,1,1,0,0
appearing,0,1,0,1,0,0,0,0,2,0,...,0,0,1,0,0,2,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
journal,0,2,0,2,2,0,0,0,4,0,...,0,0,2,0,0,0,2,2,0,0
eyes,0,1,0,1,1,0,0,0,2,0,...,0,0,1,0,0,2,0,1,0,0
gael,0,1,0,1,1,0,0,0,2,0,...,0,0,1,0,0,2,1,0,0,0
garcã,0,0,0,0,0,0,1,0,1,1,...,1,0,0,1,0,0,0,0,0,2


In [34]:
pd.DataFrame(A10, index=voc, columns=voc)

Unnamed: 0,al,mackenzie,fictional,character,appearing,american,comic,books,published,marvel,...,scar,neck,conceptual,colorist,christina,journal,eyes,gael,garcã,bernal
al,-0.130410,-0.522481,0.053226,0.236261,-0.214353,-0.044951,0.176006,0.046005,0.099147,-0.093732,...,-11.160740,23.084403,10.808825,2.362791,-19.457889,-2.398453,3.976531,-12.591957,40.633160,3.379310
mackenzie,-0.217422,1.168408,0.211535,0.952388,-0.085639,-0.084423,0.305384,0.004226,0.038886,0.270177,...,7.136572,-5.646797,-1.281016,-0.248733,-15.388486,-7.523683,11.056087,-0.592459,8.165807,59.781745
fictional,0.170095,-0.785521,-0.191115,-0.060905,-0.131229,-0.009147,0.119905,-0.018667,0.056802,-0.187664,...,4.915930,-8.543045,-8.060225,9.066797,-28.651063,23.531821,-40.373609,-2.624257,-1.871770,17.479158
character,-0.217422,1.168408,0.211535,0.952388,-0.085639,-0.084423,0.305384,0.004226,0.038886,0.270177,...,7.136572,-5.646797,-1.281016,-0.248733,-15.388486,-7.523683,11.056087,-0.592459,8.165807,59.781745
appearing,-0.217422,1.168408,0.211535,0.952388,-0.085639,-0.084423,0.305384,0.004226,0.038886,0.270177,...,7.136572,-5.646797,-1.281016,-0.248733,-15.388486,-7.523683,11.056087,-0.592459,8.165807,59.781745
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
journal,-0.434623,2.336116,0.422866,1.904281,-0.171248,-0.168803,0.610651,0.008425,0.077753,0.540167,...,14.270305,-11.287148,-2.561927,-0.497480,-30.777282,-15.034466,22.106107,-1.182442,16.325255,119.529559
eyes,-0.217422,1.168408,0.211535,0.952388,-0.085639,-0.084423,0.305384,0.004226,0.038886,0.270177,...,7.136572,-5.646797,-1.281016,-0.248733,-15.388486,-7.523683,11.056087,-0.592459,8.165807,59.781745
gael,-0.217422,1.168408,0.211535,0.952388,-0.085639,-0.084423,0.305384,0.004226,0.038886,0.270177,...,7.136572,-5.646797,-1.281016,-0.248733,-15.388486,-7.523683,11.056087,-0.592459,8.165807,59.781745
garcã,0.232188,-0.543298,-0.335489,1.335355,-0.836530,-0.174042,0.906149,-0.070226,0.238415,-0.134740,...,0.607258,-6.496423,3.536097,1.318781,-104.036011,-8.383799,-1.729620,-5.878324,29.075718,-38.325303


**Exercise 6 (PROVIDED):** Define a function which returns the similarity between 2 tokens

In [12]:
from sklearn.metrics.pairwise import cosine_similarity
vocab = vectorizer.get_feature_names_out()
token2int = vectorizer.vocabulary_

In [13]:
def similarity(embeddings, word1, word2):
  if word1 in vocab and word2 in vocab:
    v1 = embeddings[token2int[word1]].reshape(1, -1)  
    v2 = embeddings[token2int[word2]].reshape(1, -1)
    return cosine_similarity(v1, v2)[0][0]


**Exercise 7:** Use the function given in Exercise 6 to measure the similarity between
- escape and fictional
- canada and handbook
- escape and captivity

You might need to modify that part of the function which defines vocab and token2int to make it compatible with the way you named your vectorizer model

In [34]:
print(similarity(A, "escape", "fictional"))
print(similarity(A, "canada", "handbook"))
print(similarity(A, "escape", "captivity"))

0.9362681251764143
0.5908997499020893
0.47912100855108686


# Part 2

**Exercise 8:** Clean the corpus 

* Store the content of the 'Text" column into a string   
    **Pandas CS, "Extracting all text from a colum"**
* Tokenize the string into words   
    **NLTK CS**
* Print out the first and the last 10 words of your list of tokens. Do you see tokens that may not be useful for learning word representations ?
* Lower case all tokens and remove all tokens that contains characters that are not letters (OPTIONAL)   
    **NLTK CS**

In [73]:
all_text = str.join(" ", text)

In [80]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download()

stop_words = set(stopwords.words('english'))
tokens = word_tokenize("This is a test sentence", language="english")
tokens = [w for w in tokens if not w in stop_words]

# Print first and last 10 tokens
print(tokens[:10])
print(tokens[-10:])

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


### Create a frequency distribution (OPTIONAL)

**Exercise 9:** Create a frequency distribution from the list of tokens created in the previous exercise. Print out the 10 most and least frequent tokens.   
   **lexical semantics and stats_and_visu CS**
* Create a frequency distribution by iterating over the list of token while incrementing each token frequency accordingly
* Sort the tokens by decreasing frequency into a list
* Print out the first and last 10 tokens of this list (which tokens are most and least frequent ?)

### Convert the corpus to a list of integers

**Exercise 10:** Convert the list of tokens created in Exercise 2 (corpus cleaning) into a list of integers

* Create a dictionary mapping each token to a distinct integer (Cf. Lexical Semantics CS)
* Use this dictionary to convert each token from your cleaned corpus (Exercise 2) into an integer.

### Create a dictionary of co-occurences

**Exercise 11:** In the previous exercise, you created a list of integers where each integer is the identifier for the corresponding token in your cleaned up corpus. Iterate over that list and for each "integer token" *i*:

* get the neihbours of *i* within a window of size 5 (only looking at the right side of *i*)
* store these neihbours in a dictionary of coocurrences of the form {(i,j):f,} where *(i,j)* are neighbours and *f* is the frequency of the co-occurence 
* Sort co-occurences using integer order i.e., if the neighbour *n* is represented by an identifier smaller than *i*, store the co-occurence as *(n,i)*, otherwise as *(i,n)* .


### Compute the SVD decomposition of the Co-occurence Matrix

**Exercise 12:** Compute the  SVD decomposition of the word co-occurence matrix you just created

* Create a matrix A of size (vocab_length, vocab_length)
* Fill each cell *(i,j)* in this matrix with the frequency the co-occurrence between *i* and *j*
* Use numpy [svd](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html) method from the linalg module
* full_matrices = False ensures that reduced SVD is applied (rather than full SVD)

A = U * s * V

**Exercise 13 (PROVIDED):** Define a function which outputs the neighbours of a word

In [None]:
reverse_vocab = {j: i for i, j in token2int.items()}

def most_similar(embedding, word, n=10):
  if word in vocab:
    v = embedding[token2int[word]].reshape(1, -1)
    scores = cosine_similarity(v, embedding).reshape(-1)
    result = []

    # argsort gives n-best scores
    for i in reversed(scores.argsort()[-n:]):
      result.append((reverse_vocab[i], scores[i]))
    return result

print(most_similar(U, 'fictional'))

In [None]:
print(most_similar(U, 'handbook'))