# Similarity: Vector Semantics

This project is about the similarity of words given a corpus that gives us a context for each word. From that similarity we can derive semantic meaning of words and further relationships between them.

*The meaning of a word is its use in the language* -  Wittgenstein, 1953  
Language use of a word can be characterized by counting occurence of other words around it.

It involves the use of:

- TF-IDF: The weighted frequency of words appearing together
- PPMI: The measure of how much two words co-occur more than expected by chance
- Word2Vec: Use a shallow neural network to predict words in the context of other words

In [1]:

from nltk.corpus import PlaintextCorpusReader
from nltk.lm import Vocabulary
import numpy as np
import os, sys
# from gensim.models import Word2Vec

# Change the current working directory to the location of the notebook
os.chdir(os.getcwd())

corpus = PlaintextCorpusReader(os.getcwd()+'/corpora', 'p2_sherlock\.txt').words()
vocab = Vocabulary(corpus, unk_cutoff=1)
punct = ["\"", "'", ",", ".", ":", "?", "!", ".\"", "?\"", "!\"", "-", ",\"", "--"]

# Preprocessing
words = []
# test = {}
for i in range(len(corpus)):
    if corpus[i] not in punct:
        words.append(corpus[i].lower())
        # test[corpus[i].lower()] = test.get(corpus[i].lower(), 0)+1

### TF-IDF techniques

$C(w_i, c_j)$:
- The number of times context word cj occurs in the local contexts of target word wi
- Forms the co-occurence matrix

$tf_{i,j} = log_{10} (C(w_i, c_j) + 1)$:
- Term frequency
- Smoothed co-occurence frequency on a log scale

$idf_j = log_{10} \frac{N}{df_j}$:
- $df_j$ the number of contextual windows of any target word where this context word $c_j$ occurs
- N the total number of contextual windows for all target words
- Words like "the, a, it, etc..." have a high document frequency and so, a low inverse document frequency

TF-IDF: $w_{i,j} = tf_{i,j} \times idf_j$
- Weighted frequency

In [2]:
# Compute C(w_i, c_j)  (dictionnary)
# Via N-Gram 5 (2 before and 2 after the word). Could use nltk.ngrams(words, 5) to have the list of 5-gram
# Accessing the contextual windows
dictionary = {}
# Compute df_j & N
df = {}
N = 0
for i in range(len(words)):
    if dictionary.get(words[i], 0) == 0:
        dictionary[words[i]] = {}
    N+=1
    for j in [i-2, i-1, i+1, i+2]:
        if j not in [-2, -1, len(words), len(words)+1]:
            dictionary[words[i]][words[j]] = dictionary[words[i]].get(words[j], 0)+1
            df[words[j]] = df.get(words[j], 0)+1

In [9]:
# Give tf-idf for a target words: Would correspond to wj, because i is fixed to the target
def tf_idf_one(target):
    closest = {}
    for j in dictionary[target]:
        tf_ij = np.log10( dictionary[target].get(j, 0) + 1 )
        idf_j = np.log10( N / df.get(j, 0) )
        w_ij = tf_ij*idf_j
        closest[j] =  w_ij
    return closest

# Compute tf-idf weighted matrix: w
tf_idf = {}
for i in set(words):
    tf_idf[i] = tf_idf_one(i)

tf_idf["baskerville"]

{'sir': 2.6499717475814424,
 'henry': 2.8043613034991757,
 'three': 0.8045531714532216,
 'broken': 1.4464836063347186,
 'threads': 1.0472375255259798,
 'hall': 3.542267416272485,
 'the': 0.8359804946545215,
 'charles': 2.7127583202582395,
 'whose': 1.5380362693763716,
 'sudden': 0.9857912880321348,
 'was': 1.2143324830523035,
 'written': 1.0032486035793737,
 'and': 1.2348030417274845,
 'in': 1.0291797662182496,
 'family': 1.2451032785465193,
 'but': 1.100649466082976,
 'from': 0.8198972231791053,
 'hugo': 2.513066165257388,
 'as': 1.325302482073515,
 'manor': 1.2576484629005042,
 'of': 1.1147394838432119,
 'held': 0.8596207943399203,
 'near': 1.4062523195504442,
 'estate': 0.9703928444215979,
 'by': 0.8843610913923687,
 'when': 0.964969018249066,
 'he': 0.8489006436580457,
 'for': 1.2088857214534718,
 'passed': 0.8087038586721457,
 'me': 0.7942161180589478,
 'lying': 1.0472375255259798,
 'on': 0.8602827422951178,
 'which': 0.743203050701596,
 'to': 1.1322674532909247,
 'his': 0.9350203

Cosine simiarity between two words:

$cos(v, w) = \frac{\sum_{i=1}^d v_i w_i}{\sqrt{\sum_{i=1}^d v_i^2} \sqrt{\sum_{i=1}^d w_i^2}}$

In [13]:
def cosine_similarity(word_1, word_2):
    top_sum = 0
    bot_left_sum = 0
    bot_right_sum = 0
    for word, value in tf_idf[word_1].items():
        if word in tf_idf[word_2].keys():
            top_sum += value*tf_idf[word_2][word]
        bot_left_sum += value**2
    for word_2, value_2 in tf_idf[word_2].items():
        bot_right_sum += value_2**2
    
    bot_left_sum = np.sqrt(bot_left_sum)
    bot_right_sum = np.sqrt(bot_right_sum)
    return top_sum / (bot_left_sum * bot_right_sum)

# Give the top 5 closest word of the target
# Words used in the same context
def closest_words(target):
    closest = []
    for word in set(words):
        closest.append( (word, cosine_similarity(target, word)) )
    closest = sorted(closest, key=lambda entry: -entry[1])

    return closest[1:6]

In [14]:
for example in ["above", "maid", "distance", "the"]:
    print(f"Closest words to {example}:")
    for close_word in closest_words(example):
        print(f"- {close_word[0]}: {round(close_word[1], 3)}")

Closest words to above:
- beams: 0.329
- noticed: 0.253
- velvet: 0.218
- enables: 0.212
- boars: 0.212
Closest words to maid:
- scion: 0.277
- caretaker: 0.233
- existence: 0.229
- sketch: 0.227
- desmonds: 0.217
Closest words to distance:
- grazier: 0.401
- trench: 0.346
- miles: 0.34
- ordeal: 0.315
- crevice: 0.309
Closest words to the:
- of: 0.64
- and: 0.581
- in: 0.53
- a: 0.521
- that: 0.499


### PPMI techniques

**Pointwise Mutual Information** (PPMI) measures how much two words co-occur more than expected by chance.

For a word *w* and a context word *c*,

$PMI(w,c) = log_2 \frac{P(w,c)}{P(w)*P(c)}$

But since we don't have the information of the real probability of words (corpus not infinite), we use the following PPMI estimation:

$P(w_i, c_j) \approx \frac{C(w_i, c_j) + \epsilon}{\sum_{i=1}^V \sum_{j=1}^C [C(w_i, c_j) + \epsilon]}$

$P(w_i) = \sum_{j=1}^C P(w_i, c_j)$

$P(c_j) = \sum_{i=1}^V P(w_i, c_j)$

$PPMI(w_i,c_j) = max( log_2 \frac{P(w_i,c_j)}{P(w_i)*P(c_j)}, 0 )$

With:
- $C(w_i,c_j)$ the same as before
- Vocabulary $V$ the set of words
- Set of context words $C$ (maybe $C=V$)
- Smoothing hyper-parameter $\epsilon \approx \frac{1}{|V|}$ (e.g $\epsilon = 10^{-4}$)

In [None]:
# 5 closest words to "above" with epsilon of 0,0001

# 5 closest words to "maid" same epsilon

# 5 closest words to "distance" same epsilon

### Word2Vec Techniques

The model represents each word as a dense vector, allowing for similarity calculations and capturing semantic relationships, making it useful for various NLP tasks such as word similarity, language translation, and document clustering. 

Word2Vec is typically trained using a shallow neural network to predict words in the context of other words, and the resulting word vectors can be used to capture semantic meaning and relationships in a way that is computationally efficient.

In [29]:
# 2.3 Vector Semantics (Word2Vec)
# above - maid - distance
corpus = PlaintextCorpusReader('/home/thomas/Bureau/LINFO2263/Project_2/corpora', 'p2_sherlock\.txt').words() 
vocab = Vocabulary(corpus, unk_cutoff=1)
punct = ["\"", "'", ",", ".", ":", "?", "!", ".\"", "?\"", "!\"", "-", ",\"", "--"]

# Preprocessing
words = []
for i in range(len(corpus)):
    if corpus[i] not in punct:
        words.append(corpus[i].lower())

model = Word2Vec(sentences=[words], vector_size=100, window=2, min_count=2, sg=1, negative=10, epochs=300)
most_sim_1 = model.wv.most_similar('above', topn=5)
most_sim_2 = model.wv.most_similar('maid', topn=5)
most_sim_3 = model.wv.most_similar('distance', topn=5)
print(most_sim_1)
print(most_sim_2)
print(most_sim_3)

# import gensim.downloader
# glove_vectors = gensim.downloader.load('word2vec-google-news-300')
# most_sim = glove_vectors.most_similar("investigation", topn=5)
# print(most_sim)
# [('investigations', 0.833749532699585), ('probe', 0.7943025827407837), ('inquiry', 0.7801670432090759), ('investgation', 0.6887422204017639), ('investigaton', 0.6771849989891052)]

[('shone', 0.8492485284805298), ('moon', 0.7147756218910217), ('swung', 0.7015173435211182), ('escape', 0.6910266876220703), ('jaws', 0.6594571471214294)]
[('oldmore', 0.6325182318687439), ('swung', 0.6313623785972595), ('lodge', 0.6225625276565552), ('offices', 0.6021811962127686), ('hounds', 0.5777137875556946)]
[('divided', 0.6811988353729248), ('pace', 0.596568763256073), ('later', 0.5763490200042725), ('top', 0.5754624009132385), ('bit', 0.5565328001976013)]
