# Lab 6 - word-2-vec with pytorch and gensim

 "A word is characterized by the company it keeps" - Firth (1957)
 

## Execise 0 (0pt)


To do the following exercises you will need certain python packages. This
first exercise is about installing them. You will need `sklearn`, `nltk`, `numpy`,
`gensim`. Please make sure you have installed them (by your distribution’s
package manager, pip, anaconda, . . . ) and check your installation by trying
to import them:

In [1]:
# %pip install sklearn
import sklearn
import nltk
import numpy
# %pip install gensim
import gensim

## Exercise 1.1 (0pt)

In `wordspace.py` you find some convenience functions to extract a word
cooccurrence matrix from text. Run the following script and evaluate the
embeddings by looking at the nearest neighbors of some words.

In [2]:
from wordspace import cooccurrence_matrix ,nearest_neighbor_loop

with open('brown.txt', 'r') as f:
    brown = f.read()

matrix , vocabulary = cooccurrence_matrix(brown)
vocabulary
# len(vocabulary.keys())



{'the': 0,
 ',': 1,
 '.': 2,
 'of': 3,
 'and': 4,
 'to': 5,
 'a': 6,
 'in': 7,
 'that': 8,
 'is': 9,
 'was': 10,
 "''": 11,
 'for': 12,
 '``': 13,
 'with': 14,
 'The': 15,
 'it': 16,
 'he': 17,
 'as': 18,
 'his': 19,
 'on': 20,
 'be': 21,
 'I': 22,
 "'s": 23,
 '&': 24,
 'had': 25,
 'by': 26,
 'at': 27,
 'not': 28,
 'are': 29,
 'from': 30,
 'or': 31,
 'this': 32,
 'have': 33,
 'an': 34,
 'which': 35,
 '*': 36,
 'were': 37,
 '<': 38,
 '>': 39,
 'but': 40,
 'He': 41,
 'you': 42,
 'one': 43,
 'her': 44,
 'they': 45,
 'would': 46,
 ';': 47,
 'all': 48,
 '#': 49,
 'their': 50,
 'him': 51,
 'been': 52,
 'has': 53,
 ')': 54,
 '(': 55,
 '?': 56,
 'who': 57,
 'will': 58,
 'It': 59,
 'more': 60,
 "n't": 61,
 'she': 62,
 'we': 63,
 'out': 64,
 'can': 65,
 'said': 66,
 'there': 67,
 'up': 68,
 'than': 69,
 'its': 70,
 'into': 71,
 'no': 72,
 'them': 73,
 'about': 74,
 'so': 75,
 'could': 76,
 'when': 77,
 'In': 78,
 ':': 79,
 'only': 80,
 'other': 81,
 'do': 82,
 'time': 83,
 'if': 84,
 'what': 85,

In [3]:
import numpy as np
nearest_neighbor_loop(np.asarray(matrix) , vocabulary)

Goodbye.


In [4]:
del matrix
del vocabulary

## Exercise 1.2 (1pt)

One simple way to improve a basic counting model is transforming the word
counts by, e.g., applying the square root afterwards.
Modify the script from exercise 1.1 by using `numpy.sqrt` to do so.

In [5]:
from wordspace import  nearest_neighbor_loop, cooccurrence_matrix
import numpy as np

with open('brown.txt', 'r') as f:
    brown = f.read()

matrix , vocabulary = cooccurrence_matrix(brown)
matrix = np.sqrt(matrix)
vocabulary

nearest_neighbor_loop(np.asarray(matrix) , vocabulary)

Goodbye.


## Exercise 1.3 (1pt)

Next let us examine the parameters of the function `cooccurrence_matrix`.
You can modify the `window_size` and/or try a different vectorizer than
the standard `CountVectorizer` to compute the cooccurrence scores. Try
`sklearn.feature_extraction.text.TfidfVectorizer`!

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


matrix , vocabulary = cooccurrence_matrix(
    brown , window_size=5, max_vocab_size=20000,
    same_word_zero=False , vectorizer=CountVectorizer
)
# matrix = np.sqrt(matrix)
# vocabulary
nearest_neighbor_loop(np.asarray(matrix) , vocabulary)

Goodbye.


In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

matrix , vocabulary = cooccurrence_matrix(
    brown , window_size=3, max_vocab_size=20000,
    same_word_zero=False , vectorizer=TfidfVectorizer
)
matrix = np.sqrt(matrix)
vocabulary
nearest_neighbor_loop(np.asarray(matrix) , vocabulary)

Goodbye.


# Singular Value Decomposition

## Exercise 2 (1pt)

With Singular Value Decomposition (SVD) you can reduce the dimensionality
of your embeddings. Try `sklearn.decomposition.TruncatedSVD` and
see how your embeddings change! Consider the following usage example:

In [8]:
from sklearn.decomposition import TruncatedSVD
import numpy as np

with open('brown.txt', 'r') as f:
    brown = f.read()
some_text = brown
try:
    C, V = cooccurrence_matrix(some_text)
except:
    print("Error")

svd = TruncatedSVD(
    n_components=100, algorithm="randomized",
    n_iter=5, random_state=42, tol=0.
)
new_C = svd.fit_transform(np.asarray(C))

nearest_neighbor_loop(np.asarray(new_C) , V)

Error


NameError: name 'C' is not defined

In [None]:
new_C.shape

(20000, 100)

In [None]:
C.shape

(20000, 20000)

In [None]:
C

matrix([[139945,      0,      0, ...,      0,      0,      0],
        [     0,      0,      0, ...,      0,      0,      0],
        [     0,      0,      0, ...,      0,      0,      0],
        ...,
        [     0,      0,      0, ...,      6,      0,      0],
        [     0,      0,      0, ...,      0,      6,      0],
        [     0,      0,      0, ...,      0,      0,      6]])

In [None]:
new_C

array([[ 1.40180987e+05, -9.78944357e+03, -1.85329998e+03, ...,
         2.17947103e+00, -9.83446186e-01, -1.43103094e+00],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       ...,
       [ 4.01801534e-02, -1.61822499e-02,  1.00006853e+00, ...,
        -1.86342415e-03, -1.63230732e-03, -1.46320769e-03],
       [ 1.13119133e-02, -2.68964775e-03,  2.06425080e-03, ...,
         1.35455199e-04,  4.46409073e-04, -1.55133255e-03],
       [ 4.69191491e-02, -1.80478421e-02,  9.99648854e-01, ...,
        -9.27589662e-04, -7.97547423e-04, -2.00336248e-03]])

In [None]:
del C
del new_C

- `n_components` - desired embedding dimension
- `algorithm` - SVD solver to use; either “arpack” or “randomized”
- `n_iter` - number of iterations for randomized SVD solver (not used by ARPACK)
- `random_state` - seed for pseudo-random number generator
- `tol` -  toleranze for ARPACK. Ignored by randomized SVD solver

# Word2Vec

## Exercise 3.1 (1pt)

Use the following code snippets to train your own word2vec model on the
brown corpus (or any other large text file you have). `semantic_tests.py`
contains some tests for your embeddings.

In [11]:
from semantic_tests import semantic_tests
from gensim.models.word2vec import Word2Vec
import nltk.data
from nltk.tokenize import word_tokenize
import logging
logging.basicConfig(
    format='%(asctime)s: %(levelname)s: %(message)s',
    level=logging.INFO
)

#nltk.download('punkt')

sent = nltk.data.load(
    'tokenizers/punkt/english.pickle'
)

with open('brown.txt', 'r') as f:
    sentences = sent.tokenize(f.read())
    sentences = map(lambda s: word_tokenize(s), sentences)

model = Word2Vec(
    sentences , vector_size=100, window=5,
    min_count=5, hs=0, negative=5,
    cbow_mean=1, epochs=15, workers=3
)

sims = model.wv.most_similar('among', topn=10)
print("Among")
print(sims)
sims = model.wv.most_similar('problems', topn=10)
sims = model.wv.most_similar('woman', topn=10)
print("Problems")
print(sims)

semantic_tests(model.wv)

print(model.wv.similarity("man", "woman"))
print(model.wv.similarity("problems", "solutions"))
model.wv.similarity("among", "scared")

# model.wv.most_similar(
#         positive=['among', 'among'], negative=['alone'], topn=3)

2024-05-23 08:09:41,269: INFO: collecting all words and their counts
2024-05-23 08:09:41,269: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-05-23 08:09:42,031: INFO: PROGRESS: at sentence #10000, processed 257804 words, keeping 25923 word types
2024-05-23 08:09:42,751: INFO: PROGRESS: at sentence #20000, processed 502508 words, keeping 38331 word types
2024-05-23 08:09:43,565: INFO: PROGRESS: at sentence #30000, processed 788265 words, keeping 47890 word types
2024-05-23 08:09:44,241: INFO: PROGRESS: at sentence #40000, processed 993450 words, keeping 53867 word types
2024-05-23 08:09:44,793: INFO: PROGRESS: at sentence #50000, processed 1156731 words, keeping 57747 word types
2024-05-23 08:09:44,883: INFO: collected 58661 word types from a corpus of 1184239 raw words and 51328 sentences
2024-05-23 08:09:44,883: INFO: Creating a fresh vocabulary
2024-05-23 08:09:44,914: INFO: Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 15068 unique wor

Among
[('scared', 0.37241917848587036), ('contractual', 0.3398129940032959), ('Internal', 0.3379846215248108), ('displaying', 0.32876285910606384), ('syllables', 0.3277350664138794), ('exposure', 0.32741430401802063), ('outer', 0.3218686282634735), ('mates', 0.31806090474128723), ('Cathy', 0.3173461854457855), ('fed', 0.316450297832489)]
Problems
[('deputy', 0.35137489438056946), ('staffs', 0.33552974462509155), ('behavior', 0.32039764523506165), ('lash', 0.316434770822525), ('Industrial', 0.31565478444099426), ('pull', 0.3142879605293274), ('originality', 0.3142140805721283), ('usually', 0.3120686709880829), ('policies', 0.31155094504356384), ('effectively', 0.31140434741973877)]

Semantic Tests!

Man is to king as woman is to deputy, lengthy, radical.
From breakfast, cereal, dinner and lunch -- breakfast does not match.
Similarity!
man -- woman: -0.025125857442617416
man -- silver: 0.0936817079782486
-0.025125857
-0.10157965


0.3724192

## Exercise 3.2 (1pt)

Instead of training your own word2vec model, you can also download pretrained
embeddings and load them into `gensim.` Are they doing better in
your `semantic_tests`?

A popular pre-trained option is the Google News dataset model, containing 300-dimensional embeddings for 3 millions words and phrases. Download the binary file ‘GoogleNews-vectors-negative300.bin’ (1.3 GB compressed) from https://code.google.com/archive/p/word2vec/.

In [None]:
from gensim.models import KeyedVectors
from semantic_tests import semantic_tests

model = KeyedVectors.load_word2vec_format(
    'vectors.bin',
    binary=True
)

semantic_tests(model)

2024-05-22 19:24:56,301: INFO: loading projection weights from vectors.bin
2024-05-22 19:25:16,064: INFO: KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from vectors.bin', 'binary': True, 'encoding': 'utf8', 'datetime': '2024-05-22T19:25:16.063671', 'gensim': '4.3.2', 'python': '3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:36:46) \n[Clang 16.0.6 ]', 'platform': 'macOS-14.3-x86_64-i386-64bit', 'event': 'load_word2vec_format'}



Semantic Tests!

Man is to king as woman is to queen, monarch, princess.
From breakfast, cereal, dinner and lunch -- cereal does not match.
Similarity!
man -- woman: 0.7664012312889099
man -- silver: 0.10574154555797577


# Implementation i pytorch

## Exercise 4 (2pt)


- Train word2vec skip-gram model on sentence "the quick brown fox jumps over the lazy dog". Assume context window = 2, embedding_dim = 5. No preprocessing apart from tokenization.
- Compute model output probabilities for words "lazy" and "dog". If you have trained the model correctly, the output probabilities for word "lazy" should be higher for words "over", "the", "dog" (close to 1/3 each) and lower for other words (close to 0 each). For word "dog", the output probabilities should be higher for words, "the", "lazy" (close to 1/2 each) and lower for other words (close to 0 each). 
- Compute dot product between the vector of word "dog" and the vector of word "lazy" (could be representation of center vector and representation of context vector) and between "dog" and "brown". Which one is higher? Why?


You can use this tutorial https://towardsdatascience.com/implementing-word2vec-in-pytorch-skip-gram-model-e6bae040d2fb

Use pytorch (or tensorflow).

In [None]:
# http://pytorch.org/
from os.path import exists

# from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
# platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
# cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
# accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

# !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.4.1-{platform}-linux_x86_64.whl torchvision
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x1986c8030>

In [None]:
sentence = "the quick brown fox jumps over the lazy dog"

If our vocabulary is bigger, the word2vec model needs a LOT of data to obtain reasonable results. With this amount of data, the code needs to be optimized very well. Writing such code will be more suitable for a project instead of a simple exercise, therefore in the next exercise we will use [gensim](https://radimrehurek.com/gensim/), a library made for efficient training of word vectors.

## * Exercise (2pt)

- Use [gensim](https://radimrehurek.com/gensim/) to train a word2vec model on [OpinRank](http://kavita-ganesan.com/entity-ranking-data/). You can follow this [tutorial](https://medium.freecodecamp.org/how-to-get-started-with-word2vec-and-then-how-to-make-it-work-d0a2fca9dad3), but make sure you have used negative sampling.
- Find 10 similar words to word "dirty" and "canada"
- Check if similarity between "dirty" and "dusty" is bigger than between "dirty" and "clean"