# Exploring Word2Vec with Gensim

## Overview

Word2Vec is an approach to learning *word embeddings*, vector representations of words that capture semantic and syntactic relationships between words based on their co-occurrences in natural language text.

This unsupervised learning approach also reduces the dimensionality of the vectors representing words, which can be helpful for memory and to manage the *curse of dimensionality*, whereby high-dimensional vector spaces lead to a relative data sparsity, e.g., for machine learning.

In this exercise you will look at the capabilities of Word2Vec as implemented in the module Gensim.

## Requirements

Uncomment the lines below, run the installations once as needed, then comment the code out again.

In [None]:
# !pip install --upgrade pip
# !pip install --upgrade Cython
# !pip install --upgrade gensim

Import all necessary libraries.

In [None]:
# Import modules and set up logging.
from typing import List, Generator
import gensim.downloader as api
from gensim.models import Word2Vec
import logging
import numpy as np
import os

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import ipytest
import pytest

ipytest.autoconfig()

## Download data

In [None]:
# Load the Text8 corpus.
print(api.info('text8'))
text8_corpus = api.load('text8')

{'num_records': 1701, 'record_format': 'list of str (tokens)', 'file_size': 33182058, 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py', 'license': 'not found', 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.', 'checksum': '68799af40b6bda07dfa47a32612e5364', 'file_name': 'text8.gz', 'read_more': ['http://mattmahoney.net/dc/textdata.html'], 'parts': 1}


## Train a model

In [None]:
# Train a Word2Vec model on the Text8 corpus with default hyperparameters.
model = Word2Vec(text8_corpus)

# Perform a sanity check on the trained model.
print(model.wv.similarity('tree', 'leaf'))

0.6665878


In [None]:
# Reduce logging level.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [None]:
print(model.wv.most_similar('tree'))
print(model.wv.most_similar('leaf'))

[('trees', 0.7079861760139465), ('leaf', 0.6665878295898438), ('bark', 0.6538001894950867), ('vine', 0.6142206788063049), ('fruit', 0.6016198992729187), ('bird', 0.6014313101768494), ('skeleton', 0.574469804763794), ('cave', 0.5741851925849915), ('avl', 0.5740269422531128), ('nest', 0.5717236399650574)]
[('bark', 0.7800428867340088), ('coloured', 0.7542052865028381), ('jelly', 0.7346197366714478), ('colored', 0.7331221699714661), ('flower', 0.7298780679702759), ('fried', 0.7292338013648987), ('pollen', 0.7290586829185486), ('abalone', 0.7280126810073853), ('sap', 0.7245625853538513), ('sperm', 0.724285900592804)]


## Relationships

Investigate the relationships between words in terms of trained representations.

### Evaluate  analogies
With the model you have trained, evaluate the analogy
`king-man+woman =~ queen`

In [None]:
print(model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=5))

[('queen', 0.6895353198051453), ('throne', 0.6135202646255493), ('prince', 0.6088247895240784), ('princess', 0.6042246222496033), ('empress', 0.6005150079727173)]


Evaluate the analogy `ship-boat+rocket =~ spacecraft`. How similar are the left-hand side of the analogy to the right-hand side? Implement a function that can find the answer for analogies in general. We assume the right-hand side of the analogy will always be a single, positive term.

In [None]:
def eval_analogy(model: Word2Vec, lhs_pos: List[str], lhs_neg: List[str], rhs: str)->float:
    """Returns the similarity between the left-hand and right-hand sides of an anaology.

        Arguments:
            model: Trained Gensim word2vec model to use.
            lhs_pos: List of terms that are positive on the left-hand side in the analogy.
            lhs_neg: List of terms that are negative on the left-hand side in the analogy.
            rhs: A single positive term on the right-hand side in the analogy.

        Returns:
            Float of the similarity if right-hand side term is found in the top 500 most similar terms.
            Otherwise, return None."""
    # How similar are the left-hand side of the analogy to the right-hand side?
    # Implement a function that can find the answer for analogies in general.
    similarities_list = model.most_similar(positive=lhs_pos, negative=lhs_neg, topn=500)
    similarities_dict = {}
    for term, sim in similarities_list:
        similarities_dict[term] = sim
    if rhs in similarities_dict:
        return similarities_dict[rhs]
    else:
        print("Right-hand side term not found in top 500 most similar terms to the left-hand side analogy.")
        None

Test:

In [None]:
%%run_pytest[clean]

def test_eval_analogy():
    assert eval_analogy(model.wv, ['ship', 'rocket'], ['boat'], 'spacecraft') == pytest.approx(0.7, abs=1e-1)

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.02s[0m[0m


%%run_pytest[clean] and %%run_pytest are deprecated in favor of %%ipytest. %%ipytest will clean tests, evaluate the cell and then run pytest. To disable cleaning, configure ipytest with ipytest.config(clean=False).
ipytest.clean_tests is deprecated in favor of ipytest.clean


## Load a pre-trained model

In [None]:
import gensim.downloader as api
model_loaded = api.load('word2vec-google-news-300')



In [None]:
loaded_analogy_eval = -1
# Evaluate the analogy 'king'-'man'+'woman' compared to 'queen' using the loaded model
# and assign the value to the variable `loaded_analogy_eval`.
loaded_analogy_eval = eval_analogy(model_loaded, ['king', 'woman'], ['man'], 'queen')

In [None]:
%%run_pytest[clean]

def test_loaded_analogy_eval():
    assert loaded_analogy_eval != -1
    assert loaded_analogy_eval == pytest.approx(0.7, abs=1e-1)

%%run_pytest[clean] and %%run_pytest are deprecated in favor of %%ipytest. %%ipytest will clean tests, evaluate the cell and then run pytest. To disable cleaning, configure ipytest with ipytest.config(clean=False).
ipytest.clean_tests is deprecated in favor of ipytest.clean


[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.03s[0m[0m


## Train Word2Vec on different corpora

In [None]:
# Download the rap lyrics of Kanye West.
! wget https://raw.githubusercontent.com/gsurma/text_predictor/master/data/kanye/input.txt
! mv input.txt kanye.txt

# Download the complete works of William Shakespeare.
! wget https://raw.githubusercontent.com/gsurma/text_predictor/master/data/shakespeare/input.txt
! mv input.txt shakespeare.txt

--2023-10-13 15:28:09--  https://raw.githubusercontent.com/gsurma/text_predictor/master/data/kanye/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330453 (323K) [text/plain]
Saving to: ‘input.txt’


2023-10-13 15:28:09 (10.1 MB/s) - ‘input.txt’ saved [330453/330453]

--2023-10-13 15:28:09--  https://raw.githubusercontent.com/gsurma/text_predictor/master/data/shakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4573338 (4.4M) [text/plain]
Saving to: ‘input.txt’


2023-10-13 15:28:10 (70.0 MB/s) - ‘input.t

In [None]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus:
    """An interator that yields sentences (lists of str)."""
    def __init__(self, data: str) -> None:
        self.data = data

    def __iter__(self) -> Generator[List[str], None, None]:
        corpus_path = datapath(self.data)
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

Separately train two new models using the two different datasets, and compare how these datasets affect relationships between

In [None]:
kanye_data = MyCorpus(os.getcwd()+'/kanye.txt')
shakespeare_data = MyCorpus(os.getcwd()+'/shakespeare.txt')

In [None]:
kanye_model = None
# Train a Word2Vec model on the Kanye corpus, and name it `kanye_model`.
kanye_model = Word2Vec(sentences=kanye_data)

In [None]:
shakespeare_model = None
# Train a Word2Vec model on the Shakespeare corpus, and name it `shakespeare_model`.
shakespeare_model = Word2Vec(sentences=shakespeare_data)

For each of the models, we can easily find words where the two models learn very different similarities.

In [None]:
# For example, compare:
print(kanye_model.wv.most_similar(positive=['king'], topn=5))
print(shakespeare_model.wv.most_similar(positive=['king'], topn=5))

[('our', 0.9988145232200623), ('big', 0.998805582523346), ('always', 0.9987574219703674), ('as', 0.9987520575523376), ('or', 0.9987413883209229)]
[('prince', 0.8835847973823547), ('bolingbroke', 0.7122901678085327), ('duke', 0.6925632953643799), ('crown', 0.6918205618858337), ('fifth', 0.6868830323219299)]


For more information about Gensim, see https://radimrehurek.com/gensim.