<a href="https://colab.research.google.com/github/11bus11/deep_learning_course_umu/blob/main/Erik_VF_5TF078_Laboration_4_om_spr%C3%A5kteknologi_(1_2)_Word2vec_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 5TF078 Deep Learning Course
#Erik Vodopivec Forsman
## Excercise 4 NLP - Word2Vec
Created by Tomas Nordström, Umeå University

Revisions:
* 2023-12-03 Initial version with three different word2vec models (Builtin, Glove, GoogleNews) to be used from within GenSim /ToNo
* 2023-12-06 Fix for gensim.downloader that now seems missing /Tomas
* 2024-03-24 Updated tests for Kaggle. /Tomas
* 2024-04-23 Included student calculation of similarity /Tomas
* 2024-04-30 Fixed depreciated gensim glove2word2vec /Tomas

# Initialization

In [None]:
import sys
import os

### Is this notebook running on Colab?
IS_COLAB = "google.colab" in sys.modules

### Is this notebook running on Kaggle?
# Fool Kaggle into making kaggle_secrets avaiable
try:
    import kaggle_secrets
except ImportError as e:
    pass
# Now we can test for Kaggle
IS_KAGGLE = "kaggle_secrets" in sys.modules

In [None]:
os.environ["KERAS_BACKEND"] = "jax" # Also jax,pytorch for Keras 3.0

# Import Keras/TF libraries

import keras
print('Keras version:', keras.__version__)

import tensorflow as tf
print('TensorFlow version:', tf.__version__)

Keras version: 2.15.0
TensorFlow version: 2.15.0


In [None]:
# Helper libraries
import time
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split

import urllib
from zipfile import ZipFile

import gensim
import gensim.downloader as gsapi

# Matlab plotting
import matplotlib
import matplotlib.pyplot as plt

In [None]:
# https://stackoverflow.com/questions/37748105/how-to-use-progressbar-module-with-urlretrieve#53643011
import progressbar
class MyProgressBar():
    def __init__(self):
        self.pbar = None

    def __call__(self, block_num, block_size, total_size):
        if not self.pbar:
            self.pbar=progressbar.ProgressBar(maxval=total_size)
            self.pbar.start()

        downloaded = block_num * block_size
        if downloaded < total_size:
            self.pbar.update(downloaded)
        else:
            self.pbar.finish()

# Based on the Gensim framework

Gensim Docs: https://radimrehurek.com/gensim/

There are many ways to download models or data for word2vec models:
1. The Gensim Builtin models (where 'glove-twitter-100' downloads 387 MB file)
2. One of the Glove models (downloads a 822 MB zip file).
3. GoogleNews based model

In [None]:
# Select one of Builtin, Glove, GoogleNews models to use
model_to_use            = 'GoogleNews'
model_to_use_if_builtin = 'glove-twitter-100'

# Set up a word2vec model

## Look at what builtin Gensim models we can use

Note that you need to select one out of the possible builtin models!

In [None]:
# https://radimrehurek.com/gensim/models/word2vec.html

# Show all available models in gensim-data
print(list(gsapi.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [None]:
if model_to_use == 'Builtin':
  w2v = gsapi.load(model_to_use_if_builtin)

## Using GloVe

Let's download pre-trained GloVe embeddings (a 822 MB zip file).

The archive contains text-encoded vectors of various sizes: 50-dimensional, 100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100-D ones.



In [None]:
# Download a file to kerasdata, but first check if it already exist, now with progress bar
DOWNLOADS_DIR = './kerasdata'
os.makedirs(DOWNLOADS_DIR, exist_ok=True) # create dir if not exist

if model_to_use == 'Glove':
  url= 'https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip'
  # Split on the rightmost / and take everything on the right side of that; up until '?'
  name = url.rsplit('/', 1)[-1].split('?', 1)[0]
  filename = os.path.join(DOWNLOADS_DIR, name)

  # Download the file if it does not exist
  if not os.path.isfile(filename):
    print(f'Retrieving url: {url}', flush=True)
    urllib.request.urlretrieve(url, filename, reporthook=MyProgressBar())
  else:
    print(f'Using local zip file: {filename}')

  GLOVEDIR = os.path.join(DOWNLOADS_DIR, 'Glove')
  path_to_glove_file =  os.path.join(GLOVEDIR, 'glove.6B.100d.txt')
  if not os.path.isfile(path_to_glove_file):
    ZipFile(filename).extractall(GLOVEDIR)

  print(f'Using local glove file: {path_to_glove_file}')

  # Now convert the Glove file to a gensim word2vec
  # https://stackoverflow.com/questions/48743053/how-to-save-and-load-glove-models#51319383
  GENSIMGLOVEFILE = os.path.join(GLOVEDIR,"gensim_glove_vectors.txt")

  if not os.path.isfile(GENSIMGLOVEFILE):
    # glove2word2vec(glove_input_file=path_to_glove_file, word2vec_output_file=GENSIMGLOVEFILE)
    w2v = gensim.models.KeyedVectors.load_word2vec_format(path_to_glove_file, binary=False, no_header=True)
  else:
    w2v = gensim.models.KeyedVectors.load_word2vec_format(GENSIMGLOVEFILE, binary=False)


## Using GoogleNews file

In [None]:
if model_to_use == 'GoogleNews':
  # Download a file to kerasdata, but first check if it already exist, now with progress bar
  DOWNLOADS_DIR = './kerasdata'
  os.makedirs(DOWNLOADS_DIR, exist_ok=True) # create dir if not exist

  url= 'https://git.ri.se/tomas.nordstrom/mldata/-/raw/main/GoogleNews-vectors-negative300.bin.gz?inline=false'
  # Split on the rightmost / and take everything on the right side of that; up until '?'
  name = url.rsplit('/', 1)[-1].split('?', 1)[0]
  filename = os.path.join(DOWNLOADS_DIR, name)

  # Download the file if it does not exist
  if not os.path.isfile(filename):
    print(f'Retrieving url: {url}', flush=True)
    urllib.request.urlretrieve(url, filename, reporthook=MyProgressBar())
  else:
    print(f'Using local file: {filename}')

  # Create the word2vec model
  w2v = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=True)

Retrieving url: https://git.ri.se/tomas.nordstrom/mldata/-/raw/main/GoogleNews-vectors-negative300.bin.gz?inline=false


100% (1647046227 of 1647046227) |########| Elapsed Time: 0:01:18 Time:  0:01:18


# Experiments

Check out https://radimrehurek.com/gensim/models/keyedvectors.html#what-can-i-do-with-word-vectors for examples what you can do with these vectors.

In [None]:
# First check out the word2vec model
print(f'Model name {model_to_use}')
print(f'Total number of words: {len(w2v.key_to_index)}') # Totalt antal ord
print(f'Some example words: {list(w2v.key_to_index)[:20]}')
embedding_len = len(w2v[0])
print(f'Embedding vector length: {embedding_len}')

# The check out a vector
word1 = "cat"
print(f'\nLooking at "{word1}"')
cat_vector = w2v[word1]
print(f'Vector: {cat_vector[:10]}...')

Model name GoogleNews
Total number of words: 3000000
Some example words: ['</s>', 'in', 'for', 'that', 'is', 'on', '##', 'The', 'with', 'said', 'was', 'the', 'at', 'not', 'as', 'it', 'be', 'from', 'by', 'are']
Embedding vector length: 300

Looking at "cat"
Vector: [ 0.0123291   0.20410156 -0.28515625  0.21679688  0.11816406  0.08300781
  0.04980469 -0.00952148  0.22070312 -0.12597656]...


## Similarity

Now we want to find similarity between word vectors and need to define a [similarity measure](https://en.wikipedia.org/wiki/Similarity_measure). In this exercise we will use [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) as a similarity measure.

Your task is now to define a similarity function, and then generate three word vectors for “cat”, “cut”, and “dog” and compare the similarity between them using your similarity function.


In [None]:
from numpy import dot
from numpy.linalg import norm

def similarity(vec1,vec2):
##################### TODO: YOUR CODE STARTS HERE #####################
    cosine_dist = dot(vec1,vec2)/(norm(vec1)*norm(vec2))
    return cosine_dist
##################### TODO: YOUR CODE ENDS HERE #######################

In [None]:
cut_vector = w2v["cut"]
dog_vector = w2v["dog"]

##################### TODO: YOUR CODE STARTS HERE #####################
# Compare the similarity between "cat”, “cut”, and “dog”
def calculate(vec1, vec2, vec3):
  similarity1 = similarity(vec1, vec2)
  similarity2 = similarity(vec1, vec3)
  similarity3 = similarity(vec3, vec2)
  print("result between 1 and 2:", str(similarity1))
  print("result between 1 and 3:", str(similarity2))
  print("result between 3 and 2:", str(similarity3))

calculate(cat_vector, cut_vector, dog_vector)


##################### TODO: YOUR CODE ENDS HERE #######################

result between 1 and 2: 0.092555486
result between 1 and 3: 0.76094574
result between 3 and 2: 0.05553734


## Uppgift
Skapa fler exempel och analysera dina resultat \

**Svar:** Testade att jämföra lite olika ord och fick bra resultat. T.ex fick kombinationen av rake och shovel en större likhet än pot kombinerat med rake eller shovel. Detta var förväntat då de båda förstnämnda är verktyg. Resterande ordkombiationer gav också väntade resultat.

In [None]:
##################### TODO: YOUR CODE STARTS HERE #####################
# Create more examples
print("no1")
eye_vector = w2v["eye"]
nose_vector = w2v["nose"]
toe_vector = w2v["toe"]

calculate(eye_vector, nose_vector, toe_vector)

print("no2")
rake_vector = w2v["rake"]
shovel_vector = w2v["shovel"]
pot_vector = w2v["pot"]

calculate(rake_vector, shovel_vector, pot_vector)

print("no3")
rabbit_vector = w2v["rabbit"]

calculate(dog_vector, cat_vector, rabbit_vector)

##################### TODO: YOUR CODE ENDS HERE #######################

no1
result between 1 and 2: 0.4343575
result between 1 and 3: 0.2866297
result between 3 and 2: 0.46895
no2
result between 1 and 2: 0.39041606
result between 1 and 3: 0.32889664
result between 3 and 2: 0.2366583
no3
result between 1 and 2: 0.76094574
result between 1 and 3: 0.5868356
result between 3 and 2: 0.62613827


## Finding most similar
We are also interested in finding the n closest words to a certain vector.

We could do this by calculating a similarity to all vectors in the word2vec dictionary/matrix and then sort according to similarity score and take the top n-values.

This is clearly doable, but for this course round we will use the most_similar methods on our w2v object.
To get the ten most similar words to computer we do:
`sims = w2v.most_similar('computer', topn=10)`


We sometimes want to do operations with the vectors to do “king”-”man”+”woman”, and to support that we can use the two parameters: positive and negative:
`w2v.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)`


In [None]:
# Classic example
result = w2v.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

[('queen', 0.7118193507194519)]


In [None]:
##################### TODO: YOUR CODE STARTS HERE #####################
# Create more examples
result = w2v.most_similar(positive=['money', 'win'], negative=['game'], topn=1)
print(result)


##################### TODO: YOUR CODE ENDS HERE #######################

[('funds', 0.5235512852668762)]


## Uppgift
Skapa fler "räkneexempel" och analysera dina resultat \
* house, city, countryside: apartment, 0.487262487411499 \
* falmily, cat, kids: feline, 0.5297148823738098 \
* school, kids, toy: students 0.6308646202087402 \
* building, adult, free: woodframe, 0.4492707848548889 \
* park, city, grass: town, 0.5479860901832581 \
* money, win, game: funds, 0.5235512852668762 \

Resultaten är rimliga med tanke på vad jag matat in. Märkte däremot att man ibland bara fick en pluralversion av ett av de inmatade orden. Man behövde alltså testa sig fram för att få relevanta resultat ifall man valde 'svårare' kombinationer.

## There is actually a builtin similarity method

In [None]:
# There is a built in similarity methods we can use instead of our own function
word1 = 'cat'
word2 = 'dog'
print(f"Similarity between {word1} and {word2} is {w2v.similarity(word1, word2)}")

word2 = 'cut'
print(f"Similarity between {word1} and {word2} is {w2v.similarity(word1, word2)}")

Similarity between cat and dog is 0.760945737361908
Similarity between cat and cut is 0.09255547821521759


## Uppgift
Får du samma resultat med din likhetsfunktion som den i gensim inbyggda funktionen? \
**Svar:** Ja. Mina resultat är samma i alla fall som minst de först 4 decimalerna. Den inbyggda funktionen ger fler decimaler.