<a href="https://colab.research.google.com/github/Benjamin-morel/TensorFlow/blob/main/07_word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---


# **Machine Learning Model: word2vec and skip-gram**

| | |
|------|------|
| Filename | 07_word2vec.ipynb |
| Author(s) | Benjamin Morel (benjaminmorel27@gmail.com) |
| Date | January 26, 2024 |
| Aim(s) | build a word embedding space |
| Dataset(s) | IMDb Movie Reviews dataset [[1]](https://aclanthology.org/P11-1015.pdf)|
| Version | Python 3.10.12 - TensorFlow 2.17.1 |


<br> **!!Read before running!!** <br>
1. Fill in the inputs
2. CPU execution is enough
3. Run all and read comments

---

#### **Motivation**

Using the IMDb dataset used in code [02_classfication_text.ipynb](https://github.com/Benjamin-morel/TensorFlow/tree/main), the model is trained using the word2vec technique and model's weights are then used to construct a word embedding space.  

#### **Outline**



*   retrieve text data & pre-processing
*   training data generation
*   training phase
*   exploration of the embedding space
*   conclusion and comparison
*   references

---

## **0. Input section**

The model has already been trained and the user can choose to used a pre-trained model (No) or to repeat the training phase (Yes). Using a pre-trained model saves time, computer resources and CO2 emissions.

In [1]:
training_phase = "No"



---

## **1. Python librairies & display utilities**

In [2]:
# @title 1.1. Python librairies [RUN ME]

""" math """
import numpy as np # linear algebra

""" file opening and pre-processing"""
import os # miscellaneous operating system interfaces
import pandas as pd # data manipulation tool
from re import escape # regular expressions
import string # string manipulation

""" ML models """
import tensorflow as tf # framework for ML/DL
from tensorflow import keras # API used to build model in TensorFlow

""" exportation """
import pickle # serialization

In [3]:
# @title 1.2. Import Github files [RUN ME]

"""Clone the Github repertory TensorFlow and imports the files required (see section 3.2)"""

def get_github_files():
  !git clone https://github.com/Benjamin-morel/TensorFlow.git TensorFlow_duplicata
  path_model = 'TensorFlow_duplicata/99_pre_trained_models/07_word2vec/07_word2vec.keras'
  path_dictionary = 'TensorFlow_duplicata/99_pre_trained_models/07_word2vec/dictionary_embedding.pkl'
  model = keras.models.load_model(path_model, custom_objects={'Word2Vec': Word2Vec})
  with open(path_dictionary, 'rb') as f:
    dictionary = pickle.load(f)
  !rm -rf TensorFlow_duplicata/
  return model, dictionary

---


## **2. Data retrieval and set generation**

The skip-gram method is a word embedding technique used in Word2Vec to learn vector representations of words based on their surrounding context. It works by selecting a target word and predicting its context words within a given window size.

>Example: "the dog barks loudly"

>Window = 1 <br>
>Target word: "dog" <br>
>Context words: "the" and "barks"

>Window = 2 <br>
>Target word: "dog" <br>
>Context words: "the" and "barks" and "loudly"

Each word is converted to an integer and passed through a neural network with a single hidden layer. The model is trained to maximize the probability of context words given the target word, often using negative sampling to improve efficiency.

After training, the hidden layer weights form a word embeddings space, where words are represented by a 16-dimensional embedding vector. Skip-gram is especially useful for capturing relationships between words.


### 2.1. Retrieve data

The IMDb Movie Reviews is a database created for sentiment analysis in movie reviews. It contains 50,000 movie reviews. The database is extracted and placed in the folder review_dataset. The architecture of review_dataset is as follows:

```markdown
**Imdb_dataset/**
. . . aclImdb/
. . . . . . train/
. . . . . . test/
. . . . . . README
. . . . . . imdb.vocab
. . . . . . imdbEr.text
```

Other files are included in the folder like `README` which provides information about the dataset and how to use it. Files `imdb.vocab` and `imdbEr.txt` contain additional information about errors, URL website and specific annotations.

To train the model, the labels used to classify are not required.

In [4]:
def get_data(url, batch_size):

  dataset_name = "Imdb_dataset_1"
  path = tf.keras.utils.get_file(dataset_name, url, extract=True)
  path = os.path.join(path, 'aclImdb')
  train_path = os.path.join(path, 'train')
  test_path = os.path.join(path, 'test')

  raw_train_ds = tf.keras.utils.text_dataset_from_directory(train_path, labels=None, batch_size=batch_size, validation_split=None, shuffle=True, seed=1, verbose=0)
  raw_test_ds = tf.keras.utils.text_dataset_from_directory(test_path, labels=None, batch_size=batch_size, verbose=0)

  return raw_train_ds, raw_test_ds

In [5]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

batch_size = 1

train_ds, test_ds = get_data(url, batch_size)

The batch size is 1 to facilitate pre-processing of texts. The variable `nb_text` is used to control the number of texts to be supplied to the model.



In [6]:
text_nb = 25000
train_ds = train_ds.take(text_nb)

In [7]:
for text in train_ds.take(1):
  print(text.numpy())

[b"Who actually created this piece of crap this is the worst movie i have ever seen in my life it is such a waste of time and money. I hate it how they create low budget sequels featuring D-Lister actors and a storyline so similar to the 1st one.<br /><br />I found this movie in the bargain bin sitting right next to Wild Things 2 and Death To The Supermodels for $2.99 what a fool i was to actually think that this could be good instead i watched in disgust as poor acting stereotypes ripped of the storyline and script from the 1st one.<br /><br />Whoever thought that this straight-to-video production was actually even a half decent film you must be on crackd or something because I think what pretty much most of the people who've seen this film thinks WHAT A LOAD OF CRAP!!!!"]


### 2.2. Pre-processing

Input data must be processed:
```markdown
Tensor(['The film was ....'])
Tensor(['I see this movie ...'])
. . .
```
Each tensor is composed of a sentence (and not an entire movie review). So, for one review, several tensors can be generated. It is assumed that each sentence carries a context/meaning. A sentence is defined here as a series of words ending by a `. `, a `? ` or a ' ! `.

In [8]:
remove_ponctuation = escape(string.punctuation).replace('.', '').replace('?', '').replace('!', '')

The 50 most frequent words in the dataset are also removed. This list of words comes from the Jupyter Notebook [02_classfication_text.ipynb](https://github.com/Benjamin-morel/TensorFlow/tree/main). These words are removed because they carry a little meaning and increase the size of the model's input vectors:

```markdown
The dog barked loudly --> dog barked loudly --> ['dog', 'barked', 'loudly']
```

In [9]:
common_token = [' the ', ' and ', ' a ', ' of ', ' to ', ' is ', ' in ', ' it ', ' i ', ' this ', ' that ', ' br ', ' was ', ' as ', ' with ', ' for ',
                ' you ', ' on ', ' are ', ' one ', ' be ', ' he ', ' its ', ' have ', ' an ', ' by ', ' at ', ' all ', ' from ', ' who ', ' so ',
                ' they ', ' her ', ' just ', ' some ', ' out ', ' about ', ' or ', ' s ', ' if ', '  c  ', ' there ', ' were ', ' would ', ' had ', ' it ', ' we ']

In [10]:
def custom_standardization(input_text):
  text_modified = tf.strings.lower(input_text) # upper cases --> lower cases
  text_modified = tf.strings.regex_replace(text_modified, '<br />', ' ') # remove HTML strings
  for i in range(len(common_token)):
    text_modified = tf.strings.regex_replace(text_modified, common_token[i], ' ') # remove frequent words
  text_modified = tf.strings.regex_replace(text_modified, '[%s]' % remove_ponctuation, '') # remove punctuation
  return text_modified

In [11]:
train_ds = train_ds.map(lambda x: custom_standardization(x))

An example of a pre-processed movie review is displayed. Only the punctuation elements `.`, `?` and `!` are retained.

In [12]:
for text in train_ds.take(1):
  print(text.numpy())

[b'nothing can prepare another lousy bimbo outing! time its being brought neverinevitable fred olen ray! far exploitation movies go doesnt click! science fiction its plain unoriginal! see an ugly feminine android wearing bikini destroy earth showing off thats nearly bare resist! give me fing break!!! kind entertainment your thing then why not dust off those old si swimsuit mags attic change?! been much better didnt set sleaze factor very high but still wouldnt make great. id like point another film called assault 1996 jim wynorski which resembles identity alienator. illustrates why topnotch 1stperson femme fatale action movies dont translate well america. sorry fellas!']


From these pre-processed texts, the sentences are separated and isolated into tensors.

In [13]:
def split_sentences(text):
    text =  tf.strings.split(text, '. ')
    text =  tf.strings.split(text, '? ')
    text =  tf.strings.split(text, '! ')
    return text

In [14]:
train_ds = train_ds.map(lambda x: split_sentences(x))

Finally, the tensors contained in `train_ds` - with variable size - are converted into individual scalar tensors.

In [15]:
def split_ragged_tensor(ragged_tensor):
    flat_values = ragged_tensor.flat_values
    return tf.data.Dataset.from_tensor_slices(flat_values)

In [16]:
train_ds = train_ds.flat_map(split_ragged_tensor)

In [17]:
for text in train_ds.take(5):
  print(text.numpy())

b'this movie not great.i found story too banalordinary.theres not much originality here.its combination many other movies.its equal parts christmas carolits wonderful lifehow grinch stole christmasand even cat hat movie.the movie isnt very funnybut bit slapstick works.this movie isi feltoverly sentimental preachy.in facti felt like watching 90 minute commercial how important family is.nowdont get me wrong.family very important.i find subtlety works best these movies.this way too heavy handed me.but good news.the movie has great visual style.i meanit looks fantastically magical.and martin short terrific jack frostthe baddie piece.hes not really scarymore mildly disconcerting than anythingand even bit sad.i also like look gave him.this movie also bit tearjerker.anywaythis case style over substance.and while its not nearly mean spirited creepy part 2i still dont think good.the negatives outweigh positives.for methe santa clause 3 410'
b'without doubt my absolute favorite film time'
b'firs

### 2.3. Vectorization

A maximum of 20 tokens are generated for each sentence and then vectorized.

In [18]:
max_features = 5000
max_length = 20

vectorize_layer = tf.keras.layers.TextVectorization(standardize=None, # already done
                                                    max_tokens=max_features,
                                                    output_mode='int',
                                                    output_sequence_length=max_length)

In [19]:
vectorize_layer.adapt(train_ds)

In [20]:
train_ds = train_ds.map(vectorize_layer)

In [21]:
word_list = vectorize_layer.get_vocabulary()
int_list = list(train_ds.as_numpy_iterator())

for seq in int_list[:5]:
  print(f"{seq} => {[word_list[i] for i in seq]}")

[ 63  43 969  58  82   1 655  22   0   0   0   0   0   0   0   0   0   0
   0   0] => ['this', 'first', 'creepy', 'movies', 'ever', '[UNK]', '5', 'time', '', '', '', '', '', '', '', '', '', '', '', '']
[1690   29   11    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0] => ['scared', 'me', 'good', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
[   3  269  237    1    1  868    7    1   19 1555  142   14 2904    0
    0    0    0    0    0    0] => ['but', 'night', 'put', '[UNK]', '[UNK]', 'eye', 'like', '[UNK]', 'my', 'mom', 'got', 'very', 'upset', '', '', '', '', '', '', '']
[  17 2320   19  868 4963  305  227   19  868    1    7 1634    1    0
    0    0    0    0    0    0] => ['she', 'clean', 'my', 'eye', 'alcohol', 'next', 'day', 'my', 'eye', '[UNK]', 'like', 'double', '[UNK]', '', '', '', '', '', '', '']
[112 134   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0] => ['now', 'thats', 'movie', '', '', '',

---


## **3. Training set generation**

### 3.1. Skip-gram pair generation

Generates skip-gram pairs with negative sampling for a list of sequences (int-encoded sentences) based on window size, number of negative samples and vocabulary size.

In [22]:
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  targets, contexts, labels = [], [], []

  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size) # Build the sampling table for `vocab_size` tokens.

  for sequence in sequences: # Iterate over all sequences (sentences) in the dataset

    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
          sequence,
          vocabulary_size=vocab_size,
          sampling_table=sampling_table,
          window_size=window_size,
          negative_samples=0) # Generate positive skip-gram pairs for a sequence (sentence)

    for target_word, context_word in positive_skip_grams: # Iterate over each positive skip-gram pair to produce training examples with a positive context word and negative samples.
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=seed,
          name="negative_sampling")


      context = tf.concat([tf.squeeze(context_class,1), negative_sampling_candidates], 0) # Build context and label vectors (for one target word)
      label = tf.constant([1] + [0]*num_ns, dtype="int64")

      targets.append(target_word) # Append each element from the training example to global lists.
      contexts.append(context)
      labels.append(label)

  return targets, contexts, labels

In [23]:
if training_phase == "Yes":
  targets, contexts, labels = generate_training_data(sequences=int_list,
                                                    window_size=2,
                                                    num_ns=4,
                                                    vocab_size=max_features,
                                                    seed=1)

  targets = np.array(targets)
  contexts = np.array(contexts)
  labels = np.array(labels)

  print(f"targets.shape: {targets.shape}")
  print(f"contexts.shape: {contexts.shape}")
  print(f"labels.shape: {labels.shape}")

### 3.2. Performances and batches

In [24]:
if training_phase == "Yes":
  batch_size = 1024
  AUTOTUNE = tf.data.AUTOTUNE
  dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
  dataset = dataset.shuffle(len(targets)).batch(batch_size).repeat()

  dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)

### 3.3. Model definition

In [25]:
@keras.utils.register_keras_serializable()
class Word2Vec(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, **kwargs):
    super(Word2Vec, self).__init__()
    self.vocab_size = vocab_size # Store vocab_size as an attribute
    self.embedding_dim = embedding_dim # Store embedding_dim as an attribute
    self.target_embedding = tf.keras.layers.Embedding(vocab_size,
                                      embedding_dim,
                                      name="w2v_embedding")
    self.context_embedding = tf.keras.layers.Embedding(vocab_size,
                                       embedding_dim)

  def call(self, pair):
    target, context = pair
    # target: (batch, dummy?)  # The dummy axis doesn't exist in TF2.7+
    # context: (batch, context)
    if len(target.shape) == 2:
      target = tf.squeeze(target, axis=1)
    # target: (batch,)
    word_emb = self.target_embedding(target)
    # word_emb: (batch, embed)
    context_emb = self.context_embedding(context)
    # context_emb: (batch, context, embed)
    dots = tf.einsum('be,bce->bc', word_emb, context_emb)
    # dots: (batch, context)
    return dots

  def get_config(self): # Define get_config to include vocab_size and embedding_dim
    config = super(Word2Vec, self).get_config()
    config.update({"vocab_size": self.vocab_size, "embedding_dim": self.embedding_dim})
    return config

  @classmethod
  def from_config(cls, config): # Define from_config to use vocab_size and embedding_dim
      return cls(**config)

In [26]:
def create_model(embedding_dim):
  model = Word2Vec(max_features, embedding_dim)
  model.compile(optimizer='adam',
                loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                metrics=None)
  return model

### 3.4. Training phase

In [27]:
def train_model(model, training_data, steps_per_epoch, **kwargs):
  kwargs.setdefault("epochs", 20)
  kwargs.setdefault("verbose", 2)
  log = model.fit(training_data, steps_per_epoch = steps_per_epoch, **kwargs)

  return log.history["loss"]

In [28]:
if training_phase == "Yes":
  model = create_model(16)
  steps_per_epoch = len(targets) // batch_size
  history = train_model(model, dataset, steps_per_epoch)
  model.save('07_word2vec.keras')
else:
  model, dictionary_embedding = get_github_files()

Cloning into 'TensorFlow_duplicata'...
remote: Enumerating objects: 655, done.[K
remote: Counting objects: 100% (287/287), done.[K
remote: Compressing objects: 100% (193/193), done.[K
remote: Total 655 (delta 197), reused 94 (delta 94), pack-reused 368 (from 1)[K
Receiving objects: 100% (655/655), 174.58 MiB | 21.22 MiB/s, done.
Resolving deltas: 100% (324/324), done.


---


## **4. Embedding space exploration**

### 4.1. Embedding dictionary

In [29]:
weights = model.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()
dictionary_embedding = {vectorize_layer.get_vocabulary()[i]:weights[i] for i in range(len(vocab))} # build the embedding dictionary

with open('dictionary_embedding.pkl', 'wb') as f:
  pickle.dump(dictionary_embedding, f)

In [30]:
dictionary_embedding["money"]

array([-0.3503012 , -0.15039295, -0.32417077,  0.36491925, -0.39782122,
        0.07117482,  0.4433823 , -0.30348516,  0.3206255 ,  0.2682087 ,
        0.5566837 ,  0.6386071 , -0.45531398,  0.09227507,  0.39550662,
        0.11250176], dtype=float32)

In [37]:
with open('vocab.pkl', 'wb') as f:
  pickle.dump(vocab, f)

In [38]:
with open('vocab.pkl', 'rb') as f:
  vocab = pickle.load(f)

### 4.2. Similarities and analogies

In [31]:
# get the distance between two elements in the embedding space

def get_distance(token1, token2):
  p1 = dictionary_embedding[token1]
  p2 = dictionary_embedding[token2]
  distance = np.linalg.norm(p2-p1)
  return distance

# get the cosinus similarity between two elements in the embedding space

def get_cosinus_similarity(token1, token2):
  p1 = dictionary_embedding[token1]
  p2 = dictionary_embedding[token2]
  dot_product = np.dot(p1, p2)
  magnitude_1 = np.linalg.norm(p1)
  magnitude_2 = np.linalg.norm(p2)
  cosine_sim = dot_product / (magnitude_1 * magnitude_2)
  return cosine_sim

In [39]:
# get elements closest to a specific element in the embedding space

def get_synomym(token, n, used_distance=True):
  p1 = dictionary_embedding[token]
  candidate_list = {} # stores n synonyms
  for i in range(1, len(vocab)): # only the first 1000 words of the embedding dictionary are searched
    token_candidate = vocab[i]
    if used_distance == True: candidate_list[token_candidate] = get_distance(token, token_candidate)
    else: candidate_list[token_candidate] = get_cosinus_similarity(token, token_candidate)

  sorted_items = sorted(candidate_list.items(), key=lambda item: item[1])

  if used_distance == True: synonym_list = sorted_items[1:n+1]
  else: synonym_list = sorted_items[-(n+1):-1]
  words = [(item[0], item[1]) for item in synonym_list]
  return print(words)

In [43]:
word = "house"
nb = 10
get_synomym(word, nb, used_distance=False)

[('security', 0.811774), ('island', 0.81794286), ('water', 0.8186376), ('kitchen', 0.82224846), ('office', 0.8228932), ('hotel', 0.8328615), ('search', 0.8471924), ('mountain', 0.84733915), ('apartment', 0.85803413), ('bed', 0.9000992)]


**"terrible"**

*   awful - 0.9384527
*   horrible - 0.8618901
*   plain - 0.851014
*   abysmal - 0.8413805
*   uninspired - 0.83548206
*   lame - 0.8295864
*   unbelievably - 0.82898676
*   dreadful - 0.8224774
*   unwatchable - 0.81847626
*   horrendous - 0.81711394

**"amazing"**

*   brilliant - 0.8560008
*   understated - 0.8467297
*   remarkable - 0.84653646
*   outrageous - 0.8109178
*   wonderful - 0.8065811
*   delightful - 0.8046221
*   exceptional - 0.80231535
*   marvelous - 0.7814437
*   accomplished - 0.78039724
*   underrated - 0.78019947

**"house"**

*   bed - 0.9000992
*   apartment - 0.85803413
*   mountain - 0.84733915
*   search - 0.8471924
*   hotel - 0.8328615
*   office - 0.8228932
*   kitchen - 0.82224846
*   water - 0.8186376
*   island - 0.81794286
*   security - 0.811774

In [44]:
def get_analogy(vector, n):
  candidate_list = {} # stores n synonyms
  for i in range(1, len(vocab)):
    token_candidate = vocab[i]
    vector_candidate = dictionary_embedding[token_candidate]
    candidate_list[token_candidate] = sum(abs(vector - vector_candidate))

  sorted_items = sorted(candidate_list.items(), key=lambda item: item[1])
  synonym_list = sorted_items[0:n]
  words = [item[0] for item in synonym_list]
  print(words)

In [45]:
analogy = dictionary_embedding["funny"] - dictionary_embedding["nice"] + dictionary_embedding["terrible"]
get_analogy(analogy, 5)

['terrible', 'awful', 'bad', 'judging', 'horrendous']


funny-nice+terrible = ['terrible', 'awful', 'bad', 'judging', 'horrendous']

---


## **5. Conclusion**

---


## **6. References**

| | | | | |
|------|------|------|------|------|
| Index | Title | Author(s) | Type | Comments |
|[[1]](https://aclanthology.org/P11-1015.pdf) | IMDB dataset | Andrew L. Maas & al | dataset & paper | - |
|[[2]](https://www.tensorflow.org/tutorials/keras/text_classification) | Basic text classification | TensorFlow | dataset | - |
|[[3]](https://www.tensorflow.org/guide/data_performance) | Better performance with the tf.data API | TensorFlow | Tutoriels | - |
|[[4]](https://www.cs.toronto.edu/~lczhang/360/lec/w05/w2v.html) | Word2Vec and GloVe Vectors | Toronto university | website | - |