# NLP - Word representation

## The corpus

In [14]:
import numpy as np

In [15]:
texts = np.array(["I like chocolate",
            "I like tea",
            "You like chocolate",
            'You hate beer',
            'I hate wine'])
labels = np.array([1,1,1,0,0])

## The imports

In [16]:
import os
import pandas as pd
import numpy as np

import tensorflow as tf
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from keras.layers import Input, TextVectorization, Dense, Flatten, Embedding

In [17]:
import warnings
warnings.filterwarnings("ignore")

## BOW representation (the fist week)

In [18]:
# with Keras preprocessing layer
vectorize_layer = TextVectorization(output_mode='count', ngrams=(1,2))
# Fit the layer with the corpus
vectorize_layer.adapt(texts)

# define the model
input_ = Input(shape=(1,), dtype=tf.string)
x = vectorize_layer(input_)
hidden = Dense(32, activation='relu')(x)
output_ = Dense(1, activation='sigmoid')(hidden)
model = Model(input_, output_)
# summarize the model
model.summary()
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# fit the model
model.fit(texts, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(texts, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_3 (TextV  (None, 17)               0         
 ectorization)                                                   
                                                                 
 dense_4 (Dense)             (None, 32)                576       
                                                                 
 dense_5 (Dense)             (None, 1)                 33        
                                                                 
Total params: 609
Trainable params: 609
Non-trainable params: 0
_________________________________________________________________
Accuracy: 100.000000


## Keras word embedding (last week)

In [19]:
# Constants
vocab_size = 1000  # Maximum vocab size.
max_len = 10  # Sequence length to pad the outputs to.
embedding_size = 100

# with Keras preprocessing layer
vectorize_layer = TextVectorization(max_tokens=vocab_size,
                                    output_mode='int',
                                    output_sequence_length=max_len)
# Fit the layer with the corpus
vectorize_layer.adapt(texts)

# define the model
input_ = Input(shape=(1,), dtype=tf.string)
x = vectorize_layer(input_)
x = Embedding(vocab_size, embedding_size, name="Embedding")(x)
x = Flatten()(x)
hidden = Dense(32, activation="relu")(x)
output_ = Dense(1, activation='sigmoid')(hidden)
model = Model(input_, output_)
# summarize the model
model.summary()
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# fit the model
model.fit(texts, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(texts, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_4 (TextV  (None, 10)               0         
 ectorization)                                                   
                                                                 
 Embedding (Embedding)       (None, 10, 100)           100000    
                                                                 
 flatten_1 (Flatten)         (None, 1000)              0         
                                                                 
 dense_6 (Dense)             (None, 32)                32032     
                                                                 
 dense_7 (Dense)             (None, 1)                 33        
                                                           

## Use a pre-trained embedding : Glove/Word2Vec/FastText embedding (this week)

**Traditional word embedding** techniques (Glove/Word2Vec/FastText) learn a global word embedding. They first build a global vocabulary using unique words in the documents by ignoring the meaning of words in different context. Then, similar representations are learnt for the words appeared more frequently close each other in the documents. The problem is that in such word representations the words' contextual meaning (the meaning derived from the words' surroundings), is ignored. For example, only one representation is learnt for "left" in sentence "I left my phone on the left side of the table." However, "left" has two different meanings in the sentence, and needs to have two different representations in the embedding space.

For example, consider the two sentences:

1. I will show you a valid point of reference and talk to the point.
1. Where have you placed the point.

The word embeddings from a pre-trained embeddings such as word2vec, the embeddings for the word 'point' is same for both of its occurrences in example 1 and also the same for the word 'point' in example 2. (all three occurrences has same embeddings).

In [8]:
# Same steps as Keras Embedding
vocab_size = 25  # Maximum vocab size.
max_len = 10  # Sequence length to pad the outputs to.
embedding_dim = 50
hidden_size = 16
vectorizer = TextVectorization(max_tokens=vocab_size, output_sequence_length=max_len)
vectorizer.adapt(texts)

In [9]:
# Build word dict
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))
word_index

{'': 0,
 '[UNK]': 1,
 'like': 2,
 'i': 3,
 'you': 4,
 'hate': 5,
 'chocolate': 6,
 'wine': 7,
 'tea': 8,
 'beer': 9}

In [10]:
# We can look at the new shape of the vectors after TextVectorization 
texts_vec = vectorizer(texts)
texts_vec

<tf.Tensor: shape=(5, 10), dtype=int64, numpy=
array([[3, 2, 6, 0, 0, 0, 0, 0, 0, 0],
       [3, 2, 8, 0, 0, 0, 0, 0, 0, 0],
       [4, 2, 6, 0, 0, 0, 0, 0, 0, 0],
       [4, 5, 9, 0, 0, 0, 0, 0, 0, 0],
       [3, 5, 7, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)>

In [9]:
# Download the pre-trained embeddin matrix for exemple from glove
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip -q glove.6B.zip

In [20]:
# Make a dict mapping words (strings) to their NumPy vector representation:
path_to_glove_file = "glove.6B.50d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2273: character maps to <undefined>

Let's prepare a corresponding embedding matrix that we can use in a Keras Embedding layer. It's a simple NumPy matrix where entry at index i is the pre-trained vector for the word of index i in our vectorizer's vocabulary.

In [11]:
num_tokens = len(voc) + 2 # UNK/OOV and PAD
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    print(word, i)
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

 0
[UNK] 1
like 2
i 3
you 4
hate 5
chocolate 6
wine 7
tea 8
beer 9
Converted 8 words (2 misses)


In [12]:
# Initialize the Embedding layer with the weight of each word
embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False,
)
num_tokens,embedding_dim, max_len

(12, 50, 10)

In [13]:
# define the model
input_ = Input(shape=(1,), dtype=tf.string)
x = vectorize_layer(input_)
x = embedding_layer(x)
x = Flatten()(x)
hidden = Dense(hidden_size, activation="relu")(x)
output_ = Dense(1, activation='sigmoid')(hidden)
model = Model(input_, output_)
# summarize the model
model.summary()
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# fit the model
model.fit(texts, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(texts, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 10)               0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 10, 50)            600       
                                                                 
 flatten_1 (Flatten)         (None, 500)               0         
                                                                 
 dense_4 (Dense)             (None, 16)                8016      
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                           

## Train your own Word2Vec model with gensim

In [14]:
#!pip install gensim

To begin with, you need data to train a model. We will use part of the Brown corpus.

In [13]:
from nltk.corpus import brown
from gensim.models import Word2Vec

train_set = brown.sents()[:10000]

LookupError: 
**********************************************************************
  Resource [93mbrown[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('brown')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/brown[0m

  Searched in:
    - 'C:\\Users\\paul/nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.10_3.10.2544.0_x64__qbz5n2kfra8p0\\nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.10_3.10.2544.0_x64__qbz5n2kfra8p0\\share\\nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.10_3.10.2544.0_x64__qbz5n2kfra8p0\\lib\\nltk_data'
    - 'C:\\Users\\paul\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


Let's go ahead and train a model on our corpus. Don't worry about the training parameters much for now, we'll revisit them later.

In [11]:
model = Word2Vec(sentences=train_set, size=embedding_dim, window=5, min_count=1, workers=4)
#model = Word2Vec(sentences=common_texts, vector_size=embedding_dim, window=5, min_count=1, workers=4)

NameError: name 'Word2Vec' is not defined

Once we have our model, we can use it.

The main part of the model is model.wv\ , where "wv" stands for "word vectors".

In [12]:
vector = model.wv['university']  # get numpy vector of a word
vector

AttributeError: 'Functional' object has no attribute 'wv'

In [18]:
model.similarity('university','school')

0.99852455

In [19]:
sims = model.wv.most_similar('university', topn=10)  # get other similar words
sims

[('meeting', 0.9992534518241882),
 ('over', 0.9992371797561646),
 ('chairman', 0.999186098575592),
 ('collection', 0.9991796016693115),
 ('week', 0.9991607069969177),
 ('Christ', 0.9991540312767029),
 ('Soviet', 0.9991388320922852),
 ('last', 0.999138593673706),
 ('family', 0.9991305470466614),
 ('season', 0.9991235733032227)]

Training non-trivial models can take time.  Once the model is built, it can be saved using standard gensim methods:

In [20]:
import tempfile

with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:
    temporary_filepath = tmp.name
    print(temporary_filepath)
    model.save(temporary_filepath)
    #
    # The model is now safely stored in the filepath.
    # You can copy it to other machines, share it with others, etc.
    #
    # To load a saved model:
    #
    new_model = Word2Vec.load(temporary_filepath)

/var/folders/1p/3c9gtfld201dy53fjq35ky7c0000gn/T/gensim-model-ktkkx_xu


If you save the model you can continue training it later:

In [21]:
from nltk.tokenize import word_tokenize
new_model.train([word_tokenize(sent) for sent in texts], total_examples=1, epochs=1)

(15, 15)

If you no longer need to retrain the model, it can be saved with only the vectors and their keys. This results in a much smaller and faster object that can be loaded more quickly.

In [22]:
from gensim.models import KeyedVectors

# Store just the words + their trained embeddings.
word_vectors = new_model.wv
word_vectors.save("word2vec.wordvectors")

# Load back with memory-mapping = read-only, shared across processes.
new_word_vectors = KeyedVectors.load("word2vec.wordvectors", mmap='r')

vector = new_word_vectors['university']  # Get numpy vector of a word
vector

array([-0.20820735, -0.06647338, -0.11990894,  0.04388699, -0.155495  ,
       -0.13707717, -0.33804825,  0.1828707 , -0.10238216,  0.02814539,
        0.2875241 ,  0.07058284,  0.13326436,  0.00546349,  0.03286072,
       -0.01431172,  0.24371824,  0.08051413, -0.13486904,  0.03584216,
       -0.01797359, -0.1378914 ,  0.3005393 , -0.36257207, -0.206651  ,
        0.19838478, -0.16907132,  0.09443361, -0.11644898,  0.03170768,
        0.00233546,  0.2752823 , -0.16797155, -0.17364414,  0.03928268,
        0.340022  ,  0.10467251,  0.08548962,  0.06931432, -0.14733282,
        0.18021052,  0.2589757 ,  0.05634627,  0.09799216, -0.03593536,
       -0.17912693, -0.3799546 , -0.288907  , -0.16658436,  0.0912826 ],
      dtype=float32)

You can then use the template exactly as if it were a Glove/Word2Vec/FastText template retrieved from the Internet.

In [35]:
# Same steps as Keras Embedding
vectorizer = TextVectorization(max_tokens=vocab_size, output_sequence_length=max_len, name="TextVectorization")
vectorizer.adapt(texts)



In [36]:
# Build word dict
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))
word_index

{'': 0,
 '[UNK]': 1,
 'like': 2,
 'i': 3,
 'you': 4,
 'hate': 5,
 'chocolate': 6,
 'wine': 7,
 'tea': 8,
 'beer': 9}

In [37]:
num_tokens = len(voc) + 2
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    try:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = new_word_vectors[word]
        hits += 1
    except :
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 7 words (3 misses)


In [38]:
embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False,
    name="Embedding"
)

In [39]:
# define the model
input_ = Input(shape=(1,), dtype=tf.string, name="Input")
x = vectorize_layer(input_)
x = embedding_layer(x)
x = Flatten()(x)
hidden = Dense(hidden_size, activation="relu", name="Hidden")(x)
output_ = Dense(1, activation='sigmoid', name="Output")(hidden)
model = Model(input_, output_)
# summarize the model
model.summary()
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# fit the model
model.fit(texts, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(texts, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Input (InputLayer)          [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 10)               0         
 ectorization)                                                   
                                                                 
 Embedding (Embedding)       (None, 10, 50)            600       
                                                                 
 flatten_3 (Flatten)         (None, 500)               0         
                                                                 
 Hidden (Dense)              (None, 16)                8016      
                                                                 
 Output (Dense)              (None, 1)                 17        
                                                           

In [44]:
# Weights by layer
for i in range(len(model.layers)):
    name = model.layers[i].name
    if name=="Embedding":
        weights = model.layers[i].get_weights()[0]
        print(name, weights.shape)

Embedding (12, 50)


## Use a all the pre-trained embedding : Glove/Word2Vec/FastText embedding (this week)

Jusqu'à présent l'embedding est une matrice de taille vocab_size * embedding_size
* vocab_size étant le nombre de token dans les données d'entrainement : par exemple dans la situation précédente, cette taille était fixée à 5000

L'objectif ici est d'avoir une matrice de taille pre_traine

In [45]:
# Build the vocabulary list
# Build the embedding matrix
path_to_glove_file = "/users/riveill/DS-models/glove.6B.50d.txt"

vocabulary = []
embedding_matrix = [np.zeros((embedding_dim)),
                    np.zeros((embedding_dim))] # See later : 0=PAD, 1=OOV
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        vocabulary += [word]
        embedding_matrix += [coefs]
embedding_matrix = np.array(embedding_matrix)

In [46]:
embedding_matrix.shape

(400002, 50)

In [61]:
vocab_size = len(embedding_matrix)
len(vocabulary), len(embedding_matrix), 

(400000, 400002)

In [62]:
# Build vectorizer layer
vectorize_layer = tf.keras.layers.TextVectorization(
        max_tokens=len(embeddings_index)+2,
        output_mode="int",
        output_sequence_length=max_len,
        vocabulary=vocabulary  # Pass the vocabulary - no need to adapt the layer
                               # Contain the padding token ('') and OOV token ('[UNK]')
)
vectorize_layer.get_vocabulary()[:10]

['', '[UNK]', 'the', ',', '.', 'of', 'to', 'and', 'in', 'a']

In [63]:
# Test vectorizer layer
vectorize_layer('the sandberger oov')

<tf.Tensor: shape=(10,), dtype=int64, numpy=
array([     2, 400001,      1,      0,      0,      0,      0,      0,
            0,      0])>

In [64]:
# La suite est similaire à une approche avec Keras embedding

In [56]:
# Define embedding layer
embedding_layer = Embedding(
    vocab_size,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False, # False: don't fine tune the embedding matrix / True: fine tune
    name="Embedding"
)

In [57]:
# define the model
input_ = Input(shape=(1,), dtype=tf.string)
x = vectorize_layer(input_)
x = embedding_layer(x)
x = Flatten()(x)
hidden = Dense(hidden_size, activation="relu")(x)
output_ = Dense(1, activation='sigmoid')(hidden)
model = Model(input_, output_)

# summarize the model
model.summary()

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_4 (TextV  (None, 10)               0         
 ectorization)                                                   
                                                                 
 Embedding (Embedding)       (None, 10, 50)            20000100  
                                                                 
 flatten_5 (Flatten)         (None, 500)               0         
                                                                 
 dense_10 (Dense)            (None, 16)                8016      
                                                                 
 dense_11 (Dense)            (None, 1)                 17        
                                                           

In [58]:
# Weights by layer
for i in range(len(model.layers)):
    name = model.layers[i].name
    if name=="Embedding":
        weights = model.layers[i].get_weights()[0]
        print(name, weights.shape)

Embedding (400002, 50)


In [54]:
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# fit the model
model.fit(texts, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(texts, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


## Contextual word embedding (another week)

However, **contextual embeddings** (are generally obtained from the transformer based models: BERT, ELMO, GPT3). The emeddings are obtained from a model by passing the entire sentence to the pre-trained model. Note that, here there is a vocabulary of words, but the vocabulary will not contain the contextual embeddings. The embeddings generated for each word depends on the other words in a given sentence. (The other words in a given sentence is referred as context. The transformer based models work on attention mechanism, and attention is a way to look at the relation between a word with its neighbors). Thus, given a word, it will not have a static embeddings, but the embeddings are dynamically generated from pre-trained (or fine-tuned) model.

For example, consider the two sentences:
1. I will show you a valid point of reference and talk to the point.
1. Where have you placed the point.

The embeddings from BERT or ELMO or any such transformer based models, the the two occurrences of the word 'point' in example 1 will have different embeddings. Also, the word 'point' occurring in example 2 will have different embeddings than the ones in example 1.

In this course we will have a short introduction to the Transformers. For those who are in a hurry or if you are just looking to know how it works: [you can have a look at this great presentation](https://youtu.be/_QejEDB7GM8)