####USING PRETRAINED WORD EMBEDDINGS

Sometimes you have so little training data available that you can’t use your data alone to learn an appropriate task-specific embedding of your vocabulary.
**What can you do?** - you can load embedding vectors from a precomputed embedding space that you know is highly structured and exhibits useful properties—one that captures generic aspects of language structure.

Examples:
* the Word2Vec algorithm (https://code.google.com/archive/p/word2vec), developed by Tomas Mikolov at Google in 2013.
* Global Vectors for Word Representation (GloVe, https://nlp.stanford.edu/projects/glove), which was developed by Stanford researchers in 2014

The GloVe word embeddings is precomputed on the 2014
English Wikipedia dataset. It’s an 822 MB zip file containing 100-dimensional embedding
vectors for 400,000 words (or non-word tokens).

In [None]:
!wget http:/ /nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

In [1]:
import numpy as np
import tensorflow as tf
import keras
from keras import layers

# Toy training data
texts = ["positive text", "negative text", "neutral text", "positive review", "negative review"]
labels = [1, 0, 0, 1, 0]

# Text vectorization layer
max_tokens = 100 # maximum number of tokens in the vocabulary
text_vectorization = layers.TextVectorization(max_tokens=max_tokens)
text_vectorization.adapt(texts)


In [None]:
#parse the unzipped file (a .txt file) to build an index that maps words (as strings) to their vector representation
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
  for line in f:
    word, coefs = line.split(maxsplit=1) #it splits the line into two parts, the first part being the word and the second part being the 100-dimensional vector
    #print(word, coefs)
    coefs = np.fromstring(coefs, "f", sep=" ") #converts the vector from a string to a numpy array of floating-point values
    #print(coefs)
    #break
    embeddings_index[word] = coefs #adds the word and its corresponding vector to the "embeddings_index" dictionary

print(f"Found {len(embeddings_index)} word vectors.")

In [None]:
#build an embedding matrix that you can load into an Embedding layer

embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary() #Retrieve the vocabulary indexed by our previous TextVectorization layer
word_index = dict(zip(vocabulary, range(len(vocabulary)))) #Use it to create a mapping from words to their index in the vocabulary

embedding_matrix = np.zeros((max_tokens, embedding_dim)) #Prepare a matrix that we’ll fill with the GloVe vectors.
 
  for word, i in word_index.items():
    if i < max_tokens:
      embedding_vector = embeddings_index.get(word) #Fill entry i in the matrix with the word vector for index i. 
    if embedding_vector is not None:
      embedding_matrix[i] = embedding_vector #Words not found in the embedding index will be all zeros.

In [None]:
embedding_layer = layers.Embedding(
max_tokens,
embedding_dim,
embeddings_initializer=keras.initializers.Constant(embedding_matrix), # initializes weights with a constant value - values from embedding_matrix
trainable=False,
)

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.LSTM(32)(embedded)
#x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)


In [None]:
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])

In [None]:
model.fit(text_vectorization(texts), np.array(labels), epochs=10)

In [2]:
vocabulary = text_vectorization.get_vocabulary() #Retrieve the vocabulary indexed by our previous TextVectorization layer
word_index = dict(zip(vocabulary, range(len(vocabulary))))

In [3]:
word_index

{'': 0,
 '[UNK]': 1,
 'text': 2,
 'review': 3,
 'positive': 4,
 'negative': 5,
 'neutral': 6}

In [4]:
for word, i in word_index.items():
  print(word, i)

 0
[UNK] 1
text 2
review 3
positive 4
negative 5
neutral 6


In [6]:

embedding_dim = 100
embedding_matrix = np.zeros((max_tokens, embedding_dim))
embedding_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [8]:
len(embedding_matrix[0])

100

In [9]:
embedding_matrix[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [10]:
embedding_matrix.shape

(100, 100)