# Deep Learning with Python
# 6.2 - Word Embeddings
## What?
- **One-hot encoding** or one-hot hashing creates word vectors that are
    - High dimensional (> 20k tokens so 20k dimensions)
    - Sparse: All elements are 0, only the index corresponding to a specific word is 1. 
    - Binary: The value of any element in a one-hot vector is either 1 or 0.
    - Explicilty defined or hardcoded, not learned through a machine learning algorithm. 
- **Word embeddings** are vectors of words that are
    - Low-dimensional: Word-embedding vectors are 256, 512, or 1024 dimensional for vocabularies of 20k words or more. They pack a lot more words/information in lower dimensions.
    - Floating point: Not binary. Each element can be a floating point number. Makes them amenable to tensor models.
    - Learned from data.
    
Two ways to obtain word embeddings
- Learn word embeddings jointly with training your model.
- Use preconfigured or pretrained word embeddings that were computed using a different machine learning task than the one you are trying to solve.

### TLDR: Word Embeddings
- Word embeddings map words in the vocabulary for a specific machine learning task to a geometric space.
- The separation between individual words in this vector space (such as L2 norm) is analogous to the semantic difference between the words.
- Embedding vectors represent transformations that express relationships between words in a word embedding vector space. 
- E.g. a `female` word embedding vector, when added to the `King` in a word embedding vector space, will form `queen`. 

## Learning Word Embeddings
### Why?
- No universal word embedding space that can be used for all languages and all possible NLP tasks.
- All languages are different and are not isomorphic: relationships in one language's word-embedding space may not be transferable to that of another language.
- In fact, even within the same language, word embedding spaces differ from one application to another.
- So it's a good idea to learn a word-embedding space from scratch when working on a new ML problem.

In [17]:
from tensorflow.keras.layers import Embedding

Embedding layer is best understood as a dictionary that mps integer indices (which represent specific words) to dense vectors. It takes integers (indices) as inputs, looks up these integrs in an internal dictionary, and returns the associated vectors as ouput. 

In this case, the Embedding layer recieves an input tensor containing up to 1000 word tokens (indices representing up to 1000 words in the total vocabulary for our samples) and will return the vectors representing each word as a 64 dimensional word-embedding space vector. 

In [18]:
# Arg 1: Number of possible tokens - 1000 = 1 + maximum word index
# Arg 2: Dimensionality - 64 - Ou
embedding_layer = Embedding(1000, 64)

In [19]:
embedding_layer.input_dim

1000

In [20]:
embedding_layer.output_dim

64

Embedding Layer will take as input a 2D tensor of integers of shape `(samples, sequence_length)`. Here, each entry represents a single sample and each sample is a sequence of integers. All sample vectors must have the same length, so individual sample vectors may have to be zero-padded. 

The output will be a 3D `float32` tensor of shape `(samples, sequence_length, embedding_dimensionality)` where each sample is will consist of the sequence of words, and each word will be represented as an `N` dimensional vector, where `N` is the dimensionality of the word-embedding vector space.

## Learning Embedding for `imdb`

1. Prepare data by tokenizing it to a sequence of integers.
2. Restrict the movie reviews to the top 10k most common words.
3. Parse only the first 20 words in each review for tokenizing.
4. Feed each sequence of integers to the Embedding layer which will convert each word (integer) to an 8-dimensional vector. 
5. Flatten the tensor to 2D.
6. Train a single `Dense` layer on top for classification.

In [21]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras import preprocessing

In [22]:
# Number of words to consider as features
max_features = 10000

# Cuts off the text after this number of words
maxlen = 20

In [23]:
# for loading IMDB dataset from a pickle file
import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# call load_data with allow_pickle implicitly set to true
# Data is loaded as lists of integers
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# restore np.load for future normal usage
np.load = np_load_old

### Preprocesing
 Turn the list of training and test integers into 2D integer tensor of shape `(samples, maxlen)`. This means any reviews that have less than 20 words will have to be zero-padded. 

In [24]:
# Turns list of training integers into 2D integer tensor
# shape (samples, maxlen) - 0 padding for reviews that are less than 20 words
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)

In [25]:
# Do the same for the test samples
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

## Using `Embedding` Layer and Classifier on IMDB Data

In [26]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Embedding

In [27]:
model = Sequential()

The Embedding layer will accept inputs with upto 10k different words and encodes each word as an 8 dimensional word-embedding vector. The `input_length` argument specifies the dimension of each input to the layer along the samples axis.

In [28]:
# Specifies maximum input length to Embeddinag layer
# So embedded inputs can later be flattened
model.add(Embedding(10000, 8, input_length=maxlen))

In [29]:
# Flatten the 3D tensor of embeddings into a 2D tensor of shape
# (samples, maxlen * 8)
model.add(Flatten())

In [30]:
# Add a classifier on top 
# But this will treat each word in the sample separately
# Without considering inter-word relationships and structure
model.add(Dense(1, activation='sigmoid'))

In [31]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

In [32]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_2 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


In [33]:
history = model.fit(x_train, y_train, 
                   epochs=10, batch_size=32, 
                   validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Get a validation accuracy of 76%, which is good considering that we were only using the first 20 words in each review for learning word embeddings. 

The use of a densely connected classifier without an RNN or Conv1D layer is not the optimal way of learning word embeddings. This is because this architecture will only allow the NN to focus on individual words in the input sequences without considering inter-word relationshops and sentence structure.

So this network would be unable to differentiate between "This movie is a bomb" and "This move is the bomb".

## Pretrained Word Embeddings

To integrate pretrained word embeddings with our model, we must use the original IMDB data. This is because the IMDB data accessible through built-in `keras` functionality already has associations/its own encoding. We want the machine learnign algorithm to assign these encodings from scratch.

In [52]:
import os

In [53]:
# Path to original IMDB dataset on Drive
imdb_dir = '/Users/saads/OneDrive/Desktop/DL-Python-Repo/FYP-DL/Dl-Python-Book/chapter-6/aclImdb'

In [54]:
# Path to training directory - renamed to test in the new download
train_dir = os.path.join(imdb_dir, 'test')

In [64]:
# Empty lists to store class and extracted text from all reviews
labels = []
texts = []

In [68]:
pos_files = os.listdir(os.path.join(train_dir, 'pos'))
neg_files = os.listdir(os.path.join(train_dir, 'neg'))

In [75]:
pos_files[0][-4:]

'.txt'

In [84]:
# Parse all .txt files in the `neg` and `pos` subdirectories of 
# the training set. Read all reviews and add their text and labels to the 
# appropriate lists. 
for label_type in ['neg', 'pos']:
    # Parse the negative and positive class directories in turn
    dir_name = os.path.join(train_dir, label_type)
    
    # For every file in the subdirectory
    for fname in os.listdir(dir_name):
        # Check if it's a text file. DON'T FORGET CLOSING COLON FOR THE LIST SLICING
        if fname[-4:] == '.txt':
            # Open it, read its contents, append them to the texts list
            # Must specify UTF-8 encoding, otherwise Python has issues reading from file
            f = open(os.path.join(dir_name, fname), encoding='utf-8')
            texts.append(f.read())
            
            # Close the filestream
            f.close()
            
            # For each review added to the texts list, also add
            # 1 or 0 to the labels category on the corresponding index
            if label_type == 'neg':
                labels.append(0)  # negative class
            else:
                labels.append(1)  # positive class

### Tokenizing Data

The data is now ready to be tokenized i.e. converted from string to a series of integer indices, each of which represents a single word in the vocabulary.

A condition where pre-trained word embeddings are practically useful is when we have very little training data available to learn our own embeddings. To simulate this condition, we will limit ourselves to using only 200 samples fromn the training data.

In [85]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

In [86]:
# Cuts off reviews after 100 words
maxlen = 100

# Trains on only 200 samples - simulating limited training data 
training_samples = 200

# Validate on 10k samples
validation_samples = 10000

# Considers only the top 10k words in the dataset
max_words = 10000

In [87]:
# Instantiate a keras tokenizer for 10k words
tokenizer = Tokenizer(num_words=max_words)

# Tokenize all words in the training data
tokenizer.fit_on_texts(texts)

# Convert the samples to sequences of tokens
sequences = tokenizer.texts_to_sequences(texts)

In [88]:
# Dictionary that maps each word in the vocabulary to an index
word_index = tokenizer.word_index

In [89]:
print('Found %s unique tokens.' % len(word_index))

Found 72745 unique tokens.


In [90]:
data = pad_sequences(sequences, maxlen=maxlen)

In [91]:
labels = np.asarray(labels)

In [92]:
print("Shape of data tensor: ", data.shape)

Shape of data tensor:  (22455, 100)


In [93]:
print("Shape of label tensor: ", data.shape)

Shape of label tensor:  (22455, 100)


In [94]:
# Split the data into a training set and validation set
# But first shuffle the data b/c we're starting with data
# in which samples are ordered (all negative first, then all positive)
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

In [95]:
x_train = data[:training_samples]
y_train = data[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = data[training_samples: training_samples + validation_samples]

In [96]:
print(x_train.shape)

(200, 100)


## GloVe Word Embeddings

In [105]:
glove_dir = '/Users/saads/OneDrive/Desktop/DL-Python-Repo/FYP-DL/Dl-Python-Book/chapter-6/glove.6B'

In [None]:
embeddings_index = {}

In [None]:
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))

In [None]:
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs

In [None]:
print('Found %s word vectors.' %len(embeddings_index))

In [None]:
embedding_dim = 100
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # Words not found in the embedding index will be all zeros
            embedding_matrix[i] = embedding_vector

### GloVe Model

In [106]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

In [None]:
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

## Loading the GloVe Embeddings n the Model

Weight matrix of the embedding layer is a 2D float matrix where each entry `i` is the word vector meant to be associated with the index `i`. 

In [None]:
model.layers[0].set_weights([embedding_matrix])

# Embedding layer's weights are not trainable
# Want to reuse preconfigured features in the word-embedding
model.layers[0].trainable = False

## Train the Model

In [None]:
# Training the model
model.compile(optimizer='rmsprop', 
             loss='binary_crossentropy', 
             metrics=['acc'])

In [None]:
history = model.fit(x_train, y_train, 
                   epochs=10, batch_size=32, 
                   validation_data=(x_val, y_val))

## Save the Model

In [None]:
model.save_weights('pre_trained_glove_model.h5')

## Plotting Model Performance

In [108]:
import matplotlib.pyplot as plt

In [None]:
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

In [None]:
plt.plot(epochs, acc, 'bo', label='Traning Accuracy')
plt.plot(epochs, val_acc, 'b-', label='Validation Accuracy')
plt.xlabel('Epochs'); plt.ylabel('Accuracy'); plt.grid(True);
plt.legend(); plt.title('GloVe Embeddings - Accuracy')

In [None]:
plt.plot(epochs, loss, 'bo', label='Training Loss')
plt.plot(epochs, val_loss, 'b-', label='Validation Loss')
plt.xlabel('Epochs'); plt.ylabel('Loss'); plt.grid(True);
plt.legend(); plt.title('GloVe Embeddings - Loss')

## Training w/o GloVe Embeddings

In [109]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

In [110]:
model = Sequential()

In [None]:
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.summary()

In [None]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', 
             metrics=['acc'])

In [None]:
history = model.fit(x_train, y_train, epochs=10, batch_size=32, 
                   validation_data=(x_val, y_val))

## Evaluating Model

In [112]:
test_dir = os.path.join(imdb_dir, 'test')
labels = []
texts = []

In [None]:
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name))
    if fname[-4:] == '.txt':
        f = open(os.path.join(dir_name, fname))
        texts.append(f.read())
        f.close()
        
        if label_type == 'neg':
            labels.append(0)
            
        else:
            labels.append(1)

In [None]:
sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

In [None]:
model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)