*Sources* :
[How to Develop Word Embeddings in Python with Gensim, *machinelearningmastery.com*](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/)

[How to Develop Word Embeddings in Python with Gensim, *machinelearningmastery.com*](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/)

[Understanding LSTM Networks, *colah.github.io*](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)

[CNN Long Short-Term Memory Networks, *machinelearningmastery.com*](https://machinelearningmastery.com/cnn-long-short-term-memory-networks/)

# Cleaning a text example

- Split tokens on white space.
- Remove all punctuation from words.
- Remove all words that are not purely comprised of alphabetical characters.
- Remove all words that are known stop words.
- Remove all words that have a length <= 1 character.

In [5]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/macbook/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/macbook/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [6]:
# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

text[:100]

FileNotFoundError: [Errno 2] No such file or directory: 'metamorphosis_clean.txt'

In [None]:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)

print(tokens[:100])

In [None]:
# convert to lower case
tokens = [w.lower() for w in tokens]

print(tokens[:100])

In [None]:
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

print(stripped[:100])

In [None]:
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
print(words[:100])

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

In [None]:
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in
         words if not w in stop_words]
print(words[:100])

In [None]:
# stemming of words : reduces each word to its root or base
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in words]
print(stemmed[:100])

# Loading and cleaning reviews

- Split tokens on white space.
- Remove all punctuation from words.
- Remove all words that are not purely comprised of alphabetical characters.
- Remove all words that are known stop words.
- Remove all words that have a length <= 1 character.

In [None]:
from nltk.corpus import stopwords
import string

In [None]:
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
 
# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

In [None]:
# load the document
filename = 'review_polarity/txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens[:10])

# Define a vocabulary

It is important to define a vocabulary of known words when using a bag-of-words or embedding model. The more words, the larger the representation of documents, therefore it is important to constrain the words to only those believed to be predictive. We can develop a vocabulary as a Counter, which is a dictionary mapping of words and their counts that allow us to easily update and query.

$\implies$ register the occurrence of each word (after cleaning = removing punctuation and numbers)

In [None]:
from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

In [None]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens
 
# load all docs in a directory
def process_docs(directory, vocab, is_trian):
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if is_trian and filename.startswith('cv9'):
            continue
        if not is_trian and not filename.startswith('cv9'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # add doc to vocab
        add_doc_to_vocab(path, vocab)

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
    # load doc
    doc = load_doc(filename)
    # clean doc
    tokens = clean_doc(doc)
    # update counts
    vocab.update(tokens)

In [None]:
# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('review_polarity/txt_sentoken/neg', vocab, True)
process_docs('review_polarity/txt_sentoken/pos', vocab, True)
# print the size of the vocab
print(len(vocab))

In [None]:
# print the top words in the vocab (N.B. no more stopwords)
print(vocab.most_common(50))

In [None]:
# keep tokens with a min occurrence (here 2)
min_occurane = 2
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))

In [None]:
# save list  (= tokens) to file
def save_list(lines, filename):
    # convert lines to a single blob of text
    data = '\n'.join(lines)
    # open file
    file = open(filename, 'w')
    # write text
    file.write(data)
    # close file
    file.close()
    
# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')

# Train embedding layer

*Exemple* : take the sentence **So hungry, need food** and break it down into four arbitrary symbols: **so** represented as K45, **hungry** as J83, **need** as Q67, and **food** as P21, all of which can then be processed by the computer. Each unique word is represented by a different symbol; however, the downside is that there is no apparent relationship between the symbols designated to **hungry** and **food**. This hinders the NLP model from using what it learned about hungry and applying it to food, which are semantically related. Vector Space Models (VSM) help address this issue by embedding the words in a vector space where similarly defined words are mapped near each other. This space is called a Word Embedding.
![](images/word_embeddings.png)

The real valued vector representation for words can be learned while training the neural network. We can do this in the Keras deep learning library using the Embedding layer.

## What are word embeddings for text?

### What are word embeddings ?

A **word embedding** is a way of *representing text* where *each word* in the vocabulary is represented by a *real valued vector in a high-dimensional space*. The vectors are learned in such a way that *words that have similar meanings will have similar representation in the vector space* (close in the vector space).<br><br>
$\implies$ word embeddings = a class of techniques where individual words are represented as real-valued vectors in a predefined vector space $\to$ each word is mapped to one vector and the vector values are learned.

### Words embeddings algorithms

**Word embedding methods** learn a *real-valued vector representation* for a *predefined fixed sized vocabulary* from a corpus of text.

#### Embedding Layer

REQUIRES THE DOCUMENT TO BE CLEANED

The **Embedding Layer** is a *word embedding* that is learned jointly with a *neural network model* on a specific natural language processing task.
- The size of the vector space is specified as part of the model
- The embedding layer is used on the front end of a neural network and is fit in a supervised way using the Backpropagation algorithm
- If a multilayer Perceptron model is used, then the word vectors are concatenated before being fed as input to the model
- If a recurrent neural network is used, then each word may be taken as one input in a sequence

#### Word2vec

**Word2vec** (founded by a Google team) is one of the most popular *models* used to *create word embeddings*.

2 methods :
- *Continuous Bag-of-Words model (CBOW)* = the less popular of the two models, uses source words to predict the target words<br> $\implies$ "I want to learn Python" uses "I want to learn" to predict "Python"<br><br>
- *Skip-Gram Model* = uses target words to predict the source, or context<br> $\implies$ "The quick brown fox jumped over the lazy dog" $\to$ breaks the sentence in pairs (context = 2 words surrounding, target) $\implies$ ([the, brown],quick), ([quick,fox],brown), ([brown,jumped],fox),... $\to$ pairs reduced to (input = word, output = 1/2 of the 2 words surrounding) (input = word, output = 1/2 of the 2 words surrounding)<br>
![](images/Word2Vec-Training-Models.png)

## Training

In [None]:
import keras
import numpy as np
from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras import backend as K

### Create a vocabulary

In [None]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
print (vocab[:50])

In [None]:
vocab = vocab.split()
print(vocab[:10])

In [None]:
print (set(vocab[:10]))
vocab = set(vocab)

### Load the documents

We need to load all of the training data movie reviews $\implies$ update process-doc() $\to$ load the documents (pos, neg) + clean them + return them as a list of strings (1 document per string)

In [None]:
# turn a doc into clean tokens
def clean_doc(doc, vocab):
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    # filter out tokens not in vocab
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens

# load all docs in a directory
def process_docs(directory, vocab, is_trian):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if is_trian and filename.startswith('cv9'):
            continue
        if not is_trian and not filename.startswith('cv9'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load the doc
        doc = load_doc(path)
        # clean doc
        tokens = clean_doc(doc, vocab)
        # add to list
        documents.append(tokens)
    return documents
 
# load all training reviews
positive_docs = process_docs('review_polarity/txt_sentoken/pos', vocab, True)
print(positive_docs[0][:100])

In [None]:
# load all training reviews
negative_docs = process_docs('review_polarity/txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs
print (negative_docs[0][:100])

### Encode documents as sequences of integers

- The Keras Embedding layer requires integer inputs $\implies$ use the **Tokenizer class**
- The Embedding requires the specification of the vocabulary size + the size of the real-valued vector space + the maximum length of input documents
- 1 token = 1 vector
- Vectors are at first random and become meaningful
- Ensure that all documents have the same length for Keras efficient computation
- Create class labels for the neural network
- Encode and pad the test dataset

In [None]:
# create the tokenizer = instance of class
tokenizer = Tokenizer()
# fit the tokenizer on the documents = develops a consistent mapping from words in the vocabulary to unique integers
tokenizer.fit_on_texts(train_docs)

$\implies$ the mapping of words to integers is prepared
    $\implies$ use it to encode the reviews in the training dataset

In [None]:
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)

We also need to ensure that all documents have the same length.

In [None]:
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post') # adds 0 to reach the lenght of max_lenght

Finally, we can define the class labels for the training dataset (CNN).

In [None]:
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])
print (ytrain)

We can then encode and pad the test dataset, needed later to evaluate the model after we train it.

In [None]:
# load all test reviews
positive_docs = process_docs('review_polarity/txt_sentoken/pos', vocab, False)
negative_docs = process_docs('review_polarity/txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

In [None]:
print(Xtrain.shape, Xtest.shape, ytrain.shape, ytest.shape)

### Convolutional Neural Network model and Embedding Layer

**An Embedding** is *mapping a discrete variable* into a *vector of continuous numbers*<br>
$\implies$  first randomly initializes the embedding vector and then uses network optimizer to update it similarly

“deep learning is very deep” $\to$ 1 2 3 4 1
The embedding matrix gets created next : we decide how many ‘latent factors’ are assigned to each index = how long we want the vector to be (mostly 32 or 50).
Instead of ending up with huge one-hot encoded vectors we can use an embedding matrix to keep the size of each vector much smaller.

![](images/embeddings.png)

The Embedding requires the specification of the vocabulary size (= total number of words + 1 (for unknown words)) + the size of the real-valued vector space (here a 100-d) + the maximum length of input documents.

#### A simple example

Our training set consists only of two phrases:

Hope to see you soon<br>
Nice to see you soon

We encode it :

Hope to see you soon $\to$ [0, 1, 2, 3, 4]<br>
Nice to see you again $\to$ [5, 1, 2, 3, 6]

We want to train a network whose first layer is an embeding layer. In this case, we should initialize it as follows :

In [None]:
Embedding(7, 2, input_length=5)

- The first argument (7) is the number of distinct words in the training set
- The second argument (2) indicates the size of the embedding vectors
- The input_length argument determines the size of each input sequence

Once the network has been trained, we can get the weights of the embedding layer, which in this case will be of size (7, 2).

The table used to map integers to embedding vectors.

According to these embeddings the sentence "Nice to see you agaign" will be reprensented as :

In [None]:
[[0.7, 1.7], [0.1, 4.2], [1.0, 3.1], [0.3, 2.1], [4.1, 2.0]]

#### Embedding Layer and CNN

In [None]:
# define vocabulary size (largest integer value) N.B. : all the words of the texts will be classified in two categories : the vocabulary and the unknown words
vocab_size = len(tokenizer.word_index) + 1

In [None]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))

The **Convolutional Neural Network** = a Deep Learning algorithm which takes an input image *assign importance* (weights + biases) to various *aspects of the image*, which it can differentiate.<br><br>
$\to$ The CNN is an **alternance of convolutions and poolings** $\Longleftrightarrow$1 layer = 1 convolution layer + 1 pooling layer.<br><br>
The pre-processing requiries a ConvNet which captures the Spatial and Temporal dependencies thanks to filters, its role is to reduce the image in an easier form to process, without losing features which are critical for getting a good prediction.

![](images/cnn_architecture.jpeg)

**A convolution :** here with a Stride Length = 1, weights are fitting during the convolution.

![](images/kernel.gif)

**Padding :** applied with the convolution, the pixels of the corners are less counted than those in the middle = inequality weights $\implies$ loss of data $\implies$ we give additional pixels at the boundary of the data.

![](images/padding.gif)

**A pooling layer :** we only take the maximum value/the average inside the box on the left case, usually a Maxpooling layer which is a Noise Suppressant.<br> $\implies$ reduces the spatial size of the Convolved Feature

![](images/pooling_layers.png)

**Flattening :** the final output of the layer(s) is converting the data into a 1-dimensional array (column vector).

**Fully-connected layer or Dense layer :** learns the non-linear combinations of the high-level features. Always has an input and output layers and the layers in between are called the hidden layers.

![](images/flattening_fully_connected_layers.png)

- **Unit** = Neuron consisting of : 
    - $a_j(t)$ : the activation (active or unactive) = the neuron's state.<br>
    
    - $\theta_j$ : a threshold which is fixed unless changed by a learning function, the neuron $j$ is activated if the input overpass the threshold.<br>
    
    - $a_j(t+1) = f(a_j(t),p_j(t),\theta_j)$ : an (predefined) activation function which actives or desactivates the neuron.<br>
    
    - $o_j(t) = f_{out}(a_j(t))$ : the output function which works if activated.
    

- A $j$ neuron receives an input $p_j$ from predecessor neurons : $p_j(t) = \sum o_j(t)w_{ij}+ w_{0j}$ where $w_{0j}$ is a bias = can be add for learning the threashold $\theta_j$.


- **Hyper-parameters** : set manually (number of filters, of layers, ...).


- **Cost function** : $C$, considering the whole network as a function looks for the optimal solution (for a given task) i.e. $C(f*) \leq C(f) \forall f \in F$ which means the network modifies its parameters (weights and biases) until it reaches the optimum.

In [None]:
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

### Fitting the network on the training data

We use a **binary cross entropy loss function** because the problem we are learning is a binary classification problem.<br><br> $\implies$ Adam implementation of stochastic gradient descent (= designed specifically for training deep neural networks)

In [None]:
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=1, verbose=1)

The model is evaluated on the test dataset.

In [None]:
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

### Train with word2vec Embedding

In [None]:
import gensim
from gensim.models import Word2Vec

The **word2vec algorithm** = approach to *learning a word embedding* from a text corpus in a *standalone way*.<br><br>
$\implies$ can produce high-quality word embeddings very efficiently (space and time complexity)
- The word2vec algorithm processes documents sentence by sentence

#### How to develop word embeddings in Python with Gensim

**Gensim** = open source Python library for natural language processing $\implies$ topic modelling for humans
- Suite of Natural Language Processing tools for topic modeling
- Tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embeddings

Beyond the many *parameters* : 
- size (default 100) : the number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word)
- window (default 5) : the maximum distance between a target word and words around the target word
- min_count (default 5) : the minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored
- workers (default 3) : the number of threads to use while training $\to$ higher if get many cores (8) 
- sg (default 0 or CBOW) : the training algorithm, either CBOW (0) or skip gram (1)

#### Develop Word2Vec Embedding

1. Prepare the documents = the same data cleaning steps from the previous section

In [None]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
 
# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

In [None]:
# turn a doc into clean tokens = clean line by line and return cleaned lines
def doc_to_clean_lines(doc, vocab):
    clean_lines = list()
    lines = doc.splitlines()
    for line in lines:
        # split into tokens by white space
        tokens = line.split()
        # remove punctuation from each token
        table = str.maketrans('', '', punctuation)
        tokens = [w.translate(table) for w in tokens]
        # filter out tokens not in vocab
        tokens = [w for w in tokens if w in vocab]
        clean_lines.append(tokens)
    return clean_lines

In [None]:
# load all docs in a directory = load and clean all of the documents in a folder and return a list of all document lines
def process_docs(directory, vocab, is_trian):
    lines = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if is_trian and filename.startswith('cv9'):
            continue
        if not is_trian and not filename.startswith('cv9'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load and clean the doc
        doc = load_doc(path)
        doc_lines = doc_to_clean_lines(doc, vocab)
        # add lines to list
        lines += doc_lines
    return lines

In [None]:
# load training data
positive_lines = process_docs('review_polarity/txt_sentoken/pos', vocab, True)
negative_lines = process_docs('review_polarity/txt_sentoken/neg', vocab, True)
sentences = negative_docs + positive_docs

2. Create the model : clean sentences  + size of the embedding vector space (here 100) + number of neighboring words to look at (here 5) + number of threads to use when fitting the model (here 8)

In [None]:
# train word2vec model
model = Word2Vec(sentences, size=100, window=5, workers=8, min_count=1)
print (model)
# summarize vocabulary size in model
words = list(model.wv.vocab)
print('Vocabulary size: %d' % len(words))

In [None]:
# save model in ASCII (word2vec) format
filename = 'embedding_word2vec.txt'
model.wv.save_word2vec_format(filename, binary=False)

#### Visualize Word Embedding

In [None]:
from numpy import array
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
import pickle

Use classical projection methods to reduce the high-dimensional word vectors to two-dimensional plots and plot them on a graph.
1. Retrieve all of the vectors from a trained model
<br>$\implies$ use **PCA** with **Pizo**
2. Train a projection method on the vectors
3. Use matplotlib to plot the projection as a scatter plot

We need to use Pizo to plot the figure as Jupyter failed (= the kernel crashed). With this in mind we save text and X in files in order to access it from Pizo.

In [None]:
text = [sentence.split() for sentence in sentences]
print([t[:5] for t in text[:10]])
with open("text.txt", "wb") as fp:   # save text in a file for Pyzo
    pickle.dump(text, fp)

In [None]:
# train word2vec model
model_visu = Word2Vec(text, size=100, window=5, workers=8, min_count=100)
# save model in ASCII (word2vec) format
filename = 'embedding_word2vec_visu.txt'
model_visu.wv.save_word2vec_format(filename, binary=False)

In [None]:
X = model_visu[model_visu.wv.vocab]
# np.save('X.npy',X) # save with numpy as an arrayhh
#with open('X_list.pkl','wb') as f: # save with pickle as a list
    #pickle.dump(list(X), f)
with open ('X.pkl','wb') as f : # save with pickle as an array
        pickle.dump(X, f)

In [None]:
# in Pizo
#def visualize_we (model) :
#    pca = PCA(n_components=2)
#    X = np.load('X.npy')
#    with open("text.txt", "rb") as fp:   
#    text = pickle.load(fp)
#    model_visu = Word2Vec(text, size=100, window=5, workers=8, min_count=1)
#    X = model[model.wv.vocab]
#    pca = PCA(n_components=2) # 2-dimensional PCA
#    result = pca.fit_transform(X)
#    # create a scatter plot of the projection
#    pyplot.scatter(result[:, 0], result[:, 1])
#    words = list(model.wv.vocab)
#    print (words)
#    for i, word in enumerate(words):
#        pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
#    pyplot.show()

#visualize_we(model_visu)

### Use pre-trained Embedding

In [None]:
import numpy
from numpy import asarray
from numpy import array
from numpy import zeros
from string import punctuation
from os import listdir
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

Training a word vectors is complicated $\implies$ use an existing pre-trained word embedding.

It is possible that the loaded embedding does not contain all of the words in our chosen vocabulary $\implies$ skip words 

In [None]:
# returns a directory of words mapped to the vectors in NumPy format
def load_embedding(filename):
    # load embedding into memory, skip first line ( load the word embedding as a directory of words to vectors)
    file = open(filename,'r')
    lines = file.readlines()[1:]
    file.close()
    # create a map of words to vectors
    embedding = dict()
    for line in lines:
        parts = line.split()
        # key is string word, value is numpy array for vector
        embedding[parts[0]] = asarray(parts[1:], dtype='float32')
    return embedding

In [None]:
# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
    # total vocabulary size plus 0 for unknown words
    vocab_size = len(vocab) + 1
    # define weight matrix dimensions with all 0
    weight_matrix = zeros((vocab_size, 100))
    # step vocab, store vectors using the Tokenizer's integer mapping
    for word, i in vocab.items():
        weight_matrix[i] = embedding.get(word)
    return weight_matrix

Now we can add this layer to our model.

In [None]:
# define model
model_pre_trained = Sequential()
model_pre_trained.add(embedding_layer_pre_trained)
model_pre_trained.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
model_pre_trained.add(MaxPooling1D(pool_size=2))
model_pre_trained.add(Flatten())
model_pre_trained.add(Dense(1, activation='sigmoid'))
print(model_pre_trained.summary())

# CNN LSTM

## LSTM

**Long Short Term Memory networks (LSTMs)** are a special kind of RNN, capable of *learning long-term dependencies*.<br><br>

$\to$ **RNN** are networks with *loops* allowing the information to persist, it can be seen as a set of copies of the same network where each passes a message to its successor.

![](images/rnn_architecture.png)

**LSTMs** have a chain-like structure with four neural network layers :

![](images/lstm_architecture.png)

![](images/lstm_architecture_annotations.png)

**Cell state** = conveyor belt which carries information that can be add or removed from *gates*, it can be seen as the memory of the network which carries the information necessarily to make good predictions.

![](images/lstm_C_line.png)

**Gates** = a sigmoid neural net layer (outputs numbers between zero and one) and a pointwise multiplication operation.

![](images/lstm_gates.png)

- **Forget gate layer** : decides what information will be throw away.<br>
Takes as inputs $h_{t-1} = o_{t-1} o \sigma_{t-1}$, where $o_{t-1}$ is the output gate's activation vector and $\sigma_{t-1}$ is a sigmoid function, and $x(t)$ a word.

![](images\lstm_forget_gate.png)

$\implies$ outputs a number between 0 (completly get rid of it) and 1 (completly keep it) for each number of the cell state $C_{t-1}$.

- **Input gate** = a *sigmoid layer* $\sigma$ : decides which values will be updated.<br>
Then a *tanh layer* creates new candidate values $\tilde{C}_t $.

![](images\lstm_inpu_gate.png)

$\implies$ we add $i_t$ * $\tilde{C}_t $ to the cell state.

- **Ouput layer** : decides what will be ouput, a sigmoid layer $\sigma$ is running to choose the part of the cell state to ouput ($o_t$ gives a value we will apply to the cell state), then a tanh function is running to have values between -1 and 1 (normalization and having a no so linear result to extract the most important features). 

![](images/lstm_ouput_layer.png)

**LSTM** = efficient for data prediction which requiries to consider both close and far elements from a position
- Uses *Backpropagation Through Time* (BPTT) for updating the weights = modify the weights of a neural network in order to minimize the error of the network outputs compared to some expected output in response to corresponding inputs
- Each time step = one CNN model + sequence of LSTM models
- At the backend the CNN layer(s) is wrapped in a **TimeDistributed layer** = apply a convolutional layer using TimeDistributed (applicable to a 1-d) along a time dimension in order to obtain a 2-d

## CNN LSTM Model

$\implies$ repeat the operation of the CNN on several images $\implies$ LSTM = build the internal state + update the weights (with BTT) 

The **CNN Model** (*Conv2D* = interpret snapshots + *polling layers* = consolidate or abstract the interpretation) for feature extraction $\implies$ handle just one image and turns pixels to matrix or vector.<br>
+<br>
The **LSTM Model** for interpreting the features across time steps.<br><br>
![](images/CNN_LSTM_dense.png)

In [None]:
lstm = Sequential()
lstm.add(TimeDistributed(cnn, ...))
lstm.add(LSTM(..))
lstm.add(Dense(...))