In [1]:
%load_ext autoreload
%autoreload 2

# DATA MANIPULATION
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np

# DATA VISUALISATION
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

#Tenserflow
import tensorflow.keras as tk

from tensorflow.keras import Sequential, layers, regularizers

from tensorflow.keras.utils import pad_sequences

from tensorflow.keras.callbacks import EarlyStopping

from tensorflow.keras.layers import Normalization
from tensorflow.keras.layers import Dense, SimpleRNN, Flatten

from tensorflow.keras.preprocessing.text import text_to_word_sequence, Tokenizer

from tensorflow.keras.preprocessing.text import text_to_word_sequence
# VIEWING OPTIONS IN THE NOTEBOOK
from sklearn import set_config; set_config(display='diagram')


from sklearn.model_selection import train_test_split, KFold

from sklearn.preprocessing import StandardScaler

from gensim.models import Word2Vec


2023-02-24 14:52:41.572145: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-24 14:52:41.985064: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-24 14:52:41.985079: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-02-24 14:52:42.032338: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-02-24 14:52:43.012666: W tensorflow/stream_executor/platform/de

# NLP with CNN

### Exercise objectives:

- Use CNN instead of RNN for NLP

<hr>
<hr>


# The data


❓ **Question** ❓ Let's first load the data. You don't have to understand what is going on in the function, it does not matter here.

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that too many sentences will make your compute slow down, or even freeze - your RAM can overflow. For that reason, **you should start with 10% of the sentences** and see if your computer handles it. Otherwise, rerun with a lower number. 

⚠️ **DISCLAIMER** ⚠️ **No need to play _who has the biggest_ (RAM) !** The idea is to get to run your models quickly to prototype. Even in real life, it is recommended that you start with a subset of your data to loop and debug quickly. So increase the number only if you are into getting the best accuracy.

In [2]:
######################################
### Run this cell to load the data ###
######################################

import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import text_to_word_sequence

def load_data(percentage_of_sentences=None):
    train_data, test_data = tfds.load(name="imdb_reviews", split=["train", "test"], batch_size=-1, as_supervised=True)

    train_sentences, y_train = tfds.as_numpy(train_data)
    test_sentences, y_test = tfds.as_numpy(test_data)
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(train_sentences))
        train_sentences, y_train = train_sentences[:len_train], y_train[:len_train]
  
        len_test = int(percentage_of_sentences/100*len(test_sentences))
        test_sentences, y_test = test_sentences[:len_test], y_test[:len_test]
    
    X_train = [text_to_word_sequence(_.decode("utf-8")) for _ in train_sentences]
    X_test = [text_to_word_sequence(_.decode("utf-8")) for _ in test_sentences]
    
    return X_train, y_train, X_test, y_test

X_train, y_train, X_test, y_test = load_data(percentage_of_sentences=20)

2023-02-24 14:52:45.132449: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:966] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-02-24 14:52:45.132538: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-24 14:52:45.132577: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2023-02-24 14:52:45.132609: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2023-02-24 14:52:45.132637: W tensorflow/stream_executor/platform/default/dso_loader.cc:6

Remember that to do NLP, you need to go through one of the following options, as shown here: 

<img src="embedding_or_RNN.png" width="700px" />

But in both cases, you can replace the recurrent layer (top part) by a CNN layer. We will go into both, starting from the one on the left were the embedding is learned in the network.




# Part 1 : Concatenate a Keras Embedding with a Conv1D🔥 

Let's train a fancy network here. 

Each of your words is represented by a vector of size N (the size of your embedding). Therefore, as a sentence is a sequence of words, it is represented by a matrix (number of words, N). So, all your sentences are actually represented as matrices once embedded.

If you think about it, an image is also a matrix. Said differently, you may represent your sentence of word as a matrix, where each column (or row, depending on how you want to look at it) is a word, and each row (or each column) corresponds to a coordinate in the embedding space. As shown here

<img src="image_comparison.png" width="500px" />

Well, in that case, as these are close to images, why not using convolution on them? Yes, convolutions!
But, be careful. In the case of images, convolutions are 2 dimensional as the filters can move up and down, and left and right. In the case of our sentences, we want the kernel to move _only_ in the word by word direction (The alternative, moving coordinate by coordinate of the embedding space doesn't make much sense).

So let's create a model that use convolutions.

## First, the data

❓ **Question** ❓ You will need to prepare your data. First, tokenize them. Then, you need to pad them (use a value `maxlen` equal to 150). You also might need to compute the size of your vocabulary ;)

In [3]:
tk=Tokenizer()
tk.fit_on_texts(X_train)
X_train_tk=tk.texts_to_sequences(X_train)
X_test_tk=tk.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_tk, dtype='float32', padding='post', maxlen=150)
X_test_pad = pad_sequences(X_test_tk, dtype='float32', padding='post', maxlen=150)

## Using 1D Convolution.

❓ **Question** ❓ Define a model that has :
- an `Embedding` layer: `input_dim` is the `vocab_size + 1`, `output_dim` is the embedding space dimension, and `mask_zero` has to be set to `True`. Here, for computational reasons, set `input_length` to the maximum length of your observations (that you just defined in the previous question).
- a `Conv1D` layer 
- a `Flatten` layer
- a `Dense` layer
- an output layer

Compile the model accordingly

❗ **Remark** ❗ The size of the `Conv1D` kernel corresponds exactly to the number of side-by-side words (tokens) each kernel is taking into account ;)

In [4]:
X_train_pad.shape

(5000, 150)

In [5]:
vocab_size = len(tk.word_index)
embedding_size = 70

def my_model():
    
    model = Sequential()
    
    model.add(layers.Embedding(
        input_dim=vocab_size+1, #+1 for the 0 padding
        input_length=150,
        output_dim=embedding_size, 
        mask_zero=True, # Built-in masking layer :)
    ))


    model.add(layers.Conv1D(20,kernel_size=150, activation='tanh'))
    model.add(layers.Flatten())
    model.add(layers.Dense(10, activation="relu"))
    model.add(layers.Dense(1, activation="sigmoid"))
    
    

    model.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])

    
    return model

In [6]:
vocab_size

42660

❓ **Question** ❓ Look at the number of parameters. You can compare it to the model that you had in previous exercise (esp. the first one)

In [7]:
model = my_model()
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 150, 70)           2986270   
                                                                 
 conv1d (Conv1D)             (None, 1, 20)             210020    
                                                                 
 flatten (Flatten)           (None, 20)                0         
                                                                 
 dense (Dense)               (None, 10)                210       
                                                                 
 dense_1 (Dense)             (None, 1)                 11        
                                                                 
Total params: 3,196,511
Trainable params: 3,196,511
Non-trainable params: 0
_________________________________________________________________


❓ **Question** ❓ Fit your model with a stopping criterion, and evaluate it on the test data.

You will probably notice that it is ... **much faster** than RNN.

In [8]:
from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping(patience=5, restore_best_weights=True)
model = my_model()
model.fit(X_train_pad, y_train, 
          epochs=20, 
          batch_size=32,
          validation_split=0.3,
          callbacks=[es]
         )


res = model.evaluate(X_test_pad, y_test, verbose=0)

print(f'The accuracy evaluated on the test set is of {res[1]*100:.3f}%')

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
The accuracy evaluated on the test set is of 77.780%


# Part 2 : Learn a Word2Vec representation, and then feed it into a NN with a `Conv1D`🔥 

In the first part of the exercise, you were asked to jointly learn the embedding representation and the CNN convolution within the same network, which was the CNN counterpart of the left part of this Figure: 

<img src="embedding_or_RNN.png" width="700px" />

Now, let's try to replace the RNN with a CNN for an architecture as on the right side.

❓ **Question** ❓ Learn a word2vec model or load a trained one directly from GENSIM (transfer learning). Then, prepare your data as in the previous exercise. This question is quite long, it prepares you for real world challenges - but you have already done all the building bricks in the previous exercises

In [9]:
word2vec = Word2Vec(sentences=X_train, vector_size=130, min_count=7, window=7)


In [10]:
# Function to convert a sentence (list of words) into a matrix representing the words in the embedding space
def embed_sentence(word2vec, sentence):
    embedded_sentence = []
    for word in sentence:
        if word in word2vec.wv:
            embedded_sentence.append(word2vec.wv[word])
        
    return np.array(embedded_sentence)

# Function that converts a list of sentences into a list of matrices
def embedding(word2vec, sentences):
    embed = []
    
    for sentence in sentences:
        embedded_sentence = embed_sentence(word2vec, sentence)
        embed.append(embedded_sentence)
        
    return embed

# Embed the training and test sentences
X_train_embed = embedding(word2vec, X_train)
X_test_embed = embedding(word2vec, X_test)


# Pad the training and test embedded sentences
X_train_pad2 = pad_sequences(X_train_embed, dtype='float32', padding='post', maxlen=200)
X_test_pad2 = pad_sequences(X_test_embed, dtype='float32', padding='post', maxlen=200)

❓ **Question** ❓ Now construct a model that has a `Conv1D` layer, a flatten layer, a dense layer, and an output layer. Compile it, and fit it on the train data. You can then evaluate it on the test set.

In [13]:
def my_model2():
    
    model = Sequential()
    
    model.add(layers.Masking(mask_value=0))

    model.add(layers.Conv1D(20,kernel_size=150, activation='tanh'))
    model.add(layers.Flatten())
    model.add(layers.Dense(10, activation="relu"))
    model.add(layers.Dense(1, activation="sigmoid"))
    
    

    model.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])

    
    return model

In [14]:
model2 = my_model2()
model2.fit(X_train_pad2, y_train, 
          epochs=20, 
          batch_size=32,
          validation_split=0.3,
          callbacks=[es]
         )


res2 = model2.evaluate(X_test_pad2, y_test, verbose=0)

print(f'The accuracy evaluated on the test set is of {res2[1]*100:.3f}%')

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
The accuracy evaluated on the test set is of 61.620%


❓ **Question** ❓ You might be frustrated by the accuracy you got, this happens to us all at some point. Once you have tested your first iteration, you need to improve your models: by making them more complex, changing the parameters, stacking additional layers, and so on.

Only practice and experimentation will get you there. So you can go back to your previous models, change them and try to get better results ;)

In [15]:
!git add NLP-with-CNN.ipynb

!git commit -m 'Completed Challenge NPL with CNN'

!git push origin master

[master a07eba2] Completed Challenge NPL with CNN
 1 file changed, 408 insertions(+), 22 deletions(-)
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 16 threads
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 741.43 KiB | 5.53 MiB/s, done.
Total 10 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), done.[K
To github.com:CedrikGiau/data-nlp-with-cnn.git
 * [new branch]      master -> master
