# **Introduction**

This is the last part of our journey to predict the rating for a given reviews. Using classification models on tfidf matrices doesn't allow us to explore the semantic relativities between words, which impacts the performance of the classification. We will in this section use embedded words instead of tfidf.

## **Data preparation for sequence models**
The main drawback of bag of words models is that they can't capture the order of the words in the setence. By using another representation of the document (this time, not by a bag of words for each document) but by a sequence of indices where each index points to a word in the vocabulary. 
For instant, the bag of words model can't not distinguish between 2 phrases :

"This is bad, I do not buy it again"

"This is not bad, I do buy it again"

But sequence models can. In fact, by reprepsenting each document as a sequence of indices, we can capture the proximity between different words and there for can capture better the sentiment. 
The function below will take each feature ('Text_prep' or 'Summary_prep') and create a sequence of indices with max length of 300 words. We will again in this vectorization process, select the k most relevant words in the corpus.


# Load packages

In [28]:
import pandas as pd
import numpy as np
import os  # Get directory

# Print every output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Retrieve current directory
repo = os.getcwd()

# Import data

In [29]:
# Load data
x_train = pd.read_csv(repo+'/train.csv')
y_train = pd.read_csv(repo+'/train_labels.csv')

x_val = pd.read_csv(repo+'/valid.csv')
y_val = pd.read_csv(repo+'/valid_labels.csv')  

x_test = pd.read_csv(repo+'/test.csv')
y_test = pd.read_csv(repo+'/test_labels.csv')  


# Check data types
x_train.head(3)
print('\n Shape of train data', x_train.shape)
print('\n Shape of validation data', x_val.shape)
print('\n Shape of test data', x_test.shape)

Unnamed: 0,Summary,Text,Summary_prep,Text_prep
0,"Not Impressive, but it is what it is",I agree that the shipping is too high. A litt...,impressive,agree ship high little fishy taste prefer larg...
1,Itchy scratchy golden retriever no more,This food is the best. Our golden retriever w...,itchy scratchy golden retriever,food best golden retriever prone sort allergie...
2,Perfect Every Cup,Since I must have my morning cappuccino it is ...,perfect every cup,since must morning cappuccino important coffee...



 Shape of train data (36498, 4)

 Shape of validation data (9125, 4)

 Shape of test data (11406, 4)


Since we will use CNN model, we don't need to split data in to validation set. Therefore we can merge some dataframes together

In [30]:
x_train = pd.concat([x_train, x_val], ignore_index=True)
y_train = pd.concat([y_train, y_val], ignore_index=True)
print('\n Shape of train data', x_train.shape)

del x_val, y_val


 Shape of train data (45623, 4)


In [4]:
from tensorflow.python.keras.preprocessing import sequence
from tensorflow.python.keras.preprocessing import text

In [5]:
def sequence_vect(feature_to_be_sequenced, x_train, x_test, 
                top_frequent_words=20000, max_sequence_length=300):

    """ This function transform the text into vector of indexes 
    IN : 
    feature_to_be_sequence : name of the col of df 
    x_train : dataframe of texts
    x_test : dataframe of texts
    top_frequent_words :
    max_sequence_length :
    
    OUT : 
    """
   
    # Extract the top_frequent_words in corpus
    tokenizer = text.Tokenizer(num_words=top_frequent_words, # Number of wds to keep
                               char_level=False, lower=False)
    
    # Fit on documents (create the mapping of words to integers)
    tokenizer.fit_on_texts(x_train[feature_to_be_sequenced])

    # Encode the reviews by the mapping of words to integers
    x_train = tokenizer.texts_to_sequences(x_train[feature_to_be_sequenced])
    x_test = tokenizer.texts_to_sequences(x_test[feature_to_be_sequenced])


    # Pad length for shorter sequences in text or truncate for longer sequence in text
    x_train = sequence.pad_sequences(x_train, maxlen=max_sequence_length)
    x_test = sequence.pad_sequences(x_test, maxlen=max_sequence_length)
    return(x_train, x_test, tokenizer.word_index)

Usually for sequence models, there is an embedding layer in which the network "learn" how each sequence of indices (document) should be represented by a vector in a dense word vector space. However, it is usually faster to use already embedded vectors. We will check the GloVe embedded vectors that were trained from the Wikipedia corpus

* Download if not yet
* !wget http://nlp.stanford.edu/data/glove.6B.zip
* !unzip -q glove.6B.zip


Below is the function to import the 300 length Glove Vectors. To map them a chosen feature.


In [9]:
def map_GloVe_feature(feature, x_train, x_test, top_frequent_words):
    """ This function will import the dictionary which maps each word in our dictionary to their numpy vect
    representation 
    """
    path_glove = repo + '/glove.6B/glove.6B.300d.txt'
    embedding_dim = 300
    hits = 0
    misses = 0

    # Retrieve the dictionary of the given feature
    x_train_emb, x_test_emb, dictionary_voc = sequence_vect(
                                            feature_to_be_sequenced=feature,
                                            max_sequence_length=300,
                                            x_train=x_train, 
                                            x_test=x_test, 
                                            top_frequent_words=top_frequent_words)
    
    # Create a dictionary to 
    embeddings_index = {}
    with open(path_glove, encoding='utf8') as f:
        first_line = f.readline()
        for line in f:
            
            # The first item is the word, the second item is 300D numpy vector
            word, coefs = line.split(maxsplit=1)
            coefs = np.fromstring(coefs, "f", sep=" ")
            embeddings_index[word] = coefs

    print("Found %s word vectors." % len(embeddings_index))

    vocabulary_size = len(dictionary_voc) + 1 # Number of words in vocabulary and 1 for unknown words
    
    # Prepare embedding matrix
    embedding_matrix = np.zeros((vocabulary_size, embedding_dim))
    for word, i in dictionary_voc.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            # This includes the representation for "padding" and "OOV"
            embedding_matrix[i] = embedding_vector
            hits += 1
        else:
            misses += 1
    print("Converted %d words (%d misses)" % (hits, misses))
    print("Ratio of words converted", hits/(hits+misses))

    return(x_train_emb, x_test_emb, embedding_matrix, vocabulary_size)


After different attempts with different number of most common words to match the vocabulary in the data with GloVe vector, we retain 20k most common words.

In [10]:
print('For feature Summary_prep, using', 20000, 'most frequent words')
embedded_summary, voc_size_summary = map_GloVe_feature('Summary_prep', x_train, x_test, top_frequent_words=20000)

print('For feature Text_prep, using', 20000, 'most frequent words')
embedded_text, voc_size_text = map_GloVe_feature('Text_prep', x_train, x_test, top_frequent_words=20000)


For feature Summary_prep, using 20000 most frequent words
Found 400000 word vectors.
Converted 7162 words (1979 misses)
Ratio of words converted 0.7835028990263647
For feature Text_prep, using 20000 most frequent words
Found 400000 word vectors.
Converted 21932 words (13842 misses)
Ratio of words converted 0.6130709453793257


About 80% of our vocabulary is converted (for the feature Summary_prep), we can think of using the embedded vectors from GloVe. For the sake of curiosity, we will later use the same model but with a trainable Embedding layer.

Next, we also remarked that we have more convertible words for the feature Summary_prep than the feature Text_prep, that's why we will use the feature Summary_prep in the following section.

We will use the feature 

In [11]:
from tensorflow.keras.layers import Embedding, SeparableConv1D, Dense, Dropout, Input, MaxPooling1D, GlobalAveragePooling1D
from keras.models import Sequential
from keras.backend import clear_session

In [14]:
# Clear the session before training model
clear_session()

# Define the constants
vocabulary_size = voc_size_summary
max_features = 20000
embedding_dim = 300
input_shape = x_train.shape[1:][0]
embedding_matrix = embedded_summary

# Modifiable hyper-params
drop_out_rate=0.5 
filters=64 
kernel_size=5
pool_size=3

In [37]:
# Dummify our labels so that we have 1 hot encoded matrix
y_train_cat = pd.get_dummies(y_train.astype('str'), drop_first=False).astype(float)
y_test_cat = pd.get_dummies(y_test.astype('str'), drop_first=False).astype(float)



See the difference between convolution and depthwise convolution [here](https://www.machinecurve.com/index.php/2019/09/24/creating-depthwise-separable-convolutions-in-keras/#a-brief-review-what-is-a-depthwise-separable-convolutional-layer>)


In [38]:
# MODEL WITH EMBEDDED GLOVE MATRIX
model = Sequential()
model.add(Embedding(vocabulary_size, output_dim=embedding_dim, 
                    input_length=input_shape, weights=[embedding_matrix], 
                    trainable=False)) 

# Drop out some units
model.add(Dropout(rate=drop_out_rate))

# Add 2 layers of depthwise convolution 
model.add(SeparableConv1D(filters=filters, kernel_size=kernel_size,
                          activation='relu', bias_initializer='random_uniform',
                          depthwise_initializer='random_uniform',
                          padding='same'))
model.add(SeparableConv1D(filters=filters, kernel_size=kernel_size, 
                          activation='relu', bias_initializer='random_uniform',
                          depthwise_initializer='random_uniform',
                          padding='same'))
# Add a Max pooling layer
model.add(MaxPooling1D(pool_size=pool_size))
# Add 2 convolution layers
model.add(SeparableConv1D(filters=filters*2, kernel_size=kernel_size,
                          activation='relu', bias_initializer='random_uniform',
                          depthwise_initializer='random_uniform',padding='same'))
model.add(SeparableConv1D(filters=filters*2, kernel_size=kernel_size,
                          activation='relu', bias_initializer='random_uniform',
                          depthwise_initializer='random_uniform',padding='same'))
model.add(GlobalAveragePooling1D())
model.add(Dropout(rate=drop_out_rate))

# Prediction layer
model.add(Dense(5, activation="softmax"))
print(model.summary())

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer="adam", 
              metrics=['accuracy'])


# fit model
history = model.fit(x=x_train, y=y_train_cat, workers=10, epochs=20)

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 4, 300)            2742600   
_________________________________________________________________
dropout_8 (Dropout)          (None, 4, 300)            0         
_________________________________________________________________
separable_conv1d_16 (Separab (None, 4, 64)             20764     
_________________________________________________________________
separable_conv1d_17 (Separab (None, 4, 64)             4480      
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 1, 64)             0         
_________________________________________________________________
separable_conv1d_18 (Separab (None, 1, 128)            8640      
_________________________________________________________________
separable_conv1d_19 (Separab (None, 1, 128)           

UnimplementedError:  Cast string to float is not supported
	 [[node sequential_4/Cast (defined at \AppData\Local\Temp/ipykernel_33828/4153739501.py:41) ]] [Op:__inference_train_function_5235]

Function call stack:
train_function


In [None]:
# evaluate the model
_, train_acc = model.evaluate(x_train, y_train, verbose=1)
_, test_acc = model.evaluate(x_test, y_test, verbose=1)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
history.history

# plot out the loss during training
plt.figure(figsize=(10, 10))
plt.title('Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.legend()
plt.show();

# plot out accuracy during training
plt.figure(figsize=(10, 10))
plt.title('Accuracy')
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='validation')
plt.legend()
plt.show();