# P0 Student assignment: Simple models with Keras

**Goal**: implement **three models** for multiclass text classification on the [Women's E-commerce clothing reviews](https://github.com/ya-stack/Women-s-Ecommerce-Clothing-Reviews) [dataset](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews), two of them simple feed-forward models using a `Tokenizer` and `TextVectorizer`, respectively, and the third a Convolutional Neural Network (CNN) using a `TextVectorizer` layer and embeddings. 

**Teams**: one person or two --> **Martín Romero Romero and Pablo Miguel Pérez Ferreiro**.

**Due date**: October 4, 2023.

### 1. Data preparation

The first step is to downlad the dataset (a `csv` file) from *GitHub*. Suggestions:
- You can use the utility function [`tensorflow.keras.utils.get_file()`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/get_file) to download the file. You should set an absolute path to save the file, taking into account that, in *Google Colaboratory*, you have direct acces to the folder `/content/`.
- There are many ways to load a `csv` in memory. One simple way is to use `csv.reader()`.

~~~
with open(path, newline='') as f:
    reader = csv.reader(f)
    data = list(reader)
~~~

The resulting data estructure (`data`) is a python list of lists (the reviews).

In [1]:
# if needed (assumes that pip is installed)
# !pip install tensorflow==2.13.0
# !pip install pandas
# !pip install scikit-learn

import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from keras import models
from keras import layers
from keras.preprocessing.text import Tokenizer

2023-10-18 19:09:39.539841: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-18 19:09:39.663177: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-18 19:09:39.666539: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/martin/catkin_ws/devel/lib:/opt/ros/noetic/lib:/usr/local/cuda-12.0/lib64:/us

In [2]:
print(tf.__version__)

2.11.0


In [3]:
# ToDo:
# load the dataset
# load and preprocess

# download (the path must be specified as absolute, so this piece is not really portable)
tf.keras.utils.get_file(fname='/home/martin/Escritorio/NLU/P0/reviews.csv', origin='https://raw.githubusercontent.com/ya-stack/Women-s-Ecommerce-Clothing-Reviews/master/Womens%20Clothing%20E-Commerce%20Reviews.csv')    

# read and drop the unnecesary columns
data = pd.read_csv('reviews.csv')
data.drop(['Unnamed: 0', 'Clothing ID', 'Age', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'], axis=1, inplace=True)
# replace null contents with empty strings
data['Title'].fillna('', inplace=True)
data['Review Text'].fillna('', inplace=True)

Once you have the rows of the `csv` file in a data structure (remember that the first one is the names of the attributes of the data set, and must be discarded) you have to preprocess the data for its use as an input to the neural networks:
1. Extract the textual data from the rows, included in the fields `Title` and `Review Text`, and join both fields if `Title` is not empty.
2. Convert the field `Rating`, whose content are integers in the interval [1,5] into three classes: negative (ratings 1,2), neutral (rating 3) and positive (ratings 4,5).
3. The dataset contains about 23,000 reviews. Reserve the first 18,000 for training, and the rest for validation.

In [4]:
# ToDo:
# carry out the required preprocessing on Title and Review Text and remove the header
# transform the rating 1-5 to three categories: negative:0, neutral:1, positive:2
# obtain train and validation sets.

# function to convert the ratings to categories
def convert_rating(rating):
    if rating <= 2:
        return 0
    elif rating == 3:
        return 1
    else:
        return 2

# create new field with all the textual content, and drop no longer needed fields
data['text']=data['Title']+' '+data['Review Text']
data.drop(['Title', 'Review Text'], axis=1, inplace=True)
# likewise with the converted ratings
data['sentiment'] = data['Rating'].apply(convert_rating)
data.drop(['Rating'], axis=1, inplace=True)
data=data[['text', 'sentiment']]
# split the dataset between train and test, and subsequently split those between target and predictors
train_data, val_data = train_test_split(data, train_size=18000, random_state=42)
train_X, train_y = train_data.iloc[:,0], train_data.iloc[:,1]
val_X, val_y = val_data.iloc[:,0], val_data.iloc[:,1]

### 2. Perceptron with Tokenizer.

In the first model, you are going to use a `Tokenizer()` object to process the training and validation texts, transforming each review into binary vectors (of length *n*, where *n* is the size of the vocabulary) in which the positions of the words appearing in the review will be coded as `1` (clue: you can use the method `texts_to_matrix()` for this). You can set a maximum size for the vocabulary (parameter `num_words`), but it is not necessary.

Remember that you have to use the `fit_on_texts()` method in order to build the vocabulary of the tokenizer from the training data.

In addition, you have to convert vectors with the labels (negative=0, neutral=1, positive=2) from the training and validation sets to a data type which make possible to use them with the loss function [`categorical_crossentropy`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical) (clue: you may want to use the utility function [`tensorflow.keras.utils.to_categorical()`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical)).

In [5]:
# ToDo:
# create the tokenizer and build the vocabulary/word index
# obtain a vectorized representation for train and validation sets using Tokenizer

# create the tokenizer and fit with training textual data, building the vocabulary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_X)
# vectorize the textual data in both sets using the fitted tokenizer
train_X_oh = tokenizer.texts_to_matrix(train_X, mode='binary')
val_X_oh = tokenizer.texts_to_matrix(val_X, mode='binary')
# one-hot encode the target variables to allow the use of categorical_crossentropy
train_y_oh=tf.keras.utils.to_categorical(train_y)
val_y_oh=tf.keras.utils.to_categorical(val_y)

In [15]:
train_X_oh

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 1., 1., 1.]])

Now it is time to create the `Sequential` architecture of out first model. In this case, a simple perceptron with three layers (input, hidden, output) will suffice. A few pointers:
- You will need to set the `input_shape` of the first layer of the network to the size of the vocabulary in the `Tokenizer`.
- The number of units and the activation function in the output layer must be appropiate for a three-class classification problem.

In [6]:
# ToDo:
# Create a NN model: a simple perceptron

# define the model using the sequential API: input, hidden, output
model=tf.keras.Sequential()
model.add(tf.keras.Input(shape=(len(tokenizer.word_index)+1,))) # length of the word index, plus one for index 0
model.add(tf.keras.layers.Dense(24, activation='relu'))
model.add(tf.keras.layers.Dense(3, activation='softmax')) # three output neurons for the three classes

2023-10-18 19:09:44.491131: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-18 19:09:44.491391: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/martin/catkin_ws/devel/lib:/opt/ros/noetic/lib:/usr/local/cuda-12.0/lib64:/usr/local/cuda-12.0/lib64:/usr/local/cuda-12.0-/lib64:/usr/lib/x86_64-linux-gnu/gazebo-11/plugins:/usr/lib/x86_64-linux-gnu/gazebo-11/plugins:/usr/lib/x86_64-linux-gnu/gazebo-11/plugins:/usr/lib/x86_64-linux-gnu/gazebo-11/plugins:/usr/lib/x86_64-linux-gnu/gazebo-11/plugins
2023-10-18 19:09:44.491480: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.1

Now compile and train the model. You can use any optimizer you want, but the loss function must be [`categorical_crossentropy`](https://keras.io/api/losses/probabilistic_losses/#categorical_crossentropy-function), the metric used will be `accuracy`, and you will provide the validation sets for the computation of the validation loss and validation accuracy at the end of each epoch of training, with the argument `validation_data`.

The model will train for 10 epochs.

Expect a validation accuracy of 0.80-0.83, approximamtely.

In [7]:
# ToDo: 
# Train your model here

# compile and fit the model
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.fit(train_X_oh, train_y_oh, epochs=10, validation_data=(val_X_oh, val_y_oh), batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd06c60e940>

*¿Does the validation accuracy grow with each epoch?*

No. It grows during the first few epochs, but quickly stagnates or even decreases. In fact, the best validation accuracy achieved is around 83% (82.8), yet the validation accuracy during the tenth epoch is only of 80.73%. What does grow on each epoch is the training accuracy, but that is far less interesting because it can easily mean that the model is overfitting to the examples. This last hypothesis is reinforced by that fact that the validation accuracy falls: our model is progressively getting better at recognizing the textual patterns of the training set, but losing generalization potency and thus performing worse on the validation set.


### 3. Perceptron with a TextVectorizer layer.

Now you are going to implement a new neural network, with two differences with respect to the previous one:
- We will use a [`TextVectorizer`](https://keras.io/api/layers/preprocessing_layers/core_preprocessing_layers/text_vectorization/) Layer instead of a `Tokenizer`.
- The loss function will be [`sparse_categorical_crossentropy`](https://keras.io/api/losses/probabilistic_losses/#sparse_categorical_crossentropy-function).

Your first task is to set the `TextVectorization` layer. Remember you have to create the layer and call the method `adapt()` on the training data before adding the layer to the new model. You can use the default values when creating the layer if you wish, except for `output_mode` that has to be set to `'multi_hot'`, so a binary vector the size of the vocabulary is generated for each example, as `Tokenizer` did in the first model.

In [8]:
# ToDo:
# Prepare your text vectorizer layer

# let's transform our dataframes to tensors, so that they can be used with TextVectorizer
train_X_tensor=tf.convert_to_tensor(train_X)
val_X_tensor=tf.convert_to_tensor(val_X)
train_y_tensor=tf.convert_to_tensor(train_y)
val_y_tensor=tf.convert_to_tensor(val_y)

# create the vectorizer and train it with the training textual data.
text_vectorizer = layers.TextVectorization(output_mode='multi_hot')
text_vectorizer.adapt(train_X_tensor)

Now you can create the your second `Sequential` model, adding its layers one by one. Obviously, the previously created `TextVectorizer` goes first. There is not need to define an input layer. You can add the rest of the layers after the text vectorizer. 



In [9]:
# ToDo:
# create the model2 in a similar way to the first model but with the text vectorization layer

# same model as before, only with text_vectorizer first
model = tf.keras.Sequential()
model.add(text_vectorizer)
model.add(tf.keras.layers.Dense(24, activation='relu'))
model.add(tf.keras.layers.Dense(3, activation='softmax'))

Once the topology of the new model is set, you will set the datasets, compile and train it. Important:

- Remember that you are supposed to use `sparse_categorical_crossentropy`, so the label vectors for both training and validation will have to be of the appropiate type and dimensions.
- `TextVectorizer` accepts its training input in batches. That means you will have to change the way training data is passed to the model. One way to do it is to create an object `Tensorflow.Dataset` and organize it in batches, using the methods `tensorflow.Dataset.from_tensor_slices()` and 
`tensorflow.Dataset.from_tensor_slices.batch()`

~~~
train_ds = tf.data.Dataset.from_tensor_slices((x,y))
train_ds = train_ds.batch(batch_size)
~~~

where `x` and `y` are the training samples and labels, and `batch_size` is a number. You can do the same with the validation data, but it is not mandatory. When training the model, you have to pass the dataset instead of x and y, and **omit** the `batch_size` parameter, since it is already set in the dataset.

~~~
model.fit(train_ds,...,epochs=10)
~~~

You can use whichever optimizer you prefer, but you will use accuracy to measure the performance of the model, provide the validation data through the argument `validation_data`, and train for 10 epochs.

In [10]:
# ToDo:
# train your model2 after having compiled and prepared the training input in batches as suggested in the instructions

# prepare our datasets from the tensors obtained before, and define their batches
train_ds=tf.data.Dataset.from_tensor_slices((train_X_tensor, train_y_tensor))
train_ds=train_ds.batch(64)
val_ds=tf.data.Dataset.from_tensor_slices((val_X_tensor, val_y_tensor))
val_ds=val_ds.batch(64)

# with sparse_categorical_crossentropy, there is no need to one-hot encode the labels, integers are alright
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.fit(train_ds, epochs=10, validation_data=val_ds)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fcfc0130ca0>

*¿Is the new model any better than the previous one?*

The model seems to be performing very similarly to the previous one. Both achieve very high training accuracy at the completion of the ten epochs, with far less impressive (although still very good) validation accuracy, and the evolution of their validation accuracy even follows a similar pattern, with increases at the first three epochs which are followed by a progressive descent until landing around the mark of 80.8%.

### 4. CNN with TextVectorizer layer and word embeddings

Finally, you are going to train a third model with the following components:
- A `TextVectorizer` layer.
- An `Embedding` layer.
- One or more `Conv1D` layers.
- A `GlobalMaxPooling1D` layer.
- One or more `Dense` layers for the computation of results.
- A output layer with the appropiate activation function for a multiclass classifier.

You will use the functional API. See the doc reference [here](https://keras.io/guides/functional_api/).

Our goal is to process the input texts token by token using a Convolutional Neural Network (CNN) and embeddings. The first step is to define the `TextVectorizer` layer. This time the output of this layer will be a vector of integer numbers (the input for the [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) layer), with one integer for each token in the input text, so `output_mode` must be set to `int` or omitted (since `int` is the default value for this parameter). In addition, all sequences of integers (words) given to the embedding layer must have the same length. To ensure that, you will use the parameter `output_sequence_length` in the definition of the `TextVectorizer` (i.e. `output_sequence_length=100`). That will cut sequences longer than the value of `output_sequence_length` and pad shorter ones with zeros.

Once the layer is defined, it will must be trained with the method `adapt()`.

In [11]:
# ToDo:
# Create the text vectorization layer...

# we will repeat the tensors/dataset preprocessing used before here; the code is duplicated to allow flexibility in the order of execution
train_X_tensor=tf.convert_to_tensor(train_X)
val_X_tensor=tf.convert_to_tensor(val_X)
train_y_tensor=tf.convert_to_tensor(train_y)
val_y_tensor=tf.convert_to_tensor(val_y)
train_ds=tf.data.Dataset.from_tensor_slices((train_X_tensor, train_y_tensor))
train_ds=train_ds.batch(64)
val_ds=tf.data.Dataset.from_tensor_slices((val_X_tensor, val_y_tensor))
val_ds=val_ds.batch(64)

# now create and train the text vectorizer
text_vectorizer = layers.TextVectorization(output_mode='int', output_sequence_length=100)
text_vectorizer.adapt(train_X_tensor)

You have to start the defintion of the model with an `Input` layer, e.g.:

~~~
inputs = tf.keras.Input(shape=(1,),dtype=tf.string)
~~~

then you can add the `TextVectorizer`, `Conv1D`, ... layers.

The `Embedding` layer has at least two parameters: the size of the vocabulary and the size of the embeddings. For the vocabulary you have two choices: set it in avance when creating the layer, via the `max_tokens` parameter, or to let all tokens of the training set be part of the vocabulary. In the latter case, you can get the vocabulary size from the layer, using the method `vocabulary_size()`.

You must the set embedding dimension to a integer value, e.g. `30`.

In the `Conv1D` layer you have to set two parameters, `filters` and `kernel_size`. Both are integers. The first can have any integer value (e.g., `64`of `128`) but the higher is set, the bigger the number of computations will be, while the second should be small compared to the length of the sequences of words (e.g. `3` or `5`).

We finish with the output layer:

~~~
outputs = tf.keras.layers.Dense(...)(x)
~~~

where `x` is the output of the previous layer. At this point we can define the model:

~~~
model_functional = keras.Model(inputs=inputs, outputs=outputs)
~~~

In [12]:
# ToDo:
# create your model3... for example:

# using the functional API, we define the model described above
inputs = tf.keras.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs)
x = layers.Embedding(text_vectorizer.vocabulary_size(), 30)(x)
x = layers.Conv1D(filters=64, kernel_size=5)(x)
x = layers.GlobalMaxPooling1D()(x)
outputs = layers.Dense(3, activation='softmax')(x)
model = models.Model(inputs=inputs, outputs=outputs, name="sentiment_analysis")

You can use exactly the same datasets than in the previous model for training and validation, and the optimizer of you preference, but you will use accuracy as performance metric and sparse categorical crossentropy as loss function, provide the validation data through the argument `validation_data`, and train the model for 10 epochs.

In [13]:
# ToDo: 
# Train your model3 

# compile and train
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.fit(train_ds, epochs=10, validation_data=val_ds)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fcfac31a4c0>

*¿Does the new model perform any better than the previous two?*

It performs slightly better than both, although the difference is probably not significant, as at the last epoch the validation accuracies are 81.1% vs 80.8%. The evolution of said metric is a bit more stable than on the previous models, though, and it takes considerably longer to start degrading (and it degrades slower). It's important to note that we've chosen to go with the 'bare-minimum' model, according to the instructions of the assignment; but it's possible that including some of the additional layers mentioned on the instructions (i.e. more convolutional or dense layers) could help make this model a bit better.