# NLP Assignment
Nontapat Pintira
Student ID: 6088118

## Analysis on Amazon Reviews
This dataset consists of a few million Amazon customer reviews (input text) and star ratings (output labels) for learning how to train fastText for sentiment analysis.

[Dataset on Google Drive](https://drive.google.com/open?id=1YJ8WU-4o31ehA7mSD_7Pv1EJeZwLByls)

I am going to use RNN to create a model for this prediction task.
Noted that the model this notebook takes can take around 30 minutes to train

In total this notebook may takes up to 2 hours to run from start to end.
(on Google Colab with GPU runtime)

**Please run this notebook on Google Colab**

## Library Used


In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.python.keras import models, layers, optimizers
import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
import bz2
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
import re

%matplotlib inline

# Check the content in the mounted directory
import os
print(os.listdir("../content/drive/My Drive/Skill Tree/CS - Special Topic/NLP/datasets"))

['test.ft.txt.bz2', 'train.ft.txt.bz2']


***
## Preprocessing

### Reading the Text
The text files used in this assignment are compressed in bzip2 format. <br>
To deal with this, I am going to use bz2 library to read the text line by line

```
__label__<X> __label__<Y> ... <Text>
```

After the inspection of the dataset, I see that the label is the first word in each review and the rest are the text data. Moreover, I am going to encoded the label to integer of 0 and 1 (0: negative, 1: positive)

Be sure to mount the notebook with Google Drive data source and configure the path to the tar.bz2 file.

In [0]:
def get_labels_and_texts(file):
    labels = []
    texts = []
    for line in bz2.BZ2File(file):
        x = line.decode("utf-8")
        labels.append(int(x[9]) - 1)
        texts.append(x[10:].strip())
    return np.array(labels), texts

train_labels, train_texts = get_labels_and_texts('../content/drive/My Drive/Skill Tree/CS - Special Topic/NLP/datasets/train.ft.txt.bz2')
test_labels, test_texts = get_labels_and_texts('../content/drive/My Drive/Skill Tree/CS - Special Topic/NLP/datasets/test.ft.txt.bz2')

### Text Preprocessing
I am going to use regex to process the text data and create a sequence for further processing.

The steps that I will use to process the text data are as follows


1. Lowercasing everything
2. Removing non-word characters
3. Removing non-ascii characters

To use regular expression in python, I have to import re library


```
import re
```





In [0]:
NON_ALPHANUM = re.compile(r'[\W]')
NON_ASCII = re.compile(r'[^a-z0-1\s]')
def normalize_texts(texts):
    normalized_texts = []
    for text in texts:
        lower = text.lower()
        no_punctuation = NON_ALPHANUM.sub(r' ', lower)
        no_non_ascii = NON_ASCII.sub(r'', no_punctuation)
        normalized_texts.append(no_non_ascii)
    return normalized_texts
        
train_texts = normalize_texts(train_texts)
test_texts = normalize_texts(test_texts)

### Spliting Testing and Training Data

Out of almost 4 million training examples, I will use development set of 250,000 examples (around 7%) to give enough confidence level.

In [0]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, random_state=42, test_size=0.07)

## Feature Engineering

I will only use top 12,000 words as features.<br>
Keras makes it easier to tokenize the text and create useful sequence for deep learning models.

In [0]:
MAX_FEATURES = 12000
tokenizer = Tokenizer(num_words=MAX_FEATURES)
tokenizer.fit_on_texts(train_texts)
train_texts = tokenizer.texts_to_sequences(train_texts)
val_texts = tokenizer.texts_to_sequences(val_texts)
test_texts = tokenizer.texts_to_sequences(test_texts)

To use minibatch in the deep learning model, it is better to pad the input sequence to have equal size. Additionally, I have pad the data on each dataset (training, validation, testing)

In [0]:
MAX_LENGTH = max(len(train_ex) for train_ex in train_texts)
train_texts = pad_sequences(train_texts, maxlen=MAX_LENGTH)
val_texts = pad_sequences(val_texts, maxlen=MAX_LENGTH)
test_texts = pad_sequences(test_texts, maxlen=MAX_LENGTH)

***
## Training RNN Model

Layers:


1. 2 GRU Layers (128 units)
2. Dense Hidden Layer (32 units, ReLU activation)
3. Dense Hidden Layer (100 units, RuLU activation)
4. Sigmoid Output Layer

Dense Layer


```
output = activation(dot(input, kernel) + bias)
```





### Building the Model

In [0]:
def build_rnn_model():
    sequences = layers.Input(shape=(MAX_LENGTH,))
    embedded = layers.Embedding(MAX_FEATURES, 64)(sequences)
    x = layers.CuDNNGRU(128, return_sequences=True)(embedded)
    x = layers.CuDNNGRU(128)(x)
    x = layers.Dense(32, activation='relu')(x)
    x = layers.Dense(100, activation='relu')(x)
    predictions = layers.Dense(1, activation='sigmoid')(x)
    model = models.Model(inputs=sequences, outputs=predictions)
    model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['binary_accuracy']
    )
    return model
    
rnn_model = build_rnn_model()

### Fitting the Model

I am going to use mini-batch (size 128) Adam optimization algorithm

Noted that this process may takes up to 30 minutes

In [10]:
rnn_model.fit(
    train_texts, 
    train_labels, 
    batch_size=128,
    epochs=1,
    validation_data=(val_texts, val_labels), )



<tensorflow.python.keras.callbacks.History at 0x7f1b733375c0>

### Testing the Model

There are multiple evaluation metrics that can be use. I choose to use Accuracy, F1 and ROC score
On training data

In [13]:
preds = rnn_model.predict(train_texts)
print('Accuracy score: {:0.4}'.format(accuracy_score(train_labels, 1 * (preds > 0.5))))
print('F1 score: {:0.4}'.format(f1_score(train_labels, 1 * (preds > 0.5))))
print('ROC AUC score: {:0.4}'.format(roc_auc_score(train_labels, preds)))

Accuracy score: 0.9563
F1 score: 0.9563
ROC AUC score: 0.9904


On testing data

In [14]:
preds = rnn_model.predict(test_texts)
print('Accuracy score: {:0.4}'.format(accuracy_score(test_labels, 1 * (preds > 0.5))))
print('F1 score: {:0.4}'.format(f1_score(test_labels, 1 * (preds > 0.5))))
print('ROC AUC score: {:0.4}'.format(roc_auc_score(test_labels, preds)))

Accuracy score: 0.9509
F1 score: 0.951
ROC AUC score: 0.9887


We acheive relatively close score on both training and testing data. To further improve the model, we can train the model for longer (more epoch), train a bigger network, choose a better neural network architecture, and tune the hyperparameters


### Saving the model

We have to save both the model and the weight matrix of each hidden layers, so we don't have to train it again.

To save the model we can use
```
model.save('my_model.h5')
```

In [0]:
rnn_model.save('NLP_6088118_model-v2.h5')

And to save the weight we can use


```
model.save_weights('my_model_weights.h5')
```



In [0]:
rnn_model.save_weights('NLP_6088118_model_weight-v2.h5')

To use the model you can use these lines of code


```
from keras.models import load_model

model = load_model('my_model.h5')
model.load_weights('my_model_weights.h5')
```

Now you can use the model to predict your input data.
