# 1. Define Research Goal

We want to train a model that's able to classify movie reviews into a positive and a negative class.

First, we need to import all python libraries that we need.

In [1]:
import pandas as pd
import numpy as np
import pickle
from pathlib import Path
import os

from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelBinarizer
from keras.models import Sequential
from keras import metrics
from keras.layers import Activation, Dense, Dropout
import sklearn.datasets as skds
from sklearn import metrics

import matplotlib.pyplot as plt
import itertools

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


# 2. Retrieve Data
The IMDb Reviews are separated into training and test folders. Each of them has folders for positive and negative reviews.

First, we import all training data to create and train our neural net.

In [2]:
np.random.seed(1237)

labels = ["pos", "neg"] # contains all category labels that we want to classify
num_labels = 2 # number of labels

path_train = "./resources/aclImdb/train" # path to all reviews that we want to use for developing the model

We load the data by using [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html). Afterwards, 'files_train' should contain the path and label for each training review.

Then, we start to read all reviews. Afterwards, data_list should contain tuples of file path, file label and file content for each review.

If this step takes too long, interrupt it and skip to "Load data from pickle"

In [None]:
files_train = skds.load_files(path_train,load_content=False, categories=labels, encoding="UTF-8") 

file_paths = files_train.filenames
label_names = files_train.target_names
labelled_files_index = files_train.target

data_list = []
for i, file in enumerate(file_paths):
    data_list.append((file,
                      label_names[labelled_files_index[i]],
                      Path(file).read_text(encoding="UTF-8")))

Tuples are transformed into [pandas DataFrame](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.from_records.html).

In [None]:
data_tags=["filename","category","review"]
data = pd.DataFrame.from_records(data_list, columns=data_tags)

## 2.1. Load data from pickle
If reading the data from folders takes too long, you can load the data frame from a serialized pickle file. (Skip this step, if your data frame is already created

In [3]:
with open('resources/dataframes/train_dataframe.pickle', 'rb') as handle:
    data = pickle.load(handle)

data.head() should show your DataFrame like this:
<img src="resources/dataframe.png" alt="Data Frame Example">

In [4]:
data.head()

Unnamed: 0,filename,category,review
0,./resources/aclImdb/train/pos/11485_10.txt,pos,"Zero Day leads you to think, even re-think why..."
1,./resources/aclImdb/train/neg/6802_1.txt,neg,Words can't describe how bad this movie is. I ...
2,./resources/aclImdb/train/pos/7641_10.txt,pos,Everyone plays their part pretty well in this ...
3,./resources/aclImdb/train/neg/9698_1.txt,neg,There are a lot of highly talented filmmakers/...
4,./resources/aclImdb/train/neg/3141_2.txt,neg,I've just had the evidence that confirmed my s...


# 3. Prepare Data
We now have a Data Frame with all training reviews. For developing the Neural Network we split the DataFrame into training (80%) and development (20%) set. For training, we take review, category and file name from the first 80% of the Data Frame entries.

In [5]:
train_size = int(len(data) * .8) # number of reviews that we take for training

train_reviews = data['review'][:train_size]
train_tags = data['category'][:train_size]

## 3.1. Vectorization
In order to make the reviews interpretable for the Neural Network, we need to tokenize and vectorize the content of the reviews. We use the Keras [Tokenizer](http://faroit.com/keras-docs/1.2.2/preprocessing/text/#tokenizer) to split each review into tokens.

The tokens of each review are weighted corresponding to the selected mode. "Binary" sets a 1 for a token if it appears in the review and 0 if it doesn't appear.

**Now, it's your turn**: Change 'vocab_size' and 'mode' and see what happens to the neural net!

In [6]:
vocab_size = 5000 # determines the size of the vocabulary
mode = 'binary'

tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(train_reviews)

x_train = tokenizer.texts_to_matrix(train_reviews, mode=mode)

Lets see how our train matrix looks like (prints out the first 10 review vectors):

In [7]:
x_train[:10]

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])

In [11]:
model_path = 'resources/models/newmodel/' # adjust this path if you want to save a new model
if not os.path.isdir(model_path):
    os.mkdir(model_path)

We can save the tokenizer to a pickle file (for example to load it in the evaluation notebook)

In [12]:
with open(model_path +'defaulttokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

## 3.2. Label Encoding
We need to encode the categories of our reviews, too. We can do it with a LabelBinarizer that produces an array of all tagged categories.

In [13]:
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)

# 4. Explore Data
To get to know the data a little bit better you can count how many positive and negative reviews are there.

In [14]:
val, count = np.unique(y_train, return_counts=True) #count frequency of each encoded label in y_train

for i, c in enumerate(count):
    label = encoder.inverse_transform(val[i])
    print(label, c)

['neg'] 10008
['pos'] 9992


Or you can check the dimensions of our training data.<br>
Output: (x = number of training reviews, y = size of each vector)

In [15]:
x_train.shape

(20000, 5000)

We can explore the vocabulary of our tokenizer. 'vocab' is sorted decreasing by the document frequency of the tokens. 'sum_words' is a vector that contains the document frequency of each token.

In [16]:
vocab = tokenizer.word_index
sum_words = x_train.sum(axis=0)

In [27]:
vocab

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'his': 23,
 'are': 24,
 'have': 25,
 'he': 26,
 'be': 27,
 'one': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'who': 34,
 'from': 35,
 'so': 36,
 'like': 37,
 'her': 38,
 'just': 39,
 'or': 40,
 'about': 41,
 'out': 42,
 "it's": 43,
 'has': 44,
 'if': 45,
 'there': 46,
 'some': 47,
 'what': 48,
 'good': 49,
 'when': 50,
 'more': 51,
 'very': 52,
 'up': 53,
 'she': 54,
 'even': 55,
 'time': 56,
 'no': 57,
 'my': 58,
 'would': 59,
 'which': 60,
 'story': 61,
 'only': 62,
 'really': 63,
 'see': 64,
 'had': 65,
 'their': 66,
 'can': 67,
 'me': 68,
 'were': 69,
 'well': 70,
 'than': 71,
 'much': 72,
 'we': 73,
 'been': 74,
 'get': 75,
 'people': 76,
 'bad': 77,
 'will': 78,
 'also': 79,
 'do': 80,
 'into': 81,
 'other': 82,
 '

We can explore the document frequencies by asking for specific tokens.

In [23]:
index_actress = vocab['actress']
index_actor = vocab['actor']

In [24]:
print(sum_words[index_actress])
print(sum_words[index_actor])

843.0
1519.0


# 5. Model Data
Now, we have a matrix of our train input and an array of the related categories. We can build a neural net and train it with x_train and y_train.

**Now, it's your turn:** Change Parameters like optimizer, layer number, layer activation, layer size etc.

In [None]:
optimizer = 'adam'
loss = 'binary_crossentropy'
batch_size = 10
epochs = 10

In [None]:
classifier = Sequential()
#First Hidden Layer
classifier.add(Dense(4, activation='relu', kernel_initializer='random_normal', input_dim=vocab_size))
#Second Hidden Layer
classifier.add(Dense(4, activation='relu', kernel_initializer='random_normal'))
#Output Layer
classifier.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal'))

In [None]:
#Compiling the neural network
classifier.compile(optimizer=optimizer,loss=loss, 
                   metrics =['accuracy'])

In [None]:
#Fitting the data to the training dataset
classifier.fit(x_train,y_train, batch_size=batch_size, epochs=epochs)

## Save Model

In [None]:
classifier.save(model_path +'neuralnet.h5')

# Classify new Reviews
"classifier" contains our trained neural net. We can use it to classify new film reviews.

In [None]:
own_review = "Bad bad film!"

In [None]:
review_series = pd.Series(own_review)
x_review = tokenizer.texts_to_matrix(review_series)

In [None]:
prob = classifier.predict(x_review)
prob

# 6. Improve Model
After trying the neural net on example reviews we want to know how good it works in general. We use the validation set to evaluate the model.

In [None]:
val_reviews = data['review'][train_size:]
val_tags = data['category'][train_size:]

x_val = tokenizer.texts_to_matrix(val_reviews, mode=mode)
y_val = encoder.transform(val_tags)

Now, we [predict](https://keras.io/models/model/#predict) labels for all test reviews. If the probability for a positive review is more than 0.5, "pos" will be assigned.

In [None]:
probs = classifier.predict(x_val)
y_classified = ['pos' if x > 0.5 else 'neg' for x in probs]

y_true = list(encoder.inverse_transform(y_val)) #transform true encoded categories (0 and 1) to labels (neg and pos)

print(y_classified[:10]) #print first 10 predictions and true labels
print(y_true[:10])

We create a [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) to compare all predicted labels(y_classified) with the true labels(y_true).

In [None]:
cm = metrics.confusion_matrix(y_true, y_classified, labels=["neg", "pos"])
cm

The confusion matrix gives us all values for further evaluation computations.

In [None]:
tn, fp, fn, tp = cm.ravel()
pre = metrics.precision_score(y_true, y_classified, pos_label='pos')
rec = metrics.recall_score(y_true, y_classified, pos_label='pos')
print("TN, FP, FN, TP ", (tn, fp, fn, tp))
print("Precision ", pre)
print("Recall ", rec)

Now, you can improve your model by adjusting the configurations. If you think your configurations are ready, you can evaluate them with the test data by using the **EvaluateSentimentClassification** Notebook.

*** 