# 1. Define Research Goal

We want to train a model that's able to classify movie reviews into a positive and a negative class.

First, we need to import all python libraries that we need.

In [None]:
import pandas as pd
import numpy as np
import pickle
from pathlib import Path

from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelBinarizer
from keras.models import Sequential
from keras import metrics
from keras.layers import Activation, Dense, Dropout
import sklearn.datasets as skds
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt
import itertools

# 2. Retrieve Data

The IMDb Reviews are separated into training and test folders. Each of them has folders for positive and negative reviews.

First, we import all training data to create and train our neural net.

In [None]:
np.random.seed(1237)

labels = ["pos", "neg"] # contains all category labels that we want to classify
num_labels = 2 # number of labels

path_train = "./resources/aclImdb/train" # path to all reviews that we want to use for classification

We load the data by using [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html). Afterwards, files_train should contain the path and label for each training review.

In [None]:
files_train = skds.load_files(path_train,load_content=False, categories=labels, encoding="UTF-8") 

file_paths = files_train.filenames
label_names = files_train.target_names
labelled_files_index = files_train.target

Now, we start to read all reviews (This might take some time). Afterwards, data_list should contain tuples of file path, file label and file content for each review.

In [None]:
data_list = []

for i, file in enumerate(file_paths):
    data_list.append((file,
                      label_names[labelled_files_index[i]],
                      Path(file).read_text(encoding="UTF-8")))

Tuples are transformed into [pandas DataFrame](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.from_records.html). data.head() should show your DataFrame like this:
<img src="resources/dataframe.png" alt="Data Frame Example">

In [None]:
data_tags=["filename","category","review"]
data = pd.DataFrame.from_records(data_list, columns=data_tags)
data.head()

# 3. Prepare Data

We now have a Data Frame with all training reviews. For developing the Neural Network we split the DataFrame into training (80%) and development (20%) set. For training, we take review, category and file name from the first 80% of the Data Frame entries.

In [None]:
train_size = int(len(data) * .8) # number of reviews that we take for training

train_reviews = data['review'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]

## 3.1. Vectorization

In order to make the reviews interpretable for the Neural Network, we need to tokenize and vectorize the content of the reviews. We use the Keras [Tokenizer](http://faroit.com/keras-docs/1.2.2/preprocessing/text/#tokenizer) to split each review into tokens.

The tokens of each review are weighted corresponding to the selected mode. "Binary" sets a 1 for a token if it appears in the review and 0 if it doesn't appear.

**Now, it's your turn**: Change 'vocab_size' and 'mode' and see what happens to the neural net!

In [None]:
vocab_size = 5000 # determines the size of the vocabulary
mode = 'binary'

tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(train_reviews)

x_train = tokenizer.texts_to_matrix(train_reviews, mode=mode)

Lets see how our train matrix looks like (prints out the first 10 review vectors):

In [None]:
x_train[:10]

We can save the tokenizer to a pickle file (for example to load it in the evaluation notebook)

In [None]:
with open('resources/tokenizer/defaulttokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

## 3.2. Label Encoding

We need to encode the categories of our reviews, too. We can do it with a LabelBinarizer that produces an array of all tagged categories.

In [None]:
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)

# 4. Explore Data

To get to know the data a little bit better you can count how many positive and negative reviews are there.

In [None]:
val, count = np.unique(y_train, return_counts=True) #count frequency of each number in y_train

for i, c in enumerate(count):
    label = encoder.inverse_transform(val[i])
    print(label, c)

In [None]:
vocab = tokenizer.word_index

# 5. Model Data

Now, we have a matrix of our train input and an array of the related categories. We can build a neural net and train it with x_train and y_train.

In [None]:
optimizer = 'adam'
loss = 'binary_crossentropy'
batch_size = 10
epochs = 10

In [None]:
classifier = Sequential()
#First Hidden Layer
classifier.add(Dense(4, activation='relu', kernel_initializer='random_normal', input_dim=vocab_size))
#Second  Hidden Layer
classifier.add(Dense(4, activation='relu', kernel_initializer='random_normal'))
#Output Layer
classifier.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal'))

In [None]:
#Compiling the neural network
classifier.compile(optimizer=optimizer,loss=loss, 
                   metrics =['accuracy'])

In [None]:
#Fitting the data to the training dataset
classifier.fit(x_train,y_train, batch_size=batch_size, epochs=epochs)

## Save Model

In [None]:
classifier.save('resources/models/defaultModel.h5')

# Classify new Reviews

"classifier" contains our trained neural net. We can use it to classify new film reviews.

In [None]:
own_review = "Greatest film ever"

In [None]:
review_series = pd.Series(own_review)
x_review = tokenizer.texts_to_matrix(review_series)

In [None]:
prob = classifier.predict(x_review)
prob

# Evaluate on Eval Data

In [None]:
eval_reviews = data['review'][train_size:]
eval_tags = data['category'][train_size:]
eval_files_names = data['filename'][train_size:]

x_eval = tokenizer.texts_to_matrix(eval_reviews, mode=mode)
y_eval = encoder.transform(eval_tags)

In [None]:
probs = classifier.predict(x_eval)
y_classified = (probs>0.5)

In [None]:
cm = confusion_matrix(y_eval, y_classified)
cm

In [None]:
labels = encoder.classes_

fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111)
cax = ax.matshow(cm)
fig.colorbar(cax)
ax.set_yticks(np.arange(len(labels)))
ax.set_xticks(np.arange(len(labels)))
ax.set_xticklabels(labels, rotation='vertical')
ax.set_yticklabels(labels)
#ax.set_yticklabels(np.arrage(len(test_tags)), test_tags)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

Until here, you can try to improve your neural net by using the training data. If you think your configurations are ready, you can evaluate them by using the **EvaluateSentimentClassification** Notebook.

*** 