# KAIP Week 3 - Tutorial 3
# Supervised Learning using Neural Networks 

### Key Terminology
1. Training, validation and testing set
2. Hyperparameters
3. Cross-validation, Model Selection
4. Activation Function:
    - sigmoid
    - relu 
5. Loss/ Error/ Cost Function
    - Binary Cross Entropy
6. Optimization:
    - adam optimizer
7. Neural Network Classifiers:
    - Multilayer Perceptron (MLP)
    - Convolutional Neural Network (CNN)
    - Recurrent Neural Network (RNN)
    - Long Short Term Memory Network (LSTM)
8. Padding of sequences



### What is Sentiment Analysis?
Sentiment Analysis is a natural language application where the objective is to predict the underlying intent of the text.


### Dataset
In this tutorial, we will analyze the sentiment of film reviews from IMDB dataset, which is already **pre-processed** by Keras:

*Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.*

Read more here: https://keras.io/datasets/

### *Now, let's train some neural network models using the IMDB data!* 😃

Our goal is to learn to **classify** whether the review is POSITIVE or NEGATIVE i.e. supervised learning.

### (1) Import relevant libraries

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

from keras.utils import to_categorical
from keras import models, layers
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, SimpleRNN
from keras.layers.convolutional import Conv1D, MaxPooling1D, AveragePooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

from supervised_nn import *

# reproducibility
np.random.seed(7)

%load_ext autoreload
%autoreload 

### (2) Import Dataset
The dataset that we'll be using today is the IMDB dataset, which is already pre-processed by Keras (more to come on data pre-processing in Week 4!)

Features contains the representation of the review. Labels contains whether each review was positive or negative.

Labels definition:
     - 0 = negative review
     - 1 = positive review

In [None]:
from keras.datasets import imdb
n_samples = 25000
top_words = 1000
inputs, labels = load_imdb(top_words, n_samples)

### (3) Data Exploration
**All of the dataset** - Investigate the features of the dataset. 

In [None]:
print('Number of samples=' , len(inputs))

In [None]:
print("Unique categories:", np.unique(labels))
print("Number of unique words:", len(np.unique(np.hstack(inputs))))
length = [len(i) for i in inputs]
print("Average review length:", np.mean(length))
print("Standard Deviation:", round(np.std(length)))


**One data sample** - Let's look at what one data sample looks like:

In [None]:
item_num = 0
describe_review(item_num, inputs, labels)

**Question**: Investigate data item number 10 - what is the original review?

**Question**: What do you think are the challenges of working with this dataset?

### (4) Split data into training, validation and test set

In [None]:
x_train, y_train, x_test, y_test = split_dataset(inputs, labels, 0.2)
x_train, y_train, x_val, y_val = split_dataset(x_train, y_train, 0.2)
print('Total Number of samples: ', str(len(inputs)))
print('Number of training samples: ', str(len(x_train)))
print('Number of validation samples: ', str(len(x_val)))
print('Number of testing samples: ', str(len(x_test)))

In [None]:
# The length of reviews varies, so we will pad all sequences so that they are of the same length:
max_words = 300
x_train = sequence.pad_sequences(x_train, maxlen=max_words)
x_val = sequence.pad_sequences(x_val, maxlen=max_words)
x_test = sequence.pad_sequences(x_test, maxlen=max_words)


Can you tell what does 'padding' do?

In [None]:
x_train[0]

__Question__: Draw the black-box model for the IMBD sentiment analysis problem. What are the inputs? The ouputs?

## Keras recipe for Deep Learning
1. Model:
    - Sequential: A linear stack of layers. Read more here: https://keras.io/getting-started/sequential-model-guide/
<br>
2. Layers:
    - Embeddings: Turns positive integers (indexes) into dense vectors (Pre-processing step).
    - Dense: regular densely-connected NN layer.
    - Activation: applies an activation function to the input.
    - Simple RNN: Fully-connected RNN where the output is to be fed back to input.
    - LSTM: Long Short-Term Memory layer - Hochreiter 1997.
    - To read more about all the above layers: https://keras.io/layers/core/

<img src='workflow.png'>

3. Activation Functions: (won't go into too much details but for your info!)
    <img src = 'activations.png'>
    Source: http://rasbt.github.io/mlxtend/user_guide/general_concepts/activation-functions/

## Part 1: Simple Multilayer Perceptron (also known as MLP)

"A multilayer perceptron (MLP) is a fully connected neural network, i.e., all the nodes from the current layer are connected to the next layer. A MLP consisting in 3 or more layers: an input layer, an output layer and one or more hidden layers. Note that the activation function for the nodes in all the layers (except the input layer) is a non-linear function." Source: https://github.com/rcassani/mlp-example


<img src= 'mlp.png'>


MLPs are a very powerful too because they allow us to very easily perform non-linear tasks.
Let’s begin with a very simple example, from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/, two curves on a plane. The network will learn to classify points as belonging to one or the other.

<img src = 'curves_plane.png'>

A linear classifier such as logistic regression, would produce something like that:

<img src = 'curves_plane_linear.png'>

However, with a non linear classifier, such as the MLP, you get the following

<img src = 'curves_plane_nonlinear.png'>

Why?? It's all thanks to the hidden layer ! The hidden layer contains a non-linearity (for example the sigmoid from yesterday) that will "squish" the data, so that the data becomes linearly separable ! An then the output layer can successfuly take a linear combination and correctly classify the output.

<img src = 'curves_plane_squished.png'>

We can look at an animation of what happens in a simple MLP with one hidden layer
<img src = 'mlp_gif.gif'>

### I. Model Definition

In [None]:
mlp_model = models.Sequential()

# Input Layer
mlp_model.add(Embedding(top_words, 30, input_length=max_words))

# Hidden Layers
mlp_model.add(layers.Dense(20, activation = "relu"))

mlp_model.add(Flatten())

# Output Layer
mlp_model.add(layers.Dense(1, activation = "sigmoid"))

# Summarize Model
mlp_model.summary()

### II. Model Compilation

In [None]:
mlp_model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

### III. Model Fitting or "Learning"

In [None]:
# Select hyperparameters
epochs = 4
batch_size = 128

results_mlp = mlp_model.fit(x_train, y_train, epochs= epochs, batch_size = batch_size, validation_data = (x_val, y_val))

### IV. Model Evaluation 
Evaluate the model in terms of accuracy, where accuracy = 

In [None]:
scores_mlp = mlp_model.evaluate(x_test, y_test, verbose=0)
print("MLP Accuracy: %.2f%%" % (scores_mlp[1]*100))

**Question**: In your team, look up two advantages and disadvantages of using MLP? What kind of applications are appropriate for MLP? (10 mins)

## Part 2: Convolutional Neural Network (also known as CNN)


Convolutional Neural Networks have been a revolution in the field of computer vision, because they can look at "features" in the image, and combine them together ( firt they will see lines, then combine into edges, then into high level features like eyes, faces, ...).
A big emerging field is feature visualisation, which is very tied to explainability, and opening of the black box. 

First, how does a CNN look like? 

<img src = 'convnet_pic.png'>

You can see each of the filters performs "convolutions", i.e. looks at a small square in the image, and sweeps across the whole input. It is this property that will allow the network to see "real" things. Let's take a look at what the CNN sees.

<img src = 'layer_viz_colah.png'>

image source: https://distill.pub/2017/feature-visualization/ (distill is, in the authors' opinion, one of the best machine learning blogs out there).


But it turns out, CNNs are not restricted to images ! Just like we detect patterns in an image, we can use convolutions on sequences of words to recognize meaningul semantic patterns. 

<img src = 'conv_1D_2D.png'>

image source : https://blog.goodaudience.com/introduction-to-1d-convolutional-neural-networks-in-keras-for-time-sequences-3a7ff801a2cf


<br>

Besides the fact that CNNs are very powerful for extracting meaningful features, they have two other big advanges.
- They use mathematical operations called convolutions (hence the name CNN), which can be computed extremely efficiently on graphic cards (GPU) in your computer. This has led GPU manufacturing companies such as Nvidia to grow immensely, and become leaders in the field of AI.
- They have much fewer parameters than MLPs, which are "fully connected" networks, i.e. each neuron in one layer is connected to each neuron in the next layer. This also allows to train very fast.

### I. Model Definition

In [None]:
# create the model
cnn_model = Sequential()
cnn_model.add(Embedding(top_words, 32, input_length=max_words))
cnn_model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
cnn_model.add(AveragePooling1D(pool_size=2))
cnn_model.add(Flatten())
cnn_model.add(Dense(1, activation='sigmoid'))
print(cnn_model.summary())

### II. Model Compilation

In [None]:
cnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### III. Model Fitting or "Learning"

In [None]:
epochs = 3
batch_size = 128

cnn_model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=epochs, batch_size=batch_size)

### IV. Model Evaluation 

In [None]:
scores_cnn = cnn_model.evaluate(x_test, y_test, verbose=0)
print("CNN Accuracy: %.2f%%" % (scores_cnn[1]*100))

## Part 3: Recurrent Neural Networks (also known as RNN)


Traditional neural networks do not preserve temporal or sequential information. Recurrent neural networks were then developed in the 1980s, where John Hopfeild discovered Hopfield networks in 1982. <br>


<img src='RNN-unrolled.png'> <br>
Img reference and excellent resource: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

**Question**: In your team, research 3 different applications that are RNNs are useful to be used for? What are common features of the data across all applications?

They come in different forms depending on the application: <br>
<img src='rnn_types_1.png'>
Img reference and excellent resource: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

**Question**: Can you think of examples for each of the different RNN structures above?

Now let's train our own RNN! 

### I. Model Definition

In [None]:
rnn_model = Sequential()
rnn_model.add(Embedding(top_words, 32, input_length=max_words))
rnn_model.add(SimpleRNN(20))
rnn_model.add(Dense(1, activation='sigmoid'))
rnn_model.summary()

### II. Model Compilation

In [None]:
rnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### III. Model Fitting or "Learning"

In [None]:
epochs=3 
batch_size=128

rnn_model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=epochs, batch_size=batch_size)


### IV. Model Evaluation 

In [None]:
# Final evaluation of the model
scores_rnn = rnn_model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores_rnn[1]*100))

**Problem of RNNs that has been thoroughly explored in research is**: LONG TERM DEPENDENCIES!

Example: Fill in the blank.
- easy task: "the camel is in the ?”  --> easy to predict desert!
- more difficult task: "“I grew up in the UAE… I speak fluent ?” --> need to look further back to guess that it's Arabic! 


In theory, RNNs are capable of handling such “long-term dependencies.” However experiments have shown otherwise, which is why LSTMs were developed!

## Part 4: Long Short Term Memory Networks (also known as LSTM)

It is a VARIANT of RNNs!

"Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used." (colah's blog)


- 1) Simple RNN
    <img src='rnn_in.png'>


- 2) LSTM
    <img src='lstm_in.png'>

### I. Model Definition

In [None]:
# create the model
lstm_model = Sequential()
lstm_model.add(Embedding(top_words, 32, input_length=max_words))
lstm_model.add(LSTM(20))
lstm_model.add(Dense(1, activation='sigmoid'))
lstm_model.summary()

### II. Model Compilation

In [None]:
lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### III. Model Fitting or "Learning"

In [None]:
lstm_model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)

### IV. Model Evaluation 

In [None]:
# Final evaluation of the model
scores_lstm = lstm_model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores_lstm[1]*100))

**Question**: In your team, research the advantages of LSTM over simple RNNs! What are the disadvatages of LSTMs?

## Part 5: Linear classifier : Logistic regression (aka a simple benchmark)


So far, we have used all the most fancy methods for NLP. But what about a linear classifier, like the ones we used yesterday?

### I. Model Definition

In [None]:
lr_model = models.Sequential()

# Input layer
lr_model.add(Embedding(top_words, 32, input_length=max_words))
lr_model.add(Flatten())

# Output layer 
lr_model.add(layers.Dense(1, activation = "sigmoid"))

# Print the model summary
lr_model.summary()

**Question**: What's another way to run logistic regression in Python? *Hint: scikit learn*

### II. Model Compilation

In [None]:
lr_model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

### III. Model Training

In [None]:
# Select hyperparameters
epochs = 5
batch_size = 128

results_lr = lr_model.fit(x_train, y_train, epochs= epochs, batch_size = batch_size, validation_data = (x_val, y_val))

### IV. Model Evaluation

In [None]:
scores_lr = lr_model.evaluate(x_test, y_test, verbose=0)
print("Logistic Regression Accuracy: %.2f%%" % (scores_lr[1]*100))

## Part 5: Benchmark all models

In [None]:
scores = [scores_lr[1], scores_mlp[1], scores_rnn[1], scores_lstm[1], scores_cnn[1]]
models = ['LR', 'MLP', 'RNN', 'LSTM', 'CNN']
plt.plot(models,scores, 'ro', markersize=15)

To read more:
https://towardsdatascience.com/how-to-build-a-neural-network-with-keras-e8faa33d0ae4
https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/

### Summary

In this tutorial, we've explored the IMDB sentiment analysis dataset in NLP. We've developed five different models (deep learning models and a benchmark logistic regression) and we compared their performances. 

We hope that throughout the tutorial, you gained a high level, intuitive understanding of how these different models works. 
Besides that, there are two other very important take home messages: 
- Always try simple models ! As we saw, the linear classifier performs extremely well on this problem!! Of course, this also has to do with the fact that the non linear methods such as CNNs, RNNs, etc, need (a lot) more design and adjustment. Indeed, the best performing networks on this problem have achieved 99% accuracy !! See https://www.kaggle.com/c/word2vec-nlp-tutorial
- Data preprocessing (cleaning, but also dimensionality reduction) is absolutely crucial. The IMDB sentiment analysis, in the original Stanfordpaper http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf , painstakingly achieved 87% accuracy. Couple of years later, word embeddings were popularized, and we can achieve this score with a simple linear classifier ! The subject of data pre-processing and preparation will be the subject of week4
- Evaluate the uncertainty (confidence intervals, standard deviation, etc) of your metrics using bootstrapping or cross validation to check for statistical significance.