# Neural Networks and Deep Learning

Tim Repke, Julian Risch

July 2017

Some initial notes (not part of presentation)

# Artificial Neural Networks
- point
- point
- point
- point

some more on NN basics

## Formal Definition
$y=wx+b$


some more math in notes

## How do they learn?
- output neurons
- error "diff" function (loss, objective,...)
- error back-propagated using gradient
- weights and biases updated accordingly

## Other Types of Neural Networks
- **ANN/NN:** Artificial Neural Networks *(1D input, what we just saw)*
  - good for 1D input classification or regression
  - learning embeddings (reduction into lower dimensional "semantic" space)

- **CNN:** Convolutional Neural Networks *(2D input)*
  - 2D input classification; ideal for images, sometimes *fixed* length sequences
  - image segmentation (masking areas)

- **RNN:** Recurrent Neural Networks *(sequential input)*
  - sequence to sequence models (i.e. translation, tagging, summarisation, stock prognosis)
  - sequence to scalar also possible (i.e. classifying entire sequences)

## Neural Networks - Hands on!
First, we load some general dependencies we may need all the time.

In [None]:
import numpy as np
import pandas as pd
from pprint import pprint
from matplotlib import pyplot as plt
import os
import json

### The Dataset
TBD

### Loading the dataset

In [None]:
from keras.utils.data_utils import get_file
path = get_file(os.getcwd()+'/datasets/reuters.npz', origin='https://s3.amazonaws.com/text-datasets/reuters.npz')
with np.load(path) as f:
    reuters_data, labels = f['x'], f['y']

In [None]:
print('Number of samples:', reuters_data.shape)
print('Part of the first sample:', reuters_data[0][:10])

Why are there only numbers? Isn't that supposed to be text? Indeed it is, but encoded by a dictionary, which we can load as well:

In [None]:
path = get_file(os.getcwd()+'/datasets/reuters_word_index.json', origin='https://s3.amazonaws.com/text-datasets/reuters_word_index.json')
with open(path) as f:
    reuters_words = json.load(f)
    reuters_words_inverse = {v:k for k,v in reuters_words.items()}

In [None]:
print("Dictionary size:", len(reuters_words))
print("Word index:")
pprint(dict(list(reuters_words.items())[:5]))
print("\nReverse word index:")
pprint(dict(list(reuters_words_inverse.items())[:5]))

Let's "decode" some of those sentences!

In [None]:
' '.join([reuters_words_inverse[wi] for wi in reuters_data[42]])

Getting to know the data is important.
One interesting aspect is to look into the distribution of labels.
A bias might require special attention during training.
It also means, that we need to consider a strong bias in the evaluation later on, especially when creating the train/test split!
You'll see some parameters later on, that lead back to the distribution of labels, i.e. `stratified` and `balanced`.

In [None]:
%matplotlib inline

# a small pandas hack for "simplicity"
label_distribution = pd.DataFrame({'labels':labels}).groupby('labels').labels.count()

# make a barplot
label_distribution.plot.bar(figsize=(12,5))

# look at the bare numbers
label_distribution.to_frame().T # remove the '.T' to see the full list vertically

Feel free to play with the data. 
- What's the distribution of words?
- What's the distribution of words in different topics?
- How long are the text samples?
- ...

#### The easier way
Keras has also some [built in](https://keras.io/datasets/) data loading functions that essentially do what we did above.
That was just to show you how to go about things when you have you own data. 

In [None]:
from keras.datasets import reuters
max_words = 50
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=max_words, test_split=0.2)

### Preparing the dataset
There are two things we look at in this section. 

**Train/Test split**

We split our data into two subsets.
One, the bigger chunk, we use for training the mode, the other for testing it on.
This way we have some data the model has never seen before.
If it performs reasonable good classifying those, we can use it as an indicator, that it can generalise to other examples as well.
However, there are several aspects one should consider depending on the dataset and task, like overlapping concepts or data.
Here, we neglect all that for simplicity though.

**Encoding the Text**

At it's core, a neural network, as any other machine learning algorithm, is pure math. Text isn't. Thus we need to encode it in some way as numerical input as described down below.


----

**> Subsampling** data down to 4 topics

We just do that here in the example to make things easier and the evaluation less complex.

In [None]:
selected_labels = [4,11,16,19]
selected_labels_index = {l:i for i,l in enumerate(selected_labels)}
selection = [l in selected_labels for l in labels]
reuters_data_sampled = reuters_data[selection]
labels_sampled = labels[selection]

In [None]:
print('Original size of dataset:', len(labels))
print('Size of subset for',len(selected_labels),'labels:', len(labels_sampled))

**> Splitting the data** using scikit-learn [utility functions](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

Note the parameter `stratify`, to which we hand our labels. This tells the function to not do the sampling completely random, but to make sure that samples from all classes are in both sets, as well as their relative distribution is about the same.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(reuters_data_sampled, labels_sampled,
                                                    test_size=0.30, random_state=42,
                                                    stratify=labels_sampled)

In [None]:
print('Size training set:', len(X_train))
print('Size testing set:', len(X_test))

Usually you would split your data into three parts:
- one for **train**ing (usually the biggest chunk),
- one for development **test**ing, 
- and one for a final **eval**uation.

The last one you never look at or touch during your development and experiments. 
Only at the very end you will test the performance of what you determined to be your best model on that.

**> Encoding** the text-data

In our first example with a simple neural network, we use a simple [bag-of-words (BoW)](https://en.wikipedia.org/wiki/Bag-of-words_model) encoding for our text.

Keras provides [utility function](https://keras.io/preprocessing/text/#tokenizer), which we use to encode the text.
It takes the first `max_words` entries in the dictionary and for each sample creates a vector of that length. 
Each position in that vector corresponds to a word's index in the dictionary and is set to 1 if it appears in the sample, 0 if not.
In other modes, instead of setting a 1, some other score is used (i.e. number of occurences in the sample).

Hints for experiments:
- Try different vocab sizes
- Try different modes
- Look into the [implementation](https://github.com/fchollet/keras/blob/master/keras/preprocessing/text.py#L297) and do something like dropping stopwords (extremely frequently used words)

In [None]:
from keras.preprocessing.text import Tokenizer
max_words=5000
tokenizer = Tokenizer(num_words=max_words)
X_train_tokenised = tokenizer.sequences_to_matrix(X_train, mode='binary')
X_test_tokenised = tokenizer.sequences_to_matrix(X_test, mode='binary')

In [None]:
print('X_train shape:', X_train.shape, '| X_train_tokenised shape:', X_train_tokenised.shape)
print('X_test  shape:', X_test.shape,  '| X_test_tokenised  shape:', X_test_tokenised.shape)

**> Encoding** the labels

We interpret the output of the last layer of our network as the predicted label.
We could naively assume, that we use one neuron that outputs the number to the corresponding class.
However, it has proven to be significantly better to one-hot encode the labels, such that we have as many output neurons in the last layer as unique labels.
Each output neuron corresponds to one label.
Using the [softmax function](https://en.wikipedia.org/wiki/Softmax_function), we can interpret the out as a confidence score for that label - the higher (closer to 1), the more confident the network is.

In [None]:
from keras.utils import to_categorical
y_train_ = np.array([selected_labels_index[ty] for ty in y_train])
y_test_ = np.array([selected_labels_index[ty] for ty in y_test])
Y_train = to_categorical(y_train_, len(selected_labels))
Y_test = to_categorical(y_test_, len(selected_labels))

In [None]:
print('Y_train shape:', Y_train.shape)
print('Y_test shape:', Y_test.shape)

### Building the Neural Network
One hidden layer with 512 neurons with ReLU [activation](https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions); Softmax activation at the output layer; Optimisation using [Adam](https://arxiv.org/abs/1412.6980v8)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dense(len(selected_labels)))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
model.summary()

Hints for experiments:
- Try adding another layer
- Check out the [documentation](https://keras.io/layers/core/)
- Use different sizes of hidden layers
- Use different weight initialisations
- Use different [activation](https://keras.io/activations/) functions
- Use different (or no) dropout
- Use different [loss](https://keras.io/losses/) functions
- Use different [optimisers](https://keras.io/optimizers/)

Below you see a different syntax with some additional parameters.

In [None]:
from keras.regularizers import l2

model = Sequential([
    Dense(max_words, activation='linear', input_dim=max_words, 
          kernel_initializer='he_uniform'),
    Dense(2048, activation='relu', bias_regularizer=l2(0.01),
          kernel_initializer='he_uniform', kernel_regularizer=l2(0.01)),
    Dropout(0.5),
    Dense(len(selected_labels), activation='softmax',
          kernel_initializer='he_uniform')
])

model.compile(loss='msle',
              optimizer='adam',
              metrics=['accuracy'])

**Class weight computation**

As we have seen earlier, the distribution of samples across classes is somewhat unbalanced.
This unbalance potentially causes to bias our classification model and there are different approaches to counteract that.
One is to use weights during training, such that the error of samples from over-represented classes is virtually reduced by a factor and vice-versa amplified for samples of under-represented classes.
For further reading, you could look up keywords like over- or under-sampling as well.

In [None]:
from sklearn.utils import compute_class_weight
class_weights = {key: value for key, value in enumerate(
                     compute_class_weight('balanced',
                                          np.arange(len(selected_labels)),
                                          np.array(y_train_)))}
class_weights

### Training the Model

There are quite a few things going on here!
Epochs sets the number of times we let the network see the entire training set.
The batch size adjusts a number of things.
Essentially, it grabs a batch of samples, does the forward passes, and only then does the back-propagation based on those gradients.
This saves time, compared to calculating gradients and updating all variables for each training sample.
The class weight can be used to amplify the error for samples from underrepresented classes.
The validation split sets a small subset of *training* samples aside for testing along the training run.
If we had a proper train/test/eval split, we would use the test split for that (or a portion for that, because calculating these metrics takes time).

In [None]:
batch_size = 32
epochs = 5
history = model.fit(X_train_tokenised, Y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    #class_weight=class_weights,
                    validation_split=0.15)

### Testing the Model

Keras has a simple evaluation function built into the model, which can be used like this:

In [None]:
score = model.evaluate(X_test_tokenised, Y_test,
                       batch_size=batch_size, verbose=1)
print('\nTest score:', score[0])
print('Test accuracy:', score[1])

We want to dig much deeper, so below you'll find a number of helpful [scikit-learn functions](http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics).

Get model predictions for test data

In [None]:
y_pred_proba = model.predict_proba(X_test_tokenised)
y_pred = y_pred_proba.argmax(axis=1)

The **classification report**

For each class, it shows the precision, recall and their harmonic mean (F1-score).
The support shows, how many samples are in that particular class (based on test labels).
We also calculate the accuracy.
Both functions also allow normalisation to correct for class imbalances.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

print(classification_report(y_test_, y_pred, target_names=['Topic '+str(sl) for sl in selected_labels]))
print("Accuracy: {0:.3f}".format(accuracy_score(y_test_, y_pred)))

The **confusion matrix**

Based on the cofusion matrix you can figure out how the trained model makes mistakes. 
Vertically you can imagine the label of the correct class, horizontally that of the predicted class. 
The perfect classifier would only contain non-zero numbers along the diagonal.
Considering the first row, you will see for samples of that topic, with what other topic they are confused with most often.

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test_, y_pred))

To get a more visually appealing result, you need a more code. Note that the normalisation was done *per class* for convenience, not globally!

In [None]:
%matplotlib inline
from matplotlib import cm as colm
import matplotlib

conf_matrix_full = confusion_matrix(y_test_, y_pred, labels=list(range(len(selected_labels))))

matplotlib.rc('font', **{'size':12})
plt.figure(figsize=(6,4))
sub = plt.subplot(111)
normed = (conf_matrix_full.T/conf_matrix_full.sum(axis=1)).T
plt.imshow(normed, cmap=colm.Blues, vmax=normed.max()*1.4)
sub.set_yticks(list(range(len(selected_labels))))
sub.set_yticklabels(['Topic '+str(sl) for sl in selected_labels])
sub.set_xticks(list(range(len(selected_labels))))
sub.set_xticklabels(['Topic '+str(sl) for sl in selected_labels])

for i in range(normed.shape[0]):
    for j in range(normed.shape[1]):
        v = normed.T[i][j]
        c='%.2f'%v if v>0.005 else ''
        sub.text(i, j, c, va='center', ha='center')
        
plt.tight_layout()
plt.show()

**Area Under the Curve** and **Precision vs Recall curve**

As described earlier, we have an output neuron per class and interpret it's activation as the confidence.
Above we made the simple assumption to just take the class where the neuron has the highest activation for each sample.
We can also look at the precision recall curve, where we slowly increase the threshold of the confidence score we accept.
Obviously, when we want a very high confidence close to 1, the precision will be (hopefully) very good.
However, we may loose a lot of samples where there's more uncertainty, and thus the recall drops.

The **precision recall curve describes the tradeoff** we can take.
In the code below, you can print the variables returned by the sklearn function to see the range of thresholds.
The AUC score essentially is just the area under the curve just described.
A desireable model will have a high AUC and therefore the curve mostly high and stable.
The perfect classifier which never fails would have a constant precision of 1 for all recall values.

In [None]:
%matplotlib inline
from sklearn.metrics import precision_recall_curve, auc
plt.clf()
for i, f in enumerate(['Topic '+str(sl) for sl in selected_labels]):
    precision, recall, thresholds = precision_recall_curve(y_test_ == i, y_pred_proba[:, i])
    auc_ = auc(recall, precision)
    plt.plot(recall, precision, label="{}, AUC={:.4}".format(f, auc_))
plt.ylabel('Precision')
plt.xlabel('Recall')
plt.ylim(0, 1.1)
plt.xlim(0, 1.1)
plt.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
plt.grid()
plt.show()

The **training curve**

Remember that `history` variable we stored earlier during training? That contains all the values you see when you train the model with the verbose flag set to 1. We can also visualise it for convenience.

Things to look out for are as follows: The **loss** is the value the optimiser aims to minimise *(also called objective, error,...)*. It describes, based on the loss function, how far off the current prediction is from the correct response. The loss for the training samples should obviously go down and converge. If something else happens, reconsider the data encoding or the setup as a whole. The evaluation loss is the same score calculated for the evaluation data. It should behave very similar, if not, you might over-fit your model.

Essentially the same applies for the **accuracy** which is calculated after each epoch, although it should obviously go up, not down. But when the curves for the training and evaluation data grow apart, you have to pay close attention to your results.

Some **keywords to google** on: overfitting, underfitting, vanishing gradient, ?? blowup

In [None]:
%matplotlib inline
plt.subplot(121)
plt.plot(history.history['acc'], label='accuracy')
if 'val_acc' in history.history:
    plt.plot(history.history['val_acc'], label='eval_accuracy')
plt.legend()#loc='lower right')

plt.subplot(122)
plt.plot(history.history['loss'], label='loss')
if 'val_loss' in history.history:
    plt.plot(history.history['val_loss'], label='eval_loss')
plt.legend()#loc='upper right')

plt.tight_layout(rect=(0, 0, 1.5, 1))
plt.show()

### The SciKit-learn Pipeline
putting it all into one big thing makes everything easier...

In [None]:
# putting it all in a scikit pipeline for convenience (in notes)

# Convolutional Neural Networks
- point
- point

pretty pictures -> getting deep!

# Recurrent Neural Networks
- point
- point

# Conclusion
recap

## What's next?
- Attention next big thing?
- GAN (AlphaGo)
- Curricula Training (Teacher/Student) 

# Resources
Links to images used in this tutorial and some further reading.
- [Online Book](http://neuralnetworksanddeeplearning.com/) on Neural Networks and Deep Learning by [Michael Nielsen](http://michaelnielsen.org/). *Very nice* introduction into the basic math behind ANNs (including SGD backprop training).
- [Another Online Book](http://www.deeplearningbook.org/) on Deep Learning by Goodfellow, Bengio, and Courville
- [TensorFlow](https://www.tensorflow.org/) (the machine learning thing from Google)
- [Keras](https://keras.io/) as a very popular abstraction layer on top of TensorFlow, Theano, or (soon) CNTK; created by [Francois Chollet](https://twitter.com/fchollet)
- Very simple and basic [implementation](https://gist.github.com/karpathy/d4dee566867f8291f086) of a character-level recurrent neural network that writes text by [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/).
- Chris Olah wrote a [blog post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) that roughly explains the intuition behind LSTMs in a nice visual way.
- The setup of the first hands-on example of this tutorial was inspired by [this example](https://github.com/fchollet/keras/blob/master/examples/reuters_mlp.py).
- [Paper](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) on why dropout helps to prevent over-fitting.

<center>![noimg](images/test.png)</center>