# Neural Networks and Deep Learning

Tim Repke, Julian Risch

July 2017

# Artificial Neural Networks
- point
- point
- point
- point

some more on NN basics

## Formal Definition
$y=wx+b$


<center>![noimg](images/test.png)</center>

- Input to $k$-th neuron in layer $l$: $$z^l_k=\sum^K_i a^{l-1}_i w^{l-1}_{ki}+b^l_k,$$ where $K$ is the number of neurons in the preceeding layer
- Biases and weights can be represented as vectors and matrices
- Weight matrix between layer $l-1$ and $l$: $$W^l=
 \begin{pmatrix}
  w^l_{11} & w^l_{12} & \cdots & w^l_{1k} \\
  w^l_{21} & w^l_{22} & \cdots & w^l_{2k} \\
  \vdots  & \vdots  & \ddots & \vdots  \\
  w^l_{j1} & w^l_{j2} & \cdots & w^l_{jk} 
 \end{pmatrix}$$

some more math in notes

## How do they learn?
- output neurons
- error "diff" function (loss, cost, objective,...)
- error back-propagated using gradient
- weights and biases updated accordingly

## Other Types of Neural Networks
- **ANN/NN:** Artificial Neural Networks *(1D input, what we just saw)*
  - good for 1D input classification or regression
  - learning embeddings (reduction into lower dimensional "semantic" space)

- **CNN:** Convolutional Neural Networks *(2D input)*
  - 2D input classification; ideal for images, sometimes *fixed* length sequences
  - image segmentation (masking areas)

- **RNN:** Recurrent Neural Networks *(sequential input)*
  - sequence to sequence models (i.e. translation, tagging, stock prognosis)
  - sequence to scalar also possible (i.e. classifying entire sequences)

## Neural Networks - Hands on!
First, we load some general dependencies we may need all the time.

In [None]:
import numpy as np
import pandas as pd
from pprint import pprint
from matplotlib import pyplot as plt
import os
import json

### The Dataset
TBD

### Loading the dataset

In [None]:
from keras.utils.data_utils import get_file
path = get_file(os.getcwd()+'/datasets/reuters.npz', origin='https://s3.amazonaws.com/text-datasets/reuters.npz')
with np.load(path) as f:
    reuters_data, labels = f['x'], f['y']

In [None]:
print('Number of samples:', reuters_data.shape)
print('Part of the first sample:', reuters_data[0][:10])

Why are there only numbers? Isn't that supposed to be text? Indeed it is, but encoded by a dictionary, which we can load as well:

In [None]:
path = get_file(os.getcwd()+'/datasets/reuters_word_index.json', origin='https://s3.amazonaws.com/text-datasets/reuters_word_index.json')
with open(path) as f:
    reuters_words = json.load(f)
    reuters_words_inverse = {v:k for k,v in reuters_words.items()}

In [None]:
print("Dictionary size:", len(reuters_words))
print("Word index:")
pprint(dict(list(reuters_words.items())[:5]))
print("\nReverse word index:")
pprint(dict(list(reuters_words_inverse.items())[:5]))

Let's "decode" some of those sentences!

In [None]:
' '.join([reuters_words_inverse[wi] for wi in reuters_data[42]])

Getting to know the data is important.
One interesting aspect is to look into the distribution of labels.
A bias might require special attention during training.
It also means, that we need to consider a strong bias in the evaluation later on, especially when creating the train/test split!
You'll see some parameters later on, that lead back to the distribution of labels, i.e. `stratified` and `balanced`.

In [None]:
%matplotlib inline

# a small pandas hack for "simplicity"
label_distribution = pd.DataFrame({'labels':labels}).groupby('labels').labels.count()

# make a barplot
label_distribution.plot.bar(figsize=(12,5))

# look at the bare numbers
label_distribution.to_frame().T # remove the '.T' to see the full list vertically

Feel free to play with the data. 
- What's the distribution of words?
- What's the distribution of words in different topics?
- How long are the text samples?
- ...

#### The easier way
Keras has also some [built in](https://keras.io/datasets/) data loading functions that essentially do what we did above.
That was just to show you how to go about things when you have you own data. 

In [None]:
from keras.datasets import reuters
max_words = 50
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=max_words, test_split=0.2)

### Preparing the dataset
There are two things we look at in this section. 

**Train/Test split**

We split our data into two subsets.
One, the bigger chunk, we use for training the mode, the other for testing it on.
This way we have some data the model has never seen before.
If it performs reasonable good classifying those, we can use it as an indicator, that it can generalise to other examples as well.
However, there are several aspects one should consider depending on the dataset and task, like overlapping concepts or data.
Here, we neglect all that for simplicity though.

**Encoding the Text**

At it's core, a neural network, as any other machine learning algorithm, is pure math. Text isn't. Thus we need to encode it in some way as numerical input as described down below.


----

**> Subsampling** data down to 4 topics

We just do that here in the example to make things easier and the evaluation less complex.

In [None]:
selected_labels = [4,11,16,19]
selected_labels_index = {l:i for i,l in enumerate(selected_labels)}
selection = [l in selected_labels for l in labels]
reuters_data_sampled = reuters_data[selection]
labels_sampled = labels[selection]

In [None]:
print('Original size of dataset:', len(labels))
print('Size of subset for',len(selected_labels),'labels:', len(labels_sampled))

**> Splitting the data** using scikit-learn [utility functions](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

Note the parameter `stratify`, to which we hand our labels. This tells the function to not do the sampling completely random, but to make sure that samples from all classes are in both sets, as well as their relative distribution is about the same.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(reuters_data_sampled, labels_sampled,
                                                    test_size=0.30, random_state=42,
                                                    stratify=labels_sampled)

In [None]:
print('Size training set:', len(X_train))
print('Size testing set:', len(X_test))

Usually you would split your data into three parts:
- one for **train**ing (usually the biggest chunk),
- one for development **test**ing, 
- and one for a final **eval**uation.

The last one you never look at or touch during your development and experiments. 
Only at the very end you will test the performance of what you determined to be your best model on that.

**> Encoding** the text-data

In our first example with a simple neural network, we use a simple [bag-of-words (BoW)](https://en.wikipedia.org/wiki/Bag-of-words_model) encoding for our text.

Keras provides [utility function](https://keras.io/preprocessing/text/#tokenizer), which we use to encode the text.
It takes the first `max_words` entries in the dictionary and for each sample creates a vector of that length. 
Each position in that vector corresponds to a word's index in the dictionary and is set to 1 if it appears in the sample, 0 if not.
In other modes, instead of setting a 1, some other score is used (i.e. number of occurences in the sample).

Hints for experiments:
- Try different vocab sizes
- Try different modes
- Look into the [implementation](https://github.com/fchollet/keras/blob/master/keras/preprocessing/text.py#L297) and do something like dropping stopwords (extremely frequently used words)

In [None]:
from keras.preprocessing.text import Tokenizer
max_words=5000
tokenizer = Tokenizer(num_words=max_words)
X_train_tokenised = tokenizer.sequences_to_matrix(X_train, mode='binary')
X_test_tokenised = tokenizer.sequences_to_matrix(X_test, mode='binary')

In [None]:
print('X_train shape:', X_train.shape, '| X_train_tokenised shape:', X_train_tokenised.shape)
print('X_test  shape:', X_test.shape,  '| X_test_tokenised  shape:', X_test_tokenised.shape)

**> Encoding** the labels

We interpret the output of the last layer of our network as the predicted label.
We could naively assume, that we use one neuron that outputs the number to the corresponding class.
However, it has proven to be significantly better to one-hot encode the labels, such that we have as many output neurons in the last layer as unique labels.
Each output neuron corresponds to one label.
Using the [softmax function](https://en.wikipedia.org/wiki/Softmax_function), we can interpret the out as a confidence score for that label - the higher (closer to 1), the more confident the network is.

In [None]:
from keras.utils import to_categorical
y_train_ = np.array([selected_labels_index[ty] for ty in y_train])
y_test_ = np.array([selected_labels_index[ty] for ty in y_test])
Y_train = to_categorical(y_train_, len(selected_labels))
Y_test = to_categorical(y_test_, len(selected_labels))

In [None]:
print('Y_train shape:', Y_train.shape)
print('Y_test shape:', Y_test.shape)

### Building the Neural Network
One hidden layer with 512 neurons with ReLU [activation](https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions); Softmax activation at the output layer; Optimisation using [Adam](https://arxiv.org/abs/1412.6980v8)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dense(len(selected_labels)))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
model.summary()

Hints for experiments:
- Try adding another layer
- Check out the [documentation](https://keras.io/layers/core/)
- Use different sizes of hidden layers
- Use different weight initialisations
- Use different [activation](https://keras.io/activations/) functions
- Use different (or no) dropout
- Use different [loss](https://keras.io/losses/) functions
- Use different [optimisers](https://keras.io/optimizers/)

Below you see a different syntax with some additional parameters.

In [None]:
from keras.regularizers import l2

model = Sequential([
    Dense(max_words, activation='linear', input_dim=max_words, 
          kernel_initializer='he_uniform'),
    Dense(2048, activation='relu', bias_regularizer=l2(0.01),
          kernel_initializer='he_uniform', kernel_regularizer=l2(0.01)),
    Dropout(0.5),
    Dense(len(selected_labels), activation='softmax',
          kernel_initializer='he_uniform')
])

model.compile(loss='msle',
              optimizer='adam',
              metrics=['accuracy'])

**Class weight computation**

As we have seen earlier, the distribution of samples across classes is somewhat unbalanced.
This unbalance potentially causes to bias our classification model and there are different approaches to counteract that.
One is to use weights during training, such that the error of samples from over-represented classes is virtually reduced by a factor and vice-versa amplified for samples of under-represented classes.
For further reading, you could look up keywords like over- or under-sampling as well.

In [None]:
from sklearn.utils import compute_class_weight
class_weights = {key: value for key, value in enumerate(
                     compute_class_weight('balanced',
                                          np.arange(len(selected_labels)),
                                          np.array(y_train_)))}
class_weights

### Training the Model

There are quite a few things going on here!
Epochs sets the number of times we let the network see the entire training set.
The batch size adjusts a number of things.
Essentially, it grabs a batch of samples, does the forward passes, and only then does the back-propagation based on those gradients.
This saves time, compared to calculating gradients and updating all variables for each training sample.
The class weight can be used to amplify the error for samples from underrepresented classes.
The validation split sets a small subset of *training* samples aside for testing along the training run.
If we had a proper train/test/eval split, we would use the test split for that (or a portion for that, because calculating these metrics takes time).

In [None]:
batch_size = 32
epochs = 5
history = model.fit(X_train_tokenised, Y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    #class_weight=class_weights,
                    validation_split=0.15)

### Testing the Model

Keras has a simple evaluation function built into the model, which can be used like this:

In [None]:
score = model.evaluate(X_test_tokenised, Y_test,
                       batch_size=batch_size, verbose=1)
print('\nTest score:', score[0])
print('Test accuracy:', score[1])

We want to dig much deeper, so below you'll find a number of helpful [scikit-learn functions](http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics).

Get model predictions for test data

In [None]:
y_pred_proba = model.predict_proba(X_test_tokenised)
y_pred = y_pred_proba.argmax(axis=1)

The **classification report**

For each class, it shows the precision, recall and their harmonic mean (F1-score).
The support shows, how many samples are in that particular class (based on test labels).
We also calculate the accuracy.
Both functions also allow normalisation to correct for class imbalances.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

print(classification_report(y_test_, y_pred, target_names=['Topic '+str(sl) for sl in selected_labels]))
print("Accuracy: {0:.3f}".format(accuracy_score(y_test_, y_pred)))

The **confusion matrix**

Based on the cofusion matrix you can figure out how the trained model makes mistakes. 
Vertically you can imagine the label of the correct class, horizontally that of the predicted class. 
The perfect classifier would only contain non-zero numbers along the diagonal.
Considering the first row, you will see for samples of that topic, with what other topic they are confused with most often.

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test_, y_pred))

To get a more visually appealing result, you need a more code. Note that the normalisation was done *per class* for convenience, not globally!

In [None]:
%matplotlib inline
from matplotlib import cm as colm
import matplotlib

conf_matrix_full = confusion_matrix(y_test_, y_pred, labels=list(range(len(selected_labels))))

matplotlib.rc('font', **{'size':12})
plt.figure(figsize=(6,4))
sub = plt.subplot(111)
normed = (conf_matrix_full.T/conf_matrix_full.sum(axis=1)).T
plt.imshow(normed, cmap=colm.Blues, vmax=normed.max()*1.4)
sub.set_yticks(list(range(len(selected_labels))))
sub.set_yticklabels(['Topic '+str(sl) for sl in selected_labels])
sub.set_xticks(list(range(len(selected_labels))))
sub.set_xticklabels(['Topic '+str(sl) for sl in selected_labels])

for i in range(normed.shape[0]):
    for j in range(normed.shape[1]):
        v = normed.T[i][j]
        c='%.2f'%v if v>0.005 else ''
        sub.text(i, j, c, va='center', ha='center')
        
plt.tight_layout()
plt.show()

**Area Under the Curve** and **Precision vs Recall curve**

As described earlier, we have an output neuron per class and interpret it's activation as the confidence.
Above we made the simple assumption to just take the class where the neuron has the highest activation for each sample.
We can also look at the precision recall curve, where we slowly increase the threshold of the confidence score we accept.
Obviously, when we want a very high confidence close to 1, the precision will be (hopefully) very good.
However, we may loose a lot of samples where there's more uncertainty, and thus the recall drops.

The **precision recall curve describes the tradeoff** we can take.
In the code below, you can print the variables returned by the sklearn function to see the range of thresholds.
The AUC score essentially is just the area under the curve just described.
A desireable model will have a high AUC and therefore the curve mostly high and stable.
The perfect classifier which never fails would have a constant precision of 1 for all recall values.

In [None]:
%matplotlib inline
from sklearn.metrics import precision_recall_curve, auc
plt.clf()
for i, f in enumerate(['Topic '+str(sl) for sl in selected_labels]):
    precision, recall, thresholds = precision_recall_curve(y_test_ == i, y_pred_proba[:, i])
    auc_ = auc(recall, precision)
    plt.plot(recall, precision, label="{}, AUC={:.4}".format(f, auc_))
plt.ylabel('Precision')
plt.xlabel('Recall')
plt.ylim(0, 1.1)
plt.xlim(0, 1.1)
plt.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
plt.grid()
plt.show()

The **training curve**

Remember that `history` variable we stored earlier during training? That contains all the values you see when you train the model with the verbose flag set to 1. We can also visualise it for convenience.

Things to look out for are as follows: The **loss** is the value the optimiser aims to minimise *(also called objective, error,...)*. It describes, based on the loss function, how far off the current prediction is from the correct response. The loss for the training samples should obviously go down and converge. If something else happens, reconsider the data encoding or the setup as a whole. The evaluation loss is the same score calculated for the evaluation data. It should behave very similar, if not, you might over-fit your model.

Essentially the same applies for the **accuracy** which is calculated after each epoch, although it should obviously go up, not down. But when the curves for the training and evaluation data grow apart, you have to pay close attention to your results.

Some **keywords to google** on: overfitting, underfitting, vanishing gradient, ?? blowup

In [None]:
%matplotlib inline
plt.subplot(121)
plt.plot(history.history['acc'], label='accuracy')
if 'val_acc' in history.history:
    plt.plot(history.history['val_acc'], label='eval_accuracy')
plt.legend()#loc='lower right')

plt.subplot(122)
plt.plot(history.history['loss'], label='loss')
if 'val_loss' in history.history:
    plt.plot(history.history['val_loss'], label='eval_loss')
plt.legend()#loc='upper right')

plt.tight_layout(rect=(0, 0, 1.5, 1))
plt.show()

### The SciKit-learn Pipeline
SciKit-learn offers a [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) interface, which allows you to **chain or combine multiple models and processing steps**. This simplifies your evaluation and experiments. Keras also implements an abstraction layer that works with this [API](https://keras.io/scikit-learn-api/). You may find this [in-depth tutorial](https://bradzzz.gitbooks.io/ga-seattle-dsi/content/dsi/dsi_05_classification_databases/2.2-lesson/readme.html) on pipelines helpful to see what else you can do.

The pipeline unfortunately only allows to transform the data/features, not the labels themselfs. Rather does it allow for downsampling. The example below is to small to really apply the Pipeline itself, but this interface shows its real power once you get into **parameter tuning**. Therefore, some of the utility functions are used instead. The [user guide](http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html) on model selection shows you some examples.

Let's also assume, that we get raw text data and use SciKit-learn functions to transform the input.

**Hints for experiments:**
- Try to extend the example to do a [GridSearch](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) on the input layer size. You can add a parameter to the `build_model` function that sets the size. Click [here](http://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/) to cheat a bit.
- Try different parameters for the `CountVectorizer` (i.e. set binary to False)
- Replace the CountVectorizer with something else (i.e. TfidfVectorizer)


In [None]:
from sklearn.pipeline import Pipeline
from keras.models import Sequential
from keras.wrappers.scikit_learn import KerasClassifier
from keras.layers import Dense, Activation, Dropout
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import StratifiedKFold, cross_val_score

selected_labels = [4,11,16,19]
max_words=5000
num_folds = 5

# pretend we have raw text data
raw_reuters = [' '.join([reuters_words_inverse[wi] for wi in rd]) for rd in reuters_data]

# subsample the data
raw_reuters_sampled = [rr for i, rr in enumerate(raw_reuters) if labels[i] in selected_labels]
labels_sampled = [l for l in labels if l in selected_labels]

# helper function for our keras model
def build_model():
    model = Sequential()
    model.add(Dense(512, input_shape=(max_words,)))
    model.add(Activation('relu'))
    model.add(Dense(len(selected_labels)))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    return model

# encode raw text
raw_reuters_encoded = CountVectorizer(max_features=max_words, binary=True).fit_transform(raw_reuters_sampled).todense()

# prepare keras wrapper
model = KerasClassifier(build_fn=build_model, epochs=5, batch_size=20, verbose=0)

# set up kfold
kfold = StratifiedKFold(n_splits=num_folds, random_state=42)

# do the computations
results = cross_val_score(model, raw_reuters_encoded, labels_sampled, cv=kfold)
print(results.mean())
print(results)

# Convolutional Neural Networks
- point
- point

motivation: für daten die nur im kontext sinn machen. NN soll die dann acuh als kontext verarbeiten können (locality assumption)

pretty pictures -> getting deep!

## CNN - Hands on!
In this section:
- encode text as 2D input
- usage of keras embedding and convolutional layers

### Data Preparation
We already did that earlier on. In case you start the tutorial from here, you can execute the cell below to get all you need from above.

In [None]:
import numpy as np
import pandas as pd
from pprint import pprint
from matplotlib import pyplot as plt
import os
import json
from keras.utils.data_utils import get_file
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
from sklearn.utils import compute_class_weight

path_data = get_file(os.getcwd()+'/datasets/reuters.npz', 
                origin='https://s3.amazonaws.com/text-datasets/reuters.npz')
path_dict = get_file(os.getcwd()+'/datasets/reuters_word_index.json', 
                     origin='https://s3.amazonaws.com/text-datasets/reuters_word_index.json')


with np.load(path_data) as f_data, open(path_dict) as f_dict:
    reuters_data, labels = f_data['x'], f_data['y']
    reuters_words = json.load(f_dict)
    reuters_words_inverse = {v:k for k,v in reuters_words.items()}
    
    # downsample to a few topics
    selected_labels = [4,11,16,19]
    selected_labels_index = {l:i for i,l in enumerate(selected_labels)}
    selection = [l in selected_labels for l in labels]
    reuters_data_sampled = reuters_data[selection]
    labels_sampled = labels[selection]
    
    # train/test split
    X_train, X_test, y_train, y_test = train_test_split(reuters_data_sampled, labels_sampled,
                                                    test_size=0.20, random_state=42,
                                                    stratify=labels_sampled)
    # one-hot encode the labels
    y_train_ = np.array([selected_labels_index[ty] for ty in y_train])
    y_test_ = np.array([selected_labels_index[ty] for ty in y_test])
    Y_train = to_categorical(y_train_, len(selected_labels))
    Y_test = to_categorical(y_test_, len(selected_labels))
    
    # calculate class weights
    class_weights = {key: value for key, value in enumerate(
        compute_class_weight('balanced', np.arange(len(selected_labels)), np.array(y_train_)))}

### Encoding the Text
- Naive: "strech out" the bag-of-words

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$
\text{"I like Rheinsberg"}
\xrightarrow{to num}
\begin{bmatrix}0&3&1&4\end{bmatrix}
\xrightarrow{one hot}
\begin{bmatrix}
0&0&0&0\\
0&0&1&0\\
0&0&0&0\\
0&1&0&0\\
0&0&0&1\\
\end{bmatrix}$
- towards state-of-the-art: word embeddings

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\xrightarrow{embedding}
\begin{bmatrix}
0.11&0.42&-0.44&0.12\\
0.23&0.50&0.35&-0.33\\
\end{bmatrix}
$

some text about previous slide
- [Keras Example](https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py)
- [Keras Tutorial](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)

**> Reducing the Vocabulary**

The unique number of words used can become very large and it makes sense to reduce it.

Feel free to adapt the later model accordingly and see how a reduced vocabulary helps to improve results.

This is done for a number of reasons. Most importantly we don't have a large amount of training data, so the model should learn from things that are most likely to reappear during testing. We can just **remove very uncommon words**, as well as so-called **stop words** (like "and", "the", "or"), which appear very often and don't carry much meaning. The word usage frequence in natural languages usually follows [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law).

Note, that **this is optional**. The amount of words removed from the head and tail of the distribution depends on the data and problem at hand and probably requires some experimentation. The [TF-IDF Scores](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), can be helpful for that, as they give higher scores to words, which help distinguish documents, and low scores for those, that frequently appear in a lot of documents.

In [None]:
print("Number of words (train):", np.array([a for b in X_train for a in b]).shape)
print("Number of unique words (train):", np.unique([a for b in X_train for a in b]).shape)
print("Number of words (test):", np.array([a for b in X_test for a in b]).shape)
print("Number of unique words (test):", np.unique([a for b in X_test for a in b]).shape)

The code cells below allow you to count the frequency of each word and remove all but the most frequent (`vocabulary_size`) words from the texts.

In [None]:
vocabulary_size = 5000
word_counts = {}
for doc in X_train: 
    for term in doc:
        word_counts[term] = word_counts.get(term, 0) + 1

keep_words = [x[0] for x in sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:vocabulary_size]]

X_train_reduced = [[t for t in doc if t in keep_words] for doc in X_train]
X_test_reduced = [[t for t in doc if t in keep_words] for doc in X_test]

As you can see, we managed to reduce the vocabulary size significantly, but the total number of words in the texts did not change too much. Note also, that some words in the test set are missing in our training set. That is a common real world scenario, which the model has to be able to cope with.

In [None]:
print("Number of words (train):", np.array([a for b in X_train_reduced for a in b]).shape)
print("Number of unique words (train):", np.unique([a for b in X_train_reduced for a in b]).shape)
print("Number of words (test):", np.array([a for b in X_test_reduced for a in b]).shape)
print("Number of unique words (test):", np.unique([a for b in X_test_reduced for a in b]).shape)

**> Normalising Text Length**

The input texts all have different length but the neural network has a fixed size. We limit the size to a fixed length, long texts are cut, short texts are padded with zeros.

In [None]:
from keras.preprocessing import sequence
maxlen = 400
X_train_padded = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test_padded = sequence.pad_sequences(X_test, maxlen=maxlen)
print("Shape of train:", X_train_padded.shape)
print("Shape of test:", X_test_padded.shape)
X_train_padded

### Building the Convolutional Neural Network

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D

vocab_size = np.unique([a for b in X_train_padded for a in b]).shape[0]
embedding_size = 50
filters = 250
kernel_size = 3
hidden_dims = 250

model = Sequential()

model.add(Embedding(input_dim=vocab_size,
                    output_dim=embedding_size,
                    input_length=maxlen))
model.add(Dropout(0.2))
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
model.add(Dense(len(selected_labels)))
model.add(Activation('softmax'))

**Model overview**

In [None]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()

### Training the model

In [None]:
history = model.fit(X_train_padded, Y_train,
                    batch_size=20,
                    epochs=5,
                    class_weights=class_weights,
                    validation_split=0.1,
                    verbose=1)

In this case you probably won't find it useful, but generally you can use **embeddings as dimensionality reduction**. Word embeddings specifically have the nice property of "placing" semantically similar words nearby in the embedding space. Since you can have a look at each layer output of the sequential keras model, you can get the embedding of what it learned in our example.

In [None]:
from keras import backend as K

# embed the first sentence
to_embed = X_train_padded[:1]

# or something else you did yourself
to_embed = np.zeros((1,400))
to_embed[0][399] = 50

embed = K.function([model.layers[0].input], [model.layers[0].output])
embedded = embed([to_embed])[0]

print("size of vocabulary:", vocab_size)
print("size of embedding:", embedding_size)
print("shape of embedded text:", embedded.shape)

### Testing the model
Just as before, we can evaluate the model

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

y_pred_proba = model.predict_proba(X_test_padded)
y_pred = y_pred_proba.argmax(axis=1)
print()
print(classification_report(y_test_, y_pred, target_names=['Topic '+str(sl) for sl in selected_labels]))
print("Accuracy: {0:.3f}".format(accuracy_score(y_test_, y_pred)))

### Using pre-trained Word Embeddings

It is also possible to use pre-trained word embeddings. Those are trained on very large text corpora and therefore have a large vocabulary, from which you can infer meaningful word embeddings. There is a [Keras Tutorial](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) on how to use these embeddings directly in your model.

Note, that you can use such embeddings as is (fixed) or continue updating the weights accordingly during training to enforce a bias towards the wording used in the application.

1. Download: [GloVe](https://nlp.stanford.edu/projects/glove/) [word embedding](http://nlp.stanford.edu/data/glove.6B.zip) trained on Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, **Warning: 822 MB download**, almost 3GB unpacked)
2. unpack to `datasets/` directory

In [None]:
# load the word vectors into memory
# store them in a lookup dictionary
embeddings_index = {}
glove_embedding_size = 100
with open(os.getcwd()+'/datasets/glove.6B.100d.txt') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
print('Found %s word vectors.' % len(embeddings_index))

# map the pretrained vectors to the dictionary we already have
embedding_matrix = np.zeros((len(reuters_words) + 1, glove_embedding_size))
for word, i in reuters_words.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        

model = Sequential()
model.add(Embedding(len(reuters_words) + 1,
                    glove_embedding_size,
                    # set the weights (pretrained vectors)
                    weights=[embedding_matrix],
                    input_length=maxlen,
                    # flip to true to "overfit" to vocab at hand
                    trainable=False)) 
model.add(Dropout(0.2))
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
model.add(Dense(len(selected_labels)))
model.add(Activation('softmax'))

# now you can compile the model and train

# Recurrent Neural Networks
- point
- point

## RNN - Hands on!
In keras, we would just add another line of code.

Let's switch it up by going down one level and look at TensorFlow!

This is inspired by [this blog post](https://danijar.com/variable-sequence-lengths-in-tensorflow/), [full code](https://gist.github.com/danijar/3f3b547ff68effb03e20c470af22c696).

Or this?
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/dynamic_rnn.py

https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html

http://www.wildml.com/2016/08/rnns-in-tensorflow-a-practical-guide-and-undocumented-features/

Please run the cell below. There are some additional dependencies we need from there

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
from pprint import pprint
from matplotlib import pyplot as plt
import os
import json
from keras.utils.data_utils import get_file

# choose the embedding size
glove_embedding_size = [50,100,200,300][1]

# select topics of interest
selected_labels = [4,11,16,19]
selected_labels_index = {l:i for i,l in enumerate(selected_labels)}


path_data = get_file(os.getcwd()+'/datasets/reuters.npz', 
                origin='https://s3.amazonaws.com/text-datasets/reuters.npz')
path_dict = get_file(os.getcwd()+'/datasets/reuters_word_index.json', 
                     origin='https://s3.amazonaws.com/text-datasets/reuters_word_index.json')
path_glove = os.getcwd()+'/datasets/glove.6B.'+str(glove_embedding_size)+'d.txt'
path_data_tfr = os.getcwd()+'/datasets/reuters.tfrecords'
    
# don't need to do that if the file already exists
if not os.path.isfile(path_data_tfr):
    print("tfrecords file does not exist yet, writing it!")
    with np.load(path_data) as f_data:
        # load data from old format file
        reuters_data, labels = f_data['x'], f_data['y']
        
        # this writes the training data into a TFRecords protobuf file
        writer = tf.python_io.TFRecordWriter(path_data_tfr)
        idx = np.arange(len(reuters_data))
        np.random.shuffle(idx)
        for i in idx:
            features = np.array(reuters_data[i])
            label = labels[i]

            example = tf.train.Example(features=tf.train.Features(feature={
                    'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
                    'text': tf.train.Feature(int64_list=tf.train.Int64List(value=features.astype("int64"))),
            }))
            serialized = example.SerializeToString()

            writer.write(serialized)
        writer.close() 
else:
    print("tfrecords file already exists.")
    
with open(path_dict) as f_dict, open(path_glove) as f_glove:
    reuters_words = json.load(f_dict)
    reuters_words_inverse = {v:k for k,v in reuters_words.items()}
    
    # loading embeddings takes a while
    # in case you just run this to reset X and y, this will skip reloading embeddings
    #embeddings_index = {}
    #del embeddings_index
    try:
        embeddings_index
        print('word vectors already loaded!')
        # to reload trained vectors, uncomment the next line
        #qwerty
    except NameError:
        print('word vectors not loaded yet, doing that now.')
        # build embedding index
        embeddings_index = {}
        for line in f_glove:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

        # map the pretrained vectors to the dictionary we already have
        embedding_matrix = np.zeros((len(reuters_words) + 1, glove_embedding_size))
        for word, i in reuters_words.items():
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                # words not found in embedding index will be all-zeros.
                embedding_matrix[i] = embedding_vector

**> Initial Notes**

TensorFlow builds a graph of dependencies, which is used to efficiently schedule computations. This graph "lives outside" a session and has a "memory". User-defined data can be put in placeholders that are filled in a session, other global variables need to be initialised.

In [None]:
# this resets the graph
tf.reset_default_graph()

# at the beginning of a session, you need to initialise global variables
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

**> Define placeholders**

Those variables will be set and used later on.

In [None]:
# batch size will be set later
#batch_size_placeholder = tf.placeholder(tf.int32)
batch_size = 2

# data for a batch
#tf.placeholder()

**> Embedding Layer**

The embedding layer is essentially just a matrix, which we loaded earlier and now put in the tensorflow graph.

In [None]:
# build the embedding layer
embedding_layer = tf.Variable(tf.constant(0.0, shape=list(embedding_matrix.shape)),
                              trainable=False, name="embedding")
embedding_placeholder = tf.placeholder(tf.float32, list(embedding_matrix.shape))
embedding_init = embedding_layer.assign(embedding_placeholder)

**> Queue for input data**

We define some functions that read and preprocess the data

In [None]:
def filereader(input_files):
    if type(input_files) != 'list':
        input_files = [input_files]
    filename_queue = tf.train.string_input_producer(input_files, num_epochs=None)
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(
        serialized_example,
        features={
            'label': tf.FixedLenFeature([], tf.int64),
            'text': tf.VarLenFeature(tf.int64)
        })
    label = features['label']
    text = tf.sparse_tensor_to_dense(features['text'])
    return label, text

In [None]:
tf.reset_default_graph()

label, text = filereader(path_data_tfr)
print(label)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
    
    for _ in range(3):
        print(sess.run([label, text]))
        
    coord.request_stop()
    coord.join(threads)
    

In [None]:
max_length = 100
frame_size = 64
num_hidden = 200
learning_rate = 0.003
batch_size = 50
selected_labels = [4,11,16,19]
out_size = 46 # = max label + 1

tf.reset_default_graph()

label, text = filereader(path_data_tfr)

embedding_layer = tf.Variable(tf.constant(0.0, shape=embedding_matrix.shape),
                              trainable=False, name="embedding")
embedding_placeholder = tf.placeholder(tf.float32, embedding_matrix.shape)
embedding_init = embedding_layer.assign(embedding_placeholder)

text_batch, labels_batch = tf.train.batch(
    [text, label], 
    batch_size=batch_size,
    capacity=200,
    dynamic_pad=True)

label_filter = [tf.equal(labels_batch, selected_labels[0])]
for i, sl in enumerate(selected_labels[1:]):
    label_filter.append(tf.logical_or(label_filter[i], tf.equal(labels_batch, sl)))

reduced_text_batch = tf.gather(text_batch, tf.reshape(tf.where(label_filter[-1]),[-1]))
reduced_labels_batch = tf.gather(labels_batch, tf.reshape(tf.where(label_filter[-1]),[-1]))

embedded = tf.gather(embedding_layer, reduced_text_batch)

labels_batch_ = tf.one_hot(reduced_labels_batch, out_size)
lengths = tf.cast(tf.reduce_sum(tf.sign(reduced_text_batch), axis=1), tf.int32)

output, state = tf.nn.dynamic_rnn(
    tf.nn.rnn_cell.GRUCell(num_hidden),
    embedded,
    dtype=tf.float32,
    sequence_length=lengths
)

current_batch_length = tf.shape(output)[0]
current_batch_width = tf.shape(output)[1]
current_output_size = tf.shape(output)[2]

last_relevant_index = tf.range(0, current_batch_length) * current_batch_width + (lengths - 1)
flat_output = tf.reshape(output, [-1, current_output_size])
last_out = tf.gather(flat_output, last_relevant_index)

weight = tf.Variable(tf.truncated_normal([num_hidden, out_size], stddev=0.01))
bias = tf.Variable(tf.constant(0.1, shape=[out_size]))

prediction = tf.nn.softmax(tf.matmul(last_out, weight) + bias)

cross_entropy_loss = -tf.reduce_sum(labels_batch_ * tf.log(prediction))

#optimizer = tf.train.RMSPropOptimizer(learning_rate)
optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.minimize(cross_entropy_loss)


with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(embedding_init, feed_dict={embedding_placeholder: embedding_matrix})
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
    
    for epoch in range(5):
        for _ in range(10): # num batches
            sess.run([train_op, current_batch_length,current_batch_width])
            print(str(current_batch_length.eval())+'('+str(current_batch_length.eval())+').', end='')
        
        error = sess.run(cross_entropy_loss)
        print('Epoch {:2d} error {:5.2f}%'.format(epoch + 1, 100 * error))
        
        
    coord.request_stop()
    coord.join(threads)

# Conclusion
- DL nicht allheilmittel
- GIGO (garbage in garbage out)
- feature engineering


## What's next?
- Attention next big thing?
- GAN (AlphaGo)
- Curricula Training (Teacher/Student) 

## Other example
Check out the other two notebooks

# Resources
Links to images used in this tutorial and some further reading.
- [Online Book](http://neuralnetworksanddeeplearning.com/) on Neural Networks and Deep Learning by [Michael Nielsen](http://michaelnielsen.org/). *Very nice* introduction into the basic math behind ANNs (including SGD backprop training).
- [Another Online Book](http://www.deeplearningbook.org/) on Deep Learning by Goodfellow, Bengio, and Courville
- [TensorFlow](https://www.tensorflow.org/) (the machine learning thing from Google)
- [Keras](https://keras.io/) as a very popular abstraction layer on top of TensorFlow, Theano, or (soon) CNTK; created by [Francois Chollet](https://twitter.com/fchollet)
- Very simple and basic [implementation](https://gist.github.com/karpathy/d4dee566867f8291f086) of a character-level recurrent neural network that writes text by [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/).
- Chris Olah wrote a [blog post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) that roughly explains the intuition behind LSTMs in a nice visual way.
- The setup of the first hands-on example of this tutorial was inspired by [this example](https://github.com/fchollet/keras/blob/master/examples/reuters_mlp.py).
- [Paper](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) on why dropout helps to prevent over-fitting.
- A [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on types of RNNs and their effectiveness by Andrej Karpathy.
- Very helpful [post](http://www.wildml.com/2016/08/rnns-in-tensorflow-a-practical-guide-and-undocumented-features/) on hidden TensorFlow features, which may save you a lot hot head scratching by Denny Britz.
- This [blog post](https://indico.io/blog/tensorflow-data-inputs-part1-placeholders-protobufs-queues/) describes some perks of using protobuf queues in TensorFlow.
- Your Tensorflow code is slow even on a GPU? [This post](https://hanxiao.github.io/2017/07/07/Get-10x-Speedup-in-Tensorflow-Multi-Task-Learning-using-Python-Multiprocessing/) might help you

- Talk: NNs for Classification (15') - 9:00
  - NN are nerons in layers -> what other buzzwords do you know?
  - from perceptron model to layers
  - weights as matrices, forward pass as simple lin-alg, error at end -> back propagation
  - mathematical model in tensorflow, efficiently distribute computations (tensor dependency graph)
  - abstraction of that with keras
- NN hands on (20') - 9:15
  - dependencies, data, split&encoding
  - NN architecture (batch training, cost functions, learning rate, optimiser)
  - evaluation strategies
- Talk: CNN (10') - 9:35
  - motivation: locality assumption (some data only usefil in context, CNN allow NN to project context)
  - 1-hot vs embedding (sparse vs abstract), what is embedding
- CNN hands on (15') - 9:45
  - preprocessing (cutting/padding, dict limit, map to embedding)
  - max pooling, feature maps (deeper, wider nets)
- Talk: RNN (10') - 10:00
  - motivation: sequential data!
  - GRU/LSTM cells as "gates" that let information through or not
- RNN hands on (10') - 10:10
- Conclusion (5') - 10:20
  - DL not magic bullet!
  - GIGO
  - feature engineering shifts, now it's getting data and trying new ways to represent it
- play time (open end) - 10:25
  - experiment with hints, look at other datasets (see other notebooks)
