# Deep Learning

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Artificial-neural-networks" data-toc-modified-id="Artificial-neural-networks-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Artificial neural networks</a></span><ul class="toc-item"><li><span><a href="#Logistic-Regression-(0-hidden-layers-neural-network)" data-toc-modified-id="Logistic-Regression-(0-hidden-layers-neural-network)-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Logistic Regression (0 hidden layers neural network)</a></span></li><li><span><a href="#Multilayer-Perceptron-(1+-hidden-layers-neural-network)" data-toc-modified-id="Multilayer-Perceptron-(1+-hidden-layers-neural-network)-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Multilayer Perceptron (1+ hidden layers neural network)</a></span><ul class="toc-item"><li><span><a href="#Creation" data-toc-modified-id="Creation-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Creation</a></span></li><li><span><a href="#Layer-addition" data-toc-modified-id="Layer-addition-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Layer addition</a></span></li><li><span><a href="#Compilation" data-toc-modified-id="Compilation-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Compilation</a></span></li><li><span><a href="#Training" data-toc-modified-id="Training-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Training</a></span></li><li><span><a href="#Predicting" data-toc-modified-id="Predicting-1.2.5"><span class="toc-item-num">1.2.5&nbsp;&nbsp;</span>Predicting</a></span></li><li><span><a href="#Exploration-of-layers" data-toc-modified-id="Exploration-of-layers-1.2.6"><span class="toc-item-num">1.2.6&nbsp;&nbsp;</span>Exploration of layers</a></span></li></ul></li></ul></li><li><span><a href="#Deep-Learning-for-Computer-Vision" data-toc-modified-id="Deep-Learning-for-Computer-Vision-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Deep Learning for Computer Vision</a></span><ul class="toc-item"><li><span><a href="#Not-DL-model" data-toc-modified-id="Not-DL-model-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Not DL model</a></span><ul class="toc-item"><li><span><a href="#Training" data-toc-modified-id="Training-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Training</a></span></li><li><span><a href="#Predicting" data-toc-modified-id="Predicting-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Predicting</a></span></li><li><span><a href="#Accuracy-score" data-toc-modified-id="Accuracy-score-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Accuracy score</a></span></li></ul></li><li><span><a href="#Artificial-neural-network" data-toc-modified-id="Artificial-neural-network-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Artificial neural network</a></span></li><li><span><a href="#Convolutional-neural-network-(CNN)" data-toc-modified-id="Convolutional-neural-network-(CNN)-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Convolutional neural network (CNN)</a></span></li><li><span><a href="#Pretrained-nets" data-toc-modified-id="Pretrained-nets-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Pretrained nets</a></span></li></ul></li><li><span><a href="#Deep-Learning-for-NLP-(natural-language-processing)" data-toc-modified-id="Deep-Learning-for-NLP-(natural-language-processing)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Deep Learning for NLP (natural language processing)</a></span></li><li><span><a href="#Further-materials" data-toc-modified-id="Further-materials-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Further materials</a></span></li></ul></div>

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
# tensorflow low level library
# keras high level library

In [None]:
from tensorflow import keras
from tensorflow.keras import models
from tensorflow.keras import layers

In [None]:
import matplotlib.pyplot as plt

## Artificial neural networks

Lets build our first neural network to predict breast cancer. Will it perform better than Logistic Regression?

In [None]:
df = pd.read_csv("../datasets/breast_cancer.csv")

In [None]:
df.head()

In [None]:
X = df.drop("is_cancer", axis=1)
y = df.is_cancer

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

In [None]:
print(f"Training set has {X_train.shape[0]} entries")
print(f"Test set has {X_test.shape[0]} entries")

### Logistic Regression (0 hidden layers neural network)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log = LogisticRegression(max_iter=10000)

In [None]:
log.fit(X_train, y_train)

In [None]:
log.score(X_train, y_train)

In [None]:
log.score(X_test, y_test)

In [None]:
pd.Series(log.coef_[0], index=X.columns).sort_values().tail()

Logistic Regression finds the best weights $\beta_i$, one per feature, and an extra parameter $\beta_0$, to predict $y$ given feature values $x_i$

To predict a new test instance, it does two steps:

$$z = \sum_{i}x_i * w_i + b$$

$$y=\frac{1}{1 + e^{-z}}$$

<img width=300 src="https://miro.medium.com/max/1086/1*dkpb3XSLslX9IjIAGrSYsA.png">

### Multilayer Perceptron (1+ hidden layers neural network)

<img width=500 src="./multilayer.png">

#### Creation

In [None]:
# neural network creation
network = models.Sequential()

The core building block of neural networks is the **layer**, composed of different **nodes**
 
Our neural network will have:
 * input layer: dimension 30 (number of predictors)
 * hidden layer 1: dimension 10
 * output layer: dimension 1 (is_cancer)

In general, nodes in layer N+1 have less nodes than layer N

In [None]:
X_train.shape

#### Layer addition

In [None]:
network.add(layers.Dense(10, activation='relu', input_shape=(30,)))
network.add(layers.Dense(1, activation='sigmoid'))

A network:
 * while training, finds the optimal weights (arrows between layers)
 * while predicting, fast forwards the features $x_i$ through the network. At every layer:
  * multiplies by weights
  * runs through activation function

How many weights do we need?

In [None]:
network.summary()

Activation functions give non-linearity nature to neural networks

<img width=600 src="https://miro.medium.com/max/1200/1*ZafDv3VUm60Eh10OeJu1vw.png">

#### Compilation

To make our network ready for training, we need to pick three more things, as part of "compilation" step:

 * A **loss function** (differentiable metric): this is how the network will be able to measure how good a job it is doing on its training data
 * An **optimizer**: this is the mechanism through which the network will update itself based on the data it sees.
 * **Metrics** to monitor during training and testing. Here we will only care about accuracy (the fraction of the images that were correctly classified

In [None]:
network.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

[Keras optimizers](https://keras.io/api/optimizers/)  
[Keras losses](https://keras.io/api/losses/)

#### Training

A network is trained by epochs (steps). At every epoch:
 1. predicts with given weights (forward propagation)
 2. compares with real labels
 3. updates weights (back propagation)

In [None]:
network.fit(X_train, y_train.astype(float), batch_size=32, epochs=300, validation_split=0.1)
# validation split: percentaage of samples not used for training, used for validating at every epoch

#### Predicting

What does the network predict for the first 10 test entries??

In [None]:
# predictions
network.predict(X_test)[:10].round(3)

In [None]:
# real
y_test[:10]

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(network.predict(X_train) > 0.5, y_train)

In [None]:
accuracy_score(network.predict(X_test) > 0.5, y_test)

Not better than logistic regression in this case. We may need to tweak the number of layers / number of nodes per layer / number of epochs of training

#### Exploration of layers

In [None]:
network.layers

In [None]:
network.layers[0].get_weights()[0].shape

In [None]:
network.layers[0].get_weights()[0][:, 0]

In [None]:
network.layers[1].get_weights()[0]

## Deep Learning for Computer Vision

Lets classify hand-written digits

The problem: classification in the MNIST dataset
 * **classify** grayscale images 
 * of handwritten digits
 * 28 pixels by 28 pixels
 * into their 10 categories (0 to 9)

The MNIST dataset comes pre-loaded in Keras, in the form of a set of four Numpy arrays:

In [None]:
from tensorflow.keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

In [None]:
type(train_images)

In [None]:
train_images.shape

In [None]:
plt.imshow(train_images[20], cmap="gray")

In [None]:
train_labels[20]

In [None]:
n_images = 10
fig, axs = plt.subplots(1, n_images, figsize=(20, 20))
for i in range(n_images):
    axs[i].imshow(train_images[i], cmap="gray")

In [None]:
train_labels[:n_images]

### Not DL model

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
train_images[0].shape

In [None]:
X_train = train_images.reshape(train_images.shape[0], -1)
X_test = test_images.reshape(test_images.shape[0], -1)

In [None]:
train_images.shape

In [None]:
train_images[0].shape

In [None]:
X_train.shape

In [None]:
X_train[0].shape

#### Training

In [None]:
model = GradientBoostingClassifier(n_estimators=10, max_depth=5, max_features=0.1)

In [None]:
%%time
model.fit(X_train, train_labels)

#### Predicting

Lets see how this works on the first 10 test images

In [None]:
test_labels[:10]

In [None]:
n_images = 10
fig, axs = plt.subplots(1, n_images, figsize=(20, 20))
for i in range(n_images):
    axs[i].imshow(test_images[i], cmap="gray")

In [None]:
model.predict(X_test[:10])

Lets now see some examples not correctly predicted

In [None]:
error_indices = np.argwhere(test_labels[:1000] != model.predict(X_test[:1000]))

In [None]:
n_images = 10
fig, axs = plt.subplots(1, n_images, figsize=(20, 20))
for i, index in zip(range(n_images), error_indices):
    axs[i].imshow(test_images[index][0], cmap="gray")

In [None]:
model.predict(X_test[error_indices.flatten()])[:10]

#### Accuracy score

In [None]:
model.score(X_train, train_labels)

In [None]:
model.score(X_test, test_labels)

### Artificial neural network

Our workflow will be as follow: 
 * first we will present our neural network with the training data, `train_images` and `train_labels`...
 * this way, the network will then learn to associate images and labels
 * finally, we will ask the network to produce predictions for `test_images`...
 * and we will verify if these predictions match the labels from `test_labels`

In [None]:
from tensorflow.keras import models
from tensorflow.keras import layers

In [None]:
network = models.Sequential()

 * Here our network consists of a sequence of two `Dense` layers, which are densely-connected (also called "fully-connected") neural layers
 * The second (and last) layer is a 10-way "softmax" layer, which means...
 * it will return an array of 10 probability scores (summing to 1)
 * each score being the probability that the current digit image belongs to one of our 10 digit classes.

In [None]:
network.add(layers.Dense(100, activation='relu', input_shape=(784,)))
network.add(layers.Dense(10, activation='softmax'))

In [None]:
network.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

Before training:
 * reshape every 28x28 image into a 784 vector
 * scaling it so that all values are in the `[0, 1]` interval

In [None]:
train_vectors = train_images.reshape((60000, 28 * 28)).astype('float32') / 255
test_vectors = test_images.reshape((10000, 28 * 28)).astype('float32') / 255

In [None]:
train_vectors.shape

We also need to categorically encode the labels, a step which we explain in chapter 3:

In [None]:
from tensorflow.keras.utils import to_categorical

In [None]:
train_labels

In [None]:
train_labels_hot = to_categorical(train_labels)
test_labels_hot = to_categorical(test_labels)

In [None]:
train_labels_hot[:10]

We are now ready to train our network, which in Keras is done via a call to the `fit` method of the network: 
we "fit" the model to its training data.

In each forward pass, we ain't need to pass 60k images, we can pass batches and have more than one training step per epoch

In [None]:
network.summary()

In [None]:
%%time
network.fit(train_vectors, train_labels_hot, epochs=20, batch_size=128, validation_split=0.1)
# no gpu

In [None]:
plt.plot(network.history.history["accuracy"], label="train")
plt.plot(network.history.history["val_accuracy"], label="validation")
plt.legend()

In [None]:
_, test_acc = network.evaluate(test_vectors, test_labels_hot)

In [None]:
test_acc

In [None]:
n_images = 10
fig, axs = plt.subplots(1, n_images, figsize=(20, 20))
for i in range(n_images):
    axs[i].imshow(test_images[i], cmap="gray")

Some examples

In [None]:
test_labels[:10]

In [None]:
np.argmax(network.predict(test_vectors), axis=-1)[:10]

What about the previous errors with XGBoost

In [None]:
n_images = 10
fig, axs = plt.subplots(1, n_images, figsize=(20, 20))
for i, index in zip(range(n_images), error_indices):
    axs[i].imshow(test_images[index][0], cmap="gray")

In [None]:
np.argmax(network.predict(test_vectors[error_indices.flatten()]), axis=-1)[:10]

Some examples still not correctly predicted

In [None]:
error_indices_net = np.argwhere(test_labels[:1000] != np.argmax(network.predict(test_vectors), axis=-1)[:1000]).flatten()

In [None]:
error_indices_net

In [None]:
n_images = 10
fig, axs = plt.subplots(1, n_images, figsize=(20, 20))
for i, index in zip(range(n_images), error_indices_net):
    axs[i].imshow(test_images[index], cmap="gray")

In [None]:
test_labels[error_indices_net][:10]

In [None]:
np.argmax(network.predict(test_vectors[error_indices_net]), axis=-1)[:10]

Explore the network

In [None]:
network.layers[0].get_weights()[0].shape

In [None]:
network.layers[1].get_weights()[0].shape

In not as homogeneous datasets, this type of networks perform **badly** on image datasets

This traduces to: "never use the previous networks for images"

### Convolutional neural network (CNN)

Are specially prepared for image analysis

(See presentation ppt)

In [None]:
model = models.Sequential()

In [None]:
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

In [None]:
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

In [None]:
model.summary()

In [None]:
train_images.shape

In [None]:
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255

In [None]:
model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

In [None]:
model.fit(train_images, train_labels_hot, epochs=10, batch_size=128, validation_split=0.1)

Why does validation accuracy appear to be better than training accuracy? As said by Keras documentation:  
"Besides, the training loss that Keras displays is the average of the losses for each batch of training data, over the current epoch."  
[LINK](https://keras.io/getting_started/faq/#why-is-my-training-loss-much-higher-than-my-testing-loss)

In [None]:
_, test_acc = model.evaluate(test_images, test_labels_hot)

In [None]:
test_acc

 * Before we had ~98% accuracy
 * Now we have 99.25% accuracy
 * Meaning that ~60% of previous errors are now correct

Some examples of previous bad predictions

In [None]:
error_indices_net

In [None]:
n_images = 10
fig, axs = plt.subplots(1, n_images, figsize=(20, 20))
for i, index in zip(range(n_images), error_indices_net):
    axs[i].imshow(test_images[index], cmap="gray")

In [None]:
# reality
test_labels[error_indices_net][:10]

Old model

In [None]:
np.argmax(network.predict(test_vectors[error_indices_net]), axis=-1)[:10]

New model

In [None]:
np.argmax(model.predict(test_images[error_indices_net]), axis=-1)[:10]

### Pretrained nets

In [None]:
from tensorflow.keras.applications import VGG16

conv_base = VGG16(
    weights='imagenet',
    include_top=False,
    input_shape=(150, 150, 3)
)

In [None]:
conv_base.summary()

## Deep Learning for NLP (natural language processing)

The building block of Deep Learning for NLP are word embeddings (as explained in the slides)

In [None]:
import gensim

First go and download a gensim embedding

In [None]:
%%time
model = gensim.models.KeyedVectors.load_word2vec_format('./glove.6B.50d.txt', binary=True)

In [None]:
word = "queen"

In [None]:
word_vector = model[word]

In [None]:
word_vector.shape

In [None]:
other_words = ["dog", "king", "god", "cream", "princess"]

In [None]:
other_words_vectors = [model[w] for w in other_words]

In [None]:
similarities = model.cosine_similarities(
    word_vector, 
    other_words_vectors
)

In [None]:
similarities.round(2)

In [None]:
for w, s in zip(other_words, similarities.round(2)):
    print(f"{w:10} - queen: {s:.2f}")

In [None]:
most_similar = other_words[similarities.argmax()]

In [None]:
most_similar

## Further materials

[Transfer learning by sheriff & Pablo](https://www.youtube.com/watch?v=0xBUXy-9_3k&ab_channel=T3chFest)