## **Transfer Learning**

Let's split the fashion MNIST training set in two:
* `X_train_A`: all images of all items except for sandals and shirts (classes 5 and 6).
* `X_train_B`: a much smaller training set of just the first 200 images of sandals or shirts.

The validation set and the test set are also split this way, but without restricting the number of images.

We will train a model on set A (classification task with 8 classes), and try to reuse it to tackle set B (binary classification). We hope to transfer a little bit of knowledge from task A to task B, since classes in set A (sneakers, ankle boots, coats, t-shirts, etc.) are somewhat similar to classes in set B (sandals and shirts). 

Como estamos utilizando capas `Densas`, sólo se pueden reutilizar los patrones que aparecen en el mismo lugar (por el contrario, las capas convolucionales transferirán mucho mejor, ya que los patrones aprendidos se pueden detectar en cualquier lugar de la imagen, como veremos en el capítulo sobre CNN).

In [14]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]




In [2]:


def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6)  # sandals or shirts
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2  # class indices 7, 8, 9 should be moved to 5, 6, 7

    # binary classification task: is it a shirt (class 6)?
    y_B = (y[y_5_or_6] == 6).astype(np.float32)
    return ((X[~y_5_or_6], y_A),
            (X[y_5_or_6], y_B))


(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]


In [3]:
X_train_A.shape, X_train_B.shape


((43986, 28, 28), (200, 28, 28))

In [4]:
from keras.layers import Dense, Flatten, BatchNormalization

tf.random.set_seed(42)
np.random.seed(42)

model_A = keras.models.Sequential()
model_A.add(Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(Dense(n_hidden, activation="selu", kernel_initializer='lecun_normal'))
model_A.add(Dense(8, activation="softmax"))


In [5]:
model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                metrics=["accuracy"])


In [6]:
history = model_A.fit(X_train_A, y_train_A, epochs=20,
                      validation_data=(X_valid_A, y_valid_A))


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [7]:
model_A.save("resources/my_model_A.h5")


Reusing pretrined inside layers

In [8]:
model_A = keras.models.load_model("resources/my_model_A.h5")

model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))


Note that `model_B_on_A` and `model_A` actually share layers now, so when we train one, it will update both models. If we want to avoid that, we need to build `model_B_on_A` on top of a **clone** of `model_A`:

In [9]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())
model_B_on_A = keras.models.Sequential(model_A_clone.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))


In [10]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                     metrics=["accuracy"])

history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B))


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


Luego descongelar las capas reutilizadas (lo que requiere compilar el modelo de
nuevo) y continuar el entrenamiento para afinar las capas reutilizadas. Después de descongelar las capas reutilizadas, suele ser una buena idea reducir la tasa de aprendizaje, una vez más para evitar
dañar los pesos reutilizado

In [11]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(learning_rate=1e-4),
                     metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_valid_B, y_valid_B))


Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


Deberias estar convencido con el resultado, ya que el autor hizo trmapas: *Probé muchas configuraciones hasta que encontré una que demostraba una gran mejora. Si intenta cambiar las clases o la semilla aleatoria, verá que la mejora suele disminuir, o incluso desaparecer o invertirse.*

¿Por qué hice trampa? Resulta que el aprendizaje por transferencia no
funciona muy bien con redes densas pequeñas, presumiblemente porque
las redes pequeñas aprenden n pocos patrones, y las redes densas
aprenden patrones muy específicos, que probablemente no sean útiles en
otras tareas. El aprendizaje por transferencia funciona mejor con redes
neuronales convolucionales profundas, que tienden a aprender detectores
de características que son mucho más generales

## **Faster Optimizers**

* **Momentum optimization**
```python 
optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9) 
```
* **Nesterov Accelerated Gradient**
```python 
optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)
```
* **AdaGrad** 
```python 
optimizer = keras.optimizers.Adagrad(learning_rate=0.001)
```
* **RMSProp**
```python 
optimizer = keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)
```
* **Adam Optimization** 
```python 
optimizer = keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)
```
* **AdaMax Optimization** 
```python 
optimizer = keras.optimizers.Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.999)
```
* **Nadam Optimization** 
```python 
optimizer = keras.optimizers.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)
```