# TP 3 Machine Learning
**Fecha y hora de entrega:** 30/05/2022 18:00

*Para más información visitar [el TP en la wiki oficial de la materia](https://github.com/ucseml-team/machine-learning-course/wiki/TP3_2022)*

# **Fashion-MNIST Dataset**

Se trata de un conjunto de datos de 60.000 imágenes en escala de grises de 28x28 de 10 categorías de moda, junto con un conjunto de prueba de 10.000 imágenes. Este conjunto de datos puede utilizarse como reemplazo de MNIST.

<img src="https://tensorflow.org/images/fashion-mnist-sprite.png"></img>


**Las clases son:**

Label   | Description
--------|------------------
0       | T-shirt/top - Remera/Top
1       | Trouser - Pantalón
2       | Pullover - Jersey
3       | Dress - Vestido
4       | Coat - Abrigo
5       | Sandal - Sandalia
6       | Shirt - Camisa
7       | Sneaker - Zapatilla
8       | Bag - Bolso
9       | Ankle boot - Botas


**Retorna**

Tupla de matrices NumPy: (x_train, y_train), (x_test, y_test):

**x_train**: matriz NumPy de uint8 de datos de imagen en escala de grises con formas (60000, 28, 28), que contiene los datos de train.

**y_train**: matriz NumPy uint8 de etiquetas (enteros en el rango 0-9) con forma (60000,) para los datos de train.

**x_test**: uint8 NumPy array de datos de imagen en escala de grises con forma (10000, 28, 28), que contiene los datos de test.

**y_test**: matriz NumPy uint8 de etiquetas (enteros en el rango 0-9) con forma (10000,) para los datos de test.

## **Cargando al Fashion-MNIST dataset**

> **load_data** function


```
keras.datasets.fashion_mnist.load_data()
```

---


*Para más información acerca del dataset visitar a [la página oficial de Fashion MNIST](https://keras.io/api/datasets/fashion_mnist/)*

In [None]:
# !pip install -r requirements.txt

In [None]:
# de python, para especificar rutas de archivos y directorios
from pathlib import Path

# lib para trabajar con arrays
import numpy as np

# lib que usamos para mostrar las imágenes
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly import tools

# libs que usamos para construir y entrenar redes neuronales, y que además tiene 
# utilidades para leer sets de imágenes

# TensorFlow y tf.keras
import tensorflow as tf
print(tf.__version__)
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Input, Dropout, Convolution2D, MaxPooling2D, Flatten, Conv2D
from keras.preprocessing.image import load_img, img_to_array, ImageDataGenerator
from keras.utils.vis_utils import model_to_dot, plot_model

# libs que usamos para tareas generales de machine learning. En este caso, métricas
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# configuración para que las imágenes se vean dentro del notebook
%matplotlib inline

import pandas as pd

from IPython.display import SVG


In [None]:
fashion_mnist = keras.datasets.fashion_mnist

## Train y Test

In [None]:
# lo vamos a estar usando seguido
CLOTHES = "Remera/Top", "Pantalón", "Jersey", "Vestido", "Abrigo", "Sandalia", "Camisa", "Zapatilla", "Bolso", "Botas"

# (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
(train_data, test_data) = fashion_mnist.load_data()

# Análisis exploratorio sobre el conjunto de datos

## Volumetría de los datos

In [None]:
# Ver las dimensiones del dataset
assert train_images.shape == (60000, 28, 28)
assert test_images.shape == (10000, 28, 28)
assert train_labels.shape == (60000,)
assert test_labels.shape == (10000,)

In [None]:
# Mostrar la cantidad de datos en train
data_shape = [('train_images', len(train_images), train_images.shape),
              ('train_labels', len(train_labels), train_labels.shape),
              ('test_images', len(test_images), test_images.shape),
              ('test_labels', len(test_labels), test_labels.shape)]
pd.DataFrame(data_shape, columns=['Dataset Name', 'Cantidad de datos', 'Shape'])

In [None]:
train_labels

## Estructura y tipo de las imágenes

In [None]:
def print_dataset_images(dimensions, images, labels, isColorbar=False):
  plt.figure(figsize=(10,10))
  for i in range(dimensions*dimensions):
      plt.subplot(dimensions,dimensions,i+1)
      plt.xticks([])
      plt.yticks([])
      plt.imshow(images[i])
      if isColorbar: 
        plt.colorbar()
      plt.grid(False)
      plt.title("La clase es: {} \n {}".format(labels[i], CLOTHES[labels[i]]))
  plt.subplots_adjust(top=1.1, right=1)
  plt.show()

In [None]:
print_dataset_images(5, train_images, train_labels)

In [None]:
print_dataset_images(5, train_images, train_labels, True)

## Distribución de la variable a predecir

Se puede observar que en ambos datasets, tanto train como test, los datos se encuentran completamente balanceados

In [None]:
def plot_pie_per_class(labels):
  lista_train = np.zeros(10)
  for label in labels: 
      lista_train[label] += 1 
  plt.title("distibucion prendas en train")
  plt.pie(lista_train, labels=CLOTHES, autopct="%0.2f %%");

In [None]:
def get_count_per_class(yd):
    ydf = pd.DataFrame(yd)
    label_counts = ydf[0].value_counts() # Totales de cada clase
    total_samples = len(yd) # Ejemplos totales

    for i in range(len(label_counts)): # Contar numero de items en cada clase
        label = CLOTHES[label_counts.index[i]]
        count = label_counts.values[i]
        percent = (count / total_samples) * 100
        print("{:<20s}:   {} or {:.2f}%".format(label, count, percent))

### Train

In [None]:
get_count_per_class(train_labels)

In [None]:
plot_pie_per_class(train_labels)

### Test

In [None]:
get_count_per_class(test_labels)

In [None]:
plot_pie_per_class(test_labels)

# Preprocesado del dataset a trabajar

In [None]:
train_images = train_images / 255.0
test_images = test_images / 255.0

In [None]:
print_dataset_images(5, train_images, train_labels, True)

# Machine Learning 

## Funciones útiles: 

In [None]:
# Función que dibuja las curvas
def curvas(model):
    plt.plot(model.history['accuracy'], label='train')
    plt.plot(model.history['val_accuracy'], label='test')
    plt.title('Accuracy over train epochs')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(loc='upper left')
    plt.show()

In [None]:
def crear_curvas(x, y, ylabel, color):
        trace = go.Scatter(
            x = x,y = y,
            name=ylabel,
            marker=dict(color=color),
            mode = "markers+lines",
            text=x
        )
        return trace
    
def plot_accuracy_and_loss(model):
    hist = model.history
    acc = hist['accuracy']
    val_acc = hist['val_accuracy']
    loss = hist['loss']
    val_loss = hist['val_loss']
    epochs = list(range(1,len(acc)+1))
    
    trace_ta = crear_curvas(epochs,acc,"Training accuracy", "Green")
    trace_va = crear_curvas(epochs,val_acc,"Validation accuracy", "Red")
    trace_tl = crear_curvas(epochs,loss,"Training loss", "Blue")
    trace_vl = crear_curvas(epochs,val_loss,"Validation loss", "Magenta")

    fig = tools.make_subplots(rows=1,cols=2, subplot_titles=('Training and validation accuracy',
                                                             'Training and validation loss'))
    fig.append_trace(trace_ta,1,1)
    fig.append_trace(trace_va,1,1)
    fig.append_trace(trace_tl,1,2)
    fig.append_trace(trace_vl,1,2)
    fig['layout']['xaxis'].update(title = 'Epoch')
    fig['layout']['xaxis2'].update(title = 'Epoch')
    fig['layout']['yaxis'].update(title = 'Accuracy', range=[0,1])
    fig['layout']['yaxis2'].update(title = 'Loss', range=[0,1])

    
    iplot(fig, filename='accuracy-loss')

## Modelos

### MLP 1:

In [None]:
model_mlp_1 = Sequential([
    Flatten(input_shape=(28,28)),

    Dense(10, activation='relu'),

    Dense(len(CLOTHES), activation='softmax'),
])

model_mlp_1.compile(
    # Variable por descenso de gradiente a utilizar:
    optimizer='adam',
    # Función de error a utilizar:
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy',],
)

In [None]:
# Entrenamos:
train_model_mlp_1 = model_mlp_1.fit(
    train_images,
    train_labels,
    epochs=5,
    batch_size=10,
    validation_data=(test_images, test_labels)
)

In [None]:
curvas(train_model_mlp_1)
plot_accuracy_and_loss(train_model_mlp_1)

In [None]:
mejor_puntaje = max(train_model_mlp_1.history['val_accuracy'])
mejor_epoca = np.array(train_model_mlp_1.history['val_accuracy']).argmax()+1
print('La mejor precisión en test fue %f en la epoca %i' % (mejor_puntaje,mejor_epoca))

### MLP 2:

In [None]:
model_mlp_2 = Sequential([
    Flatten(input_shape=(28,28)),
    Dense(200, activation='relu'),
    Dropout(0.1),
    Dense(200, activation='relu'),
    Dropout(0.1),
    Dense(200, activation='relu'),
    Dropout(0.1),
    Dense(100, activation='relu'),
    Dropout(0.1),
    Dense(50, activation='relu'),
    Dropout(0.1),
    Dense(len(CLOTHES), activation='softmax'),
])

model_mlp_2.compile(
    # Variable por descenso de gradiente a utilizar:
    optimizer='adam',
    # Función de error a utilizar:
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy',],
)

model_mlp_2.summary()

In [None]:
# Entrenamos:
train_model_mlp_2 = model_mlp_2.fit(
    train_images,
    train_labels,
    epochs=20,
    batch_size=300,
    validation_data=(test_images, test_labels)
)

In [None]:
curvas(train_model_mlp_2)
plot_accuracy_and_loss(train_model_mlp_2)

In [None]:
mejor_puntaje = max(train_model_mlp_2.history['val_accuracy'])
mejor_epoca = np.array(train_model_mlp_2.history['val_accuracy']).argmax()+1
print('La mejor precisión en test fue %f en la epoca %i' % (mejor_puntaje,mejor_epoca))

Vemos que la precisión en el ds de train aumenta con el número de épocas. La precisión del ds de test también aumenta inicialmente con el número de épocas. Sin embargo, en algún momento puede comenzar a disminuir debido al sobreentrenamiento. De hecho comenzamos a ver esto en el momento en la epoca 16, donde la precisión fue de 89% y luego comienza a disminuir.

### CNN 1:

In [None]:
model_cnn_1 = Sequential([
    Convolution2D(32, kernel_size=(3, 3),
                 activation='relu',
                 kernel_initializer='he_normal',
                 input_shape=(28, 28, 1)), # input_shape=(filas en las img, cols en las img)
    MaxPooling2D((2, 2)),
    Convolution2D(64, kernel_size=(3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Convolution2D(128, (3, 3), activation='relu'),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(len(CLOTHES), activation='softmax'),
])

model_cnn_1.compile(
              optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy', ])

model_cnn_1.summary()

In [None]:
plot_model(model_cnn_1, to_file='model.png')
SVG(model_to_dot(model_cnn_1, dpi=65).create(prog='dot', format='svg'))

In [None]:
train_model_cnn_1 = model_cnn_1.fit(
                train_images,
                train_labels,
                epochs=40,
                batch_size=300,
                verbose=1,
                validation_data=(test_images, test_labels),
                )

In [None]:
score = model_cnn_1.evaluate(test_images, test_labels, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

In [None]:
plot_accuracy_and_loss(train_model_cnn_1)

### CNN 2: Agregando capas Dropout

In [None]:
model_cnn_2 = Sequential([
    Convolution2D(32, kernel_size=(3, 3),
                 activation='relu',
                 kernel_initializer='he_normal',
                 input_shape=(28, 28, 1)),
    MaxPooling2D((2, 2)),
    Dropout(0.25),
    Convolution2D(64, kernel_size=(3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Dropout(0.25),
    Convolution2D(128, (3, 3), activation='relu'),
    Dropout(0.25),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.25),
    Dense(len(CLOTHES), activation='softmax'),
])

model_cnn_2.compile(
              optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy', ])

model_cnn_2.summary()

In [None]:
plot_model(model_cnn_2, to_file='model.png')
SVG(model_to_dot(model_cnn_2, dpi=65).create(prog='dot', format='svg'))

In [None]:
train_model_cnn_2 = model_cnn_2.fit(
                train_images,
                train_labels,
                epochs=40,
                batch_size=300,
                verbose=1,
                validation_data=(test_images, test_labels),
                )

In [None]:
plot_accuracy_and_loss(train_model_cnn_2)

In [None]:
score = model_cnn_2.evaluate(test_images, test_labels, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

## Predicciones con CNN 2

In [None]:
# predicted values
y_pred_enc = model_cnn_2.predict(test_images)

# decoding predicted values
y_pred = [np.argmax(i) for i in y_pred_enc]

print(y_pred_enc[0])
print(y_pred[0])

In [None]:
# predicted targets of each images
fig, ax = plt.subplots(figsize=(18, 8))
for ind, row in enumerate(test_images[:8]):
    plt.subplot(2, 4, ind+1)
    predict_index = np.argmax(y_pred[ind])
    true_index = np.argmax(test_labels[ind])
    plt.title("{} ({})".format(CLOTHES[y_pred[ind]], 
                                  CLOTHES[test_labels[ind]]),
                                  color=("green" if predict_index == true_index else "red"))
    img = row.reshape(28, 28)
    fig.suptitle('Predicted values', fontsize=24)
    plt.axis('off')
    plt.imshow(img, cmap='cividis')

In [None]:
print(classification_report(test_labels, y_pred))