# Labeling Zalando pics - 2nd round

## Miguel Ángel Canela, IESE Business School

******

### Introduction

This example continues with the development of neural network classifiers for the Zalando images. In the data set, every row stands for an image.  

### Importing the data

The data come in seven parts, which I import one by one and put them together with the Pandas function `concat`.

In [1]:
import pandas as pd
# folder = 'https://raw.githubusercontent.com/mcanela-iese/ML_Course/master/Data/'
folder = ''
df1 = pd.read_csv(folder + 'zalando1.csv')
df2 = pd.read_csv(folder + 'zalando2.csv')
df3 = pd.read_csv(folder + 'zalando3.csv')
df4 = pd.read_csv(folder + 'zalando4.csv')
df5 = pd.read_csv(folder + 'zalando5.csv')
df6 = pd.read_csv(folder + 'zalando6.csv')
df7 = pd.read_csv(folder + 'zalando7.csv')
df = pd.concat([df1, df2, df3, df4, df5, df6, df7], axis=0)

Next, I split the data set into a **features matrix** and a **target vector**, normalizing the features in the usual way.

In [2]:
import numpy as np
X = df.iloc[:, 1:].values
X = X/255
y = df.iloc[:, 0].values

I split the data with `train_test_split`, keeping 10,000 images for testing.  

In [3]:
from sklearn import model_selection
X_train, X_test, y_train, y_test = \
    model_selection.train_test_split(X, y, test_size=1/7)

### Our former model in Keras

The specification of neural network architectures in deep learning implementations is necessarily more complex than in libraries which are limited to MLP networks, due to the extra possibilities offered. So long, Keras seems to offer the simplest approach. 

In my first example, I come back to the MLP architecture which I already tried in scikit-learn, consisting in a single hidden layer with 32 nodes. I import the three modules that I actualyy need, `utils`, `models` and `layers`: 

In [4]:
from keras import utils, models, layers

Using TensorFlow backend.


To specify a classifier in Keras, the target vector has to be transformed To specify a classifier in Keras, the target vector has to be transformed into a matrix in which each column is a dummy associated to one of the target values. This can be done in many ways, using Pandas or scikit-learn. In Keras itself, it is done with the function `to_categorical`, from `utils`.

In [5]:
y_train = utils.to_categorical(y_train, dtype=int)
y_test = utils.to_categorical(y_test, dtype=int)

I use the class `Sequential` of the module `models` to specify the network architecture. I start initializing the class, with the default specification.

In [6]:
tfclf1 = models.Sequential()

In an ordinary MLP architecture, the layers added are specified with the class `layers`. In this first model, the hidden layer and the output layer are **dense**, that is, every node is connected to all nodes of the preceding layer. 

In the first layer that I create, which will be the hidden layer, I have to specify the shape of the input tensor, which is the number of features. In the forthcoming layers, the input is determined by the number of nodes of the preceding layer. In Keras, the different types of layers have different activation defaults (in a dense layer, `activation=None`), so it is always safer to specify the activation function.

In [7]:
tfclf1.add(layers.Dense(32, input_shape=(784, ), activation='relu'))
tfclf1.add(layers.Dense(10, activation='softmax'))

Once the network architecture is completely specified, the model is compiled, with the method `compile`.

In [8]:
tfclf1.compile(optimizer='rmsprop', loss='categorical_crossentropy',
    metrics=['accuracy'])

Now, we apply the method `fit`, which is just a bit more complex than in scikit-learn. The **number of epochs** is typically low, to prevent overfitting. The default is `epochs=1`. The **batch size** is typically set as a power of 2. The default is 32, although 64 is more popular. Note that these defaults are different from those of scikit-learn.

In [9]:
tfclf1.fit(X_train, y_train, epochs=10, batch_size=64)

W1110 18:29:51.688523 4486792640 deprecation_wrapper.py:119] From /usr/local/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x17f72d610>

Finally, I evaluate the model in the test set. Note that, even with a very low number of epochs, we can overfit the data.

In [10]:
tfclf1.evaluate(X_test, y_test)



[0.3755474271059036, 0.8684999942779541]

### Convolutional network 

**2D convolutional neural networks** are the state of the art of image classification. They combine several types of layers, including dense layers. They have two parts, with different types of layers. In the first part, the input tensors, called here **feature maps**, are three-dimensional, with two spatial axes (`height` and `width`) as well as a `channels` axis. For black and white pictures like those of this example, there is only one channel.

So, I start by reshaping the data so that we can feed the network with 60,000 tensors of shape (28, 28, 1).

In [11]:
X_train, X_test = X_train.reshape(60000, 28, 28, 1), X_test.reshape(10000, 28, 28, 1)

I initialize the class `Sequential`, in order to specify the network architecture.

In [12]:
tfclf2 = models.Sequential()

The first part of the network is a stack of alternate `Conv2D` and `MaxPooling2D` layers.

In [13]:
tfclf2.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
tfclf2.add(layers.MaxPooling2D((2, 2)))
tfclf2.add(layers.Conv2D(64, (3, 3), activation='relu'))
tfclf2.add(layers.MaxPooling2D((2, 2)))
tfclf2.add(layers.Conv2D(64, (3, 3), activation='relu'))

W1110 18:31:16.627098 4486792640 deprecation_wrapper.py:119] From /usr/local/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4070: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.



The `summary` produces a report showing the shape of the output tensors and the number of parameters for each layer.

In [14]:
tfclf2.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


Let us follow the track. The input of the first convolutional layer has shape (28, 28, 1). The convolution works in 3 x 3 patches, so we lose one unit on each border. The 28 x 28 grid is transformed into a 26 x 26 grid in each of the 32 nodes, which means that the output tensor has shape (28, 28, 32).

A `MaxPooling2D` layer extracts a window, typically a 2 x 2 window, and outputs the maximum value. This reduces the number of parameters and induces spatial-filter hierarchy. In the above summary, we see that, in the first `MaxPooling2D` layer, the 26 x 26 grids of the input tensor are transformed into 13 x 13 grids. 

The last layer of this stack outputs a tensor of shape (3, 3, 64). The second part of the network is a stack of `Dense` layers. Since these layers process vectors, which are 1D, I have to flatten the 3D inputs to 1D. This is done with a `Flatten` layer. There is no calculation in this layers, just a reshape.

In this case, I have just added a `Dense` layers of 64 nodes, before the output layer, which is the same as in the MLP architecture.

In [15]:
tfclf2.add(layers.Flatten())
tfclf2.add(layers.Dense(64, activation='relu'))
tfclf2.add(layers.Dense(10, activation='softmax'))

In [16]:
tfclf2.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)               

The rest (compilation, fitting and evaluation) is as in the dense network. Five epochs is typically enough, given that these algorithms are prone to overfitting, as we have seen in our first algorithm, with a unique hidden layer.

In [17]:
tfclf2.compile(optimizer='rmsprop', loss='categorical_crossentropy',
    metrics=['accuracy'])
tfclf2.fit(X_train, y_train, epochs=5, batch_size=64)
tfclf2.evaluate(X_test, y_test)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


[0.2653323025226593, 0.9072999954223633]