[View in Colaboratory](https://colab.research.google.com/github/LiliDumelle/Markdown-for-Manuscripts/blob/master/WiMLDS_Women_Who_Code_Deep_Learning_%22Hello_World%22.ipynb)

# Demystifying Deep Learning Tutorial
## WiMLDS x Women Who Code 
## July 25, 2018
### Lisa Nash & Jane Zanzig 


The purpose tutorial is to give a "hello world" example of building a neural network. We made the tutorial using Colaboratory (which allows us to basically do Google Drive for Jupyter notebooks), but all of the code can be run in Python on your own machine.

In [0]:
!pip install keras

### Import packages. 

`keras` is a deep learning library in Python that has a lot of pre-loaded models, so it allows for easy and fast prototyping of neural nets. It supports both *convolutional networks*, which are good at identifying _shapes_ and are more commonly used for images, and *recurrent networks*, which are good at identifying _sequences_ and more commonly used for speech and audio, as well as combinations of the two.

`numpy` is a scientific computing package that allows us to do mathematical computations efficiently. For example, you can use `numpy` to multiply matrices or generate (pseudo-)random numbers. 


In [0]:
import numpy as np

# Import all the model layers we want to use from keras
# If you want to extend the model, you may need to add more here
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.utils import np_utils
from keras.datasets import mnist

### Load data.
Here we get a toy dataset example, the MNIST dataset of handwritten digits. This is because a common application of image recognition would be taking a picture of something handwritten and seeing if you can deduce what letter or number it is. There are some variations in the way that people write characters, but they all follow some common patterns. 

`keras`  comes with data already loaded into training and test sets. If you wanted to replace this with your own data, you'd just need an `X_train`, `y_train`, an `X_test`, and a `y_test`, where `X` contains your features (in this case, images) and `y` contains your labels.

In [0]:
 # Load pre-shuffled MNIST data into train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()	

 In this case, `X` is images of grayscale handwritten characters. Each image is 28 x 28 pixels, so the input data is a 28 x 28 matrix where the value in each cell is the darkness of that pixel (from 0 to 255).

In [0]:
print (X_train.shape)
# (60000, 28, 28)

We have 60,000 labeled digits in our training set. We can take a look at them. This is a good step to take as a "sanity check." 

### Look at data.

In [0]:
from matplotlib import pyplot as plt
plt.imshow(X_train[5])

### Reshape data.

We need to quickly reshape our data to add a dimension for the depth of the input image. In other words, we want to transform our dataset from having shape (`n, width, height`) to (`n, depth, width, height`). Our MNIST images only have a depth of 1 (darkness); if we had color images, for instance, our depth might be 3, one for each dimension of the RGB scale.

In [0]:
X_train = X_train.reshape(X_train.shape[0], 1, 28, 28)
X_test = X_test.reshape(X_test.shape[0], 1, 28, 28)

In [0]:
print(X_train.shape)

All good! 

In [0]:
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

This is just putting our inputs in the right format. We need to cast them as floats, or decimals (because multiplying integers and decimal numbers sometimes gives you weird results). Then we divide them by 255 to make them normalized between 0 and 1. This is something you'll sometimes do when modeling or computing features, although many Python libraries (or any kind of modeling software) will automatically do this for you. 

In [0]:
print(y_train.shape)

We have 60,000 images in our training set, and each one is labeled (a digit between 0 and 9). Currently the shape of the data is a 60,000-long vector stating all of the labels. Instead, we want to have a matrix that has 10 entries, with a 1 in the position of the label, because math. You might have heard this called *one-hot encoding*. 


In [0]:
# Convert 1-dimensional class arrays to 10-dimensional class matrices
Y_train = np_utils.to_categorical(y_train, 10)
Y_test = np_utils.to_categorical(y_test, 10)

In [0]:
print(Y_train.shape)

### Build a model! 

In [0]:
model  = Sequential()

The `Sequential` model is a pre-packaged `keras` model, and comprises a linear stack of layers. If you want to get fancier, you can use the [keras Model API](https://keras.io/models/model/), but that's beyond the scope of this tutorial!

Note that we create the model object without referencing our data at all at this point.

In [0]:
model.add(Conv2D(filters=32, 
                 kernel_size=1, 
                 activation='relu', 
                 input_shape=(1,28,28)))

Since this is the first layer in the model, we provide an `input_shape` argument. We don't need to do this for the following layers, as subsequent layers will infer the input shape from the output of the previous layer. (Here you could also put `None` to indicate that the length could vary.)

`Conv2D` is a 2-dimensional convolutional layer. Convolution just means that it is trying to smooth out the input data (i.e. reduce the noise) without oversimplifying and losing too much of the signal. It can be used for edge detection, smoothing etc. it's just applying a function to all of the values. It also makes sense when you're trying to model things that happen in time. The function here is `relu`, which is short for rectified linear units. 

This layer creates a 2-dimensional convolution kernel. Here, the arguments we're using are `filters`,  the dimension of the output space, and `kernel_size`, which is the length of the convolution window. The `activation` parameter specifies the activation function, which maps the output to [0,1] space.

In [0]:
print(model.output_shape)

In [0]:
model.add(Conv2D(filters=32, 
                 kernel_size=1, 
                 activation='relu'))
print(model.output_shape)
model.add(MaxPooling2D(pool_size=1,
                      strides=1))
print(model.output_shape)

Here we've added a few different kinds of layers: another 2D convolution layer, plus a `MaxPooling2D` layer and a `Dropout` layer. Pooling is a way of reducing the dimension by taking the maximum of a moving window. `strides` is a way of specifying the step size, or the overlap (or space between) each window (what happens if `strides` is less than `pool_size`? Greater?).

As with any feature engineering problem, coming to the right values for these parameters will be a combination of common sense, understanding your problem, and trial and error. In general, though, you'll probably have higher values for them in earlier layers because the dimensions will be higher, so it makes sense to have higher values, both in terms of computation and common sense.

In [0]:
model.add(Dropout(0.25))
print(model.output_shape)
model.add(Flatten())
print(model.output_shape)
model.add(Dense(128, activation='relu')) # this will have an output size of 128
print(model.output_shape)
model.add(Dropout(0.5))
print(model.output_shape)
model.add(Dense(10, activation='softmax'))
print(model.output_shape)

Now we keep on adding layers. The `Dropout` layers will randomly set a certain proporiton of the inputs to 0, which helps prevent overfitting. The `Dense` layers will keep all of the inputs. A dropout layer doesn't have any trainable parameters. A dense layer will have weights that it applies to the inputs. 

Notice that the output of the last layer is 10, because there are 10 labels. The `softmax` activation function will take the vector of 10 values and make them all between 0 and 1 and add up to 1. 

### Configure the learning process
To actually configure our model, we need all of the layers we've added to `model` so far, plus the following:
- *An optimizer.* This could be the string identifier of an existing optimizer (e.g. as “`rmsprop`” or “`adagrad`”) or a call to an optimizer function (e.g. `optimizer_sgd()``).
- *A loss function.* This is the objective that the model will try to minimize. It can be the string identifier of an existing loss function (e.g. “`categorical_crossentropy`” or “`mse`”) or a call to a loss function (e.g. `loss_mean_squared_error()``).
- *A list of metrics.* A metric could be the string identifier of an existing metric (e.g. `accuracy`) or a call to metric function, possibly one that you defined (e.g. `metric_binary_crossentropy()``).

In [0]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Here you specify what your loss function is to decide how "good" your model is. In this case, you need an optimizer as well. You can compare this to the case of a linear regression, where you have a closed-form solution for what the optimal coefficients should be, expressed as a function of the observed data. Here, you _don't_ have that luxury, and are searching over a v non-linear space, and in fact don't know if you're moving in the right direction. Here we chose [adam](https://arxiv.org/abs/1412.6980v8), which we won't go into, but you can read the paper if you like. _Note: Sometimes optimizers also have parameters that you need to tune._


In [0]:
model.fit(X_train, Y_train,  
          batch_size=32,
          epochs=10, 
          verbose=1)

Now we fit the model to the data (note this is the first time we reference `X` or `Y` at all). The parameters that we're using here are:
- `epochs`, where one epoch = one forward pass and one backward pass of all the training examples
- `batch_size`, the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
- `verbose` just means that it displays the progress bar. This doesn't affect the model, just what you see and how you interact with it.

You can see that the accuracy is increasing over time, but that the gains decrease over time, so a common "stopping condition" would be when the difference between the loss function in two consecutive epochs is sufficiently small.

It will print out the `loss`, or cost function, for each epoch. Ideally this is decreasing with each run, but you'll notice that the returns are diminishing. Here it also prints out the `accuracy` for each epoch, because we included that in the call to `model.compile()`.

In [0]:
score = model.evaluate(X_test, Y_test, verbose=1)
print(score)

Now we evaluate our model on unseen data, or a test set, to see how well it will generalize (remember overfitting?). The `evaluate()` method applies the model to `X_test` and then compares the results to the true labels, `y_test`, and calculates the loss as well as any of the metrics passed to the initial `model` call (in this case, `accuracy`). 

If you just want the raw predictions, to do your own scoring or to investigate the results that you're getting in more depth, you can call `model.predict()`.

In [0]:
pred = model.predict(X_train)


The `predict` method gives you predictions for all of the classes based on the input features. Since this is a multi-class prediction problem, you could also choose to do something different with this information. 

Now we can look at how accurate our predictions were.

In [0]:
pred[6]

In [0]:
plt.imshow(X_test[6])

### Things to Try!
- Go back through and try different numbers of layers, different types of layers, different hyperparameters. What effect does it have on your results?
- Try this same exercise with another dataset! `keras` has [a few other pre-loaded datasets](https://keras.io/datasets/) for you to play with, or find something else you're interested in! 

### Acknowledgements & Resources

We used the tutorial at [Elite Data Science](https://elitedatascience.com/keras-tutorial-deep-learning-in-python) as a starting point for this tutorial. 

If you're interested in going deeper into neural networks, check out [CS231n: Convolutional Neural Networks for Visual Recognition](http://cs231n.stanford.edu/)

Thanks to our colleague Nat Steinsultz for alerting us to the presence of Colaboratory and the amazing Neocognitron video.
