

# Presentation of the notebook


The idea of this notebook is to understand the basics of deep learning through a specific example. Thus, this notebook wants to be interactive and will refer the reader to other websites in order to deepen certain concepts if he wishes so. 

In order to understand deep learning, we have chosen a dataset from the website: http://www.andrewjanowczyk.com/deep-learning/ .
The Use case is: Invasive Ductal Carcinoma Identification.

An Invasive Ductal Carcinoma (IDC) is a cancer that began growing in a milk duct and has invaded the fibrous or fatty tissue of the breast outside of the duct. It is the most common form of breast cancer and represents 80% of all breast cancer diagnosis. For more information, you can check the link down below: https://www.hopkinsmedicine.org/breast_center/breast_cancers_other_conditions/invasive_ductal_carcinoma.html

# Getting started

In this notebook we are going to build an image classifier from scratch. The first step of each notebook are these three lines: they ensure that any edits to libraries you make are reloaded automatically, and also that any charts or images displayed are shown in this notebook.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

Then, we import all the necessary libraries. In this notebook, we will work with the  fastai V1 library (http://www.fast.ai/2018/10/02/fastai-ai/): it provides many useful functions that enable us to quickly and easily build neural networks and train our models.

You can find more information about this library here: https://www.fast.ai 

Other libraries exist such as Keras or Caffe. For more information on these libraries, you can check these blogs: https://www.pyimagesearch.com/2017/12/11/image-classification-with-keras-and-deep-learning/ and http://caffe.berkeleyvision.org/ .

In [4]:
from fastai.vision import *
from fastai.metrics import error_rate
import numpy as np 
import pandas as pd 
import os

In order to run this notebook, it is important to have enough memory on your computer. If you have a small GPU (Graphic Processing Unit), you can Rerun this notebook after changing the batch size down below.

For more information on the batch size, you can check this link : https://radiopaedia.org/articles/batch-size-machine-learning

In [5]:
bs= 64 
#bs=16 #uncomment this line and comment the line above if your GPU is too small

## Step 1: Looking at the data

The first step of this notebook is to understand the dataset we are working with. Taking a look at the data means understanding how the data directories are structured, what the labels are and what some sample images look like.

Indeed, it is important to see what we are working with. The dataset is organised in two folders: train and test. The train folder has the images we are going to train our model with. The images are classified. Once our model has been trained several times, we can use the test folder in order to test our model with other images than the ones we trained it with.

It is important to have both folders because testing and training the model with the same images will create an erroneous model. Indeed, it will consider that the specificities of certain images are generalities. We will see later in this notebook that this is also why you should not train the model too much.

In order to use the data, the notebook has to know where it is. Therefore, we should create a folder containing the notebook and the data. Then, using the Path function (from os), we can connect them. 

In [6]:
path = Path('train');
path

PosixPath('train')

Then, we can make sure it works by using the ls() function. It will show all the folders contained in path

In [7]:
path.ls()

FileNotFoundError: [Errno 2] No such file or directory: 'train'

Before anything, the dataset needs to be converted into a DataBunch object and in the case of images into an ImageDataBunch subclass. For more information, you can check: https://docs.fast.ai/vision.data.html

The main difference between the handling of image classification datasets is the way labels are stored. In this particular dataset, labels are stored in the filenames themselves. Therefore, we will use the function ImageDataBunch.from_folder (). We could have also used Regular Expression (re) to extract our data (for more information: https://docs.python.org/3.6/library/re.html)

We use the function random.seed() in order to reproduce the data given by a pseudo-random number generator. By re-using a seed value, we can regenerate the same data multiple times as multiple threads are not running. For more information: https://pynative.com/python-random-seed/

In [8]:
np.random.seed(42)
data = ImageDataBunch.from_folder(
    path, 
    train=".",
    test='../test', 
    valid_pct=0.2,
    ds_tfms=get_transforms(), 
    size=224, num_workers=4
).normalize(imagenet_stats)

AssertionError: train is not a valid directory.

Now, we can have a proper look at the images by using the function: show_batch()

In [9]:
data.show_batch(rows=3, figsize=(7,6))

NameError: name 'data' is not defined

## Step 2: Training

Now we will start training our model. We will use a [convolutional neural network](http://cs231n.github.io/convolutional-networks/) backbone and a fully connected head with a single hidden layer as a classifier.
We are building a model which will take images as input and will output the predicted probability for each of the categories.

Fastai offers different models using Residual Neural Network: resnet18, resnet34 and resnet50. 
Broadly speaking, the numbers correspond to the number of layers the model uses to recognize images.Nevertheless, it is slightly more complicated as it uses residual layers (skipping certain layers).

For more information, you can check this article: https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035


We will train for 4 epochs (4 cycles through all our data).

In [11]:
learn = cnn_learner(data,models.resnet34,metrics=error_rate,model_dir='/tmp/models')

NameError: name 'data' is not defined

We now use the function lr_find(). It will launch an LR range test that will help you select a good learning rate. A good learning rate is important thus it controls how quickly or slowly a neural network model learns a problem.

For more information you can look at : https://docs.fast.ai/basic_train.html#lr_find 



In [12]:
learn.lr_find()
learn.recorder.plot()

NameError: name 'learn' is not defined

We can see that as the learning rate increases, the loss curve decreases until it reaches a minimum at which it begins to increase. This means that our model is starting to make more mistakes which is surely due to the fact that he starts to know the images and as explained above, he takes specific elements of images for generalities.


Thus, it is necessary to select an LR before this critical value.

Here, the value is:   . 
Therefore, we can work in between .... and .... . 

In [13]:
learn.fit_one_cycle(3, max_lr=slice(1e-6,1e-4))

NameError: name 'learn' is not defined

At this stage, we can save our work before looking at the results. 


In [14]:
learn.save('stage-1')

NameError: name 'learn' is not defined

## Step 3: Looking at the results

Here, we are going to see the results that our model has. 


We will first see which were the categories that the model most confused with one another. We will try to see if what the model predicted was reasonable or not. In this case the mistakes look reasonable (none of the mistakes seems obviously naive). This is an indicator that our classifier is working correctly. 

Furthermore, when we plot the confusion matrix, we can see that the distribution is heavily skewed: the model makes the same mistakes over and over again but it rarely confuses other categories. This suggests that it just finds it difficult to distinguish some specific categories between each other; this is normal behaviour.

### Most confused categories

In order to see the most confused categories, we will use the function plot_top_losses(): it will give back a set of images with this legend: prediction/actual/loss/probability .

In [15]:
interp = ClassificationInterpretation.from_learner(learn)

losses,idxs = interp.top_losses()

len(data.valid_ds)==len(losses)==len(idxs)

NameError: name 'learn' is not defined

In [16]:
interp.plot_top_losses(9, figsize=(15,11))

NameError: name 'interp' is not defined

### Most common mistakes

Here, we will use the confusion matrix in order to find the were the model makes the most common mistakes. This will allow us to see if the errors it makes are errors that we, as humans, could make as well or if there is a problem with our model.


In [17]:
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

NameError: name 'interp' is not defined

In [18]:
interp.most_confused(min_val=2)

NameError: name 'interp' is not defined

With all the visualization of the results, we can see that our model is working as we expect it to. We can continue working on it in order to improve its efficiency. 

## Step 4: Unfreezing and working on the model.

Here, we are going to unfreeze the version we previously saved in order to work on it. We will use the function unfreeze.

In [19]:
learn.unfreeze()

NameError: name 'learn' is not defined

In [20]:
learn.load('stage-1');

NameError: name 'learn' is not defined

Then, as we did previously, we are going to find the perfect learning rate range.

In [21]:
learn.lr_find()
learn.recorder.plot()

NameError: name 'learn' is not defined

In [22]:
learn.unfreeze()
learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4))

NameError: name 'learn' is not defined

The error rate has continued to decrease and we have a model that seems pretty accurate. 