$\color{red}{\mathrm{Dear\ Learner,}}$

$\color{red}{\mathrm{This\ is\ the\ archival\ version\  of\ the\ original\ notebook\ authored\ at\ NUS}}$

$\color{red}{\mathrm{This\ has\ been\ provided\ to\ you\ with\ programming\ answers\ for\ your\ individual\ reference\ and\ review }}$

$\color{red}{\mathrm{Please\ do\ not\ repost\ to\ any\ public\ sites\ so\ that\ future\ students\ can\ avoid\ gratuitous\ copying\ and\ can\ benefit\ from\ the\ challenge\ of\ learning}}$

Available at http://www.comp.nus.edu.sg/~cs3244/1910/09.colab

![Machine Learning](https://www.comp.nus.edu.sg/~cs3244/1910/img/banner-1910.png)
---
See **Credits** below for acknowledgements and rights.  For NUS class credit, you'll need to do the corresponding _Assessment_ in [CS3244 in Coursemology](http://coursemology.org/courses/1677) by the respective deadline (as in Coursemology). 

**You must acknowledge that your submitted Assessment is your independent work, see questions in the Assessment at the end.**


**Learning Outcomes for this Week** 

After watching the videos and completing the exercises for this week, you should be able to:

* Explain how CNNs and RNNs works;
* Implement basic forms of CNNs and RNNs;
* Explain the idea of embeddings and how they can be used to improve performance;
* Apply embeddings to sentiment analysis data;
* Tune hyperparameters of a CNN;



_Welcome to the Week $09$ Python notebook._ This week we will learn about **Deep Learning**.  We introduce several _Deep Learning_ architectures in the lecture videos, and will be reviewing this material in the seventh tutorial.

In this notebook, we will go through the concept of Convolutional Neural Networks, Recurrent Newural Networks, Embeddings. We will also cover few programming exercise in both pre-tutorial and post-tutorial section of the notebook. We will be using `PyTorch`, to implement these architectures with some interesting datasets.

---
# Week 09: Pre-Tutorial Work 

Watch the videos for Week 9 Pre. This week, we will learn about deep neural networks, which are networks of units with more hidden layers, making it more deep and more efficient.


## 1 Basic questions from the videos

The following are some simple questions you should be able to answer after watching the pre-class videos, notably the video on Convolutional Neural Networks.

**Your Turn (Question 1)**: What is the neural network in the **left figure** doing with the weight vectors described?

_Choose from: Detecting features of the first row, Detecting features of the first column, Convolution of the features, Analyzing all feature with same weight._

**Your Turn (Question 2)**: What is the neural network in the **right figure** doing with the weight vectors described?

_Choose from: Detecting features of the first row, Detecting features of the first column, Convolution of the features, Analyzing all feature with same weight._


<img src="https://www.comp.nus.edu.sg/~neamul/Images/9pre01.PNG" align="left" width = 430/><img src="https://www.comp.nus.edu.sg/~neamul/Images/09pre02.PNG" align="right" width = 430/>

## 2 CNN Coding Example

Now let's do some edge detection using convolutional neural networks!  We will see different kinds of filtering to detect horizontal, vertical or diagonal lines from an image, by using a simple CNN.

We will use the `pytorch` library, which we already used in Week 8. Let's first install the necessary libraries.

In [0]:
### Import the pytorch libraries
import torch
import torchvision
import torch.nn as nn

!pip install Pillow>=4.1.1

import matplotlib.pyplot as plt
import numpy as np

Now we will define a convolutional neural network,  a simple one. In the following code block, we define a CNN which has only 1 `input channel`, 1 `output channel` and 3x3 `kernel matrix`.

In [0]:
# Defining a simple CNN with 1 input channel, 1 output channel and kernel matrix size is 3x3.
# We use the most simple non-linear activation function here, the ReLU.
net = nn.Sequential(nn.Conv2d(1,1,3),nn.ReLU())

# Setting the bias value of the network to 0
net[0].bias.data = torch.FloatTensor([0])

In the following code block, we define a function which sets the weight vector of our neural network based on the input. For different settings of the weight vector, we obtain different forms of filtering.

In [0]:
def edgeDetection(filterType):
  """ 
    Args:
        filterType (int) : type of filter (ranging from 1 to 4)
        
  """
   # setting the weight vector of the network to detect horizontal / vertical / diagonal lines
  if filterType == 1:
    net[0].weight.data = torch.FloatTensor ([[[[-0.5,1.0,-0.5],
                                               [-0.5,1.0,-0.5],
                                               [-0.5,1.0,-0.5]]]])
  elif filterType == 2:
    net[0].weight.data = torch.FloatTensor ([[[[-0.5,1.0,1.0],
                                               [1.0,-0.5,1.0],
                                               [1.0,1.0,-0.5]]]])
  elif filterType == 3:
    net[0].weight.data = torch.FloatTensor ([[[[ 1.0, 1.0, 1.0],
                                               [-0.5,-0.5,-0.5],
                                               [-0.5,-0.5,-0.5]]]])
  elif filterType == 4:
    net[0].weight.data = torch.FloatTensor ([[[[1.0,-0.5,-0.5],
                                               [-0.5,1.0,-0.5],
                                               [-0.5,-0.5,1.0]]]])

In the following code block, we download an image from an URL using the `torchvision` library. Then we display the image. We will use this image to check how the filtering works.

Let's check how the original image looks like.

In [0]:
from PIL import Image
# Downloading an image from a given URL using the torchvision library
torchvision.datasets.utils.download_url('https://d5wt70d4gnm1t.cloudfront.net/media/a-s/articles/1181-043236208562/close-look-sol-lewitt-900x450-c.jpg',root='data/image',filename="Picture.jpg",md5="")

# Opening the image from the uploaded folder
k = Image.open('data/image/Picture.jpg').convert('L')

# Giving picture credits.  Sol Lewitt was a famous American artist connected with the minimist movement
# https://en.wikipedia.org/wiki/Sol_LeWitt
print ("'Detail of A Square Divided Horizontally and Vertically into Four Equal Parts, Each with a Different Direction of Alternating Parallel Bands of Lines', 1982, by Sol Lewitt")

# Showing the original image
k

Now we have everything we need. In this code block, we put together all of the code blocks we have written so far in a compact, standalone representation. We first convert the image to a `Tensor`. Finally, we call the `edgeDetection` function with different parameters ranging from $1$ to $4$ to achieve the different filtering types.

Let's explore how the filtering works.

In [0]:

# Loading the image as Tensor
m = torch.Tensor(np.asarray(k)).view([1,1,450,900])/256

#### 
edgeDetection(1) # change the parameter from 1 to 4 to see different type of filtering
####

# Filtering the image by passing it to the defined network
filteredImage = net(m.round()).detach()

# Displaying the image
plt.imshow(filteredImage[0,0])

# You can use the following code to see the image in a grid
#imshow(torchvision.utils.make_grid(net2(m.round()).detach()))

**Your Turn:** Which of the following weight vector corresponds to what kind of edge detectiion (*horizontal / vertical / diagonal*)?

* **A:** $\hspace{10mm}\begin{bmatrix}
    1.0 & 1.0 & 1.0 \\
    -0.5 & -0.5 & -0.5 \\  
    -0.5 & -0.5 & -0.5 
  \end{bmatrix}$
  
* **B:** $\hspace{10mm}\begin{bmatrix}
    -0.5 & 1.0 & 1.0 \\
    1.0 & -0.5 & 1.0 \\  
   1.0 & 1.0 & -0.5 
  \end{bmatrix}$
  
* **C:** $\hspace{10mm}\begin{bmatrix}
    1.0 & -0.5 & -0.5\\
   -0.5 & 1.0 & -0.5 \\  
   -0.5 &  -0.5 & 1.0 
  \end{bmatrix}$
  
* **D:** $\hspace{10mm}\begin{bmatrix}
    -0.5 & 1.0 & -0.5  \\
   -0.5 & 1.0 & -0.5  \\  
   -0.5 & 1.0 & -0.5  
  \end{bmatrix}$


**Your Turn (Question 3):**  Which is the corresponding weight vector used to detect **Horizontal Line**?

_Choose from: A, B, C, D_

**Your Turn (Question 4):**  Which is the corresponding weight vector used to detect **Vertical Line**?

_Choose from: A, B, C, D_

**Your Turn (Question 5):**  Which is the corresponding weight vector used to detect **Diagonal Line**?

_Choose from: A, B, C, D_

## 3 Programming : Recursive Neural Networks (RNNs)



**Classifying names with a character-level RNN**

We will be building and training a basic character-level RNN to classify
words. A character-level RNN reads words as a series of characters -
outputting a prediction and "hidden state" at each step, feeding its
previous hidden state into each next step. We take the final prediction
to be the output, i.e. which class the word belongs to.

Specifically, we'll train on a few thousand surnames from 18 languages
of origin, and predict which language a name is from based on the
spelling:

```
    (input) Hinton
    (-0.47) Scottish
    (-1.52) English
    (-3.57) Irish

    (input) Schmidhuber
    (-0.19) German
    (-2.48) Czech
    (-2.68) Dutch
```

**Credits**: This section is adapted from one of pytorch's official tutorials, which can be found [here](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html).

### .a Prepare environment and data

Let's start by importing the libraries and downloading the data. 

Run the following cell after restarting your environment and/or copying the notebook. Libraries that we need for the specific section will be loaded at the beginning of each section.

In [0]:
from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os
import random
import unicodedata
import string
import time
import math

import torch
import torch.nn as nn
import seaborn as sns
import matplotlib.pyplot as plt

# Download the required data
!wget https://download.pytorch.org/tutorial/data.zip
!unzip data.zip

#### Clean and index data
The downloaded data are mostly romanized, however, we still need to convert them from Unicode to ASCII.

We also need to build the mapping from category (language) to specific names in order to retrieve training samples later.

We'll end up with a dictionary of lists of names per language, ```{language: [names ...]}```. 

In [0]:
def findFiles(path): return glob.glob(path)
# print(findFiles('data/names/*.txt'))

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

# Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

# Build the category_names dictionary, a list of names per language
category_names = {}  # mapping from category (language) to name
all_categories = []  # list of all categories (languages)

# Read a file and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    names = readLines(filename)
    category_names[category] = names

# number of categories
n_categories = len(all_categories)

#### Turning names into Tensors
Now that we have all the names organized, we need to turn them into Tensors to make any use of them.

To represent a single letter, we use a "one-hot vector" of size <1 x n_letters>. 

A one-hot vector is filled with 0s except for a 1 at index of the current letter, e.g. "b" = <0 1 0 0 0 ...>. 

To make a word we join a bunch of those into a 2D matrix <name_length x 1 x n_letters>.

That extra 1 dimension is because PyTorch assumes everything is in batches - we're just using a batch size of 1 here.

In [0]:
# Find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):
    return all_letters.find(letter)

# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    tensor[0][letterToIndex(letter)] = 1
    return tensor

# Turn a name into a <name_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def nameToTensor(name):
    tensor = torch.zeros(len(name), 1, n_letters)
    for li, letter in enumerate(name):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

# This is just an example. You can change to something else and see what happens!
print(letterToTensor('J'))
print(nameToTensor('Jones').size())

### .b Creating the network


You can implement an RNN in a very "pure" way, as regular feed-forward layers like Linear or Conv.

The RNN we use here is just 2 linear layers which operate on an input and hidden state, with a LogSoftmax layer after the output.

 <div align="center">
<img src="https://www.comp.nus.edu.sg/~neamul/Images/CS3244_1910/rnn_archi.png"  width = 450/>
<p>  A Simple Recurrent Neural Network. 
</div>

 (Diagram Credit: From [https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6](https://miro.medium.com/max/726/1*XxxiA0jJvPrHEJHD4z893g.png))

In [0]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        # hidden states are 0 at the beginning
        return torch.zeros(1, self.hidden_size)

To create an instance of this network we need to specify:

- `n_letters` : number of all possible letters.
- `n_hidden` : number of hidden units and hidden state size.
- `n_categories` : number of categories. In our case it is 18.

We can now create our RNN as follows:

In [0]:
rnn = RNN(input_size=n_letters, hidden_size=128, output_size=n_categories)

To run a step of this network we need to pass an input (in our case, the Tensor for the current letter) and a previous hidden state (which we initialize as zeros at first). We'll get back the output (probability of each language) and a next hidden state (which we keep for the next step).

In [0]:
# Try the following code. You can try different names.
input = nameToTensor('Albert')
hidden = torch.zeros(1, 128)

output, next_hidden = rnn(input[0], hidden)
print(output)

### .c Train the network

#### Prepare for training
Before going into training we should make a few helper functions. The
first is to interpret the output of the network, which we know to be a
likelihood of each category. We can use ``Tensor.topk`` to get the index
of the greatest value. We will also want a quick way to get a training example (a name and its language):

In [0]:
def categoryFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return all_categories[category_i], category_i

def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

def randomTrainingExample():
    category = randomChoice(all_categories)
    name = randomChoice(category_names[category])
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
    name_tensor = nameToTensor(name)
    return category, name, category_tensor, name_tensor


Finally, let's setup the train and evaluation engine.

**Your Turn (Question 6)** We removed some codes from `train_rnn()`. Please complete these two functions. The question marks in the template are placeholders for python statements.

In [0]:
def train_rnn(category_tensor, name_tensor, loss_function, optimizer):
    hidden = rnn.initHidden()

    optimizer.zero_grad()

    ####### Your Turn ########
    #
    # We removed some code here, now it's your turn to make it run.
    # Consider the following template:
    #
    # for ? in range(?):
    #    output, hidden = rnn(?, ?)
    
    for i in range(name_tensor.size()[0]):
        output, hidden = rnn(name_tensor[i], hidden)
        
    #########################

    loss = loss_function(output, category_tensor)
    loss.backward()
    optimizer.step()

    return output, loss.item()


**Your Turn (Question 7)** We removed some codes from `evaluate_rnn()`. Please complete these two functions. The question marks in the template are placeholders for python statements.

In [0]:
def evaluate_rnn(name_tensor):
    hidden = rnn.initHidden()

    ####### Your Turn ########
    #
    # We removed some code here, now it's your turn to make it run.
    # Consider the following template:
    #
    # for ? in range(?):
    #    output, hidden = rnn(?, ?)
    
    for i in range(name_tensor.size()[0]):
        output, hidden = rnn(name_tensor[i], hidden)
        
    #########################

    return output

#### Now let's train it
Now we just have to run that with a bunch of examples. Since the
``train`` function returns both the output and loss we can print its
guesses and also keep track of loss for plotting. Since there are 1000s
of examples we print only every ``print_every`` examples, and take an
average of the loss.

In [0]:
# Hyperparameters
learning_rate = 0.005
n_iters       = 100000
print_every   = 5000
plot_every    = 1000
loss_function = nn.NLLLoss()
optimizer     = torch.optim.SGD(rnn.parameters(), learning_rate)

# Keep track of losses for plotting
current_loss = 0
all_losses = []

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()

for i in range(1, n_iters + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output, loss = train_rnn(category_tensor, line_tensor, loss_function, optimizer)
    current_loss += loss

    # Print iter number, loss, name and guess
    if i % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '✓' if guess == category else '✗ (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (i, i / n_iters * 100, timeSince(start), loss, line, guess, correct))

    # Add current loss avg to list of losses
    if i % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

### .d Evaluate the network
To see how well the network performs on different categories, we will
create a confusion matrix, indicating for every actual language (rows)
which language the network guesses (columns). To calculate the confusion
matrix a bunch of samples are run through the network with
``evaluate_rnn()``, which is the same as ``train_rnn()`` minus the backprop.

In [0]:
# Keep track of correct guesses in a confusion matrix
confusion = torch.zeros(n_categories, n_categories)
n_confusion = 5000

# Go through a bunch of examples and record which are correctly guessed
for i in range(n_confusion):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output = evaluate_rnn(line_tensor)
    guess, guess_i = categoryFromOutput(output)
    category_i = all_categories.index(category)
    confusion[category_i][guess_i] += 1

# Normalize by dividing every row by its sum
for i in range(n_categories):
    confusion[i] = confusion[i] / confusion[i].sum()

fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(confusion, xticklabels=all_categories, yticklabels=all_categories, ax=ax)

**Your Turn (Question 8)**: Why are certain pairs of countries more easily confuse our RNN than other pairs?

_Replace with your answer: For example: because they have similar character combinations (contexts) in their surnames._

You can also evaluate this RNN using your own input:


In [0]:
#@title Try your own input. Type your name and run the code! (Double click the title to reveal the code. This is not a question!)
custom_input = "Xu" #@param {type:"string"}
def predict(input_name, n_predictions=3):
    print('\n> %s' % input_name)
    with torch.no_grad():
        output = evaluate_rnn(nameToTensor(input_name))

        # Get top N categories
        topv, topi = output.topk(n_predictions, 1, True)
        predictions = []

        for i in range(n_predictions):
            value = topv[0][i].item()
            category_index = topi[0][i].item()
            print('(%.2f) %s' % (value, all_categories[category_index]))
            predictions.append([value, all_categories[category_index]])


predict(custom_input)

---
# Week 09: Post-Tutorial Work




## 4 Embeddings



In this part, we'll learn a new concept called **embeddings**. We will use *movie review* string data to create a sparse feature vector and then implement a *sentiment-analysis* model using that feature vector. To get a better understanding of *embeddings* and why it is used, we will first implement a linear model and later use a DNN model with and without using an embedding. Then we will visualize and try to see how embeddings help.

**Credits:** This section of the exercises is adapted from the [Google Tutorial on Embeddings](https://colab.research.google.com/notebooks/mlcc/intro_to_sparse_data_and_embeddings.ipynb), licensed under [Apache License version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

### .a Building a Sentiment Analysis Model 

Let's train a sentiment-analysis model on this data that predicts if a review is generally *favorable* (label of 1) or *unfavorable* (label of 0).

To do so, we'll turn our string-value `terms` into feature vectors by using a *vocabulary*, a list of each term we expect to see in our data. For the purposes of this exercise, we've created a small vocabulary that focuses on a limited set of terms. Most of these terms were found to be strongly indicative of *favorable* or *unfavorable*, but some were just added because they're interesting.

Each term in the vocabulary is mapped to a coordinate in our feature vector. To convert the string-value `terms` for an example into this vector format, we encode such that each coordinate gets a value of 0 if the vocabulary term does not appear in the example string, and a value of 1 if it does. Terms in an example that don't appear in the vocabulary are thrown away.

**Note:** *We can use a larger vocabulary, and there are special tools for creating these. In addition, instead of just dropping terms that are not in the vocabulary, we can introduce a small number of OOV (out-of-vocabulary) buckets, which you can hash to terms not in the vocabulary. We can also use a __feature hashing__ approach that hashes each term, instead of creating an explicit vocabulary. This works well in practice, but loses interpretability, which we want to retain for this exercise.*

Before start, let's do some setup. We start by importing the libraries that we need for all sections. 

Run this cell after restarting your environment and or copying the notebook. Libraries that we need for the specific section will be loaded at the beginning of each section.

Next, we will define our `train` function, which we already used last week. We will use it multiple times in this section.

In [0]:
### Load the pytorch libraries
import torch
import torchvision
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as transforms
from torch.autograd import Variable

# Load the other neccesary libraries
import matplotlib.pyplot as plt
import numpy as np

In [0]:
def train(model, trainloader, validationloader, lossfunction, optimizer, n_epochs=100, verbose=True):
  """
    Args:
        model (pytorch neural network): the network we want to train
        trainloader (data loader): The data loader for the training set
        validationloader (data loader): The data loader for the validation set
        lossfunction (a pytorch loss function): The loss function used to train the network
        optimizer (a pytorch optimizer ): The optimizer used to train the network
        n_epochs (int): The number of epochs the network is trained
    Returns:
        trainingLosses,validationLosses (Lists of floats): returns the losses of each epoch
  """
  trainingLosses, validationLosses = [],[]
  for t in range(n_epochs):
    model = model.train()
    running_loss = 0.0
    for i, data in enumerate(trainloader):
      inputs, labels = data
      inputs, labels = Variable(inputs), Variable(labels) # See the comments from last week
      inputs =  inputs.cuda()
      labels = labels.cuda()
      optimizer.zero_grad() # See the comments from last week
      outputs = model(inputs) # See the comments from last week
    #   ipdb.set_trace()
      loss = lossfunction(outputs, labels) # Compute the loss
      loss.backward() # Compute the gradient for each variable
      torch.nn.utils.clip_grad_norm_(model.parameters(), 5)
      optimizer.step() # Update the weights according to the computed gradient
            
      # for printing
      running_loss += loss.data.item()
    # The second loop is actually not training, it's just calculating the loss in the validation set
    # Otherwise, it's the same as above
    model = model.eval()
    with torch.no_grad():
        running_loss_val = 0.0
        for i, data in enumerate(validationloader):
            inputs, labels = data
            inputs, labels = Variable(inputs), Variable(labels).long()
            inputs = inputs.cuda()
            labels = labels.cuda()
            outputs = model(inputs) # problematic for students
            loss = lossfunction(outputs, labels) # Compute the loss
            
            # for printing
            running_loss_val += loss.data.item()
        trainingLosses.append(running_loss)
        validationLosses.append(running_loss_val)
    if verbose:
        print("Epoch: {} Training loss: {:f} Validation loss: {:f}".format(t+1,running_loss,running_loss_val))
#   return trainingLosses,validationLosses

### .b Building the Input Pipeline

We converted the data set from the original Tenserflow to simple lists and then [pickled](https://wiki.python.org/moin/UsingPickle) them. You can download it with the following code. This might take a few minutes. 

In [0]:
 # This line will clone a git repository with the data and save it into the files of this notebook.
 # You'll only need to run this code once (per reset of the runtime); it will throw errors afterwards.

! git clone --recursive https://github.com/MartinStrobel/CS3244Week09data.git

In [0]:
import pickle
statements = pickle.load( open("CS3244Week09data/Week09_embeddings_statements.p",'rb'))
labels = pickle.load( open("CS3244Week09data/Week09_embeddings_labels.p",'rb'))

After downloading the dataset, we can look at a few statements. Let's see how one looks:


In [0]:
print(' '.join(statements[11])) # Why 11?

Now we need to transform the statements into a usable data set. Let's start very simply and focus only on whether 50 informative terms occur within the statements (as opposed to the 171,476 words in the [Oxford English Dictionary](https://en.oxforddictionaries.com/explore/how-many-words-are-there-in-the-english-language/).) 

In [0]:
# 50 informative terms that compose our model vocabulary 
informative_terms = ("bad", "great", "best", "worst", "fun", "beautiful",
                     "excellent", "poor", "boring", "awful", "terrible",
                     "definitely", "perfect", "liked", "worse", "waste",
                     "entertaining", "loved", "unfortunately", "amazing",
                     "enjoyed", "favorite", "horrible", "brilliant", "highly",
                     "simple", "annoying", "today", "hilarious", "enjoyable",
                     "dull", "fantastic", "poorly", "fails", "disappointing",
                     "disappointment", "not", "him", "her", "good", "time",
                     "?", ".", "!", "movie", "film", "action", "comedy",
                     "drama", "family")
# This is just a crude way of assigning emotions to the above words from bad (-1), over neutral (0) to good.
# We will use that for plotting later
informative_terms_emotion = [-1,1,1,-1,1,1,1,-1,-1,-1,-1,0,1,1,-1,-1,1,1,-1,1,1,1,-1,1
                             ,1,0,-1,0,1,1,-1,1,-1,-1,-1,-1,0,0,0,1,0,0,0,0,0,0,0,0,0,0]


In [0]:
def convertStatmentToVector(statement, informative_terms=informative_terms):
  """
    Args:
        statement (string) : the statement we want to convert to vector
        informative_terms (list of string) : the list of informative keywords that we want to find
    Returns:
        tensor (pytorch tensor) : pytorch tensor converted from the statement 
  """
  tensor = torch.zeros(len(informative_terms))
  for term in statement:
    if term in informative_terms:
        tensor[informative_terms.index(term)] = 1
  return tensor

Let's split our data into 20,000 training points and 5,000 validation points. The data is already shuffled, so we can just split it. We have done this many times before, so it should be familiar to you.

In [0]:
# This takes a couple of seconds
# For labels, we squeeze dimension 1 to make labels in shape (N,). 
X_train = torch.zeros([20000, len(informative_terms)], dtype=torch.float)
y_train = torch.tensor(np.asarray(labels)[:20000,None], dtype=torch.float).squeeze(1).long()
X_val   = torch.zeros([5000, len(informative_terms)], dtype=torch.float)
y_val   = torch.tensor(np.asarray(labels)[20000:,None], dtype=torch.float).squeeze(1).long()

# Loop to convert every training statement to a vector
for i,statement in enumerate(statements[:20000]):
  X_train[i] = convertStatmentToVector(statement)

# Loop to convert every validation statement to a vector
for i,statement in enumerate(statements[20000:]):
  X_val[i] = convertStatmentToVector(statement)

Now that we have transformed our data into tensors we can create our data loaders:

In [0]:
sentimentTrainloader      = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(X_train,y_train), batch_size=20000)
sentimentValidationloader = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(X_val,y_val), batch_size=20000)

Next we define the loss function. This time we are using **nn.NLLLoss()**. This loss function is wildly used with log_softmax activation for multi-class (k >= 2) classification. You are encouraged to read documentation about [LogSoftmax](https://pytorch.org/docs/stable/nn.html#logsoftmax) and [NLLLoss](https://pytorch.org/docs/stable/nn.html#nllloss).

In [0]:
embeddingLoss = nn.NLLLoss()

In [0]:
def accuracy(model,dataloader):
  """
    Args:
        model (pytorch model) : the model we want to analyze
        dataloader (pytorch dataloader): the dataloader to load corresponding 
          dataset for that model like training or validation set
    Returns:
        accu (float) : the calculated accuracy of the model on the provided data
  """
  model = model.eval()
  with torch.no_grad():
    correct = 0.0;
    total = 0.0;
    for i, data in enumerate(dataloader):
        inputs, labels = data
        inputs, labels = Variable(inputs).cuda(), Variable(labels).cuda()
        outputs = model(inputs)
        correct += (labels == outputs.argmax(dim=1)).sum().item()
        total += len(labels)
    accu = correct/total    
  return accu

### .c Use a Linear Model with Sparse Inputs and an Explicit Vocabulary

For our first model, we'll build a Linear classifier using the 50 informative terms; always start simple (and the iterate)!

We use a different optimizer Adagrad this time. It's one of the many improvements of SGD. We could get similar results with SGD, but Adagrad converges a faster and is more consistent, so it's better suited for this exercise. You can read more about Adagrad as well as many other optimization algorithms [here](http://ruder.io/optimizing-gradient-descent/index.html#adagrad).

In [0]:
linearNet = nn.Sequential(
    nn.Linear(len(informative_terms), 2),
    nn.LogSoftmax(dim=1)
).cuda()

linearOpt = optim.Adagrad(linearNet.parameters(), 1e-1)
# You may also try SGD for fun after you've finished this task
# linearOpt = optim.SGD(linearNet.parameters(), 1e-1)

Now we will train our linear classifier using the defined optimizer:

In [0]:
train(linearNet,sentimentTrainloader,sentimentValidationloader,embeddingLoss,linearOpt,verbose=True)

Let's see what accuracy we get:

In [0]:
accuracy(linearNet,sentimentValidationloader)

### .d Use a DNN with Sparse Inputs and an Explicit Vocabulary

We have seen in the previous section how a linear model performs. Now we  will try a DNN on the same data. Let's see if we can do better or not. We will declare a DNN, slightly more complex than before and an optimizer for the network.

In [0]:
DNNNet = nn.Sequential(
    nn.Linear(len(informative_terms), 100),
    nn.ReLU(),
    nn.Dropout(),
    nn.Linear(100, 2),
    nn.LogSoftmax(dim=1)
).cuda()

# The optimizer for DNNNet
DNNOpt = optim.Adagrad(DNNNet.parameters(), 1e-1, weight_decay=1e-4)
# DNNOpt = optim.SGD(DNNNet.parameters(), 1e-1, momentum=0.9, weight_decay=1e-3)

# Train our DNNNet with the optimizer
train(DNNNet,sentimentTrainloader,sentimentValidationloader,embeddingLoss,DNNOpt,n_epochs=100,verbose=True);

Let's see what accuracy we get this time:

In [0]:
accuracy(DNNNet,sentimentValidationloader)

### .e Use an Embedding with a DNN Model

From *Linear* to *DNN*, we have seen some improvement on  accuracy. We will now try DNN with an embedding layer. Let's see if we can do better than DNN.

In this task, we'll implement our DNN model using an embedding layer. An embedding layer takes a sparse vector or an index as input and returns a dense vector as output. 

Please note that its behavior is library-dependent and there are no "standard" implementation. For example, pytorch's embedding layer takes indices instead of sparse vectors as input. 

You are encouraged to read documentations about [Embedding](https://pytorch.org/docs/stable/nn.html#embedding) layer.

First, we need to convert our one-hot representation into indices. Note that since the inputs are indices, it should be LongTensor instead of FloatTensor. 

In [0]:
# We need an index to indicate "not presenting". Since we use 0~(vocab_size-1) 
# as indices for vocab_size vocabularies, vocab_size itself can be used as the 
# index for "not presenting"

def terms_to_indices(X, vocab_size):
    X_new = np.zeros([X.shape[0], vocab_size])
    for i, x in enumerate(X):
        X_new[i, :] = [ j if xi !=0 else vocab_size for j, xi in enumerate(x) ]
    return X_new
# Note the tailing underscore
X_train_ = torch.LongTensor(terms_to_indices(X_train, len(informative_terms)))
X_val_ = torch.LongTensor(terms_to_indices(X_val, len(informative_terms)))

In [0]:
embeddingTrainloader      = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(X_train_,y_train), batch_size=20000)
embeddingValidationloader = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(X_val_,y_val), batch_size=5000)

Now let's try the embedding layer. There are many ways of designing the network, let's use the most naive way. We will retrieve embeddings based on the input informative terms, and use the mean of all the retrieved embeddings as the feature of the document. Finally, we will use a linear classifier to perform sentiment analysis.

In [0]:
class EmbeddingNet(nn.Module):

    def __init__(self, vocab_size, embedding_dim):
        super(EmbeddingNet, self).__init__()
        
        # These are two constants we need to transform the output 
        self.embedding_dim = embedding_dim
        self.vocab_size = vocab_size
        
        # An embedding layer is basically a lookup table the input is a vector with indices 
        # and the output is their represenation in a embedding space
        self.embeddings = nn.Embedding(vocab_size+1, embedding_dim, padding_idx=vocab_size)
        self.linear1 = nn.Linear(embedding_dim, 2)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        out = torch.log_softmax(self.linear1(embeds.mean(dim=1)),dim=1)

        return out
      
############ Your Turn: Edit this code block later ####################      
embeddingNet = EmbeddingNet(50,2).cuda()
## first parameter indicates the size of the informative_terms
## second parameter indicates the dimension
###############################################################

We will now declare the `optimizer` for our embedding network. Then train the network.

In [0]:
######### You may need to re-run this block later to answer Q02-Q04 ########
embeddingOpt = optim.Adagrad(embeddingNet.parameters(), 1e-1)
train(embeddingNet,embeddingTrainloader,embeddingValidationloader,embeddingLoss,embeddingOpt,n_epochs=200,verbose=True);

Let's see... How well do we do on accuracy now that we use embeddings?

In [0]:
accuracy(embeddingNet,embeddingValidationloader)

### .f Examine the Embedding

Let's now take a look at the actual embedding space, and visualize where the terms end up in it.

We will first visualize the keywords. Run the code below and then try to answer the following question.

**Your Turn (Question 1):** Which of the following describes the output?

_Choose from: Similar words are clustered together, Most of the similar words are close to each other with some exception, Most of the words that are close are not correlated at all_

In [0]:
colors = {-1:"red", 0:"black", 1:"green" }
with torch.no_grad():
    for term_index in range(len(informative_terms)):
        # the embedding layer takes indices, instead of one-hot vectors, as inputs
        embedding_xy = embeddingNet.embeddings(torch.LongTensor([term_index]).cuda())
        plt.text(embedding_xy[0][0],
            embedding_xy[0][1],
            informative_terms[term_index], color=colors[informative_terms_emotion[term_index]])

# Do a little setup to make sure the plot displays nicely.  
# H/T Always do good when presenting.
plt.rcParams["figure.figsize"] = (15, 15)
embedding_matrix = embeddingNet.embeddings.weight.detach().cpu().numpy()
plt.xlim(1.2 * embedding_matrix.min(), 1.2 * embedding_matrix.max())
plt.ylim(1.2 * embedding_matrix.min(), 1.2 * embedding_matrix.max())
plt.show() 

**Your Turn:** You have seen the output by running it once. Now retrain the `EmbeddingEmbeddingNet` model and run the above code block again. Then answer the following question.

**Your Turn (Question 2):** How the output looks like after retraining the model?

_Choose from: It looks the same as before, It looks better than previous one, It looks worse than the previous one._

In this task, you need to retrain the `embeddingNet` model again, but this time with only **10 epochs**.  Change the code accordingly, then train the model, generate the graph and answer the following question.

**Your Turn (Question 3):** How does the output looks like after retraining the model with only 10 epoch?

_Choose from: It looks the same as before, It looks better than previous one, It looks worse than the previous one._

**Your Turn (Question 4):** Explain the reason for your answer of Q03.

_Replace with your answer_

## 5 Image Classification with CNN



Next, we will implement CNN using PyTorch. We start by installing Pytorch and importing the libaries. We copied the code here so that you can start the post-class notebook with ou running the code above.



In [0]:
### Load the pytorch libraries
import torch
import torchvision
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as transforms
from torch.autograd import Variable

# Load the other neccesary libraries
import matplotlib.pyplot as plt
import numpy as np
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)

### .a Gathering and Loading Data

As with most Machine Learning projects, a minority of the code you end up writing has to do with actual statistics––most is spent on gathering, cleaning, and readying your data for analysis. CNNs in Pytorch are no exception.

Pytorch ships with the `torchvision` package, which makes it easy to download and use datasets for CNNs. To stick with convention and benchmark accurately, we’ll use the **CIFAR-10 dataset**. CIFAR-10 contains images of 10 different classes, and is a standard library of sorts used for CNN building. 

The first step to get our data is to use Pytorch and download it. This may take a few minites to finish. 



In [0]:
## The compose function allows for multiple transforms
#
# transforms.ToTensor() converts our PILImage to a tensor of shape (C x H x W) in the range [0,1]
# transforms.Normalize(mean,std) normalizes a tensor to a (mean, std) for (R, G, B)

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

train_set = torchvision.datasets.CIFAR10(root='./cifardata', train=True, download=True, transform=transform)

test_set = torchvision.datasets.CIFAR10(root='./cifardata', train=False, download=True, transform=transform)

We then designate the 10 possible labels for each image:

In [0]:
classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

The final step of data preparation is to define **data samplers** for our images. Data samplers are a very useful tool in PyTorch that helps us to split all of the available training examples into training, test, and cross validation sets when we train our model later on. 

In [0]:
from torch.utils.data.sampler import SubsetRandomSampler

# Training
n_training_samples = 20000
train_sampler = SubsetRandomSampler(np.arange(n_training_samples, dtype=np.int64))

# Validation
n_val_samples = 5000
val_sampler = SubsetRandomSampler(np.arange(n_training_samples, n_training_samples + n_val_samples, dtype=np.int64))

# Test
n_test_samples = 5000
test_sampler = SubsetRandomSampler(np.arange(n_test_samples, dtype=np.int64))

OK, now let us visualize some of the data samples in the CIFAR-10 dataset by running the following code:

In [0]:
# functions to show an image
def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))

# get some random training images
train_loader = torch.utils.data.DataLoader(train_set, batch_size=16, sampler=train_sampler)
dataiter = iter(train_loader)
images, labels = dataiter.next()

# show images
imshow(torchvision.utils.make_grid(images))

### .b Designing a Neural Net in Pytorch

Pytorch makes it pretty easy to implement all of those key components that we described above. We’ll be making use of 4 major functions in our CNN class: 

 - `torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)` – applies convolution

- `torch.nn.relu(x)` – applies ReLU

- `torch.nn.MaxPool2d(kernel_size, stride, padding)` – applies Max Pooling

- `torch.nn.Linear(in_features, out_features)` – fully connected layer (multiply inputs by learned weights)

Now as we already get all components at hand, let's begin to write our CNN code in Pytorch. We’ll create a `SimpleCNN` class which inherits from the master `torch.nn.Module` class. 



In [0]:
from torch.autograd import Variable
import torch.nn.functional as F

class SimpleCNN(torch.nn.Module):
    
    # Our batch shape for input x is (3, 32, 32)
    
    def __init__(self):
        super(SimpleCNN, self).__init__()
        
        # Input channels = 3, output channels = 18
        self.conv1 = torch.nn.Conv2d(3, 18, kernel_size=3, stride=1, padding=1)
        self.pool = torch.nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        
        # 4608 input features, 64 output features (see sizing flow below)
        self.fc1 = torch.nn.Linear(18 * 16 * 16, 64)
        
        # 64 input features, 10 output features for our 10 defined classes
        self.fc2 = torch.nn.Linear(64, 10)
        
    def forward(self, x):
        
        # Computes the activation of the first convolution
        # Size changes from (3, 32, 32) to (18, 32, 32)
        x = F.relu(self.conv1(x))
        
        # Size changes from (18, 32, 32) to (18, 16, 16)
        x = self.pool(x)
        
        # Reshape data to input to the input layer of the neural net
        # Size changes from (18, 16, 16) to (1, 4608)
        # Recall that the -1 infers this dimension from the other given dimension
        x = x.view(-1, 18 * 16 *16)
        
        # Computes the activation of the first fully connected layer
        # Size changes from (1, 4608) to (1, 64)
        x = F.relu(self.fc1(x))
        
        # Computes the second fully connected layer (activation applied later)
        # Size changes from (1, 64) to (1, 10)
        x = self.fc2(x)
        return(x)

Let’s explain what’s going on here. We’re creating a `SimpleCNN` class with one class method: `forward`. The `forward()` method computes a forward pass of the CNN, which includes the four steps we outlined above. When an instance of the SimpleCNN class is created, we define internal functions to represent the layers of the net. During the forward pass we call these internal functions.

One of the pesky parts about manually defining neural nets is that we need to specify the sizes of inputs and outputs at each part of the process. The comments should give some direction as to what’s happening with size changes at each step. 

You can see that in the above codes, the input dimension is **(3, 32, 32)**, which means our input image has $3$ input channels, and each channel has a resolution of $32 \times 32$. The number of output channels is set to $18$, meaning that we have $18$ different kernels, and the `kernel size` is set to $3$, which means the size of each kernel is $3\times 3$. And for the convolution operation, we set `stride = 1` and `padding = 1`. Under this setting, we get an output tensor with shape **(18, 32, 32)** after the convolution.  



**Your Turn (Question 5):** What is the output dimension after convolution if we change the input dimention to $(6, 16, 16)$, with other settings remaining the same?

_CHoose from: $(6,16,16), (6,15,15), (18,16,16), (18,15,15)$_

After the convolution, we send the output tensor (shape is **(18, 32, 32)**) to a max pooling layer, in which the parameters are set to: `kernel_size = 2`, `stride = 2`, and `padding = 0`. This results in a tensor with shape **(18, 16, 16)**. 

**Your Turn (Question 6):** What is the output dimension after max pooling from input $(18,32,32)$ if we change the kernel size to $4$, and stride to $4$, with other settings remaining the same?

_Choose from: $(9,8,8), (9,4,4), (18,8,8), (18,4,4)$_

### .c Training our CNN in Pytorch



Once we’ve defined the class for our CNN, we need to train the net itself. This is where neural network code gets interesting. If you’re working with more basic types of machine learning algorithms, you can usually get meaningful output in just a few lines of code. But with neural networks in Pytorch (and TensorFlow) though, it takes a bunch more code than that. Our basic flow is **a training loop**: each time we pass through the loop (called and “epoch”), we compute a forward pass on the network and implement backpropagation to adjust the weights. We’ll also record some other measurements like loss and time passed, so that we can analyze them as the net trains itself. 

To start, we’ll define our **data loaders** using the samplers we created above.


In [0]:
# DataLoader takes in a dataset and a sampler for loading (num_workers deals with system level memory) 
def get_train_loader(batch_size):
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size,
                                           sampler=train_sampler, num_workers=2)
    return(train_loader)

In [0]:
# Test and validation loaders have constant batch sizes, so we can define them directly
test_loader = torch.utils.data.DataLoader(test_set, batch_size=4, sampler=test_sampler, num_workers=2)
val_loader = torch.utils.data.DataLoader(train_set, batch_size=128, sampler=val_sampler, num_workers=2)

We’ll also define our loss and optimizer functions that the CNN will use to find the right weights. We’ll be using **Cross Entropy Loss (Log Loss)** as our loss function, which strongly penalizes high confidence in the wrong answer. The optimizer is the popular **Adam** algorithm. 

In [0]:
import torch.optim as optim

def createLossAndOptimizer(net, learning_rate=0.001):
    
    #Loss function
    loss = torch.nn.CrossEntropyLoss()
    
    #Optimizer
    optimizer = optim.Adam(net.parameters(), lr=learning_rate)
    
    return(loss, optimizer)

Finally, we’ll define a function to train our CNN using a simple for loop. 

In [0]:
import time

def trainNet(net, batch_size, n_epochs, learning_rate):
    
    #Print all of the hyperparameters of the training iteration:
    print("===== HYPERPARAMETERS =====")
    print("batch_size=", batch_size)
    print("epochs=", n_epochs)
    print("learning_rate=", learning_rate)
    print("=" * 30)
    
    #Get training data
    train_loader = get_train_loader(batch_size)
    n_batches = len(train_loader)
    
    # Create our loss and optimizer functions
    loss, optimizer = createLossAndOptimizer(net, learning_rate)
    
    # Time for printing
    training_start_time = time.time()
    
    # Loop for n_epochs
    for epoch in range(n_epochs):
        
        running_loss = 0.0
        print_every = n_batches // 10
        start_time = time.time()
        total_train_loss = 0
        
        for i, data in enumerate(train_loader, 0):
            
            # Get inputs
            inputs, labels = data
            
            # Set the parameter gradients to zero
            optimizer.zero_grad()
            
            # Forward pass, backward pass, optimize
            outputs = net(inputs)
            loss_size = loss(outputs, labels)
            loss_size.backward()
            optimizer.step()
            
            #Print statistics
            running_loss += loss_size.item()
            total_train_loss += loss_size.item()
            
            # Print every 10th batch of an epoch
            if (i + 1) % (print_every + 1) == 0:
                print("Epoch {}, {:d}% \t train_loss: {:.2f} took: {:.2f}s".format(
                        epoch+1, int(100 * (i+1) / n_batches), running_loss / print_every, time.time() - start_time))
                # Reset running loss and time
                running_loss = 0.0
                start_time = time.time()
            
        # At the end of the epoch, do a pass on the validation set
        total_val_loss = 0
        for inputs, labels in val_loader:
            
            # Forward pass
            val_outputs = net(inputs)
            val_loss_size = loss(val_outputs, labels)
            total_val_loss += val_loss_size.item()
            
        print("Validation loss = {:.2f}".format(total_val_loss / len(val_loader)))
        
    print("Training finished, took {:.2f}s".format(time.time() - training_start_time))

During each epoch of training, we pass data to the model **in batches** whose size we define when we call the training loop. Data is feature engineered using the SimpleCNN class we’ve defined, and then basic metrics are printed after a few passes. During each loop, we also calculate the loss on our validation set. 

To actually train the net now only requires two lines of code: 

In [0]:
CNN = SimpleCNN()
trainNet(CNN, batch_size=32, n_epochs=5, learning_rate=0.001)

### .d Evaluation 

After the training, now let's see if our CNN works by evaluating the classfication accuracy in the test data:

In [0]:
correct = 0
total = 0
for data in test_loader:
    images, labels = data
    outputs = CNN(images)
    _, predicted = torch.max(outputs.data, 1)
    total += labels.size(0)
    correct += (predicted == labels).sum()

print('Accuracy of the network on the test images: %f %%' % (100 * float(correct) / total))

OK, maybe you were expecting a better classification accuracy. This is exactly the work for you to do. Please try to tune the network structure or hyper-parameters of the simple CNN in order to get a better classification accuracy. We know that tuning the neural network is the least interesting thing you want to do, because it's boring and annoying. But this is what you, as a beginner, have to do and is also a good way to help you understand CNN better. 

After you finish, please record your best accuracy achieved and its corresponding network and parameter settings. 

**Your Turn (Question 7):** What is the best classification accuracy you achieved?

_Replace with your answer_

**Your Turn (Question 8):**  Please describe the network structure and parameter settings for your best experiment.

_Replace with your answer_

---
## 6 Readings - for better understanding



In this notebook, we will learn about what is Convolutional Neural Network (CNN), and how to implement a CNN in PyTorch. 

As there are lots of great CNN tutorials on the Internet, we don't need to write a fresh version again. For most contents of this post-class notebook, we are re-using a tutorial from the blog post: [https://blog.algorithmia.com/convolutional-neural-nets-in-pytorch/](https://blog.algorithmia.com/convolutional-neural-nets-in-pytorch/), which is very good for beginners to learn CNN using PyTorch; and we highly recommend it.  We modified some parts to add more detailed explanations and some exercises for you. 

Ok, Let's get started.  

### What is a Convolutional Neural Network (CNNs)?

Convolutional Neural Networks (ConvNets or CNNs) are a category of neural networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars. 

I think you already learned how CNN works after watching the post-class videos. So here I just do a quick recap of the key components in CNN:

- Convolutional Layer
- ReLU Layer
- Max Pooling
- Fully Connected Layers

If you are still confused about some of the above concepts, don't worry!  We recommend you to go through the Stanford CS231n lecture notes, which is widely considered as one of the best CNN tutorials by the community. The link is as follows:   

[http://cs231n.github.io/convolutional-networks/](http://cs231n.github.io/convolutional-networks/)



#### .a Convolution Layer



The CNN gets its name from the process of convolution. Think of convolution as applying a **filter** to our image. We pass over a mini image, usually called a kernel, and output the resulting filtered subset of our image.

<div align="center">
<img src="http://deeplearning.stanford.edu/wiki/images/6/6c/Convolution_schematic.gif" width=400>
<p> Figure 1. The convolution operation. </p>
</div>

(Diagram credit:  [https://blog.algorithmia.com/convolutional-neural-nets-in-pytorch](https://blog.algorithmia.com/convolutional-neural-nets-in-pytorch))

Since an image is just a bunch of pixel values, in practice this means multiplying small parts of our input images by the filter. There are a few parameters that get adjusted here: 

 - **Kernel Size** – the size of the filter.
 - **Kernel Type** – the values of the actual filter. Some examples include *identity*, *edge detection*, and *sharpen*.
 - **Stride** – the rate at which the kernel passes over the input image. A stride of 2 moves the kernel in 2 pixel increments.
 - **Padding** – we can add layers of 0s to the outside of the image in order to make sure that the kernel properly passes over the edges of the image.
 - **Output Layers** – how many different kernels are applied to the image.

The output of the convolution process is called the “convolved feature” or “feature map.” Remember: it’s just a filtered version of our original image where we multiplied some pixels by some numbers.

The resulting feature map can be viewed as a more optimal representation of the input image that’s more informative to the eventual neural network that the image will be passed through. In practice, convolution combined with the next two steps has been shown to greatly increase the accuracy of neural networks on images.

You’ll see the convolution step through the use of the `torch.nn.Conv2d()` function in Pytorch.

#### .b ReLU



Since the convolution operation is essentially a linear function; in CNN, we also need to add in a nonlinear function to help approximate non-linear relationship in the underlying data. 

The function most popular with CNNs is the ReLU function and it’s extremely simple. We already introduced ReLU to you in the Week 8 post-class notebook. Here for a quick recap, I give the formula of the ReLU function again, which is simply converts all negative pixel values to $0$. 

$$
R(z) = \max(0, z)
$$

And we also put the same Figure in Week 8 again to help you see the difference between Sigmoid and ReLU function. 

<div align="center">
<img src="https://miro.medium.com/max/1452/1*XxxiA0jJvPrHEJHD4z893g.png" width=400>
<p> Figure 2. The Sigmoid and ReLU Function.  </p>
</div>

 (Diagram Credit: From [https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6))

There are other functions that can be used to add non-linearity, like *tanh* or *softmax*. But in CNNs, ReLU is the most commonly used.

You’ll see the ReLU step through the use of the `torch.nn.relu()` function in Pytorch. 

#### .c Max Pooling



Another important part in CNNs is pooling, and the name describes it pretty well: we pass over sections of our image and pool them into the highest value in the section. Depending on the size of the pool, this can greatly reduce the size of the feature set that we pass into the neural network. This graphic from Stanford’s course page visualizes it simply: 

<div align="center">
<img src="https://blog.algorithmia.com/wp-content/uploads/2018/03/word-image-5.png" width=400>
<p> Figure 3. The max pooling operation.  </p>
</div>

 (Diagram Credit: From [https://blog.algorithmia.com/convolutional-neural-nets-in-pytorch/](https://blog.algorithmia.com/convolutional-neural-nets-in-pytorch/))
 
Max pooling also has a few of the same parameters as convolution that can be adjusted, like **stride** and **padding**. There are also other types of pooling that can be applied, like **sum pooling** or **average pooling**.

You’ll see the Max Pooling step through the use of the `torch.nn.MaxPool2d()` function in Pytorch. 

#### .d Fully Connected Layers



After the above steps are applied (convolution, ReLU, and max-pooling), the resulting image is passed into the traditional neural network architecture. Designing the optimal neural network is beyond the scope of this notebook, and we’ll be using a simple 2 layer format with one hidden layer and one output layer. 

This part of the CNN is almost identical to any other standard neural network. The key to understanding CNNs is this: the driver of better accuracy is the steps we take to engineer better features, not the classifier we end up passing those values through. Convolution, ReLU, and max pooling prepare our data for the neural network in a way that extracts all the useful information they have in an efficient manner. 

You’ll see the forward pass step through the use of the `torch.nn.Linear()` function in Pytorch.

---
# Credits
Authored by Martin Strobel, Xu Ziwei, [Liangming Pan](https://www.liangmingpan.com), [Min-Yen Kan](http://www.comp.nus.edu.sg/~kanmy) (2019), affiliated with [WING](http://wing.comp.nus.edu.sg), [NUS School of Computing](http://www.comp.nus.edu.sg) and [ALSET](http://www.nus.edu.sg/alset).
Licensed as: [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/ ) (CC BY 4.0).
Please retain and add to this credits cell if using this material as a whole or in part.
