# [CS Series] Lecture 10: Neural Networks

## May 8, 2022

### Hosted by and maintained by the [Student Association for Applied Statistics (SAAS)](https://saas.berkeley.edu).
Created by Ritvik Iyer and Akhil Vemuri

### Table of Contents
1. [Gradient Descent](#grad_descent)
2. [What are Neural Nets?](#what_are_nn)
    1. [Neural net examples](#nn_examples)
    2. [Neural network inspiration](#neuroscience)
3. [The Single Neuron](#single_neuron)
4. [Activation Functions](#activation_fn)
5. [Linear Regression Connection](#lr-reg)
6. [Perceptrons](#the-question)
    1. [Single-layer Perceptrons](#slp)
    2. [Multi-layer Perceptrons](#mlp)
7. [Training a MLP on the Titanic Dataset](#titanic)  
8. [Summary](#summary)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
plt.rcParams["figure.figsize"] = [15.0, 10.0]

<img src="neural_nets_images/AIsubsets.png" alt="Drawing" style="width: 500px; height: 380px"/>

<img src="neural_nets_images/pikachu.jpeg" alt="Drawing" style="width: 300px; height: 250px"/>

<a id='grad_descent'></a>
## Gradient Descent 

In many machine learning contexts, the objective is to minimize a *loss function*, which tells us the error in our predictions based on the parameters we chose. In linear regression, the loss function was mean squared error. Indeed, there many other types of loss functions such as cross entropy loss, zero-one loss, and more. Regardless of what loss function we choose, the objective is always to minimize it. Finding the minimum of a function can be exceptionally difficult, especially if the loss function is nonconvex. To make matters worse, we likely do not even know how the loss landscape looks with respect to the model's parameters. For instance, imagine the difference in difficulty in finding the minimum of a parabola and the minimum of a landscape shown below: 

![Picture title](neural_nets_images/image-20210418-121923.png)

There may not even be a nice expression for the above function, so how do we even try to find the minimum? Enter gradient descent. The idea behind gradient descent is we take steps in the direction of steepest descent to land in a local minimum. Conveniently, the gradient of a function gives us the direction of steepest ascent, so we can just move in the opposite direction. More specifically:

1) We calculate the gradient of the loss function with respect to the model parameters, $\nabla_\theta \mathcal{L}(\theta)$.

2) We subtract some scalar multiple of our choice, $\alpha$, of it from the current model parameters. This is like taking an $\alpha$-sized step in the direction opposite the gradient.

3) Repeat 1) and 2) until convergence.

In math:

1) Calculate $\nabla_\theta \mathcal{L}(\theta)$

2) $\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(\theta)$

3) Repeat 1) and 2) until convergence.

To demonstrate the power of gradient descent, we could use it to minimize linear regression loss, LASSO loss, ridge regression loss, and any convex function, without having to know how to solve it formulaically (although, it may take a bit longer). Essentially, you could use it as a black box solver for functions with no local minima, no saddle points, and no plateaus; and you could expect to find the gloabl optimum fairly easily. Indeed, there are several optimizers that are much better and more widely used, but they all draw heavily from the basic principles of gradient descent. Examples are stochastic gradient descent, AdaGrad, and Adam.

#### GD Questions

1) Does gradient descent guarantee that we will find the global optimum for every function?

2) What happens when $\alpha$ is large? What happens when $\alpha$ is very close to $0$? 

<a id='what_are_nn'></a>

## What are Neural Networks?

Neural networks derive their name from the neural network we have in our head, our brain. In a very highly simplified model, the brain is a collection of neurons that receives electrical input signals from dendrites, outputting electrical signals via a single axon. Each neuron sends signals along a single axon and connects with other dendrites.

<img src="neural_nets_images/neuron_connection.gif" alt="Drawing" style="width: 300px; height: 250px"/>

<a id='nn_examples'></a>

### Neural Networks -  Examples

#### DALL-E - Creating Images from Text 

DALL-E is a neural network trained using image and text pairs in order to generate images from text descriptions. You can read more about it [here](https://openai.com/blog/dall-e/). Here is an example below of the images DALL-E created from the input text "an armchair in the shape of an avocado."

<img src="neural_nets_images/DallE_example.png" alt="Drawing" style="width: 800px; height: 600px"/>

<img src="neural_nets_images/dalle.png" alt="Drawing" style="width: 500px; height: 290px"/>

#### GPT-3 - Language Generation 

Generative Pre-trained Transformer 3 is an autoregressive language model that uses deep learning to produce human-like text. You can see some examples of how it is being used in real products [here](https://openai.com/blog/gpt-3-apps/). Below is an example of a conversation with a GPT-3 bot. 

<img src="neural_nets_images/gpt3exampl.png" alt="Drawing" style="width: 800px; height: 400px"/>

<a id='neuroscience'></a>

### Neural Nets Were (Predictably) Inspired by Neuroscience

Harvard neurophysiologists David H. Hubel and Torsten Wiesel recorded electrical brain activity from individual neurons in the brains of cats. They showed that some neurons are activated only when exposed to certain visual stimuli. When shown an animation of a rotating line, some neurons increased activity the more vertical the line got. They found that some neurons are specialized to detect certain visual cues, and that a combination of different neurons is necessary to comprehensively track a stimulus. 

This idea was adopted into the realm of neural nets, where multiple 'neurons' are strung together, often in a sequential order, to learn the label of their input. 

<img src="neural_nets_images/hubel_wiesel_cat.png" alt="Drawing" style="width: 400px; height: 250px"/>

<a id='single_neuron'></a>
## The Single Neuron

Every neural network is comprised of many neurons connected with weights. The neuron itself is the backbone of the network, and each neuron accomplishes the following tasks:


1. Receives information through weights pointing to that neural multiplied by the neuron the weight originated from.
2. Applies the activation function denoted by f to the sum of the weights multiplied by the previous neuron.
3. Returns an result in the form of the outer layer.

<img src="neural_nets_images/biological_neuron.jpeg" alt="Drawing" style="width: 400px; height: 250px"/>

The diagram above shows a single neuron taking in other neurons $x_0$, $x_1$ and $x_2$ with weights $w_0$, $w_1$ and $w_2$, respectively. The output from the neuron is the output axon. The function f is the Activation Function. The purpose of the activation function is to introduce non-linearity into the output of a neuron. This is necessary because using only linear functions to predict real world data is a poor decision. Greater accuracy requries neurons to learn nonlinear representations. We'll learn more about activation functions next.

<a id='activation_fn'></a>

## Activation Functions

#### What is an Activation Function? 

The purpose of an activation function is to help the network learn complex patterns in our data. An activation function introduces non-linear operations in the neural network. Without them, the network could only generate predictions from a series of (linear) matrix multiplications. 

<img src="neural_nets_images/ReLU.png" alt="Drawing" style="width: 500px; height: 300px"/>

An example of an activation function is the Rectified Linear Unit, or ReLU for short. As we see above, this is a non-linear function defined as: $f(x) = max(0, x)$. 

$\textit{Sigmoid}$ is useful if you want only positive numbers. However, it has fallen out of popularity recently because it causes gradients to vanish. When a neuron's activation saturates close to 0 or 1, the gradient will be really close to 0. During backpropogation this causes the signal to be lost. Also, because it is not 0-centered, it has a greater chance for gradient updates to go far in either direction.

<img src="neural_nets_images/sigmoid.png" alt="Graph" style="width: 300px; height: 250px"/>

$\textit{Tanh}$ has many advantages over sigmoid since the activation function is centered at zero, and can output negative numbers. In practice, $\textit{Tanh}$ is preferred over sigmoid, but it stil creates vanishing gradients problem when x becomes too large or too small.

<img src="neural_nets_images/tanh.png" alt="Graph" style="width: 300px; height: 250px"/>

<a id='lr-reg'></a>
## Linear Regression Connection

Linear regression is one of basic forms of machine learning we have learned. It simply fits the best linear model to a set of data points in order to determine a "line of best fit". But the simple linear regression model you've learned is actually a neural network in disguise!

<div> <img src="https://joshuagoings.com/assets/linear.png" width="500"/> </div>

We can think of linear regression models as neural networks consisting of just a single artificial neuron, or as single-layer neural networks. Since for linear regression, every input is connected to every output (in this case there is only one output), we can regard this transformation as a *fully-connected layer*, or *dense layer*. We will talk more about single and multi-layer networks later.

A lot of neural network operations are linear as well, so there's no need to be alarmed by their seemingly complex nature. Each output node is simply a linear combination of weights and inputs, and there are potentially multiple output nodes. That's all there is to it! The only place where non-linearity is introduced is through the activation function, but this is only to ensure the ability to fit to non-linear trends in the data. The patterns therefore exhibited between OLS and neural nets are shockingly similar!

<a id='perceptrons'></a>

## Perceptrons

<a id='slp'></a>

### Single Layer Perceptrons

Remember the image of the neuron above? The combination of the input, the neuron, and the output was actually simplest example of a neural network, called a single-layer perceptron (SLP). An SLP is an algorithm for learning a binary classifier - that is, it classifies data into one of two classes based on some linear decision boundary.
The architecture of the SLP consists of:

1. **Input**: The SLP takes in multiple input values
2. **Summation**: Each input value is multiplied by a weight (which is learned by the neural net), and summed together along with a bias term. 
3. **Activation function and output**: If the summation is less than some pre-defined threshold, we classify it as class 0. Otherwise, we label it as class 1. 

<img src="neural_nets_images/slp.png" alt="Graph"/>

#### Training a SLP
During training, we present each datapoint x and its label y to the SLP as inputs. We feed each training datapoint through the network and calculate the error in our prediction. Then, we use this error to update the weights and obtain a more accurate decision boundary. We iteratively repeat this process until the error is 0 and we obtain the best decision boundary for our training dataset.

[Here](https://owenshen24.github.io/perceptron/) is an animated example of how a perceptron learns its decision boundary

In this example, the decision boundary learned by the perceptron shifts as more data is added.
<img src="neural_nets_images/Perceptron_example.png" alt="Graph">

**Exercise**: What happens if we train an SLP on non-linearly separable data?

Because the decision boundary is linear (since it takes the form $w_1x_1 + w_2x_2 + ... + w_nx_n + b$ = 0), an SLP can only classify linearly separable data. Our solution is to interconnect multiple perceptrons together to create a **multi-layer perceptron (MLP)**. 

<a id='mlp'></a>

### Multi-layer Perceptron

<img src="neural_nets_images/mlp.png" alt="mlp" style="width: 400px; height: 250px"/>

An MLP differs from an SLP because, as its name suggests, it has multiple layers of neurons. The architecture of the MLP consists of:

1. **Input**: The MLP takes in multiple input values (same as SLP)
2. **Multiple layers**: MLPs have multiple layers of neurons. MLPs are also **fully connected**, which means the output of each input neuron gets fed into the input of each neuron in the next layer. Each input value is multiplied by a weight (which is learned by the neural net), and summed together along with a bias term (same as SLP). 
3. **Activation function**: MLPs more often than not have nonlinear activation functions, like the sigmoid or ReLU functions above. This is because if we use linear activation functions in an MLP, it can easily be reduced down to a simple SLP. 
4. **Output layer**: The output of a MLP doesn't have to take the binary form an SLP does. MLPs can output one single value or a vector of values depending on the format of the problem. For example, an MLP for a multi-class classification problem can output multiple values, each corresponding to the probability of the input being a specific class.

#### Training a Multi-layer Perceptron

Training a MLP begins with a **forward pass**, in which the inputs are fed through the network. This is the same way inputs are passed into a SLP. After the forward pass is complete, we compare the output of the network to the expected output and calculate the error. This error is propogated back through the network until it reaches the first layer. This process is called **backpropogation** and is a vital step to learning for many machine learning networks. The weights are updated using gradient descent, which is based on the error of each neuron in the network, and the process repeats for a set period of time until the network has (hopefully) obtained better weights.

**Exercise**: What is the difference in training a SLP vs MLP?

MLPs can be used for a variety of problems, from simple binary classification tasks to complex tasks like image recognition and speech recognition. Let's use a MLP to solve a very familiar problem: predict whether or not someone survived on the Titanic.

<a id='titanic'></a>

## Training a MLP on the Titanic Dataset 🤠

There are many cool framworks out there such as Pytorch, Tensorflow, and MXNet. The most popular ones are Pytorch and Tensorflow. Below we will use Pytorch, an open-source machine learning library developed by Facebook's AI Research lab. Don't worry about actually understanding the code- focus on understanding the comments instead!

Our task today is to train a multi-layer perceptron to predict whether or not a particular passenger survived the Titanic accident.

In [None]:
#import necessary libraries
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.autograd import Variable
import torch.utils.data as data
import torch.optim as optim
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
#We read in our training and testing datapoints here.
train = pd.read_csv("titanic/train.csv")
test = pd.read_csv("titanic/test.csv")

In [None]:
# This function preprocesses the data to feed as inputs to our model for training.
def preprocess(df):
    #make a deep copy so we don't change df
    df1 = df.copy(deep=True)

    df1.drop(['Name','Ticket','Cabin','PassengerId'],axis=1,inplace=True)
    #one hot encoding
    sex = pd.get_dummies(df1['Sex'],drop_first=True)
    embark = pd.get_dummies(df1['Embarked'],drop_first=True)
    pclass = pd.get_dummies(df1['Pclass'],drop_first=True)
    df1 = pd.concat([df1,sex,embark,pclass],axis=1)

    df1.drop(['Sex','Embarked', 'Pclass'],axis=1,inplace=True)

    #to handle the test data, where we don't have the "Survived" column
    if "Survived" in df1.columns:
        y = df1.loc[:, 'Survived'].values
        del df1['Survived']
    else:
        y = []

    df1.fillna(df1.mean(),inplace=True)

    # normalize input for NN to speed up learning and achieve faster convergence
    Scaler1 = StandardScaler()

    df1 = pd.DataFrame(Scaler1.fit_transform(df1))


    X = df1.values

    return X,y

### Declearing a Dataset

In [None]:
# We will declare a TitanicDataset class to make it easier to access the items we need.
class TitanicDataset(data.Dataset):
    
    def __init__(self, df):
        self.df = df
    def __len__(self):
        return self.df.shape[0]

# This function basically does all the preprocessing of the data for us and returns it to us.
    def __getitem__(self, idx):
        X,y = preprocess(self.df)
        X_idx = X[idx]
        y_idx = y[idx]
        return X_idx,y_idx


### Model

In [None]:
# This class represents our neural net architecture. 
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        # Our neural net will have 3 fully connected layers. Notice how the first layer has an input of 9
        #and an output of 512. This represents the dimensions how many input items are fed to the neural 
        #network (one for each feature column) and the number of inputs to our next fully connected layer. 
        self.fc1 = nn.Linear(9, 512)
        self.fc2 = nn.Linear(512, 512)
        self.fc3 = nn.Linear(512, 2)

        # We also include a dropout "layer", which will zero-out the elements of some neurons with 
        # probability 0.2 
        self.dropout = nn.Dropout(0.2)
    
    # The forward function is the actual body of our neural net. 
    def forward(self, x):
        # Our input gets fed into the first fully connected layer and the output goes into the next layer,
        # after going through one round of dropout.
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        # Our output gets fed into a sigmoid function for binary classification
        x = torch.sigmoid(x)
        return x
model = Net()
print(model)

### Loss Function and Optimizer

In [None]:
# Our loss function will be cross entropy loss.
criterion = nn.CrossEntropyLoss()

# We will use stochastic gradient descent with a learning rate of 0.01 to find the optimum weights.
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

### Training

In [None]:
batch_size = 64
n_epochs = 10
batch_no = len(train) // batch_size

train_loss = 0
train_loss_min = np.Inf
for epoch in range(n_epochs):
    for i in range(batch_no):
        # create index to generate different training batches
        start = i*batch_size
        end = start+batch_size

        dfx, dfy = TitanicDataset(train)[start:end]
        # convert arrays into tensors
        x_var = Variable(torch.FloatTensor(dfx))
        y_var = Variable(torch.LongTensor(dfy)) 
        
        #clears old gradients from the last step
        optimizer.zero_grad()
        output = model(x_var)
        loss = criterion(output,y_var)
        #computes the derivative of the loss w.r.t. the parameters using backpropagation
        loss.backward()
        #making the optimizer to take a step based on the gradients of the parameters
        optimizer.step()
        
        #torch.max finds the maximum value within each row of the output (we are looking at each row because dim=1)
        #each row is contains 2 probability values that sum up to one (thanks to the sigmoid function in the model) 
        #we are essentially choosing the most probable output according to the probabilities
        #values stands for the max probability value in each row of the output 
        #labels stands for the corresponding label(0 or 1) associated with the max probability value 

        values, labels = torch.max(output, 1)
        #find the number of correct predictions
        num_right = np.sum(labels.data.numpy() == dfy)
        train_loss += loss.item()*batch_size
    
    train_loss = train_loss / len(train)
    if train_loss <= train_loss_min:
        print("Training loss decreased ({:6f} ===> {:6f}). Saving the model...".format(train_loss_min,train_loss))
        torch.save(model.state_dict(), "model.pt")
        train_loss_min = train_loss
    

    if epoch % 5 == 0:
        print('')
        print("Epoch: {} \tTrain Loss: {} \tTrain Accuracy: {}".format(epoch+1, train_loss,num_right / len(dfy) ))
print('Training Ended! ')



### Inference

In [None]:
X_test, _ = TitanicDataset(test)[:]
X_test_var = Variable(torch.FloatTensor(X_test), requires_grad=False) 
#to perform inference without gradient calculation.
with torch.no_grad():
    test_result = model(X_test_var)
values, labels = torch.max(test_result, 1)
survived = labels.data.numpy()

<a id='summary'></a>

## Summary

In this lesson, we introduced you to the machine learning topic of **neural nets**. We learned about how neural nets are implemented in the real world. Neural nets are inspired by the structure of the human brain. The building blocks of neural nets are neurons, which accepts inputs and passes them through an activation function. Activation functions like sigmoids and ReLUs are functions that help us learn complex patterns in the data. We also learned about single layer perceptrons, which is composed of an input layer, a neuron that multiplies the inputs by weights and performs aggregation, an activation function, and a binary output. However, they only make decisions on linearly-separable data. Multilayer perceptrons have no such limitation, since they are composed of multiiple SLPs and have non-linear activation functions. Finally, we ran through an example of a MLP being trained on the Titanic dataset to predict whether or not a particular passenger survived the accident.