#### Commmonly used activation functions 
* ReLU
* Sigmoid
* Tanh

##### Sigmoid (Logistic)

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>&#x03C3;<!-- σ --></mi>
  <mo stretchy="false">(</mo>
  <mi>x</mi>
  <mo stretchy="false">)</mo>
  <mo>=</mo>
  <mfrac>
    <mn>1/</mn>
    <mrow>
      <mn>(1</mn>
      <mo>+</mo>
      <msup>
        <mi>e</mi>
        <mrow class="MJX-TeXAtom-ORD">
          <mo>&#x2212;<!-- − --></mo>
          <mi>x)</mi>
        </mrow>
      </msup>
    </mrow>
  </mfrac>
</math>

Cons:
* Activation saturates at 0 or 1 with gradients ≈ 0
    * No signal to update weights → cannot learn
    * Solution: Have to carefully initialize weights to prevent this
* Outputs not centered around 0
    * If output always positive → gradients always positive or negative → bad for gradient updates
    
This causes vanishing gradients and poor learning for deep networks. This can occur when the weights of our networks are initialized poorly – with too-large negative and positive values.

##### Tanh

* tanh(x)=2σ(2x)−1
 
    A scaled sigmoid function
* Input number → [-1, 1]
* Cons:
    * Activation saturates at 0 or 1 with gradients ≈ 0
        * No signal to update weights → cannot learn
        * Solution: Have to carefully initialize weights to prevent this

##### ReLUs¶
* f(x)=max(0,x)
* Pros:
    * Accelerates convergence → train faster
    * Less computationally expensive operation compared to Sigmoid/Tanh exponentials
* Cons:
    *Many ReLU units "die" → gradients = 0 forever
        * Solution: careful learning rate choice

## importing the dataset

In [4]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as datasets

train_dataset = datasets.MNIST(root = './data',
                                   train = True,
                                   transform = transforms.ToTensor(),download = True)


test_dataset = datasets.MNIST(root = './data',
                                   train = False,
                                   transform = transforms.ToTensor(),download = True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


9913344it [00:28, 344804.52it/s]                             


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Using downloaded and verified file: ./data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Using downloaded and verified file: ./data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Using downloaded and verified file: ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw

Processing...
Done!


  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


## model parameters

In [38]:
from torch.utils.data import DataLoader
batch_size = 128
n_iters = 3000
num_epochs = int(n_iters/(len(train_dataset)/batch_size) ) + 1


train_loader = DataLoader(train_dataset,
                         shuffle = True, 
                         batch_size = batch_size)


test_loader = DataLoader(train_dataset,
                         shuffle = True, 
                         batch_size = batch_size)

## Model creation

In [6]:
input_dim = 28*28
hidden = 100
output_dim = 100

In [31]:
class Feedforwardneuralnetwork(nn.Module):
    
    def __init__(self, input_dim, hidden, output_dim):
        # as we have already told in the linear regression notebook, nn.Module is the parent class here 
        # and Feedforwardneuralnetwork is the child class that inherit from the parent class.
        
        
        # Then obvious question arise is that, we never calls the forward function although we define it
        # its because,there is a inbuilt _call_ function that calls the forward function from the parent class
        
        super(Feedforwardneuralnetwork, self).__init__()
        # first: linear function(first input layer) with relu activation
        self.fc1 = nn.Linear(input_dim, hidden)
        self.relu1 = nn.ReLU()
        
        
        # second: Linear layer
        self.fc2 = nn.Linear(hidden, hidden)
        self.relu2 = nn.ReLU()
        
        # linear fucntion layer
        self.fc3 = nn.Linear(hidden, hidden)
        self.relu3 = nn.ReLU()
        
        # output layer: linear
        self.fc4 = nn.Linear(hidden, output_dim)

    def forward(self, x,):
        # input layer
        out = self.fc1(x)
        out = self.relu1(out)
        
        # first hidden layer
        out = self.fc2(out)
        out = self.relu2(out)
        
        
        # second layer
        out = self.fc3(out)
        out = self.relu3(out)
        
        # output layer
        out = self.fc4(out)
        return out
        
model = Feedforwardneuralnetwork(input_dim, hidden, output_dim)

In [35]:
# checking initial weights shape 

print("first layer:", list(model.parameters())[0].shape)
print("second layer:", list(model.parameters())[1].shape)
print("third layer:", list(model.parameters())[2].shape)
print("fourth layer:", list(model.parameters())[3].shape)


first layer: torch.Size([100, 784])
second layer: torch.Size([100])
third layer: torch.Size([100, 100])
fourth layer: torch.Size([100])


#### instantiating the loss fucntion and the optimizer for the model

In [36]:
import torch.optim as optim
learning_rate = 0.1
optimizer = optim.SGD(model.parameters(), lr = learning_rate )
criterion = nn.CrossEntropyLoss()

#### Train the model

In [39]:
iter = 0;
for epoch in range(num_epochs):
    for i,(images, labels) in enumerate(train_loader):
        # reshaping the image
        images = images.view(-1, 28*28).requires_grad_()
        
        # clearing the optimizer gradient
        optimizer.zero_grad()
        
        # forward pass to get the output
        output = model(images)
        
        # calculate loss: softmax => cross entropy loss
        loss = criterion(output, labels)
        
        # getting gradients
        loss.backward()
        
        # updating the weights with 
        optimizer.step()
        
        iter += 1
        
        # validating the current model
        if iter%500 == 0:
            correct = 0
            total = len(labels)
            
            
            for images, labels in test_loader:
                images = images.view(-1, 28*28).requires_grad_()
                
                outputs = model.forward(images)
                
                _, predicted = torch.max(outputs, 1)
                
                total += labels.size(0)
                
                correct += (predicted == labels).sum()
                
                
            accuracy = 100*(correct/total)
            
            
#           Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.item(), accuracy))


                
                
                
        

Iteration: 500. Loss: 0.053387608379125595. Accuracy: 97.69491577148438
Iteration: 1000. Loss: 0.061938781291246414. Accuracy: 98.20050811767578
Iteration: 1500. Loss: 0.10773777216672897. Accuracy: 98.06912231445312
Iteration: 2000. Loss: 0.0716380923986435. Accuracy: 98.41504669189453
Iteration: 2500. Loss: 0.02929224818944931. Accuracy: 98.66285705566406
Iteration: 3000. Loss: 0.019005168229341507. Accuracy: 98.95722198486328
