<a href="https://colab.research.google.com/github/Sunkyoung/PyTorch-Study/blob/main/PyTorch_Study_03_PyTorch%2CMLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![picture](https://drive.google.com/uc?id=1vC0N3Obk4HZJk9JOG7fKgYE10YYlCqsg)

# Week 3: PyTorch, Logistic Regression and MLP

- We will cover basic concepts of PyTorch Framework (tensor operations, GPU utilizing and autograd)
- We will implement simple logistic regression and multinomial logistic regression (softmax) with PyTorch
- We will use simple linear model and multi-layer perceptron (MLP) in this class

If you have any questions, feel free to ask
- For additional questions, post questions in classum or send emails to jihoontack@kaist.ac.kr

## Why PyTorch?

- Intuitive and concise code
- Define by Run method (Tensorflow is Define and Run method)
- High compatibility with Numpy (almost one-to-one mapping)

![picture](https://drive.google.com/uc?id=1nAfTkF8Kp4YEI1pBeShs3L7NCPHx_iHQ)

## 0. Prelim: Load packages & GPU setup

In [1]:
# visualize current GPU usages in your server
!nvidia-smi

Thu Jan 13 10:19:30 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
# set gpu by number
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [4]:
# load packages
import torch
import numpy as np

In [5]:
# print the version of PyTorch
print(torch.__version__)

1.10.0+cu111


## 1. PyTorch and Numpy

PyTorch use **tensor**: the basic data structure in PyTorch.\
**Tensor: n-dimensional array + GPU calculation is supported**\
**Almost the same with Numpy array**

![picture](https://drive.google.com/uc?id=1z2v05mGyhP_FpEa3Z4JsNpgbtEnkg0bo)

### PyTorch and Numpy shares almost identical grammer


**We will show some examples of:**
- Same operation with identical grammer
- Same operation with different grammer
- Different operation with same grammer

**We will not handle all examples in this class :(**
- For more examples, see the following reference: https://github.com/wkentaro/pytorch-for-numpy-users

**First! Define Numpy array and PyTorch tensor**

In [6]:
np_array_1 = np.array([1, 2, 3, 4])
np_array_2 = np.array([5, 6, 7, 8])
torch_tensor_1 = torch.tensor([1, 2, 3, 4])
torch_tensor_2 = torch.tensor([5 ,6 ,7, 8])

print (np_array_1)
print (np_array_2)
print (torch_tensor_1)
print (torch_tensor_2)

[1 2 3 4]
[5 6 7 8]
tensor([1, 2, 3, 4])
tensor([5, 6, 7, 8])


**1) Same operations with identical grammer**

Example) Get the shape of the tensor

In [8]:
# numpy
print(np_array_1.shape)

# torch
print(torch_tensor_1.shape)
# size() and shape operation is identical in torch
print(torch_tensor_1.size())

(4,)
torch.Size([4])
torch.Size([4])


**2) Same operations with different grammer**

Example 1) Concatenate two tensors
- numpy use `np.concatenate`
- torch use `torch.cat`
- IMPORTANT: axis (numpy) and dim (torch) is identical

In [14]:
# numpy
np_concat = np.concatenate([np_array_1, np_array_2], axis=0)
print('----numpy----')
print(np_concat)

# torch
torch_concat = torch.cat([torch_tensor_1, torch_tensor_2], dim=0)
print('----torch----')
print(torch_concat)

----numpy----
[1 2 3 4 5 6 7 8]
----torch----
tensor([1, 2, 3, 4, 5, 6, 7, 8])


Example 2) reshape the tensor shape
- numpy use `X.reshape`
- torch use `X.view`
- IMPORTANT: axis (numpy) and dim (torch) is identical

In [18]:
# numpy
np_reshaped = np_concat.reshape(4, 2)
print('----numpy----')
print(np_reshaped)
print(np_reshaped.shape)

# torch
torch_reshaped = torch_concat.view(4, 2)
print('----torch----')
print(torch_reshaped)
print(torch_reshaped.shape)

----numpy----
[[1 2]
 [3 4]
 [5 6]
 [7 8]]
(4, 2)
----torch----
tensor([[1, 2],
        [3, 4],
        [5, 6],
        [7, 8]])
torch.Size([4, 2])


**3) Different operations with same grammer (Confusing operations)**

Example) manipulation tensors
- Same grammer `repeat`  has different operations

In [17]:
x = np.array([1, 2, 3])
x_repeat = x.repeat(3)

print('----numpy----')
print(x)
print(x_repeat)

x = torch.tensor([1, 2, 3])
x_repeat = x.repeat(3)

print('----torch----')
print(x)
print(x_repeat)

# To obtain the same result with np.repeat (will skip explanation: you should be proficient with reshaping operations)
x_repeat = x.repeat_interleave(3)
print(x_repeat)

----numpy----
[1 2 3]
[1 1 1 2 2 2 3 3 3]
----torch----
tensor([1, 2, 3])
tensor([1, 2, 3, 1, 2, 3, 1, 2, 3])
tensor([1, 1, 1, 2, 2, 2, 3, 3, 3])


In [19]:
# similar manipulation operation: stack & repeat
x = torch.tensor([1, 2, 3])
x_repeat = x.repeat(4)
x_stack = torch.stack([x, x, x, x])

print (x_repeat)
print (x_stack)
print (x_repeat.view(4, 3)) # reshape x

tensor([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3])
tensor([[1, 2, 3],
        [1, 2, 3],
        [1, 2, 3],
        [1, 2, 3]])
tensor([[1, 2, 3],
        [1, 2, 3],
        [1, 2, 3],
        [1, 2, 3]])


## 2. Tensor operations under GPU utilization

Deep learning frameworks utilize GPUs to accelarate computations.

In this section, we will learn **how to utilize GPU** in PyTorch

In [20]:
print(torch.cuda.is_available())  # Is GPU accessible?

True


In [31]:
a = torch.ones(3)
b = torch.randn(100, 50, 3)

In [32]:
print(a.device)
print(b.device)

cpu
cpu


In [23]:
c = a + b

In [24]:
print(c.device)

cpu


In [35]:
# upload a and b to GPU
# .to('cuda') is identical to .cuda()
a = a.to('cuda') # a.cuda()
b = b.to('cuda') # b.cuda()

In [36]:
a = a.cuda()
b = b.cuda()

In [37]:
print(a.device)
print(b.device)

cuda:0
cuda:0


In [38]:
c = a + b

In [39]:
print(c.device)

cuda:0


In [40]:
c = c.to('cpu')

In [30]:
print(c.device)

cpu


## 3. Autograd

Central to all neural networks in PyTorch is the `autograd` package. 

The `autograd` package provides automatic differentiation for all operations on Tensors. 

`torch.Tensor` is the central class of the package. If you set its attribute `.requires_grad` as True, it starts to track all operations on it. When you finish your computation you can call `.backward()` and have all the gradients computed automatically. The gradient for this tensor will be accumulated into `.grad` attribute.

To stop a tensor from tracking history, you can call `.detach()` to detach it from the computation history, and to prevent future computation from being tracked.

### Example

In [41]:
x = torch.ones(2, 2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


In [42]:
y = x + 2
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


In [43]:
z = y * y * 3
print(z)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>)


In [44]:
out = z.mean()
print(out)

tensor(27., grad_fn=<MeanBackward0>)


In [45]:
# retrain grad for y, z. backward for out
y.retain_grad()
z.retain_grad()
out.backward()

![picture](https://drive.google.com/uc?id=1JyMWTbaU6ktJAHx2XqiU7s4tId-cxiLF)
![picture](https://drive.google.com/uc?id=17j-aNqj1yjZfVPCKZJRt6YVZ-7usf5PH)

In [46]:
print(z.grad)

tensor([[0.2500, 0.2500],
        [0.2500, 0.2500]])


![picture](https://drive.google.com/uc?id=1jPfdq6piSkkwZ21nX7kIBa-xGJE6uPBu)
![picture](https://drive.google.com/uc?id=1NN0kpdvRRP9NwguXJHnU3u8VikMFUKw2)

In [47]:
print(y.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


![picture](https://drive.google.com/uc?id=1HllHu2CxuNFX8mc6QdQEEtnXJ3Rvo6TE)
![picture](https://drive.google.com/uc?id=1jWJPOXVLG6mdUyDSklocNWPVa9Rg62K3)

In [48]:
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


### Efficient inference (testing) with torch.no_grad()

To prevent tracking history (and using memory), you can also wrap the code block in with `torch.no_grad()`

Situation: when **gradient calculation is not required** e.g., inference\
Solution: use `torch.no_grad()`, then torch doesn't generate computational graph for back propagation, therefore it is **much faster**

In [49]:
# gradient calculation is not required
with torch.no_grad():
    x = torch.ones(2, 2, requires_grad=True)
    y = x + 2
    z = y * y * 3
    out = z.mean()

In [50]:
out

tensor(27.)

In [51]:
out.backward() ## ERROR!!!!: we used torch.no_grad()!!

RuntimeError: ignored

## 4. nn.Module

![picture](https://drive.google.com/uc?id=1Vu3oRATA-EWDycO2zVWkBdzndU-8C5cB)

### Using pre-defined modules (subset of models) in PyTorch

In [53]:
import torch.nn as nn

X = torch.tensor([[1., 2., 3.], [4., 5., 6.]])

print(X)
print(X.shape)

tensor([[1., 2., 3.],
        [4., 5., 6.]])
torch.Size([2, 3])


In [54]:
# input dim 3, output dim 1
linear_fn = nn.Linear(3, 1)

In [55]:
linear_fn  # WX + b

Linear(in_features=3, out_features=1, bias=True)

In [56]:
Y = linear_fn(X)
print(Y)
print(Y.shape)

tensor([[-0.7191],
        [-0.3344]], grad_fn=<AddmmBackward0>)
torch.Size([2, 1])


In [57]:
Y = Y.sum()
print(Y)

tensor(-1.0535, grad_fn=<SumBackward0>)


You can use other types of `nn.Module` in PyTorch

In [59]:
nn.Conv2d
nn.RNNCell
nn.LSTMCell
nn.GRUCell
nn.Transformer;

### How can we design a customized model (neural network)?

In [58]:
# Linear -> ReLU -> Linear
class Model(nn.Module):
    def __init__(self, input_dim, output_dim, hidden_dim):
        super(Model, self).__init__()
        self.linear_1 = nn.Linear(input_dim, hidden_dim)
        self.linear_2 = nn.Linear(hidden_dim, output_dim)
        self.relu = nn.ReLU()
    def forward(self, x):
        x = self.linear_1(x)
        x = self.relu()
        x = self.linear_2(x)
        return x

**What is activation function?**
- They make non-linearity for deep neural networks
- Therefore, deep neural networks can approximate complex functions

![picture](https://drive.google.com/uc?id=1dxJJUOzYykRfW2q3my2Qtg82RsjptIx4)

In [61]:
nn.Sigmoid
nn.ReLU
nn.LeakyReLU
nn.Tanh;

## 5. MNIST classification with PyTorch (Logistic regression & MLP)

![picture](https://drive.google.com/uc?id=1kdig6RLSCvYJNqarbb8gviYsnxZfSkYQ)

### What is MNIST & How to do multi-class classification?

The MNIST database of **handwritten digits from 0 to 9**, has a training set of 60,000 examples, and a test set of 10,000 examples.

Since we have 10 classes (0~9), current problem can be interpreted as **multinomial logistic regression** (**multi-class classification**).

Therefore, we use **softmax** function to handle multiple class output with **cross-entropy** loss function.

![picture](https://drive.google.com/uc?id=1v-QvM2MEMku6wWMb_8f8NIqIDzby7wJP)

### Load packages

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torch.utils.data import DataLoader

import torchvision
import torchvision.transforms as transforms

### Load datasets for training & testing

In [4]:
# Download MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./', train=True, transform=transforms.ToTensor(), download=True)
test_dataset = torchvision.datasets.MNIST(root='./', train=False, transform=transforms.ToTensor())

# Data loader
# mini batch size : 128 for train, 100 for test
train_loader = DataLoader(dataset=train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=100, shuffle=False)

### Define model (we will use one layer classifier first)

![picture](https://drive.google.com/uc?id=1Xe4J88NglbuASnfYJYI7ISqA1c1rcs5P)

In [5]:
# Define model class
# This model has one hidden layer
class Multinomial_logistic_regression(nn.Module):
    def __init__(self, input_size, output_size):
      super(Multinomial_logistic_regression, self).__init__()
      self.linear = nn.Linear(input_size, output_size)
    def forward(self, x):
      out = self.linear(x)
      return out

In [6]:
# Generate model
# input dim: 784  / output dim: 10
model = Multinomial_logistic_regression(784, 10) 

In [7]:
model

Multinomial_logistic_regression(
  (linear): Linear(in_features=784, out_features=10, bias=True)
)

In [8]:
# Upload model to GPU 
model = model.to('cuda') #.cuda

### Define optimizer

Optimization is about finding the best solution (model parameter) that fits the given dataset!

PyTorch optimizer is about **which optimization methods to use for training**

We will not handle the details in this class. (take **"Optimization for AI (AI505)"** course)

In [9]:
# Optimizer define
# optimizer = torch.optim.SGD(model.parameters(), lr=0.05) 
optimizer = torch.optim.SGD(model.parameters(), lr=0.05, momentum=0.9)
# toptimizer = orch.optim.Adam(model.parameters(), lr=0.05)

![picture](https://drive.google.com/uc?id=1BvkB6O1hsGZ4YkD92k-E3I59omprN7qz)

### Train the model

In [22]:
# Loss function define (we use cross-entropy)
loss_fn = nn.CrossEntropyLoss()

#Train the model
total_step = len(train_loader)

for epoch in range(50):
  # mini batch for loop
  for i, (images, labels) in enumerate(train_loader):
        # reshape images to (128, 784) and load to GPU
        images = images.reshape(-1, 28*28).to('cuda')
        labels = labels.to('cuda') # labels shape (128)
        
        # Forward to model
        outputs = model(images)
        # calculate the loss (crossentropy loss) with ground truth & prediction value
        loss = loss_fn(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward() # automatic gradient calculation (autograd)
        optimizer.step() # update model parameter with requires_grad=True 
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, 10, i+1, total_step, loss.item()))

Epoch [1/10], Step [100/469], Loss: 0.3782
Epoch [1/10], Step [200/469], Loss: 0.3789
Epoch [1/10], Step [300/469], Loss: 0.3781
Epoch [1/10], Step [400/469], Loss: 0.3513
Epoch [2/10], Step [100/469], Loss: 0.2105
Epoch [2/10], Step [200/469], Loss: 0.3235
Epoch [2/10], Step [300/469], Loss: 0.4068
Epoch [2/10], Step [400/469], Loss: 0.3053
Epoch [3/10], Step [100/469], Loss: 0.1749
Epoch [3/10], Step [200/469], Loss: 0.3380
Epoch [3/10], Step [300/469], Loss: 0.2674
Epoch [3/10], Step [400/469], Loss: 0.2586
Epoch [4/10], Step [100/469], Loss: 0.2581
Epoch [4/10], Step [200/469], Loss: 0.1750
Epoch [4/10], Step [300/469], Loss: 0.2957
Epoch [4/10], Step [400/469], Loss: 0.3441
Epoch [5/10], Step [100/469], Loss: 0.3131
Epoch [5/10], Step [200/469], Loss: 0.3308
Epoch [5/10], Step [300/469], Loss: 0.1583
Epoch [5/10], Step [400/469], Loss: 0.4312
Epoch [6/10], Step [100/469], Loss: 0.2151
Epoch [6/10], Step [200/469], Loss: 0.2176
Epoch [6/10], Step [300/469], Loss: 0.2259
Epoch [6/10

### Test the model

In [23]:
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, 28*28).to('cuda')
        labels = labels.to('cuda')
        outputs = model(images)
        # classificatoin model -> get the label prediction of top 1 
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))

Accuracy of the network on the 10000 test images: 92.34 %


### New model: MLP (multi-layer-perceptron)

Previous model used multinomial logistic regression (one linear layer)\
What if we use **MLP (multi-layer-perceptron)?** A neural network with hidden layers?

In [30]:
# New model with multi layer
# Linear -> Sigmoid -> Linear -> Sigmoid -> Linear
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
      super(NeuralNet, self).__init__()
      self.fc1 = nn.Linear(input_size, hidden_size)
      self.fc2 = nn.Linear(hidden_size, hidden_size)
      self.fc3 = nn.Linear(hidden_size, output_size)
      self.sigmoid = nn.Sigmoid()
    def forward(self, x):
        out = self.fc1(x)
        out = self.sigmoid(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        out = self.fc3(out)
        return out

In [31]:
# Generate model
# input dim: 784  / hidden dim: 20  / output dim: 10
model = NeuralNet(784, 20, 10)

# Upload model to GPU
model.cuda()

# Loss function define (we use cross-entropy)
loss_fn = nn.CrossEntropyLoss()

# Define optimizer : SGD, lr=0.05, momentum=0.9
optimizer = torch.optim.SGD(model.parameters(), lr=0.05, momentum=0.9)
# optimizer = torch.optim.Adam(model.parameters(), lr=0.05)

# Train the model
total_step = len(train_loader)

for epoch in range(10):
    for i, (images, labels) in enumerate(train_loader):  # mini batch for loop
        # upload to gpu
        images = images.reshape(-1, 28*28).cuda()
        labels = labels.cuda()
        
        # Forward
        outputs = model(images)
        loss = loss_fn(outputs, labels)  # calculate the loss (crossentropy loss) with ground truth & prediction value
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()  # automatic gradient calculation (autograd)
        optimizer.step()  # update model parameter with requires_grad=True 
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, 10, i+1, total_step, loss.item()))

Epoch [1/10], Step [100/469], Loss: 2.2840
Epoch [1/10], Step [200/469], Loss: 1.7887
Epoch [1/10], Step [300/469], Loss: 1.2332
Epoch [1/10], Step [400/469], Loss: 0.9947
Epoch [2/10], Step [100/469], Loss: 0.5507
Epoch [2/10], Step [200/469], Loss: 0.4818
Epoch [2/10], Step [300/469], Loss: 0.5647
Epoch [2/10], Step [400/469], Loss: 0.4883
Epoch [3/10], Step [100/469], Loss: 0.3765
Epoch [3/10], Step [200/469], Loss: 0.3756
Epoch [3/10], Step [300/469], Loss: 0.3558
Epoch [3/10], Step [400/469], Loss: 0.3290
Epoch [4/10], Step [100/469], Loss: 0.2686
Epoch [4/10], Step [200/469], Loss: 0.2944
Epoch [4/10], Step [300/469], Loss: 0.2886
Epoch [4/10], Step [400/469], Loss: 0.2286
Epoch [5/10], Step [100/469], Loss: 0.2221
Epoch [5/10], Step [200/469], Loss: 0.2409
Epoch [5/10], Step [300/469], Loss: 0.3241
Epoch [5/10], Step [400/469], Loss: 0.2258
Epoch [6/10], Step [100/469], Loss: 0.1664
Epoch [6/10], Step [200/469], Loss: 0.2001
Epoch [6/10], Step [300/469], Loss: 0.2272
Epoch [6/10

In [32]:
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, 28*28).cuda()
        labels = labels.cuda()
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)  # classificatoin model -> get the label prediction of top 1 
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))

Accuracy of the network on the 10000 test images: 95.29 %


### Change the following options to obtain better accuracy!! (try it by your-self)

#### (1) Model configurations: 
- size of hidden layer units
- number of layers
- type of activation function (e.g., relu, tanh, softplus etc.)

#### (2) Optimization configurations
- learning rate
- epoch
- type of optimizer
- momentem hyperparameter