## Colab 환경 구축


### 활용 라이브러리 (고정)

*   [torch==1.9.0](https://pytorch.org/)
*   [pytorch-lightning==1.4.2](https://pypi.org/project/pytorch-lightning/1.4.2/)


### 참고사항

*   GPU 최대 12시간 연속 사용
*   PyTorch의 경우 설치되어있지 않아 매 런타임마다 install command 실행
*   노트북파일 맨 첫 셀에 패키지 설치하는 코드 넣고 사용하는 것을 권장



In [1]:
!pip3 install torch==1.9.0 torchvision torchaudio
!pip3 install pytorch-lightning==1.4.2

Collecting torchaudio
  Downloading torchaudio-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 5.3 MB/s 
Installing collected packages: torchaudio
Successfully installed torchaudio-0.9.0
Collecting pytorch-lightning==1.4.2
  Downloading pytorch_lightning-1.4.2-py3-none-any.whl (916 kB)
[K     |████████████████████████████████| 916 kB 5.3 MB/s 
[?25hCollecting pyDeprecate==0.3.1
  Downloading pyDeprecate-0.3.1-py3-none-any.whl (10 kB)
Collecting fsspec[http]!=2021.06.0,>=2021.05.0
  Downloading fsspec-2021.8.1-py3-none-any.whl (119 kB)
[K     |████████████████████████████████| 119 kB 44.0 MB/s 
Collecting PyYAML>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 37.2 MB/s 
Collecting torchmetrics>=0.4.0
  Downloading torchmetrics-0.5.0-py3-none-any.whl (272 kB)
[K     |████████████████████████████████| 272 kB 45.1 MB/s 
[?25hCollecting future>=0.17.1
  Download

In [2]:
# Torch, Cuda, Cudnn Version Check
import torch

print("Torch version:{}".format(torch.__version__))
print("Cuda version: {}".format(torch.version.cuda))
print("Cudnn version:{}".format(torch.backends.cudnn.version()))


Torch version:1.9.0+cu102
Cuda version: 10.2
Cudnn version:7605


## PyTorch Lightning 튜토리얼

### 주요 components
1. Data
2. Model
3. Loss
4. Optimizer

[소스코드 참고](https://colab.research.google.com/drive/1Mowb4NzWlRCxzAFjOIJqUmmk_wAT-XP3)

In [3]:
# 패키지 import
import os

import pytorch_lightning as pl
from pytorch_lightning import LightningDataModule, LightningModule, Trainer
from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning.metrics.functional import accuracy

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split

from torchvision import transforms
from torchvision.datasets import MNIST

## Data Preparation
1. Download images
2. Image transforms
3. Train, Validation, Test dataset splits
3. Wrap each dataset split in a DataLoader


### 1. Data : [MNIST](https://pytorch.org/vision/stable/datasets.html#mnist)

* MNIST dataset : 28*28 픽셀로 구성
* torchvision.datasets 패키지로 데이터 쉽게 로드 가능 (MNIST, CIFAR, COCO 등 유명한 데이터셋 구현)

```
train_data = torchvision.datasets.MNIST(
  './data',            # 데이터 저장 위치
  train=True,          # True : train set, False : test set
  download=True,       # download 여부 
  transform=transform  # 데이터 전처리
  )
```

### 2. Image transforms
- 이미지 픽셀 값은 0 ~ 255 값을 갖고, ToTensor()로 타입 변경시 0~1 사이의 값으로 바뀜
- transforms.Normalize(mean, std)를 이용하여 -1 ~ 1 사이 값으로 normalize 시킴


```
- transforms.ToTensor()                         # PIL 이미지 또는 numpy.ndarray 이미지 데이터를 tensor로 변형
- transforms.Normalize(mean, std, inplace=True) # mean(평균), std(표준편차) 사용하여 이미지 정규화
- transforms.Compose([])                        # 여러 transform들을 Compose로 구성
```


### 3. Dataset splits
- Train, Test 데이터만 제공되는 경우 Train data에서 Train data, Validation data로 분할
- Train : Valid = 0.9 : 0.1 비율로 분할할 경우

```
from torch.utils.data import random_split
num_train = int(len(train_data)*0.9)
train_set, valid_set = random_split(train_data, [num_train, len(train_data) - num_train])
```




### 4. DataLoader
- batch_size : 모델을 한 번 학습시킬 때 몇 개의 데이터 넣을지 설정 (gpu, 메모리 등 고려)
- shuffle : 데이터 섞을지 결정 (대부분 train data만 shuffle=True로 설정하고 valid, test data의 경우 False로 설정)

```
from torch.utils.data import DataLoader
train_loader = DataLoader(train_set, batch_size=8, shuffle=True) 
test_loader = DataLoader(test_set, batch_size=8, shuffle=False)
```


In [4]:
class MNISTDataModule(pl.LightningDataModule):
  def __init__(self, data_folder, batch_size):
    super().__init__()
    
    self.data_folder = data_folder
    self.batch_size = batch_size

    # transforms for images
    self.transform = transforms.Compose([transforms.ToTensor(),
                                         transforms.Normalize((0.1307,), (0.3081,))])

  def setup(self):
    # prepare transforms standard to MNIST
    mnist_train = MNIST(self.data_folder, train=True, download=True, transform=self.transform)
    self.mnist_test = MNIST(self.data_folder, train=False, download=True, transform=self.transform)
    
    self.mnist_train, self.mnist_val = random_split(mnist_train, [55000, 5000])

  def train_dataloader(self):
    return DataLoader(self.mnist_train, batch_size=self.batch_size, shuffle=True)

  def valid_dataloader(self):
    return DataLoader(self.mnist_val, batch_size=self.batch_size)

  def test_dataloader(self):
    return DataLoader(self.mnist_test, batch_size=self.batch_size)

## Lightning Module
* Network 구조, 연산
* Training/Validation/Test Loop
* Optimizer 
---

```
오버라이딩 가능 메소드 : forward, training_step, validation_step, test_step, 
validation_step_end, validation_epoch_end, configure_optimizer 등
```


### training_step
* training loop
* argument : training dataloader에서 제공하는 batch, batch_idx
* 학습 loss 계산하여 return

### validation_step
* 학습 중간마다 모델 성능 체크
* argument : validation dataloader에서 제공하는 batch, batch_idx
* `self.log('val_loss':loss)` 와 같이 로그 값 저장
* val_loss 성능이 best인 모델을 구하는 용도로 활용

### test_step
* 모델 성능 평가
* argument : test dataloader에서 제공하는 batch, batch_idx

### configure_optimizers
* 모델의 최적 파라미터 찾을 때 사용되는 schedular, optimizer 구현
* optimizer 종류, learning rate 등 설정

In [5]:
class MNISTClassifier(pl.LightningModule):

  def __init__(self):
    super(MNISTClassifier, self).__init__()

    # mnist images are (1, 28, 28) (channels, width, height) 
    self.layer_1 = torch.nn.Linear(28 * 28, 128)
    self.layer_2 = torch.nn.Linear(128, 256)
    self.layer_3 = torch.nn.Linear(256, 10)

  def forward(self, x):
    batch_size, channels, width, height = x.size()

    # (batch_size, 1, 28, 28) -> (batch_size, 1*28*28)
    x = x.view(batch_size, -1)

    # layer 1
    x = self.layer_1(x)
    x = torch.relu(x)

    # layer 2
    x = self.layer_2(x)
    x = torch.relu(x)

    # layer 3
    x = self.layer_3(x)

    # probability distribution over labels
    x = torch.log_softmax(x, dim=1)

    return x

  def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x) 
        loss = self.cross_entropy_loss(logits, y)
        self.log('train_loss', loss)
        return loss

  def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = self.cross_entropy_loss(logits, y)
        acc = accuracy(logits, y)
        metrics = {'val_acc': acc, 'val_loss': loss}
        self.log_dict(metrics)

  def test_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = self.cross_entropy_loss(logits, y)
        acc = accuracy(logits, y)
        metrics = {'test_acc': acc, 'test_loss': loss}
        self.log_dict(metrics)

  def cross_entropy_loss(self, logits, labels):
    return F.nll_loss(logits, labels)
  
  def configure_optimizers(self):
      optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
      return optimizer

In [6]:
# seed (랜덤시드 고정)
pl.seed_everything(42)

# google drive mount 후 본인 drive 폴더 경로 내에 데이터 및 모델 저장
path = os.path.join('/content/drive/MyDrive/AISoftware/week1')
data_folder = os.path.join(path, 'data')
model_folder = os.path.join(path, 'model')

if not os.path.exists(data_folder): os.makedirs(data_folder)
if not os.path.exists(model_folder): os.makedirs(model_folder)

batch_size = 64

dm = MNISTDataModule(data_folder, batch_size)
dm.setup()

Global seed set to 42


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw/train-images-idx3-ubyte.gz to /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw/train-labels-idx1-ubyte.gz to /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw/t10k-images-idx3-ubyte.gz to /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw/t10k-labels-idx1-ubyte.gz to /content/drive/MyDrive/AISoftware/week1/data/MNIST/raw



  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


### Checkpoint callback
* PyTorch Lightning은 각 version 마다 checkpoint 저장
* checkpoint의 이름, 저장 주기, 모니터링할 metric 등 변경할 경우 checkpoint_callback 수정해야 함


```
checkpoint_callback = ModelCheckpoint(
  filepath=os.path.join('checkpoints', '{epoch:d}'),
  monitor='val_acc', # 어떤 metric을 기준으로 체크포인트 저장할 지 지정
  mode='max' # 지정한 metric의 어떤 기준(max, min)으로 체크포인트 저장할 지 지정
)
```



In [7]:
checkpoint_callback = ModelCheckpoint(monitor='val_loss', dirpath=model_folder, filename='{epoch:02d}-{val_loss:.2f}')
logger = TensorBoardLogger(model_folder, name='tensorboard')

model = MNISTClassifier()
# create the trainer -- Single GPU training
trainer = Trainer(
    max_epochs=100, gpus=1, auto_select_gpus=True,
    logger = logger,
    callbacks=[
               checkpoint_callback,
               LearningRateMonitor(logging_interval='step'),
               EarlyStopping(monitor='val_loss', verbose=True, patience=10)
               ],
               )

trainer.fit(model, dm.train_dataloader(), dm.valid_dataloader()) # 학습

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type   | Params
-----------------------------------
0 | layer_1 | Linear | 100 K 
1 | layer_2 | Linear | 33.0 K
2 | layer_3 | Linear | 2.6 K 
-----------------------------------
136 K     Trainable params
0         Non-trainable params
136 K     Total params
0.544     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  stream(template_mgs % msg_args)
Global seed set to 42


Training: -1it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Metric val_loss improved. New best score: 0.145


Validating: 0it [00:00, ?it/s]

Metric val_loss improved by 0.041 >= min_delta = 0.0. New best score: 0.104


Validating: 0it [00:00, ?it/s]

Metric val_loss improved by 0.002 >= min_delta = 0.0. New best score: 0.102


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Metric val_loss improved by 0.013 >= min_delta = 0.0. New best score: 0.089


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Monitored metric val_loss did not improve in the last 10 records. Best score: 0.089. Signaling Trainer to stop.


In [8]:
# 성능 TEST
trainer.test(model, dm.test_dataloader())

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.9787999987602234, 'test_loss': 0.1038069948554039}
--------------------------------------------------------------------------------


[{'test_acc': 0.9787999987602234, 'test_loss': 0.1038069948554039}]