# **Data Installation**
This code will use the dataset from MONAI github : https://github.com/Project-MONAI/MONAI-extra-test-data/

Code below installs MedNIST.tar.gz file from MONAI github and unzips in a folder named MedNIST

If you already have MedNIST dataset ready, assign the directory(str) to **`MedNIST_DATA_DIR`**

It may take some time on colab for MedNIST folder to be visible

In [1]:
! pip install wget # this is for colab users since colab does not have wget package

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9676 sha256=b4fa6bb1098877f1cd2e93a9cceee9e8de89b9d9b20a853c15edb3d538f75fe9
  Stored in directory: /root/.cache/pip/wheels/04/5f/3e/46cc37c5d698415694d83f607f833f83f0149e49b3af9d0f38
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [2]:

 
import wget
import os

MedNIST_DATA_DIR = None # change here

if MedNIST_DATA_DIR!=None:
  print("skip MedNIST dataset installation")

else:
  if os.path.isfile('MedNIST.tar.gz'):
    print(f"skip pulling MedNIST dataset from MONAI github")

  else:
    wget.download("https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz")

  print("unziping ...")
  if os.path.isfile("./MedNIST"):
    os.system("rm -rf MedNIST")
  
  os.system("tar -xf MedNIST.tar.gz")
  MedNIST_DATA_DIR = f"./MedNIST"

unziping ...


# **Import Packages**
Basically, the whole code uses **`torch`**  as main framework, however, there are several packages needed to preprocess the MONAI dataset

**numpy** : basical arrray handling package useful in dealing with multi dimension array

**PIL** : package used to read images and convert them into array

**matplotlib** : most perferred package when displaying images in *python*

**random** : package that provides useful functions when shuffling dataset



In [3]:
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy
from PIL import Image
import random

# **Preprocess parameters and constants**

Preprocessing is the most important part of whole page.

The final model will be a classifier, so there is one more thing to consider when preprocessing the dataset, other than normalization or shuffling.

When dealing with this kind of multi-class classifier, it is important for trainset to contain **equal portion of data in each class**

So, instead of mixing all data and spliting them, we will split data from each class by certain ratio which will be defined as **`TEST_RATIO`**.




In [4]:
TEST_RATIO = 0.2

And then iterate through all classes to concatnate the long part in to trainset, and the other part to testset

If dataset is saved in npy format, it will simply load it

In [5]:
if os.path.isfile("./X_train.npy") and os.path.isfile("./Y_train") and os.path.isfile("./X_test.npy") and os.path.isfile("./Y_test.npy"):
  train_x = numpy.load('./X_train.npy')
  train_y = numpy.load('./Y_train.npy')
  valid_x = numpy.load('./X_test.npy')
  valid_y = numpy.load('./Y_test.npy')

else:
  data_folder_list = [ name for name in os.listdir(f"{MedNIST_DATA_DIR}") if os.path.isdir(os.path.join(f"{MedNIST_DATA_DIR}", name)) ]
  dataset_dict = dict()
  for category,folder in enumerate(data_folder_list):
      dataset_dict[f'{folder}'] = list()
      for img in os.listdir(f'{MedNIST_DATA_DIR}/{folder}'):
          img_array = numpy.asarray(Image.open(f'{MedNIST_DATA_DIR}/{folder}/{img}'))
          dataset_dict[f'{folder}'].append([img_array,category])
  tmp_test = list()
  tmp_train = list()
  
  for key in dataset_dict.keys():
      tmp_test.append(dataset_dict[key][:int(TEST_RATIO*len(dataset_dict[key]))])
      tmp_train.append(dataset_dict[key][int(TEST_RATIO*len(dataset_dict[key])):])

  train_x = list()
  train_y = list()
  valid_x = list()
  valid_y = list()

  for key in range(6):
      for train_data_x, train_data_y in tmp_train[key]:
          train_x.append([train_data_x])
          train_y.append(train_data_y)

      for test_data_x, test_data_y in tmp_test[key]:
          valid_x.append([test_data_x])
          valid_y.append(test_data_y)

  train_x = numpy.asarray(train_x)
  train_y = numpy.asarray(train_y)
  valid_x = numpy.asarray(valid_x)
  valid_y = numpy.asarray(valid_y)

  train_x = train_x/255.0 # normalization on train_image
  valid_x = valid_x/255.0 # normalization on test_image

  train_y = numpy.ravel(train_y)
  valid_y = numpy.ravel(valid_y)
  
  train_index = numpy.arange(len(train_x))
  random.Random(6).shuffle(train_index)

  train_x = train_x[train_index]
  train_y = train_y[train_index]

  test_index = numpy.arange(len(valid_x))
  random.Random(7).shuffle(test_index)

  valid_x = valid_x[test_index]
  valid_y = valid_y[test_index]

  numpy.save('./X_train.npy',train_x)
  numpy.save('./Y_train.npy',train_y)
  numpy.save('./X_test.npy',valid_x)
  numpy.save('./Y_test.npy',valid_y)

print("data preparation completed ..")

data preparation completed ..


# **Convert to Tensor**

Pytorch Model only takes Tensor as input and label.
So, numpy-format data must be converted to Tensor format.

(**Inputs**)Image data should be in FloatTensor type and (**labels**)indicies should be in LongTensor type. 

In [6]:
train_x = torch.FloatTensor(train_x)
valid_x = torch.FloatTensor(valid_x)
train_y = torch.LongTensor(train_y)
valid_y = torch.LongTensor(valid_y)

# **Set Dataloader**

Dataloader is the most important characteristic of Pytorch.

Before assigning the tensor input and label, they must be packed in TensorDataset

Batch size is determines the size of minibatch per iteration.

Assign the size of minibatch in **`BATCH_SIZE`**.

When training, it will enumerate dataloader, which will automatically schedule minibatch per iteration

In [7]:
BATCH_SIZE = 1024

In [8]:
train_set = torch.utils.data.TensorDataset(train_x, train_y)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=False)

valid_set = torch.utils.data.TensorDataset(valid_x, valid_y)
valid_loader = torch.utils.data.DataLoader(valid_set, batch_size=BATCH_SIZE, shuffle=False)

# **Prepare Device**
When using Pytroch, user have to select device, or it will use cpu by default

If using multi-gpu. select one gpu-id and insert after `cuda:`

In [9]:
USE_CUDA = torch.cuda.is_available()

if USE_CUDA:
    device = torch.device('cuda:0')
    torch.cuda.set_device(device)
else:
    device = torch.device('cpu')

# **Config Model**
Model is a light cnn model

**Note that there is no activation function at the last layer**

`CrossEntropyLoss` will be used later, and pytorch cross-entropy-loss calculates softmax-logit value internaly. 

In [10]:
class Conv2d_Model(nn.Module):
    def __init__(self):
        super(Conv2d_Model, self).__init__()
        self.convolution_layer1 = nn.Conv2d(1,6,3)
        self.convolution_layer2 = nn.Conv2d(6,12,3)
        self.pool = nn.MaxPool2d(2, 2)
        self.dense_layer1 = nn.Linear(14*14*12, 256)
        self.dense_layer2 = nn.Linear(256, 32)
        self.dense_layer3 = nn.Linear(32, 6)

    def forward(self,x):
        x = F.relu(self.convolution_layer1(x))
        x = self.pool(x)
        x = F.relu(self.convolution_layer2(x))
        x = self.pool(x)
        x = torch.flatten(x, 1)
        x = F.relu(self.dense_layer1(x))
        x = F.relu(self.dense_layer2(x))
        x = self.dense_layer3(x)
        return x

# **Create Model object**
Unlike Tensorflow, Pytorch does not automatically load model on gpu

In [11]:
model = Conv2d_Model()
if USE_CUDA: model.to(device)

# **Define Loss Function and Optimizer**
After creating model object, assign its parameters to optimizer.

This way, model parameter updating equation is determined.

In [12]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# **Parameters for training session**
Pytorch does not have simple methods to train model, however, dataloader makes it simple.

As mentioned before, Pytorch's training session is usually done by enumerating through trains-dataloader. Then, by every loop, loss will be calculated and gradient will be updated within given minibatch.

## **Metric**
However, it would be more convienient if we could monitor it.
The code below counts the progress of training and the loss for each epoch.
`print('\r',f"training {train_idx+1}/{len(train_loader)}, train_loss: {train_loss:0.4f}",end=" ")`


Moreover, to check if the model is trained correctly, we can test it on validation-set and also print it.

`print('\r',f"validing {val_idx+1}/{len(valid_loader)}, val_loss:{tot_val_loss:0.4f}, val_acc: {acc:0.4}%", end=" ")`



## **Early Stopping**
While there can be many type of custom training session, **early stopping** is one of the most frequently used technique.

So, early stopping will be used in the code below. It can be accomplished by defining few parameters.


## **Parameter**
**`EPOCH`**: The total number of loop of model training

**`CURRENT_PATIENCE`**: Current patience count. After every epoch, if validation loss did not get smaller than before, patience count is increased.

**`STANDARD_VAL_LOSS`**: The initial validation loss. If the validation loss is smaller than this, it will be replaced with latest validation loss.

**`PATIENCE_LIMIT`**: The max patientce count. If patient count reaches this point, training session, it will halt the training session. 

In [13]:
EPOCH = 100
PATIENCE_LIMIT = 8
CURRENT_PAIENCE = 0
STANDARD_VAL_LOSS = 10**9

# **Training Session**

In [14]:
for epoch in range(EPOCH):
  print(f"---------------Epoch : {epoch+1}/{EPOCH}--------------------")
  train_loss = 0.0

  for train_idx, data in enumerate(train_loader, 0):
    optimizer.zero_grad()
    print('\r',f"training {train_idx+1}/{len(train_loader)}, train_loss: {train_loss:0.4f}",end=" ")
    inputs, labels = data

    outputs = model(inputs.to(device))
    loss = criterion(outputs, labels.to(device))

    loss.backward()
    optimizer.step()

    train_loss += loss.item()

  print('')

  total = 0
  correct = 0
  tot_val_loss = 0.0
  acc = 0.0

  with torch.no_grad():
    '''
    Omitting this no_grad condition will not affect the model training, since we will update the loss calcuated in validation-set.
    However, it is conventional to use torch.no_grad in model validation.
    '''
    for val_idx, val_data in enumerate(valid_loader, 0):
      print('\r',f"validing {val_idx+1}/{len(valid_loader)}, val_loss:{tot_val_loss:0.4f}, val_acc: {acc:0.4}%", end=" ")
        
      val_inputs, val_label = val_data
      val_output = model(val_inputs.to(device))
      val_loss = criterion(val_output, val_label.to(device))
        
      prediction = torch.argmax(val_output,1)
      tot_val_loss += val_loss.item() 

      total += val_label.size(0)
      correct += (prediction == val_label.to(device)).sum().item()
      acc = 100.0*correct/total
    print('')
    print('\n')
  
  if PATIENCE_LIMIT > CURRENT_PAIENCE:

    if val_loss < STANDARD_VAL_LOSS:
      
      STANDARD_VAL_LOSS = val_loss
      best_epoch = epoch+1
      best_model = model
      CURRENT_PAIENCE = 0

    else:
      CURRENT_PAIENCE += 1
  else:break

print('')
print('\n')

if (epoch+1) != EPOCH:
  print("Early Stopping ...")

if os.path.isfile('./2D_CNN_model_parameter'):
  os.remove('./2D_CNN_model_parameter')

torch.save(best_model.state_dict(), './2D_CNN_model_parameter')
print("================================================")
print(f"model parameters from epoch {best_epoch} saved!")
print("================================================")
print("\n")
print(".. model train finished")

---------------Epoch : 1/100--------------------
 training 47/47, train_loss: 21.7071 
 validing 12/12, val_loss:0.4026, val_acc: 98.97% 


---------------Epoch : 2/100--------------------
 training 47/47, train_loss: 0.6671 
 validing 12/12, val_loss:0.0763, val_acc: 99.77% 


---------------Epoch : 3/100--------------------
 training 47/47, train_loss: 0.2747 
 validing 12/12, val_loss:0.0504, val_acc: 99.8% 


---------------Epoch : 4/100--------------------
 training 47/47, train_loss: 0.2023 
 validing 12/12, val_loss:0.0441, val_acc: 99.85% 


---------------Epoch : 5/100--------------------
 training 47/47, train_loss: 0.1378 
 validing 12/12, val_loss:0.0366, val_acc: 99.87% 


---------------Epoch : 6/100--------------------
 training 47/47, train_loss: 0.0966 
 validing 12/12, val_loss:0.0441, val_acc: 99.87% 


---------------Epoch : 7/100--------------------
 training 47/47, train_loss: 0.0989 
 validing 12/12, val_loss:0.0320, val_acc: 99.9% 


---------------Epoch : 8/100