# Hw1 report 0616215 
## Github: https://github.com/Darklanx/deep_visual_recog_hw1
I orginally use **resnet50** as the backbone of this assignment, however it is found that using a more complex network such as resnet101 can produce a better result.

Later, a network that add an **attention mechanism** to the orginal resnet called **resnest** is used for more improvement: https://github.com/zhanghang1989/ResNeSt, in the end, I achieve my best score on Kaggle by using pretrained resnest200 and addition 21 epochs of transer learning.

code reference: https://www.kaggle.com/deepbear/pytorch-car-classifier-90-accuracy

Inorder to run resnest successfully, please run `pip3 install resnest --pre`.


In [None]:
%load_ext autoreload
%autoreload 2
%env CUDA_VISIBLE_DEVICES=0


Fixing the random seed for reproducibility:

In [None]:


from Libs.Dataset import Dataset
from Libs.Model import Net
from Libs.train import train_model, eval_model
import pandas as pd
import torch
import torchvision
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
import numpy as np
import csv
import random
import os 
print(torch.cuda.device_count())

random.seed(0)
np.random.seed(0)
torch.manual_seed(0)


Setting the batch size, note that if multiple GPUs are used, the batch_size should be divisible by the number of GPUs

In [None]:
BATCH_SIZE = 30

Setting the path to the data folder and the label file

In [None]:
dir_training = "./data/training_data/training_data"
dir_testing = "./data//testing_data/testing_data"
csv_file = "data/training_labels.csv"

Loading the csv file to obtain labels for the training set:

In [None]:

df = pd.read_csv(csv_file)
label_ids = {}
for label in df["label"]:
    if label not in label_ids:
        label_ids[label] = len(label_ids)
id_labels =  {v: k for k, v in label_ids.items()}

# Dataset preprocess
For data preprocess, the training dataset is transformed with the following operation:

1. resize to 400*400.

2. random horizontal flip with p=0.5. (data augmentation)

3. random rotation of +- 15 degree. (data augmentation)

4. scale the pixel value from 0~255 to 0~1

5. normalized with respect to each channel with mean and std both being 0.5.

After applying the transformation, I split 10% of the training dataset out from the training dataset to the testing dataset, this testing dataset is used to adjust the learning rate during the training with pytorch's learning rate scheduler **ReduceLROnPlateau**, which will be further discussed in the training section.

## Note
Applying random horizontal flip is a method of data augmentation, and is crucial to the training of the model, when training the model with resnet50 as the backbone, applying random horizontal flip can boost the performance significantly.

In [None]:

train_trans = transforms.Compose([transforms.Resize((400, 400)),
                                 transforms.RandomHorizontalFlip(),
                                 transforms.RandomRotation(15),
                                 transforms.ToTensor(), # range [0, 255] -> [0.0,1.0]
                                 transforms.Normalize((0.5), (0.5))])
train_dataset = Dataset(dir_training, csv_file, label_ids, transform=train_trans)
train_dataset, test_dataset = train_dataset.train_test_split(0.9)
print("train dataset size: ", train_dataset.data.shape[0])
print("test dataset size: ", test_dataset.data.shape[0])
print(test_dataset.data[0])

train_loader =  torch.utils.data.DataLoader(train_dataset, batch_size = BATCH_SIZE, shuffle=True,drop_last=False, num_workers=4)
                               
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=False, num_workers=4)



# Training
### Hyperparameters:
1. Optimizer selection, and parameters of the optimizer (lr, momenetum etc..)

2. Learning rate scheduler, parameters of the scheduler

3. Number of training epoch

4. Backbone selection: resnet50, resnet101, resnet200, resnest50, resnest101, resnest200.

### Training details
The model is trained with transfer learning by replacing the last fully connected layer from the pretrained **resnest200** to fit the task. 
#### Optimizer
SGD with lr=0.01, momentum=0.9 is applied as the optimizer
#### Learning Rate Scheduler
 Learning rate is scheduled by **ReduceLROnPlateau** scheduler, this scheduler takes a value as input after every epoch, and adjust the learning rate accordingly.

In my training, I feed the accuracy of the model on the testing dataset as the input to the scheduler, and the learning rate will be multiply by 0.5 if the accuracy did not improve more than 0.8.

<div class="alert alert-info">
I removed the epoch-by-ecoch outputs of the training (which contains the loss and testing error for every epoch) for simplicity, however TA can easiliy reproduce the result simply by running the code above. 
</div>



In [None]:
TRAIN_EPOCH_LOAD = 0
MODEL_DIR = "./model/"
end_epoch = 21
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = Net(use_att=False)
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)
    print("torch.cuda.device_count(): ", torch.cuda.device_count())

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# lrscheduler = optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.5)
lrscheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5,patience=1, threshold = 0.8, threshold_mode='abs', min_lr=1e-4)


if TRAIN_EPOCH_LOAD <= 0:
    start_epoch = 0
else:
    start_epoch = TRAIN_EPOCH_LOAD
    checkpoint = torch.load('{}.pth'.format(os.path.join(MODEL_DIR, str(TRAIN_EPOCH_LOAD))))
    model.load_state_dict(checkpoint["model"])
    optimizer.load_state_dict(checkpoint["optimizer"])
    print(optimizer.param_groups[0]['lr'])
    for state in optimizer.state.values():
        for k, v in state.items():
            if torch.is_tensor(v):
                state[k] = v.to(device)

    lrscheduler.step(checkpoint["test_acc"])
    # lrscheduler = checkpoint["scheduler"]



model.train()
model.to(device)


model, training_losses, training_accs, test_accs = train_model(model, train_loader, test_loader, criterion, optimizer, lrscheduler, start_epoch, end_epoch)


# Prediction
To produce the result for the submission, every input image is resized and normalized just as the training dataset.

In [None]:
predict = {}
load_epoch = 15
print("Testing...")
model.load_state_dict(torch.load('{}.pth'.format(os.path.join("./model/", str(load_epoch))))["model"])
model.eval()

print("Test accuracy of epoch {}: {}".format(load_epoch, eval_model(model, test_loader)))

eval_trans = transforms.Compose([transforms.Resize((400, 400)),
                                 transforms.ToTensor(), # range [0, 255] -> [0.0,1.0]
                                 transforms.Normalize((0.5), (0.5))])
eval_dataset =  Dataset(dir_testing, csv_file, label_ids=None, transform=eval_trans, eval=True)
eval_loader =  torch.utils.data.DataLoader(eval_dataset, batch_size = BATCH_SIZE, shuffle=True, num_workers=4) 

with torch.no_grad():
    for b, inputs in enumerate(eval_loader):
        imgs, img_names = inputs
        imgs = imgs.to(device)
        output = model(imgs)
        p = torch.argmax(output, 1)
        for i, img_name in enumerate(img_names):
            predict[img_name] = id_labels[p[i].item()]

with open('submission.csv', 'w',newline='') as csvfile:
    fieldnames=["id", "label"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for key, value in predict.items():
        writer.writerow({fieldnames[0]: key, fieldnames[1]: value})
print("Done")
    

# Discussion
Several foundings was found during my experiments:

1. The more complex the model is, the better result it produces. (No overfitting observed)

2. Applying split-attention using resnet as backbone (resnest), I have observed improvement of 1~2%, however in the original paper (ResNeSt：Split-Attention Networks) an improvement of 3~4% of improvement is observed, this may be affected by the setting of optimizer and lr scheduler.

3. Adjusting learning rate with respect to testing accuracy/error has better performance than adjusting it by a fixed amount every epoch.

4. Reducing the learning rate to a small value near the end of training can help stabilze the model and thus leading to a slight improvement. 