# **Homework 1: COVID-19 Cases Prediction (Regression)**

Author: Heng-Jui Chang

Slides: https://github.com/ga642381/ML2021-Spring/blob/main/HW01/HW01.pdf  
Video: TBA

Objectives:
* Solve a regression problem with deep neural networks (DNN).
* Understand basic DNN training tips.
* Get familiar with PyTorch.

If any questions, please contact the TAs via TA hours, NTU COOL, or email.


# **Download Data**


If the Google drive links are dead, you can download data from [kaggle](https://www.kaggle.com/c/ml2021spring-hw1/data), and upload data manually to the workspace.

In [1]:
tr_path = 'covid.train.csv'  # path to training data
tt_path = 'covid.test.csv'   # path to testing data

!gdown --id '19CCyCgJrUxtvgZF53vnctJiOJ23T5mqF' --output covid.train.csv
!gdown --id '1CE240jLm2npU-tdz81-oVKEF3T2yfT1O' --output covid.test.csv

Downloading...
From: https://drive.google.com/uc?id=19CCyCgJrUxtvgZF53vnctJiOJ23T5mqF
To: /content/covid.train.csv
100% 2.00M/2.00M [00:00<00:00, 64.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1CE240jLm2npU-tdz81-oVKEF3T2yfT1O
To: /content/covid.test.csv
100% 651k/651k [00:00<00:00, 91.6MB/s]


# 新段落

# **Import Some Packages**

In [2]:
# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# For data preprocess
import numpy as np
import csv
import os

# For plotting
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

from sklearn.model_selection import StratifiedKFold

myseed = 820  # set a random seed for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(myseed)
torch.manual_seed(myseed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(myseed)

# **Some Utilities**

You do not need to modify this part.

In [3]:
def get_device():
    ''' Get device (if GPU is available, use GPU) '''
    return 'cuda' if torch.cuda.is_available() else 'cpu'

def plot_learning_curve(loss_record, title=''):
    ''' Plot learning curve of your DNN (train & dev loss) '''
    total_steps = len(loss_record['train'])
    x_1 = range(total_steps)
    x_2 = x_1[::len(loss_record['train']) // len(loss_record['dev'])]
    figure(figsize=(6, 4))
    plt.plot(x_1, loss_record['train'], c='tab:red', label='train')
    plt.plot(x_2, loss_record['dev'], c='tab:cyan', label='dev')
    plt.ylim(0.0, 5.)
    plt.xlabel('Training steps')
    plt.ylabel('MSE loss')
    plt.title('Learning curve of {}'.format(title))
    plt.legend()
    plt.show()


def plot_pred(dv_set, model, device, lim=35., preds=None, targets=None):
    ''' Plot prediction of your DNN '''
    if preds is None or targets is None:
        model.eval()
        preds, targets = [], []
        for x, y in dv_set:
            x, y = x.to(device), y.to(device)
            with torch.no_grad():
                pred = model(x)
                preds.append(pred.detach().cpu())
                targets.append(y.detach().cpu())
        preds = torch.cat(preds, dim=0).numpy()
        targets = torch.cat(targets, dim=0).numpy()

    figure(figsize=(5, 5))
    plt.scatter(targets, preds, c='r', alpha=0.5)
    plt.plot([-0.2, lim], [-0.2, lim], c='b')
    plt.xlim(-0.2, lim)
    plt.ylim(-0.2, lim)
    plt.xlabel('ground truth value')
    plt.ylabel('predicted value')
    plt.title('Ground Truth v.s. Prediction')
    plt.show()

# **Preprocess**

We have three kinds of datasets:
* `train`: for training
* `dev`: for validation
* `test`: for testing (w/o target value)

## **Dataset**

The `COVID19Dataset` below does:
* read `.csv` files
* extract features
* split `covid.train.csv` into train/dev sets
* normalize features

Finishing `TODO` below might make you pass medium baseline.

In [4]:
with open('covid.train.csv', 'r') as fp:
  data = list(csv.reader(fp))  # 将读取到的数据转化为list，其中data[0]为表头，data[1]为第一行数据
  row = data[0]
  row = data[0][1:]
  feats = list(range(52))
  feats.append(57)
  feats = feats + list(range(58, 70))
  feats.append(75)
  feats = feats + list(range(76, 88))

  data = np.array(data[1:])[:, 1:].astype(float)[:, feats]

  print(len(data))
  indices = [i for i in range(len(data)) if i % 5 == 0]
  print(indices)

  print(row)
  split1 = row[:40]
  print(split1)
  split1_1 = row[44:52]
  print(split1_1)
  t1 = row[57]
  print(t1)
  split2 = row[62:70]
  print(split2)
  t2 = row[75]
  print(t2)
  split3 = row[80:88]
  print(split3)
  result = split1 + [t1] + split2 + [t2] + split3
  print(len(result))

2700
[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 295, 300, 305, 310, 315, 320, 325, 330, 335, 340, 345, 350, 355, 360, 365, 370, 375, 380, 385, 390, 395, 400, 405, 410, 415, 420, 425, 430, 435, 440, 445, 450, 455, 460, 465, 470, 475, 480, 485, 490, 495, 500, 505, 510, 515, 520, 525, 530, 535, 540, 545, 550, 555, 560, 565, 570, 575, 580, 585, 590, 595, 600, 605, 610, 615, 620, 625, 630, 635, 640, 645, 650, 655, 660, 665, 670, 675, 680, 685, 690, 695, 700, 705, 710, 715, 720, 725, 730, 735, 740, 745, 750, 755, 760, 765, 770, 775, 780, 785, 790, 795, 800, 805, 810, 815, 820, 825, 830, 835, 840, 845, 850, 855, 860, 865, 870, 875, 880, 885, 890, 895, 900, 905, 910, 915, 920, 925, 930, 935, 940, 945, 950, 955, 960, 965, 970, 975, 980, 985, 990, 995, 1000, 1005, 1010

In [5]:
class COVID19Dataset(Dataset):
    ''' Dataset for loading and preprocessing the COVID19 dataset '''
    def __init__(self,
                 path,
                 fold_num,
                 mode='train',
                 target_only=False):
        self.mode = mode

        # Read data into numpy arrays
        with open(path, 'r') as fp:
            data = list(csv.reader(fp))  # 将读取到的数据转化为list，其中data[0]为表头，data[1]为第一行数据
            data = np.array(data[1:])[:, 1:].astype(float)  # 将data数据去除第一行表头，同时去除第一列数据，随后转化为np的array形式
        
        if not target_only:
            # feats = list(range(93))
            """
            feats = list(range(40))
            feats = feats + list(range(44, 52))
            feats.append(57)
            feats = feats + list(range(62, 70))
            feats.append(75)
            feats = feats + list(range(80, 88))
            """
            feats = list(range(52))
            feats.append(57)
            feats = feats + list(range(58, 70))
            feats.append(75)
            feats = feats + list(range(76, 88))

        else:
            # TODO: Using 40 states & 2 tested_positive features (indices = 57 & 75)
            # 仅使用前40个特征和两个tested_positive特征
            feats = list(range(40))
            feats.append(57)
            feats.append(75)

        if mode == 'test':
            # Testing data
            # data: 893 x 93 (40 states + day 1 (18) + day 2 (18) + day 3 (17))
            data = data[:, feats]
            self.data = torch.FloatTensor(data)  # 将data转化为PyTorch的Tensor形式
        else:
            # Training data (train/dev sets)
            # data: 2700 x 94 (40 states + day 1 (18) + day 2 (18) + day 3 (18))
            target = data[:, -1]  # 取最后一行作为label
            data = data[:, feats]
            
            
            # Splitting training data into train & dev sets
            # 将数据集以1:9分为dev set和train set
            if mode == 'train':
                indices = [i for i in range(len(data)) if i % 5 != fold_num]
            elif mode == 'dev':
                indices = [i for i in range(len(data)) if i % 5 == fold_num]
            
            # Convert data into PyTorch tensors
            self.data = torch.FloatTensor(data[indices])
            self.target = torch.FloatTensor(target[indices])

        # Normalize features (you may remove this part to see what will happen)
        self.data[:, 40:] = \
            (self.data[:, 40:] - self.data[:, 40:].mean(dim=0, keepdim=True)) \
            / self.data[:, 40:].std(dim=0, keepdim=True)

        self.dim = self.data.shape[1]

        print('Finished reading the {} set of COVID19 Dataset ({} samples found, each dim = {})'
              .format(mode, len(self.data), self.dim))

    def __getitem__(self, index):
        # Returns one sample at a time
        if self.mode in ['train', 'dev']:
            # For training
            return self.data[index], self.target[index]
        else:
            # For testing (no target)
            return self.data[index]

    def __len__(self):
        # Returns the size of the dataset
        return len(self.data)

## **DataLoader**

A `DataLoader` loads data from a given `Dataset` into batches.


In [6]:
def prep_dataloader(path, mode, batch_size, fold_num, n_jobs=0, target_only=False):
    ''' Generates a dataset, then is put into a dataloader. '''
    dataset = COVID19Dataset(path, fold_num, mode=mode, target_only=target_only)  # Construct dataset
    dataloader = DataLoader(
        dataset, batch_size,
        shuffle=(mode == 'train'), drop_last=False,
        num_workers=n_jobs, pin_memory=True)                            # Construct dataloader
    return dataloader


# **Deep Neural Network**

`NeuralNet` is an `nn.Module` designed for regression.
The DNN consists of 2 fully-connected layers with ReLU activation.
This module also included a function `cal_loss` for calculating loss.


In [7]:
class NeuralNet(nn.Module):
    ''' A simple fully-connected deep neural network '''
    def __init__(self, input_dim):
        super(NeuralNet, self).__init__()

        # Define your neural network here
        # TODO: How to modify this model to achieve better performance?
        """self.net = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )"""

        self.net = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.Dropout(0.5),
            nn.SiLU(),
            nn.Linear(256, 128),
            nn.SiLU(),
            nn.Linear(128, 1)
        )


        # Mean squared error loss
        self.criterion = nn.MSELoss(reduction='mean')

    def forward(self, x):
        ''' Given input of size (batch_size x input_dim), compute output of the network '''
        return self.net(x).squeeze(1)

    def cal_loss(self, pred, target):
        ''' Calculate loss '''
        # TODO: you may implement L2 regularization here
        return self.criterion(pred, target)

# **Train/Dev/Test**

## **Training**

In [8]:
def train(tr_set, dv_set, model, config, device, fold_num):
    ''' DNN training '''

    n_epochs = config['n_epochs']  # Maximum number of epochs

    # Setup optimizer
    optimizer = getattr(torch.optim, config['optimizer'])(
        model.parameters(), **config['optim_hparas'])

    min_mse = 1000.
    loss_record = {'train': [], 'dev': []}      # for recording training loss
    early_stop_cnt = 0
    epoch = 0
    while epoch < n_epochs:
        model.train()                           # set model to training mode
        for x, y in tr_set:                     # iterate through the dataloader
            optimizer.zero_grad()               # set gradient to zero
            x, y = x.to(device), y.to(device)   # move data to device (cpu/cuda)
            pred = model(x)                     # forward pass (compute output)
            mse_loss = model.cal_loss(pred, y)  # compute loss
            mse_loss.backward()                 # compute gradient (backpropagation)
            optimizer.step()                    # update model with optimizer
            loss_record['train'].append(mse_loss.detach().cpu().item())

        # After each epoch, test your model on the validation (development) set.
        dev_mse = dev(dv_set, model, device)
        if dev_mse < min_mse:
            # Save model if your model improved
            min_mse = dev_mse
            print('Saving model (epoch = {:4d}, train loss = {:.4f}, dev loss = {:4f})'
                .format(epoch + 1, mse_loss, min_mse))
            torch.save(model.state_dict(), 'models/model' + str(fold_num) + '.pth')  # Save model to specified path
            early_stop_cnt = 0
        else:
            early_stop_cnt += 1

        epoch += 1
        loss_record['dev'].append(dev_mse)
        if early_stop_cnt > config['early_stop']:
            # Stop training if your model stops improving for "config['early_stop']" epochs.
            break

    print('Finished training after {} epochs'.format(epoch))
    return min_mse, loss_record

## **Validation**

In [9]:
def dev(dv_set, model, device):
    model.eval()                                # set model to evalutation mode
    total_loss = 0
    for x, y in dv_set:                         # iterate through the dataloader
        x, y = x.to(device), y.to(device)       # move data to device (cpu/cuda)
        with torch.no_grad():                   # disable gradient calculation
            pred = model(x)                     # forward pass (compute output)
            mse_loss = model.cal_loss(pred, y)  # compute loss
        total_loss += mse_loss.detach().cpu().item() * len(x)  # accumulate loss
    total_loss = total_loss / len(dv_set.dataset)              # compute averaged loss

    return total_loss

## **Testing**

In [10]:
def test(tt_set, model, device):
    model.eval()                                # set model to evalutation mode
    preds = []
    for x in tt_set:                            # iterate through the dataloader
        x = x.to(device)                        # move data to device (cpu/cuda)
        with torch.no_grad():                   # disable gradient calculation
            pred = model(x)                     # forward pass (compute output)
            preds.append(pred.detach().cpu())   # collect prediction
    preds = torch.cat(preds, dim=0).numpy()     # concatenate all predictions and convert to a numpy array
    return preds

# **Setup Hyper-parameters**

`config` contains hyper-parameters for training and the path to save your model.

In [11]:
device = get_device()                 # get the current available device ('cpu' or 'cuda')
os.makedirs('models', exist_ok=True)  # The trained model will be saved to ./models/
target_only = False                   # TODO: Using 40 states & 2 tested_positive features

# TODO: How to tune these hyper-parameters to improve your model's performance?
config = {
    'fold_num': 5,
    'n_epochs': 5000,                # maximum number of epochs
    'batch_size': 200,               # mini-batch size for dataloader
    'optimizer': 'SGD',              # optimization algorithm (optimizer in torch.optim)
    'optim_hparas': {                # hyper-parameters for the optimizer (depends on which optimizer you are using)
        'lr': 0.001,                 # learning rate of SGD
        'momentum': 0.9,              # momentum for SGD
        'weight_decay': 1e-6
        #'lr': 1e-4,                 # learning rate of Adam
        #'weight_decay': 1e-6              # momentum for Adam
    },
    'early_stop': 300,               # early stopping epochs (the number epochs since your model's last improvement)
    # 'save_path': 'models/model.pth'  # your model will be saved here
}

# **开始训练**

In [12]:
for fold in range(5):
  print('Model ', fold, ' start training!')
  tr_set = prep_dataloader(tr_path, 'train', config['batch_size'], fold, target_only=target_only)
  dv_set = prep_dataloader(tr_path, 'dev', config['batch_size'], fold, target_only=target_only)
  model = NeuralNet(tr_set.dataset.dim).to(device)  # Construct model and move to device
  model_loss, model_loss_record = train(tr_set, dv_set, model, config, device, fold)

  del model

Model  0  start training!
Finished reading the train set of COVID19 Dataset (2160 samples found, each dim = 78)
Finished reading the dev set of COVID19 Dataset (540 samples found, each dim = 78)
Saving model (epoch =    1, train loss = 251.5282, dev loss = 236.567426)
Saving model (epoch =    2, train loss = 68.4863, dev loss = 92.446802)
Saving model (epoch =    3, train loss = 20.3497, dev loss = 34.026524)
Saving model (epoch =    4, train loss = 12.8303, dev loss = 14.887577)
Saving model (epoch =    5, train loss = 7.5224, dev loss = 8.184994)
Saving model (epoch =    6, train loss = 5.1088, dev loss = 7.311154)
Saving model (epoch =    7, train loss = 4.4401, dev loss = 4.179840)
Saving model (epoch =    8, train loss = 3.1419, dev loss = 3.223811)
Saving model (epoch =    9, train loss = 3.4617, dev loss = 2.666975)
Saving model (epoch =   11, train loss = 2.3772, dev loss = 2.435452)
Saving model (epoch =   12, train loss = 3.6670, dev loss = 2.368988)
Saving model (epoch =   1

# **Testing**
The predictions of your model on testing set will be stored at `pred.csv`.

In [17]:
def save_pred(preds, file):
    ''' Save predictions to specified file '''
    print('Saving results to {}'.format(file))
    with open(file, 'w') as fp:
        writer = csv.writer(fp)
        writer.writerow(['id', 'tested_positive'])
        for i, p in enumerate(preds):
            writer.writerow([i, p])
path = '/models/model.pth'
pred = []
for i in range(5):
  model = NeuralNet(tr_set.dataset.dim).to(device)
  ckpt = torch.load('models/model' + str(i) + '.pth', map_location='cpu')  # Load your best model
  model.load_state_dict(ckpt)
  tt_set = prep_dataloader(tt_path, 'test', config['batch_size'], i, target_only=target_only)
  preds = test(tt_set, model, device)  # predict COVID-19 cases with your model
  pred += [test(tt_set, model, device)]
  print(preds[:5])
  print(preds.shape)
preds = np.concatenate(pred)  # 拼接
preds = preds.reshape((-1,893))  # reshape成(5, 893)
preds = np.mean(preds, axis=0) # axis=0，计算每一列的均值
save_pred(preds, 'pred.csv')         # save prediction file to pred.csv

Finished reading the test set of COVID19 Dataset (893 samples found, each dim = 78)
[20.461685   3.1870391  3.3155158 10.840319   3.2112334]
(893,)
Finished reading the test set of COVID19 Dataset (893 samples found, each dim = 78)
[20.769821   2.958387   3.060844  10.860355   3.6221693]
(893,)
Finished reading the test set of COVID19 Dataset (893 samples found, each dim = 78)
[20.301563   2.6114378  2.5979648 10.354824   2.866485 ]
(893,)
Finished reading the test set of COVID19 Dataset (893 samples found, each dim = 78)
[20.40987    2.3792107  2.803878  10.360481   2.7472353]
(893,)
Finished reading the test set of COVID19 Dataset (893 samples found, each dim = 78)
[20.694126   2.5977337  2.783979  10.367838   2.8152301]
(893,)
(5, 893)
[20.461685   3.1870391  3.3155158 10.840319   3.2112334]
[20.769821   2.958387   3.060844  10.860355   3.6221693]
[20.301563   2.6114378  2.5979648 10.354824   2.866485 ]
(893,)
20.527414
Saving results to pred.csv


# **Hints**

## **Simple Baseline**
* Run sample code

## **Medium Baseline**
* Feature selection: 40 states + 2 `tested_positive` (`TODO` in dataset)

## **Strong Baseline**
* Feature selection (what other features are useful?)
* DNN architecture (layers? dimension? activation function?)
* Training (mini-batch? optimizer? learning rate?)
* L2 regularization
* There are some mistakes in the sample code, can you find them?

# **Reference**
This code is completely written by Heng-Jui Chang @ NTUEE.  
Copying or reusing this code is required to specify the original author. 

E.g.  
Source: Heng-Jui Chang @ NTUEE (https://github.com/ga642381/ML2021-Spring/blob/main/HW01/HW01.ipynb)
