# **Homework 1: COVID-19 Cases Prediction (Regression)**

Objectives:
* Solve a regression problem with deep neural networks (DNN).
* Understand basic DNN training tips.
* Familiarize yourself with PyTorch.

If you have any questions, please contact the TAs via TA hours, NTU COOL, or email to mlta-2023-spring@googlegroups.com

In [None]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


# Download data
If the Google Drive links below do not work, you can use the dropbox link below or download data from [Kaggle](https://www.kaggle.com/t/a339b77fa5214978bfb8dde62d3151fe), and upload data manually to the workspace.

In [9]:
# google drive link
!pip install gdown
!gdown --id '1BjXalPZxq9mybPKNjF3h5L3NcF7XKTS-' --output covid_train.csv
!gdown --id '1B55t74Jg2E5FCsKCsUEkPKIuqaY7UIi1' --output covid_test.csv

# dropbox link
!wget -O covid_train.csv https://www.dropbox.com/s/lmy1riadzoy0ahw/covid.train.csv?dl=0
!wget -O covid_test.csv https://www.dropbox.com/s/zalbw42lu4nmhr2/covid.test.csv?dl=0

Downloading...
From: https://drive.google.com/uc?id=1BjXalPZxq9mybPKNjF3h5L3NcF7XKTS-
To: /content/covid_train.csv
100% 2.16M/2.16M [00:00<00:00, 11.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1B55t74Jg2E5FCsKCsUEkPKIuqaY7UIi1
To: /content/covid_test.csv
100% 638k/638k [00:00<00:00, 94.9MB/s]
--2025-09-12 14:49:12--  https://www.dropbox.com/s/lmy1riadzoy0ahw/covid.train.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/ewl0ff7lviu0s7f53jp9o/covid.train.csv?rlkey=pocojbo26thh2ncv0xkxfafiv&dl=0 [following]
--2025-09-12 14:49:12--  https://www.dropbox.com/scl/fi/ewl0ff7lviu0s7f53jp9o/covid.train.csv?rlkey=pocojbo26thh2ncv0xkxfafiv&dl=0
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uce0d9

# Import packages

In [1]:
# Numerical Operations
import math
from sched import scheduler

import numpy as np

# Reading/Writing Data
import pandas as pd
import os
import csv

# For Progress Bar
from tqdm import tqdm

# Pytorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split

# For plotting learning curve
from torch.utils.tensorboard import SummaryWriter

# For selecting features
from sklearn.feature_selection import SelectKBest, f_regression

# Some Utility Functions

You do not need to modify this part.

In [2]:
'''作用：保证每次运行代码时的随机结果一致（可复现）'''
def same_seed(seed):
    '''Fixes random number generator seeds for reproducibility.'''
    torch.backends.cudnn.deterministic = True #强制 cuDNN 使用确定性的算法（避免 GPU 上的随机性）
    torch.backends.cudnn.benchmark = False #关闭 cuDNN 的自动优化（因为自动优化可能导致结果不一致）
    np.random.seed(seed) #固定 NumPy 的随机数种子
    torch.manual_seed(seed) #固定 PyTorch 的 CPU 随机数种子
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed) #固定 PyTorch 在所有 GPU 上的随机数种子（如果有 GPU）

'''划分训练集和测试集'''
def train_valid_split(data_set, valid_ratio, seed):#valid_ratio: the percent of test set
    '''Split provided training data into training set and validation set'''
    valid_set_size = int(valid_ratio * len(data_set))
    train_set_size = len(data_set) - valid_set_size
    train_set, valid_set = random_split(data_set, [train_set_size, valid_set_size], generator=torch.Generator().manual_seed(seed))#按指定大小随机划分数据集 确保划分结果固定（可复现）
    return np.array(train_set), np.array(valid_set) #把结果转换为 NumPy 数组返回

'''预测函数 作用：用训练好的模型对测试集进行预测'''
def predict(test_loader, model, device):
    model.eval() # Set your model to evaluation mode.将模型设置为 评估模式（关闭 dropout、batchnorm 的更新等）
    preds = []
    for x in tqdm(test_loader): #遍历测试数据加载器（tqdm 用于显示进度条）
        x = x.to(device) #把数据放到 GPU 或 CPU 上
        with torch.no_grad(): #禁用梯度计算，加快推理速度，节省显存
            pred = model(x) #前向传播，得到预测结果
            preds.append(pred.detach().cpu()) #把每批次的预测结果存到列表里
    preds = torch.cat(preds, dim=0).numpy() #拼接所有批次结果，并转成 NumPy 数组
    return preds

# Dataset

In [3]:
class COVID19Dataset(Dataset):
    '''
    x: Features.
    y: Targets, if none, do prediction.
    '''
    def __init__(self, x, y=None):
        if y is None:
            self.y = y #如果 y 为空（说明是 预测阶段），就不存标签
        else:
            self.y = torch.FloatTensor(y) #如果 y 不为空（说明是 训练/验证阶段），就把标签转成 torch.FloatTensor
        self.x = torch.FloatTensor(x)

    '''定义了当你用 dataset[i] 访问数据时返回什么'''
    def __getitem__(self, idx):
        if self.y is None:
            return self.x[idx] #如果没有标签（预测阶段），只返回特征 x[idx]
        else:
            return self.x[idx], self.y[idx] #如果有标签（训练/验证阶段），返回 (x[idx], y[idx]) 这个二元组


    def __len__(self):
        return len(self.x)

# Neural Network Model
Try out different model architectures by modifying the class below.

In [4]:
class My_Model(nn.Module):
    def __init__(self, input_dim):
        super(My_Model, self).__init__()
        # TODO: modify model's structure, be aware of dimensions.
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 8),
            nn.ReLU(),
            nn.Linear(8, 4),
            nn.ReLU(),
            nn.Linear(4, 1)
        )

    def forward(self, x):
        x = self.layers(x)
        x = x.squeeze(1) # (B, 1) -> (B) 如果输出的形状是 (B, 1)（B 表示 batch 大小），就去掉第 1 维
        # 变成 (B,)，方便后续计算（比如和标签 y 对齐）
        return x

# Feature Selection
Choose features you deem useful by modifying the function below.

In [5]:
''' SelectKBest 和 f_regression 用于特征选择：
    SelectKBest：挑出评分最高的前 K 个特征。
    f_regression：衡量连续特征对连续目标的线性相关性(F-score)'''
def select_top_features_regression(df_features, y, top_k):
    #df_features：DataFrame 格式的特征矩阵 y：目标向量（连续值）top_k：想选出的特征数量
    selector = SelectKBest(score_func=f_regression, k=top_k) #创建 SelectKBest 对象，指定评分函数为 f_regression
    selector.fit(df_features, y) #调用 fit 方法计算每个特征与目标的 F-score

    # 获取布尔掩码，True 表示被选中
    mask = selector.get_support()
    #selector.get_support() 返回一个布尔数组，标记哪些特征被选中（True 表示被选中）用这个布尔数组筛选出 列名，保存到 selected_columns

    # 转换为整数索引
    selected_indices = [i for i, x in enumerate(mask) if x]

    #构建一个 DataFrame，把每个特征的 F-score 存起来，按分数从高到低排序 方便我们查看哪些特征最重要
    feature_scores = pd.DataFrame({
        "feature": df_features.columns,
        "score": selector.scores_,
        "index": range(len(df_features.columns))
    }).sort_values(by="score", ascending=False)


    #打印前 top_k 个最重要特征及其 F-score，方便分析
    print(f"Top {top_k} features for regression:")
    print(feature_scores.head(top_k))

    return selected_indices


In [7]:
df_origin = pd.read_csv("./data/covid_train.csv").drop(columns = ['id']) #读取数据 去掉id列
df_test = pd.read_csv("./data/covid_test.csv").drop(columns = ['id'])

target = df_origin['tested_positive']


list_column_features = select_top_features_regression(df_origin, target, 16)
#样本数远大于特征数（通常至少 10 倍以上）比较安全，避免过拟合


Top 16 features for regression:
              feature          score  index
69  tested_positive.1  256663.154076     69
87  tested_positive.2  104538.477954     87
46       hh_cmnty_cli   13524.111217     46
47     nohh_cmnty_cli   13432.693434     47
36    wnohh_cmnty_cli   11640.642445     36
65   nohh_cmnty_cli.1   11290.342757     65
64     hh_cmnty_cli.1   11284.344366     64
54  wnohh_cmnty_cli.1    9820.753823     54
83   nohh_cmnty_cli.2    9438.214125     83
82     hh_cmnty_cli.2    9370.679094     82
72  wnohh_cmnty_cli.2    8248.330144     72
34                cli    7577.491969     34
35                ili    7564.492163     35
52              cli.1    6448.391622     52
53              ili.1    6433.907516     53
70              cli.2    5435.753285     70


In [8]:
df_features = df_origin.iloc[:, list_column_features]

# 取 df_features.columns 与 df_test.columns 的交集
common_cols = df_features.columns.intersection(df_test.columns)

df_test = df_test[common_cols]

df_features

Unnamed: 0,cli,ili,wnohh_cmnty_cli,hh_cmnty_cli,nohh_cmnty_cli,cli.1,ili.1,wnohh_cmnty_cli.1,hh_cmnty_cli.1,nohh_cmnty_cli.1,tested_positive.1,cli.2,wnohh_cmnty_cli.2,hh_cmnty_cli.2,nohh_cmnty_cli.2,tested_positive.2
0,1.509413,1.511169,18.583362,23.857405,18.757247,1.451798,1.460472,17.684337,22.897375,18.037506,18.876155,1.308107,17.194312,22.686202,17.583283,18.490787
1,1.451798,1.460472,17.684337,22.897375,18.037506,1.308107,1.366300,17.194312,22.686202,17.583283,18.490787,1.406672,16.733442,22.484758,17.219515,16.329253
2,1.308107,1.366300,17.194312,22.686202,17.583283,1.406672,1.488543,16.733442,22.484758,17.219515,16.329253,1.381060,16.580258,22.506261,17.128204,16.522931
3,1.406672,1.488543,16.733442,22.484758,17.219515,1.381060,1.453365,16.580258,22.506261,17.128204,16.522931,1.307137,17.291188,22.369951,17.069263,15.578501
4,1.381060,1.453365,16.580258,22.506261,17.128204,1.307137,1.400021,17.291188,22.369951,17.069263,15.578501,1.206659,16.705053,21.440588,16.207377,14.171920
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3004,1.145430,1.174613,11.825744,15.376992,11.689749,1.121511,1.150313,11.772172,15.420157,11.780117,5.910541,1.049163,11.764646,15.416655,11.389769,6.487310
3005,1.121511,1.150313,11.772172,15.420157,11.780117,1.049163,1.077583,11.764646,15.416655,11.389769,6.487310,1.210881,12.415059,15.755713,11.811906,6.112827
3006,1.049163,1.077583,11.764646,15.416655,11.389769,1.210881,1.239045,12.415059,15.755713,11.811906,6.112827,1.257293,12.270085,15.538073,11.435870,6.151394
3007,1.210881,1.239045,12.415059,15.755713,11.811906,1.257293,1.238664,12.270085,15.538073,11.435870,6.151394,1.647812,13.535265,15.980730,11.592346,7.165580


In [9]:
def select_feat(train_data, valid_data, test_data, feat_idx = None):
    '''Selects useful features to perform regression'''
    y_train, y_valid = train_data[:,-1], valid_data[:,-1]
    raw_x_train, raw_x_valid, raw_x_test = train_data[:,:-1], valid_data[:,:-1], test_data

    if feat_idx is None:
        feat_idx = list(range(raw_x_train.shape[1]))
     # TODO: Select suitable feature columns.

    return raw_x_train[:,feat_idx], raw_x_valid[:,feat_idx], raw_x_test[:,feat_idx], y_train, y_valid

# Training Loop

In [10]:
def trainer(train_loader, valid_loader, model, config, device):

    criterion = nn.MSELoss(reduction='mean') # Define your loss function, do not modify this.
    # reduction='mean' 表示对 batch 中的样本求平均

    # Define your optimization algorithm.
    # TODO: Please check https://pytorch.org/docs/stable/optim.html to get more available algorithms.
    # TODO: L2 regularization (optimizer(weight decay...) or implement by your self).

    '''选用Adam作为优化器'''
    optimizer = torch.optim.Adam(model.parameters(), lr=config['learning_rate'], weight_decay=config['weight_decay'])

    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)

    writer = SummaryWriter() # Writer of tensoboard.

    if not os.path.isdir('./models'):
        os.mkdir('./models') # Create directory of saving models.

    n_epochs, best_loss, step, early_stop_count = config['n_epochs'], math.inf, 0, 0

    for epoch in range(n_epochs):
        model.train() # Set your model to train mode. 训练模式，启用 dropout/batchnorm
        loss_record = []

        # tqdm is a package to visualize your training progress.
        train_pbar = tqdm(train_loader, position=0, leave=True)

        for x, y in train_pbar:
            optimizer.zero_grad()               # Set gradient to zero.
            x, y = x.to(device), y.to(device)   # Move your data to device.
            pred = model(x)
            loss = criterion(pred, y)
            loss.backward()                     # Compute gradient(backpropagation).
            optimizer.step()                    # Update parameters.
            step += 1
            loss_record.append(loss.detach().item())

            # Display current epoch number and loss on tqdm progress bar.
            train_pbar.set_description(f'Epoch [{epoch+1}/{n_epochs}]')
            train_pbar.set_postfix({'loss': loss.detach().item()})

        mean_train_loss = sum(loss_record)/len(loss_record)
        writer.add_scalar('Loss/train', mean_train_loss, step)

        model.eval() # Set your model to evaluation mode.
        loss_record = []
        for x, y in valid_loader:
            x, y = x.to(device), y.to(device)
            with torch.no_grad():
                pred = model(x)
                loss = criterion(pred, y)

            loss_record.append(loss.item())

        mean_valid_loss = sum(loss_record)/len(loss_record)
        print(f'Epoch [{epoch+1}/{n_epochs}]: Train loss: {mean_train_loss:.4f}, Valid loss: {mean_valid_loss:.4f}')
        writer.add_scalar('Loss/valid', mean_valid_loss, step)

        if mean_valid_loss < best_loss:
            best_loss = mean_valid_loss
            torch.save(model.state_dict(), config['save_path']) # Save your best model
            print('Saving model with loss {:.3f}...'.format(best_loss))
            early_stop_count = 0
        else:
            early_stop_count += 1

        if early_stop_count >= config['early_stop']:
            print('\nModel is not improving, so we halt the training session.')
            return

# Configurations
`config` contains hyper-parameters for training and the path to save your model.

In [11]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
config = {
    "weight_decay": 1e-4,  # L2
    'seed': 10101,      # Your seed number, you can pick your lucky number. :)
    #'select_all': True,   # Whether to use all features.
    'valid_ratio': 0.2,   # validation_size = train_size * valid_ratio
    'n_epochs': 10000,     # Number of epochs.
    'batch_size': 64,
    'learning_rate': 1e-4,
    'early_stop': 300,    # If model has not improved for this many consecutive epochs, stop training.
    #通常 early_stop ≈ 训练轮数的 1%-5% 比较合理
    'save_path': './models/model.ckpt'  # Your model will be saved here
}

# Dataloader
Read data from files and set up training, validation, and testing sets. You do not need to modify this part.

In [13]:
# Set seed for reproducibility
same_seed(config['seed'])


# train_data size: 3009 x 89 (35 states + 18 features x 3 days)
# test_data size: 997 x 88 (without last day's positive rate)
train_data, test_data = pd.read_csv('./data/covid_train.csv').values, pd.read_csv('./data/covid_test.csv').values
train_data, valid_data = train_valid_split(train_data, config['valid_ratio'], config['seed'])

# Print out the data size.
print(f"""train_data size: {train_data.shape}
valid_data size: {valid_data.shape}
test_data size: {test_data.shape}""")

# Select features
x_train, x_valid, x_test, y_train, y_valid = select_feat(train_data, valid_data, test_data, list_column_features)

# Print out the number of features.
print(f'number of features: {x_train.shape[1]}')

train_dataset, valid_dataset, test_dataset = COVID19Dataset(x_train, y_train), \
                                            COVID19Dataset(x_valid, y_valid), \
                                            COVID19Dataset(x_test)

# Pytorch data loader loads pytorch dataset into batches.
train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
valid_loader = DataLoader(valid_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=config['batch_size'], shuffle=False, pin_memory=True)

train_data size: (2408, 89)
valid_data size: (601, 89)
test_data size: (997, 88)
number of features: 16


# Start training!

In [None]:
model = My_Model(input_dim=x_train.shape[1]).to(device) # put your model and data on the same computation device.
trainer(train_loader, valid_loader, model, config, device)

# Testing
The predictions of your model on testing set will be stored at `pred.csv`.

In [14]:
def save_pred(preds, file):
    ''' Save predictions to specified file '''
    with open(file, 'w') as fp:
        writer = csv.writer(fp)
        writer.writerow(['id', 'tested_positive'])
        for i, p in enumerate(preds):
            writer.writerow([i, p])

model = My_Model(input_dim=x_train.shape[1]).to(device)
model.load_state_dict(torch.load(config['save_path']))
preds = predict(test_loader, model, device)
save_pred(preds, 'pred.csv')

100%|██████████| 16/16 [00:00<00:00, 1774.85it/s]


# Download

Run this block to download the `pred.csv` by clicking.

In [None]:
from IPython.display import FileLink
FileLink(r'pred.csv')

# Reference
This notebook uses code written by Heng-Jui Chang @ NTUEE (https://github.com/ga642381/ML2021-Spring/blob/main/HW01/HW01.ipynb)