<h1 style="text-align: center">机器学习导论习题六</h1>

<h2 style="text-align: center;">221300079, 王俊童, <a href="mailto:221300079@smail.nju.edu.cn">221300079@smail.nju.edu.cn</a></h2>

In [None]:
# 用于记录每个单元格的运行时间

try:
    %load_ext autotime
except:
    !pip install ipython-autotime
    %load_ext autotime

In [1]:
# 导入第三方库

import os, re, glob, time, random, datetime
import multiprocessing as mp

GLOBAL_START_TIME = time.time()

# # !pip install ipywidgets widgetsnbextension pandas-profiling
# from tqdm.notebook import trange, tqdm


In [2]:

import numpy as np
import pandas as pd
import sklearn
import joblib

import torch

import mindspore

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import optuna

%matplotlib inline

matplotlib.rcParams['figure.dpi'] = 128
matplotlib.rcParams['figure.figsize'] = (8, 6)



ModuleNotFoundError: No module named 'mindspore'

In [None]:
# 固定随机数种子

GLOBAL_SEED = 0

def fix_seed(seed: int) -> None:
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if 'mindspore' in globals():
        mindspore.set_seed(seed)

fix_seed(GLOBAL_SEED)

# 1 [20pts] 处理数据

## (1) [5pts] 加载数据

从 `./data/` 中加载数据, 特征数据加载为 `np.float64`, 标签数据加载为 `np.int64`. 注意, 原始数据的标签从 `1` 开始, 你需要转换成从 `0` 开始.

In [None]:
# 加载数据

# 提示: np.loadtxt
train_x = None
train_y = None
test_x = None
test_y = None

print(f'train_x.dtype={train_x.dtype}; train_x.shape={train_x.shape}; train_y.dtype={train_y.dtype}; train_y.shape={train_y.shape};')
print(f'test_x.dtype={test_x.dtype}; test_x.shape={test_x.shape}; test_y.dtype={test_y.dtype}; test_y.shape={test_y.shape};')

assert all((train_x.shape == (7352, 561), train_y.shape == (7352,), test_x.shape == (2947, 561), test_y.shape == (2947,)))
assert all((train_x.dtype == np.float64, train_y.dtype == np.int64, test_x.dtype == np.float64, test_y.dtype == np.int64))

## (2) [5pts] 检查数据

分析并回答如下问题:

- 数据中是否存在缺失值?

> ⌨在这里回答

- 数据是否存在类别不平衡的问题?

> ⌨在这里回答

- 数据属性取值是否需要归一化?

> ⌨在这里回答

In [None]:
# 打印训练数据缺失值的统计结果

# 提示: np.isnan


In [None]:
# 打印训练数据类别样例数量的统计结果

# 提示: np.bincount


In [None]:
# 打印训练数据属性取值的统计结果



## (3) [5pts] 可视化属性分布

属性取值归一化之后, 选择方差最大的特征, 绘制小提琴图, 可视化对比各个类别的样本在该属性上取值分布.

绘图参考下图:

<div><img src="./plot/violinplot.png" width="512"/></div>

In [None]:
# 挑选方差最大的属性

# 提示: np.argmax


In [None]:
# 绘制小提琴图可视化各个类别在该属性上取值分布

# 提示: sns.violinplot


## (4) [5pts] 可视化属性相关性

绘制热力图, 可视化前 51 个属性两两之间的 Pearson 相关系数.

绘图参考下图:

<div><img src="./plot/heatmap.png" width="512"/></div>

In [None]:
# 绘制热力图可视化前51个属性两两之间的Pearson相关系数

# 提示: sns.heatmap


# 2 [15pts+附加5pts] 分类模型

## (1) [5pts] 调用 `sklearn` 实现基线模型

固定超参数, 汇报如下基线模型的运行时间和准确率: $k$ 近邻, 高斯核支持向量机 (高斯核又称径向基核), 随机森林.

> - $k$ 近邻: elapsed=????s; accuracy=????%; 
> - 高斯核支持向量机: elapsed=????s; accuracy=????%;
> - 随机森林: elapsed=????s; accuracy=????%;

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# 注意固定随机数种子确保实验可复现


## (2) [5pts] 调用 `xgboost` 实现 Boosting 模型

固定超参数, 汇报 `xgboost` 的运行时间和准确率.

> - `xgboost`: elapsed=????s; accuracy=????%;

In [None]:
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# 注意固定随机数种子确保实验可复现


## (3) [5pts] 基于 `torch` 训练神经网络模型

每遍历一轮训练数据, 就在 `./ckpt/` 中保存当前模型权重, 固定超参数, 绘制神经网络在训练集和测试集上的准确率随训练轮数变化的折线图.

绘图参考下图:

<div><img src="./plot/line.png" width="512"/></div>

In [None]:
from sklearn.metrics import accuracy_score
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class MLP(nn.Module):
    def __init__(self, input_dim: int, hidden_dims: list, output_dim: int):
        super(MLP, self).__init__()
        self.input_layer = nn.Sequential(
            nn.Linear(input_dim, hidden_dims[0]),
            nn.ReLU(),
        )
        self.hidden_layers = [
            nn.Sequential(
                nn.Linear(dim_in, dim_out),
                nn.ReLU(),
            )
            for dim_in, dim_out in zip(hidden_dims[:-1], hidden_dims[+1:])
        ]
        self.hidden_layers = nn.Sequential(*self.hidden_layers)
        self.output_layer = nn.Sequential(
            nn.Linear(hidden_dims[-1], output_dim),
        )
    def forward(self, x):
        x = self.input_layer(x)
        for hidden_layer in self.hidden_layers:
            x = hidden_layer(x)
        x = self.output_layer(x)
        return x

class MLPClassifier(object):
    def __init__(self, hidden_dims: list = [128, 32], batch_size: int = 128, learning_rate: float = 1e-2, num_epochs: int = 16, ckpt_dir: str = None):
        self.hidden_dims = hidden_dims
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.num_epochs = num_epochs
        self.ckpt_dir = ckpt_dir
    def define_mlp(self, x: np.ndarray, y: np.ndarray):
        _, self.input_dim = x.shape
        self.output_dim = y.max() + 1
        self.mlp = MLP(input_dim=self.input_dim, hidden_dims=self.hidden_dims, output_dim=self.output_dim)
    def train_mlp(self, x: np.ndarray, y: np.ndarray):
        self.mlp.train()
        x = torch.from_numpy(x).float()
        y = torch.from_numpy(y).long()
        dataset = TensorDataset(x, y)
        dataloader = DataLoader(dataset, batch_size=self.batch_size, shuffle=True)
        loss_fn = nn.CrossEntropyLoss()
        optimizer = optim.Adam(self.mlp.parameters(), lr=self.learning_rate)
        for epoch in range(self.num_epochs):
            running_loss = 0.0
            for inputs, labels in dataloader:
                raise NotImplementedError()
                # TODO: 阅读官方文档示例, 完成梯度下降的代码, 注意把当前批次的损失加到 running_loss 上.
                # https://pytorch.org/tutorials/beginner/introyt/trainingyt.html#the-training-loop
            print(f'\033[1m[{epoch+1:3d}/{self.num_epochs:3d}]\033[0m running_loss={running_loss:8.4f};')
            if self.ckpt_dir:
                torch.save(self.mlp.state_dict(), os.path.join(self.ckpt_dir, f'{epoch:03d}.pt'))
    def load(self, x: np.ndarray, y: np.ndarray, ckpt_path: str):
        self.define_mlp(x=x, y=y)
        assert os.path.exists(ckpt_path)
        self.mlp.load_state_dict(torch.load(ckpt_path))
    def fit(self, x: np.ndarray, y: np.ndarray):
        self.define_mlp(x=x, y=y)
        self.train_mlp(x=x, y=y)
    def predict(self, x: np.ndarray):
        self.mlp.eval()
        x = torch.from_numpy(x).float()
        dataset = TensorDataset(x)
        dataloader = DataLoader(dataset, batch_size=self.batch_size, shuffle=False)
        y_pred = []
        with torch.no_grad():
            for inputs, in dataloader:
                outputs = self.mlp(inputs)
                labels = outputs.argmax(dim=1)
                y_pred.append(labels)
        y_pred = torch.cat(y_pred, dim=0)
        return y_pred.detach().cpu().numpy()

classifier = MLPClassifier(ckpt_dir='./ckpt/')

fix_seed(GLOBAL_SEED)
_start_time = time.time()
classifier.fit(train_x, train_y)
test_y_hat = classifier.predict(test_x)
accuracy = accuracy_score(test_y, test_y_hat)
_end_time = time.time()
_elapsed_time = _end_time - _start_time
print(f'\033[1m[{classifier.__class__.__name__}]\033[0m elapsed={_elapsed_time:.2f}s; accuracy={accuracy*100:.2f}%;')

del classifier, test_y_hat, accuracy
del _start_time, _end_time, _elapsed_time

In [None]:
# 绘制折线图可视化神经网络在训练集和测试集上的准确率随训练轮数变化

classifier = MLPClassifier(ckpt_dir=None)
train_accs = []
test_accs = []
for epoch in range(16):
    classifier.load(train_x, train_y, os.path.join('./ckpt/', f'{epoch:03d}.pt'))
    train_y_hat = classifier.predict(train_x)
    train_acc = accuracy_score(train_y, train_y_hat)
    train_accs.append(train_acc)
    test_y_hat = classifier.predict(test_x)
    test_acc = accuracy_score(test_y, test_y_hat)
    test_accs.append(test_acc)

# print(f'train_accs = {train_accs};')
# print(f'test_accs = {test_accs};')



## (4) [附加5pts] 基于 `mindspore` 训练神经网络模型

使用国产化软件复现 (3) 的结果, 并比较二者在效率等方面的差异.

In [None]:
from sklearn.metrics import accuracy_score
import mindspore
import mindspore.nn as nn
import mindspore.ops as ops

class MLP(nn.Cell):
    def __init__(self, input_dim: int, hidden_dims: list, output_dim: int):
        super(MLP, self).__init__()
        self.input_layer = nn.SequentialCell(
            nn.Dense(input_dim, hidden_dims[0]),
            nn.ReLU(),
        )
        self.hidden_layers = [
            nn.SequentialCell(
                nn.Dense(dim_in, dim_out),
                nn.ReLU(),
            )
            for dim_in, dim_out in zip(hidden_dims[:-1], hidden_dims[+1:])
        ]
        self.hidden_layers = nn.SequentialCell(*self.hidden_layers)
        self.output_layer = nn.SequentialCell(
            nn.Dense(hidden_dims[-1], output_dim),
        )
    def construct(self, x):
        x = self.input_layer(x)
        for hidden_layer in self.hidden_layers:
            x = hidden_layer(x)
        x = self.output_layer(x)
        return x

class MLPClassifier(object):
    def __init__(self, hidden_dims: list = [128, 32], batch_size: int = 128, learning_rate: float = 1e-2, num_epochs: int = 16, ckpt_dir: str = None):
        self.hidden_dims = hidden_dims
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.num_epochs = num_epochs
        self.ckpt_dir = ckpt_dir
    def define_mlp(self, x: np.ndarray, y: np.ndarray):
        _, self.input_dim = x.shape
        self.output_dim = y.max() + 1
        self.input_dim = int(self.input_dim)
        self.output_dim = int(self.output_dim)
        self.mlp = MLP(input_dim=self.input_dim, hidden_dims=self.hidden_dims, output_dim=self.output_dim)
    def train_mlp(self, x: np.ndarray, y: np.ndarray):
        self.mlp.set_train(True)
        x = mindspore.Tensor.from_numpy(x).float()
        y = mindspore.Tensor.from_numpy(y).int()
        dataset = mindspore.dataset.GeneratorDataset([(xi, yi) for xi, yi in zip(x, y)], column_names=['inputs', 'labels'], shuffle=True).batch(batch_size=self.batch_size)
        dataloader = dataset.create_tuple_iterator()
        loss_fn = nn.CrossEntropyLoss()
        optimizer = nn.Adam(self.mlp.trainable_params(), learning_rate=self.learning_rate)
        def forward_fn(inputs, labels):
            logits = self.mlp(inputs)
            loss = loss_fn(logits, labels)
            return loss, logits
        grad_fn = mindspore.value_and_grad(forward_fn, None, optimizer.parameters, has_aux=True)
        for epoch in range(self.num_epochs):
            loss = 0.0
            for inputs, labels in dataloader:
                raise NotImplementedError()
                # TODO: 阅读官方文档示例, 完成梯度下降的代码, 注意把当前批次的损失加到 running_loss 上.
                # https://www.mindspore.cn/tutorials/zh-CN/r2.3.0rc2/beginner/train.html#%E8%AE%AD%E7%BB%83%E4%B8%8E%E8%AF%84%E4%BC%B0
            print(f'\033[1m[{epoch+1:3d}/{self.num_epochs:3d}]\033[0m loss={loss:8.4f};')
            if self.ckpt_dir:
                mindspore.save_checkpoint(self.mlp, os.path.join(self.ckpt_dir, f'{epoch:03d}.ckpt'))
    def load(self, x: np.ndarray, y: np.ndarray, ckpt_path: str):
        self.define_mlp(x=x, y=y)
        assert os.path.exists(ckpt_path)
        _, _ = mindspore.load_param_into_net(self.mlp, mindspore.load_checkpoint(ckpt_path))
    def fit(self, x: np.ndarray, y: np.ndarray):
        self.define_mlp(x=x, y=y)
        self.train_mlp(x=x, y=y)
    def predict(self, x: np.ndarray):
        self.mlp.set_train(False)
        x = mindspore.Tensor.from_numpy(x).float()
        dataset = mindspore.dataset.GeneratorDataset([xi for xi in x], column_names=['inputs'], shuffle=False).batch(batch_size=self.batch_size)
        dataloader = dataset.create_tuple_iterator()
        y_pred = []
        with torch.no_grad():
            for inputs, in dataloader:
                outputs = self.mlp(inputs)
                labels = outputs.argmax(axis=1)
                y_pred.append(labels)
        y_pred = ops.Concat(axis=0)(y_pred)
        return y_pred.asnumpy()

classifier = MLPClassifier(ckpt_dir='./ckpt/')

fix_seed(GLOBAL_SEED)
_start_time = time.time()
classifier.fit(train_x, train_y)
test_y_hat = classifier.predict(test_x)
accuracy = accuracy_score(test_y, test_y_hat)
_end_time = time.time()
_elapsed_time = _end_time - _start_time
print(f'\033[1m[{classifier.__class__.__name__}]\033[0m elapsed={_elapsed_time:.2f}s; accuracy={accuracy*100:.2f}%;')

del classifier, test_y_hat, accuracy
del _start_time, _end_time, _elapsed_time

In [None]:
# 绘制折线图可视化神经网络在训练集和测试集上的准确率随训练轮数变化

classifier = MLPClassifier(ckpt_dir=None)
train_accs = []
test_accs = []
for epoch in range(16):
    classifier.load(train_x, train_y, os.path.join('./ckpt/', f'{epoch:03d}.ckpt'))
    train_y_hat = classifier.predict(train_x)
    train_acc = accuracy_score(train_y, train_y_hat)
    train_accs.append(train_acc)
    test_y_hat = classifier.predict(test_x)
    test_acc = accuracy_score(test_y, test_y_hat)
    test_accs.append(test_acc)

# print(f'train_accs = {train_accs};')
# print(f'test_accs = {test_accs};')



# 3 [15pts] 参数调优

## (1) [5pts] 5 折交叉验证

调用 `sklearn` 实现, 为 $k$ 近邻选择最优的邻居数量 $k$, 汇报在训练集上选出的 $k$ 及其 5 折交叉验证准确率, $k \in \{1, \cdots, 16\}$. 

> - $k$=1, accuracy=????%;
> - $k$=2, accuracy=????%;
> - $k$=3, accuracy=????%;
> - $k$=4, accuracy=????%;
> - $k$=5, accuracy=????%;
> - $k$=6, accuracy=????%;
> - $k$=7, accuracy=????%;
> - $k$=8, accuracy=????%;
> - $k$=9, accuracy=????%;
> - $k$=10, accuracy=????%;
> - $k$=11, accuracy=????%;
> - $k$=12, accuracy=????%;
> - $k$=13, accuracy=????%;
> - $k$=14, accuracy=????%;
> - $k$=15, accuracy=????%;
> - $k$=16, accuracy=????%;

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier

from evaluate import evaluate_5_fold_cv



## (2) [5pts] 多进程并行加速

为高斯核支持向量机选择最优的正则化系数 $C$, 汇报在训练集上选出的 $C$ 及其5折交叉验证准确率, 同时汇报使用多进程并行加速后的总用时, $C \in \{0.01, 0.1, 1.0, 10.0, 100.0\}$.

> - $C$=0.01, accuracy=????%
> - $C$=0.1, accuracy=????%
> - $C$=1.0, accuracy=????%
> - $C$=10.0, accuracy=????%
> - $C$=100.0, accuracy=????%
> - 总共用时 ????s.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
from joblib import Parallel, delayed

from evaluate import distributed_evaluate_classifier

# 提示: 推荐使用 `joblib`, 其接口比 `multiprocessing` 更加简洁.
# 注意: 如果你使用的操作系统是 Windows, 那么进程的入口函数必须从包文件导入.
# 进程的入口函数已经实现, 即 distributed_evaluate_classifier, 你需要为其填充参数, 填充示例如下:
# packed = [task_id, SVC, [], {'C': c, 'kernel': 'rbf'}, fit_x, fit_y, predict_x, predict_y]
# 每个进程将执行如下代码:
# distributed_evaluate_classifier(packed)
# 其中, task_id 用来记录当前任务身份, 对于此处代码而言 task_id 应当包括 C 的值和交叉验证中的折数.
# 其中, fit_x 和 fit_y 是这一折交叉验证的训练数据, predict_x 和 predict_y 是这一折交叉验证的测试数据.
delayed_entrance = delayed(distributed_evaluate_classifier)



## (3) [5pts] 搜索超参数

使用 `optuna` 搜索 `xgboost` 的超参数, 汇报在训练集上选出的超参数及其 5 折交叉验证准确率 (更换随机数种子不低于 93.0%).

> - 关键超参数: n_estimators=????; max_depth=????; max_leaves=????; eta=????; 如果你搜索了其他超参数也一并填在这里
> - 5 折交叉验证准确率: ????%

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier

from evaluate import evaluate_5_fold_cv

import optuna

optuna.logging.set_verbosity(optuna.logging.ERROR)



# 4 [15pts] 集成模型

## (1) [5pts] 简单多数投票

采用简单多数投票法集成之前题目调过参的分类器: $k$ 近邻, 高斯核支持向量机, `xgboost`.

汇报测试集上准确率的提升.

> 测试集上准确率的提升: .

## (2) [5pts] Stacking

采用 Stacking 集成上一问中的分类器, 通过 5 折交叉验证训练 Stacking 模型. 汇报测试集上准确率的提升.

> 测试集上准确率的提升: .

## (3) [5pts] 探索其他集成方式

以下方式任选其一: 把神经网络最后一个隐层的输出作为新的特征, 训练一个根据样本决定采用哪个模型的路由模型, 提出你自己的集成方式并给出清晰的说明. 汇报测试集上准确率的提升.

> 测试集上准确率的提升: .

# 6 [附加5pts] 学件市场

<h2>北冥坞的注册邮箱: <code>TODO@smail.nju.edu.cn</code></h2>

<h2>上传的学件的 ID: <code>TODO</code></h2>

<h2>根据规约查搜学件的截图:</h2>

<h2>你认为北冥坞还需要改进的地方: (例如, 代码报错日志不够详细, 上传/复用学件存在明显的冗余操作步骤, &hellip;)</h2>

# 致谢

<h2>允许与其他同样未完成作业的同学讨论作业的内容, 但需在此注明并加以致谢; 如在作业过程中, 参考了互联网上的资料或大语言模型的生成结果, 且对完成作业有帮助的, 亦需注明并致谢.</h2>