# 基于 ResNet-18 微调的 Caltech-101 图像分类实验报告

本报告将介绍使用预训练的 ResNet-18 模型，通过微调方法在 Caltech-101 数据集上进行图像分类任务的过程、配置、代码实现及结果可视化。

## 1. 引言

图像分类是计算机视觉领域的一个基础且重要的任务。深度学习，特别是卷积神经网络（CNN），在此任务上取得了显著的成功。ResNet (Residual Network) 是一种经典的 CNN 架构，通过引入残差连接解决了深度网络训练困难的问题。Caltech-101 数据集是一个广泛用于对象识别研究的标准 benchmark。

本实验的目标是：
1. 修改预训练的 ResNet-18 模型以适应 Caltech-101 数据集的101个类别。
2. 采用微调策略，冻结大部分预训练层，仅训练新添加的分类层和微调部分顶层卷积层。
3. 在有限的时间内在本机CPU上完成训练。
4. 使用 TensorBoard 可视化训练过程中的损失和准确率。

## 2. 模型与数据集介绍

### 2.1 ResNet-18 模型
ResNet (Residual Network) 是一种深度卷积神经网络架构，其核心创新是引入了“残差学习”单元。这些单元通过“快捷连接”（skip connections）允许梯度更直接地反向传播到较早的层，从而使得训练非常深的网络成为可能。ResNet-18 是该系列中一个相对较浅（18个含权层）但仍然非常有效的模型，它在图像识别任务的性能和计算效率之间取得了良好的平衡，适合在资源受限的情况下进行微调。

### 2.2 Caltech-101 数据集
Caltech-101 数据集包含101个对象类别（例如，飞机、摩托车、佛像等）以及一个背景类别。每个对象类别大约有40到800张图像，总图像数约为9000张。图像尺寸各不相同。该数据集的挑战包括类内差异大、视角变化、光照条件不同以及背景杂乱等。在本实验中，背景类别 `BACKGROUND_Google` 不是视为一个有效的学习目标，因此在数据加载时已被我手动排除。

## 3. 实验设置与代码实现

整个实验流程基于 Python 和 PyTorch 框架实现。下面将分模块介绍代码的配置和功能。

### 3.1 初始配置与导入库

首先，我们导入必要的库并定义实验相关的配置参数。

In [10]:
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import ImageFolder
import time
import os
from torch.utils.tensorboard import SummaryWriter # 用于 TensorBoard 可视化

# --- Configuration ---
DATA_DIR = 'caltech-101/101_ObjectCategories/101_ObjectCategories' # 数据集路径，请确保此路径正确
NUM_CLASSES = 101        # Caltech-101 的类别数
BATCH_SIZE = 32          # 批处理大小，根据CPU和内存调整
IMAGE_SIZE = (128, 128)  # 输入图像统一调整的大小，较小尺寸以加速CPU训练
NUM_EPOCHS = 5           # 训练周期数，原为1，这里假设根据脚本内容增加到了5
LEARNING_RATE_FC = 0.001 # 新全连接层的学习率
LEARNING_RATE_FINETUNE = 0.0001 # 微调预训练层的学习率
TRAIN_SPLIT_RATIO = 0.8  # 训练集划分比例
RANDOM_SEED = 0          # 随机种子，用于保证实验可复现性

### 3.2 数据加载与预处理模块 (`get_data_loaders`)

此模块负责从指定路径加载 Caltech-101 数据集，并进行必要的预处理，然后创建训练集和验证集的 DataLoader。

主要步骤包括：
- 定义图像变换：包括缩放图像到 `IMAGE_SIZE`，随机水平翻转、随机旋转和颜色抖动等数据增强操作，转换为Tensor，以及使用ImageNet的均值和标准差进行归一化。
- 使用 `ImageFolder` 加载数据：它会自动从子文件夹名称推断类别标签。
- 划分数据集：按照 `TRAIN_SPLIT_RATIO` 将完整数据集划分为训练集和验证集。
- 创建 `DataLoader`：用于在训练和验证过程中高效地批量加载数据。

In [12]:
def get_data_loaders(data_dir, image_size, batch_size, train_split_ratio, random_seed):
    """
    加载 Caltech-101 数据集并创建训练和验证 DataLoader。
    """
    print(f"从以下路径加载数据: {data_dir}")

    transform = transforms.Compose([
        transforms.Resize(image_size),
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(10),
        transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) # ImageNet 统计数据
    ])

    try:
        full_dataset = ImageFolder(root=data_dir, transform=transform)
        print(f"数据集已加载。找到的总类别数: {len(full_dataset.classes)}")
        if len(full_dataset.classes) == 0:
            print(f"错误: 未找到类别。请检查 '{data_dir}' 是否包含每个类别的子目录。")
            return None, None, 0
        if len(full_dataset) == 0:
            print(f"错误: 数据集为空。请检查 {data_dir} 中的图像文件。")
            return None, None, len(full_dataset.classes)

    except Exception as e:
        print(f"加载数据集时出错: {e}")
        print(f"请确保 Caltech-101 数据集正确放置在 '{data_dir}' 目录中。")
        print("并且它不为空或损坏，并且如果需要，已删除不需要的类文件夹（如 'BACKGROUND_Google'）。")
        return None, None, 0

    # 分割数据集
    num_train = int(len(full_dataset) * train_split_ratio)
    num_val = len(full_dataset) - num_train

    if num_train == 0 or num_val == 0:
        print(f"错误: 没有足够的数据来分割为训练集和验证集。训练集数量: {num_train}, 验证集数量: {num_val}")
        return None, None, len(full_dataset.classes)

    train_dataset, val_dataset = random_split(full_dataset, [num_train, num_val], 
                                            generator=torch.Generator().manual_seed(random_seed))

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2, pin_memory=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=2, pin_memory=True)

    print(f"训练样本数: {len(train_dataset)}")
    print(f"验证样本数: {len(val_dataset)}")
    print(f"类别数: {len(full_dataset.classes)}")

    return train_loader, val_loader, len(full_dataset.classes)

### 3.3 模型定义与微调策略模块 (`get_model`)

此模块负责加载预训练的 ResNet-18 模型，并根据 Caltech-101 数据集的类别数修改其分类头。同时，它也设置了不同的学习率策略用于微调。

主要步骤：
- 加载预训练的 ResNet-18：使用 `torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.IMAGENET1K_V1)` 加载在 ImageNet 上预训练好的模型权重。
- 冻结参数：初始时，冻结模型所有参数的梯度更新（`param.requires_grad = False`）。
- 修改分类层：获取原全连接层 (`model.fc`) 的输入特征数，然后替换为一个新的 `nn.Linear` 层，其输出特征数为 `NUM_CLASSES`。这个新层默认 `requires_grad=True`。
- 设置Xavier初始化：对新的全连接层权重使用Xavier均匀初始化。
- 解冻部分预训练层：选择性地解冻 ResNet 靠后的卷积层（即 `layer3` 和 `layer4`），设置其 `requires_grad = True`，以便进行微调。
- 配置优化器：使用 `Adam` 优化器，并为其传递两个参数组：一组是新全连接层的参数（使用 `LEARNING_RATE_FC`），另一组是解冻的预训练层参数（使用 `LEARNING_RATE_FINETUNE`）。

In [13]:
def get_model(num_classes, learning_rate_fc, learning_rate_finetune):
    """
    加载预训练的 ResNet-18 模型并修改其分类器。
    为新分类器和预训练层设置不同的学习率。
    """
    model = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.IMAGENET1K_V1)

    # 首先冻结所有参数
    for param in model.parameters():
        param.requires_grad = False

    # 修改最后的全连接层
    num_ftrs = model.fc.in_features
    model.fc = nn.Linear(num_ftrs, num_classes) # 新层默认 requires_grad=True
    nn.init.xavier_uniform_(model.fc.weight); # Xavier 初始化
    
    # 设置具有不同学习率的优化器
    # 新初始化的全连接层的参数
    fc_params = model.fc.parameters()
    # 预训练层的参数（我们将解冻其中一些进行微调）
    # 对于 ResNet，我们微调 layer4 及以上的层（更特定的特征）
    finetune_params = []
    for name, param in model.named_parameters():
        if "fc" not in name: # 排除已处理的fc层
             # 解冻较后的层进行微调
            if 'layer4' in name or 'layer3' in name: # 示例：微调 layer3 和 layer4
                param.requires_grad = True
                finetune_params.append(param)
            else:
                param.requires_grad = False # 保持较早的层冻结

    optimizer = optim.Adam([
        {'params': fc_params, 'lr': learning_rate_fc},
        {'params': finetune_params, 'lr': learning_rate_finetune}
    ], lr=learning_rate_fc) # 默认学习率，尽管会被特定组的学习率覆盖

    return model, optimizer

### 3.4 训练与评估模块 (`train_model`)

此模块包含模型训练和验证的核心逻辑。

主要功能：
- 迭代指定的 `num_epochs` 次。
- **训练阶段** (每个 epoch 内):
    - 设置模型为训练模式 (`model.train()`)。
    - 遍历训练数据加载器 (`train_loader`)。
    - 将数据移至指定设备 (`device`)。
    - 清零优化器梯度 (`optimizer.zero_grad()`)。
    - 前向传播，计算模型输出 (`outputs = model(inputs)`)。
    - 计算损失 (`loss = criterion(outputs, labels)`)。
    - 反向传播，计算梯度 (`loss.backward()`)。
    - 更新模型参数 (`optimizer.step()`)。
    - 累积损失和计算训练准确率。
    - 将每批次的训练损失记录到 TensorBoard。
- **验证阶段** (每个 epoch 结束时):
    - 设置模型为评估模式 (`model.eval()`)。
    - 在 `torch.no_grad()` 上下文中执行，禁用梯度计算以节省内存和计算。
    - 遍历验证数据加载器 (`val_loader`)。
    - 计算验证集上的总损失和准确率。
- **日志记录**:
    - 打印每个 epoch 的训练/验证损失和准确率以及耗时。
    - 将每个 epoch 的平均训练损失、训练准确率、验证损失和验证准确率记录到 TensorBoard。

In [14]:
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, device, writer): # writer 用于TensorBoard
    """
    训练模型并在验证集上进行评估。
    将训练和验证指标记录到 TensorBoard。
    """
    print(f"在 {device} 上训练...")
    global_step = 0 # 用于记录批次训练损失
    for epoch in range(num_epochs):
        start_time_epoch = time.time()
        model.train() # 设置模型为训练模式
        running_loss_train = 0.0
        correct_train = 0
        total_train = 0

        for i, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss_train += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs.data, 1)
            total_train += labels.size(0)
            correct_train += (predicted == labels).sum().item()

            # 每个批次后记录训练损失到 TensorBoard
            writer.add_scalar('Loss/train_batch', loss.item(), global_step)
            global_step += 1

            if (i + 1) % 20 == 0: # 每 20 个批次打印一次日志
                print(f'Epoch [{epoch+1}/{num_epochs}], Batch [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}')

        epoch_loss_train = running_loss_train / len(train_loader.dataset)
        epoch_acc_train = correct_train / total_train
        
        # 每个 epoch 后记录平均训练损失和准确率
        writer.add_scalar('Loss/train_epoch', epoch_loss_train, epoch)
        writer.add_scalar('Accuracy/train_epoch', epoch_acc_train, epoch)

        # --- 验证阶段 ---
        model.eval() # 设置模型为评估模式
        running_loss_val = 0.0
        correct_val = 0
        total_val = 0
        with torch.no_grad(): # 评估时不需要计算梯度
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                running_loss_val += loss.item() * inputs.size(0)
                _, predicted = torch.max(outputs.data, 1)
                total_val += labels.size(0)
                correct_val += (predicted == labels).sum().item()

        epoch_loss_val = running_loss_val / len(val_loader.dataset)
        epoch_acc_val = correct_val / total_val

        # 每个 epoch 后记录验证损失和准确率到 TensorBoard
        writer.add_scalar('Loss/validation_epoch', epoch_loss_val, epoch)
        writer.add_scalar('Accuracy/validation_epoch', epoch_acc_val, epoch)

        end_time_epoch = time.time()
        epoch_duration = end_time_epoch - start_time_epoch

        print(f"Epoch {epoch+1}/{num_epochs} 耗时 {epoch_duration:.2f}s")
        print(f"  训练损失: {epoch_loss_train:.4f}, 训练准确率: {epoch_acc_train:.4f}")
        print(f"  验证损失: {epoch_loss_val:.4f}, 验证准确率: {epoch_acc_val:.4f}")

    print("训练完成")
    return model

### 3.5 主执行逻辑

这是脚本的主要执行部分，它将上述所有模块组合起来，完成整个训练和评估流程。

步骤包括：
- 初始化 `SummaryWriter` 用于 TensorBoard 日志记录，日志将保存在 `runs/Caltech101_ResNet18_<timestamp>` 目录下。
- 设置计算设备（CPU 或 CUDA GPU，如果可用）。
- 调用 `get_data_loaders` 加载数据。
- 检查加载的类别数是否与预期的 `NUM_CLASSES` 一致，如果不一致则使用实际加载的类别数。
- 调用 `get_model` 获取模型和优化器。
- 定义损失函数（`nn.CrossEntropyLoss`）。
- 调用 `train_model` 执行训练和评估。
- 关闭 `SummaryWriter`。
- 打印总执行时间。

In [15]:
if __name__ == '__main__' and '__file__' not in globals(): # 在Jupyter Notebook中运行时，通常__name__是'__main__'，但没有__file__
    print("开始 Caltech-101 分类任务...")
    overall_start_time = time.time()

    # 初始化 SummaryWriter
    run_name = f"Caltech101_ResNet18_{time.strftime('%Y%m%d-%H%M%S')}"
    log_dir_notebook = os.path.join('runs', run_name) # 为notebook单独创建一个日志目录
    # 注意：下面的Tensorboard将使用固定的预训练日志路径
    # writer = SummaryWriter(log_dir=log_dir_notebook) # 如果要重新训练并记录，取消注释这一行和下面的close
    # print(f"新的 TensorBoard 日志将保存到: {log_dir_notebook}")

    # 设置设备
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"使用设备: {device}")
    if str(device) == "cuda":
        print(f"CUDA 设备名称: {torch.cuda.get_device_name(0)}")

    # 获取 DataLoaders
    train_loader, val_loader, num_actual_classes = get_data_loaders(
        data_dir=DATA_DIR,
        image_size=IMAGE_SIZE,
        batch_size=BATCH_SIZE,
        train_split_ratio=TRAIN_SPLIT_RATIO,
        random_seed=RANDOM_SEED
    )

    if train_loader is None or val_loader is None:
        print("加载数据失败。正在退出。")
        # if 'writer' in locals(): writer.close()
        exit() # 在notebook中可能需要换成 return 或者直接不执行后续cell

    if num_actual_classes != NUM_CLASSES and num_actual_classes > 0 :
        print(f"警告: 检测到的类别数 ({num_actual_classes}) 与 NUM_CLASSES ({NUM_CLASSES}) 不同。")
        print(f"使用 {num_actual_classes} 作为类别数。")
        current_num_classes = num_actual_classes
    elif num_actual_classes == 0:
        print("错误: 未加载任何类别。正在退出。")
        # if 'writer' in locals(): writer.close()
        exit() # 同上
    else:
        current_num_classes = NUM_CLASSES

    # 获取模型和优化器
    model, optimizer = get_model(
        num_classes=current_num_classes,
        learning_rate_fc=LEARNING_RATE_FC,
        learning_rate_finetune=LEARNING_RATE_FINETUNE
    )
    model.to(device)

    # 损失函数
    criterion = nn.CrossEntropyLoss()
    
    # --- 演示：如果要在Notebook中重新训练并记录，需要取消注释writer的初始化和关闭 ---
    # print(f"开始训练 {NUM_EPOCHS} 个 epoch(s)...")
    # # 创建一个临时的 writer 实例用于本次训练 (如果上面没有初始化全局 writer)
    temp_run_name = f"Notebook_Run_{time.strftime('%Y%m%d-%H%M%S')}"
    temp_log_dir = os.path.join('runs', temp_run_name)
    notebook_writer = SummaryWriter(log_dir=temp_log_dir)
    print(f"当前Notebook训练的TensorBoard日志将保存到：{temp_log_dir}")
    
    trained_model = train_model(model, train_loader, val_loader, criterion, optimizer, NUM_EPOCHS, device, notebook_writer)
    
    notebook_writer.close()
    # --- 演示结束 ---

    overall_end_time = time.time()
    total_duration = overall_end_time - overall_start_time
    print(f"总执行时间: {total_duration:.2f} 秒。")
    if total_duration > 600 and NUM_EPOCHS > 1: # 10 分钟
        print("警告: 训练时间超过10分钟。")

    print("任务完成。")
else:
    print("代码块已定义，请在Notebook中按需调用函数或运行主逻辑（如果适用）。")

开始 Caltech-101 分类任务...
使用设备: cpu
从以下路径加载数据: caltech-101/101_ObjectCategories/101_ObjectCategories
数据集已加载。找到的总类别数: 101
训练样本数: 6941
验证样本数: 1736
类别数: 101
当前Notebook训练的TensorBoard日志将保存到：runs\Notebook_Run_20250529-151921
在 cpu 上训练...
Epoch [1/5], Batch [20/217], Loss: 3.7138
Epoch [1/5], Batch [40/217], Loss: 1.6943
Epoch [1/5], Batch [60/217], Loss: 1.0137
Epoch [1/5], Batch [80/217], Loss: 1.5851
Epoch [1/5], Batch [100/217], Loss: 1.2455
Epoch [1/5], Batch [120/217], Loss: 1.7049
Epoch [1/5], Batch [140/217], Loss: 1.2910
Epoch [1/5], Batch [160/217], Loss: 0.9077
Epoch [1/5], Batch [180/217], Loss: 0.7428
Epoch [1/5], Batch [200/217], Loss: 0.6682
Epoch 1/5 耗时 138.10s
  训练损失: 1.4771, 训练准确率: 0.6730
  验证损失: 0.5599, 验证准确率: 0.8497
Epoch [2/5], Batch [20/217], Loss: 0.3265
Epoch [2/5], Batch [40/217], Loss: 0.3769
Epoch [2/5], Batch [60/217], Loss: 0.2004
Epoch [2/5], Batch [80/217], Loss: 0.5411
Epoch [2/5], Batch [100/217], Loss: 0.2292
Epoch [2/5], Batch [120/217], Loss: 0.2949
Epoch [2/5

### 3.6 实验结果

经过 5 个 epoch 的训练，我们的模型在验证集上的准确率达到了 0.9055

## 4. TensorBoard 可视化

我们将使用 TensorBoard 来可视化训练过程中的关键指标，包括训练集和验证集上的损失（Loss）曲线，以及验证集上的准确率（Accuracy）变化。


In [28]:
# 加载 TensorBoard notebook 扩展
%load_ext tensorboard

# 指定之前训练保存的 TensorBoard 日志目录路径
#log_dir_path = "C:\\Users\\Ruiyu\\Desktop\\NN_and_DL\\MidTerm\\runs\\Notebook_Run_20250529-151921"

%tensorboard --logdir="C:\\Users\\Ruiyu\\Desktop\\NN_and_DL\\MidTerm\\runs\\Notebook_Run_20250529-151921"

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6010 (pid 24336), started 0:00:29 ago. (Use '!kill 24336' to kill it.)

我们将参数保存下来。

In [29]:
# --- 保存模型参数 ---

model_save_path = "model_parameters.pth" #保存在当前工作目录下

if 'trained_model' in locals():
    try:
        torch.save(trained_model.state_dict(), model_save_path)
        print(f"模型参数已成功保存到: {os.path.abspath(model_save_path)}")
    except Exception as e:
        print(f"保存模型参数时发生错误: {e}")
else:
    print("错误: 'trained_model' 未定义。请确保模型已经训练完毕。")


模型参数已成功保存到: c:\Users\Ruiyu\Desktop\NN_and_DL\MidTerm\model_parameters.pth


# 5. 超参数选择

由于计算资源的有限，我们在这里只运行一些非常 toy 的超参数搜索。

In [30]:
import time
import os
# Removed: from torch.utils.tensorboard import SummaryWriter
import torch
import torch.nn as nn # In case criterion was not made global

print("--- Starting Hyperparameter Search (No TensorBoard Logging) ---")

# --- 0. Prerequisite Check & Setup ---
# Ensure essential variables from previous cells are available.
required_globals = ['device', 'train_loader', 'val_loader', 'current_num_classes', 
                    'get_model', 'criterion', 'DATA_DIR', 'IMAGE_SIZE', 'BATCH_SIZE',
                    'TRAIN_SPLIT_RATIO', 'RANDOM_SEED']
missing_vars = [var for var in required_globals if var not in globals()]

if missing_vars:
    print(f"Error: Essential variables not found in global scope: {missing_vars}")
    print("Please ensure the main training setup cells (defining imports, configs, data loaders, model function, etc.) have been run.")
    # Depending on your notebook structure, you might want to stop execution here
    # For example: raise RuntimeError("Missing prerequisite variables for hyperparameter search.")
else:
    print("Prerequisite variables found. Proceeding with hyperparameter search.")

    # --- 1. Define Hyperparameter Space ---
    num_epochs_options = [1, 3, 5]
    lr_fc_options = [0.001, 0.002]
    #lr_finetune_options = [0.0001, 0.0002]

    hyperparameter_search_results = []

    # --- 2. Define a modified train_model function for this cell that returns final validation loss ---
    # (TensorBoard 'writer' parameter and related calls are removed)
    def train_model_for_hp_search_no_tb(model, train_loader, val_loader, criterion_fn, optimizer, num_epochs, device):
        print(f"Training on {device} for {num_epochs} epoch(s)...")
        log_indent = "  " 
        
        final_epoch_val_loss = float('inf')
        final_epoch_val_acc = 0.0

        for epoch in range(num_epochs):
            epoch_start_time = time.time()
            model.train()
            current_epoch_train_loss = 0.0
            current_epoch_correct_train = 0
            current_epoch_total_train = 0

            for i, (inputs, labels) in enumerate(train_loader):
                inputs, labels = inputs.to(device), labels.to(device)
                optimizer.zero_grad()
                outputs = model(inputs)
                loss = criterion_fn(outputs, labels)
                loss.backward()
                optimizer.step()

                current_epoch_train_loss += loss.item() * inputs.size(0)
                _, predicted = torch.max(outputs.data, 1)
                current_epoch_total_train += labels.size(0)
                current_epoch_correct_train += (predicted == labels).sum().item()
                
                if (i + 1) % 20 == 0: 
                    print(f'{log_indent}Epoch [{epoch+1}/{num_epochs}], Batch [{i+1}/{len(train_loader)}], Batch Loss: {loss.item():.4f}')

            avg_epoch_train_loss = current_epoch_train_loss / current_epoch_total_train if current_epoch_total_train > 0 else 0
            avg_epoch_train_acc = current_epoch_correct_train / current_epoch_total_train if current_epoch_total_train > 0 else 0
            
            # Validation phase
            model.eval()
            current_epoch_val_loss = 0.0
            current_epoch_correct_val = 0
            current_epoch_total_val = 0
            with torch.no_grad():
                for inputs, labels in val_loader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    outputs = model(inputs)
                    loss = criterion_fn(outputs, labels)
                    current_epoch_val_loss += loss.item() * inputs.size(0)
                    _, predicted = torch.max(outputs.data, 1)
                    current_epoch_total_val += labels.size(0)
                    current_epoch_correct_val += (predicted == labels).sum().item()
            
            avg_epoch_val_loss = current_epoch_val_loss / current_epoch_total_val if current_epoch_total_val > 0 else float('inf')
            avg_epoch_val_acc = current_epoch_correct_val / current_epoch_total_val if current_epoch_total_val > 0 else 0.0
            
            if epoch == num_epochs - 1:
                final_epoch_val_loss = avg_epoch_val_loss
                final_epoch_val_acc = avg_epoch_val_acc
            
            epoch_duration = time.time() - epoch_start_time
            print(f'{log_indent}Epoch {epoch+1}/{num_epochs} took {epoch_duration:.2f}s')
            print(f'{log_indent}  Train Loss: {avg_epoch_train_loss:.4f}, Train Acc: {avg_epoch_train_acc:.4f}')
            print(f'{log_indent}  Val Loss:   {avg_epoch_val_loss:.4f}, Val Acc:   {avg_epoch_val_acc:.4f}')

        print(f"{log_indent}Training completed for this hyperparameter set.")
        return model, final_epoch_val_loss, final_epoch_val_acc
    # --- End of modified train_model_for_hp_search_no_tb ---

    # --- 3. Iterate through Hyperparameters ---
    if not missing_vars: 
        for n_epochs_hp in num_epochs_options:
            for fc_lr_hp in lr_fc_options:
                ft_lr_hp = fc_lr_hp * 0.1               
                current_hparams_values = {
                    "epochs": n_epochs_hp,
                    "lr_fc": fc_lr_hp,
                    "lr_finetune": ft_lr_hp
                }
                print(f"\n{'='*10} Testing Hyperparameters: {current_hparams_values} {'='*10}")

                model_hp, optimizer_hp = get_model(
                    num_classes=current_num_classes, 
                    learning_rate_fc=fc_lr_hp,
                    learning_rate_finetune=ft_lr_hp
                )
                model_hp.to(device)

                # Train model using the modified function (no writer)
                _, final_val_loss_run, final_val_acc_run = train_model_for_hp_search_no_tb(
                    model_hp,
                    train_loader, 
                    val_loader,   
                    criterion,    
                    optimizer_hp,
                    n_epochs_hp,
                    device
                )
                
                hyperparameter_search_results.append({
                    "params": current_hparams_values,
                    "final_val_loss": final_val_loss_run,
                    "final_val_acc": final_val_acc_run
                })
                print(f"Reported Final Validation Loss for {current_hparams_values}: {final_val_loss_run:.4f}, Acc: {final_val_acc_run:.4f}")
                print(f"{'='*50}\n")

        # --- 4. Print Summary of Hyperparameter Search ----
        print(f"\n\n{'='*20} Hyperparameter Search Full Report {'='*20}")
        if hyperparameter_search_results:
            sorted_results = sorted(hyperparameter_search_results, key=lambda x: x['final_val_loss'])
            print("Top performing hyperparameter sets (by validation loss):")
            for result in sorted_results:
                print(f"  Params: Epochs={result['params']['epochs']}, FC LR={result['params']['lr_fc']}, FT LR={result['params']['lr_finetune']} "
                      f"-> Final Val Loss: {result['final_val_loss']:.4f}, Final Val Acc: {result['final_val_acc']:.4f}")
        else:
            print("No hyperparameter search results were collected (or an error occurred).")

print("\n--- Hyperparameter Search Cell Execution Finished (No TensorBoard Logging) ---")

--- Starting Hyperparameter Search (No TensorBoard Logging) ---
Prerequisite variables found. Proceeding with hyperparameter search.

Training on cpu for 1 epoch(s)...
  Epoch [1/1], Batch [20/217], Batch Loss: 3.7174
  Epoch [1/1], Batch [40/217], Batch Loss: 2.3352
  Epoch [1/1], Batch [60/217], Batch Loss: 1.4172
  Epoch [1/1], Batch [80/217], Batch Loss: 1.0213
  Epoch [1/1], Batch [100/217], Batch Loss: 1.0038
  Epoch [1/1], Batch [120/217], Batch Loss: 1.1085
  Epoch [1/1], Batch [140/217], Batch Loss: 1.0488
  Epoch [1/1], Batch [160/217], Batch Loss: 0.7869
  Epoch [1/1], Batch [180/217], Batch Loss: 0.5555
  Epoch [1/1], Batch [200/217], Batch Loss: 1.0363
  Epoch 1/1 took 141.91s
    Train Loss: 1.4889, Train Acc: 0.6780
    Val Loss:   0.5565, Val Acc:   0.8485
  Training completed for this hyperparameter set.
Reported Final Validation Loss for {'epochs': 1, 'lr_fc': 0.001, 'lr_finetune': 0.0001}: 0.5565, Acc: 0.8485


Training on cpu for 1 epoch(s)...
  Epoch [1/1], Batch [

就如我们所选择的，最好的超参数是训练 5 个 epoch，使用普通学习率 0.001，微调学习率 0.001。训练集上分类准确率达到了 90.15%。显而易见的是，继续训练，准确率会继续上升。可惜我的计算资源过于有限。

# 6 与从头训练的模型做比较

我们从头训练一个 resnet-18 模型。我们训练它 5 个 epoch。注意由于所有参数都要被训练，它已经使用了远超过我们模型的计算资源了。

In [31]:
print("--- Training ResNet-18 From Scratch ---")

# --- 0. Configuration for Scratch Training ---
# Most configurations like DATA_DIR, IMAGE_SIZE, BATCH_SIZE, etc., are assumed to be globally defined.
# RANDOM_SEED is also assumed global for consistent data splitting.
SCRATCH_NUM_EPOCHS = 5
SCRATCH_LEARNING_RATE = 0.001 # A single learning rate for all parameters

# --- 1. Prerequisite Check ---
required_globals_scratch = ['device', 'train_loader', 'val_loader', 'current_num_classes', 
                            'criterion', 'IMAGE_SIZE', 'BATCH_SIZE', 'RANDOM_SEED']
missing_vars_scratch = [var for var in required_globals_scratch if var not in globals()]

if missing_vars_scratch:
    print(f"Error: Essential variables not found in global scope: {missing_vars_scratch}")
    print("Please ensure the main training setup cells (defining imports, configs, data loaders, etc.) have been run.")
else:
    print("Prerequisite variables found. Proceeding with training from scratch.")

    # --- 2. Define Model Instantiation for Scratch Training ---
    def get_resnet18_scratch(num_classes):
        # Initialize ResNet-18 WITHOUT pre-trained weights
        model_scratch = torchvision.models.resnet18(weights=None) # Key change: weights=None
        
        # Modify the final fully connected layer for the number of classes
        num_ftrs_scratch = model_scratch.fc.in_features
        model_scratch.fc = nn.Linear(num_ftrs_scratch, num_classes)
        
        # Initialize all parameters using Xavier initialization
        for m in model_scratch.modules():
            if isinstance(m, (nn.Conv2d, nn.Linear)):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
        # The following lines specifically initialize the fc layer, which is redundant
        # if the loop above correctly handles all Linear layers.
        # However, keeping it for explicitness or if there's a specific reason
        # to re-initialize it (though generally not needed).
        # If the loop above is comprehensive, these two lines can be removed.

       # nn.init.xavier_uniform_(model_scratch.fc.weight)
        # if model_scratch.fc.bias is not None:
        #     nn.init.zeros_(model_scratch.fc.bias)
            
        # For training from scratch, all parameters should have requires_grad = True by default
        # So, no need to manually set requires_grad as we did for fine-tuning.
        # All parameters will be passed to the optimizer.
        return model_scratch

    # --- 3. Define Training Function (similar to previous, no TensorBoard) ---
    def train_model_scratch_epochs(model, train_loader, val_loader, criterion_fn, optimizer, num_epochs, device):
        print(f"Training from scratch on {device} for {num_epochs} epoch(s)...")
        log_indent = "  "
        
        epoch_summary = []

        for epoch in range(num_epochs):
            epoch_start_time = time.time()
            model.train()
            current_epoch_train_loss_sum = 0.0
            current_epoch_correct_train = 0
            current_epoch_total_train = 0

            for i, (inputs, labels) in enumerate(train_loader):
                inputs, labels = inputs.to(device), labels.to(device)
                optimizer.zero_grad()
                outputs = model(inputs)
                loss = criterion_fn(outputs, labels)
                loss.backward()
                optimizer.step()

                current_epoch_train_loss_sum += loss.item() * inputs.size(0)
                _, predicted = torch.max(outputs.data, 1)
                current_epoch_total_train += labels.size(0)
                current_epoch_correct_train += (predicted == labels).sum().item()
                
                if (i + 1) % 20 == 0: 
                    print(f'{log_indent}Epoch [{epoch+1}/{num_epochs}], Batch [{i+1}/{len(train_loader)}], Batch Loss: {loss.item():.4f}')

            avg_epoch_train_loss = current_epoch_train_loss_sum / current_epoch_total_train if current_epoch_total_train > 0 else 0
            avg_epoch_train_acc = current_epoch_correct_train / current_epoch_total_train if current_epoch_total_train > 0 else 0
            
            model.eval()
            current_epoch_val_loss_sum = 0.0
            current_epoch_correct_val = 0
            current_epoch_total_val = 0
            with torch.no_grad():
                for inputs, labels in val_loader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    outputs = model(inputs)
                    loss = criterion_fn(outputs, labels)
                    current_epoch_val_loss_sum += loss.item() * inputs.size(0)
                    _, predicted = torch.max(outputs.data, 1)
                    current_epoch_total_val += labels.size(0)
                    current_epoch_correct_val += (predicted == labels).sum().item()
            
            avg_epoch_val_loss = current_epoch_val_loss_sum / current_epoch_total_val if current_epoch_total_val > 0 else float('inf')
            avg_epoch_val_acc = current_epoch_correct_val / current_epoch_total_val if current_epoch_total_val > 0 else 0.0
            
            epoch_duration = time.time() - epoch_start_time
            print(f'{log_indent}Epoch {epoch+1}/{num_epochs} took {epoch_duration:.2f}s')
            print(f'{log_indent}  Train Loss: {avg_epoch_train_loss:.4f}, Train Acc: {avg_epoch_train_acc:.4f}')
            print(f'{log_indent}  Val Loss:   {avg_epoch_val_loss:.4f}, Val Acc:   {avg_epoch_val_acc:.4f}')
            epoch_summary.append({
                'epoch': epoch + 1,
                'train_loss': avg_epoch_train_loss,
                'train_acc': avg_epoch_train_acc,
                'val_loss': avg_epoch_val_loss,
                'val_acc': avg_epoch_val_acc
            })

        print(f"{log_indent}Training from scratch completed.")
        return model, epoch_summary
    # --- End of train_model_scratch_epochs ---

    # --- 4. Execute Scratch Training ---
    if not missing_vars_scratch:
        print(f"\nInstantiating ResNet-18 for training from scratch with {current_num_classes} classes.")
        model_scratch_instance = get_resnet18_scratch(num_classes=current_num_classes)
        model_scratch_instance.to(device)

        # Optimizer for scratch training - all parameters with the same learning rate
        optimizer_scratch = optim.Adam(model_scratch_instance.parameters(), lr=SCRATCH_LEARNING_RATE)
        
        # 'criterion' should be globally defined (e.g., criterion = nn.CrossEntropyLoss())
        scratch_start_time = time.time()
        
        trained_model_scratch, scratch_training_summary = train_model_scratch_epochs(
            model_scratch_instance,
            train_loader, # Global train_loader
            val_loader,   # Global val_loader
            criterion,    # Global criterion
            optimizer_scratch,
            SCRATCH_NUM_EPOCHS,
            device        # Global device
        )
        
        scratch_total_time = time.time() - scratch_start_time
        print(f"\nTotal time for scratch training ({SCRATCH_NUM_EPOCHS} epochs): {scratch_total_time:.2f} seconds.")

        print("\nSummary of Training from Scratch:")
        for item in scratch_training_summary:
            print(f"  Epoch {item['epoch']}: Train Loss={item['train_loss']:.4f}, Train Acc={item['train_acc']:.4f}, "
                  f"Val Loss={item['val_loss']:.4f}, Val Acc={item['val_acc']:.4f}")
        
        # Optionally, save this model's parameters too
        # scratch_model_save_path = "model_parameters_scratch.pth"
        # torch.save(trained_model_scratch.state_dict(), scratch_model_save_path)
        # print(f"Scratch model parameters saved to: {os.path.abspath(scratch_model_save_path)}")

print("\n--- Scratch Training Cell Execution Finished ---")

--- Training ResNet-18 From Scratch ---
Prerequisite variables found. Proceeding with training from scratch.

Instantiating ResNet-18 for training from scratch with 101 classes.
Training from scratch on cpu for 5 epoch(s)...
  Epoch [1/5], Batch [20/217], Batch Loss: 4.1954
  Epoch [1/5], Batch [40/217], Batch Loss: 3.5331
  Epoch [1/5], Batch [60/217], Batch Loss: 3.6803
  Epoch [1/5], Batch [80/217], Batch Loss: 3.8752
  Epoch [1/5], Batch [100/217], Batch Loss: 2.8735
  Epoch [1/5], Batch [120/217], Batch Loss: 3.1450
  Epoch [1/5], Batch [140/217], Batch Loss: 2.9949
  Epoch [1/5], Batch [160/217], Batch Loss: 3.0378
  Epoch [1/5], Batch [180/217], Batch Loss: 2.5939
  Epoch [1/5], Batch [200/217], Batch Loss: 2.5988
  Epoch 1/5 took 202.34s
    Train Loss: 3.3635, Train Acc: 0.3062
    Val Loss:   2.9619, Val Acc:   0.3669
  Epoch [2/5], Batch [20/217], Batch Loss: 2.9442
  Epoch [2/5], Batch [40/217], Batch Loss: 2.2189
  Epoch [2/5], Batch [60/217], Batch Loss: 2.1546
  Epoch [2

显而易见，从头训练的accuracy比微调差了许多。

# 7 github 仓库地址及权重文件地址

github 仓库：见 https://github.com/RuiyuanHuang/NN_and_DL_Mideterm1/tree/main

权重文件：见 https://drive.google.com/file/d/1TpXDjjGhcojaFSK53TGxED5ZgdDrY3OV/view?usp=drive_link