# 主流程文件 Main Training Script

> 主训练脚本入口，调用各模块进行模型训练
> 
> The main entry point for running training, orchestrating all modules for model training

## 简介/Description:
main 模块是项目的主训练入口。它结合了 core 模块中的任务定义和 data 模块中的数据加载功能，通过调用 PyTorch Lightning 的 Trainer 对模型进行训练。用户可以通过配置类快速切换不同的数据集、模型和训练策略，灵活完成实验任务。

The main module serves as the primary entry point for training. It combines task definitions from the core module and data loading from the data module to execute model training via PyTorch Lightning’s Trainer. Users can flexibly switch between different datasets, models, and training strategies through configuration classes to perform experiments.

## 主要符号/Main symbols:

- Trainer: PyTorch Lightning 的训练控制器，用于管理训练过程。  
  
  Trainer: The PyTorch Lightning controller for managing the training process.

- ClassificationTask: 从 core 导入，用于模型训练的主要任务类。
  
  ClassificationTask: Imported from core, the primary task class for model training.

- CIFAR100DataModule: 从 data 导入的数据加载模块。
  
  CIFAR100DataModule: Data loading module imported from data.

In [1]:
#| default_exp __main__

In [1]:
#| hide
%load_ext autoreload
%autoreload 2
from nbdev.showdoc import *

In [1]:
#| export
from namable_classify.core import ClassificationTask, ClassificationTaskConfig
config = ClassificationTaskConfig()
# config.learning_rate = 1e-1
# config.learning_rate = 1
# config.learning_rate = 1e-3
# config.learning_rate = 1e-5
config.learning_rate = 3e-4
config.experiment_index = 1
# config.learning_rate = 1e-6
config.dataset_config.batch_size = 64
cls_task = ClassificationTask(config)
cls_task.print_model_pretty()
import torch
# cls_task.cls_model = torch.compile(cls_task.cls_model, mode='reduce-overhead')
#  fullgraph=True

Seed set to 1


In [2]:
#| export
from boguan_yuequ.auto import AutoYueQuAlgorithm
AutoYueQuAlgorithm(cls_task, "LORA")

Before 约取 (YueQu) , the model structure is: 


Using LORA Algorithm from Hugging Face PEFT Library. 
peft_config: LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=None, inference_mode=False, r=8, target_modules=['query', 'value'], lora_alpha=8, lora_dropout=0.0, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False))
After 约取 (YueQu) , the model structure is: 


<boguan_yuequ.auto.AutoYueQuAlgorithm at 0x71cb46476110>

In [4]:
# #| export
# import lightning as L
# trainer = L.Trainer()
# from lightning.pytorch.tuner import Tuner
# tuner = Tuner(trainer)
# found_batch_size = tuner.scale_batch_size(cls_task, datamodule=cls_task.lit_data, 
#                                         #   mode='binsearch', 
#                                           mode='power', 
#                                           init_val=64)
# # found_batch_size, cls_task.lit_data.hparams.batch_size
# print(f"Found batch size: {found_batch_size}")

In [8]:
#| export
import lightning as L
from namable_classify.utils import runs_path
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from lightning.pytorch.callbacks import ModelSummary, StochasticWeightAveraging, DeviceStatsMonitor, LearningRateMonitor, LearningRateFinder
from lightning.pytorch.loggers import TensorBoardLogger, CSVLogger, WandbLogger

trainer = L.Trainer(default_root_dir=runs_path, enable_checkpointing=True, 
                    enable_model_summary=True, 
                    num_sanity_val_steps=2, # 防止 val 在训了好久train才发现崩溃
                    callbacks=[
                        # EarlyStopping(monitor="val_loss", mode="min")
                        EarlyStopping(monitor="val_acc1", mode="max", check_finite=True, 
                                      patience=5, 
                                    #   patience=6, 
                                      check_on_train_epoch_end=False,  # check on validation end
                                      verbose=True),
                        ModelSummary(max_depth=3),
                        # https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/
                        # StochasticWeightAveraging(swa_lrs=1e-2), 
                        # DeviceStatsMonitor(cpu_stats=True)
                        LearningRateMonitor(), 
                        # LearningRateFinder() # 有奇怪的bug
                               ]
                    , max_epochs=15
                    # , gradient_clip_val=1.0, gradient_clip_algorithm="value"
                    , logger=[
                        # TensorBoardLogger(save_dir=runs_path/"tensorboard"),
                        TensorBoardLogger(save_dir=runs_path),
                              CSVLogger(save_dir=runs_path), 
                              WandbLogger(project="namable_classify", name="test")
                              ]
                    # , profiler="simple"
                    # , fast_dev_run=True
                    # limit_train_batches=10, limit_val_batches=5
                    # strategy="ddp", accelerator="gpu", devices=4
                    )

Trainer will use only 1 of 8 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=8)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [None]:
# #| export
# from lightning.pytorch.tuner import Tuner
# tuner = Tuner(trainer)

# lr_finder = tuner.lr_find(cls_task, datamodule=cls_task.lit_data, 
#                         #   max_lr=1e-2
#                         method = "fit",
#                         min_lr = 1e-8,
#     max_lr = 1,
#     num_training = 100,
#     mode = "exponential"
                        
#                           )
# print(lr_finder.results)

# fig = lr_finder.plot(suggest=True)
# from matplotlib import pyplot as plt
# from namable_classify.utils import runs_figs_path
# plt.savefig(runs_figs_path/'lr_finder.png')
# # fig.show()
# new_lr = lr_finder.suggestion()
# # new_lr, cls_task.hparams.learning_rate
# print("New learning rate: ", new_lr)

In [None]:
#| export
trainer.fit(cls_task, datamodule=cls_task.lit_data)

In [None]:
#| export
trainer.test(cls_task, datamodule=cls_task.lit_data)

In [11]:
from namable_classify.utils import lib_repo_path
trainer.test(cls_task, datamodule=cls_task.lit_data, 
             ckpt_path=lib_repo_path/"deprecated/lightning_logs/version_53/checkpoints/epoch=11-step=8448.ckpt")

You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision


Files already downloaded and verified
Files already downloaded and verified


[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33m2603119857[0m ([33mhandicraft-computing[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011136492067534063, max=1.0…

Restoring states from the checkpoint path at /home/ycm/repos/research/cv/cls/NamableClassify/deprecated/lightning_logs/version_53/checkpoints/epoch=11-step=8448.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Loaded model weights from the checkpoint at /home/ycm/repos/research/cv/cls/NamableClassify/deprecated/lightning_logs/version_53/checkpoints/epoch=11-step=8448.ckpt


Testing: |          | 0/? [00:00<?, ?it/s]

[{'test_loss': 1.0386210680007935,
  'test_acc1': 0.9221000075340271,
  'test_acc2': 0.9702000021934509,
  'test_acc3': 0.9850000143051147,
  'test_acc5': 0.991599977016449,
  'test_acc10': 0.9961000084877014,
  'test_acc20': 0.9980000257492065,
  'test_roc_auc': 0.9985058903694153,
  'test_matthews_corrcoef': 0.9213470220565796,
  'test_f1': 0.9220957159996033,
  'test_precision': 0.9248200058937073,
  'test_recall': 0.9221000075340271,
  'test_log_loss': 0.4029653072357178,
  'test_balanced_accuracy': 0.9221000075340271,
  'test_cohen_kappa': 0.9213131070137024,
  'test_hinge_loss': 0.2577705681324005}]

In [6]:
#| hide
import nbdev; nbdev.nbdev_export()