### Tutorial: Training Demo

This notebook demonstrates the HME training pipeline using a minimal setup. 

**⚠️ Important:**
This notebook is for **educational purposes** to understand the code structure and data flow. For actual large-scale training (multi-GPU, DeepSpeed), please use the provided shell scripts in `scripts/`.

**Steps:**
1. Prepare a tiny dummy dataset.
2. Configure training arguments (CPU/Single-GPU mode).
3. Run the training loop for a few steps.

In [2]:
import sys
import os
import json
import torch
from pathlib import Path

# Add project root to path
sys.path.append('..')

# Import the training function
from hme.run_clm import train

# Output directory for this demo
DEMO_OUTPUT_DIR = Path("demo_training_output")
DEMO_OUTPUT_DIR.mkdir(exist_ok=True)

In [7]:
# We will use the data generated in Tutorial 01
json_data_path = "../datasets/property_qa_test_2.json"
pt_data_path = "./demo_subset_1000.json.cfm.pt"

# Validate that prerequisites are met
if not os.path.exists(json_data_path) or not os.path.exists(pt_data_path):
    print("Error: Demo data not found.")
    print("Please run 'data_preprocess.ipynb' first to generate the demo subset.")
    # Stop execution of subsequent cells if data is missing
    raise FileNotFoundError
else:
    print(f"Found training data: {json_data_path}")
    print(f"Found embeddings: {pt_data_path}")

Found training data: ../datasets/property_qa_test_2.json
Found embeddings: ./demo_subset_1000.json.cfm.pt


### 1. Configure Training Arguments

We translate the parameters from `scripts/run_zero2_comprehension-pretrain.sh` into Python arguments.

**Key Adjustments for Demo:**
*   `--num_train_epochs 1`
*   `--max_steps 3` (Stop after 3 steps)
*   Disable DeepSpeed (Run standard PyTorch)

In [4]:
# Simulate command line arguments
# Note: We point 'base_model_path' to the downloaded Llama-3 folder
os.environ['CUDA_VISIBLE_DEVICES'] = '0'  # Use GPU 0 if available
base_model_path = "../checkpoints/Meta-Llama-3-8B-Instruct"

if not os.path.exists(base_model_path):
    print(f"⚠️ Warning: Base model not found at {base_model_path}. Please run Tutorial 02 first.")
else:
    # Construct arguments list
    # These map directly to HfArgumentParser fields in hme.run_clm.py
    sys.argv = [
        "hme.run_clm",
        "--model_name_or_path", base_model_path,
        "--output_dir", str(DEMO_OUTPUT_DIR),
        
        # Data Config (Using Real Demo Data)
        "--data_path", json_data_path,   # <--- Updated
        "--task_type", "qa",
        "--data_type", "1d,2d,3d,frg",
        "--emb_dict_mol", pt_data_path,  # <--- Updated
        "--emb_dict_protein", "none",
        
        # LoRA Config
        "--lora_r", "8",
        "--lora_alpha", "16",
        "--lora_targets", "q_proj,v_proj",
        "--modules_to_save", "feature_fuser",
        "--merge_when_finished", "False",
        
        # Training Config (Minimal for Speed)
        "--max_length", "128",         # Shorter length for speed
        "--per_device_train_batch_size", "1", # Batch size 1 for CPU/Low-mem GPU compatibility
        "--gradient_accumulation_steps", "1",
        "--learning_rate", "1e-4",
        "--num_train_epochs", "1",
        "--max_steps", "20",            # Only run 20 steps to verify the loop
        "--save_strategy", "no",
        "--logging_steps", "1",        # Log every step to show progress immediately
        "--report_to", "none",
        "--bf16", "False",
        "--do_train"
    ]

    print("Starting training demo with real data...")
    print("(This may take a minute to load the model...)")
    
    train()
    print("\nDemo training finished successfully!")

Starting training demo with real data...
(This may take a minute to load the model...)


Using vocab_file: vocab_800_other_tasks.txt to load fragment list.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

trainable params: 267,644,928 || all params: 8,568,729,600 || trainable%: 3.1235


  torch.load(emb_dict_mol, map_location="cpu")


Now the length of the dataset is 4060


max_steps is given, it will override any value given in num_train_epochs


[2025-12-08 07:59:17,527] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/home/lvliuzhenghao/miniconda3/envs/mollama/bin/../lib/gcc/x86_64-conda-linux-gnu/11.2.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/home/lvliuzhenghao/miniconda3/envs/mollama/bin/../lib/gcc/x86_64-conda-linux-gnu/11.2.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lcufile: No such file or directory
collect2: error: ld returned 1 exit status
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


Step,Training Loss
1,3.935
2,2.1574
3,2.9623
4,3.8283
5,1.9189
6,2.1684
7,1.069
8,1.3924
9,1.1911
10,1.1686





Demo training finished successfully!
