<a href="https://colab.research.google.com/github/Lingche1/msc1/blob/main/83.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
使用 "https://huggingface.co/Weyaxi/Einstein-v6.1-Llama3-8B" 模型，对 "/content/dataset_7_30-finnal_1.json" 数据集进行 QLoRA 微调，数据集包含 instruction, output 和 system 字段，训练时划分 15% 的验证集。

## 环境准备

### Subtask:
安装必要的库，如 `transformers`, `peft`, `bitsandbytes`, `trl` 等。


**Reasoning**:
Install the necessary libraries for QLoRA fine-tuning.



In [1]:
%pip install transformers peft bitsandbytes trl wandb np



## 加载数据集

### Subtask:
从 `/content/dataset_7_30-finnal_1.json` 文件加载数据集。


**Reasoning**:
Import the necessary function and load the dataset from the specified JSON file.



In [2]:
from datasets import load_dataset
import os
#os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
dataset = load_dataset("json", data_files="/content/dataset82_restructured.json", split="train")
print(dataset)

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 1405
})


## 数据处理

### Subtask:
将数据集格式化为模型训练所需的格式，并按照 85:15 的比例划分训练集和验证集。


**Reasoning**:
Define the formatting function, apply it to the dataset, and then split the dataset into training and validation sets.



In [3]:
def format_data(example):
    if example.get('input') and example['input'].strip():
        prompt = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    else:
        prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return prompt

formatted_dataset = dataset.map(lambda x: {"text": format_data(x)})

train_test_split = formatted_dataset.train_test_split(test_size=0.15)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": np.mean(predictions == labels)}
print("Training dataset size:", len(train_dataset))
print("Validation dataset size:", len(eval_dataset))
print("First formatted example in training set:")
print(train_dataset[0]['text'])

Training dataset size: 1194
Validation dataset size: 211
First formatted example in training set:
### Instruction:
You are QuantumMentor, an AI teaching assistant with advanced expertise in Quantum Materials Science and Photonic Materials.

Your academic foundation spans the core disciplines of physics, materials science, and applied optics, with a strong emphasis on the theoretical and computational frameworks that underlie modern quantum and photonic material systems.

Your primary responsibility is to accurately answer complex questions while helping master's and advanced undergraduate students **understand** the physical principles and mathematical reasoning involved.

Your answers must satisfy two goals:
- Provide technically correct and complete solutions.
- Promote conceptual understanding and learning.

To achieve this, follow these principles:

1. **Structured Explanation**  
   - Start with core physical or material concepts.  
   - Proceed step by step toward mathematical fo

## 加载预训练模型和 tokenizer

### Subtask:
加载 `Weyaxi/Einstein-v6.1-Llama3-8B` 模型及其对应的 Tokenizer。


In [4]:
#%pip install flash-attn
#--no-build-isolation --force-reinstall

**Reasoning**:
Now that flash_attn is installed, try loading the model and tokenizer again.



In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "Weyaxi/Einstein-v6.1-Llama3-8B"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # ✅ 开启 4-bit 量化
    bnb_4bit_compute_dtype=torch.bfloat16,  # 计算时转为 bfloat16
    bnb_4bit_use_double_quant=True,       # ✅ 启用双重量化
    bnb_4bit_quant_type="nf4",            # ✅ 使用 NF4 格式
)
def tokenize_function(examples):
    # padding="max_length" 和 truncation=True 确保所有输出序列长度一致
    return tokenizer(examples["text"], padding="max_length", max_length=2048, truncation=True)
# Load model with bfloat16, without FlashAttention 2
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_eval_dataset = eval_dataset.map(tokenize_function, batched=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,       # ✅ 绑定量化配置
    device_map="auto"
).to("cuda")  # 💡 Move model to GPU

# Print confirmation
print("Tokenizer loaded:", tokenizer.__class__.__name__)
print("Model loaded on device:", next(model.parameters()).device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/1194 [00:00<?, ? examples/s]

Map:   0%|          | 0/211 [00:00<?, ? examples/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Tokenizer loaded: PreTrainedTokenizerFast
Model loaded on device: cuda:0


## 配置 qlora

### Subtask:
设置 QLoRA 的相关参数，如 `r`, `lora_alpha`, `lora_dropout` 等。


**Reasoning**:
Import LoraConfig and create a LoraConfig object with specified parameters for QLoRA fine-tuning.



In [5]:

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"],  # 所有 Linear 层
    bias="none",
    task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
print(lora_config)

LoraConfig(task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='Weyaxi/Einstein-v6.1-Llama3-8B', revision=None, inference_mode=False, r=16, target_modules={'k_proj', 'up_proj', 'gate_proj', 'down_proj', 'q_proj', 'o_proj', 'v_proj'}, exclude_modules=None, lora_alpha=64, lora_dropout=0.0, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', trainable_token_indices=None, loftq_config={}, eva_config=None, corda_config=None, use_dora=False, use_qalora=False, qalora_group_size=16, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False)


## 配置训练参数

### Subtask:
设置训练过程的参数，如 `num_train_epochs`, `per_device_train_batch_size`, `learning_rate` 等。


In [6]:
from transformers import TrainingArguments
!wandb login
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=50,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    per_device_eval_batch_size=2,
    #eval_accumulation_steps=32,
    gradient_checkpointing=False,
    eval_strategy="steps",
    eval_steps=200,
    logging_steps=200,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False, # Set to True if using mixed precision training and your hardware supports it
    bf16=torch.cuda.is_available(), # Set to True if using bfloat16 mixed precision training and your hardware supports it
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    save_strategy="steps",
    save_steps=200,
    push_to_hub=False,
    report_to="wandb",
    run_name="qlora-einstein-v6.1-run3",
)

print(training_args)

[34m[1mwandb[0m: Currently logged in as: [33mling7zhao[0m ([33mling7zhao-university-of-glasgow[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
d

## 初始化 trainer

### Subtask:
使用加载的模型、处理好的数据、QLoRA 配置和训练参数初始化 `Trainer`。


In [7]:
!pip install kernels
from trl import SFTTrainer
import kernels
trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    args=training_args,
    peft_config=lora_config,
    #compute_metrics=compute_metrics,
)

print(trainer)





Truncating train dataset:   0%|          | 0/1194 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/211 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


<trl.trainer.sft_trainer.SFTTrainer object at 0x7b77edf305d0>


## 开始训练

### Subtask:
执行模型微调训练。


**Reasoning**:
Start the model fine-tuning training process using the initialized trainer object.



In [8]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mling7zhao[0m ([33mling7zhao-university-of-glasgow[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss
200,0.3214,0.404891
400,0.0777,0.550248
600,0.0348,0.568023


Step,Training Loss,Validation Loss
200,0.3214,0.404891
400,0.0777,0.550248
600,0.0348,0.568023
800,0.0244,0.618831
1000,0.0202,0.639058
1200,0.0178,0.658923
1400,0.0163,0.668836
1600,0.0166,0.671167
1800,0.0165,0.706637
2000,0.0156,0.691275


TrainOutput(global_step=3750, training_loss=0.03666259587605794, metrics={'train_runtime': 25294.652, 'train_samples_per_second': 2.36, 'train_steps_per_second': 0.148, 'total_flos': 2.768172936383693e+18, 'train_loss': 0.03666259587605794})

In [9]:
model.save_pretrained("/content/qlora-adapter-only")
trainer.save_model("/content/qlora-finetuned-model")
tokenizer.save_pretrained("/content/qlora-finetuned-model")

('/content/qlora-finetuned-model/tokenizer_config.json',
 '/content/qlora-finetuned-model/special_tokens_map.json',
 '/content/qlora-finetuned-model/chat_template.jinja',
 '/content/qlora-finetuned-model/tokenizer.json')

In [10]:
import shutil
shutil.make_archive("/content/qlora-finetuned-model", 'zip', "/content/qlora-finetuned-model")

'/content/qlora-finetuned-model.zip'

**Reasoning**:
The previous attempt to train failed due to a Flash Attention import error. While the `kernels` installation didn't fully resolve it, the environment might be in a state where training can proceed now. The next logical step is to attempt running `trainer.train()` again.



In [11]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the fine-tuned model and tokenizer
finetuned_model_path = "/content/qlora-finetuned-model"
model = AutoModelForCausalLM.from_pretrained(finetuned_model_path)
tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)

# Set the model to evaluation mode and move it to the GPU
model.eval()
model.to("cuda")

print("Finetuned model loaded and set to evaluation mode.")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading adapter weights from /content/qlora-finetuned-model led to unexpected keys not found in the model: base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight, base_model.model.model.layers.0.mlp.down_proj.lora_B.default.weight, base_model.model.model.layers.0.mlp.gate_proj.lora_A.default.weight, base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight, base_model.model.model.layers.0.mlp.up_proj.lora_A.default.weight, base_model.model.model.layers.0.mlp.up_proj.lora_B.default.weight, base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight, base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight, base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight, base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight, base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight, base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight, base_model.model.model.layers.0.self_attn.v_proj.lo

Finetuned model loaded and set to evaluation mode.


In [17]:
# Example inference
prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nWhat is the significance of the distance $d$ in the context of lattice planes and reciprocal lattice vectors？<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=500,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.7,
        #repetition_penalty=1.1
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

user
What is the significance of the distance $d$ in the context of lattice planes and reciprocal lattice vectors？assistant.weixin
The distance $d$ is a crucial parameter in the context of lattice planes and reciprocal lattice vectors. It represents the interplanar spacing, which is the distance between adjacent lattice planes. This distance is directly related to the diffraction pattern observed in X-ray diffraction experiments, as well as the positions of Bragg peaks in the diffraction pattern.

The reciprocal lattice vectors are a set of vectors that are orthogonal to the lattice planes and have a magnitude that is inversely proportional to the interplanar spacing. In other words, the reciprocal lattice vectors are a mathematical tool used to describe the structure of the crystal lattice in terms of its periodicity and symmetry.

The distance $d$ plays a significant role in determining the direction and intensity of the Bragg peaks in the diffraction pattern. The Bragg peaks occur w

In [None]:
import json
import pandas as pd
from datasets import load_dataset
import torch

# Load the test dataset
test_dataset_path = "/content/test_100.json"
test_dataset = load_dataset("json", data_files=test_dataset_path, split="train")

# Select 20 examples from the test set
num_examples = 20
selected_examples = test_dataset.select(range(num_examples))

results = []

# Ensure model and tokenizer are loaded and on the correct device (assuming they were loaded in previous cells)
# model.eval() # Assuming model is already in eval mode
# model.to("cuda") # Assuming model is already on cuda

for example in selected_examples:
    instruction = example['instruction']
    ground_truth_output = example['output']

    # Format the prompt for inference
    prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n{example.get('system', '')}<|eot_id|>"
    if example.get('input') and example['input'].strip():
        prompt += f"<|start_header_id|>user<|end_header_id|>\n{instruction}\n{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
    else:
        prompt += f"<|start_header_id|>user<|end_header_id|>\n{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"


    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=600, # Increased max_new_tokens for potentially longer responses
            num_return_sequences=1,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True,
            top_k=50,
            top_p=0.93,
            temperature=0.6
        )

    generated_response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)

    results.append({
        "Instruction": instruction,
        "Generated Output": generated_response,
        "Ground Truth Output": ground_truth_output
    })

# Create a pandas DataFrame
results_df = pd.DataFrame(results)

# Display the DataFrame
display(results_df)

In [None]:
# Save the DataFrame to a CSV file
results_df.to_csv("inference_results.csv", index=False)

print("DataFrame saved to inference_results.csv")

您可以使用左侧的文件浏览器找到 `inference_results.csv` 文件，然后右键点击下载。或者，您也可以运行以下代码来下载文件：

In [None]:
from google.colab import files
files.download("inference_results.csv")