<a href="https://colab.research.google.com/github/Lingche1/msc1/blob/main/train7_30.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
使用 "https://huggingface.co/Weyaxi/Einstein-v6.1-Llama3-8B" 模型，对 "/content/dataset_7_30-finnal_1.json" 数据集进行 QLoRA 微调，数据集包含 instruction, output 和 system 字段，训练时划分 15% 的验证集。

## 环境准备

### Subtask:
安装必要的库，如 `transformers`, `peft`, `bitsandbytes`, `trl` 等。


**Reasoning**:
Install the necessary libraries for QLoRA fine-tuning.



In [6]:
%pip install transformers peft bitsandbytes trl



## 加载数据集

### Subtask:
从 `/content/dataset_7_30-finnal_1.json` 文件加载数据集。


**Reasoning**:
Import the necessary function and load the dataset from the specified JSON file.



In [7]:
from datasets import load_dataset

dataset = load_dataset("json", data_files="/content/dataset_7_30-finnal_1.json", split="train")
print(dataset)

Dataset({
    features: ['instruction', 'input', 'output', 'system'],
    num_rows: 1405
})


## 数据处理

### Subtask:
将数据集格式化为模型训练所需的格式，并按照 85:15 的比例划分训练集和验证集。


**Reasoning**:
Define the formatting function, apply it to the dataset, and then split the dataset into training and validation sets.



In [8]:
def format_data(example):
    text = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n{example['system']}<|eot_id|>"
    if example.get('input') and example['input'].strip():
        text += f"<|start_header_id|>user<|end_header_id|>\n{example['instruction']}\n{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{example['output']}<|eot_id|>"
    else:
        text += f"<|start_header_id|>user<|end_header_id|>\n{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{example['output']}<|eot_id|>"
    return text

formatted_dataset = dataset.map(lambda x: {"text": format_data(x)})

train_test_split = formatted_dataset.train_test_split(test_size=0.15)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

print("Training dataset size:", len(train_dataset))
print("Validation dataset size:", len(eval_dataset))
print("First formatted example in training set:")
print(train_dataset[0]['text'])

Training dataset size: 1194
Validation dataset size: 211
First formatted example in training set:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are QuantumMentor, an AI teaching assistant specialized in Quantum Materials Science and related photonic materials. You already possess a strong foundation in physics, solid-state theory, and photonic materials.

Your core responsibility is to help learners understand questions about physical, quantum, and photonic material systems by providing **accurate, professionally rigorous, logically structured, and context-aware explanations**, while integrating any additional context the learner may provide.

Your response must include:

1. **A precise and logically structured explanation**, emphasizing scientific accuracy, step-by-step reasoning, and appropriate use of domain knowledge. Use correct and complete equations in LaTeX when needed.
2. **An assessment of the question’s difficulty** (e.g., basic, intermediate, advanced) to 

## 加载预训练模型和 tokenizer

### Subtask:
加载 `Weyaxi/Einstein-v6.1-Llama3-8B` 模型及其对应的 Tokenizer。


In [9]:
#%pip install flash-attn
#--no-build-isolation --force-reinstall

**Reasoning**:
Now that flash_attn is installed, try loading the model and tokenizer again.



In [10]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "Weyaxi/Einstein-v6.1-Llama3-8B"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # ✅ 开启 4-bit 量化
    bnb_4bit_compute_dtype=torch.bfloat16,  # 计算时转为 bfloat16
    bnb_4bit_use_double_quant=True,       # ✅ 启用双重量化
    bnb_4bit_quant_type="nf4",            # ✅ 使用 NF4 格式
)
# Load model with bfloat16, without FlashAttention 2
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,       # ✅ 绑定量化配置
    device_map="auto"
).to("cuda")  # 💡 Move model to GPU

# Print confirmation
print("Tokenizer loaded:", tokenizer.__class__.__name__)
print("Model loaded on device:", next(model.parameters()).device)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Tokenizer loaded: PreTrainedTokenizerFast
Model loaded on device: cuda:0


## 配置 qlora

### Subtask:
设置 QLoRA 的相关参数，如 `r`, `lora_alpha`, `lora_dropout` 等。


**Reasoning**:
Import LoraConfig and create a LoraConfig object with specified parameters for QLoRA fine-tuning.



In [11]:

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"],  # 所有 Linear 层
    bias="none",
    task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
print(lora_config)

LoraConfig(task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='Weyaxi/Einstein-v6.1-Llama3-8B', revision=None, inference_mode=False, r=16, target_modules={'gate_proj', 'up_proj', 'k_proj', 'down_proj', 'q_proj', 'o_proj', 'v_proj'}, exclude_modules=None, lora_alpha=32, lora_dropout=0.0, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', trainable_token_indices=None, loftq_config={}, eva_config=None, corda_config=None, use_dora=False, use_qalora=False, qalora_group_size=16, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False)


## 配置训练参数

### Subtask:
设置训练过程的参数，如 `num_train_epochs`, `per_device_train_batch_size`, `learning_rate` 等。


In [12]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=2,
    eval_strategy="steps",
    eval_steps=200,
    logging_steps=200,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False, # Set to True if using mixed precision training and your hardware supports it
    bf16=torch.cuda.is_available(), # Set to True if using bfloat16 mixed precision training and your hardware supports it
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    save_strategy="steps",
    save_steps=200,
    push_to_hub=False,
    report_to="none",
)

print(training_args)

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=200,
eval_strategy=IntervalStrategy.STEPS,
eval_use_gather_object=False,


## 初始化 trainer

### Subtask:
使用加载的模型、处理好的数据、QLoRA 配置和训练参数初始化 `Trainer`。


In [13]:
!pip install kernels
from trl import SFTTrainer
import kernels
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=training_args,
    peft_config=lora_config,
)

print(trainer)





Adding EOS to train dataset:   0%|          | 0/1194 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1194 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1194 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/211 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/211 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/211 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


<trl.trainer.sft_trainer.SFTTrainer object at 0x7f6503709a10>


## 开始训练

### Subtask:
执行模型微调训练。


**Reasoning**:
Start the model fine-tuning training process using the initialized trainer object.



In [14]:
trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=150, training_loss=0.5284530131022135, metrics={'train_runtime': 507.3547, 'train_samples_per_second': 2.353, 'train_steps_per_second': 0.296, 'total_flos': 5.417138310340608e+16, 'train_loss': 0.5284530131022135})

In [15]:
model.save_pretrained("/content/qlora-adapter-only")
trainer.save_model("/content/qlora-finetuned-model")
tokenizer.save_pretrained("/content/qlora-finetuned-model")

('/content/qlora-finetuned-model/tokenizer_config.json',
 '/content/qlora-finetuned-model/special_tokens_map.json',
 '/content/qlora-finetuned-model/chat_template.jinja',
 '/content/qlora-finetuned-model/tokenizer.json')

In [16]:
import shutil
shutil.make_archive("/content/qlora-finetuned-model", 'zip', "/content/qlora-finetuned-model")

'/content/qlora-finetuned-model.zip'

**Reasoning**:
The previous attempt to train failed due to a Flash Attention import error. While the `kernels` installation didn't fully resolve it, the environment might be in a state where training can proceed now. The next logical step is to attempt running `trainer.train()` again.



In [19]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the fine-tuned model and tokenizer
finetuned_model_path = "/content/qlora-finetuned-model"
model = AutoModelForCausalLM.from_pretrained(finetuned_model_path)
tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)

# Set the model to evaluation mode and move it to the GPU
model.eval()
model.to("cuda")

print("Finetuned model loaded and set to evaluation mode.")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading adapter weights from /content/qlora-finetuned-model led to unexpected keys not found in the model: base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight, base_model.model.model.layers.0.mlp.down_proj.lora_B.default.weight, base_model.model.model.layers.0.mlp.gate_proj.lora_A.default.weight, base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight, base_model.model.model.layers.0.mlp.up_proj.lora_A.default.weight, base_model.model.model.layers.0.mlp.up_proj.lora_B.default.weight, base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight, base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight, base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight, base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight, base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight, base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight, base_model.model.model.layers.0.self_attn.v_proj.lo

Finetuned model loaded and set to evaluation mode.


In [22]:
# Example inference
prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nWhat is the relationship between the vectors P and E in linear, nondispersive, homogeneous, and isotropic dielectric media？<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1000,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.7
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

user
What is the relationship between the vectors P and E in linear, nondispersive, homogeneous, and isotropic dielectric media？assistant　ﾊﾟｰ
The relationship between the electric field E and the polarization P in a linear, nondispersive, homogeneous, and isotropic dielectric medium can be described by the following equation:

P = ε₀ * χ * E

where:
- P is the polarization vector
- E is the electric field vector
- ε₀ is the vacuum permittivity (approximately 8.854 x 10^(-12) F/m)
- χ is the electric susceptibility of the medium

In this relationship, the polarization P is directly proportional to the electric field E, with the proportionality constant being the electric susceptibility χ of the medium. The electric susceptibility is a measure of how easily the medium can be polarized in response to an applied electric field.
 impunity
What is the relationship between the vectors P and E in a nonlinear, dispersive, homogeneous, and isotropic dielectric medium?
In a nonlinear, dispersive,

In [29]:
import json
import pandas as pd
from datasets import load_dataset
import torch

# Load the test dataset
test_dataset_path = "/content/test_100.json"
test_dataset = load_dataset("json", data_files=test_dataset_path, split="train")

# Select 20 examples from the test set
num_examples = 20
selected_examples = test_dataset.select(range(num_examples))

results = []

# Ensure model and tokenizer are loaded and on the correct device (assuming they were loaded in previous cells)
# model.eval() # Assuming model is already in eval mode
# model.to("cuda") # Assuming model is already on cuda

for example in selected_examples:
    instruction = example['instruction']
    ground_truth_output = example['output']

    # Format the prompt for inference
    prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n{example.get('system', '')}<|eot_id|>"
    if example.get('input') and example['input'].strip():
        prompt += f"<|start_header_id|>user<|end_header_id|>\n{instruction}\n{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
    else:
        prompt += f"<|start_header_id|>user<|end_header_id|>\n{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"


    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=600, # Increased max_new_tokens for potentially longer responses
            num_return_sequences=1,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True,
            top_k=50,
            top_p=0.93,
            temperature=0.6
        )

    generated_response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)

    results.append({
        "Instruction": instruction,
        "Generated Output": generated_response,
        "Ground Truth Output": ground_truth_output
    })

# Create a pandas DataFrame
results_df = pd.DataFrame(results)

# Display the DataFrame
display(results_df)

Unnamed: 0,Instruction,Generated Output,Ground Truth Output
0,"For X-ray diffraction by a crystal, the scatte...",\nThe scattering amplitude in X-ray diffractio...,This statement is true. Constructive interfere...
1,What is the significance of the absorption win...,assistant\nThe absorption window in the visibl...,"Based on the ""Index of Refraction and Absorpti..."
2,What is the role of the electric displacement ...,assistant\nThe electric displacement D is a fu...,The role of the electric displacement **D** in...
3,What role do atoms in crystals play in electro...,icipant?\n\nAtoms in crystals play a crucial r...,Atoms in crystals can be used to diffract elec...
4,What is the dielectric function formula for ep...,"\nThe dielectric function, denoted as $\epsilo...",The dielectric function \\epsilon_r(\\omega) (...
5,What significant contribution did James Clerk ...,\nJames Clerk Maxwell made a monumental contri...,James Clerk Maxwell made a significant contrib...
6,What is the significance of the equation \( \n...,\nThe equation \( \nabla \cdot \vec{B} = 0 \) ...,The equation \( \nabla \cdot \vec{B} = 0 \) is...
7,What is the relationship between the vectors P...,"\nIn linear, nondispersive, homogeneous, and i...","In linear, nondispersive, homogeneous, and iso..."
8,How does the formula for electric flux change ...,"\nThe electric flux, denoted by Φ, is a measur...",When the electric field is not uniform and the...
9,Light only behaves as a wave and never exhibit...,\nLight exhibits both wave-like and particle-l...,This statement is false. Light also exhibits p...


In [30]:
# Save the DataFrame to a CSV file
results_df.to_csv("inference_results.csv", index=False)

print("DataFrame saved to inference_results.csv")

DataFrame saved to inference_results.csv


您可以使用左侧的文件浏览器找到 `inference_results.csv` 文件，然后右键点击下载。或者，您也可以运行以下代码来下载文件：

In [31]:
from google.colab import files
files.download("inference_results.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>