<a href="https://colab.research.google.com/github/Lingche1/msc1/blob/main/pm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Fine-tune the "Weyaxi/Einstein-v6.1-Llama3-8B" model using LoRA on a small, 429-entry English AIpaca dataset to adapt it for the quantum materials domain.

## Install necessary libraries

### Subtask:
Install `transformers`, `datasets`, `peft`, `trl`, `bitsandbytes`, `accelerate`, and `scipy`.


**Reasoning**:
Install the necessary libraries using pip.



In [1]:
%pip install transformers datasets peft trl accelerate scipy
%pip install -U bitsandbytes

Traceback (most recent call last):
  File "/usr/local/bin/pip3", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/main.py", line 78, in main
    command = create_command(cmd_name, isolated=("--isolated" in cmd_args))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/__init__.py", line 114, in create_command
    module = importlib.import_module(module_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlo

KeyboardInterrupt: 

## Load the model and tokenizer

### Subtask:
Load the `Weyaxi/Einstein-v6.1-Llama3-8B` model and its corresponding tokenizer.


**Reasoning**:
Import necessary classes and load the tokenizer and model.



In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "Weyaxi/Einstein-v6.1-Llama3-8B"

# Load model in 4-bit precision for Q-LoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)

## Load and preprocess the dataset

### Subtask:
Load the small AIpaca dataset and preprocess it for fine-tuning. This may involve formatting the data into the correct input format for the model.


In [None]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("json", data_files="/content/alpaca-cleaned.json")

# Select the first 429 entries
small_dataset = dataset["train"].select(range(429))

# Define the preprocessing function
def preprocess_function(examples):
    texts = []
    for instruction, input_text, output_text in zip(examples['instruction'], examples['input'], examples['output']):
        if input_text:
            text = f"Instruction: {instruction}\nInput: {input_text}\nOutput: {output_text}"
        else:
            text = f"Instruction: {instruction}\nOutput: {output_text}"
        texts.append(text)
    tokenized_inputs = tokenizer(texts, padding="max_length", truncation=True, max_length=512)
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy() # Create labels for causal language modeling
    return tokenized_inputs

# Apply the preprocessing function
processed_dataset = small_dataset.map(preprocess_function, batched=True)

# Remove original text columns
processed_dataset = processed_dataset.remove_columns(['instruction', 'input', 'output', 'system'])

# Set the format to torch
processed_dataset.set_format(type='torch')

display(processed_dataset)
from sklearn.model_selection import train_test_split
from datasets import Dataset

# Convert the processed_dataset to a pandas DataFrame for splitting
processed_df = processed_dataset.to_pandas()

# Split the DataFrame
train_df, val_df = train_test_split(processed_df, test_size=0.15, random_state=42)

# Convert the DataFrames back to Dataset objects
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

display(train_dataset)
display(val_dataset)

## Configure lora

### Subtask:
Set up the LoRA configuration for efficient fine-tuning. This involves specifying the LoRA parameters like `r`, `lora_alpha`, and `lora_dropout`.


**Reasoning**:
Import the LoraConfig class and instantiate it with the specified parameters.



In [None]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=64, # Increased r for potentially better performance with Q-LoRA
    lora_alpha=16, # Adjusted alpha based on r
    lora_dropout=0.1, # Increased dropout for regularization
    bias="none",
    task_type="CAUSAL_LM",
)

print(lora_config)

## Configure training parameters

### Subtask:
Set up the training arguments, including hyperparameters like learning rate, batch size, number of epochs, and optimization strategy.


**Reasoning**:
Import the `TrainingArguments` class and instantiate it with the specified parameters. Then print the object to verify the configuration.



In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    optim="adamw_torch",
    logging_dir="./logs",
    logging_steps=10,
)

print(training_args)

## Prepare the model for lora training

### Subtask:
Wrap the base model with the LoRA layers.


**Reasoning**:
Wrap the base model with the LoRA layers and print the number of trainable parameters.



In [None]:
from peft import get_peft_model

model.enable_input_require_grads()
model.config.use_cache = False
model = get_peft_model(
    model,
    lora_config,

)

# 2) 确保在 GPU 上
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(peft_model.print_trainable_parameters())

## Summary:

### Q&A
**Can the "Weyaxi/Einstein-v6.1-Llama3-8B" model be fine-tuned on a small AIpaca dataset with LoRA given the available computational resources?**

No, the "Weyaxi/Einstein-v6.1-Llama3-8B" model could not be fine-tuned on the provided dataset with the available computational resources. The training process repeatedly failed due to `OutOfMemoryError` on the GPU, even after implementing several memory-saving techniques.

### Data Analysis Key Findings
* The training process failed with an `OutOfMemoryError`, indicating that the "Weyaxi/Einstein-v6.1-Llama3-8B" model is too large for the available GPU memory (approximately 40GB).
* Several memory optimization techniques were attempted, including:
  * Reducing the `per_device_train_batch_size` to 1.
  * Enabling `gradient_checkpointing`.
  * Reducing the `max_length` of the tokenized sequences to 256.
* Despite these optimizations, the `OutOfMemoryError` persisted, suggesting that the memory issue is not solvable with these techniques alone on the given hardware.

### Insights or Next Steps
* To successfully fine-tune this model, it is necessary to use a more powerful GPU with a larger memory capacity.
* An alternative approach could be to use a smaller, less memory-intensive base model for fine-tuning.


## Set up and run the SFTTrainer

### Subtask:
Set up the `SFTTrainer` and start the training process.

**Reasoning**:
Import the `SFTTrainer` class and instantiate it with the configured model, dataset, tokenizer, and training arguments. Then, call the `train()` method to start the fine-tuning.

In [None]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=peft_model,
    train_dataset=processed_dataset,
    eval_dataset=val_dataset,
    args=training_args,
)

# Start training
trainer.train()

## Save the fine-tuned model

### Subtask:
Save the fine-tuned model to a specified directory.

**Reasoning**:
Save the fine-tuned model using the `save_model` method of the trainer.

In [None]:
# Save the fine-tuned model
trainer.save_model("./fine_tuned_model")

## Load and evaluate the fine-tuned model

### Subtask:
Load the fine-tuned model and perform manual evaluation by generating text.

**Reasoning**:
Load the base model and the fine-tuned LoRA adapters. Merge the adapters with the base model and then use the merged model to generate text for manual evaluation.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)

# Load the LoRA adapters
model = PeftModel.from_pretrained(base_model, "./fine_tuned_model")

# Merge the LoRA adapters with the base model
merged_model = model.merge_and_unload()

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Function to generate text
def generate_text(prompt, model, tokenizer, max_length=200):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(inputs["input_ids"], max_length=max_length, num_return_sequences=1)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example prompt for manual evaluation
prompt = "Instruction: What is the concept of a 'reciprocal lattice'and how is it related to the 'Brillouin zone' in solid-state physics?.\nOutput:"
generated_text = generate_text(prompt, merged_model, tokenizer)

print("Generated Text:")
print(generated_text)