# üöÄ Fine-tune Llama 3 8B v·ªõi QLoRA cho CV-JD Matching

**M·ª•c ƒë√≠ch:** Fine-tune model Llama 3 8B ƒë·ªÉ ƒë√°nh gi√° v√† so kh·ªõp CV v·ªõi JD

**Y√™u c·∫ßu:**
- Google Colab (Free tier v·ªõi T4 GPU ƒë·ªß d√πng)
- Hugging Face account v·ªõi Llama 3 access
- Training data t·ª´ preprocessing pipeline

**Th·ªùi gian ∆∞·ªõc t√≠nh:** 2-4 gi·ªù cho full training

## 1. Setup & Installation

In [None]:
# Ki·ªÉm tra GPU
!nvidia-smi

In [None]:
%%capture
# C√†i ƒë·∫∑t dependencies
!pip install -U transformers datasets accelerate peft bitsandbytes trl
!pip install -U huggingface_hub scipy
!pip install flash-attn --no-build-isolation

In [None]:
# Import libraries
import os
import json
import torch
from datetime import datetime

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from datasets import load_dataset, Dataset
from trl import SFTTrainer

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## 2. Hugging Face Login

In [None]:
from huggingface_hub import login

# Nh·∫≠p Hugging Face token c·ªßa b·∫°n
# L·∫•y token t·∫°i: https://huggingface.co/settings/tokens
HF_TOKEN = ""  # @param {type:"string"}

if HF_TOKEN:
    login(token=HF_TOKEN)
    print("‚úÖ Logged in to Hugging Face")
else:
    print("‚ö†Ô∏è Please enter your Hugging Face token")
    login()

## 3. Configuration

In [None]:
# ============================================
# C·∫§U H√åNH - ƒêI·ªÄU CH·ªàNH THEO NHU C·∫¶U
# ============================================

# Model configuration
BASE_MODEL = "meta-llama/Meta-Llama-3-8B-Instruct"  # Base model
NEW_MODEL_NAME = "llama3-8b-cv-jd-matcher"  # T√™n model sau fine-tune

# Training configuration
EPOCHS = 3
BATCH_SIZE = 2  # Gi·∫£m n·∫øu h·∫øt memory
GRADIENT_ACCUMULATION = 4  # Effective batch = BATCH_SIZE * GRADIENT_ACCUMULATION
LEARNING_RATE = 2e-4
MAX_SEQ_LENGTH = 2048
WARMUP_RATIO = 0.03

# LoRA configuration
LORA_R = 64  # LoRA rank
LORA_ALPHA = 128  # LoRA alpha (th∆∞·ªùng = 2 * r)
LORA_DROPOUT = 0.05

# QLoRA 4-bit quantization
USE_4BIT = True
BNB_4BIT_COMPUTE_DTYPE = "float16"
BNB_4BIT_QUANT_TYPE = "nf4"
USE_NESTED_QUANT = False

# Output
OUTPUT_DIR = "./results"
LOGGING_STEPS = 25
SAVE_STEPS = 100

print("üìã Configuration:")
print(f"   Base Model: {BASE_MODEL}")
print(f"   Epochs: {EPOCHS}")
print(f"   Effective Batch Size: {BATCH_SIZE * GRADIENT_ACCUMULATION}")
print(f"   Learning Rate: {LEARNING_RATE}")
print(f"   LoRA Rank: {LORA_R}")
print(f"   Max Sequence Length: {MAX_SEQ_LENGTH}")

## 4. Upload Training Data

Upload file `train.jsonl` t·ª´ m√°y local c·ªßa b·∫°n

In [None]:
from google.colab import files

# Option 1: Upload t·ª´ m√°y local
print("üì§ Upload training data (train.jsonl):")
uploaded = files.upload()

# Ki·ªÉm tra file ƒë√£ upload
if 'train.jsonl' in uploaded:
    print(f"‚úÖ Uploaded train.jsonl ({len(uploaded['train.jsonl'])/1024:.1f} KB)")
else:
    print("‚ö†Ô∏è Please upload train.jsonl file")

In [None]:
# Option 2: Upload t·ª´ Google Drive (n·∫øu file l·ªõn)
# from google.colab import drive
# drive.mount('/content/drive')
# !cp "/content/drive/MyDrive/path/to/train.jsonl" .

In [None]:
# Load v√† ki·ªÉm tra dataset
def load_jsonl(filepath):
    data = []
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line))
    return data

train_data = load_jsonl('train.jsonl')
print(f"üìä Loaded {len(train_data)} training examples")

# Hi·ªÉn th·ªã sample
print("\nüìù Sample:")
sample = train_data[0]
print(f"   Messages: {len(sample['messages'])}")
print(f"   System: {sample['messages'][0]['content'][:100]}...")
print(f"   Match Score: {sample['metadata']['match_score']}")

## 5. Prepare Dataset

In [None]:
def format_chat_template(example):
    """Format messages into Llama 3 chat template."""
    messages = example['messages']
    
    # Llama 3 chat format
    formatted = ""
    for msg in messages:
        role = msg['role']
        content = msg['content']
        
        if role == 'system':
            formatted += f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{content}<|eot_id|>"
        elif role == 'user':
            formatted += f"<|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|>"
        elif role == 'assistant':
            formatted += f"<|start_header_id|>assistant<|end_header_id|>\n\n{content}<|eot_id|>"
    
    return {'text': formatted}

# Convert to Hugging Face Dataset
dataset = Dataset.from_list(train_data)
dataset = dataset.map(format_chat_template)

print(f"‚úÖ Dataset prepared: {len(dataset)} examples")
print(f"   Sample text length: {len(dataset[0]['text'])} chars")

## 6. Load Model v·ªõi QLoRA

In [None]:
# Quantization config cho 4-bit
compute_dtype = getattr(torch, BNB_4BIT_COMPUTE_DTYPE)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=USE_4BIT,
    bnb_4bit_quant_type=BNB_4BIT_QUANT_TYPE,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=USE_NESTED_QUANT,
)

print("üì• Loading base model (this may take a few minutes)...")

# Load model
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=compute_dtype,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

print(f"‚úÖ Model loaded!")
print(f"   Model size: {model.get_memory_footprint() / 1024**3:.2f} GB")

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(f"‚úÖ Tokenizer loaded!")
print(f"   Vocab size: {tokenizer.vocab_size}")

## 7. Configure LoRA

In [None]:
# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# LoRA configuration
peft_config = LoraConfig(
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    r=LORA_R,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

model = get_peft_model(model, peft_config)

# Print trainable parameters
model.print_trainable_parameters()

## 8. Training

In [None]:
# Training arguments
training_arguments = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    optim="paged_adamw_32bit",
    save_steps=SAVE_STEPS,
    logging_steps=LOGGING_STEPS,
    learning_rate=LEARNING_RATE,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=WARMUP_RATIO,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="none",  # Disable wandb
)

print("üìã Training Arguments:")
print(f"   Epochs: {EPOCHS}")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Gradient accumulation: {GRADIENT_ACCUMULATION}")
print(f"   Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION}")
print(f"   Learning rate: {LEARNING_RATE}")

In [None]:
# Initialize trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

print("‚úÖ Trainer initialized!")

In [None]:
# üöÄ START TRAINING
print("="*60)
print("üöÄ STARTING TRAINING")
print("="*60)
print(f"   Start time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"   Estimated duration: 2-4 hours")
print("="*60)

# Train!
trainer.train()

print("\n" + "="*60)
print("‚úÖ TRAINING COMPLETE!")
print("="*60)

## 9. Save Model

In [None]:
# Save LoRA adapters
ADAPTER_PATH = f"./adapters/{NEW_MODEL_NAME}"
trainer.model.save_pretrained(ADAPTER_PATH)
tokenizer.save_pretrained(ADAPTER_PATH)

print(f"‚úÖ LoRA adapters saved to: {ADAPTER_PATH}")
!ls -la {ADAPTER_PATH}

In [None]:
# Download adapters to local machine
!zip -r adapters.zip ./adapters
files.download('adapters.zip')
print("üì• Downloading adapters.zip...")

## 10. Test Model

In [None]:
# Test prompt
test_cv = """
**M·ª•c ti√™u ngh·ªÅ nghi·ªáp:** K·ªπ s∆∞ ph·∫ßn m·ªÅm v·ªõi 3 nƒÉm kinh nghi·ªám, mong mu·ªën ph√°t tri·ªÉn trong lƒ©nh v·ª±c AI/ML
**K·ªπ nƒÉng:** Python, TensorFlow, PyTorch, SQL, Docker, Git
**H·ªçc v·∫•n:** C·ª≠ nh√¢n Khoa h·ªçc M√°y t√≠nh - ƒê·∫°i h·ªçc B√°ch khoa H√† N·ªôi (2020)
**Kinh nghi·ªám:** ML Engineer t·∫°i FPT Software (2020-2023)
"""

test_jd = """
**V·ªã tr√≠:** Senior Machine Learning Engineer
**Y√™u c·∫ßu h·ªçc v·∫•n:** C·ª≠ nh√¢n CNTT ho·∫∑c t∆∞∆°ng ƒë∆∞∆°ng
**Y√™u c·∫ßu kinh nghi·ªám:** √çt nh·∫•t 2 nƒÉm kinh nghi·ªám ML/AI
**K·ªπ nƒÉng y√™u c·∫ßu:** Python, TensorFlow/PyTorch, MLOps, Kubernetes
"""

prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

B·∫°n l√† chuy√™n gia ƒë√°nh gi√° v√† so kh·ªõp CV v·ªõi Job Description (JD). 
Nhi·ªám v·ª• c·ªßa b·∫°n l√† ph√¢n t√≠ch m·ª©c ƒë·ªô ph√π h·ª£p gi·ªØa CV c·ªßa ·ª©ng vi√™n v√† y√™u c·∫ßu c√¥ng vi·ªác.
Lu√¥n tr·∫£ l·ªùi b·∫±ng JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

Ph√¢n t√≠ch m·ª©c ƒë·ªô ph√π h·ª£p gi·ªØa CV v√† JD sau:

## CV ·ª®ng vi√™n:
{test_cv}

## Y√™u c·∫ßu C√¥ng vi·ªác (JD):
{test_jd}

H√£y ƒë√°nh gi√° v√† tr·∫£ l·ªùi b·∫±ng JSON.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

print("üß™ Testing model...")

In [None]:
# Generate response
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract assistant response
assistant_response = response.split("assistant")[-1].strip()

print("\n" + "="*60)
print("üìù MODEL RESPONSE:")
print("="*60)
print(assistant_response)

## 11. (Optional) Merge & Export to GGUF

ƒê·ªÉ s·ª≠ d·ª•ng v·ªõi Ollama, c·∫ßn merge LoRA v√†o base model v√† export sang GGUF format.

In [None]:
# Merge LoRA v·ªõi base model (c·∫ßn nhi·ªÅu RAM)
# Ch·ªâ ch·∫°y n·∫øu c√≥ ƒë·ªß RAM (>16GB)

MERGE_MODEL = False  # Set True ƒë·ªÉ merge

if MERGE_MODEL:
    print("üîÑ Merging LoRA adapters with base model...")
    
    # Reload model in FP16
    base_model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    
    # Merge
    merged_model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
    merged_model = merged_model.merge_and_unload()
    
    # Save merged model
    MERGED_PATH = f"./merged/{NEW_MODEL_NAME}"
    merged_model.save_pretrained(MERGED_PATH)
    tokenizer.save_pretrained(MERGED_PATH)
    
    print(f"‚úÖ Merged model saved to: {MERGED_PATH}")
else:
    print("‚è≠Ô∏è Skipping merge (set MERGE_MODEL=True to enable)")

## 12. Push to Hugging Face Hub (Optional)

In [None]:
PUSH_TO_HUB = False  # Set True ƒë·ªÉ push
HUB_MODEL_ID = "your-username/llama3-8b-cv-jd-matcher"  # Thay b·∫±ng username c·ªßa b·∫°n

if PUSH_TO_HUB:
    print(f"üì§ Pushing to Hugging Face Hub: {HUB_MODEL_ID}")
    trainer.model.push_to_hub(HUB_MODEL_ID)
    tokenizer.push_to_hub(HUB_MODEL_ID)
    print("‚úÖ Pushed successfully!")
else:
    print("‚è≠Ô∏è Skipping push (set PUSH_TO_HUB=True to enable)")

---

## üéâ Ho√†n th√†nh!

### C√°c b∆∞·ªõc ti·∫øp theo:

1. **Download adapters.zip** v·ªÅ m√°y local
2. **Merge v·ªõi base model** (n·∫øu ch∆∞a l√†m)
3. **Convert sang GGUF** ƒë·ªÉ d√πng v·ªõi Ollama
4. **Deploy** v√†o ·ª©ng d·ª•ng

### ƒê·ªÉ convert sang GGUF v√† d√πng v·ªõi Ollama:

```bash
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert to GGUF
python convert.py /path/to/merged/model --outfile model.gguf

# Quantize (optional, reduces size)
./quantize model.gguf model-q4_0.gguf q4_0

# Create Ollama Modelfile
echo 'FROM ./model-q4_0.gguf' > Modelfile

# Create Ollama model
ollama create cv-jd-matcher -f Modelfile

# Test
ollama run cv-jd-matcher
```