<a href="https://colab.research.google.com/github/your-username/your-repo/blob/main/Qwen2_5_VL_Cardboard_QC_Finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Qwen2.5-VL Fine-tuning for Cardboard Quality Control 📦🔍

<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
</div>

Fine-tune Qwen2.5-VL (7B) for automated cardboard bundle quality control using your custom dataset.

**Dataset**: `Cong2612/cardboard-qc-dataset` (168 samples)
**Task**: Binary classification - Pass (flat) vs Fail (warped)
**Model**: Vision-Language model for industrial quality control

---

## 🔧 Installation & Setup

Install Unsloth and required dependencies. This will take a few minutes.

In [None]:
%%capture
import os, re
import torch

# Check if we're in Colab
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Colab-specific installation
    v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
    
# Additional packages for evaluation
!pip install scikit-learn

print("✅ Installation complete!")

## 🚀 Model Setup

Load Qwen2.5-VL-7B with 4-bit quantization and add LoRA adapters for efficient fine-tuning.

In [None]:
from unsloth import FastVisionModel
import torch

print("🔧 Setting up Qwen2.5-VL for Cardboard Quality Control Fine-tuning")
print("=" * 60)

# Load the base model with 4-bit quantization
print("📥 Loading Qwen2.5-VL-7B model...")
model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit",
    load_in_4bit=True,  # 4-bit quantization for memory efficiency
    use_gradient_checkpointing="unsloth",  # Memory optimization
)

print("✅ Base model loaded successfully!")

In [None]:
# Add LoRA adapters for parameter-efficient fine-tuning
print("🔧 Adding LoRA adapters...")
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=True,      # Fine-tune vision layers for cardboard images
    finetune_language_layers=True,    # Fine-tune language for QC assessment
    finetune_attention_modules=True,  # Fine-tune attention layers
    finetune_mlp_modules=True,        # Fine-tune MLP layers

    r=16,             # LoRA rank - balance between accuracy and efficiency
    lora_alpha=16,    # LoRA alpha - recommended to match r
    lora_dropout=0,   # No dropout for stable training
    bias="none",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

print("✅ LoRA adapters added successfully!")
print("📊 Model ready for fine-tuning on cardboard QC task")

## 📊 Dataset Loading & Preparation

Load the cardboard QC dataset and convert it to the format required for vision-language training.

In [None]:
from datasets import load_dataset

print("📊 Loading Cardboard Quality Control Dataset...")

# Load your uploaded dataset from Hugging Face Hub
dataset = load_dataset("Cong2612/cardboard-qc-dataset")

print(f"✅ Dataset loaded successfully!")
print(f"   Train samples: {len(dataset['train'])}")
print(f"   Validation samples: {len(dataset['validation'])}")
print(f"   Test samples: {len(dataset['test'])}")

# Display example data
example = dataset['train'][0]
print(f"\n📝 Example data point:")
print(f"   Filename: {example['filename']}")
print(f"   Label: {example['label']}")
print(f"   Reason: {example['reason']}")
print(f"   Image size: {example['image'].size}")
print(f"   Conversations: {len(example['conversations'])} messages")

# Show the conversation structure
print(f"\n💬 Conversation example:")
for i, msg in enumerate(example['conversations']):
    print(f"   {i+1}. {msg['role']}: {msg['content'][:80]}...")

In [None]:
# Display sample images
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle('Sample Cardboard Bundle Images', fontsize=16)

for i in range(3):
    sample = dataset['train'][i]
    axes[i].imshow(sample['image'])
    axes[i].set_title(f"{sample['label']}\n{sample['filename'][:20]}...", fontsize=10)
    axes[i].axis('off')

plt.tight_layout()
plt.show()

# Show label distribution
train_labels = [sample['label'] for sample in dataset['train']]
pass_count = train_labels.count('Pass')
fail_count = train_labels.count('Fail')

print(f"\n📈 Training Set Label Distribution:")
print(f"   Pass (flat bundles): {pass_count} ({pass_count/len(train_labels)*100:.1f}%)")
print(f"   Fail (warped bundles): {fail_count} ({fail_count/len(train_labels)*100:.1f}%)")

## 🔄 Data Format Conversion

Convert the dataset from your conversation format to the Unsloth vision training format.

In [None]:
def convert_cardboard_to_conversation(sample):
    """
    Convert cardboard QC data to Unsloth conversation format.
    Your dataset conversations are already well-structured, just need reformatting.
    """
    # Extract conversations from your dataset format
    conversations = sample["conversations"]
    
    # Convert to Unsloth format
    messages = []
    for msg in conversations:
        if msg["role"] == "user":
            # User message with text + image
            messages.append({
                "role": "user",
                "content": [
                    {"type": "text", "text": msg["content"]},
                    {"type": "image", "image": sample["image"]}
                ]
            })
        else:  # assistant
            # Assistant response with text only
            messages.append({
                "role": "assistant",
                "content": [
                    {"type": "text", "text": msg["content"]}
                ]
            })
    
    return {"messages": messages}

# Convert the training dataset
print("🔄 Converting dataset to training format...")
train_dataset = dataset['train']
converted_train = [convert_cardboard_to_conversation(sample) for sample in train_dataset]

# Also convert validation set for potential evaluation
val_dataset = dataset['validation']
converted_val = [convert_cardboard_to_conversation(sample) for sample in val_dataset]

print(f"✅ Dataset conversion complete!")
print(f"   Converted train samples: {len(converted_train)}")
print(f"   Converted validation samples: {len(converted_val)}")

# Display converted example
print(f"\n📋 Converted conversation structure:")
example_conv = converted_train[0]['messages']
for i, msg in enumerate(example_conv):
    print(f"   {i+1}. Role: {msg['role']}")
    print(f"      Content items: {len(msg['content'])}")
    for j, content in enumerate(msg['content']):
        if content['type'] == 'text':
            print(f"         Text: {content['text'][:60]}...")
        else:
            print(f"         Image: {content['type']}")

## 🧪 Pre-Training Model Test

Let's test the model before fine-tuning to see its baseline performance on cardboard QC.

In [None]:
print("🧪 Testing model before fine-tuning...")
FastVisionModel.for_inference(model)  # Switch to inference mode

# Test with a sample from the dataset
test_sample = train_dataset[5]  # Pick sample 5
test_image = test_sample["image"]
test_instruction = "Analyze this cardboard bundle image. Is the bundle flat or warped? Provide your assessment."

print(f"🔍 Testing on: {test_sample['filename']}")
print(f"📝 True label: {test_sample['label']} - {test_sample['reason']}")

# Prepare input
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": test_instruction}
    ]}
]

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
    test_image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

print("\n🤖 Model response BEFORE training:")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    **inputs, 
    streamer=text_streamer, 
    max_new_tokens=128,
    use_cache=True, 
    temperature=1.0, 
    min_p=0.1
)

# Display the test image
plt.figure(figsize=(6, 6))
plt.imshow(test_image)
plt.title(f"Test Image: {test_sample['label']} - {test_sample['filename'][:30]}...")
plt.axis('off')
plt.show()

## 🏋️ Training Setup

Configure the trainer with optimized hyperparameters for cardboard quality control.

In [None]:
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

print("🏋️ Setting up training configuration...")

# Switch model to training mode
FastVisionModel.for_training(model)

# Training configuration optimized for cardboard QC
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=UnslothVisionDataCollator(model, tokenizer),  # Essential for vision training
    train_dataset=converted_train,
    args=SFTConfig(
        per_device_train_batch_size=2,        # Adjust based on GPU memory
        gradient_accumulation_steps=4,         # Effective batch size = 2 * 4 = 8
        warmup_steps=10,                       # Warm-up steps for stable training
        num_train_epochs=3,                    # Full training epochs
        # max_steps=60,                        # Use this for quick testing instead
        learning_rate=2e-4,                    # Learning rate for LoRA
        logging_steps=5,                       # Log every 5 steps
        optim="adamw_8bit",                    # 8-bit optimizer for memory efficiency
        weight_decay=0.01,                     # Regularization
        lr_scheduler_type="linear",            # Learning rate schedule
        seed=3407,                             # For reproducibility
        output_dir="outputs",                  # Checkpoint directory
        report_to="none",                      # Change to "wandb" if using W&B
        save_steps=50,                         # Save checkpoint every 50 steps
        save_total_limit=2,                    # Keep only 2 checkpoints
        
        # Essential settings for vision fine-tuning
        remove_unused_columns=False,
        dataset_text_field="",
        dataset_kwargs={"skip_prepare_dataset": True},
        max_length=2048,
    ),
)

print("✅ Trainer configured successfully!")

# Show memory statistics before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"\n💾 Memory Statistics:")
print(f"   GPU: {gpu_stats.name}")
print(f"   Max memory: {max_memory} GB")
print(f"   Reserved memory: {start_gpu_memory} GB")
print(f"   Available memory: {max_memory - start_gpu_memory:.1f} GB")

## 🚀 Model Training

Start fine-tuning! This will take approximately 15-30 minutes depending on your GPU.

In [None]:
print("🚀 Starting fine-tuning...")
print("⏰ This will take some time. Grab a coffee! ☕")
print("=" * 60)

# Start training
trainer_stats = trainer.train()

print("\n🎉 Training Complete!")
print("=" * 60)

In [None]:
# Show training statistics
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"📈 Training Statistics:")
print(f"   Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"   Training time: {trainer_stats.metrics['train_runtime']/60:.2f} minutes")
print(f"   Peak memory: {used_memory} GB")
print(f"   Peak memory for training: {used_memory_for_lora} GB")
print(f"   Peak memory %: {used_percentage}%")
print(f"   Training memory %: {lora_percentage}%")

# Show training loss if available
if 'train_loss' in trainer_stats.metrics:
    print(f"   Final training loss: {trainer_stats.metrics['train_loss']:.4f}")

## 🧪 Post-Training Model Test

Let's test the fine-tuned model to see the improvement!

In [None]:
print("🧪 Testing fine-tuned model...")
FastVisionModel.for_inference(model)  # Switch to inference mode

# Test with the same sample as before
inputs = tokenizer(
    test_image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

print(f"🔍 Testing on same image: {test_sample['filename']}")
print(f"📝 True label: {test_sample['label']} - {test_sample['reason']}")
print("\n🤖 Model response AFTER fine-tuning:")
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    **inputs, 
    streamer=text_streamer, 
    max_new_tokens=128,
    use_cache=True, 
    temperature=0.7,  # Slightly lower for more focused responses
    min_p=0.1
)

In [None]:
# Test on a few more samples to see consistency
print("\n🎯 Testing on multiple samples...")
test_indices = [0, 10, 20]  # Test different samples

for i, idx in enumerate(test_indices):
    sample = train_dataset[idx]
    
    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": test_instruction}
        ]}
    ]
    
    input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
    inputs = tokenizer(
        sample["image"],
        input_text,
        add_special_tokens=False,
        return_tensors="pt",
    ).to("cuda")
    
    print(f"\n📝 Sample {i+1}: {sample['filename'][:30]}...")
    print(f"   True: {sample['label']} - {sample['reason']}")
    print(f"   Response: ", end="")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs, 
            max_new_tokens=80,
            use_cache=True, 
            temperature=0.7, 
            min_p=0.1,
            do_sample=True
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    prompt_len = len(tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True))
    generated_text = response[prompt_len:].strip()
    print(generated_text[:100] + "..." if len(generated_text) > 100 else generated_text)

## 📊 Model Evaluation

Evaluate the fine-tuned model on the validation set to measure performance.

In [None]:
import re
from sklearn.metrics import accuracy_score, classification_report

def extract_prediction(response_text):
    """Extract Pass/Fail prediction from model response."""
    response_lower = response_text.lower()
    
    # Look for clear indicators
    if 'pass' in response_lower and 'fail' not in response_lower:
        return "Pass"
    elif 'fail' in response_lower and 'pass' not in response_lower:
        return "Fail"
    elif 'flat' in response_lower and 'warp' not in response_lower:
        return "Pass"
    elif 'warp' in response_lower and 'flat' not in response_lower:
        return "Fail"
    else:
        # Check for more subtle indicators
        pass_count = sum(1 for word in ['good', 'acceptable', 'appears flat', 'looks flat'] 
                        if word in response_lower)
        fail_count = sum(1 for word in ['bad', 'unacceptable', 'appears warped', 'looks warped'] 
                        if word in response_lower)
        
        if pass_count > fail_count:
            return "Pass"
        elif fail_count > pass_count:
            return "Fail"
        else:
            return "Unknown"

def evaluate_on_validation():
    """Evaluate model on validation set."""
    print("📊 Evaluating on validation set...")
    
    predictions = []
    true_labels = []
    
    val_samples = min(10, len(val_dataset))  # Limit to 10 for speed in Colab
    
    for i in range(val_samples):
        sample = val_dataset[i]
        
        # Prepare input
        messages = [
            {"role": "user", "content": [
                {"type": "image"},
                {"type": "text", "text": test_instruction}
            ]}
        ]
        
        input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
        inputs = tokenizer(
            sample["image"],
            input_text,
            add_special_tokens=False,
            return_tensors="pt",
        ).to("cuda")
        
        # Generate response
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=100,
                use_cache=True,
                temperature=0.3,  # Lower temperature for consistent evaluation
                min_p=0.05,
                do_sample=True
            )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        prompt_len = len(tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True))
        generated_text = response[prompt_len:].strip()
        
        # Extract prediction
        predicted_label = extract_prediction(generated_text)
        true_label = sample['label']
        
        predictions.append(predicted_label)
        true_labels.append(true_label)
        
        print(f"   {i+1:2d}. {sample['filename'][:25]:25} | True: {true_label:4} | Pred: {predicted_label:4} | {'✓' if predicted_label == true_label else '✗'}")
    
    return predictions, true_labels

# Run evaluation
predictions, true_labels = evaluate_on_validation()

In [None]:
# Calculate metrics
valid_indices = [i for i, pred in enumerate(predictions) if pred in ["Pass", "Fail"]]
valid_predictions = [predictions[i] for i in valid_indices]
valid_true_labels = [true_labels[i] for i in valid_indices]

if valid_predictions:
    accuracy = accuracy_score(valid_true_labels, valid_predictions)
    
    print(f"\n📈 Evaluation Results:")
    print(f"   Samples evaluated: {len(valid_predictions)}/{len(predictions)}")
    print(f"   Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")
    
    print(f"\n📊 Classification Report:")
    print(classification_report(valid_true_labels, valid_predictions))
    
    # Count correct predictions
    correct = sum(1 for p, t in zip(valid_predictions, valid_true_labels) if p == t)
    print(f"\n✅ Correct predictions: {correct}/{len(valid_predictions)}")
    
else:
    print("❌ No valid predictions found. Model may need more training.")

## 💾 Save the Fine-tuned Model

Save the LoRA adapters for later use or deployment.

In [None]:
print("💾 Saving fine-tuned model...")

# Save LoRA adapters locally
model.save_pretrained("cardboard_qc_lora")
tokenizer.save_pretrained("cardboard_qc_lora")

print("✅ LoRA adapters saved to 'cardboard_qc_lora' directory")

# Check what was saved
import os
saved_files = os.listdir("cardboard_qc_lora")
print(f"📁 Saved files: {saved_files}")

# Optional: Zip the model for download
!zip -r cardboard_qc_lora.zip cardboard_qc_lora/
print("📦 Model zipped as 'cardboard_qc_lora.zip' for download")

## 📤 Optional: Upload to Hugging Face Hub

Upload your fine-tuned model to Hugging Face Hub for easy sharing and deployment.

In [None]:
# Set to True if you want to upload to HuggingFace Hub
UPLOAD_TO_HUB = False  # Change to True if you want to upload

if UPLOAD_TO_HUB:
    # Configure these settings
    HF_TOKEN = "your_huggingface_token_here"  # Get from https://huggingface.co/settings/tokens
    MODEL_NAME = "your-username/qwen2.5-vl-cardboard-qc-lora"  # Change to your desired name
    
    print(f"📤 Uploading model to Hugging Face Hub: {MODEL_NAME}")
    
    try:
        model.push_to_hub(
            MODEL_NAME,
            token=HF_TOKEN,
            private=False  # Set to True for private repository
        )
        tokenizer.push_to_hub(
            MODEL_NAME,
            token=HF_TOKEN,
            private=False
        )
        
        print(f"✅ Model uploaded successfully!")
        print(f"🔗 Model URL: https://huggingface.co/{MODEL_NAME}")
        
    except Exception as e:
        print(f"❌ Upload failed: {e}")
        print("Make sure to set your HF_TOKEN and MODEL_NAME correctly")
        
else:
    print("⏭️  Skipping upload to Hugging Face Hub")
    print("💡 Set UPLOAD_TO_HUB = True above if you want to upload")

## 🔄 Load and Test Saved Model

Verify that the saved model can be loaded correctly.

In [None]:
# Test loading the saved model (optional - uncomment if you want to test)
TEST_LOADING = False  # Set to True if you want to test loading

if TEST_LOADING:
    print("🔄 Testing model loading...")
    
    # Clear GPU memory
    del model, tokenizer
    torch.cuda.empty_cache()
    
    # Load the saved model
    from unsloth import FastVisionModel
    model, tokenizer = FastVisionModel.from_pretrained(
        model_name="cardboard_qc_lora",
        load_in_4bit=True,
    )
    FastVisionModel.for_inference(model)
    
    # Quick test
    test_sample = train_dataset[0]
    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": "Analyze this cardboard bundle. Is it flat or warped?"}
        ]}
    ]
    
    input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
    inputs = tokenizer(
        test_sample["image"],
        input_text,
        add_special_tokens=False,
        return_tensors="pt",
    ).to("cuda")
    
    print("🧪 Testing loaded model:")
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)
    _ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=50)
    
    print("\n✅ Model loading test successful!")
    
else:
    print("⏭️  Skipping model loading test")

## 🎉 Fine-tuning Complete!

### Summary

✅ **Model**: Qwen2.5-VL-7B fine-tuned for cardboard quality control
✅ **Dataset**: 117 training samples, 16 validation samples
✅ **Task**: Binary classification (Pass/Fail for cardboard bundles)
✅ **Training**: 3 epochs with LoRA adapters
✅ **Output**: Saved LoRA adapters ready for deployment

### Next Steps

1. **📊 Evaluation**: Test on more validation samples for thorough evaluation
2. **🚀 Deployment**: Use the saved model for production quality control
3. **📈 Improvement**: Collect more data and retrain if needed
4. **🔧 Integration**: Build API endpoints for real-time quality assessment

### Usage Example

```python
# To load your fine-tuned model later:
from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    model_name="cardboard_qc_lora",  # or your HuggingFace repo
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

# Then use for prediction on new cardboard images!
```

### 🎯 Your Model's Specialty

Your fine-tuned Qwen2.5-VL model now specializes in:
- 📸 **Vision**: Analyzing cardboard bundle images
- 🧠 **Understanding**: Recognizing flat vs warped bundles  
- 💬 **Communication**: Providing clear quality assessments
- ⚡ **Efficiency**: Fast inference for production use

**Congratulations on successfully fine-tuning your cardboard QC model! 🎊**