# Turkish E-commerce Product Information Extraction with Fine-Tuned LLaMA


# 🎯 Project Overview
This project fine-tunes LLaMA 3.2 3B model using Unsloth to extract structured product information from Turkish e-commerce HTML pages. The model converts unstructured HTML content into clean JSON format, making it ideal for web scraping, price monitoring, and data analysis applications.

## ✨ Key Features

- 🇹🇷 Turkish Language Specialized: Optimized for Turkish e-commerce websites
- 🚀 High Performance: Fine-tuned with 500+ synthetic examples
- 📊 JSON Output: Clean, structured data extraction
- ⚡ Local Deployment: Works with Ollama for privacy
- 🎛️ Customizable: Easy to extend for new product categories

## 🏗️ Architecture

HTML Input → Fine-Tuned LLaMA 3.2 3B → Structured JSON Output
### Model Details:

* Base Model: unsloth/Llama-3.2-3B-Instruct-bnb-4bit
* Fine-tuning: LoRA (Low-Rank Adaptation)
* Quantization: Q4_K_M for optimal size/performance balance
* Deployment: Ollama GGUF format

## 📋 Table of Contents

1. Setup & Installation
2. Dataset Generation
3. Model Training
4. Testing & Validation
5. GGUF Conversion
6. Ollama Deployment
7. Usage Examples
8. Performance Metrics


## 🚀 Setup & Installation
### Prerequisites

* Google Colab with GPU (T4/V100/A100)
* Python 3.8+
* CUDA compatible GPU (for local training)



In [None]:
# ============================================================================
# 🚀 TURKISH E-COMMERCE AI MODEL SETUP
# Fine-tuning LLaMA 3.2 3B for Turkish e-commerce data extraction
# ============================================================================

import sys
import torch
import warnings
warnings.filterwarnings('ignore')

print("🔍 System Information")
print("=" * 50)
print(f"Python Version: {sys.version}")
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"CUDA Version: {torch.version.cuda}")
else:
    print("⚠️  No GPU detected! Please enable GPU in Runtime settings")
    print("   Runtime → Change runtime type → Hardware accelerator → GPU")

In [None]:
# ============================================================================
# 📦 INSTALL REQUIRED PACKAGES
# Installing Unsloth and dependencies for efficient fine-tuning
# ============================================================================

print("🔧 Installing Unsloth for efficient fine-tuning...")
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

print("📚 Installing additional dependencies...")
!pip install -q datasets transformers accelerate bitsandbytes trl

print("✅ Installation completed!")
print("⚠️  IMPORTANT: Please restart runtime now!")
print("   Runtime → Restart runtime")

In [None]:
# ============================================================================
# 📊 TURKISH E-COMMERCE DATASET GENERATION
# Creating diverse synthetic examples for training
# ============================================================================

import json
import random
from typing import Dict, List, Any

def generate_turkish_ecommerce_dataset(size: int = 500) -> List[Dict[str, Any]]:
    """
    Generate synthetic Turkish e-commerce dataset.

    Args:
        size (int): Number of examples to generate

    Returns:
        List[Dict]: Generated dataset with input/output pairs
    """

    print(f"📊 Generating {size} Turkish e-commerce examples...")

    # Product categories with realistic Turkish products
    products = {
        "Telefon": [
            "iPhone 15 Pro Max", "Samsung Galaxy S24 Ultra", "Xiaomi 14 Pro",
            "Google Pixel 8 Pro", "OnePlus 12", "Huawei P60 Pro"
        ],
        "Laptop": [
            "MacBook Air M3", "Dell XPS 13", "HP Envy 15", "Asus ROG Strix",
            "MSI Gaming GF63", "Lenovo ThinkPad X1", "Surface Laptop 5"
        ],
        "Kulaklık": [
            "AirPods Pro 2", "Sony WH-1000XM5", "Bose QuietComfort 45",
            "Sennheiser HD 660S", "JBL Tour One M2", "Marshall Major IV"
        ],
        "Televizyon": [
            "Samsung QLED 4K", "LG OLED C3", "Sony Bravia XR",
            "TCL 4K Android TV", "Philips Ambilight"
        ],
        "Ayakkabı": [
            "Nike Air Max 270", "Adidas Ultraboost 22", "Puma RS-X",
            "New Balance 990v6", "Converse Chuck Taylor"
        ]
    }

    # Turkish e-commerce seller names
    sellers = [
        "ResmiMağaza", "TechWorld", "Elektronik Dünyası",
        "Hızlı Teknoloji", "Digital Plaza", "Mega Store"
    ]

    # Category-specific features
    features_by_category = {
        "Telefon": [
            "128GB Depolama", "256GB Depolama", "512GB Depolama",
            "12GB RAM", "8GB RAM", "48MP Kamera", "5G Desteği",
            "Wireless Charging", "Su Geçirmez", "Face ID"
        ],
        "Laptop": [
            "Intel i7 İşlemci", "AMD Ryzen 7", "16GB RAM", "512GB SSD",
            "NVIDIA RTX 4060", "15.6 İnç Ekran", "Backlit Keyboard"
        ],
        "Kulaklık": [
            "Aktif Gürültü Engelleme", "Wireless", "30 Saat Pil Ömrü",
            "Hi-Res Audio", "Su Geçirmez", "Voice Assistant"
        ],
        "Televizyon": [
            "4K Ultra HD", "Smart TV", "55 İnç", "HDR10+",
            "Dolby Vision", "Voice Control"
        ],
        "Ayakkabı": [
            "Air Cushioning", "Su Geçirmez", "Nefes Alabilen",
            "Ortopedik Destek", "Anti-Slip Taban"
        ]
    }

    # Price ranges by category (in Turkish Lira)
    price_ranges = {
        "Telefon": (15000, 80000),
        "Laptop": (20000, 100000),
        "Kulaklık": (500, 15000),
        "Televizyon": (8000, 50000),
        "Ayakkabı": (300, 3000)
    }

    dataset = []

    for i in range(size):
        # Select random category and product
        category = random.choice(list(products.keys()))
        product = random.choice(products[category])
        brand = product.split()[0]  # Extract brand from product name
        seller = random.choice(sellers)

        # Generate realistic pricing
        min_price, max_price = price_ranges[category]
        price = random.randint(min_price, max_price)
        original_price = price + random.randint(int(price * 0.1), int(price * 0.3))
        discount = round(((original_price - price) / original_price) * 100)

        # Generate ratings and reviews
        rating = round(random.uniform(3.5, 5.0), 1)
        review_count = random.randint(10, 5000)

        # Select random features
        available_features = features_by_category[category]
        num_features = random.randint(3, min(6, len(available_features)))
        selected_features = random.sample(available_features, num_features)

        # Generate HTML template
        html_template = f"""<div class="urun-detayi">
    <h1 class="baslik">{product}</h1>
    <div class="fiyat-bolumu">
        <span class="aktuel-fiyat">{price:,} TL</span>
        <span class="eski-fiyat">{original_price:,} TL</span>
        <span class="indirim-orani">%{discount} İndirim</span>
    </div>
    <div class="urun-bilgileri">
        <div class="marka">Marka: <strong>{brand}</strong></div>
        <div class="kategori">Kategori: {category}</div>
        <div class="satici">Satıcı: {seller}</div>
    </div>
    <div class="degerlendirme">
        <span class="puan">{rating}</span>
        <span class="yorum-sayisi">({review_count:,} değerlendirme)</span>
    </div>
    <div class="ozellikler">
        <h3>Ürün Özellikleri:</h3>
        <ul>
{"".join([f'            <li>{feature}</li>' for feature in selected_features])}
        </ul>
    </div>
    <div class="kargo-bilgi">🚚 Ücretsiz Kargo</div>
    <div class="stok-durumu">✅ Stokta mevcut</div>
</div>"""

        # Generate target JSON output
        output = {
            "name": product,
            "price": f"{price:,} TL",
            "original_price": f"{original_price:,} TL",
            "discount": f"%{discount} İndirim",
            "brand": brand,
            "category": category,
            "seller": seller,
            "rating": str(rating),
            "review_count": f"{review_count:,}",
            "features": selected_features,
            "shipping": "Ücretsiz Kargo",
            "availability": "Stokta mevcut"
        }

        dataset.append({
            "input": f"Aşağıdaki Türkçe e-ticaret ürün sayfasından bilgileri çıkar:\n\n{html_template}",
            "output": output
        })

        if (i + 1) % 100 == 0:
            print(f"   ✅ Generated {i + 1}/{size} examples")

    return dataset

# Generate dataset
dataset = generate_turkish_ecommerce_dataset(500)

print(f"\n✅ Dataset ready with {len(dataset)} examples!")

# Save dataset
with open('turkish_ecommerce_dataset.json', 'w', encoding='utf-8') as f:
    json.dump(dataset, f, ensure_ascii=False, indent=2)

print("\n📝 Sample data preview:")
print("Input (first 200 chars):")
print(dataset[0]["input"][:200] + "...")
print("\nOutput:")
print(json.dumps(dataset[0]["output"], ensure_ascii=False, indent=2))

In [None]:
# ============================================================================
# 🤖 MODEL LOADING & LORA CONFIGURATION
# Setting up LLaMA 3.2 3B with efficient fine-tuning
# ============================================================================

from unsloth import FastLanguageModel
import torch

# Model configuration
MODEL_NAME = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"
MAX_SEQ_LENGTH = 4096

print(f"🔄 Loading model: {MODEL_NAME}")

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,
    load_in_4bit=True,
)

print("✅ Base model loaded successfully!")

# Configure LoRA (Low-Rank Adaptation)
model = FastLanguageModel.get_peft_model(
    model,
    r=64,  # LoRA rank - Higher for better Turkish performance
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=128,  # 2x rank
    lora_dropout=0.1,  # Prevent overfitting
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=False,
    loftq_config=None,
)

print("✅ LoRA adapters configured!")
print(f"📊 Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

In [None]:
# ============================================================================
# 📊 DATA PREPROCESSING & FORMATTING
# Preparing dataset for fine-tuning
# ============================================================================

from datasets import Dataset

def format_prompt_turkish(example):
    """Format training examples with Turkish e-commerce prompt template."""
    return f"""### Görev:
{example['input']}

### Çıktı:
{json.dumps(example['output'], ensure_ascii=False)}<|endoftext|>"""

# Load and format dataset
with open('turkish_ecommerce_dataset.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# Format all examples
formatted_data = [format_prompt_turkish(item) for item in data]
train_dataset = Dataset.from_dict({"text": formatted_data})

print(f"✅ Formatted {len(train_dataset)} training examples")
print("\n📝 Sample formatted prompt:")
print("=" * 80)
print(train_dataset[0]['text'][:500] + "...")
print("=" * 80)

In [None]:
# ============================================================================
# 🚀 FINE-TUNING EXECUTION
# Training the model with optimized hyperparameters
# ============================================================================

from trl import SFTTrainer
from transformers import TrainingArguments

# Training configuration optimized for Turkish e-commerce
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,
    args=TrainingArguments(
        # Batch size and gradient settings
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,  # Effective batch size = 16

        # Learning rate and scheduling
        learning_rate=2e-4,
        warmup_steps=20,
        num_train_epochs=3,
        lr_scheduler_type="cosine",

        # Optimization settings
        optim="adamw_8bit",
        weight_decay=0.01,
        label_smoothing_factor=0.1,

        # Precision settings
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),

        # Logging and saving
        logging_steps=10,
        output_dir="./turkish-ecommerce-model",
        save_strategy="epoch",
        save_total_limit=2,

        # Memory optimization
        dataloader_pin_memory=False,
        remove_unused_columns=False,

        # Disable external logging
        report_to="none",
        seed=42,
    ),
)

print("🔧 Trainer configured successfully!")
print("\n🚀 Starting fine-tuning process...")
print("⏱️  This may take 15-30 minutes depending on your GPU...")

# Start training
trainer_stats = trainer.train()

print("\n✅ Fine-tuning completed!")
print("📊 Training Statistics:")
print(f"   Final Loss: {trainer_stats.training_loss:.4f}")
print(f"   Training Steps: {trainer_stats.global_step}")

# Save the fine-tuned model
model_save_path = "./turkish-ecommerce-final"
trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)

print(f"\n💾 Model saved to: {model_save_path}")

In [None]:
# ============================================================================
# 🧪 MODEL TESTING & VALIDATION
# Evaluating the fine-tuned model performance
# ============================================================================

import json
from typing import Dict, Any

# Prepare model for inference
FastLanguageModel.for_inference(model)

def extract_turkish_ecommerce(html_content: str) -> Dict[str, Any]:
    """
    Extract product information from Turkish e-commerce HTML.

    Args:
        html_content (str): Raw HTML content

    Returns:
        Dict[str, Any]: Extracted product information or error details
    """

    prompt = f"""### Görev:
Aşağıdaki Türkçe e-ticaret ürün sayfasından bilgileri çıkar:

{html_content}

### Çıktı:
"""

    # Tokenize input
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.3,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract JSON from response
    try:
        json_start = response.find('### Çıktı:')
        if json_start == -1:
            return {"error": "Output header not found", "raw_response": response}

        json_start += len('### Çıktı:')
        json_str = response[json_start:].strip()

        # Clean up response
        if '<|endoftext|>' in json_str:
            json_str = json_str.split('<|endoftext|>')[0].strip()

        # Parse JSON
        result = json.loads(json_str)
        return result

    except json.JSONDecodeError as e:
        return {
            "error": f"JSON parsing error: {str(e)}",
            "raw_response": response,
            "extracted_json": json_str if 'json_str' in locals() else "Not found"
        }
    except Exception as e:
        return {"error": f"General error: {str(e)}", "raw_response": response}

# Test cases
test_cases = [
    {
        "name": "iPhone Test",
        "html": """
<div class="product-page">
    <h1>Apple iPhone 15 Pro 128GB - Doğal Titanyum</h1>
    <div class="price-section">
        <span class="price">54.999 TL</span>
        <span class="old-price">59.999 TL</span>
        <span class="discount">%8 İndirim</span>
    </div>
    <div class="brand">Apple</div>
    <div class="category">Akıllı Telefon</div>
    <div class="seller">Apple Store</div>
    <div class="rating">4.8 ⭐ (2.156 değerlendirme)</div>
    <div class="features">
        <li>128GB Depolama</li>
        <li>Titanium Tasarım</li>
        <li>48MP Ana Kamera</li>
    </div>
    <div class="shipping">Ücretsiz Kargo</div>
    <div class="stock">Stokta var</div>
</div>
"""
    },
    {
        "name": "Samsung Test",
        "html": """
<div class="urun-detay">
    <h1>Samsung Galaxy S24 Ultra 256GB - Siyah</h1>
    <div class="fiyat">
        <span class="guncel-fiyat">45.999 TL</span>
        <span class="eski-fiyat">49.999 TL</span>
        <span class="indirim">%8 İndirim</span>
    </div>
    <div class="marka">Samsung</div>
    <div class="kategori">Telefon</div>
    <div class="satici">Samsung Türkiye</div>
    <div class="puan">4.6 (1.234 değerlendirme)</div>
    <ul class="ozellikler">
        <li>256GB Dahili Hafıza</li>
        <li>12GB RAM</li>
        <li>200MP Ana Kamera</li>
    </ul>
    <div class="kargo">Ücretsiz Kargo</div>
    <div class="stok">Stokta mevcut</div>
</div>
"""
    }
]

# Run tests
print("🧪 Running model validation tests...")
print("=" * 80)

successful_tests = 0
total_tests = len(test_cases)

for i, test_case in enumerate(test_cases, 1):
    print(f"\n📋 Test {i}: {test_case['name']}")
    print("-" * 40)

    result = extract_turkish_ecommerce(test_case['html'])

    if "error" not in result:
        print("✅ Success!")
        print(json.dumps(result, ensure_ascii=False, indent=2))
        successful_tests += 1
    else:
        print("❌ Error occurred:")
        print(f"Error: {result['error']}")

print(f"\n📊 Test Results: {successful_tests}/{total_tests} tests passed")
print(f"Success Rate: {(successful_tests/total_tests)*100:.1f}%")

In [None]:
# ============================================================================
# 🔧 GGUF CONVERSION FOR OLLAMA DEPLOYMENT
# Converting fine-tuned model to efficient GGUF format
# ============================================================================

print("🔄 Converting model to GGUF format for Ollama deployment...")
print("⏱️  This process may take several minutes...")

# Convert to GGUF with Q4_K_M quantization (optimal size/quality balance)
model.save_pretrained_gguf(
    "turkish_ecommerce_gguf",
    tokenizer,
    quantization_method="q4_k_m"
)

print("✅ GGUF conversion completed!")

# Display file information
import os
gguf_path = "turkish_ecommerce_gguf"
if os.path.exists(gguf_path):
    print(f"\n📁 GGUF files generated in: {gguf_path}/")
    for file in os.listdir(gguf_path):
        if file.endswith('.gguf'):
            file_path = os.path.join(gguf_path, file)
            size_mb = os.path.getsize(file_path) / (1024 * 1024)
            print(f"   📄 {file}: {size_mb:.1f} MB")

print("\n🎯 Next Steps:")
print("1. Download the GGUF files to your local machine")
print("2. Install Ollama (https://ollama.ai)")
print("3. Create and run the model locally")

In [None]:
# ============================================================================
# 📥 DOWNLOAD GGUF FILES TO LOCAL MACHINE
# Using Google Drive for large file transfer
# ============================================================================

from google.colab import drive
import shutil
import os

print("☁️  Uploading GGUF files to Google Drive...")

try:
    # Mount Google Drive
    drive.mount('/content/drive')
    print("✅ Google Drive connected!")

    # Create destination directory
    drive_destination = "/content/drive/MyDrive/turkish_ecommerce_gguf"

    if os.path.exists("turkish_ecommerce_gguf"):
        # Copy files to Google Drive
        if os.path.exists(drive_destination):
            shutil.rmtree(drive_destination)

        shutil.copytree("turkish_ecommerce_gguf", drive_destination)
        print(f"✅ Files uploaded to Google Drive: {drive_destination}")

        # Verify upload
        uploaded_files = os.listdir(drive_destination)
        total_size = 0
        for file in uploaded_files:
            file_path = os.path.join(drive_destination, file)
            if os.path.isfile(file_path):
                size_mb = os.path.getsize(file_path) / (1024 * 1024)
                total_size += size_mb
                print(f"   📄 {file}: {size_mb:.1f} MB")

        print(f"\n📊 Total size: {total_size:.1f} MB")
        print("\n📱 Access your files:")
        print("1. Open Google Drive: https://drive.google.com")
        print("2. Navigate to 'turkish_ecommerce_gguf' folder")
        print("3. Download the .gguf file to your computer")

    else:
        print("❌ GGUF folder not found!")

except Exception as e:
    print(f"❌ Error uploading to Google Drive: {e}")
    print("💡 Try downloading files manually using Google Colab's file browser")