# Amazon ML Challenge 2025 - LLM-Based Hierarchical Classification Approach

## Strategy Overview

This notebook implements a **Local LLM-Based Classification** approach for product price prediction:

### Architecture:
1. **LLM-Based Classification**: Use local LLM (Phi-3/Llama) to generate hierarchical categories
   - Class (e.g., Electronics, Fashion, Home)
   - Sub-class (e.g., Smartphones, Laptops)
   - Brand (extracted from text)
   - Category attributes

2. **Price Range Mapping**: Build statistical maps for each class/subclass combination
   - Mean price per category
   - Price distribution statistics
   - Category-based features

3. **Hybrid Prediction**: Combine category features with text embeddings
   - Category embeddings
   - Text features from catalog_content
   - Statistical features from price maps

### Advantages:
- **Interpretable**: Clear category hierarchy
- **Robust**: Categories generalize well
- **Non-duplicated**: Reuse existing categories when possible
- **Efficient**: Uses local LLM (no API costs)

In [None]:
# Install required packages
!pip install -q transformers torch pandas numpy scikit-learn matplotlib seaborn tqdm accelerate bitsandbytes
!pip install -q sentence-transformers  # For text embeddings

In [None]:
# Import required libraries
import os
import json
import re
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    AutoModel, BitsAndBytesConfig
)
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from collections import defaultdict, Counter
import warnings
warnings.filterwarnings('ignore')

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

# Check GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# Configuration
class Config:
    # Paths
    TRAIN_CSV = '/kaggle/input/amazon-ml-challenge-2025-main-data/student_resource/dataset/train.csv'
    TEST_CSV = '/kaggle/input/amazon-ml-challenge-2025-main-data/student_resource/dataset/test.csv'
    OUTPUT_CSV = 'test_out.csv'
    
    # Category files (will be created)
    CATEGORY_MAP_FILE = 'category_map.json'
    PRICE_STATS_FILE = 'price_statistics.json'
    
    # LLM Configuration (Local model)
    LLM_MODEL = 'microsoft/phi-2'  # Small, efficient local LLM
    # Alternatives: 'microsoft/Phi-3-mini-4k-instruct', 'meta-llama/Llama-2-7b-chat-hf'
    USE_4BIT = True  # Use 4-bit quantization for memory efficiency
    MAX_LLM_LENGTH = 512
    
    # Text Embedding Model
    EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'  # Fast & efficient
    
    # Training Configuration
    BATCH_SIZE = 64
    LEARNING_RATE = 1e-3
    NUM_EPOCHS = 20
    VAL_SPLIT = 0.15
    
    # Model Architecture
    HIDDEN_DIM = 256
    DROPOUT = 0.3
    
    # Category processing
    MIN_CATEGORY_SAMPLES = 5  # Minimum samples to create new category
    MAX_CATEGORIES_PER_LEVEL = 50  # Max unique categories per level
    
    # Feature configuration
    USE_LOG_TRANSFORM = True
    
config = Config()
print("Configuration loaded!")
print(f"LLM Model: {config.LLM_MODEL}")
print(f"Embedding Model: {config.EMBEDDING_MODEL}")
print(f"Output will be saved to: {config.OUTPUT_CSV}")

## Step 1: Load Data

In [None]:
# Load datasets
try:
    train_df = pd.read_csv(config.TRAIN_CSV)
    test_df = pd.read_csv(config.TEST_CSV)
    
    print(f"Training data: {train_df.shape}")
    print(f"Test data: {test_df.shape}")
    
    print("\nTraining columns:", train_df.columns.tolist())
    print("\nFirst few rows:")
    print(train_df.head(3))
    
    print("\nPrice statistics:")
    print(train_df['price'].describe())
    
    print("\nMissing values:")
    print(train_df.isnull().sum())
    
except FileNotFoundError as e:
    print(f"Error loading data: {e}")
    print("Please update the file paths in Config class")

## Step 2: Setup Local LLM for Classification

We'll use a local LLM (Phi-2) with 4-bit quantization to:
1. Extract product categories from text
2. Generate hierarchical classification (class → sub_class → brand)
3. Build a reusable category taxonomy

In [None]:
# Initialize Local LLM with 4-bit quantization
print("Loading Local LLM...")
print(f"Model: {config.LLM_MODEL}")

try:
    # Configure 4-bit quantization for memory efficiency
    if config.USE_4BIT:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
        )
        
        llm_tokenizer = AutoTokenizer.from_pretrained(config.LLM_MODEL, trust_remote_code=True)
        llm_model = AutoModelForCausalLM.from_pretrained(
            config.LLM_MODEL,
            quantization_config=bnb_config,
            device_map="auto",
            trust_remote_code=True
        )
    else:
        llm_tokenizer = AutoTokenizer.from_pretrained(config.LLM_MODEL)
        llm_model = AutoModelForCausalLM.from_pretrained(config.LLM_MODEL).to(device)
    
    llm_model.eval()
    print("✓ LLM loaded successfully!")
    print(f"Model size: ~{sum(p.numel() for p in llm_model.parameters()) / 1e9:.2f}B parameters")
    
except Exception as e:
    print(f"Error loading LLM: {e}")
    print("Falling back to rule-based classification")
    llm_model = None
    llm_tokenizer = None

In [None]:
# Category Extraction with LLM
def extract_categories_with_llm(text, existing_categories=None):
    """
    Use LLM to extract hierarchical categories from product text.
    Reuses existing categories when possible.
    
    Returns: dict with 'class', 'sub_class', 'brand', 'attributes'
    """
    if llm_model is None:
        # Fallback: rule-based extraction
        return extract_categories_rule_based(text)
    
    # Prepare prompt
    prompt = f"""Analyze this product and extract categories in JSON format.

Product: {text[:400]}

Extract:
1. class: Main category (e.g., Electronics, Fashion, Home, Beauty, Sports)
2. sub_class: Specific subcategory (e.g., Smartphones, Laptops, Shirts)
3. brand: Brand name if mentioned
4. attributes: Key product attributes (as list)

Format: {{"class": "...", "sub_class": "...", "brand": "...", "attributes": [...]}}

JSON:"""
    
    try:
        inputs = llm_tokenizer(prompt, return_tensors="pt", max_length=config.MAX_LLM_LENGTH, truncation=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = llm_model.generate(
                **inputs,
                max_new_tokens=150,
                temperature=0.3,
                do_sample=True,
                pad_token_id=llm_tokenizer.eos_token_id
            )
        
        response = llm_tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract JSON from response
        json_match = re.search(r'\{.*\}', response, re.DOTALL)
        if json_match:
            categories = json.loads(json_match.group())
            
            # Match with existing categories if provided
            if existing_categories:
                categories = match_existing_categories(categories, existing_categories)
            
            return categories
        else:
            return extract_categories_rule_based(text)
            
    except Exception as e:
        print(f"LLM extraction error: {e}")
        return extract_categories_rule_based(text)


def extract_categories_rule_based(text):
    """
    Fallback rule-based category extraction.
    """
    text_lower = text.lower()
    
    # Define category keywords
    category_keywords = {
        'Electronics': ['phone', 'laptop', 'computer', 'tablet', 'camera', 'headphone', 'speaker', 'tv', 'monitor'],
        'Fashion': ['shirt', 'pant', 'dress', 'shoe', 'bag', 'watch', 'clothing', 'apparel', 'fashion'],
        'Home': ['furniture', 'kitchen', 'bed', 'table', 'chair', 'decor', 'appliance', 'home'],
        'Beauty': ['makeup', 'cosmetic', 'skincare', 'perfume', 'beauty', 'lotion', 'cream'],
        'Sports': ['fitness', 'sport', 'gym', 'exercise', 'yoga', 'running', 'athletic'],
        'Books': ['book', 'novel', 'textbook', 'magazine', 'reading'],
        'Toys': ['toy', 'game', 'puzzle', 'kids', 'children', 'play'],
        'Food': ['food', 'snack', 'grocery', 'beverage', 'drink', 'coffee', 'tea'],
    }
    
    # Find matching class
    main_class = 'General'
    for cls, keywords in category_keywords.items():
        if any(kw in text_lower for kw in keywords):
            main_class = cls
            break
    
    # Extract brand (look for capitalized words)
    brands = re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)?\b', text)
    brand = brands[0] if brands else 'Unknown'
    
    # Extract attributes (simple version)
    attributes = []
    if 'pack' in text_lower:
        pack_match = re.search(r'(\d+)\s*pack', text_lower)
        if pack_match:
            attributes.append(f"{pack_match.group(1)}-pack")
    
    return {
        'class': main_class,
        'sub_class': f"{main_class}_General",
        'brand': brand,
        'attributes': attributes
    }


def match_existing_categories(new_categories, existing_categories):
    """
    Match new categories with existing ones to avoid duplicates.
    Uses fuzzy matching on category names.
    """
    from difflib import get_close_matches
    
    # Match class
    if 'classes' in existing_categories:
        matches = get_close_matches(new_categories['class'], existing_categories['classes'], n=1, cutoff=0.8)
        if matches:
            new_categories['class'] = matches[0]
    
    # Match sub_class
    if 'sub_classes' in existing_categories:
        matches = get_close_matches(new_categories['sub_class'], existing_categories['sub_classes'], n=1, cutoff=0.8)
        if matches:
            new_categories['sub_class'] = matches[0]
    
    # Match brand
    if 'brands' in existing_categories:
        matches = get_close_matches(new_categories['brand'], existing_categories['brands'], n=1, cutoff=0.85)
        if matches:
            new_categories['brand'] = matches[0]
    
    return new_categories

print("✓ Category extraction functions defined!")

## Step 3: Process Training Data - Extract Categories

Extract categories for all training samples and build the category taxonomy.

In [None]:
# Process training data to extract categories
print("Extracting categories from training data...")
print("This may take a while with LLM processing...\n")

# Initialize category storage
all_categories = []
existing_categories = {
    'classes': set(),
    'sub_classes': set(),
    'brands': set()
}

# Process in batches for progress tracking
batch_size = 100
for i in tqdm(range(0, len(train_df), batch_size), desc="Processing batches"):
    batch = train_df.iloc[i:i+batch_size]
    
    for idx, row in batch.iterrows():
        text = str(row['catalog_content']) if pd.notna(row['catalog_content']) else ''
        
        # Extract categories
        categories = extract_categories_with_llm(text, existing_categories)
        categories['sample_id'] = row['sample_id']
        categories['price'] = row['price']
        
        all_categories.append(categories)
        
        # Update existing categories
        existing_categories['classes'].add(categories['class'])
        existing_categories['sub_classes'].add(categories['sub_class'])
        existing_categories['brands'].add(categories['brand'])

# Convert to DataFrame
category_df = pd.DataFrame(all_categories)

print("\n✓ Category extraction complete!")
print(f"\nTotal samples: {len(category_df)}")
print(f"Unique classes: {category_df['class'].nunique()}")
print(f"Unique sub_classes: {category_df['sub_class'].nunique()}")
print(f"Unique brands: {category_df['brand'].nunique()}")

print("\nClass distribution:")
print(category_df['class'].value_counts().head(10))

In [None]:
# Build price statistics for each category combination
print("\nBuilding price statistics maps...")

price_stats = {}

# Stats by class
for cls in category_df['class'].unique():
    cls_data = category_df[category_df['class'] == cls]['price']
    if len(cls_data) >= config.MIN_CATEGORY_SAMPLES:
        price_stats[f"class_{cls}"] = {
            'mean': float(cls_data.mean()),
            'median': float(cls_data.median()),
            'std': float(cls_data.std()),
            'min': float(cls_data.min()),
            'max': float(cls_data.max()),
            'count': int(len(cls_data))
        }

# Stats by class + sub_class
for (cls, sub_cls), group in category_df.groupby(['class', 'sub_class']):
    if len(group) >= config.MIN_CATEGORY_SAMPLES:
        prices = group['price']
        price_stats[f"class_{cls}_sub_{sub_cls}"] = {
            'mean': float(prices.mean()),
            'median': float(prices.median()),
            'std': float(prices.std()),
            'min': float(prices.min()),
            'max': float(prices.max()),
            'count': int(len(prices))
        }

# Stats by brand
for brand in category_df['brand'].unique():
    brand_data = category_df[category_df['brand'] == brand]['price']
    if len(brand_data) >= config.MIN_CATEGORY_SAMPLES:
        price_stats[f"brand_{brand}"] = {
            'mean': float(brand_data.mean()),
            'median': float(brand_data.median()),
            'std': float(brand_data.std()),
            'min': float(brand_data.min()),
            'max': float(brand_data.max()),
            'count': int(len(brand_data))
        }

print(f"\n✓ Price statistics created for {len(price_stats)} category combinations")

# Save statistics
with open(config.PRICE_STATS_FILE, 'w') as f:
    json.dump(price_stats, f, indent=2)

print(f"✓ Statistics saved to {config.PRICE_STATS_FILE}")

# Show sample statistics
print("\nSample price statistics:")
for key in list(price_stats.keys())[:5]:
    print(f"\n{key}:")
    print(f"  Mean: ${price_stats[key]['mean']:.2f}")
    print(f"  Median: ${price_stats[key]['median']:.2f}")
    print(f"  Count: {price_stats[key]['count']}")

In [None]:
# Visualize price distributions by category
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Price by class
ax = axes[0, 0]
top_classes = category_df['class'].value_counts().head(8).index
data_to_plot = [category_df[category_df['class'] == cls]['price'].values for cls in top_classes]
ax.boxplot(data_to_plot, labels=top_classes)
ax.set_xlabel('Class')
ax.set_ylabel('Price ($)')
ax.set_title('Price Distribution by Class')
ax.tick_params(axis='x', rotation=45)

# Price by brand (top brands)
ax = axes[0, 1]
top_brands = category_df['brand'].value_counts().head(8).index
data_to_plot = [category_df[category_df['brand'] == brand]['price'].values for brand in top_brands]
ax.boxplot(data_to_plot, labels=top_brands)
ax.set_xlabel('Brand')
ax.set_ylabel('Price ($)')
ax.set_title('Price Distribution by Brand')
ax.tick_params(axis='x', rotation=45)

# Class counts
ax = axes[1, 0]
category_df['class'].value_counts().head(10).plot(kind='bar', ax=ax)
ax.set_xlabel('Class')
ax.set_ylabel('Count')
ax.set_title('Sample Count by Class')
ax.tick_params(axis='x', rotation=45)

# Overall price distribution
ax = axes[1, 1]
ax.hist(category_df['price'], bins=50, edgecolor='black')
ax.set_xlabel('Price ($)')
ax.set_ylabel('Frequency')
ax.set_title('Overall Price Distribution')

plt.tight_layout()
plt.show()

print("\n✓ Visualizations complete!")

## Step 4: Build Hybrid Price Prediction Model

Combine:
1. Category embeddings (class, sub_class, brand)
2. Text embeddings from catalog_content
3. Statistical features from price maps

In [None]:
# Initialize text embedding model
print("Loading text embedding model...")
embedding_model = SentenceTransformer(config.EMBEDDING_MODEL)
embedding_model = embedding_model.to(device)
print(f"✓ Embedding model loaded: {config.EMBEDDING_MODEL}")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

In [None]:
# Create label encoders for categories
print("\nCreating category encoders...")

class_encoder = LabelEncoder()
sub_class_encoder = LabelEncoder()
brand_encoder = LabelEncoder()

category_df['class_encoded'] = class_encoder.fit_transform(category_df['class'])
category_df['sub_class_encoded'] = sub_class_encoder.fit_transform(category_df['sub_class'])
category_df['brand_encoded'] = brand_encoder.fit_transform(category_df['brand'])

print(f"✓ Encoders created")
print(f"  Classes: {len(class_encoder.classes_)}")
print(f"  Sub-classes: {len(sub_class_encoder.classes_)}")
print(f"  Brands: {len(brand_encoder.classes_)}")

# Save encoders and categories
category_map = {
    'classes': class_encoder.classes_.tolist(),
    'sub_classes': sub_class_encoder.classes_.tolist(),
    'brands': brand_encoder.classes_.tolist()
}

with open(config.CATEGORY_MAP_FILE, 'w') as f:
    json.dump(category_map, f, indent=2)

print(f"✓ Category map saved to {config.CATEGORY_MAP_FILE}")

In [None]:
# Dataset class
class CategoryPriceDataset(Dataset):
    def __init__(self, df, category_df, train_df_full, embedding_model, price_stats, is_test=False):
        self.df = df
        self.category_df = category_df
        self.train_df_full = train_df_full
        self.embedding_model = embedding_model
        self.price_stats = price_stats
        self.is_test = is_test
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        sample_id = row['sample_id']
        
        # Get text
        text = str(row['catalog_content']) if pd.notna(row['catalog_content']) else ''
        
        # Get category info
        cat_row = self.category_df[self.category_df['sample_id'] == sample_id].iloc[0]
        
        # Category features
        class_id = cat_row['class_encoded']
        sub_class_id = cat_row['sub_class_encoded']
        brand_id = cat_row['brand_encoded']
        
        # Statistical features from price maps
        stat_features = []
        
        # Class stats
        key = f"class_{cat_row['class']}"
        if key in self.price_stats:
            stat_features.extend([
                self.price_stats[key]['mean'],
                self.price_stats[key]['median'],
                self.price_stats[key]['std'],
            ])
        else:
            stat_features.extend([0, 0, 0])
        
        # Class+SubClass stats
        key = f"class_{cat_row['class']}_sub_{cat_row['sub_class']}"
        if key in self.price_stats:
            stat_features.extend([
                self.price_stats[key]['mean'],
                self.price_stats[key]['median'],
                self.price_stats[key]['std'],
            ])
        else:
            stat_features.extend([0, 0, 0])
        
        # Brand stats
        key = f"brand_{cat_row['brand']}"
        if key in self.price_stats:
            stat_features.extend([
                self.price_stats[key]['mean'],
                self.price_stats[key]['median'],
                self.price_stats[key]['std'],
            ])
        else:
            stat_features.extend([0, 0, 0])
        
        stat_features = torch.tensor(stat_features, dtype=torch.float32)
        
        # Text embedding
        with torch.no_grad():
            text_embedding = self.embedding_model.encode(text, convert_to_tensor=True, device=device)
        
        # Target
        if not self.is_test:
            price = cat_row['price']
            if config.USE_LOG_TRANSFORM:
                price = np.log1p(price)
            target = torch.tensor(price, dtype=torch.float32)
        else:
            target = torch.tensor(0.0, dtype=torch.float32)
        
        return {
            'class_id': torch.tensor(class_id, dtype=torch.long),
            'sub_class_id': torch.tensor(sub_class_id, dtype=torch.long),
            'brand_id': torch.tensor(brand_id, dtype=torch.long),
            'text_embedding': text_embedding,
            'stat_features': stat_features,
            'target': target,
            'sample_id': sample_id
        }

print("✓ Dataset class defined!")

In [None]:
# Hybrid Price Prediction Model
class HybridCategoryPriceModel(nn.Module):
    def __init__(self, num_classes, num_sub_classes, num_brands, text_dim, stat_dim, hidden_dim, dropout):
        super().__init__()
        
        # Category embeddings
        self.class_embedding = nn.Embedding(num_classes, 32)
        self.sub_class_embedding = nn.Embedding(num_sub_classes, 64)
        self.brand_embedding = nn.Embedding(num_brands, 64)
        
        # Text projection
        self.text_projection = nn.Sequential(
            nn.Linear(text_dim, 256),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(256, 128)
        )
        
        # Statistical features projection
        self.stat_projection = nn.Sequential(
            nn.Linear(stat_dim, 64),
            nn.ReLU(),
            nn.Dropout(dropout)
        )
        
        # Fusion MLP
        total_dim = 32 + 64 + 64 + 128 + 64  # class + sub_class + brand + text + stats
        self.fusion = nn.Sequential(
            nn.Linear(total_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, hidden_dim // 4),
            nn.ReLU(),
            nn.Linear(hidden_dim // 4, 1)
        )
        
    def forward(self, class_id, sub_class_id, brand_id, text_embedding, stat_features):
        # Embed categories
        class_emb = self.class_embedding(class_id)
        sub_class_emb = self.sub_class_embedding(sub_class_id)
        brand_emb = self.brand_embedding(brand_id)
        
        # Project text
        text_features = self.text_projection(text_embedding)
        
        # Project stats
        stat_proj = self.stat_projection(stat_features)
        
        # Concatenate all features
        combined = torch.cat([
            class_emb,
            sub_class_emb,
            brand_emb,
            text_features,
            stat_proj
        ], dim=-1)
        
        # Predict price
        output = self.fusion(combined)
        return output.squeeze(-1)

print("✓ Model architecture defined!")

In [None]:
# Prepare training data
print("\nPreparing training data...")

# Merge train_df with category_df
train_data = train_df.merge(category_df[['sample_id', 'class_encoded', 'sub_class_encoded', 'brand_encoded', 'class', 'sub_class', 'brand']], on='sample_id')

print(f"Training data shape: {train_data.shape}")

# Split into train and validation
train_data, val_data = train_test_split(train_data, test_size=config.VAL_SPLIT, random_state=42)

print(f"Train samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")

# Create datasets
train_dataset = CategoryPriceDataset(train_data, category_df, train_df, embedding_model, price_stats)
val_dataset = CategoryPriceDataset(val_data, category_df, train_df, embedding_model, price_stats)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=config.BATCH_SIZE, shuffle=True, num_workers=2)
val_loader = DataLoader(val_dataset, batch_size=config.BATCH_SIZE, shuffle=False, num_workers=2)

print(f"✓ Dataloaders created!")
print(f"  Train batches: {len(train_loader)}")
print(f"  Val batches: {len(val_loader)}")

In [None]:
# Initialize model
text_dim = embedding_model.get_sentence_embedding_dimension()
stat_dim = 9  # 3 stats x 3 levels (class, class+sub, brand)

model = HybridCategoryPriceModel(
    num_classes=len(class_encoder.classes_),
    num_sub_classes=len(sub_class_encoder.classes_),
    num_brands=len(brand_encoder.classes_),
    text_dim=text_dim,
    stat_dim=stat_dim,
    hidden_dim=config.HIDDEN_DIM,
    dropout=config.DROPOUT
).to(device)

print(f"\n✓ Model initialized!")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

# Optimizer and loss
optimizer = torch.optim.AdamW(model.parameters(), lr=config.LEARNING_RATE, weight_decay=0.01)
criterion = nn.MSELoss()
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=3)

print("\n✓ Optimizer and loss configured!")

## Step 5: Train Model

In [None]:
# Training loop
def train_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    
    for batch in tqdm(loader, desc="Training"):
        class_id = batch['class_id'].to(device)
        sub_class_id = batch['sub_class_id'].to(device)
        brand_id = batch['brand_id'].to(device)
        text_embedding = batch['text_embedding'].to(device)
        stat_features = batch['stat_features'].to(device)
        target = batch['target'].to(device)
        
        optimizer.zero_grad()
        output = model(class_id, sub_class_id, brand_id, text_embedding, stat_features)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    return total_loss / len(loader)


def validate(model, loader, criterion, device, use_log_transform):
    model.eval()
    total_loss = 0
    all_preds = []
    all_targets = []
    
    with torch.no_grad():
        for batch in tqdm(loader, desc="Validation"):
            class_id = batch['class_id'].to(device)
            sub_class_id = batch['sub_class_id'].to(device)
            brand_id = batch['brand_id'].to(device)
            text_embedding = batch['text_embedding'].to(device)
            stat_features = batch['stat_features'].to(device)
            target = batch['target'].to(device)
            
            output = model(class_id, sub_class_id, brand_id, text_embedding, stat_features)
            loss = criterion(output, target)
            total_loss += loss.item()
            
            # Convert back from log space
            if use_log_transform:
                preds = torch.expm1(output).cpu().numpy()
                targets = torch.expm1(target).cpu().numpy()
            else:
                preds = output.cpu().numpy()
                targets = target.cpu().numpy()
            
            all_preds.extend(preds)
            all_targets.extend(targets)
    
    # Calculate SMAPE
    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)
    smape = np.mean(np.abs(all_preds - all_targets) / ((np.abs(all_targets) + np.abs(all_preds)) / 2)) * 100
    
    return total_loss / len(loader), smape

print("✓ Training functions defined!")

In [None]:
# Train model
print("\nStarting training...\n")

best_smape = float('inf')
train_losses = []
val_losses = []
val_smapes = []

for epoch in range(config.NUM_EPOCHS):
    print(f"\nEpoch {epoch+1}/{config.NUM_EPOCHS}")
    print("="*50)
    
    # Train
    train_loss = train_epoch(model, train_loader, optimizer, criterion, device)
    train_losses.append(train_loss)
    
    # Validate
    val_loss, val_smape = validate(model, val_loader, criterion, device, config.USE_LOG_TRANSFORM)
    val_losses.append(val_loss)
    val_smapes.append(val_smape)
    
    # Update scheduler
    scheduler.step(val_smape)
    
    print(f"Train Loss: {train_loss:.4f}")
    print(f"Val Loss: {val_loss:.4f}")
    print(f"Val SMAPE: {val_smape:.2f}%")
    print(f"Learning Rate: {optimizer.param_groups[0]['lr']:.6f}")
    
    # Save best model
    if val_smape < best_smape:
        best_smape = val_smape
        torch.save(model.state_dict(), 'best_model.pth')
        print(f"✓ Best model saved! (SMAPE: {best_smape:.2f}%)")

print("\n" + "="*50)
print(f"Training complete!")
print(f"Best Validation SMAPE: {best_smape:.2f}%")
print("="*50)

In [None]:
# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Loss curves
ax = axes[0]
ax.plot(train_losses, label='Train Loss', marker='o')
ax.plot(val_losses, label='Val Loss', marker='s')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss (MSE)')
ax.set_title('Training and Validation Loss')
ax.legend()
ax.grid(True)

# SMAPE curve
ax = axes[1]
ax.plot(val_smapes, label='Val SMAPE', marker='o', color='green')
ax.axhline(y=best_smape, color='r', linestyle='--', label=f'Best: {best_smape:.2f}%')
ax.set_xlabel('Epoch')
ax.set_ylabel('SMAPE (%)')
ax.set_title('Validation SMAPE')
ax.legend()
ax.grid(True)

plt.tight_layout()
plt.show()

print("✓ Training curves plotted!")

## Step 6: Generate Test Predictions

Process test data:
1. Extract categories for test samples
2. Generate predictions using trained model
3. Create submission file

In [None]:
# Load best model
model.load_state_dict(torch.load('best_model.pth'))
model.eval()
print("✓ Best model loaded!")

# Extract categories for test data
print("\nExtracting categories for test data...")
print("Reusing existing categories when possible...\n")

test_categories = []
for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Processing test samples"):
    text = str(row['catalog_content']) if pd.notna(row['catalog_content']) else ''
    
    # Extract categories (will match with existing ones)
    categories = extract_categories_with_llm(text, existing_categories)
    categories['sample_id'] = row['sample_id']
    
    # Encode categories (handle unseen values)
    try:
        categories['class_encoded'] = class_encoder.transform([categories['class']])[0]
    except:
        categories['class_encoded'] = 0  # Default to first class
    
    try:
        categories['sub_class_encoded'] = sub_class_encoder.transform([categories['sub_class']])[0]
    except:
        categories['sub_class_encoded'] = 0
    
    try:
        categories['brand_encoded'] = brand_encoder.transform([categories['brand']])[0]
    except:
        categories['brand_encoded'] = 0
    
    test_categories.append(categories)

test_category_df = pd.DataFrame(test_categories)
print(f"\n✓ Test categories extracted: {len(test_category_df)} samples")

In [None]:
# Create test dataset
test_data = test_df.merge(test_category_df[['sample_id', 'class_encoded', 'sub_class_encoded', 'brand_encoded', 'class', 'sub_class', 'brand']], on='sample_id')
test_dataset = CategoryPriceDataset(test_data, test_category_df, train_df, embedding_model, price_stats, is_test=True)
test_loader = DataLoader(test_dataset, batch_size=config.BATCH_SIZE, shuffle=False, num_workers=2)

print(f"\n✓ Test dataset created: {len(test_dataset)} samples")
print("\nGenerating predictions...\n")

# Generate predictions
all_predictions = []
all_sample_ids = []

model.eval()
with torch.no_grad():
    for batch in tqdm(test_loader, desc="Predicting"):
        class_id = batch['class_id'].to(device)
        sub_class_id = batch['sub_class_id'].to(device)
        brand_id = batch['brand_id'].to(device)
        text_embedding = batch['text_embedding'].to(device)
        stat_features = batch['stat_features'].to(device)
        sample_ids = batch['sample_id']
        
        output = model(class_id, sub_class_id, brand_id, text_embedding, stat_features)
        
        # Convert from log space
        if config.USE_LOG_TRANSFORM:
            preds = torch.expm1(output).cpu().numpy()
        else:
            preds = output.cpu().numpy()
        
        # Ensure positive prices
        preds = np.maximum(preds, 0.01)
        
        all_predictions.extend(preds.tolist())
        all_sample_ids.extend(sample_ids)

print(f"\n✓ Predictions generated: {len(all_predictions)}")
print(f"\nPrediction statistics:")
print(f"  Min: ${np.min(all_predictions):.2f}")
print(f"  Max: ${np.max(all_predictions):.2f}")
print(f"  Mean: ${np.mean(all_predictions):.2f}")
print(f"  Median: ${np.median(all_predictions):.2f}")

In [None]:
# Create submission file
output_df = pd.DataFrame({
    'sample_id': all_sample_ids,
    'price': all_predictions
})

# Sort by sample_id to match test.csv order
output_df = output_df.sort_values('sample_id').reset_index(drop=True)

# Save to CSV
output_df.to_csv(config.OUTPUT_CSV, index=False)

print(f"\n{'='*60}")
print(f"SUBMISSION FILE CREATED")
print(f"{'='*60}")
print(f"File: {config.OUTPUT_CSV}")
print(f"Samples: {len(output_df)}")
print(f"\nFirst 10 predictions:")
print(output_df.head(10))

# Validate output
if len(output_df) == len(test_df):
    print(f"\n✓ Output has correct number of samples: {len(output_df)}")
else:
    print(f"\n⚠ Warning: Output has {len(output_df)} samples, expected {len(test_df)}")

# Check for missing sample IDs
missing_ids = set(test_df['sample_id']) - set(output_df['sample_id'])
if missing_ids:
    print(f"\n⚠ Missing {len(missing_ids)} sample IDs in output")
else:
    print("\n✓ All sample IDs present in output")

# Check for duplicates
if output_df['sample_id'].duplicated().any():
    print("\n⚠ Warning: Duplicate sample IDs found")
else:
    print("✓ No duplicate sample IDs")

print(f"\n{'='*60}")
print("✓ READY FOR SUBMISSION!")
print(f"{'='*60}")

## Summary

### Approach:
1. **LLM-Based Classification**: Used local Phi-2 model to extract hierarchical categories
2. **Category Taxonomy**: Built non-duplicated class/sub_class/brand structure
3. **Price Statistics**: Mapped price distributions for each category combination
4. **Hybrid Model**: Combined category embeddings + text embeddings + statistical features
5. **Training**: Optimized for SMAPE metric with log-transformed prices

### Key Features:
- **Local LLM**: No API costs, privacy-preserved
- **Reusable Categories**: Matches existing categories to avoid duplication
- **Statistical Features**: Leverages category-based price patterns
- **Text Embeddings**: Captures semantic information from product descriptions
- **Hybrid Architecture**: Combines multiple information sources

### Results:
- Training SMAPE: ~{:.2f}%
- Validation SMAPE: {:.2f}%
- Test predictions: {} samples

### Next Steps for Improvement:
1. Fine-tune LLM on domain-specific data
2. Add more category levels (sub-sub-class, product type)
3. Extract quantitative attributes (size, weight, pack quantity)
4. Ensemble with image-based models
5. Post-processing: Category-based calibration
6. Use larger LLM (Llama-2-7B, Phi-3) for better categorization