# üöÄ **ULTRA-FAST LLM Feature Extraction with vLLM** - Amazon ML Challenge 2025

## ‚ö° **What's New:**
- ‚úÖ **vLLM Engine** - 5-10x faster than HuggingFace Transformers
- ‚úÖ **PagedAttention** - Efficient KV cache management
- ‚úÖ **Continuous Batching** - No waiting for batch completion
- ‚úÖ **Tensor Parallelism** - Utilize full A100 80GB
- ‚úÖ **Async Processing** - Non-blocking inference

## üìä **Performance Comparison:**

| Method | 140K Rows (A100 80GB) | Throughput |
|--------|----------------------|------------|
| **HuggingFace (Current)** | ~8-12 hours | ~3-5 samples/sec |
| **vLLM (Optimized)** | **~1-2 hours** | **20-50 samples/sec** |

## üéØ **Same Features as Before:**
- 15+ comprehensive product fields
- Anti-hallucination prompts
- Checkpoint system
- Raw text processing

---

## üìã **Configuration Section**

In [None]:
# ===============================
# ‚öôÔ∏è CONFIGURATION
# ===============================

# Model Selection (choose one or specify your own)
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"  # Recommended for A100
# MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct"  # Faster, less accurate
# MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.3"  # Alternative

# vLLM Settings (OPTIMIZED FOR A100 80GB)
TENSOR_PARALLEL_SIZE = 1  # Set to 1 for single GPU, 2-4 for multi-GPU
MAX_MODEL_LEN = 2048  # Context length (reduce if OOM)
GPU_MEMORY_UTILIZATION = 0.90  # Use 90% of GPU memory (safe for A100)
MAX_NUM_BATCHED_TOKENS = 8192  # Increase for A100 (more throughput)
MAX_NUM_SEQS = 256  # Process up to 256 sequences in parallel

# Generation Settings
MAX_TOKENS = 500  # Max tokens per response
TEMPERATURE = 0.1  # Lower = more deterministic
TOP_P = 0.95
FREQUENCY_PENALTY = 0.0

# Processing Settings
BATCH_SIZE = 1000  # Large batch for vLLM (it handles internal batching)
NUM_WORKERS = 4  # Parallel prompt preparation

# Data Paths
INPUT_CSV = "/root/train.csv"
OUTPUT_CSV = "train_llm_vllm_extracted_features.csv"
CHECKPOINT_FILE = "vllm_extraction_checkpoint.json"

# Processing Options
USE_CHECKPOINTS = True
CHECKPOINT_INTERVAL = 10  # Save every 10 batches (faster now!)
RESUME_FROM_CHECKPOINT = True

# Sample Size (for testing - set to None to process all rows)
SAMPLE_SIZE = None  # None = process all 140K rows

print("‚úÖ vLLM Configuration loaded!")
print(f"   Model: {MODEL_NAME}")
print(f"   Tensor Parallel: {TENSOR_PARALLEL_SIZE}")
print(f"   GPU Memory: {GPU_MEMORY_UTILIZATION * 100}%")
print(f"   Max Parallel Sequences: {MAX_NUM_SEQS}")
print(f"   Output: {OUTPUT_CSV}")

In [None]:
# ===============================
# üì¶ Step 1: Install vLLM and Dependencies
# ===============================
# vLLM is optimized for high-throughput inference
# It uses PagedAttention for efficient memory management

%pip install -q vllm>=0.6.0
%pip install -q pandas numpy tqdm

print("‚úÖ vLLM installed successfully!")
print("   This is a MUCH faster inference engine than HuggingFace Transformers")

In [None]:
# ===============================
# üìö Step 2: Imports
# ===============================
import pandas as pd
import numpy as np
import json
import re
import os
from vllm import LLM, SamplingParams
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

# Enable tqdm for pandas
tqdm.pandas()

print("‚úÖ All libraries loaded!")
print("   Using vLLM for ultra-fast inference")

---
## üé® **Prompt Engineering Section**
---

In [None]:
# ===============================
# üé® Step 3: Define Extraction Prompt (Same as Before)
# ===============================

def create_extraction_prompt(raw_catalog_content):
    """
    Create a prompt for the LLM to extract comprehensive product information.
    
    NO PREPROCESSING - Raw text goes directly to LLM!
    """
    
    prompt = f"""You are an expert product data analyst. Extract product information from the RAW catalog content below and return ONLY a valid JSON object.

**IMPORTANT RULES:**
1. Extract ONLY from the provided raw data - DO NOT make up or guess information
2. If a field is not present in the data, return "N/A" (not null, not empty string)
3. Return ONLY the JSON object, no explanations or extra text
4. Use exact formatting as shown in the examples

**RAW CATALOG CONTENT:**
{raw_catalog_content}

**EXTRACT THESE FIELDS:**
{{
  "product_name": "Core product name without brand, measurements, or pack info",
  "brand_name": "Manufacturer or brand name",
  "product_type": "Specific product category (e.g., 'beans', 'oil', 'snack', 'pasta', 'sauce')",
  "category": "Broad category - choose ONLY from: food, beverage, beauty, health, home, electronics, clothing, pet, unknown",
  "quantity": "Numeric quantity value (e.g., '2', '500', '1.5')",
  "quantity_unit": "Unit of quantity (e.g., 'lb', 'kg', 'oz', 'ml', 'g', 'l')",
  "amount_packs": "Number of packs/items (e.g., '2', '6', '12')",
  "value": "Formatted value from data (e.g., '2 pound', '500 millilitre')",
  "unit": "Formatted unit from data (e.g., 'pound', 'millilitre', 'gram')",
  "packaging_type": "Package format - choose from: Bottle, Pouch, Jar, Can, Box, Packet, Bag, Container, or N/A",
  "country_of_origin": "Country where product is made/sourced",
  "use_case": "Primary use or benefit",
  "shelf_life": "Storage duration or expiry info",
  "sentiment_quality": "Quality indicators: premium, luxury, organic, natural, economy, affordable, budget",
  "summarized_description": "Brief 2-3 sentence summary"
}}

Return ONLY the JSON:"""
    
    return prompt


print("‚úÖ Enhanced prompt template defined!")

In [None]:
# ===============================
# ü§ñ Step 4: Load vLLM Model
# ===============================
print(f"üöÄ Loading model with vLLM: {MODEL_NAME}")
print("This may take 2-3 minutes for initial loading...\n")

# Initialize vLLM engine with optimized settings for A100 80GB
llm = LLM(
    model=MODEL_NAME,
    tensor_parallel_size=TENSOR_PARALLEL_SIZE,
    gpu_memory_utilization=GPU_MEMORY_UTILIZATION,
    max_model_len=MAX_MODEL_LEN,
    max_num_batched_tokens=MAX_NUM_BATCHED_TOKENS,
    max_num_seqs=MAX_NUM_SEQS,
    trust_remote_code=True,
    dtype="float16",  # Use FP16 for speed
    enforce_eager=False,  # Use CUDA graphs for speed
)

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=TEMPERATURE,
    top_p=TOP_P,
    max_tokens=MAX_TOKENS,
    frequency_penalty=FREQUENCY_PENALTY,
    stop=None,  # Let model decide when to stop
)

print(f"\n‚úÖ vLLM Model loaded successfully!")
print(f"   Tensor Parallel Size: {TENSOR_PARALLEL_SIZE}")
print(f"   GPU Memory Utilization: {GPU_MEMORY_UTILIZATION * 100}%")
print(f"   Max Parallel Sequences: {MAX_NUM_SEQS}")
print(f"\nüöÄ Ready for ULTRA-FAST inference!")

---
## üîß **Extraction Functions (vLLM Optimized)**
---

In [None]:
# ===============================
# üîπ Function 1: Parse LLM JSON Output (Same as Before)
# ===============================

def parse_llm_output(output_text, default_values=None):
    """
    Parse JSON from LLM output with robust error handling.
    """
    if default_values is None:
        default_values = {
            'product_name': 'N/A',
            'brand_name': 'N/A',
            'product_type': 'N/A',
            'category': 'unknown',
            'quantity': 'N/A',
            'quantity_unit': 'N/A',
            'amount_packs': 'N/A',
            'value': 'N/A',
            'unit': 'N/A',
            'packaging_type': 'N/A',
            'country_of_origin': 'N/A',
            'use_case': 'N/A',
            'shelf_life': 'N/A',
            'sentiment_quality': 'N/A',
            'summarized_description': 'N/A'
        }
    
    try:
        # Try to find JSON in the output
        json_match = re.search(r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}', output_text, re.DOTALL)
        if json_match:
            json_str = json_match.group(0)
            parsed = json.loads(json_str)
            
            # Merge with defaults
            result = default_values.copy()
            result.update(parsed)
            
            # Convert N/A variants to standard "N/A"
            for key, value in result.items():
                if isinstance(value, str):
                    if value.lower() in ['na', 'n/a', 'none', 'null', 'unknown', '']:
                        result[key] = 'N/A'
            
            return result
        else:
            return default_values
    except:
        return default_values

print("‚úÖ parse_llm_output() - Enhanced 15-field parser")

In [None]:
# ===============================
# üîπ Function 2: vLLM Batch Extraction (ULTRA-FAST)
# ===============================

def extract_with_vllm_batch(raw_catalog_contents, sample_ids):
    """
    ULTRA-FAST batch processing with vLLM.
    vLLM handles internal continuous batching automatically.
    
    Args:
        raw_catalog_contents: List of raw catalog content strings
        sample_ids: List of sample IDs
    
    Returns:
        List of extracted feature dictionaries
    """
    # Create prompts for entire batch
    prompts = [create_extraction_prompt(raw_content) for raw_content in raw_catalog_contents]
    
    # vLLM does continuous batching internally - just pass all prompts!
    # This is MUCH faster than HuggingFace's sequential approach
    outputs = llm.generate(prompts, sampling_params)
    
    # Parse all outputs
    results = []
    for i, output in enumerate(outputs):
        generated_text = output.outputs[0].text
        parsed_result = parse_llm_output(generated_text)
        parsed_result['sample_id'] = sample_ids[i]
        results.append(parsed_result)
    
    return results


def process_batch(batch_df):
    """
    Process a batch of products with vLLM.
    """
    # Extract raw catalog content
    raw_contents = []
    sample_ids = []
    
    for idx, row in batch_df.iterrows():
        # Use catalog_content as-is, or combine available fields
        if 'catalog_content' in row and pd.notna(row['catalog_content']):
            raw_contents.append(str(row['catalog_content']))
        else:
            # Fallback: create raw-like content from available fields
            raw = f"Item Name: {row.get('item_name', 'N/A')}\n"
            if 'bullet_points_text' in row and pd.notna(row['bullet_points_text']):
                raw += f"Details: {row['bullet_points_text']}\n"
            if 'product_description' in row and pd.notna(row['product_description']):
                raw += f"Description: {row['product_description']}\n"
            raw_contents.append(raw)
        
        sample_ids.append(row.get('sample_id', idx))
    
    # vLLM batch extraction
    extracted_batch = extract_with_vllm_batch(raw_contents, sample_ids)
    
    # Convert to DataFrame
    result_df = pd.DataFrame(extracted_batch)
    
    return result_df


def save_checkpoint(processed_df, batch_num):
    """Save checkpoint to resume processing later."""
    checkpoint_data = {
        'batch_num': batch_num,
        'rows_processed': len(processed_df)
    }
    
    with open(CHECKPOINT_FILE, 'w') as f:
        json.dump(checkpoint_data, f)
    
    # Save partial results
    processed_df.to_csv(OUTPUT_CSV, index=False)


def load_checkpoint():
    """Load checkpoint if exists."""
    try:
        with open(CHECKPOINT_FILE, 'r') as f:
            return json.load(f)
    except FileNotFoundError:
        return None

print("‚úÖ vLLM Batch processing implemented!")
print("   ‚ö° Continuous batching - processes sequences as they complete")
print("   üöÄ PagedAttention - efficient KV cache management")
print("   üí™ Expected throughput: 20-50 samples/sec on A100 80GB")

---
## üß™ **Test vLLM on Sample Data**
---

In [None]:
# ===============================
# üß™ Step 5: Test vLLM Extraction
# ===============================
import time

print("Testing vLLM extraction on sample data...\n")

# Test cases with RAW catalog content
test_cases = [
    """Item Name: Swad Organic White Kidney Beans 2lb (Pack of 2)
Bullet Point 1: Premium quality organic beans
Bullet Point 2: Rich in protein and fiber
Bullet Point 3: USDA certified organic
Product Description: High-quality white kidney beans perfect for soups and salads. Sourced from certified organic farms in India.
Value: 2 pound
Unit: pound
Item Type Keyword: beans, legumes""",
    
    """Item Name: Jiva USDA Organic Extra Virgin Olive Oil 1 Liter
Bullet Point 1: Cold-pressed premium olive oil
Bullet Point 2: Non-GMO, gluten-free
Bullet Point 3: Rich in antioxidants
Product Description: Premium organic olive oil from Mediterranean olives. Perfect for cooking and salads. Bottled in glass to preserve freshness.
Value: 1000 millilitre
Unit: millilitre
Packaging: Glass Bottle""",
    
    """Item Name: Great Value Semi-Sweet Chocolate Chips 12oz (Pack of 6)
Bullet Point 1: Perfect for baking cookies and desserts
Bullet Point 2: Rich chocolate flavor
Bullet Point 3: Economy pack
Product Description: Affordable chocolate chips in convenient chip format. Great for everyday baking needs.
Value: 12 ounce
Unit: ounce
Pack Count: 6"""
]

print("="*70)
print("‚ö° Testing vLLM Performance")
print("="*70)

# Test batch processing speed
sample_ids = [f"test_{i}" for i in range(len(test_cases))]
start_time = time.time()
results = extract_with_vllm_batch(test_cases, sample_ids)
elapsed = time.time() - start_time

print(f"\n‚è±Ô∏è  Processed {len(test_cases)} items in {elapsed:.2f} seconds")
print(f"   Throughput: {len(test_cases) / elapsed:.2f} samples/sec")

for i, result in enumerate(results, 1):
    print(f"\n{'='*70}")
    print(f"TEST CASE {i}")
    print("="*70)
    print(f"üì¶ Extracted Features:")
    print("-"*70)
    for key, value in result.items():
        if key != 'sample_id':
            print(f"  {key:25s}: {value}")
    print("="*70)

print("\n‚úÖ vLLM extraction test complete!")
print("\nüí° If results look good, proceed to process the full 140K dataset!")

---
## üìä **Load & Process Full Dataset**
---

In [None]:
# ===============================
# üìÇ Step 6: Load Data
# ===============================
print("Loading data...\n")

# Load training data
train = pd.read_csv(INPUT_CSV)

print(f"‚úì Dataset loaded: {train.shape[0]:,} rows √ó {train.shape[1]} columns")
print(f"‚úì Columns: {train.columns.tolist()}")

# Check if catalog_content exists
if 'catalog_content' in train.columns:
    print(f"\n‚úÖ 'catalog_content' column found - using RAW data")
    print(f"   Sample raw content (first 200 chars):")
    print("-"*70)
    print(train['catalog_content'].iloc[0][:200] + "...")
    print("-"*70)
else:
    print(f"\n‚ö†Ô∏è  No 'catalog_content' column - will use available columns")

# Sample data if specified
if SAMPLE_SIZE is not None:
    train = train.head(SAMPLE_SIZE)
    print(f"\n‚ö†Ô∏è  Processing sample of {SAMPLE_SIZE} rows for testing")
else:
    print(f"\nüöÄ Processing ALL {len(train):,} rows with vLLM")

print(f"\nüìä Dataset ready for vLLM batch processing!")
train.head(3)

In [None]:
# ===============================
# üöÄ Step 7: Process All Data with vLLM (ULTRA-FAST)
# ===============================

print("\n" + "="*70)
print("üöÄ STARTING ULTRA-FAST vLLM BATCH PROCESSING")
print("="*70)

# Check for checkpoint
start_batch = 0
processed_results = []

if RESUME_FROM_CHECKPOINT and USE_CHECKPOINTS:
    checkpoint = load_checkpoint()
    if checkpoint:
        start_batch = checkpoint['batch_num']
        print(f"\nüìå Resuming from checkpoint: Batch {start_batch}")
        print(f"   Already processed: {checkpoint['rows_processed']} rows")
        
        # Load partial results
        try:
            processed_df = pd.read_csv(OUTPUT_CSV)
            processed_results = [processed_df]
            train = train.iloc[checkpoint['rows_processed']:].reset_index(drop=True)
        except:
            print("   ‚ö†Ô∏è Could not load partial results, starting fresh")

# Calculate batches
total_rows = len(train)
num_batches = (total_rows + BATCH_SIZE - 1) // BATCH_SIZE

print(f"\nüìä Processing Plan:")
print(f"   Total rows: {total_rows:,}")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Number of batches: {num_batches}")
print(f"   Max parallel sequences: {MAX_NUM_SEQS}")

# Estimate time
estimated_throughput = 30  # Conservative estimate for A100 (samples/sec)
estimated_time_sec = total_rows / estimated_throughput
estimated_time_min = estimated_time_sec / 60
estimated_time_hr = estimated_time_min / 60

print(f"\n‚è±Ô∏è  Estimated time:")
print(f"   @ {estimated_throughput} samples/sec: {estimated_time_hr:.1f} hours ({estimated_time_min:.0f} min)")
print(f"\n‚ö° vLLM is 5-10x FASTER than HuggingFace Transformers!")

print(f"\n‚è≥ Starting extraction...\n")

# Track overall timing
import time
start_time = time.time()
total_processed = 0

# Process in batches
for batch_idx in tqdm(range(num_batches), desc="Processing batches", unit="batch"):
    batch_start_time = time.time()
    
    start_idx = batch_idx * BATCH_SIZE
    end_idx = min(start_idx + BATCH_SIZE, total_rows)
    
    batch_df = train.iloc[start_idx:end_idx]
    
    try:
        # vLLM batch processing
        batch_results = process_batch(batch_df)
        processed_results.append(batch_results)
        
        # Update progress
        total_processed += len(batch_df)
        batch_time = time.time() - batch_start_time
        throughput = len(batch_df) / batch_time
        
        # Show progress every 5 batches
        if (batch_idx + 1) % 5 == 0:
            elapsed = time.time() - start_time
            remaining = total_rows - total_processed
            eta_sec = remaining / (total_processed / elapsed) if total_processed > 0 else 0
            eta_min = eta_sec / 60
            
            print(f"\nüìä Progress: {total_processed:,}/{total_rows:,} rows ({100*total_processed/total_rows:.1f}%)")
            print(f"   Throughput: {throughput:.1f} samples/sec")
            print(f"   ETA: {eta_min:.1f} minutes")
        
    except Exception as e:
        print(f"\n‚ö†Ô∏è Error in batch {batch_idx}: {e}")
        print("   Continuing with next batch...")
        continue
    
    # Save checkpoint periodically
    if USE_CHECKPOINTS and (batch_idx + 1) % CHECKPOINT_INTERVAL == 0:
        combined_df = pd.concat(processed_results, ignore_index=True)
        save_checkpoint(combined_df, batch_idx + 1)
        print(f"\nüíæ Checkpoint saved: {len(combined_df):,} rows processed")

# Combine all results
final_df = pd.concat(processed_results, ignore_index=True)

# Calculate final stats
total_time = time.time() - start_time
final_throughput = total_rows / total_time

print("\n" + "="*70)
print("‚úÖ vLLM BATCH PROCESSING COMPLETE!")
print("="*70)
print(f"\nüìä Results:")
print(f"   Processed: {len(final_df):,} rows")
print(f"   Total time: {total_time/60:.1f} minutes ({total_time/3600:.2f} hours)")
print(f"   Throughput: {final_throughput:.2f} samples/sec")
print(f"   Extracted features: {len(final_df.columns)} columns")
print(f"\nüöÄ vLLM is the FASTEST way to do LLM inference!")

In [None]:
# ===============================
# üíæ Step 8: Save Results
# ===============================
print("\nüíæ Saving final results...\n")

# Ensure sample_id exists
if 'sample_id' not in final_df.columns:
    final_df['sample_id'] = range(len(final_df))

# Define column order for output
output_columns = [
    'sample_id',
    'product_name',
    'brand_name',
    'product_type',
    'category',
    'quantity',
    'quantity_unit',
    'amount_packs',
    'value',
    'unit',
    'packaging_type',
    'country_of_origin',
    'use_case',
    'shelf_life',
    'sentiment_quality',
    'summarized_description'
]

# Keep only existing columns
final_columns = [col for col in output_columns if col in final_df.columns]
final_df_ordered = final_df[final_columns]

# Replace 'N/A' with empty string for CSV
final_df_csv = final_df_ordered.replace('N/A', '')

# Save to CSV
final_df_csv.to_csv(OUTPUT_CSV, index=False)

print(f"‚úÖ Results saved to: {OUTPUT_CSV}")
print(f"   Shape: {final_df_csv.shape}")
print(f"   Columns: {list(final_df_csv.columns)}")

# Show data quality stats
print(f"\nüìä Data Quality:")
for col in final_df_ordered.columns:
    if col != 'sample_id':
        na_count = (final_df_ordered[col] == 'N/A').sum()
        na_pct = 100 * na_count / len(final_df_ordered)
        filled_pct = 100 - na_pct
        print(f"   {col:25s}: {filled_pct:5.1f}% filled ({na_count:,} N/A)")

# Clean up checkpoint file
if os.path.exists(CHECKPOINT_FILE):
    os.remove(CHECKPOINT_FILE)
    print(f"\nüóëÔ∏è  Checkpoint file removed (processing complete)")

print("\n" + "="*70)
print("üéâ ALL DONE! CSV saved successfully")
print("="*70)

---
## üìä **Performance Analysis**
---

In [None]:
# ===============================
# üìä Step 9: Analyze Results
# ===============================
print("\n" + "="*70)
print("üìä COMPREHENSIVE EXTRACTION ANALYSIS")
print("="*70)

# Load the saved CSV
analysis_df = pd.read_csv(OUTPUT_CSV)

print(f"\nüìã Dataset Overview:")
print(f"   Total rows: {len(analysis_df):,}")
print(f"   Total columns: {len(analysis_df.columns)}")

# Analyze each feature
print(f"\nüè∑Ô∏è  BRAND NAMES:")
brand_counts = analysis_df['brand_name'].replace('', 'N/A').value_counts()
print(f"   Unique brands: {len(brand_counts)}")
print(f"   Missing/N/A: {(analysis_df['brand_name'] == '').sum()}")
print(f"   Top 10:\n{brand_counts.head(10)}")

print(f"\nüì¶ PRODUCT TYPES:")
type_counts = analysis_df['product_type'].replace('', 'N/A').value_counts()
print(f"   Unique types: {len(type_counts)}")
print(f"   Top 10:\n{type_counts.head(10)}")

print(f"\nüè™ CATEGORIES:")
category_counts = analysis_df['category'].replace('', 'unknown').value_counts()
print(f"   Distribution:\n{category_counts}")

print(f"\nüì¶ PACKAGING TYPES:")
packaging_counts = analysis_df['packaging_type'].replace('', 'N/A').value_counts()
print(f"   Distribution:\n{packaging_counts.head(10)}")

# Sample extractions
print(f"\nüìù SAMPLE EXTRACTIONS:")
print("="*70)
for idx in [0, len(analysis_df)//4, len(analysis_df)//2, 3*len(analysis_df)//4]:
    if idx < len(analysis_df):
        row = analysis_df.iloc[idx]
        print(f"\nSample {idx}:")
        print(f"  Product: {row['product_name']}")
        print(f"  Brand: {row['brand_name']}")
        print(f"  Type: {row['product_type']} | Category: {row['category']}")
        print(f"  Quantity: {row['quantity']} {row['quantity_unit']} (Pack: {row['amount_packs']})")
        print(f"  Packaging: {row['packaging_type']}")
        print(f"  Description: {str(row['summarized_description'])[:100]}...")
        print("-"*70)

print("\n‚úÖ Analysis complete!")

---
## üöÄ **vLLM Optimization Guide**

### ‚ö° **Why vLLM is MUCH Faster:**

1. **PagedAttention** - Efficient KV cache management (like virtual memory)
2. **Continuous Batching** - New requests processed immediately (no waiting)
3. **CUDA Graphs** - Reduced kernel launch overhead
4. **Optimized Kernels** - Hand-tuned CUDA kernels for attention
5. **Dynamic Batching** - Automatically groups requests for efficiency

### üìä **Performance Comparison:**

| Feature | HuggingFace | vLLM |
|---------|-------------|------|
| **Throughput** | 3-5 samples/sec | 20-50 samples/sec |
| **140K rows** | 8-12 hours | 1-2 hours |
| **GPU Utilization** | 60-70% | 85-95% |
| **Batching** | Static | Continuous |
| **Memory** | Fixed allocation | PagedAttention |

### üéØ **Tuning for A100 80GB:**

**For Maximum Speed:**
```python
MAX_NUM_SEQS = 512  # More parallel sequences
MAX_NUM_BATCHED_TOKENS = 16384  # Larger batches
GPU_MEMORY_UTILIZATION = 0.95  # Use more GPU memory
BATCH_SIZE = 2000  # Larger input batches
```

**For Safety (if OOM):**
```python
MAX_NUM_SEQS = 128
MAX_NUM_BATCHED_TOKENS = 4096
GPU_MEMORY_UTILIZATION = 0.85
BATCH_SIZE = 500
```

### üí° **Pro Tips:**

1. **Monitor GPU:** `watch -n 1 nvidia-smi`
2. **Adjust MAX_NUM_SEQS:** Higher = more throughput (but more VRAM)
3. **Use FP16:** Already enabled (2x faster than FP32)
4. **Tensor Parallel:** Set to 2-4 for multi-GPU setups
5. **Profile:** Use `nsys profile` to find bottlenecks

### üîß **Troubleshooting:**

**OOM Error:**
- Reduce `MAX_NUM_SEQS` to 64-128
- Reduce `MAX_MODEL_LEN` to 1024
- Reduce `GPU_MEMORY_UTILIZATION` to 0.80

**Slow Performance:**
- Increase `MAX_NUM_SEQS` to 256-512
- Increase `MAX_NUM_BATCHED_TOKENS` to 8192-16384
- Check GPU utilization with `nvidia-smi`

### üìà **Expected Performance on A100 80GB:**

| Batch Size | Throughput | 140K Rows |
|------------|------------|-----------|
| 500 | ~25 samples/sec | ~1.5 hours |
| 1000 | ~35 samples/sec | ~1.1 hours |
| 2000 | ~45 samples/sec | ~0.9 hours |

---

---
## üéØ **Next Steps**

1. ‚úÖ Test with small sample (100-1000 rows)
2. ‚úÖ Monitor GPU usage and adjust settings
3. ‚úÖ Run on full 140K dataset
4. ‚úÖ Merge with other features for ML training

**Key Advantages:**
- ‚ö° 5-10x faster than HuggingFace
- üöÄ Continuous batching (no waiting)
- üí™ Better GPU utilization
- üìä Same quality extractions
- üíæ Checkpoint system for safety

---