# ProductionCarMatcher Demo

## Overview

The `ProductionCarMatcher` class provides intelligent car make and model standardization for real-world data cleaning. It handles typos, abbreviations, and inconsistent naming conventions commonly found in automotive datasets.

## Key Features

- **Fuzzy Matching**: Handles typos and variations in car names
- **Brand Constraints**: Prevents cross-brand model confusion
- **Numeric Intelligence**: Enhanced matching for models with numbers (e.g., "BMW 3 Series")
- **Fallback Strategies**: Multiple approaches to find best matches
- **Batch Processing**: Efficient cleaning of entire datasets
- **Confidence Scoring**: Quality assessment for each match

## What We'll Demonstrate

1. Database setup and initialization
2. Individual make/model matching
3. Batch dataset cleaning
4. Edge case handling
5. Performance analysis

In [1]:
# Essential imports
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

# Load the ProductionCarMatcher
os.chdir('/Users/leonardodicaterina/Documents/GitHub/ML_group_45')
from utils.preprocessing.CarDatabase import ProductionCarMatcher, create_optimized_database

# Create sample car database (in production, this would be loaded from a comprehensive dataset)
sample_car_data = [
    {'Make': 'BMW', 'Models': ['1 Series', '2 Series', '3 Series', '4 Series', '5 Series', 'X1', 'X3', 'X5', 'i3', 'i8']},
    {'Make': 'Mercedes-Benz', 'Models': ['A-Class', 'C-Class', 'E-Class', 'S-Class', 'GLA', 'GLC', 'GLE', 'GLS']},
    {'Make': 'Audi', 'Models': ['A1', 'A3', 'A4', 'A6', 'A8', 'Q2', 'Q3', 'Q5', 'Q7', 'TT', 'R8']},
    {'Make': 'Toyota', 'Models': ['Corolla', 'Camry', 'Prius', 'RAV4', 'Highlander', 'Yaris', 'C-HR']},
    {'Make': 'Volkswagen', 'Models': ['Golf', 'Passat', 'Polo', 'Tiguan', 'Touareg', 'ID.3', 'ID.4']},
    {'Make': 'Ford', 'Models': ['Fiesta', 'Focus', 'Mondeo', 'Kuga', 'Explorer', 'Mustang']},
    {'Make': 'Hyundai', 'Models': ['i10', 'i20', 'i30', 'Tucson', 'Santa Fe', 'Kona']},
    {'Make': 'Opel', 'Models': ['Corsa', 'Astra', 'Insignia', 'Mokka', 'Grandland', 'Crossland']}
]

print("Creating optimized car database...")
optimized_db = create_optimized_database(sample_car_data, max_model_words=2)

print("Initializing ProductionCarMatcher...")
matcher = ProductionCarMatcher(optimized_db)

print(f"\nDatabase Summary:")
print(f"  Total brands: {len(optimized_db)}")
print(f"  Sample brands: {list(optimized_db.keys())[:5]}")

# Show database structure for one brand
print(f"\nExample - BMW database entry:")
if 'bmw' in optimized_db:
    bmw_data = optimized_db['bmw']
    print(f"  Canonical name: {bmw_data['canonical_name']}")
    print(f"  Aliases: {bmw_data['aliases']}")
    print(f"  Models: {bmw_data['models'][:8]}...")

File path: /Users/leonardodicaterina/.cache/kagglehub/datasets/bourzamraid/global-car-make-and-model-list/versions/1/vehicle models.json
Creating optimized car database...
Analyzing word frequencies...
Created optimized database with 9 makes
Initializing ProductionCarMatcher...

Database Summary:
  Total brands: 9
  Sample brands: ['bmw', 'mercedes-benz', 'audi', 'toyota', 'volkswagen']

Example - BMW database entry:
  Canonical name: bmw
  Aliases: ['beemer', 'bimmer']
  Models: ['1 series', '2 series', '3 series', '4 series', '5 series', '6 series', '7 series', '8 series']...


# ProductionCarMatcher Demo

## Overview

The `ProductionCarMatcher` class provides intelligent car make and model standardization for real-world data cleaning. It handles typos, abbreviations, and inconsistent naming conventions commonly found in automotive datasets.

## Key Features

- **Fuzzy Matching**: Handles typos and variations in car names
- **Brand Constraints**: Prevents cross-brand model confusion
- **Numeric Intelligence**: Enhanced matching for models with numbers (e.g., "BMW 3 Series")
- **Fallback Strategies**: Multiple approaches to find best matches
- **Batch Processing**: Efficient cleaning of entire datasets
- **Confidence Scoring**: Quality assessment for each match

## What We'll Demonstrate

1. Database setup and initialization
2. Individual make/model matching
3. Batch dataset cleaning
4. Edge case handling
5. Performance analysis

In [2]:
# Essential imports
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

# Load the ProductionCarMatcher
os.chdir('/Users/leonardodicaterina/Documents/GitHub/ML_group_45')
from utils.preprocessing.CarDatabase import ProductionCarMatcher, create_optimized_database

# Create sample car database (in production, this would be loaded from a comprehensive dataset)
sample_car_data = [
    {'Make': 'BMW', 'Models': ['1 Series', '2 Series', '3 Series', '4 Series', '5 Series', 'X1', 'X3', 'X5', 'i3', 'i8']},
    {'Make': 'Mercedes-Benz', 'Models': ['A-Class', 'C-Class', 'E-Class', 'S-Class', 'GLA', 'GLC', 'GLE', 'GLS']},
    {'Make': 'Audi', 'Models': ['A1', 'A3', 'A4', 'A6', 'A8', 'Q2', 'Q3', 'Q5', 'Q7', 'TT', 'R8']},
    {'Make': 'Toyota', 'Models': ['Corolla', 'Camry', 'Prius', 'RAV4', 'Highlander', 'Yaris', 'C-HR']},
    {'Make': 'Volkswagen', 'Models': ['Golf', 'Passat', 'Polo', 'Tiguan', 'Touareg', 'ID.3', 'ID.4']},
    {'Make': 'Ford', 'Models': ['Fiesta', 'Focus', 'Mondeo', 'Kuga', 'Explorer', 'Mustang']},
    {'Make': 'Hyundai', 'Models': ['i10', 'i20', 'i30', 'Tucson', 'Santa Fe', 'Kona']},
    {'Make': 'Opel', 'Models': ['Corsa', 'Astra', 'Insignia', 'Mokka', 'Grandland', 'Crossland']}
]

print("Creating optimized car database...")
optimized_db = create_optimized_database(sample_car_data, max_model_words=2)

print("Initializing ProductionCarMatcher...")
matcher = ProductionCarMatcher(optimized_db)

print(f"\nDatabase Summary:")
print(f"  Total brands: {len(optimized_db)}")
print(f"  Sample brands: {list(optimized_db.keys())[:5]}")

# Show database structure for one brand
print(f"\nExample - BMW database entry:")
if 'bmw' in optimized_db:
    bmw_data = optimized_db['bmw']
    print(f"  Canonical name: {bmw_data['canonical_name']}")
    print(f"  Aliases: {bmw_data['aliases']}")
    print(f"  Models: {bmw_data['models'][:8]}...")

Creating optimized car database...
Analyzing word frequencies...
Created optimized database with 9 makes
Initializing ProductionCarMatcher...

Database Summary:
  Total brands: 9
  Sample brands: ['bmw', 'mercedes-benz', 'audi', 'toyota', 'volkswagen']

Example - BMW database entry:
  Canonical name: bmw
  Aliases: ['beemer', 'bimmer']
  Models: ['1 series', '2 series', '3 series', '4 series', '5 series', '6 series', '7 series', '8 series']...
