## Market Comparisons

This exercise builds on the ideas we explore in the class "Cluster Analytsis for Feature Engineering". Our goal in this exercise is to understand how building a pricing model for every single location for a specific car make/model changes the accuracy and coverage of the models. 

1. Pick a car make and model.
1. For each location, build a pricing model for just that location. 
1. Calculate errors and coverage for your family of pricing models.
1. Compare coverage and errors to a pricing model to a single pricing model across the entire country. 

**Note to UST Faculty**: Typically we would connect to GBQ for the data, but to ensure access the data can be reached via [this link](dropbox transfer TK). Download this zip and extract the contents into a folder called `data/` within the repository. 

---

### Exercise Boilerplate

#### Purpose of Exercises
Exercises are a critical part of this class. While lectures and readings introduce concepts, exercises help you:
- Develop practical implementation skills
- Understand common pitfalls and debugging strategies
- Build intuition through experimentation
- Create a portfolio of working examples
- Practice real-world data analysis workflows

#### Using These Notebooks
- **Dive Right In**: These exercises often reveal unexpected challenges
- **Work Incrementally**: Test each step before moving forward
- **Ask Questions**: Use class Teams for help, ask your instructor, ask classmates
- **Compare Solutions**: Solutions are available to you in this folder
- **Save Your Work**: Commit working versions to your repository

#### Using AI Assistants
AI coding assistants (ChatGPT, Claude, GitHub Copilot, etc.) are powerful tools that you'll use in your career. In this class:
- ✅ Use AI to understand code snippets
- ✅ Use AI to debug errors
- ✅ Use AI to explore alternative approaches
- ✅ Use AI to explain concepts
- ❌ Don't just paste the whole exercise
- ❌ Don't submit AI-generated code without understanding it

Document your AI interactions in a comment block and include a link to your chat:
```python
# AI Interaction Log:
# 1. Asked Claude to explain the difference between train_test_split and TimeSeriesSplit
# 2. Used GitHub Copilot to help write data validation functions
# 3. Had ChatGPT debug a pandas groupby error
# 4. Chat logs are available here: https://chatgpt.com/c/671d1f08-1ebc-8011-a128-8a29255f24fe
```

#### Evaluation
Exercises are not evaluated by your instructor. They are for your learning.


### Setup
Loading required libraries for our analysis. We'll use pandas for data handling, scikit-learn for modeling, and some visualization libraries.

---

In [1]:
# Imports for this exercise
import pandas as pd
import numpy as np
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression


# For my solution
from IPython.display import display, Markdown


### Helper Functions
These functions handle the core operations we'll need: loading data, creating features, filtering data, and evaluating models.

In [2]:
# Load and prepare data
def load_data():
    """Load car listing data and do initial filtering"""
    listings = pd.read_csv('../data/processed_listing_pages.csv')
    listings['age'] = 2024 - listings['year']
    return listings

# Get the model features 
def get_features(data):
    """Create feature matrix for modeling"""
    return pd.DataFrame({
        'age': data['age'],
        'log_odometer': data['log_odometer'],
        'age_miles': data['age'] * data['log_odometer'],
        'condition_mapped': data['condition_mapped']
    })

def filter_data(data, make, model, location=None):
    """Filter data for specific make/model and optional location"""
    mask = (data['make'] == make) & (data['model'] == model)
    if location:
        mask = mask & (data['location'] == location)
    return data[mask].copy()

def create_model(X, y):
    """Create and fit a linear regression model"""
    model = LinearRegression()
    model.fit(X, y)
    return model

def evaluate_model(model, X, y):
    """Calculate model performance metrics"""
    pred = model.predict(X)
    return {
        'rmse': root_mean_squared_error(y, pred),
        'mae': mean_absolute_error(y, pred),
        'r2': r2_score(y, pred)
    }

# For my solution
# Function to create a markdown table from metrics
def create_markdown_table(metrics, title):
  table = f"### {title}\n\n"
  table += "| Metric | Value |\n"
  table += "|--------|-------|\n"
  for key, value in metrics.items():
    table += f"| {key.upper()} | {value:.2f} |\n"
  return table



### Load Data
First, let's load our car listing data.

In [3]:
listing_data = load_data()

In [4]:
# Select your make and model

make = 'nissan'
model_name = 'altima'
location = 'sfbay'

### Build National Model
Create a baseline model using all national data for this make/model.

In [5]:
national_data = filter_data(listing_data, make, model_name)
X_national = get_features(national_data)
y_national = national_data['price']
national_model = create_model(X_national, y_national)

### Build Local Model
Create a model specific to our chosen location to compare against the national model.

In [6]:
local_data = filter_data(listing_data, make, model_name, location)
X_local = get_features(local_data)
y_local = local_data['price']
local_model = create_model(X_local, y_local)

### Compare Models
Calculate improvement metrics to see how much the local model improves upon the national model.

In [7]:
national_metrics = evaluate_model(national_model, X_local, y_local)
local_metrics = evaluate_model(local_model, X_local, y_local)
rmse_improvement = (national_metrics['rmse'] - local_metrics['rmse']) / national_metrics['rmse'] * 100
mae_improvement = 100 * (national_metrics['mae'] - local_metrics['mae']) / national_metrics['mae']

print(f"Location: {location}")
print(f"Sample size: {len(local_data)}")
print(f"RMSE Improvement: {rmse_improvement:.1f}%")
print(f"MAE Improvement: {mae_improvement:.1f}%")

Location: sfbay
Sample size: 631
RMSE Improvement: 10.3%
MAE Improvement: 12.6%


In [8]:

national_table = create_markdown_table(national_metrics, "National Metrics")
local_table = create_markdown_table(local_metrics, "Local Metrics")

# Display markdown tables
display(Markdown(national_table))
display(Markdown(local_table))


### National Metrics

| Metric | Value |
|--------|-------|
| RMSE | 1416.95 |
| MAE | 1025.28 |
| R2 | 0.82 |


### Local Metrics

| Metric | Value |
|--------|-------|
| RMSE | 1271.07 |
| MAE | 896.02 |
| R2 | 0.86 |
