## Location-Based Market Clustering

This exercise builds on our initial market comparison work. Instead of building individual models for each location, we'll use clustering to group similar markets together and build models for each cluster.

### Our Goals:
1. Understand what makes markets similar
2. Use clustering to group markets effectively
3. Balance model accuracy with market coverage
4. Compare cluster-based models to national models

### Key Steps:
1. Compare locations across different dimensions
2. Choose a representative make/model
3. Cluster locations using appropriate distance metrics
4. Build and evaluate cluster-specific models

**Note to UST Faculty**: Typically we would connect to GBQ for the data, but to ensure access the data can be reached via [this link](https://www.dropbox.com/t/H5lPzrmt29Oq1F5V). Download this zip and extract the contents into a folder called `data/` within the repository. 

---


### Exercise Boilerplate

#### Purpose of Exercises
Exercises are a critical part of this class. While lectures and readings introduce concepts, exercises help you:
- Develop practical implementation skills
- Understand common pitfalls and debugging strategies
- Build intuition through experimentation
- Create a portfolio of working examples
- Practice real-world data analysis workflows

#### Using These Notebooks
- **Dive Right In**: These exercises often reveal unexpected challenges
- **Work Incrementally**: Test each step before moving forward
- **Ask Questions**: Use class Teams for help, ask your instructor, ask classmates
- **Compare Solutions**: Solutions are available to you in this folder
- **Save Your Work**: Commit working versions to your repository

#### Using AI Assistants
AI coding assistants (ChatGPT, Claude, GitHub Copilot, etc.) are powerful tools that you'll use in your career. In this class:
- ✅ Use AI to understand code snippets
- ✅ Use AI to debug errors
- ✅ Use AI to explore alternative approaches
- ✅ Use AI to explain concepts
- ❌ Don't just paste the whole exercise
- ❌ Don't submit AI-generated code without understanding it

Document your AI interactions in a comment block and include a link to your chat:
```python
# AI Interaction Log:
# 1. Asked Claude to explain the difference between train_test_split and TimeSeriesSplit
# 2. Used GitHub Copilot to help write data validation functions
# 3. Had ChatGPT debug a pandas groupby error
# 4. Chat logs are available here: https://chatgpt.com/c/671d1f08-1ebc-8011-a128-8a29255f24fe
```

#### Evaluation
Exercises are not evaluated by your instructor. They are for your learning.

### Setup
First, we'll load our required libraries and prepare our data sources. We need libraries for data manipulation, clustering, modeling, and visualization.

---

In [12]:
# Core data analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling & clustering
from sklearn.linear_model import LinearRegression
from sklearn_extra.cluster import KMedoids
import gower

# Display utilities
from IPython.display import display, Markdown

### Helper Functions
We'll reuse some functions from our previous exercise and add clustering-specific functionality.

In [None]:
# Essentials from previous notebook
def load_data():
    """Load car listing data and market summaries"""
    listing_data = pd.read_csv('../data/processed_listing_pages.csv')
    listing_data['age'] = 2024 - listing_data['year']
    
    # Load location summary data
    loc_data = pd.read_csv('../data/location_summary_data.csv')
    return listing_data, loc_data

def get_features(data):
    """Create feature matrix for modeling"""
    # TODO
    pass

def filter_data(data, make, model, location=None):
    """Filter data for specific make/model and optional location"""
    mask = (data['make'] == make) & (data['model'] == model)
    if location:
        mask = mask & (data['location'] == location)
    return data[mask].copy()

def create_model(X, y):
    """Create and fit a linear regression model"""
    # TODO
    pass

def evaluate_model(model, X, y):
    """Calculate model performance metrics"""
    # TODO
    pass


# TODO: Create a function to compare markets
def compare_markets(loc_data, markets_to_compare):
    """Compare key statistics across different markets
    
    Parameters:
    -----------
    loc_data: DataFrame with market data
    markets_to_compare: list of market names to compare
    
    Returns:
    --------
    DataFrame with market comparisons
    """
    pass  # Your code here
    
# TODO: Create a function to prepare data for clustering
def prepare_cluster_data(loc_data):
    """Prepare location data for clustering
    
    Consider:
    - Which numeric features to include?
    - How to handle categorical data?
    - Should we scale the data?
    
    Returns:
    --------
    DataFrame ready for clustering
    """
    pass  # Your code here

# Helper for naming clusters
def create_cluster_names(loc_data):
    """Create descriptive names for each market cluster
    combining region and dominant state"""
    names = loc_data.groupby('cluster').agg({
        'census_region': lambda x: x.mode()[0],
        'state': lambda x: x.mode()[0]
    }).apply(lambda x: f"{x.name}-{x['census_region']}-{x['state']}", axis=1)
    return names

def build_cluster_model(data, needed_obvs=50):
    """Build model for a cluster if enough data exists"""
    if len(data) < needed_obvs:
        return {
            'n_obs': len(data),
            'rmse': np.nan,
            'mae': np.nan,
            'r2': np.nan,
            'price': np.mean(data['price'])
        }
    
    # TODO: Implement model building, return metrics
    #metrics = evaluate_model(model, X, y)
    #metrics['n_obs'] = len(data)
    #metrics['price'] = np.mean(y)
    
    return None #TODO: metrics

### Load and Explore Market Data
First, we'll load our market data and look at the characteristics that might make markets similar.

In [None]:
# Load location summaries and listing data
listing_data, loc_data = load_data() 

# TODO: Explore what makes markets similar
# Look at a few example markets to understand what features we have
example_markets = ['seattle', 'newyork', 'houston']
market_comparison = loc_data[loc_data['location'].isin(example_markets)]
display(market_comparison)


In [None]:

# TODO: Choose features for clustering
# Consider:
# 1. Basic market statistics (listings, prices, ages)
# 2. Geographic information
# 3. Car-specific features (F150 prices vs Civic prices)

# Print the columns we have available
print("\nAvailable features:")
for col in loc_data.columns:
    print(f"- {col}")

# TODO: Examine distributions of key features
# HINT: Consider using sns.distplot or plt.hist
# Which features might need scaling?


In [None]:

# TODO: Prepare data for clustering
# 1. Select relevant features
# 2. Handle categorical variables
# 3. Consider scaling numeric features

# Example of handling categorical variables:
market_features = loc_data.drop('location', axis=1)
market_features['state'] = pd.Categorical(market_features['state']).codes
market_features['census_region'] = pd.Categorical(market_features['census_region']).codes

# Extension: Try different feature combinations
# How do they affect your clusters?

### Implement Clustering
We'll use k-means clustering with k=8 based on our market exploration. This gives us reasonable market sizes while maintaining geographic coherence.

In [None]:
# TODO: Prepare data for clustering
# First we need to separate numeric and categorical columns
cat_cols = ['state', 'census_region']
# TODO: What other columns should we use for clustering?
# HINT: Look at loc_data.columns and think about what makes markets similar
num_cols = [col for col in loc_data.columns 
            if col not in cat_cols + ['location']]

# Print what we're using
print("Features for clustering:")
print("Categorical:", cat_cols)
print("Numeric:", num_cols)


In [None]:

# TODO: Handle missing values
# First, let's see what we're dealing with
print("\nMissing values by column:")
print(cluster_features.isna().sum()[cluster_features.isna().sum() > 0])

# TODO: Choose a strategy for missing values
# 1. Remove columns with too many missing values?
# 2. Fill with mean/median?
# 3. Something else?

# Here's one approach - is it the best one?
cluster_features = loc_data[num_cols + cat_cols].copy()
cluster_features[num_cols] = cluster_features[num_cols].fillna(
    cluster_features[num_cols].mean()
)

# Calculate distance matrix using Gower distance
# This handles mixed numeric/categorical data
gower_dist = gower.gower_matrix(cluster_features)

# TODO: Examine the distances
# 1. What's the range of distances?
# 2. Which markets are most similar?
# 3. Which are most different?

print("\nDistance matrix shape:", gower_dist.shape)
print("Distance range:", np.min(gower_dist), "to", np.max(gower_dist))

In [None]:
# Use PAM/KMedoids on the distance matrix
k = None # TODO: Choose a number of clusters
pam = KMedoids(n_clusters=k, random_state=42, metric='precomputed')
market_clusters = pam.fit_predict(gower_dist)

# Add clusters back to location data
loc_data['cluster'] = market_clusters

# See cluster sizes and characteristics
print("\nCluster Sizes:")
print(pd.Series(market_clusters).value_counts().sort_index())

# Examine cluster characteristics
cluster_summary = loc_data.groupby('cluster').agg({
    'total_listings': 'mean',
    'avg_price': 'mean',
    'census_region': lambda x: x.mode()[0],
    'state': lambda x: x.mode()[0]
}).round(2)

display(cluster_summary)

# Optional: Look at most similar markets to a specific market
example_market = 'seattle'
market_idx = loc_data[loc_data['location'] == example_market].index[0]
distances = gower_dist[market_idx]
similar_idx = np.argsort(distances)[:5]  # Get 5 most similar
similar_markets = loc_data.iloc[similar_idx]
print(f"\nMarkets most similar to {example_market}:")
display(similar_markets[['location', 'cluster', 'total_listings', 'avg_price']])

### Choose Make/Model and Build Models
Now we'll select a popular make/model combination and compare national vs cluster-specific models.

In [None]:
# Select make/model and merge cluster assignments
make = None # TODO: Choose a make
model_name = None # TODO: Choose a model

mm_data = filter_data(listing_data, make, model_name)
mm_data = mm_data.merge(loc_data[['location', 'cluster']], on='location', how='left')

print(f"Total {make} {model_name} listings: {len(mm_data)}")
print("\nListings per cluster:")
display(mm_data.groupby('cluster')['price'].count().sort_values(ascending=False))

### Build and Compare Models
Now that we have our market clusters, let's:
1. Build a national model for our chosen make/model
2. Build separate models for each cluster
3. Compare the performance across approaches

In [None]:

# Build national model
# TODO: Create a function to build a model for a specific cluster

# Build cluster models and evaluate
cluster_results = []
for cluster_id in range(k):
    # TODO - build model for each cluster
    cluster_data = mm_data[mm_data['cluster'] == cluster_id]
    metrics = build_cluster_model(cluster_data, needed_obvs=50)
    metrics['cluster'] = cluster_id
    cluster_results.append(metrics)

cluster_results = pd.DataFrame(cluster_results)

# Calculate summary statistics
n_markets_with_models = sum(~cluster_results['rmse'].isna())
total_coverage = cluster_results['n_obs'].sum() / len(mm_data)
avg_improvement = ((national_metrics['rmse'] - cluster_results['rmse'].mean()) 
                  / national_metrics['rmse'] * 100)

print(f"Coverage Statistics:")
print(f"Markets with models: {n_markets_with_models} of {k}")
print(f"Total coverage: {total_coverage:.1%}")
print(f"Average RMSE improvement: {avg_improvement:.1f}%")

### Visualize Results
Let's create some visualizations to understand our clusters and their performance.

I've left these for you here to save time.

In [None]:
# Create cluster name mapping
cluster_names = loc_data.groupby('cluster').agg({
    'census_region': lambda x: x.mode()[0],
    'state': lambda x: x.mode()[0]
}).apply(lambda x: f"{x.name}-{x['census_region']}-{x['state']}", axis=1)

# Plot average prices by cluster
plt.figure(figsize=(12, 6))
cluster_results['avg_price'] = cluster_results['price']  # from build_cluster_model
plot_data = cluster_results.sort_values('avg_price', ascending=True)
plt.barh(range(len(plot_data)), plot_data['avg_price'])
plt.yticks(range(len(plot_data)), 
          [cluster_names[i] for i in plot_data['cluster']])
plt.xlabel('Average Price')
plt.title(f'Average {make.title()} {model_name.upper()} Price by Market Cluster')
plt.tight_layout()
plt.show()

# Plot RMSE improvement
plt.figure(figsize=(12, 6))
improvement = (national_metrics['rmse'] - cluster_results['rmse']) / national_metrics['rmse'] * 100
plot_data = cluster_results.assign(improvement=improvement).sort_values('improvement')
plt.barh(range(len(plot_data)), plot_data['improvement'])
plt.yticks(range(len(plot_data)), 
          [cluster_names[i] for i in plot_data['cluster']])
plt.axvline(x=0, color='red', linestyle='--')
plt.xlabel('RMSE Improvement (%)')
plt.title('Model Improvement by Cluster vs National Model')
plt.tight_layout()
plt.show()