> ### Note on Labs and Assigments:
>
> 🔧 Look for the **wrench emoji** 🔧 — it highlights where you're expected to take action!
>
> These sections are graded and are not optional.
>

# IS 4487 Lab 9: Segmentation

In this lab, we return to the **SF Rent** dataset that we used in **Lab 4: Data Understanding** and **Lab 5: Exploratory Data Analysis (EDA)**.

This time, we’ll explore how to segment the counties using both:
- **Manual segmentation** based on business rules
- **Automatic segmentation** using KMeans clustering

Segmentation helps identify meaningful groups within data, such as counties with high rent burden or low affordability. This is valuable for making targeted decisions in housing policy, urban planning, and social support.


## Outline

- Load and inspect the SF Rents dataset  
- Engineer and prepare features  
- Create manual segments using binning  
- Perform KMeans clustering for automatic segments  
- Visualize and compare results  

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Labs/lab_09_segmentation.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Overview

**Dataset:** `rent.csv`  
Source: [TidyTuesday – 2022-07-05](https://github.com/rfordatascience/tidytuesday/blob/main/data/2022/2022-07-05/rent.csv)

| Variable       | Type       | Description |
|----------------|------------|-------------|
| `post_id`      | Categorical| Unique listing ID |
| `date`         | Numeric    | Listing date (numeric format) |
| `year`         | Integer    | Year of listing |
| `nhood`        | Categorical| Neighborhood |
| `city`         | Categorical| City |
| `county`       | Categorical| County |
| `price`        | Numeric    | Listing price (USD) |
| `beds`         | Numeric    | Number of bedrooms |
| `baths`        | Numeric    | Number of bathrooms |
| `sqft`         | Numeric    | Square footage |
| `room_in_apt`  | Binary     | 1 = room in apartment |
| `address`      | Categorical| Street address |
| `lat`          | Numeric    | Latitude |
| `lon`          | Numeric    | Longitude |
| `title`        | Text       | Listing title |
| `descr`        | Text       | Listing description |
| `details`      | Text       | Additional details |


## Part 1: Importing the Data + Prepare for Segmentation

### Instructions:
- Import the `pandas` library.
- Import data from the rent.csv into a dataframe from the tidytuesday link.
- Use `.info()` and `.head()` to inspect the structure and preview the data.e structure and preview the data.
- Remove duplicates
- Handle missing values
- Remove outliers (for price, beds, baths, sqft)
- Fix data types
- Optionally impute or filter variables

In [None]:
import pandas as pd

# Load the dataset
url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-07-05/rent.csv'
df = pd.read_csv(url)

# Get a quick overview
df.info()
df.head()


In [None]:
# STEP 1: Drop duplicates
df = df.drop_duplicates(subset='post_id')

# STEP 2: Drop rows with nulls in essential columns
essential = ['price', 'beds', 'baths', 'sqft', 'lat', 'lon']
df = df.dropna(subset=essential)

# STEP 3: Remove outliers (common-sense filtering)
df = df[df['price'].between(500, 20000)]
df = df[df['beds'].between(0, 10)]
df = df[df['baths'].between(0.5, 10)]
df = df[df['sqft'].between(100, 5000)]

# STEP 4: Convert data types if needed
df['beds'] = df['beds'].astype(int)
df['baths'] = df['baths'].astype(float)  # decimal values allowed
df['sqft'] = df['sqft'].astype(int)
df['price'] = df['price'].astype(int)

# STEP 5: Reset index
df = df.reset_index(drop=True)

# Preview cleaned data
df.info()
df.head()


## Part 2: Engineer and Prepare Features

We’ll select features for clustering:
- Property: `price`, `beds`, `baths`, `sqft`
- Geographic: `lat`, `lon`

We’ll standardize features to ensure fair weighting in distance-based clustering.

### Why This Matters:
Standardization avoids giving larger-scale variables (like `price`) more influence.

### Things to think about:
- Should all variables be scaled?
- Do geographic coordinates need standardization?


In [None]:
from sklearn.preprocessing import StandardScaler

# Select and scale features
features = ['price', 'beds', 'baths', 'sqft', 'lat', 'lon']
segment_df = df[features].copy()

scaler = StandardScaler()
scaled_data = scaler.fit_transform(segment_df)

# Convert back to DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=features)
scaled_df.head()


### 🔧 Try It Yourself – Part 2

1. Why should lat/lon be scaled before clustering? What would happen if they weren’t?

🔧 Add comment here:

## Part 3: Create Manual Segments Using Binning

Let’s group listings by price into:
- **Low**: < $2,000
- **Mid**: $2,000–$4,000
- **High**: > $4,000

### Why This Matters:
Manual bins based on thresholds offer simple segmentation — useful when business rules exist.

### Things to think about:
- Are fixed cutoffs better than percentiles?
- Should you also bin square footage?


In [None]:
# Create price segments
df['price_segment'] = pd.cut(
    df['price'],
    bins=[0, 2000, 4000, float('inf')],
    labels=['Low', 'Mid', 'High']
)

df['price_segment'].value_counts()


### 🔧 Try It Yourself – Part 3

1. Create a column called `sqft_segment` using bins:  
  - Small: `< 800`, Medium: `800–1400`, Large: `>1400`  
2. Count how many listings fall into each `sqft_segment` using `.value_counts()`  
3. Use `.head()` to preview both `price_segment` and `sqft_segment`

In [None]:
# 🔧 Add code here

## Part 4: Perform KMeans Clustering

We’ll create two sets of clusters:
1. **Feature-based** (price, beds, baths, sqft)
2. **Geographic-based** (lat, lon)

### Why This Matters:
Unsupervised clustering finds hidden patterns — useful for market segmentation, targeting, etc.

### Things to think about:
- How many clusters should you use?
- How do results differ between property and location clusters?


In [None]:
from sklearn.cluster import KMeans

# Select only the standardized property features
X_feat = scaled_df[['price', 'beds', 'baths', 'sqft']]

# Apply KMeans clustering with 4 clusters
kmeans_feat = KMeans(n_clusters=4, random_state=1)
scaled_df['feature_cluster'] = kmeans_feat.fit_predict(X_feat)

# Show number of listings in each cluster
scaled_df['feature_cluster'].value_counts()


In [None]:
# Select only standardized geographic coordinates
X_geo = scaled_df[['lat', 'lon']]

# Apply KMeans clustering with 5 clusters
kmeans_geo = KMeans(n_clusters=5, random_state=1)
scaled_df['geo_cluster'] = kmeans_geo.fit_predict(X_geo)

# Show number of listings in each geographic cluster
scaled_df['geo_cluster'].value_counts()

### 🔧 Try It Yourself – Part 4

1. Run KMeans again using `k=3` and compare the cluster counts  
2. Plot a histogram of `price` grouped by `feature_cluster`  

In [None]:
# 🔧 Add code here

## Part 5: Visualize and Compare Results

We’ll now visualize:
- Clusters on a map (lat/lon)
- Clusters in price vs sqft space

### Why This Matters:
Visual validation helps determine if clusters are interpretable and useful.

### Things to think about:
- Are location clusters geographically meaningful?
- Do property clusters separate by price or size?

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot latitude vs longitude colored by geographic cluster
plt.figure(figsize=(10,6))
sns.scatterplot(data=scaled_df,
                x='lon',
                y='lat',
                hue='geo_cluster',
                palette='tab10')

plt.title("KMeans Geographic Clusters (lat/lon)")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.legend(title='Geo Cluster')
plt.show()

In [None]:
# Plot price vs square footage colored by feature cluster
plt.figure(figsize=(10,6))
sns.scatterplot(data=scaled_df,
                x='price',
                y='sqft',
                hue='feature_cluster',
                palette='Set2')

plt.title("KMeans Property Clusters (Price vs Sqft)")
plt.xlabel("Price")
plt.ylabel("Square Footage")
plt.legend(title='Feature Cluster')
plt.show()

### 🔧 Try It Yourself – Part 5

1. Add `beds` as the **point size** in your scatterplot of `price` vs `sqft`  
2. Add `baths` as the **point style** in your `sns.scatterplot()`  
3. Group by `feature_cluster` and calculate:
  - Average `price`
  - Average `sqft`
  - Average `beds`  


In [None]:
# 🔧 Add code here

## 🔧 Part 6: Reflection (100 words or less per question)

1. Which method—manual binning or KMeans clustering—gave you more useful insights?
2. How might missing data or outliers affect your segmentation results?


🔧 Add comment here

## Export Your Notebook to Submit in Canvas
- Use the instructions from Lab 1

In [None]:
!jupyter nbconvert --to html "lab_09_LastnameFirstname.ipynb"