> ### Note on Labs and Assignments:
>
> 🔧 Look for the **wrench emoji** 🔧 — it highlights where you're expected to take action!
>
> These sections are graded and are not optional.
>

# IS 4487 Lab 10: Classification Trees with SF Rent Data

## Outline

- Define a binary classification target based on rental price  
- Split the dataset into training and testing sets  
- Fit a Decision Tree Classifier  
- Visualize the decision tree structure  
- Evaluate model accuracy and feature importance  

In this lab, we’ll build a classification model to predict whether a rental listing in San Francisco is considered **"expensive"** based on its features.

We are continuing with the **SF Rent dataset** used in **Lab 9**, and will explore how classification trees can uncover patterns in rental prices.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Labs/lab_10_classification.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



## Dataset Overview

**Dataset:** `rent.csv`  
Source: [TidyTuesday-2022-7-05](https://github.com/rfordatascience/tidytuesday/blob/main/data/2022/2022-07-05/rent.csv)

| Variable       | Type       | Description |
|----------------|------------|-------------|
| `post_id`      | Categorical| Unique listing ID |
| `date`         | Numeric    | Listing date (numeric format) |
| `year`         | Integer    | Year of listing |
| `nhood`        | Categorical| Neighborhood |
| `city`         | Categorical| City |
| `county`       | Categorical| County |
| `price`        | Numeric    | Listing price (USD) |
| `beds`         | Numeric    | Number of bedrooms |
| `baths`        | Numeric    | Number of bathrooms |
| `sqft`         | Numeric    | Square footage |
| `room_in_apt`  | Binary     | 1 = room in apartment |
| `address`      | Categorical| Street address |
| `lat`          | Numeric    | Latitude |
| `lon`          | Numeric    | Longitude |
| `title`        | Text       | Listing title |
| `descr`        | Text       | Listing description |
| `details`      | Text       | Additional details |

## Part 1: Clean and Prepare the Data

We'll repeat the cleaning process from Lab 9, including:
- Removing duplicates
- Dropping missing values in key columns
- Filtering out extreme outliers
- Converting data types

### Why This Matters:
Good models depend on clean, reliable inputs.


In [None]:
import pandas as pd

# Load dataset
url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-07-05/rent.csv'
df = pd.read_csv(url)

# Drop duplicate listings using post_id
df = df.drop_duplicates(subset='post_id')

# Drop rows missing essential features
required_cols = ['price', 'beds', 'baths', 'sqft', 'lat', 'lon']
df = df.dropna(subset=required_cols)

# Remove outliers
df = df[df['price'].between(500, 20000)]
df = df[df['beds'].between(0, 10)]
df = df[df['baths'].between(0.5, 10)]
df = df[df['sqft'].between(100, 5000)]

# Convert columns to appropriate data types
df['beds'] = df['beds'].astype(int)
df['baths'] = df['baths'].astype(float)
df['sqft'] = df['sqft'].astype(int)
df['price'] = df['price'].astype(int)

# Reset index
df = df.reset_index(drop=True)

# Quick check
df.info()
df.head()


## Part 2: Create a Binary Target Variable

We'll classify whether a listing is **expensive** (price > $4,000).

- 1 = Expensive
- 0 = Not expensive

### Why This Matters:
To use classification trees, we need a target variable with exactly two classes.


In [None]:
# Create binary target variable
df['expensive'] = (df['price'] > 4000).astype(int)

# Check target class distribution
df['expensive'].value_counts(normalize=True)


### 🔧 Try It Yourself – Part 2

1. Adjust the expensive threshold to $5,000 and check the class balance again  
2. Calculate what percentage of listings are labeled as expensive



In [None]:
# Add code here

Add comment here

In [None]:
# Reset binary target back to 4000

# Create binary target variable
df['expensive'] = (df['price'] > 4000).astype(int)

## Part 3: Select Features and Split Data

We'll use property characteristics as predictors:
- `beds`, `baths`, `sqft`

Then split the data into 70% training and 30% testing.

### Why This Matters:
Train/test splitting ensures our model is evaluated on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

# Select features and target
X = df[['beds', 'baths', 'sqft']]
y = df['expensive']

# Create train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Check dimensions
X_train.shape, X_test.shape


### 🔧 Try It Yourself – Part 3

1. Add `lat` or `price_segment` as an additional feature in the x variable
2. Now add `price` as a feature/predictor.  What changes have occurred?

In [None]:
# 🔧 Add code here

🔧 Add comment here

## Part 4: Build and Fit the Decision Tree Classifier

We'll train a `DecisionTreeClassifier` on our data.

### Why This Matters:
Decision trees make predictions by splitting data into branches using rules that reduce impurity.


In [None]:
from sklearn.tree import DecisionTreeClassifier

# Initialize classifier
clf = DecisionTreeClassifier(random_state=42)

# Train model on training data
clf.fit(X_train, y_train)

### 🔧 Try It Yourself – Part 4

1. Create a second model called `clf_limited` with `max_depth=3`  
2. Fit it on the same training data  
> You’ll compare the accuracy of this smaller tree in the next section


In [None]:
# 🔧 Add code here

## 5. Evaluate the Model

We’ll check how well the tree performed on the test data using:
- Accuracy score
- Confusion matrix

### Why This Matters:
We want to know how often the model is right and what kind of mistakes it makes.


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

# Predict on test set
y_pred = clf.predict(X_test)

# Accuracy
acc = accuracy_score(y_test, y_pred)
print("Test Accuracy:", acc)

# Confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


### 🔧 Try It Yourself – Part 5

1. Use the `clf_limited` model you created in Part 4 to predict on the test set  
2. Calculate and print the **test accuracy** for `clf_limited`  
3. Compare the test accuracy of `clf_limited` to the original `clf`. Which performs better?


In [None]:
# 🔧 Add code here

🔧 Add comment here:

## Part 6: Visualize Tree and Feature Importance

We’ll plot the tree and rank the most important predictors.

### Why This Matters:
Tree plots explain the model's logic. Feature importance shows what drives predictions.


In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# Plot decision tree
plt.figure(figsize=(14, 8))
plot_tree(clf, feature_names=X.columns, class_names=['Not Expensive', 'Expensive'], filled=True)
plt.title("Decision Tree for Expensive Listings")
plt.show()


In [None]:
import seaborn as sns

# Plot feature importance
feat_imp = pd.Series(clf.feature_importances_, index=X.columns)
sns.barplot(x=feat_imp.values, y=feat_imp.index)
plt.title("Feature Importance Scores")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()


### 🔧 Try It Yourself – Part 6

1. Plot the decision tree for `clf_limited` using `plot_tree()`  
2. Which feature does it split on first?  
3. Is the structure simpler or more complex than the original `clf` tree?

In [None]:
# 🔧 Add code here

🔧 Add comment here:

## 🔧 Part 7: Reflection (100 words or less per question)

1. Which feature was most useful in identifying expensive listings?  
2. Would you trust this model to recommend pricing to landlords?


🔧 Add comment here:

## Export Your Notebook to Submit in Canvas
- Use the instructions from Lab 1

In [None]:
!jupyter nbconvert --to html "lab_10_LastnameFirstname.ipynb"