# Assignment 2: Engineering Predictive Features

**Student Name:** Sean Sampietro

**Date:** 2/5/26

---

## Assignment Overview

In this assignment, you'll practice feature engineering by creating new predictive features from the Ames Housing dataset. You'll build a baseline model with raw features, engineer at least 5 new features based on real estate intuition, and measure how feature engineering improves model performance.

---

## Step 1: Import Libraries and Load Data

In [58]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Set random seed for reproducibility
np.random.seed(42)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


In [59]:
# Load the Ames Housing dataset
# TODO: Load train.csv from the data folder
df = pd.read_csv(r"C:\Users\Ssanp\Downloads\train.csv")  # Replace with pd.read_csv()

# Display basic information
# TODO: Display the first few rows and basic info about the dataset


print("\n" + "="*80)
print("CHECKPOINT: Verify dataset loaded correctly")
print(f"Dataset shape: {df.shape if df is not None else 'Not loaded'}")
print("="*80)


CHECKPOINT: Verify dataset loaded correctly
Dataset shape: (1460, 81)


---
## Step 2: Build Baseline Model with Raw Features

### Select Raw Features for Baseline

Select 10-15 raw features to use in your baseline model. Here's a suggested starting set (you can adjust):

**Suggested features:**
- `GrLivArea` - Above grade living area square feet
- `OverallQual` - Overall material and finish quality
- `YearBuilt` - Original construction year
- `TotalBsmtSF` - Total basement square feet
- `FullBath` - Full bathrooms above grade
- `BedroomAbvGr` - Bedrooms above grade
- `GarageArea` - Size of garage in square feet
- `LotArea` - Lot size in square feet
- `Neighborhood` - Physical location (categorical)
- Add 5-10 more features you think are important

In [62]:
# Select features for baseline model
# TODO: Create a list of feature names you want to use
baseline_features = [
  'OverallQual',
    'GrLivArea',
    'GarageCars',
    'GarageArea',
    'TotalBsmtSF',
    '1stFlrSF',
    'YearBuilt',
    'YearRemodAdd',
    'FullBath',
    'TotRmsAbvGrd',
    'Fireplaces',
    'LotArea',
    'Neighborhood'
    # Add more features here
]

# TODO: Create X (features) and y (target) for baseline
# Make sure to handle missing values and encode categorical variables
X = df[baseline_features].copy()
y = df['SalePrice']  # Replace with df['SalePrice']

num_cols = X.select_dtypes(include=['int64', 'float64']).columns
X[num_cols] = X[num_cols].fillna(X[num_cols].median())

cat_cols = X.select_dtypes(include=['object']).columns
X[cat_cols] = X[cat_cols].fillna('None')

print(f"Baseline features selected: {len(baseline_features)}")
print(f"Target variable shape: {y.shape if y is not None else 'Not defined'}")

Baseline features selected: 13
Target variable shape: (1460,)


### Preprocess Baseline Features

In [63]:
# Handle missing values
# TODO: Fill missing values appropriately
# Numeric: Use median or 0
# Categorical: Use 'None' or most frequent
if isinstance(X_baseline, list):
    X_baseline = pd.DataFrame(X_baseline)

# Encode categorical variables
# TODO: Use pd.get_dummies() for categorical features
X_base = pd.get_dummies(X_base, drop_first=True)

print("\n" + "="*80)
print("CHECKPOINT: After preprocessing")
print(f"X_baseline shape: {X_baseline.shape if X_baseline is not None else 'Not defined'}")
print(f"Missing values: {X_baseline.isnull().sum().sum() if X_baseline is not None else 'N/A'}")
print("="*80)


CHECKPOINT: After preprocessing
X_baseline shape: (12, 1)
Missing values: 0


### Train Baseline Model

In [65]:
# Split data into train and test sets
# TODO: Use train_test_split with test_size=0.2, random_state=42
X_train, X_test, y_train, y_test = train_test_split(X_baseline, y, test_size=0.2, random_state=42)  # Replace with train_test_split()

# Train baseline Random Forest model
# TODO: Create and train RandomForestRegressor(n_estimators=100, random_state=42)
baseline_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
baseline_model.fit(X_train, y_train)
  # Replace with trained model

# Make predictions
# TODO: Generate predictions on test set
baseline_predictions = baseline_model.predict(X_test)
  # Replace with predictions

# Calculate metrics
# TODO: Calculate R² and RMSE
baseline_r2 = r2_score(y_test, baseline_predictions)
  # Replace with r2_score()
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_predictions))

  # Replace with np.sqrt(mean_squared_error())

print("\n" + "="*80)
print("BASELINE MODEL RESULTS")
print("="*80)
print(f"R² Score: {baseline_r2 if baseline_r2 is not None else 'Not calculated'}")
print(f"RMSE: ${baseline_rmse:,.2f}" if baseline_rmse is not None else "RMSE: Not calculated")
print("="*80)

ValueError: Found input variables with inconsistent numbers of samples: [12, 1460]

### Visualize Baseline Feature Importances

In [67]:
# Extract and visualize feature importances
# TODO: Get feature importances from baseline_model
# TODO: Create a horizontal bar plot of top 10 features
rf_baseline = RandomForestRegressor(
    n_estimators=300,
    random_state=42
)
rf_baseline.fit(X_train, y_train)
baseline_importances = pd.Series(
    rf_baseline.feature_importances_,
    index=baseline_features
).sort_values(ascending=False)


plt.figure(figsize=(8,6))
baseline_importances.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top 10 Baseline Feature Importances')
plt.show()

print("\n" + "="*80)
print("CHECKPOINT: Review which raw features are most important")
print("="*80)

NameError: name 'X_train' is not defined

---
## Step 3: Engineer New Features

### Feature 1: Quality x Area - [Interaction]

**Business Justification:**
This feature captures the interaction between a home’s overall quality and its living area. Larger homes tend to sell for higher prices, but size contributes more value when the home is also high quality. Real estate intuition suggests that buyers pay a premium for homes both spacious and well-built.

In [35]:
# TODO: Create your first engineered feature
# Example: df['total_bathrooms'] = df['FullBath'] + 0.5 * df['HalfBath']
df['Quality_x_Area'] = df['OverallQual'] * df['GrLivArea']

### Feature 2: Home Age  - [Derived]

**Business Justification:**
This feature measures how old the home is at the time of sale. Newer homes generally command higher prices due to modern designs, updated systems, and lower expected maintenance costs. Home age is a strong proxy for depreciation in real estate markets.

In [None]:
# TODO: Create your second engineered feature
df['HomeAge'] = df['YrSold'] - df['YearBuilt']

### Feature 3: Has Fireplace - [Boolean]

**Business Justification:**
This feature indicates whether a home has at least one fireplace. Fireplaces are considered a luxury amenity and can increase buyer appeal, especially in colder climates. Buyers often value the presence of a fireplace more than the total number.

In [None]:
# TODO: Create your third engineered feature
df['HasFireplace'] = (df['Fireplaces'] > 0).astype(int)

### Feature 4: [Overall Quality x Area] - [Quality]

**Business Justification:**
This feature measures the quality of a home relative to its size. It helps distinguish homes that are efficiently designed and high quality from large homes with lower construction quality. Buyers often pay more for homes that offer higher quality per square foot.

In [None]:
# TODO: Create your fourth engineered feature
df['Qual_per_SF'] = df['OverallQual'] / (df['GrLivArea'] + 1)


### Feature 5: [Total Square Footage] - [Aggregation]

**Business Justification:**
This feature represents the total usable square footage of the home by combining basement and above-ground living space. Total living space is one of the most important drivers of housing prices. Larger homes generally command higher sale prices due to increased utility and comfort.

In [None]:
# TODO: Create your fifth engineered feature
df['TotalSF'] = df['TotalBsmtSF'] + df['GrLivArea']


### Add More Engineered Features (Optional)

You can create additional features beyond the required 5 if you think they'll improve performance.

In [None]:
# Optional: Create additional engineered features


---
## Step 4: Train Model with Engineered Features

In [68]:
# Create feature list combining baseline + engineered features
# TODO: List all your engineered feature names
engineered_features = [
    'Quality_x_Area',
    'HomeAge',
    'Fireplaces',
    'Qual_per_SF',
    'TotalSF'   # Add your engineered feature names here
]

# Combine baseline and engineered features
all_features = baseline_features + engineered_features

# TODO: Create X_engineered with all features
# Remember to handle missing values and encode categoricals
X_eng = df[baseline_features + engineered_features].copy()
X_engineered = df[engineered_features]# Replace with your feature matrix

print(f"Total features in engineered model: {len(all_features)}")
print(f"New engineered features: {len(engineered_features)}")

KeyError: "['Quality_x_Area', 'HomeAge', 'Qual_per_SF', 'TotalSF'] not in index"

In [70]:
# Split data (use same random_state for fair comparison)
# TODO: Split X_engineered and y
X_train_eng, X_test_eng, y_train_eng, y_test_eng = train_test_split(X_engineered, y, test_size=0.2, random_state=42)

# Train model with engineered features
# TODO: Train RandomForestRegressor(n_estimators=100, random_state=42)
engineered_model = RandomForestRegressor(
    n_estimators= 100,
    random_state= 42
)
  # Replace with trained model

# Make predictions
# TODO: Generate predictions on test set
engineered_predictions = engineered_model.predict(X_test_eng)  # Replace with predictions

# Calculate metrics
# TODO: Calculate R² and RMSE
engineered_r2 = r2_score(y_test_eng, engineered_predictions)  # Replace with r2_score()
engineered_rmse = None  # Replace with np.sqrt(mean_squared_error())

print("\n" + "="*80)
print("ENGINEERED MODEL RESULTS")
print("="*80)
print(f"R² Score: {engineered_r2 if engineered_r2 is not None else 'Not calculated'}")
print(f"RMSE: ${engineered_rmse:,.2f}" if engineered_rmse is not None else "RMSE: Not calculated")
print("="*80)

TypeError: Expected sequence or array-like, got <class 'NoneType'>

---
## Step 5: Compare Models and Identify Most Valuable Features

### Create Comparison Table

In [None]:
# Create comparison DataFrame
# TODO: Create a table comparing baseline vs engineered model
comparison = None  # Replace with pd.DataFrame()

print("\n" + "="*80)
print("MODEL COMPARISON")
print("="*80)
# TODO: Display comparison table

print("="*80)

# Calculate improvement
if baseline_r2 is not None and engineered_r2 is not None:
    r2_improvement = ((engineered_r2 - baseline_r2) / baseline_r2) * 100
    rmse_improvement = ((baseline_rmse - engineered_rmse) / baseline_rmse) * 100
    print(f"\nR² Improvement: {r2_improvement:.2f}%")
    print(f"RMSE Improvement: {rmse_improvement:.2f}%")

### Visualize Feature Importances from Engineered Model

In [None]:
# Extract and visualize top 15 feature importances
# TODO: Get feature importances from engineered_model
# TODO: Create horizontal bar plot of top 15 features



### Analysis: Most Valuable Features

**Write 3-5 bullet points analyzing your results:**

- [Which of YOUR engineered features appeared in the top 15 most important features?]
- [Why do you think these specific features performed well?]
- [Were any engineered features less valuable than you expected? Why?]
- [What did you learn about feature engineering from this analysis?]
- [If you were to create more features, what would you try based on these results?]

---
## Step 6: Submit Your Work

Before submitting:
1. Make sure all code cells run without errors
2. Verify you have at least 5 engineered features with business justifications
3. Check that your comparison table and visualizations display correctly
4. Complete the analysis section above

Then push to GitHub:
```bash
git add .
git commit -m 'completed feature engineering assignment'
git push
```

Submit your GitHub repository link on the course platform.