# BA5: Predictive Modeling – Housing Price Estimation

In this notebook, we will build a predictive model to estimate the **total value of homes** using data from **West Roxbury**, Boston. This is a classic case of **supervised regression modeling**.

We will walk through the full data science process: from framing the problem to building and evaluating a multiple linear regression model.


## 1. Problem Framing

Our objective is to predict the variable `TOTAL_VALUE` using other available features. One variable, `TAX`, will be excluded because it is derived from the target variable and would introduce **data leakage**.

We will also handle a categorical variable `REMODEL` by converting it into dummy variables.


## 2. Load and Prepare the Data
We will load the dataset, drop unnecessary columns, handle categorical variables, and fix any anomalies (like outliers in `FLOORS`).

In [None]:
import pandas as pd

housing_df = pd.read_csv("/mnt/data/WestRoxbury.csv")
housing_df = housing_df.copy()

# Drop the TAX column
if "TAX" in housing_df.columns:
    housing_df = housing_df.drop(columns=["TAX"])

# Convert categorical 'REMODEL' to dummy variables
housing_df = pd.get_dummies(housing_df, columns=["REMODEL"], drop_first=True)

# Correct the anomalous FLOORS entry (if any)
housing_df["FLOORS"] = housing_df["FLOORS"].replace(15, 1.5)

housing_df.head()

## 3. Split the Data into Training and Validation Sets

We will split the dataset into a 60% training set and 40% validation set to build and evaluate our model.


In [None]:
from sklearn.model_selection import train_test_split

exclude_cols = ["TOTAL_VALUE"]
predictors = [col for col in housing_df.columns if col not in exclude_cols]
X = housing_df[predictors]
y = housing_df["TOTAL_VALUE"]

train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=1)


## 4. Fit a Multiple Linear Regression Model

We will fit the model using the **training data**.


In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(train_X, train_y)

## 5. Predict and Evaluate Residuals

We compute the predicted values and residuals for both training and validation sets.


In [None]:
train_pred = model.predict(train_X)
valid_pred = model.predict(valid_X)

train_results = pd.DataFrame({
    "Actual": train_y,
    "Predicted": train_pred,
    "Residual": train_y - train_pred
})

train_results.head()

## 6. Evaluate Model Performance with RMSE

Root Mean Squared Error (RMSE) gives a good idea of prediction accuracy in the original units.


In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np

train_rmse = np.sqrt(mean_squared_error(train_y, train_pred))
valid_rmse = np.sqrt(mean_squared_error(valid_y, valid_pred))

print(f"Train RMSE: ${train_rmse:,.2f}")
print(f"Validation RMSE: ${valid_rmse:,.2f}")


## 🔍 Challenge Exercises

1. Try excluding one of the predictors and see if RMSE improves.
2. Use only 2 or 3 predictors. Does model performance degrade?
3. Try predicting TOTAL_VALUE for a hypothetical home.
4. Change the train/test ratio and observe the RMSE changes.
5. (Advanced) Try using `Ridge` or `Lasso` regression instead of standard Linear Regression.
