# BA5b: Diagnostics and Transformations for Predictive Modeling

This notebook builds on BA5 and BA5a to evaluate whether the **assumptions of linear regression** are met and explore **transformations** to improve model performance and validity.

Topics covered:
- Residual analysis
- Detecting skewness in predictors and response
- Applying transformations (log, sqrt)
- Checking multicollinearity (VIF)


## 1. Load and Prepare Data

Same dataset from BA5. We'll fit a linear regression model and perform diagnostics on it.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

housing_df = pd.read_csv("/mnt/data/WestRoxbury.csv")
housing_df = housing_df.drop(columns=["TAX"])
housing_df = pd.get_dummies(housing_df, columns=["REMODEL"], drop_first=True)
housing_df["FLOORS"] = housing_df["FLOORS"].replace(15, 1.5)

X = housing_df.drop(columns=["TOTAL_VALUE"])
y = housing_df["TOTAL_VALUE"]
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=1)


## 2. Fit the Linear Model and View Residuals


In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(train_X, train_y)
pred_y = model.predict(valid_X)
residuals = valid_y - pred_y


## 3. Residual Plot

A good model will have residuals that are **randomly scattered** around zero.


In [None]:
import matplotlib.pyplot as plt

plt.scatter(pred_y, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()


## 4. Check for Skewness in Target Variable

Heavily skewed response variables can distort model estimates.


In [None]:
import seaborn as sns
sns.histplot(train_y, kde=True)


## 5. Apply Log Transformation to Target and Refit Model

Let’s try modeling `log(TOTAL_VALUE)` instead of `TOTAL_VALUE`.


In [None]:
log_train_y = np.log(train_y)
log_valid_y = np.log(valid_y)

log_model = LinearRegression()
log_model.fit(train_X, log_train_y)
log_pred = log_model.predict(valid_X)

# Convert back to original scale
reverted_pred = np.exp(log_pred)
log_rmse = np.sqrt(mean_squared_error(valid_y, reverted_pred))

print(f"Log-transformed Model RMSE: ${log_rmse:,.2f}")


## 6. Check for Multicollinearity using VIF

High VIF values (> 5 or 10) suggest multicollinearity, which can destabilize coefficient estimates.


In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

X_const = sm.add_constant(train_X)
vif_df = pd.DataFrame()
vif_df["Feature"] = train_X.columns
vif_df["VIF"] = [variance_inflation_factor(X_const.values, i+1) for i in range(len(train_X.columns))]
vif_df.sort_values("VIF", ascending=False)


## 🧪 Challenges

1. Try applying log or sqrt transformation to one or more **predictors**.
2. Drop high-VIF variables and refit the model. Observe changes in RMSE.
3. Combine transformed predictors and response in one model.
