# Comprehensive Research: Advanced Housing Regression

## 1. Environment & Data
**Objective**: Predict House Prices ($y$) minimizing RMSLE.
**Challenge**: Outliers, Skewed Target, Multicollinearity.
**Models**: Ridge, Lasso, GradientBoosting, Stacking.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.linear_model import LassoCV, RidgeCV
from sklearn.ensemble import GradientBoostingRegressor, StackingRegressor
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, KFold

sns.set_style("darkgrid")

# --- 1. DATA GENERATION ---
def make_housing_data(n=1000):
    # Area: Log-Normal (Most houses small, few mansions)
    area = np.random.lognormal(7.5, 0.4, n)
    # Age: Uniform
    age = np.random.randint(0, 100, n)
    # Quality: 1-10
    qual = np.random.randint(3, 10, n)
    
    # Target Price
    price = 10000 + area*150 + qual*10000 - age*500 + np.random.normal(0, 20000, n)
    price = np.abs(price) # Ensure positive
    
    # Add Outlier (The "Manor")
    area[0] = 50000 
    price[0] = 200000 # Unusually cheap for size (Foreclosure?)
    
    return pd.DataFrame({"GrLivArea": area, "Age": age, "OverallQual": qual}), price

X, y = make_housing_data()
print(f"Data Shape: {X.shape}")

## 2. Deep EDA: Target Analysis
Regression models (Linear) assume Residuals are normally distributed. If the Target is skewed, residuals will be skewed.

In [None]:
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(y, kde=True)
plt.title("Original Price Distribution (Skewed)")

plt.subplot(1, 2, 2)
stats.probplot(y, plot=plt)
plt.title("Q-Q Plot (Not Normal)")
plt.show()

print(f"Skewness: {pd.Series(y).skew():.4f}")

**Observation**: Highly right-skewed. We must apply `log1p` transformation.

In [None]:
# 3. Outlier Analysis
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X['GrLivArea'], y=y)
plt.title("Area vs Price (Look for Outliers)")
plt.show()

**Observation**: There is one point at 50,000 sqft with low price. This will destroy the regression slope (High Leverage Point). We must drop it.

In [None]:
# 3.1 Hard Drop
outlier_idx = X['GrLivArea'].idxmax()
print(f"Dropping Outlier Index: {outlier_idx}")
X = X.drop(outlier_idx)
y = np.delete(y, outlier_idx)

## 4. Feature Selection (Lasso)
Lasso (L1 Regularization) drives coefficients of useless features to Zero. We use it to inspect feature importance.

In [None]:
# Standardize first
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

# LassoCV automatically finds best Alpha
lasso = LassoCV(cv=5, random_state=42).fit(X_scaled, np.log1p(y))

coef = pd.Series(lasso.coef_, index=X.columns)
coef.sort_values().plot(kind='barh')
plt.title("Lasso Feature Importance")
plt.show()

## 5. Model Stacking
Unifying Linear (Ridge) and Non-Linear (GBR) models.

In [None]:
# Base Models
ridge = RidgeCV(cv=5)
gbr = GradientBoostingRegressor(n_estimators=100, max_depth=3)

# Stacking
stack = StackingRegressor(
    estimators=[('ridge', ridge), ('gbr', gbr)],
    final_estimator=RidgeCV()
)

# Wrap in TransformedTargetRegressor to handle Log-Y automatically
model = TransformedTargetRegressor(
    regressor=stack,
    func=np.log1p,
    inverse_func=np.expm1
)

# Evaluate
kf = KFold(5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='neg_root_mean_squared_error')
print(f"Stacking RMSE: {-scores.mean():.2f}")

## 6. Residual Analysis
A good model has residuals randomly distributed around 0. Patterns indicate missing information.

In [None]:
model.fit(X, y)
preds = model.predict(X)
residuals = y - preds

plt.figure(figsize=(10, 5))
plt.scatter(preds, residuals, alpha=0.5)
plt.axhline(0, color='r', linestyle='--')
plt.xlabel("Predicted Price")
plt.ylabel("Residual (Error)")
plt.title("Residual Plot")
plt.show()