# Lab 2 — Task 2: Multiple Linear Regression (All Features → median_house_value)

Author: LUV-KUSHWAHA

This notebook completes Task 2 of the assignment: build a multiple linear regression model using all available input features to predict `median_house_value`.

It follows the ML pipeline steps with clear, line-by-line comments in code cells so you can copy-paste directly into your assignment notebook/script.

Notes:
- The notebook first attempts to load a local CSV (common Kaggle variant with `ocean_proximity`). If not found, it falls back to scikit-learn's California housing dataset (which does NOT include `ocean_proximity`).
- We handle missing values (median imputation for `total_bedrooms`), encode categorical features (`ocean_proximity`) if present, scale numeric features, train LinearRegression, and report coefficients and evaluation metrics.

## 1) Imports and setup

Import required libraries and set plotting style.

In [None]:
# Standard imports
import numpy as np                                  # numerical ops
import pandas as pd                                 # DataFrame handling
import matplotlib.pyplot as plt                     # plotting
import seaborn as sns                               # nicer plots

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Optional imports for encoding / imputation
from sklearn.impute import SimpleImputer

sns.set(style='whitegrid', context='notebook')      # plotting style

## 2) Data retrieval and collection

Try to load a local CSV (Kaggle-style California housing with `ocean_proximity`). If not available, fall back to scikit-learn dataset.

In [None]:
# Attempt to read a local CSV first (common filenames used by many tutorials)
csv_candidates = ['housing.csv', 'california_housing.csv', 'housing.csv']  # try common names
df = None
for fname in csv_candidates:
    try:
        df = pd.read_csv(fname)                        # try to load CSV
        print(f"Loaded data from local file: {fname}")
        break
    except Exception:
        pass

if df is None:
    # Fallback: use scikit-learn's California housing dataset
    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing(as_frame=True)
    df = housing.data.copy()
    df['median_house_value'] = housing.target
    # rename feature to match assignment naming
    if 'HouseAge' in df.columns:
        df = df.rename(columns={'HouseAge': 'housing_median_age'})
    print('Loaded scikit-learn California housing dataset (no ocean_proximity column).')

# Quick top-level inspect
print('\nDataset shape (rows, cols):', df.shape)
print('\nColumns:')
print(list(df.columns))

# Show first rows (visual inspection)
df.head()

## 3) Data cleaning

Check missing values and data types, then handle missing values (we will impute `total_bedrooms` using the median).

In [None]:
# Check missing values per column
missing = df.isnull().sum()
print('Missing counts per column:\n')
print(missing)

# Report fraction missing for total_bedrooms if present
if 'total_bedrooms' in df.columns:
    total_rows = df.shape[0]
    n_missing = missing['total_bedrooms']
    frac = n_missing / total_rows
    print(f"\ntotal_bedrooms missing: {n_missing} rows ({frac:.2%} of dataset)")

# Data types
print('\nData types:')
print(df.dtypes)

# Imputation strategy: median for total_bedrooms (robust to outliers)
if 'total_bedrooms' in df.columns and df['total_bedrooms'].isnull().any():
    imp_med = SimpleImputer(strategy='median')        # median imputer
    df[['total_bedrooms']] = imp_med.fit_transform(df[['total_bedrooms']])
    print('\nApplied median imputation to total_bedrooms.')

# Verify no missing values remain (for numeric columns used later)
print('\nMissing counts after imputation:')
print(df.isnull().sum())

Observation: we used median imputation for `total_bedrooms` because it was the only column with many missing entries and median is robust to outliers.
If you prefer, multiple imputation (MICE) or model-based imputation are alternatives for final reporting.

## 4) Feature design

Prepare the feature matrix X and target y. Encode categorical features (`ocean_proximity`) using one-hot encoding if present. Scale numeric features with StandardScaler.

In [None]:
# Identify target and drop it from features
target_col = 'median_house_value'
if target_col not in df.columns:
    raise ValueError(f"Target column '{target_col}' not found in dataset")

X = df.drop(columns=[target_col]).copy()            # features DataFrame
y = df[target_col].copy()                            # target Series

# Handle categorical column 'ocean_proximity' if present
categorical_cols = [c for c in X.columns if X[c].dtype == 'object' or X[c].dtype.name == 'category']
print('Categorical columns detected:', categorical_cols)

if len(categorical_cols) > 0:
    # Use pandas one-hot encoding for simplicity and interpretability
    X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)  # drop_first avoids collinearity with intercept
    print('\nApplied one-hot encoding to categorical columns. New shape:', X.shape)

# Now scale numeric features using StandardScaler (recommended when features have different scales)
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()   # numeric feature list
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X[numeric_cols]), columns=numeric_cols)

# Final feature matrix
X_final = X_scaled.copy()
print('\nFinal features shape (after encoding & scaling):', X_final.shape)
X_final.head()

Why scaling? LinearRegression does not require scaling for correctness, but scaling helps interpret coefficient magnitudes when predictors have very different units and is useful for diagnostics and comparability. We scaled here to keep features on comparable scales and to make coefficient comparisons meaningful in standardized units.

## 5) Algorithm selection & loss

We use Ordinary Least Squares Linear Regression and evaluate with Mean Squared Error (MSE) and R². This choice is consistent with Task 1 and gives interpretable coefficients for each (standardized) feature.

In [None]:
# 6) Model learning: split data and fit
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.20, random_state=42)

model = LinearRegression()                           # ordinary least squares model
model.fit(X_train, y_train)                          # fit model on training data

print('Model trained.')
print(f'Intercept: {model.intercept_:.4f}')

# Map coefficients to feature names (note: features are standardized so coefficients refer to 1-std changes)
coef_series = pd.Series(model.coef_, index=X_final.columns).sort_values(ascending=False)
print('\nTop 10 positive coefficients (feature : coef):')
print(coef_series.head(10))
print('\nTop 10 negative coefficients (feature : coef):')
print(coef_series.tail(10))

Interpretation note: Because we scaled numeric features (zero mean, unit variance), each coefficient represents the expected change in median_house_value for a one-standard-deviation increase in that feature, holding others constant. For one-hot encoded binary features (e.g., ocean_proximity_X), coefficients represent the expected change when that category is present vs the reference category, on the same (standardized) scale.

## 7) Model evaluation

Evaluate on the test set (MSE, RMSE, R²) and present diagnostic plots.

In [None]:
# Predictions on test set
y_pred = model.predict(X_test)

# Metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'Test MSE: {mse:.4f}')
print(f'Test RMSE: {rmse:.4f}')
print(f'Test R²: {r2:.4f}')

# Residuals
residuals = y_test - y_pred
print('\nResiduals summary:')
print(pd.Series(residuals).describe())

In [None]:
# Plot 1: Predicted vs Actual
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linewidth=2)
plt.xlabel('Actual median_house_value')
plt.ylabel('Predicted median_house_value')
plt.title('Predicted vs Actual — Multiple Linear Regression (Test set)')
plt.show()

# Plot 2: Residuals vs Fitted
plt.figure(figsize=(8, 6))
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Fitted values (predictions)')
plt.ylabel('Residuals (actual - predicted)')
plt.title('Residuals vs Fitted')
plt.show()

# Plot 3: Residuals distribution
plt.figure(figsize=(8, 4))
sns.histplot(residuals, kde=True, bins=40, color='purple')
plt.xlabel('Residual (actual - predicted)')
plt.title('Residuals distribution (Test set)')
plt.show()

## Model coefficients (detailed)

List all coefficients with feature names so they can be interpreted in the assignment. Because numeric features were standardized, coefficients correspond to a 1‑standard‑deviation change in the predictor.

In [None]:
# Create a clear table of feature, coefficient, and absolute importance
coef_table = pd.DataFrame({
    'feature': X_final.columns,
    'coefficient': model.coef_
})
coef_table['abs_coef'] = coef_table['coefficient'].abs()
coef_table = coef_table.sort_values(by='abs_coef', ascending=False)
coef_table.reset_index(drop=True, inplace=True)

# Show top 20 most important features by absolute coefficient
coef_table.head(20)

## Comparison to Task 1 and conclusion

Summarize improvement over single-feature model and final recommendations to include in your assignment.

In [None]:
print('Summary:')
print('- Using multiple features typically improves predictive performance relative to the single-feature model because more relevant information is provided to the learner.')
print('- Report the metrics above (MSE, RMSE, R²) and discuss whether the R² and RMSE indicate good predictive performance for your use-case.')

print('\nRecommendations:')
print('1) If residual diagnostics show heteroscedasticity or non-linearity, try target-transformations (log) or polynomial features.')
print('2) Consider regularized regression (Ridge, Lasso) if overfitting or coefficient instability is a concern.')
print('3) For inference, provide confidence intervals or run bootstrap to quantify uncertainty; for production, evaluate on a held-out validation set.')

## Deliverables (what to paste into your assignment)

Include:
- The code cells above (with comments) in your notebook.
- The imputation strategy and rationale (we used median for total_bedrooms).
- The final evaluation metrics (MSE, RMSE, R²) and coefficient table.
- Diagnostic plots and an interpretation of whether linear regression assumptions hold.

If you want, I can also produce a notebook variant that:
- Uses Ridge/Lasso and compares cross-validated RMSE, or
- Outputs a ready-to-submit PDF or .py script version of this notebook.