# Session 5-6 Linear regression

# Exercise

You are a data scientist tasked with developing a predictive model for house prices using the Ames Housing dataset from Kaggle:

ðŸ”— https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Your objective is to build and evaluate a linear regression model that predicts house sale prices based on selected features.

## Step 1. Data Preparation

Goal: Load data, explore it briefly, select features, and handle missing values.

In [None]:
# Step 1: Data preparation

import pandas as pd
import numpy as np

# Load the Ames Housing training dataset
data = pd.read_csv("train.csv")

In [None]:
# Quick look at the dataset structure
print(data.head())
print(data.info())

In [None]:
# Target variable (what we want to predict)
y = data["SalePrice"]

# Select a small set of meaningful numerical features
# (keeping it simple for linear regression)
features = [
    "OverallQual",   # Overall material and finish quality
    "GrLivArea",     # Above ground living area (sq ft)
    "GarageCars",    # Size of garage in car capacity
    "TotalBsmtSF",   # Total basement area
    "YearBuilt"      # Year the house was built
]

X = data[features]

In [None]:
# Check missing values
print(X.isnull().sum())

# Handle missing values by filling with median
# (median is robust to outliers)
X = X.fillna(X.median())

## Step 2. Trainâ€“Test Split

Goal: Split the data into 70% training and 30% testing.

In [None]:
# Step 2: Train-test split
from sklearn.model_selection import train_test_split

# Split data: 70% training, 30% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.30,
    random_state=42
)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

## Step 3. Model Development
Goal: Build and train a linear regression model.

In [None]:
# Step 3: Model development

from sklearn.linear_model import LinearRegression

# Initialize the linear regression model
model = LinearRegression()

# Train the model using the training data
model.fit(X_train, y_train)

# Display learned coefficients
coefficients = pd.Series(model.coef_, index=features)
print("Model coefficients:")
print(coefficients)

## Step 4. Model Evaluation

Goal: Evaluate model performance using RMSE, MAE, and RÂ².

In [None]:
# Step 4: Model evaluation

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Predict house prices on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Model performance on test data:")
print("RMSE:", rmse)
print("MAE:", mae)
print("R^2:", r2)

## Step 5. Interpretation and Insights

Goal: Interpret coefficients, strengths, and limitations.

In [None]:
# Step 5: Interpretation and insights

# Combine coefficients with feature names
interpretation = pd.DataFrame({
    "Feature": features,
    "Coefficient": model.coef_
}).sort_values(by="Coefficient", ascending=False)

print("Feature importance (linear regression coefficients):")
print(interpretation)

## Interpretation notes:
- Positive coefficients increase predicted house price.
- Larger magnitude means stronger influence.
- Linear regression assumes:
  * Linear relationships
  * No strong multicollinearity
  * Homoscedastic errors
Limitations:
- Cannot capture nonlinear effects
- Sensitive to outliers
- Feature interactions are ignored