
# Task‑01 — Linear Regression on House Prices (Kaggle)

**Goal:** Build a simple **Linear Regression** model to predict `SalePrice` (house price) using **only**:
- `GrLivArea` — above-ground living area (square feet)
- `BedroomAbvGr` — number of bedrooms above ground
- `FullBath` — number of full bathrooms

We’ll proceed step-by-step and explain *what* we do and *why*. This notebook assumes you have these files in the **same folder**:
- `train.excel` (or `train.xlsx` / `train.csv`)
- `test.excel` (or `test.xlsx` / `test.csv`)
- `sample_submission.excel` (optional, just for reference)
- `data_description.txt` (optional, for field descriptions)

> **Note:** The official Kaggle competition uses CSV files; if your files have `.excel`, rename to `.xlsx` if needed. The code below auto-detects the extension and uses the appropriate loader.



## Glossary (plain‑English definitions)

- **Feature (predictor, input variable):** A measurable property used to make predictions. Here: `GrLivArea`, `BedroomAbvGr`, `FullBath`.
- **Target (label, output variable):** The value we want to predict. Here: `SalePrice`.
- **Model:** A mathematical function that maps features to a prediction. **Linear regression** models a straight-line relationship.
- **Training:** Feeding data to a model so it can learn patterns that map inputs to outputs.
- **Validation / Hold‑out set:** A portion of data not used during training, to fairly evaluate how well the model generalizes.
- **Overfitting:** When a model memorizes the training data but performs poorly on new data.
- **Underfitting:** When a model is too simple and fails to capture important patterns.
- **Coefficient (weight):** The number the model learns for each feature; it says how much the prediction changes when that feature increases by 1 unit (holding others fixed).
- **Intercept (bias):** The base value of the prediction when all features are zero.
- **Residual:** The difference between the true target and the model’s prediction (`y_true − y_pred`).
- **RMSE (Root Mean Squared Error):** A common error metric; lower is better. Roughly, the typical size of prediction errors.
- **R² (Coefficient of Determination):** Measures how much variance in the target is explained by the model (1.0 is perfect; can be negative if very poor).
- **Imputation:** Filling in missing values so algorithms can run.
- **Baseline:** A simple method (like predicting the mean) to sanity-check whether our model actually learns anything.
- **Assumptions (for linear regression):** Roughly linear relationships, errors with constant spread (homoscedasticity), limited multicollinearity among features, and independent errors.


In [None]:

# If running locally for the first time, uncomment to install dependencies:
# !pip install -q pandas scikit-learn matplotlib numpy openpyxl

import os
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


In [None]:

def smart_read(path: str):
    """
    Read a file that may be .csv, .xlsx/.xls, or a non-standard '.excel' extension.
    """
    p = Path(path)
    suffix = p.suffix.lower()
    if suffix == '.csv':
        return pd.read_csv(p)
    elif suffix in ('.xlsx', '.xls', '.xlsm'):
        return pd.read_excel(p)
    elif suffix == '.excel':
        # Treat like Excel
        return pd.read_excel(p)
    else:
        raise ValueError(f"Unsupported file extension for {path}. Please use .csv or .xlsx/.xls.")



## Step 1 — Load the dataset

We’ll load the **training** and **test** datasets. The training data has both **features** and the **target** (`SalePrice`). The test data has features only.


In [None]:

TRAIN_PATH = 'train.excel'  # change if your file is named differently (e.g., 'train.csv' or 'train.xlsx')
TEST_PATH  = 'test.excel'   # change if needed
SAMPLE_SUB_PATH = 'sample_submission.excel'  # optional

train_df = smart_read(TRAIN_PATH)
test_df = smart_read(TEST_PATH)

# Peek at the columns and first few rows
print('Train shape:', train_df.shape)
print('Test shape :', test_df.shape)
display(train_df.head())
display(test_df.head())



## Step 2 — Select features and the target

For this baseline, we’ll use **exactly three features**:
- `GrLivArea` (square footage above ground)
- `BedroomAbvGr` (number of bedrooms above ground)
- `FullBath` (number of full bathrooms)

The **target** is `SalePrice`.


In [None]:

FEATURES = ['GrLivArea', 'BedroomAbvGr', 'FullBath']
TARGET = 'SalePrice'

missing_cols = [c for c in FEATURES + [TARGET] if c not in train_df.columns]
if missing_cols:
    raise KeyError(f"Missing columns in training data: {missing_cols}. Check your file names or dataset.")

X = train_df[FEATURES].copy()
y = train_df[TARGET].copy()

# Basic sanity checks
print('Missing in X:\n', X.isna().sum())
print('Missing in y:', y.isna().sum())



## Step 3 — Quick sanity checks (EDA)

We’ll look at simple scatter plots to see whether relationships look *roughly* linear and at least sensible. In practice, more features (like neighborhoods, quality scores) matter a lot in this competition, but we intentionally keep it simple for learning.


In [None]:

# Scatter plots: each feature vs SalePrice
for col in FEATURES:
    plt.figure()
    plt.scatter(train_df[col], y, alpha=0.5)
    plt.xlabel(col)
    plt.ylabel('SalePrice')
    plt.title(f'{col} vs SalePrice')
    plt.show()



## Step 4 — Train/validation split

We hold out a **validation** set to estimate generalization. This helps detect **overfitting** and **underfitting**.


In [None]:

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_train.shape, X_val.shape



## Step 5 — Handle missing values (imputation)

Linear models can’t handle `NaN`s. We’ll impute any missing values with the **median** of that column (robust to outliers). For these specific columns, missingness is rare, but we do this defensively.


In [None]:

X_train = X_train.fillna(X_train.median(numeric_only=True))
X_val   = X_val.fillna(X_train.median(numeric_only=True))  # use train stats



## Step 6 — Build a **baseline**

Before modeling, compare to a trivial baseline: always predict the **mean** of `SalePrice` from the training set. If our linear model can’t beat this, something is wrong.


In [None]:

baseline_pred = np.full_like(y_val, fill_value=y_train.mean(), dtype=np.float64)
baseline_rmse = mean_squared_error(y_val, baseline_pred, squared=False)
baseline_r2   = r2_score(y_val, baseline_pred)
print(f'Baseline RMSE: {baseline_rmse:,.2f}')
print(f'Baseline R^2 : {baseline_r2:,.4f}')



## Step 7 — Train a **Linear Regression** model

**Why linear regression?** It’s simple, fast, and interpretable:
- Each feature gets a **coefficient** representing how much price changes per unit of that feature (holding others constant).
- We can inspect these coefficients to understand the model’s reasoning.


In [None]:

linreg = LinearRegression()
linreg.fit(X_train, y_train)

print('Intercept (bias):', linreg.intercept_)
coef_table = pd.DataFrame({'feature': FEATURES, 'coefficient': linreg.coef_})
display(coef_table)



## Step 8 — Evaluate on the validation set

We’ll compute **RMSE** and **R²**. Lower RMSE is better; higher R² is better (max 1.0).


In [None]:

y_val_pred = linreg.predict(X_val)
rmse = mean_squared_error(y_val, y_val_pred, squared=False)
r2   = r2_score(y_val, y_val_pred)

print(f'Validation RMSE: {rmse:,.2f}')
print(f'Validation R^2 : {r2:,.4f}')



## Step 9 — Residual diagnostics (basic)

We want residuals (errors) to look like random noise with fairly constant spread. Systematic patterns suggest nonlinearity or missing features.


In [None]:

residuals = y_val - y_val_pred

plt.figure()
plt.scatter(y_val_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle='--')
plt.xlabel('Predicted SalePrice')
plt.ylabel('Residual (y_true - y_pred)')
plt.title('Residuals vs Predicted')
plt.show()



## Step 10 — Retrain on **all** training data and predict the **test** set

Finally, train on the full dataset and produce predictions for `test`. We’ll impute missing values using medians from the full training set, and save a Kaggle‑ready `submission.csv`.


In [None]:

# Refit on ALL training data
X_full = train_df[FEATURES].copy().fillna(train_df[FEATURES].median(numeric_only=True))
y_full = train_df[TARGET].copy()

linreg_full = LinearRegression()
linreg_full.fit(X_full, y_full)

# Prepare test features
X_test = test_df[FEATURES].copy()
X_test = X_test.fillna(X_full.median(numeric_only=True))

test_preds = linreg_full.predict(X_test)

submission = pd.DataFrame({
    'Id': test_df['Id'],
    'SalePrice': test_preds
})
submission_path = 'submission.csv'
submission.to_csv(submission_path, index=False)
submission.head()



## Where to go next

- Try a **log transform** of `SalePrice` (use `np.log1p` for train and `np.expm1` to invert predictions) to reduce the impact of very expensive homes.
- Add more informative features (overall quality, year built, garage, neighborhood) and try **regularized** linear models (**Ridge** / **Lasso**) to handle multicollinearity and feature selection.
- Consider **cross-validation** (e.g., `KFold`) instead of a single train/validation split for more reliable estimates.
