<a href="https://colab.research.google.com/github/OlyMahmudMugdho/supervised-learning-notes/blob/main/linear_regression_note.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
olymahmud_housing_prices_dataset_path = kagglehub.dataset_download('olymahmud/housing-prices-dataset')

print('Data source import complete.')


## Step 1: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import joblib   # for saving the model

pandas & numpy → handle and manipulate data.

matplotlib.pyplot → for visualizations.

sklearn.model_selection → split dataset and perform cross-validation.

LinearRegression → the model we’re using.

StandardScaler & OneHotEncoder → preprocess numeric and categorical features.

ColumnTransformer → apply different transformations to different columns.

Pipeline → combine preprocessing and modeling into one reusable object.

metrics → evaluate model performance.

joblib → save the trained pipeline for production use.

In [None]:
RANDOM_STATE = 42

In [None]:
# Step 1: Load the dataset
df = pd.read_csv("/kaggle/input/housing-prices-dataset/housing_prices_dataset.csv")

Loads the dataset into a Pandas DataFrame.

In production, you could also load from a database, cloud storage, or API.

In [None]:
# Step 2: Inspect dataset
print(df.head())
print(df.info())
print(df.isnull().sum())  # check missing values

head() → see the first few rows, check columns.

info() → check data types and number of non-null entries.

isnull().sum() → detect missing values.

In [None]:
# Step 3: Handle missing values (production: log, then impute/drop)
df = df.dropna()

Drops any rows with missing values.

In production, you might instead impute missing values using mean, median, or mode depending on the feature.

In [None]:
# Step 4: Split features and target
X = df.drop("price", axis=1)
y = df["price"]

X = all input features (everything except the target).

y = the target variable (house price).

In [None]:
# Step 5: Separate categorical and numerical features
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()
numerical_features = X.select_dtypes(exclude=["object"]).columns.tolist()

Automatically separates categorical vs numerical features.

Categorical features need encoding, numeric features may need scaling.

In [None]:
# Step 6: Define preprocessing for numerical and categorical
numeric_transformer = Pipeline(steps=[
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("encoder", OneHotEncoder(drop="first", handle_unknown="ignore"))
])

Numerical: Standardize values to mean=0, std=1 → important when feature ranges differ.

Categorical: One-hot encode → converts categories into numeric format.

drop="first" avoids dummy variable trap.

handle_unknown="ignore" → prevents errors if a new category appears during prediction.

In [None]:
# Step 7: Build ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

Applies different preprocessing pipelines to numeric vs categorical features automatically.

This ensures the model gets clean numeric input.

In [None]:
# Step 8: Build pipeline (preprocessing + model)
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

Bundles preprocessing and the model into one reusable object.

This prevents data leakage: same preprocessing is applied to both training and new data.

In [None]:
# Step 9: Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

Splits data into training set (80%) and test set (20%).

random_state ensures reproducibility.

In [None]:
# Step 10: Train model
model.fit(X_train, y_train)

Fits the pipeline:

Preprocesses X_train (scaling + encoding).

Fits LinearRegression on preprocessed data.

In [None]:
# Step 11: Cross-validation (for more reliable evaluation)
cv_scores = cross_val_score(model, X, y, cv=5, scoring="r2")
print("Cross-validated R² scores:", cv_scores)
print("Mean CV R²:", np.mean(cv_scores))

In [None]:
# Step 12: Evaluate on holdout test set
y_pred = model.predict(X_test)
print("Test MSE:", mean_squared_error(y_test, y_pred))
print("Test R²:", r2_score(y_test, y_pred))

In [None]:
# Step 13: Predict on new data (raw dict, just like production API input)
new_house = pd.DataFrame([{
    "area": 2000,
    "bedrooms": 3,
    "bathrooms": 2,
    "stories": 2,
    "mainroad": "yes",
    "guestroom": "no",
    "basement": "yes",
    "hotwaterheating": "no",
    "airconditioning": "yes",
    "parking": 1,
    "prefarea": "yes",
    "furnishingstatus": "semi-furnished"
}])

predicted_price = model.predict(new_house)
print("Predicted Price:", predicted_price[0])

In [None]:
# Step 14: Save the trained pipeline (model + preprocessing bundled)
joblib.dump(model, "linear_regression_housing.pkl")
print("Model saved as linear_regression_housing.pkl")

Saves pipeline as a single file.

In production, you can load it anywhere (joblib.load) and predict new houses without retraining.

In [None]:
# 🔹 Predicted vs Actual Visualization
plt.figure(figsize=(8,8))
plt.scatter(y_test, y_pred, alpha=0.6, color="blue", edgecolors="k", label="Predictions")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
         color="red", linestyle="--", linewidth=2, label="Perfect Prediction Line")

plt.xlabel("Actual House Prices", fontsize=12)
plt.ylabel("Predicted House Prices", fontsize=12)
plt.title("Predicted vs Actual House Prices", fontsize=14)
plt.legend()
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()