<a href="https://colab.research.google.com/github/Sg134-ch/Machine-Learning-Projects-/blob/main/ML_Theory_Assignment_01_Q2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# House Price Prediction â€“ Bias vs Variance Study
This notebook trains Linear, Ridge and Decision Tree models and compares underfitting vs overfitting.


In [None]:
!pip install pandas scikit-learn numpy matplotlib



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from google.colab import files
import io

In [None]:
# Load dataset
print("Please upload the 'housing.csv' file:")
uploaded = files.upload()

for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])))
    housing = pd.read_csv(io.BytesIO(uploaded[fn]))
housing = housing.dropna()

Please upload the 'housing.csv' file:


Saving housing.csv to housing (1).csv
User uploaded file "housing (1).csv" with length 1423529 bytes


In [None]:
X = housing.drop('median_house_value', axis=1)
y = housing['median_house_value']
X = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
models = {
    'Linear': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Decision Tree': DecisionTreeRegressor(max_depth=10)
}

results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    rmse_train = np.sqrt(mean_squared_error(y_train, train_pred))
    rmse_test = np.sqrt(mean_squared_error(y_test, test_pred))
    mae_test = mean_absolute_error(y_test, test_pred)
    results.append([name, rmse_train, rmse_test, mae_test])

pd.DataFrame(results, columns=['Model','RMSE Train','RMSE Test','MAE Test'])

Unnamed: 0,Model,RMSE Train,RMSE Test,MAE Test
0,Linear,68487.306669,69297.716691,50413.433308
1,Ridge,68487.314293,69297.746637,50412.415336
2,Decision Tree,47437.297752,61236.20029,40491.283477


ðŸ“Š Biasâ€“Variance Analysis Summary
Biasâ€“Variance Analysis (Point-wise)

Underfitting (High Bias):
Linear Regression underfit the data because it is too simple to capture the non-linear relationships in house prices. Both training and testing RMSE were high, showing poor learning capacity.

Overfitting (High Variance):
Decision Tree overfit the data because it achieved very low training RMSE but much higher test RMSE, meaning it memorized the training data and failed to generalize.

Best Generalizing Model:
Ridge Regression gave the lowest test RMSE by balancing bias and variance using regularization, preventing overfitting while maintaining learning ability.

Minimum Tasks Performed

Handling Missing Values:
Missing values were handled using dropna() to remove incomplete records.

Encoding Categorical Variables:
The categorical variable ocean_proximity was encoded using One-Hot Encoding.

Feature Scaling:
Feature scaling was applied using StandardScaler to normalize all numerical features.

Training and Error Reporting:
Linear, Ridge and Decision Tree models were trained and evaluated using RMSE (Train & Test) and MAE (Test).

Real-World ML Issue

Outliers and Noisy Features:
Extreme house prices and noisy inputs distort learning, causing simple models to underfit and complex models to overfit, reducing real-world prediction reliability.