**Github Link:**
https://github.com/PRASHIRAWAL/House-Price-Prediction-Regression-Underfitting-Overfitting-

In [6]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error
from IPython.display import display

In [7]:
data_path = "/content/housing.csv"  # update path if needed
df = pd.read_csv(data_path)

In [8]:
# Fill missing numerical values with median
df.fillna(df.median(numeric_only=True), inplace=True)

In [9]:
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]

# Identify numerical and categorical columns
num_cols = X.select_dtypes(include=np.number).columns
cat_cols = X.select_dtypes(exclude=np.number).columns

In [10]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),   # feature scaling
        ("cat", OneHotEncoder(handle_unknown='ignore'), cat_cols)       # categorical encoding
    ]
)

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [12]:
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0),
    "Decision Tree Regressor": DecisionTreeRegressor(random_state=42)
}

results = []

In [13]:
for model_name, model in models.items():

    pipeline = Pipeline([
        ("preprocessing", preprocessor),
        ("model", model)
    ])

    pipeline.fit(X_train, y_train)

    # Predictions
    train_pred = pipeline.predict(X_train)
    test_pred = pipeline.predict(X_test)

    # Metrics
    rmse_train = np.sqrt(mean_squared_error(y_train, train_pred))
    rmse_test = np.sqrt(mean_squared_error(y_test, test_pred))
    mae_test = mean_absolute_error(y_test, test_pred)

    results.append([
        model_name,
        rmse_train,
        rmse_test,
        mae_test
    ])

In [14]:
results_df = pd.DataFrame(
    results,
    columns=["Model", "RMSE (Train)", "RMSE (Test)", "MAE (Test)"]
)

print("\nModel Comparison Table:\n")
display(results_df)


Model Comparison Table:



Unnamed: 0,Model,RMSE (Train),RMSE (Test),MAE (Test)
0,Linear Regression,68433.937367,70060.521845,50670.738241
1,Ridge Regression,68434.995896,70067.3465,50677.170993
2,Decision Tree Regressor,0.0,69078.765224,43550.404797


## Model Performance Analysis

**Key Observations from the results:**

*   **Overfitting (High Variance):** The **Decision Tree Regressor** shows clear signs of overfitting. Its RMSE (Train) of 0.00 indicates it has perfectly memorized the training data. However, its RMSE (Test) is significantly higher, suggesting it performs poorly on unseen data because it learned noise and specific patterns from the training set, rather than generalizable relationships. While its MAE (Test) is the lowest, the substantial gap between train and test RMSE points to high variance.

*   **Underfitting (High Bias):** In this specific comparison, none of the models demonstrate strong signs of severe underfitting (high bias). Underfitting would be indicated by high RMSE on *both* the training and test sets, implying the model is too simple to capture the underlying patterns in the data. The Linear and Ridge Regressors have moderately high RMSEs on both sets, but their performance on test data is comparable to their training data, suggesting a reasonable fit without severe underfitting. They might benefit from more complex features or models, but they are not drastically underfit.

*   **Linear and Ridge Regression:** These models perform similarly, with small differences between their training and testing errors. This suggests they have a balanced bias-variance trade-off in this scenario, but their overall predictive power (as indicated by the RMSE values) is limited compared to an ideal model.