House Price Prediction – Model Training Notebook

This notebook contains the complete machine learning pipeline for predicting house prices.
It covers:
- Data loading and inspection
- Exploratory Data Analysis (EDA)
- Outlier detection and removal
- Feature–target split
- Training multiple regression models
- Model evaluation and comparison
- Saving the best-performing model for deployment


In [None]:
# 1. Import Required Libraries (Code Cell)

import pandas as pd
import numpy as np

# 2. Load Dataset (Code Cell)

df = pd.read_csv("house_prices_dataset.csv")
df.head()


In [None]:
# 3. Dataset Information (EDA) (Code Cell)

print("Information about data....")
df.info()

print("\nDescribe data.....")
df.describe()


In [None]:
# 4. Check Missing Values (Code Cell)

print("Checking null values.....")
df.isnull().sum()


# 5. Outlier Detection & Removal (Markdown Cell)

## Outlier Detection and Removal

Outliers can negatively impact model performance.
We use the **Interquartile Range (IQR)** method to detect and remove outliers
from all numerical columns.


In [None]:
# 6. Outlier Removal Function (Code Cell)

def remove_outliers(dataframe):
    df_clean = dataframe.copy()
    
    for col in df_clean.columns:
        Q1 = df_clean[col].quantile(0.25)
        Q3 = df_clean[col].quantile(0.75)
        IQR = Q3 - Q1

        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        df_clean = df_clean[
            (df_clean[col] >= lower_bound) &
            (df_clean[col] <= upper_bound)
        ]
        
    return df_clean


In [None]:
# 7. Apply Outlier Removal (Code Cell)

df_clean = remove_outliers(df)

print("Before outlier removal:", df.shape)
print("After outlier removal:", df_clean.shape)

df_clean.describe()


# 8. Feature and Target Selection

- Features (X): All columns except "price"
- Target (y): "price"



In [None]:
# 9. Feature & Target Split

X = df_clean.drop("price", axis=1)
y = df_clean["price"]



In [None]:
# 10. Train–Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


11. Model Training & Evaluation

We train and evaluate the following regression models:
- Linear Regression
- Decision Tree Regressor
- Random Forest Regressor

Evaluation metrics used:
- R² Score
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)


In [None]:
# 12. Train Multiple Models

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42)
}

result_clean = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    result_clean[name] = {
        "R2": r2_score(y_test, predictions),
        "MAE": mean_absolute_error(y_test, predictions),
        "MSE": mean_squared_error(y_test, predictions)
    }

result_clean


13. Model Selection

Based on evaluation metrics, "Linear Regression" achieved the best performance
with the highest R² score and lowest error values.
Therefore, it was selected as the final model for deployment.


In [None]:
# 14. Train Final Model

model = LinearRegression()
model.fit(X_train, y_train)


15. Save Trained Model

The trained model is saved using "joblib" for later use in a FastAPI application.


In [None]:
# 16. Save Model to File

import joblib

joblib.dump(model, "house_price_model.pkl")
