## **Nigeria Housing Price Prediction Using scikit-learn**
### **Introduction**

This notebook builds upon the earlier from-scratch Ridge regression implementation by re-implementing the same predictive modeling task using scikit-learn.



### **About the Dataset**

The dataset was obtained from Kaggle, contributed by [chik0di](http://kaggle.com/chik0di). It contains information scraped from [Nigeria Property Centre](https://nigeriapropertycentre.com/for-sale/houses/showtype), providing detailed data on houses listed in the Nigerian real estate market, including housing trends and property pricing.

It includes a variety of attributes describing each property, offering valuable insights into the dynamics of the Nigerian housing market. Researchers, data analysts, and developers can use this dataset for machine learning projects, pricing models, or market studies.

In [1]:
# import the necessary libraries
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

In [2]:
# import our dataset into the pandas dataframe

df = pd.read_csv('house_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Property Ref,Added On,Last Updated,Market Status,Type,Bedrooms,Bathrooms,Toilets,Parking Spaces,Total Area,Covered Area,Price,District,State,Servicing,Furnishing,Service Charge
0,0,1841352,28 Jul 2023,08 Nov 2024,Available,Detached Duplex,7.0,7.0,7.0,8.0,850 sqm,850 sqm,450000000.0,Gwarinpa,Abuja,,,
1,1,2601226,02 Dec 2024,04 Dec 2024,Available,Semi-detached Duplex,4.0,4.0,5.0,4.0,,,135000000.0,Lekki,Lagos,,,
2,2,2601251,02 Dec 2024,04 Dec 2024,Available,Block of Flats,2.0,2.0,3.0,2.0,,,90000000.0,Lekki,Lagos,,,
3,3,2607973,05 Dec 2024,05 Dec 2024,Available,Terraced Duplex,4.0,4.0,5.0,,,,110000000.0,Lekki,Lagos,,,
4,4,2607972,05 Dec 2024,05 Dec 2024,Available,Detached Duplex,6.0,7.0,7.0,6.0,,,400000000.0,Lekki,Lagos,,,


In [3]:
# Fix Data Types
df['Added On'] = pd.to_datetime(df['Added On'])
df['Last Updated'] = pd.to_datetime(df['Last Updated'])
df['Total Area'] = df['Total Area'].str.replace('sqm', '', regex=False).str.strip()
df['Total Area'] = pd.to_numeric(df['Total Area'], errors='coerce')
df['Covered Area'] = df['Covered Area'].str.replace('sqm', '', regex=False).str.strip()
df['Covered Area'] = pd.to_numeric(df['Covered Area'], errors='coerce')

# simple feature engineering
df['Listing_Duration_Days'] = (df['Last Updated'] - df['Added On']).dt.days

In [4]:
# Drop unnecessary columns FIRST
df = df.drop(columns=['Unnamed: 0', 'Property Ref', 'Service Charge'])

# Fill categoricals with 'Unknown'
df['Servicing'] = df['Servicing'].fillna('Unknown')
df['Furnishing'] = df['Furnishing'].fillna('Unknown')

In [5]:
# Split Data
X = df.drop(columns='Price')
y = np.log1p(df['Price'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)  # Same seed as manual!

In [6]:
# Define features
numeric_features = ['Bedrooms','Bathrooms','Toilets','Parking Spaces','Total Area','Covered Area','Listing_Duration_Days']
categorical_features = ['Market Status', 'Type', 'District', 'State', 'Servicing', 'Furnishing']

In [7]:
# Missing indicators INSIDE pipeline, applied only to numerics
def add_missing_flags_numeric(X):
    """Add missing flags ONLY for numeric columns"""
    X_out = X.copy()
    for col in ['Bedrooms','Bathrooms','Toilets','Parking Spaces','Total Area','Covered Area']:
        if col in X_out.columns:
            X_out[col + '_missing'] = X_out[col].isna().astype(int)
    return X_out

In [8]:
numeric_transformer = Pipeline(steps=[
    ('missing_flags', FunctionTransformer(add_missing_flags_numeric, validate=False)),
    ('imputer', SimpleImputer(strategy='median')), 
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Column transformer
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

In [9]:
# Full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('ridge', Ridge(fit_intercept=True))
])

In [10]:
param_grid = {'ridge__alpha': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]}

# Grid search with CV
grid_search = GridSearchCV(pipeline, param_grid, scoring='neg_root_mean_squared_error', cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best alpha:", grid_search.best_params_)
print("Best CV RMSE:", (-grid_search.best_score_).round(2))

Best alpha: {'ridge__alpha': 10}
Best CV RMSE: 0.79


In [11]:
# Test on HELD-OUT data
best_model = grid_search.best_estimator_
y_pred_test = best_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
print("Test RMSE:", test_rmse.round(2))

Test RMSE: 0.67
