<a href="https://colab.research.google.com/github/OsirisEscaL/Machine_learning/blob/main/House_Price_Prediction_Model_with_Scikit_Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Scikit-Learn to Create a House Price Prediction Model.

The real estate market is a dynamic, competitive environment where accurate pricing is essential for consumers and sellers. This article examines the methodology for estimating house prices using various regression algorithms from the Python scikit-learn library. We will demonstrate how to perform data preprocessing, hyperparameter tuning, and model performance evaluation.

**Understanding the Problem**

The objective is to create a model that predicts property prices based on square footage, average number of rooms, location, crime rate, and other variables. Accurate forecasts can benefit both real estate agents and consumers and sellers.

**Dataset Preparation**

We require a data set with pertinent characteristics and household values. This dataset is available in [Kaggle](https://www.kaggle.com/datasets/fedesoriano/the-boston-houseprice-data). Now you have a CSV file named 'house_prices_dataset.csv' containing the necessary data for this example.

**Step 1: Data Preprocessing**

An essential step in prepping a dataset for modeling is data preprocessing. It includes data entry, modification, and transformation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

In [None]:
# Load your dataset
data = pd.read_csv('house_prices_dataset.csv')
# Check the first few rows of the dataset
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [None]:
# Get summary statistics of the data
data.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [None]:
# Check for missing values
data.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [None]:
# Separate features (X) and target variable (y)
X = data.drop('MEDV', axis=1)
y = data['MEDV']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Standardize the features (optional, but often improves model performance)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Step 2: Evaluate various regression algorithms**

Now, let's evaluate the efficacy of various regression algorithms through experimentation.

In [None]:
# Create a dictionary to store models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Support Vector Regression': SVR(),
    'Decision Tree Regressor': DecisionTreeRegressor(),
    'Random Forest Regressor': RandomForestRegressor(),
    'Gradient Boosting Regressor': GradientBoostingRegressor()
}

# Train and evaluate each model
results = {}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results[model_name] = {'MSE': mse, 'R2': r2}

For analysis, we use Mean Squared Error (MSE) and R-squared (R2), two metrics commonly used in machine learning and statistics to evaluate the efficacy of regression models. It describes how well a model matches the data and predicts accurately.

The MSE assesses the difference between the actual (observed) values and the regression model's predicted values. It represents the average number of defects or flaws. Lower MSE values imply that the model better fits the data. An equivalent model has an MSE close to 0, indicating no difference between the predicted and actual values.

R2, also known as the coefficient of determination, assesses the similarity of the variance of the dependent variable (target) explained by the regression model's independent variables (attributes). R2 ranges from 0 to 1, where 0 indicates the model explains none of the differences. High R2 values (close to 1) suggest that the model adequately accounts for a substantial proportion of the data's variance, indicating a decent fit. Low R2 values indicate that the model does not sufficiently suit the data.

In [None]:
results = pd.DataFrame(results)
results

Unnamed: 0,Linear Regression,Ridge Regression,Lasso Regression,Support Vector Regression,Decision Tree Regressor,Random Forest Regressor,Gradient Boosting Regressor
MSE,24.291119,24.312904,27.577692,25.66854,18.037255,7.816395,6.304847
R2,0.668759,0.668462,0.623943,0.649977,0.754039,0.893413,0.914025


**Step 3: Tuning Hyperparameters**

The tuning of hyperparameters can considerably impact model performance. We demonstrate this by adjusting the hyperparameters of the Gradient Boosting Regressor, the most promising algorithm.

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'subsample': [0.8, 0.9, 1.0]
}
# Create a GridSearchCV object
grid_search = GridSearchCV(GradientBoostingRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the best model
y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Best Model Hyperparameters: {best_params}')
print(f'Best Model MSE: {mse:.2f}')
print(f'Best Model R2: {r2:.2f}')

Best Model Hyperparameters: {'learning_rate': 0.1, 'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300, 'subsample': 0.8}
Best Model MSE: 6.05
Best Model R2: 0.92


**Conclusion**

This article demonstrated how to model house price forecasting using multiple regression algorithms, preprocessed data, over-parameter optimization, and model performance evaluation. We can implement the skills and techniques shown in real-world real estate applications, thereby aiding managers in making more logical decisions in the commercial market.