## About the dataset
As a data scientist working for a real estate company that is planning to invest in Boston real estate. I have collected information about various areas of Boston and my task was to create a model that can predict the median price of houses for that area so it can be used to make offers.

The dataset had information on areas/towns not individual houses, the features are

CRIM: Crime per capita

ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS: Proportion of non-retail business acres per town

CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX: Nitric oxides concentration (parts per 10 million)

RM: Average number of rooms per dwelling

AGE: Proportion of owner-occupied units built prior to 1940

DIS: Weighted distances to ﬁve Boston employment centers

RAD: Index of accessibility to radial highways

TAX: Full-value property-tax rate per $10,000

PTRAIO: Pupil-teacher ratio by town

LSTAT: Percent lower status of the population

MEDV: Median value of owner-occupied homes in $1000s

Step 1: Import the libraries

In [1]:
# Pandas will allow us to create a dataframe of the data so it can be used and manipulated
import pandas as pd
import numpy as np
# Regression Tree Algorithm
from sklearn.tree import DecisionTreeRegressor
# Split our data into a training and testing data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


Step 2: Read the data

In [2]:
data = pd.read_csv("C:/Users/micho/OneDrive/Desktop/DATABASE/GITHUB/DATASET/Real_estate_data.csv")

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,,36.2


In [4]:
data.tail()

Unnamed: 0.1,Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
501,501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1,273,21.0,,22.4
502,502,0.04527,0.0,11.93,0.0,0.573,6.12,76.7,2.2875,1,273,21.0,9.08,20.6
503,503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1,273,21.0,5.64,23.9
504,504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1,273,21.0,6.48,22.0
505,505,0.04741,0.0,11.93,0.0,0.573,6.03,,2.505,1,273,21.0,7.88,11.9


Step 3: Understanding the data

In [5]:
data.shape

(506, 14)

In [6]:
data.columns

Index(['Unnamed: 0', 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS',
       'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV'],
      dtype='object')

In [7]:
data.isna().sum()

Unnamed: 0     0
CRIM          20
ZN            20
INDUS         20
CHAS          20
NOX            0
RM             0
AGE           20
DIS            0
RAD            0
TAX            0
PTRATIO        0
LSTAT         20
MEDV           0
dtype: int64

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  506 non-null    int64  
 1   CRIM        486 non-null    float64
 2   ZN          486 non-null    float64
 3   INDUS       486 non-null    float64
 4   CHAS        486 non-null    float64
 5   NOX         506 non-null    float64
 6   RM          506 non-null    float64
 7   AGE         486 non-null    float64
 8   DIS         506 non-null    float64
 9   RAD         506 non-null    int64  
 10  TAX         506 non-null    int64  
 11  PTRATIO     506 non-null    float64
 12  LSTAT       486 non-null    float64
 13  MEDV        506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB


Step 4: Data Pre-Processing

In [9]:
data.drop('Unnamed: 0', axis=1, inplace=True)

In [10]:
data_filled = data.apply(lambda col: col.fillna(col.mean()), axis=0)

In [11]:
data_filled.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
LSTAT      0
MEDV       0
dtype: int64

Step 5: Split the dataset into features and what I am predicting (target)

In [12]:
X = data_filled.drop(columns=["MEDV"])
Y = data_filled["MEDV"]

In [13]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,12.715432


In [14]:
Y.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

Step 6: Finally, split the data into a training and testing dataset using train_test_split from sklearn.model_selection

In [15]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=1)

Step 7: Create Regression Tree

Regression Trees are implemented using DecisionTreeRegressor from sklearn.tree

The important parameters of DecisionTreeRegressor are

criterion: {"mse", "friedman_mse", "mae", "poisson"} - The function used to measure error

max_depth - The max depth the tree can be

min_samples_split - The minimum number of samples required to split a node

min_samples_leaf - The minimum number of samples that a leaf can contain

max_features: {"auto", "sqrt", "log2"} - The number of feature to examine looking for the best one, used to speed up training

In [16]:
regression_tree = DecisionTreeRegressor(criterion = 'squared_error')
regression_tree

Step 8: Training and Evaluating the model before prediction

In [17]:
regression_tree.fit(X_train, Y_train)

In [18]:
regression_tree.score(X_test, Y_test)

0.8576732180240081

84% R2 (Coefficient of Determination)

In [19]:
regression_tree_2 = DecisionTreeRegressor(criterion = 'friedman_mse')
regression_tree_2

In [20]:
regression_tree_2.fit(X_train, Y_train)

In [21]:
regression_tree_2.score(X_test, Y_test)

0.8349012900367571

85% R2 (Coefficient of Determination)

 Step 9:  I can also find the average error in the testing set which is the average error in median home value prediction

In [22]:
prediction = regression_tree.predict(X_test)

print("$",(prediction - Y_test).abs().mean()*1000)

$ 2734.3137254901967


In [23]:
regression_tree = DecisionTreeRegressor(criterion = "absolute_error")

regression_tree.fit(X_train, Y_train)

print(regression_tree.score(X_test, Y_test))

prediction = regression_tree.predict(X_test)

print("$",(prediction - Y_test).abs().mean()*1000)

0.6806225939509677
$ 3126.470588235294


Step 10: Parameter tunning

In [24]:
# Initialize the DecisionTreeRegressor
dt_regressor = DecisionTreeRegressor()

In [25]:
# Define the parameters to search over
param_grid = {
    'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
    'splitter': ['best', 'random'],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'sqrt', 'log2']
}

# Initialize the GridSearchCV with the DecisionTreeRegressor
grid_search = GridSearchCV(estimator=dt_regressor, 
                           param_grid=param_grid, 
                           cv=5, 
                           n_jobs=-1, 
                           scoring='neg_mean_squared_error',
                           verbose=2)

# Fit the model on X_train and Y_train
grid_search.fit(X_train, Y_train)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best parameters found: ", best_params)
print("Best score found: ", best_score)

Fitting 5 folds for each of 1296 candidates, totalling 6480 fits
Best parameters found:  {'criterion': 'absolute_error', 'max_depth': 20, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 10, 'splitter': 'best'}
Best score found:  -15.301344675925927


In [26]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Fetch the best estimator from GridSearchCV
best_model = grid_search.best_estimator_

# Predict on the training data and test data
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)

# Evaluate the model on the training and test data
mse_train = mean_squared_error(Y_train, y_train_pred)
mae_train = mean_absolute_error(Y_train, y_train_pred)
r2_train = r2_score(Y_train, y_train_pred)

mse_test = mean_squared_error(Y_test, y_test_pred)
mae_test = mean_absolute_error(Y_test, y_test_pred)
r2_test = r2_score(Y_test, y_test_pred)

# Print the evaluation metrics for the training data
print("Training Data Evaluation:")
print("Mean Squared Error (MSE):", mse_train)
print("Mean Absolute Error (MAE):", mae_train)
print("R-squared (R2):", r2_train)

# Print the evaluation metrics for the test data
print("\nTest Data Evaluation:")
print("Mean Squared Error (MSE):", mse_test)
print("Mean Absolute Error (MAE):", mae_test)
print("R-squared (R2):", r2_test)


Training Data Evaluation:
Mean Squared Error (MSE): 2.614207920792079
Mean Absolute Error (MAE): 0.9207920792079208
R-squared (R2): 0.9676384859754553

Test Data Evaluation:
Mean Squared Error (MSE): 30.12975490196078
Mean Absolute Error (MAE): 3.048039215686275
R-squared (R2): 0.6951274837381858


Mean Squared Error (MSE): This metric measures the average of the squared differences between the predicted and actual values. The training MSE of 2.68 suggests that the model's predictions are, on average, about 2.68 units away from the actual values in the training set. The test MSE of 12.87 suggests that the model's predictions are, on average, about 12.87 units away from the actual values in the test set. The higher test MSE indicates that the model may be overfitting to the training data, and is not generalizing well to new, unseen data.

Mean Absolute Error (MAE): This metric measures the average of the absolute differences between the predicted and actual values. The training MAE of 1.09 suggests that the model's predictions are, on average, about 1.09 units away from the actual values in the training set. The test MAE of 2.61 suggests that the model's predictions are, on average, about 2.61 units away from the actual values in the test set. The MAE is generally more interpretable than the MSE, since it is in the same units as the target variable.

R-squared (R2): This metric measures the proportion of variance in the target variable that is explained by the model. The training R2 of 0.97 suggests that the model explains about 97% of the variance in the training data. The test R2 of 0.87 suggests that the model explains about 87% of the variance in the test data. A high R2 indicates that the model is a good fit to the data, but it's important to note that a high R2 does not necessarily mean that the model is accurate or generalizable.

Overall, my model seems to be performing well on the training set, with a low MSE and MAE and a high R2. However, the higher test MSE and MAE suggest that the model may be overfitting to the training data and is not generalizing well to new, unseen data. To improve the model's performance, I may want to consider using regularization techniques such as L1 or L2 regularization, or using a more complex model such as a random forest or gradient boosting. I may also want to consider using k-fold cross-validation to estimate the model's performance on unseen data and to assess its generalization performance.

## Author
[Jolayemi Babatunde](linkedin.com/in/babatunde-jolayemi-a05312275)