Title: Predicting the house price per unit area using features from the dataset.

This model aims to predict the house price per unit area. To build a model that predicts the house price per unit area i would be making use of a data set that contains real estate information such as the house age, latitude,longitude,distance to the nearest Mrt Station e.t.c.

This csv file contains a data set of real estate informations:

https://www.kaggle.com/datasets/quantbruce/real-estate-price-prediction

In [51]:
#import csv file

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import SelectKBest, f_classif, chi2
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.read_csv('Real estate.csv')

In [52]:
#Data cleaning 

df.columns

Index(['No', 'X1 transaction date', 'X2 house age',
       'X3 distance to the nearest MRT station',
       'X4 number of convenience stores', 'X5 latitude', 'X6 longitude',
       'Y house price of unit area'],
      dtype='object')

In [53]:
#Remove features that are not used in this model

remove = ['No', 'X1 transaction date']
df.drop(remove, inplace=True, axis=1)

In [54]:
#Check if features were removed
df.columns

Index(['X2 house age', 'X3 distance to the nearest MRT station',
       'X4 number of convenience stores', 'X5 latitude', 'X6 longitude',
       'Y house price of unit area'],
      dtype='object')

In [55]:
# Check for missing values in each column
missing_values = df.isnull().sum()

print(f"Missing Values : \n {missing_values}")

Missing Values : 
 X2 house age                              0
X3 distance to the nearest MRT station    0
X4 number of convenience stores           0
X5 latitude                               0
X6 longitude                              0
Y house price of unit area                0
dtype: int64


In [56]:
# rename the target variable
df = df.rename(columns={'Y house price of unit area' : 'target'})
df['target'].value_counts(dropna=False)

42.5    4
40.3    4
29.3    4
40.6    4
37.4    4
       ..
55.9    1
22.9    1
21.5    1
55.1    1
63.9    1
Name: target, Length: 270, dtype: int64

In [57]:
#Rename other features 

#Rename 'X2 house age' column
df = df.rename(columns={'X2 house age' : 'Age'})
df['Age'].value_counts(dropna=False)


0.0     17
13.6     7
13.3     6
16.2     6
16.4     6
        ..
30.2     1
4.3      1
24.0     1
8.4      1
18.8     1
Name: Age, Length: 236, dtype: int64

In [58]:
#Rename 'X3 distance to the nearest MRT station' column
df = df.rename(columns={'X3 distance to the nearest MRT station' : 'MRT_Station_Distance'})
df['MRT_Station_Distance'].value_counts(dropna=False)

289.32480     13
90.45606      11
492.23130      9
1360.13900     8
104.81010      8
              ..
4527.68700     1
401.88070      1
432.03850      1
472.17450      1
390.96960      1
Name: MRT_Station_Distance, Length: 259, dtype: int64

In [59]:
#Rename 'X4 number of convenience stores' column
df = df.rename(columns={'X4 number of convenience stores' : 'Convenience_store_count'})
df['Convenience_store_count'].value_counts(dropna=False)

5     67
0     67
3     46
1     46
6     37
7     31
4     31
8     30
9     25
2     24
10    10
Name: Convenience_store_count, dtype: int64

In [60]:
#Rename 'X5 latitude' column
df = df.rename(columns={'X5 latitude' : 'Latitude'})
df['Latitude'].value_counts(dropna=False)

24.97433    14
24.98203    13
24.96674     9
24.96515     9
24.96299     8
            ..
24.98034     1
24.97493     1
24.94898     1
24.98489     1
24.97923     1
Name: Latitude, Length: 234, dtype: int64

In [61]:
#Rename 'X6 longitude' column
df = df.rename(columns={'X6 longitude' : 'Longitude'})
df['Latitude'].value_counts(dropna=False)  

24.97433    14
24.98203    13
24.96674     9
24.96515     9
24.96299     8
            ..
24.98034     1
24.97493     1
24.94898     1
24.98489     1
24.97923     1
Name: Latitude, Length: 234, dtype: int64

In [62]:
# Check for duplicates in the 'Longitude' column

duplicates = df.duplicated(subset= ['Longitude'], keep='first')

print(sum(duplicates))

182


In [63]:
#Check for duplicates in the 'Latitude' column

duplicates = df.duplicated(subset= ['Latitude'], keep='first')

print(sum(duplicates))

180


In [64]:
# Remove duplicate with same 'Latitude' and 'Longitude'

df.drop_duplicates(subset= ['Latitude', 'Longitude'], keep='last')

shape_before = df.shape[0] + sum(duplicates)
shape_after = df.shape[0]

# Calculate the number of duplicates removed
duplicates_removed = shape_before - shape_after

# Print the number of duplicates removed and the updated DataFrame shape
print(f"Number of duplicates removed: {duplicates_removed}")
print(f"DataFrame shape before deduplication: {shape_before}")
print(f"DataFrame shape after deduplication: {shape_after}")

Number of duplicates removed: 180
DataFrame shape before deduplication: 594
DataFrame shape after deduplication: 414


In [65]:
#check for outliers

df.boxplot(column=['target'])

<Axes: >

In [66]:
#Remove outliers 

df = df[df['target'] < 100]
df.boxplot(column=['target'])

<Axes: >

In [67]:
#Feature Engineering
x = df.loc[:, ['Age', 'MRT_Station_Distance', 'Convenience_store_count', 'Latitude', 'Longitude']]
y = df.loc[:, 'target'] 

In [68]:
np.set_printoptions(suppress = True)
fs = SelectKBest(score_func=f_classif, k='all')
bestFeatures = fs.fit(x, y)
print(f'F-Score: {bestFeatures.scores_}')
print(f'P-Values: {bestFeatures.pvalues_}')

F-Score: [1.61752597 4.80511195 2.01107741 2.36130223 1.91004868]
P-Values: [0.00073162 0.         0.00000254 0.00000001 0.00001122]


No feature is irrelevant to the target variable. The feature MRT_Station_Distance has a hight correleation with the target variable.

In [69]:
#Split the dataset into 80% training dataset and 20% testing dataset. 

# Split the data into X and Y
X = df.drop(columns=['target'])
y =df['target']

# Split the data into training (80%) and testing (20%) datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)


In [70]:
# check the shape of the training data and testing data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (330, 5)
X_test shape: (83, 5)
y_train shape: (330,)
y_test shape: (83,)


In [72]:
#Build a simple linear regression model for this data. Plotting the Linear Regression Line with the scatter plot

# Create and fit a linear regression model

model = LinearRegression()
model.fit(X_train, y_train)

In [76]:
# Calculate the R-squared value
print("R-squared value: %.4f" %  model.score(X_train,y_train))

R-squared value: 0.6323
