The goal of this project is to build a linear regression model to predict house prices based
on various features such as house age, distance to the nearest MRT station, number of
convenience stores, and other relevant attributes. The model will help in understanding
the impact of different factors on house prices of unit area and assist potential buyers and
real estate agents in making informed decisions.

##Import the necessary libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score


##Loading the dataset

In [3]:
data=pd.read_csv('Real estate.csv')
data.head()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


##Data Exploration and Cleaning

In [4]:
data.shape

(414, 8)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   No                                      414 non-null    int64  
 1   X1 transaction date                     414 non-null    float64
 2   X2 house age                            414 non-null    float64
 3   X3 distance to the nearest MRT station  414 non-null    float64
 4   X4 number of convenience stores         414 non-null    int64  
 5   X5 latitude                             414 non-null    float64
 6   X6 longitude                            414 non-null    float64
 7   Y house price of unit area              414 non-null    float64
dtypes: float64(6), int64(2)
memory usage: 26.0 KB


In [6]:
##checking summary statistics
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
No,414.0,207.5,119.655756,1.0,104.25,207.5,310.75,414.0
X1 transaction date,414.0,2013.148971,0.281967,2012.667,2012.917,2013.167,2013.417,2013.583
X2 house age,414.0,17.71256,11.392485,0.0,9.025,16.1,28.15,43.8
X3 distance to the nearest MRT station,414.0,1083.885689,1262.109595,23.38284,289.3248,492.2313,1454.279,6488.021
X4 number of convenience stores,414.0,4.094203,2.945562,0.0,1.0,4.0,6.0,10.0
X5 latitude,414.0,24.96903,0.01241,24.93207,24.963,24.9711,24.977455,25.01459
X6 longitude,414.0,121.533361,0.015347,121.47353,121.528085,121.53863,121.543305,121.56627
Y house price of unit area,414.0,37.980193,13.606488,7.6,27.7,38.45,46.6,117.5


In [7]:
##checking missig values
data.isnull().sum()

Unnamed: 0,0
No,0
X1 transaction date,0
X2 house age,0
X3 distance to the nearest MRT station,0
X4 number of convenience stores,0
X5 latitude,0
X6 longitude,0
Y house price of unit area,0


##Data splitting

In [9]:
##assigning variables
x=data.drop(['No','X1 transaction date'],axis=1)
y=data['Y house price of unit area']

## 1 Build and Train the Linear Regression Model

1.1 Training the data set

In [11]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

1.2 Building the model

In [12]:
Reg=LinearRegression()

In [13]:
##fitting linear regression
Reg.fit(x_train,y_train)

In [14]:
##checking the intercept
Reg.intercept_

np.float64(2.6787461138155777e-12)

In [15]:
##checking the coefficients
Reg.coef_

array([ 1.17451965e-16, -8.32667268e-17, -7.25060203e-17, -8.61546468e-14,
       -3.65330329e-15,  1.00000000e+00])

##2. Evaluating the model

In [19]:
##checking the accuracy using r2 score
y_pred=Reg.predict(x_test)

In [22]:
r2_score(y_test,y_pred)

1.0