# __House Price Prediction Model__

### __Problem Statement__ 

The objective of this project is to make a __Regression model__, that predicts the house prices accurately. The dataset contains various influencing factors.

### __Dataset Used__

The dataset is sourced from Kaggle and contains information such as the bhk, under construction, RERA, square feet, target price  and other relevant features that impact the price of house.

### __Model__
### *_Data preprocessing_*
#### 1. Importing Libraries
We Import the required Libraries that we will use in the model

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

#Display Settings
%matplotlib inline

#### 2. Load the Dataset

We load the training Dataset __train.csv__ into a dataframe df.

In [4]:
df = pd.read_csv("train.csv")

#### 3. Displaying First few rows of the Dataset

In [5]:
df.head()

Unnamed: 0,POSTED_BY,UNDER_CONSTRUCTION,RERA,BHK_NO.,BHK_OR_RK,SQUARE_FT,READY_TO_MOVE,RESALE,ADDRESS,LONGITUDE,LATITUDE,TARGET(PRICE_IN_LACS)
0,Owner,0,0,2,BHK,1300.236407,1,1,"Ksfc Layout,Bangalore",12.96991,77.59796,55.0
1,Dealer,0,0,2,BHK,1275.0,1,1,"Vishweshwara Nagar,Mysore",12.274538,76.644605,51.0
2,Owner,0,0,2,BHK,933.159722,1,1,"Jigani,Bangalore",12.778033,77.632191,43.0
3,Owner,0,1,2,BHK,929.921143,1,1,"Sector-1 Vaishali,Ghaziabad",28.6423,77.3445,62.5
4,Dealer,1,0,2,BHK,999.009247,0,1,"New Town,Kolkata",22.5922,88.484911,60.5


#### 4. Summary Statistics

We get the summary statistics using describe function.

In [7]:
print("\nSummary Statistics of the Dataset:")
df.describe()


Summary Statistics of the Dataset:


Unnamed: 0,UNDER_CONSTRUCTION,RERA,BHK_NO.,SQUARE_FT,READY_TO_MOVE,RESALE,LONGITUDE,LATITUDE,TARGET(PRICE_IN_LACS)
count,29451.0,29451.0,29451.0,29451.0,29451.0,29451.0,29451.0,29451.0,29451.0
mean,0.179756,0.317918,2.392279,19802.17,0.820244,0.929578,21.300255,76.837695,142.898746
std,0.383991,0.465675,0.879091,1901335.0,0.383991,0.255861,6.205306,10.557747,656.880713
min,0.0,0.0,1.0,3.0,0.0,0.0,-37.713008,-121.761248,0.25
25%,0.0,0.0,2.0,900.0211,1.0,1.0,18.452663,73.7981,38.0
50%,0.0,0.0,2.0,1175.057,1.0,1.0,20.75,77.324137,62.0
75%,0.0,1.0,3.0,1550.688,1.0,1.0,26.900926,77.82874,100.0
max,1.0,1.0,20.0,254545500.0,1.0,1.0,59.912884,152.962676,30000.0


#### 5. Check for missing values and drop them

In [11]:
print("\nMissing values in the dataset:")
print(df.isnull().sum())
df = df.dropna()


Missing values in the dataset:
POSTED_BY                0
UNDER_CONSTRUCTION       0
RERA                     0
BHK_NO.                  0
BHK_OR_RK                0
SQUARE_FT                0
READY_TO_MOVE            0
RESALE                   0
ADDRESS                  0
LONGITUDE                0
LATITUDE                 0
TARGET(PRICE_IN_LACS)    0
dtype: int64


#### 6. Convert categorical variables to dummy/indicator variables 

In [12]:
df = pd.get_dummies(df, drop_first = True)
df.head()

Unnamed: 0,UNDER_CONSTRUCTION,RERA,BHK_NO.,SQUARE_FT,READY_TO_MOVE,RESALE,LONGITUDE,LATITUDE,TARGET(PRICE_IN_LACS),POSTED_BY_Dealer,...,"ADDRESS_vasundhara nagar,Jalna","ADDRESS_veeraragavalu Nagar, Vinayagapuram, Kathirvedu Village, Ambattur Taluk,Chennai","ADDRESS_vidyut nagar,Rajkot","ADDRESS_vikas nagar,Karnal","ADDRESS_vinayaka,Varanasi","ADDRESS_virar,Palghar","ADDRESS_vishakoderu,Bhimavaram","ADDRESS_walkeshwari nagari,Jamnagar","ADDRESS_west mambalam,Chennai","ADDRESS_yelahanka/Jakkur,Bangalore"
0,0,0,2,1300.236407,1,1,12.96991,77.59796,55.0,False,...,False,False,False,False,False,False,False,False,False,False
1,0,0,2,1275.0,1,1,12.274538,76.644605,51.0,True,...,False,False,False,False,False,False,False,False,False,False
2,0,0,2,933.159722,1,1,12.778033,77.632191,43.0,False,...,False,False,False,False,False,False,False,False,False,False
3,0,1,2,929.921143,1,1,28.6423,77.3445,62.5,False,...,False,False,False,False,False,False,False,False,False,False
4,1,0,2,999.009247,0,1,22.5922,88.484911,60.5,True,...,False,False,False,False,False,False,False,False,False,False


#### 7. Assuming **Target(price in lacs)** is the column to predict, replace it actual column name.

In [19]:
Target_column = 'TARGET(PRICE_IN_LACS)' #Replace with the actual target column name

X = df.drop(Target_column, axis=1)
y = df[Target_column]

print(X.head())
print(y.head())


   UNDER_CONSTRUCTION  RERA  BHK_NO.    SQUARE_FT  READY_TO_MOVE  RESALE  \
0                   0     0        2  1300.236407              1       1   
1                   0     0        2  1275.000000              1       1   
2                   0     0        2   933.159722              1       1   
3                   0     1        2   929.921143              1       1   
4                   1     0        2   999.009247              0       1   

   LONGITUDE   LATITUDE  POSTED_BY_Dealer  POSTED_BY_Owner  ...  \
0  12.969910  77.597960             False             True  ...   
1  12.274538  76.644605              True            False  ...   
2  12.778033  77.632191             False             True  ...   
3  28.642300  77.344500             False             True  ...   
4  22.592200  88.484911              True            False  ...   

   ADDRESS_vasundhara nagar,Jalna  \
0                           False   
1                           False   
2                           F

#### 9. Split the data into training and validating sets

Although we a different data set for testing, we still split the training data set into three, traing set, validation and test set set for minimizing bias, Overfitting prevention and performing Hyperparameter tuning.

In [33]:
# Defining few variables
train_size = 0.6 # Taking 60% of training dataset for training the model
val_size = 0.2   # Taking 20% of training dataset for validating the model
test_size = 0.2  # Taking 20% of training dataset for testing the model

#random state define the randomness of the data selected
# First split into train and temp set
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=val_size + test_size, random_state=42)
print(X_train.head(), X_temp.head(), y_train.head(), y_temp.head())


       UNDER_CONSTRUCTION  RERA  BHK_NO.    SQUARE_FT  READY_TO_MOVE  RESALE  \
1238                    0     0        4  2718.868951              1       1   
11804                   0     1        3  1875.779496              1       1   
10605                   0     1        3  1949.756280              1       1   
9909                    0     1        2   979.929162              1       1   
6109                    1     1        1   404.526848              0       0   

       LONGITUDE   LATITUDE  POSTED_BY_Dealer  POSTED_BY_Owner  ...  \
1238   19.058710  72.899690              True            False  ...   
11804  30.666500  76.862500              True            False  ...   
10605  30.594843  76.849830              True            False  ...   
9909   28.509982  77.051850              True            False  ...   
6109   19.113330  72.921567              True            False  ...   

       ADDRESS_vasundhara nagar,Jalna  \
1238                            False   
11804     

In [34]:
# Split temp set into validation and test
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=test_size / (val_size + test_size), random_state=42)
print("\n\n---------------------------------------------------------------------------------\n",X_val.head(), X_test.head(), y_val.head(), y_test.head())



---------------------------------------------------------------------------------
        UNDER_CONSTRUCTION  RERA  BHK_NO.    SQUARE_FT  READY_TO_MOVE  RESALE  \
330                     0     1        3  1902.843071              1       1   
23351                   0     0        3  1500.375094              1       1   
28661                   1     1        2   822.268242              0       0   
11857                   1     1        2   503.343239              0       0   
24870                   0     1        3  1000.000000              1       1   

       LONGITUDE   LATITUDE  POSTED_BY_Dealer  POSTED_BY_Owner  ...  \
330    12.969910  77.597960              True            False  ...   
23351  10.929882  76.948199             False             True  ...   
28661  28.250000  77.070000              True            False  ...   
11857  24.862517  78.282467              True            False  ...   
24870  28.404730  77.367514              True            False  ...   

       

### _Model Training_

First we are using Linear regression for the given dataset

In [35]:
model = LinearRegression()
model.fit(X_train, y_train)