# About Dataset
House Price Prediction Challenge

## Overview

Welcome to the House Price Prediction Challenge, you will test your regression skills by designing an algorithm to accurately predict the house prices in India. Accurately predicting house prices can be a daunting task. The buyers are just not concerned about the size(square feet) of the house and there are various other factors that play a key role to decide the price of a house/property. It can be extremely difficult to figure out the right set of attributes that are contributing to understanding the buyer's behavior as such. This dataset has been collected across various property aggregators across India.

## Attributes Description:
Column 	Description
POSTED_BY 	Category marking who has listed the property
UNDER_CONSTRUCTION 	Under Construction or Not
RERA 	Rera approved or Not
BHK_NO 	Number of Rooms
BHK_OR_RK 	Type of property
SQUARE_FT 	Total area of the house in square feet
READY_TO_MOVE 	Category marking Ready to move or Not
RESALE 	Category marking Resale or not
ADDRESS 	Address of the property
LONGITUDE 	Longitude of the property
LATITUDE 	Latitude of the property

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('train.csv')
df.head()

# EDA

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df['POSTED_BY'].value_counts()

In [None]:
df['UNDER_CONSTRUCTION'].value_counts()

In [None]:
df['RERA'].value_counts()

In [None]:
df['BHK_NO.'].value_counts()

In [None]:
df['BHK_OR_RK'].value_counts()

In [None]:
df['SQUARE_FT'].value_counts()

In [None]:
df['READY_TO_MOVE'].value_counts()

In [None]:
df['RESALE'].value_counts()

In [None]:
df['ADDRESS'].value_counts()

In [None]:
df['LONGITUDE'].value_counts()

In [None]:
df['LATITUDE'].value_counts()

In [None]:
df['TARGET(PRICE_IN_LACS)'].value_counts()

In [None]:
df['POSTED_BY'].value_counts().plot(kind='bar')

In [None]:
df['UNDER_CONSTRUCTION'].value_counts().plot(kind='bar')

In [None]:
df.info()

In [None]:
# Set Posted_by, BHK_OR_RK as category
df['POSTED_BY'] = df['POSTED_BY'].astype('category')
df['BHK_OR_RK'] = df['BHK_OR_RK'].astype('category')

In [None]:
df.info()

In [None]:
# Drop Address column
df.drop('ADDRESS', axis=1, inplace=True)

In [None]:
df.info()

In [None]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['POSTED_BY'] = le.fit_transform(df['POSTED_BY'])
df['BHK_OR_RK'] = le.fit_transform(df['BHK_OR_RK'])

In [None]:
df.info()

In [None]:
df.head()

# Data Visualization

In [None]:
sns.heatmap(df.corr(), annot=True)

In [None]:
# Scatter plot Target vs other features
columns = ['POSTED_BY', 'UNDER_CONSTRUCTION', 'RERA', 'BHK_NO.', 'BHK_OR_RK', 'SQUARE_FT', 'READY_TO_MOVE', 'RESALE', 'LONGITUDE', 'LATITUDE']

for i in columns:
    plt.scatter(df[i], df['TARGET(PRICE_IN_LACS)'])
    plt.xlabel(i)
    plt.ylabel('TARGET(PRICE_IN_LACS)')
    plt.show()


# Detecting Outliers for SQUARE_FT

In [None]:
sns.boxplot(df['SQUARE_FT'])

In [None]:
df['SQUARE_FT'].describe()

In [None]:
# Delet outliers
df = df[df['SQUARE_FT'] < 1000000]

In [None]:
sns.boxplot(df['SQUARE_FT'])

In [None]:
# Scatter plot Target vs SQUARE_FT
plt.scatter(df['SQUARE_FT'], df['TARGET(PRICE_IN_LACS)'])

## Feature Engineering

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Split data into X and y
X = df.drop('TARGET(PRICE_IN_LACS)', axis=1)
y = df['TARGET(PRICE_IN_LACS)']

In [None]:
X.head()

In [None]:
y.head()

In [None]:
# Split data into train and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=101, test_size=0.2)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X_train

In [None]:
X_test

## Model Building
### Linear Regression

In [None]:
# Linear Regression
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

### Decision Tree Regressor

In [None]:
# Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)

In [None]:
dtr.score(X_train, y_train)

In [None]:
dtr.score(X_test, y_test)

In [None]:
# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

params = {'criterion': ['mse', 'friedman_mse', 'mae', 'poisson'], 'splitter': ['best', 'random'], 'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10]}
grid = GridSearchCV(dtr, param_grid=params, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

In [None]:
grid.best_params_


In [106]:
grid.best_score_

0.951269667012744

In [107]:
grid.score(X_train, y_train)

0.9736282435609258

In [108]:
grid.score(X_test, y_test)

0.932763780082391

### Random Forest Regressor

In [109]:
# Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)

In [110]:
rfr.score(X_train, y_train)

0.9932255086868519

In [111]:
rfr.score(X_test, y_test)

0.9479160788046371

### Bayesian Ridge

In [112]:
# Bayesian Ridge
from sklearn.linear_model import BayesianRidge

br = BayesianRidge()
br.fit(X_train, y_train)

In [113]:
br.score(X_train, y_train)

0.3746563601228028

In [114]:
br.score(X_test, y_test)

0.48990878843920527

### Elastic Net

In [115]:
# Elastic Net
from sklearn.linear_model import ElasticNet

en = ElasticNet()
en.fit(X_train, y_train)

In [116]:
en.score(X_train, y_train)

0.3372094720739097

In [117]:
en.score(X_test, y_test)

0.3989837143946131

# Model Evaluation

In [119]:
# Variance Score
from sklearn.metrics import explained_variance_score

print('Linear Regression: ', explained_variance_score(y_test, lr.predict(X_test)))
print('Decision Tree Regressor: ', explained_variance_score(y_test, dtr.predict(X_test)))
print('Random Forest Regressor: ', explained_variance_score(y_test, rfr.predict(X_test)))
print('Bayesian Ridge: ', explained_variance_score(y_test, br.predict(X_test)))
print('Elastic Net: ', explained_variance_score(y_test, en.predict(X_test)))

Linear Regression:  0.4900897640048708
Decision Tree Regressor:  0.8846273232053333
Random Forest Regressor:  0.9479200706485754
Bayesian Ridge:  0.48995923773856875
Elastic Net:  0.3990559673341091


In [120]:
# R2 Score
from sklearn.metrics import r2_score

print('Linear Regression: ', r2_score(y_test, lr.predict(X_test)))
print('Decision Tree Regressor: ', r2_score(y_test, dtr.predict(X_test)))
print('Random Forest Regressor: ', r2_score(y_test, rfr.predict(X_test)))
print('Bayesian Ridge: ', r2_score(y_test, br.predict(X_test)))
print('Elastic Net: ', r2_score(y_test, en.predict(X_test)))

Linear Regression:  0.49003936902665723
Decision Tree Regressor:  0.8846041349057705
Random Forest Regressor:  0.9479160788046371
Bayesian Ridge:  0.48990878843920527
Elastic Net:  0.3989837143946131
