**1. HOUSE PRICE PREDICTION**

In this notebook, we will learn how to use available data to build a predictive model
(house price prediction).

- How to build/initialize the model?
- Feed data to train the model?
- Use the trained model to predict?

**2. Step by step implementation**

2.1. Load dataset

Import necessary libraries

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

%matplotlib inline

In [2]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [3]:
path = 'clean_data.csv'
df = pd.read_csv(path)
df.shape

(7120, 6)

In [4]:
df.head()

Unnamed: 0,bath,balcony,price,total_sqft_float,bhk,price_per_sqft
0,3.0,2.0,150.0,1672.0,3,8971.291866
1,3.0,3.0,149.0,1750.0,3,8514.285714
2,3.0,2.0,150.0,1750.0,3,8571.428571
3,2.0,2.0,40.0,1250.0,2,3200.0
4,2.0,2.0,83.0,1200.0,2,6916.666667


2.2. Split dataset

In [5]:
X = df.drop(columns='price')
y = df['price']
X.shape, y.shape

((7120, 5), (7120,))

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)
print("Shape of input training set:", X_train.shape)
print("Shape of output training set:", y_train.shape)
print("Shape of input test set:", X_test.shape)
print("Shape of output test set:", y_test.shape)

Shape of input training set: (5696, 5)
Shape of output training set: (5696,)
Shape of input test set: (1424, 5)
Shape of output test set: (1424,)


2.3. Feature Scaling

If feature scaling is not done, then a machine learning algorithm tends to weigh greater
values, higher and consider smaller values as the lower values, regardless of the unit of
the values

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

2.4. Build regression model

2.4.1. Linear regression

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

lr = LinearRegression()

In [9]:
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true=y_true, y_pred=y_pred))

Linear model

In [10]:
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print(rmse(y_test, y_pred_lr))

32.89717191486189


Lasso model

In [11]:
lambda_ = [0.01, 0.05, 0.075, 0.1, 0.25, 0.5]
error_lasso = []

for alpha in lambda_:
    lr_lasso = Lasso(alpha=alpha)
    lr_lasso.fit(X_train, y_train)
    y_pred_lasso = lr_lasso.predict(X_test)
    error_lasso.append(rmse(y_test, y_pred_lasso))

i = np.argmin(error_lasso)
print("Minimum loss of Lasso regression:")
print(np.min(error_lasso))
print("Best lambda: ", lambda_[i])

Minimum loss of Lasso regression:
32.89910777134902
Best lambda:  0.01


Ridge model

In [12]:
error_ridge = []

for alpha in lambda_:
    lr_ridge = Ridge(alpha=alpha)
    lr_ridge.fit(X_train, y_train)
    y_pred_ridge = lr_ridge.predict(X_test)
    error_ridge.append(rmse(y_test, y_pred_ridge))

i = np.argmin(error_ridge)
print("Minimum loss of Ridge regression:")
print(np.min(error_ridge))
print("Best lambda: ", lambda_[i])

Minimum loss of Ridge regression:
32.89719109092792
Best lambda:  0.01


2.4.2. Support Vector Machine (SVM)

In [13]:
from sklearn.svm import SVR

svr = SVR(kernel='rbf', C=1000)
svr.fit(X_train, y_train)
# R2 score
svr_score = svr.score(X_test, y_test)
svr_rmse = rmse(y_true=y_test, y_pred=svr.predict(X_test))

print(svr_score)
print(svr_rmse)

0.9817439476705249
14.396655658930486


2.4.3. Random Forest

In [14]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=100, verbose=1)
rfr.fit(X_train, y_train)
rfr_score = rfr.score(X_test, y_test)
rfr_rmse = rmse(y_true=y_test, y_pred=rfr.predict(X_test))
print(rfr_score)
print(rfr_rmse)


[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.9s


0.9651759807060099
19.88371953505962


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    1.9s finished
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished


2.5. Save and load model

In [44]:
import joblib

joblib.dump(svr, 'bangalore_house_price_prediction_model.pkl')

['bangalore_house_price_prediction_model.pkl']

In [45]:
bangalore_house_price_prediction_model = joblib.load('bangalore_house_price_prediction_model.pkl')

In [47]:
y_pred_model = bangalore_house_price_prediction_model.predict(X_test)
rmse(y_true=y_test, y_pred=y_pred_model)

np.float64(14.396655658930486)

In [48]:
bangalore_house_price_prediction_model.score(X_test, y_test)

0.9817439476705249