# Multiple Linear Regression
## Housing Case Study

#### Problem Statement:

Consider a real estate company that has a dataset containing the prices of properties in the Delhi region. It wishes to use the data to optimise the sale prices of the properties based on important factors such as area, bedrooms, parking, etc.

Essentially, the company wants —


- To identify the variables affecting house prices, e.g. area, number of rooms, bathrooms, etc.

- To create a linear model that quantitatively relates house prices with variables such as number of rooms, area, number of bathrooms, etc.

- To know the accuracy of the model, i.e. how well these variables can predict house prices.

**So interpretation is important!**

## Step 1: Reading and Understanding the Data

Let us first import NumPy and Pandas and read the housing dataset

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
housing_df=pd.read_csv('housing.csv')

In [6]:
housing_df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [13]:
def binary_map(x):
    return x.map({'yes':1,'no':0})

In [14]:
cat_variables=['mainroad','guestroom','basement','hotwaterheating','airconditioning','prefarea']

In [15]:
housing_df[cat_variables]=housing_df[cat_variables].apply(binary_map)


In [16]:
housing_df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,furnished
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,furnished
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,semi-furnished
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,furnished
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,furnished


In [17]:
furninshingstatus=pd.get_dummies(housing_df['furnishingstatus'])

In [18]:
furninshingstatus.head()

Unnamed: 0,furnished,semi-furnished,unfurnished
0,1,0,0
1,1,0,0
2,0,1,0
3,1,0,0
4,1,0,0


In [19]:
furninshingstatus.drop('furnished',axis=1,inplace=True)

In [63]:
housing=pd.concat([housing_df,furninshingstatus],axis=1).drop('furnishingstatus',axis=1)

In [64]:
housing.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,semi-furnished,unfurnished
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,0,0
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,0,0
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,1,0
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,0,0
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,0,0


## Splitting the data

In [41]:
from sklearn.model_selection import train_test_split

In [65]:
train_df,test_df=train_test_split(housing,test_size=0.3,random_state=100)

## Preprocessing

In [44]:
from sklearn.preprocessing import MinMaxScaler

In [66]:
train_df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,semi-furnished,unfurnished
359,3710000,3600,3,1,1,1,0,0,0,0,1,0,0,1
19,8855000,6420,3,2,2,1,0,0,0,1,1,1,1,0
159,5460000,3150,3,2,1,1,1,1,0,1,0,0,0,0
35,8080940,7000,3,2,4,1,0,0,0,1,2,0,0,0
28,8400000,7950,5,2,2,1,0,1,1,0,2,0,0,1


In [67]:
num_var=['area','bedrooms','bathrooms','stories','mainroad','guestroom','parking','price']

In [68]:
scaler=MinMaxScaler()
train_df[num_var]=scaler.fit_transform(train_df[num_var])

In [69]:
train_df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,semi-furnished,unfurnished
359,0.169697,0.155227,0.4,0.0,0.0,1.0,0.0,0,0,0,0.333333,0,0,1
19,0.615152,0.403379,0.4,0.5,0.333333,1.0,0.0,0,0,1,0.333333,1,1,0
159,0.321212,0.115628,0.4,0.5,0.0,1.0,1.0,1,0,1,0.0,0,0,0
35,0.548133,0.454417,0.4,0.5,1.0,1.0,0.0,0,0,1,0.666667,0,0,0
28,0.575758,0.538015,0.8,0.5,0.333333,1.0,0.0,1,1,0,0.666667,0,0,1


## Split data into X and Y

In [70]:
Y_train_df=train_df.pop('price')
X_train_df=train_df

In [71]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

In [72]:
lr=LinearRegression()
model=lr.fit(X_train_df,Y_train_df)

In [94]:
rfe=RFE(model,10)
rfe=rfe.fit(X_train_df,Y_train_df)

In [95]:
pd.DataFrame(list(zip(X_train_df.columns,rfe.support_,rfe.ranking_)),columns=['Feature','Support','Rank'])

Unnamed: 0,Feature,Support,Rank
0,area,True,1
1,bedrooms,True,1
2,bathrooms,True,1
3,stories,True,1
4,mainroad,True,1
5,guestroom,True,1
6,basement,False,3
7,hotwaterheating,True,1
8,airconditioning,True,1
9,parking,True,1


In [98]:
import statsmodels.api as sm
from statsmodels.api import OLS

In [102]:
X_train_df=sm.add_constant(X_train_df)
lr=OLS(Y_train_df,X_train_df)
lr_model=lr.fit()

In [104]:
lr_model.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.681
Model:,OLS,Adj. R-squared:,0.67
Method:,Least Squares,F-statistic:,60.4
Date:,"Wed, 25 Sep 2019",Prob (F-statistic):,8.829999999999999e-83
Time:,23:37:53,Log-Likelihood:,381.79
No. Observations:,381,AIC:,-735.6
Df Residuals:,367,BIC:,-680.4
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0200,0.021,0.955,0.340,-0.021,0.061
area,0.2347,0.030,7.795,0.000,0.175,0.294
bedrooms,0.0467,0.037,1.267,0.206,-0.026,0.119
bathrooms,0.1908,0.022,8.679,0.000,0.148,0.234
stories,0.1085,0.019,5.661,0.000,0.071,0.146
mainroad,0.0504,0.014,3.520,0.000,0.022,0.079
guestroom,0.0304,0.014,2.233,0.026,0.004,0.057
basement,0.0216,0.011,1.943,0.053,-0.000,0.043
hotwaterheating,0.0849,0.022,3.934,0.000,0.042,0.127

0,1,2,3
Omnibus:,93.687,Durbin-Watson:,2.093
Prob(Omnibus):,0.0,Jarque-Bera (JB):,304.917
Skew:,1.091,Prob(JB):,6.14e-67
Kurtosis:,6.801,Cond. No.,14.6


In [106]:
#removing bedrooms as pvalue is higher
X_train_df=sm.add_constant(X_train_df.drop("bedrooms",axis=1))
lr=OLS(Y_train_df,X_train_df)
lr_model=lr.fit()

In [108]:
lr_model.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.68
Model:,OLS,Adj. R-squared:,0.67
Method:,Least Squares,F-statistic:,65.2
Date:,"Wed, 25 Sep 2019",Prob (F-statistic):,2.3499999999999998e-83
Time:,23:40:26,Log-Likelihood:,380.96
No. Observations:,381,AIC:,-735.9
Df Residuals:,368,BIC:,-684.7
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0351,0.017,2.032,0.043,0.001,0.069
area,0.2350,0.030,7.799,0.000,0.176,0.294
bathrooms,0.1964,0.022,9.114,0.000,0.154,0.239
stories,0.1178,0.018,6.643,0.000,0.083,0.153
mainroad,0.0488,0.014,3.419,0.001,0.021,0.077
guestroom,0.0301,0.014,2.207,0.028,0.003,0.057
basement,0.0239,0.011,2.179,0.030,0.002,0.045
hotwaterheating,0.0864,0.022,4.007,0.000,0.044,0.129
airconditioning,0.0666,0.011,5.870,0.000,0.044,0.089

0,1,2,3
Omnibus:,97.809,Durbin-Watson:,2.097
Prob(Omnibus):,0.0,Jarque-Bera (JB):,326.485
Skew:,1.131,Prob(JB):,1.27e-71
Kurtosis:,6.93,Cond. No.,11.1
