# Multiple Linear Regression

## Housing Case Study

### Problem Statement:

Consider a real estate company that has a dataset containing the prices of properties in the Delhi region. It wishes to use the data to optimise the sale prices of the properties based on important factors such as area, bedrooms, parking, etc.

Essentially, the company wants —

- To identify the variables affecting house prices, e.g. area, number of rooms, bathrooms, etc.

- To create a linear model that quantitatively relates house prices with variables such as number of rooms, area, number of bathrooms, etc.

- To know the accuracy of the model, i.e. how well these variables can predict house prices.

__So interpretation is important!__

The steps we will follow in this exercise are as follows:

1. Reading,understanding and visualising the data
2. Preparing the data for modelling(train-test split,rescaling etc.)
3. Training the model
4. Residual analysis
5. Prediction and evaluation on test set

## Step 1: Reading and Understanding the Data

Let us first import NumPy and Pandas and read the housing dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import r2_score


import warnings
warnings.filterwarnings('ignore')

In [2]:
#read the data
housing=pd.read_csv('../input/housing/Housing.csv')
housing.head()

In [3]:
housing.shape

In [4]:
housing.info()

In [5]:
housing.describe()

In [6]:
#visualize the mumerical variables
sns.pairplot(housing)
plt.show()

In [7]:
#visualize the categorical variables
plt.figure(figsize=(10,8))
sns.boxplot(x='mainroad',y='price',data=housing)

In [8]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'mainroad', y = 'price', data = housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'guestroom', y = 'price', data = housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'basement', y = 'price', data = housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'hotwaterheating', y = 'price', data = housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'airconditioning', y = 'price', data = housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'furnishingstatus', y = 'price', data = housing)
plt.show()

## Step2: Preparing the data for modelling
- Encoding:
    - Converting binary vars to 1/0
    - Other categorical vars to dummy vars
- Splitting into train and test
- Rescaling of variables

In [9]:
housing.mainroad.value_counts()

In [10]:
# yes/no variables
varlist=['mainroad','guestroom','basement','hotwaterheating','airconditioning','prefarea']
housing[varlist]=housing[varlist].apply(lambda x: x.map({'yes':1,'no':0}))
housing[varlist].head()


## Dummy Variables

In [11]:
#creating dummy variables for furnishing status

In [12]:
status=pd.get_dummies(housing['furnishingstatus'])
status

Now, you don't need three columns. You can drop the furnished column, as the type of furnishing can be identified with just the last two columns where —

- 00 will correspond to furnished
- 01 will correspond to unfurnished
- 10 will correspond to semi-furnished

In [13]:
#creating dummy vars for furnishing status
# dropping a redundant dummy var
status=pd.get_dummies(housing['furnishingstatus'],drop_first=True)
status.head()

In [14]:
#concat the dummy df with the oriinal one 
housing=pd.concat([housing,status],axis=1)
housing.head()

In [15]:
housing=housing.drop('furnishingstatus',axis=1)
housing.head()

## Splitting into train and test

In [16]:
df_train,df_test=train_test_split(housing,train_size=0.7,random_state=100)
print(df_train.shape)
print(df_test.shape)

## Rescaling the Features

As you saw in the demonstration for Simple Linear Regression, scaling doesn't impact your model. Here we can see that except for area, all the columns have small integer values. So it is extremely important to rescale the variables so that they have a comparable scale. If we don't have comparable scales, then some of the coefficients as obtained by fitting the regression model might be very large or very small as compared to the other coefficients. This might become very annoying at the time of model evaluation. So it is advised to use standardization or normalization so that the units of the coefficients obtained are all on the same scale. As you know, there are two common ways of rescaling:

1. Min-Max scaling
2. Standardisation (mean-0, sigma-1)<br>

This time, we will use MinMax scaling.

In [17]:
#1. Instantiate an object

scaler=MinMaxScaler()

#create a list of numeric variables
num_vars=['area','bedrooms','bathrooms','stories','parking','price']


#2. Fit the data
#scaler.fit
#fit(): learn xmins,xmax
#transform(): learn x-xmin/xmax-xmin
#fit_transform()
df_train[num_vars]=scaler.fit_transform(df_train[num_vars])
df_train.head()

## Step 3:Training the model

In [18]:
#heatmap
plt.figure(figsize=(16,10))
sns.heatmap(df_train.corr(),annot=True)
plt.show()

In [19]:
# X_train, y_train
y_train=df_train.pop('price')
X_train=df_train

In [20]:
X_train.head()

In [21]:
y_train.head()

In [22]:
# add a constant
X_train_sm=sm.add_constant(X_train['area'])

#create a first model
lr=sm.OLS(y_train,X_train_sm)

#fit
lr_model=lr.fit()

#params
lr_model.params

In [23]:
lr_model.summary()

In [24]:
# add another variable
X_train_sm=X_train[['area','bathrooms']]
X_train_sm=sm.add_constant(X_train_sm)

#create a first model
lr=sm.OLS(y_train,X_train_sm)

#fit
lr_model=lr.fit()

#params
lr_model.params

#summary
lr_model.summary()

In [25]:
# add another variable
X_train_sm=X_train[['area','bathrooms','bedrooms']]
X_train_sm=sm.add_constant(X_train_sm)

#create a first model
lr=sm.OLS(y_train,X_train_sm)

#fit
lr_model=lr.fit()

#params
lr_model.params

#summary
lr_model.summary()

In [26]:
    housing.columns

In [27]:
#build a model with all variables
X_train_sm=sm.add_constant(X_train)
#create a first model
lr=sm.OLS(y_train,X_train_sm)

#fit
lr_model=lr.fit()

#params
lr_model.params

#summary
lr_model.summary()


In [28]:
# significance  (p-values)
# VIF

## Checking VIF

Variance Inflation Factor or VIF, gives a basic quantitative idea about how much the feature variables are correlated with each other. It is an extremely important parameter to test our linear model. 

In [29]:
vif=pd.DataFrame()
vif['Features']=X_train.columns
vif['VIF']=[variance_inflation_factor(X_train.values,i) for i in range(X_train.shape[1])]
vif['VIF']=round(vif['VIF'],2)
vif=vif.sort_values(by='VIF',ascending=False)
vif

We could have:
- High p-value,high VIF
- High-low:
    - High p,low VIF: remove these first
    - Low p,high VIF: remove these after the ones above
- Low p-value,low VIF

In [30]:
X=X_train.drop('semi-furnished',axis=1)

In [31]:
#create another model
X_train_sm=sm.add_constant(X)
#create a first model
lr=sm.OLS(y_train,X_train_sm)

#fit
lr_model=lr.fit()

#params
lr_model.params

#summary
lr_model.summary()


In [32]:
vif = pd.DataFrame()
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [33]:
X=X.drop('bedrooms',axis=1)

In [34]:
#create another model
X_train_sm=sm.add_constant(X)
#create a first model
lr=sm.OLS(y_train,X_train_sm)

#fit
lr_model=lr.fit()

#params
lr_model.params

#summary
lr_model.summary()


In [35]:
# Calculate the VIFs again for the new model
vif = pd.DataFrame()
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

## Step 4: Residual Analysis

In [36]:
y_train_pred=lr_model.predict(X_train_sm)
y_train_pred

In [37]:
res=y_train-y_train_pred
sns.distplot(res)

## Step-4: Predictions and Evaluation of the Test Set

Applying the scaling on the test sets

In [38]:
num_vars = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking','price']

df_test[num_vars] = scaler.transform(df_test[num_vars])

In [39]:
df_test.describe()

In [40]:
y_test=df_test.pop('price')
X_test=df_test

In [41]:
#add a constant
X_test_sm=sm.add_constant(X_test)
X_test_sm.head()

In [42]:
X_test_sm=X_test_sm.drop(["bedrooms","semi-furnished"],axis=1)

In [43]:
#predictions
y_test_pred=lr_model.predict(X_test_sm)



In [44]:
#evaluate 
r2_score(y_true=y_test,y_pred=y_test_pred)

Overall we have a decent model, but we also acknowledge that we could do better.<br>
price=0.236 X area + 0.202 X bathrooms + 0.11 X stories + 0.05 X mainroad + 0.04 X guestroom + 0.0876 X hotwaterheating + 0.0682 X airconditioning  + 0.0629 X parking + 0.0637 X prefarea - 0.0337 X unfurnished

We have a couple of options:

1. Add new features (bathrooms/bedrooms, area/stories, etc.)
2. Build a non-linear model