# Multiple Linear Regression - Micro Project

# Client Requirement
#### 1. Predict the price of a house with high confidence(95% confidence)
#### 2. Adjust for price inflation by multiplying the values by 30
#### 3. The house should have 4 rooms, 
#### 4. The Area where the house is located should have a school with PT ratio 10
#### 5. The house should be near the Charles river
#### 6. Inputs must be given from the KB
#### 7. Negative rooms  and zero rooms...not allowed
#### 8. Same with PT ratio.....
#### 9. Charles river.....Choice (Yes/no)...set yes = 1 and no = 0

# Data Preprocessing

#### It is a data mining technique that transforms raw data into an understandable format. Raw data(real world data) is always incomplete and that data cannot be sent through a model. That would cause certain errors. That is why we need to preprocess data before sending through a model.

Steps in Data Preprocessing
These are the steps:
1. Import libraries
2. Import dataset
3. Finding for missing values
4. Encoding categorical data
5. Data splitting
6. Feature Scaling

# 1. Importing Libraries

In [1]:
#Importing the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from statsmodels.stats.outliers_influence import variance_inflation_factor

# 2. Importing Dataset

In [2]:
# 2. Importing Dataset
boston_dataset = load_boston()

In [3]:
data = pd.DataFrame(data= boston_dataset.data, columns= boston_dataset.feature_names)
data.shape

(506, 13)

In [4]:
data['PRICE'] = boston_dataset.target
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  PRICE    506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB


In [5]:
data.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [6]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [7]:
data.tail()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.12,76.7,2.2875,1.0,273.0,21.0,396.9,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.9,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0.0,0.573,6.03,80.8,2.505,1.0,273.0,21.0,396.9,7.88,11.9


# 3. Checking missing value

In [8]:
data.isnull()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,False,False,False,False,False,False,False,False,False,False,False,False,False,False
502,False,False,False,False,False,False,False,False,False,False,False,False,False,False
503,False,False,False,False,False,False,False,False,False,False,False,False,False,False
504,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [9]:
data.isnull().any()

CRIM       False
ZN         False
INDUS      False
CHAS       False
NOX        False
RM         False
AGE        False
DIS        False
RAD        False
TAX        False
PTRATIO    False
B          False
LSTAT      False
PRICE      False
dtype: bool

In [10]:
data.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
PRICE      0
dtype: int64

# 4. Encoding Categorical Data
#### There is no Categorical Data

# 5. Data splitting

In [11]:
features = data.drop(['PRICE'], axis=1)

prices = data['PRICE']

# Data Normalization

In [12]:
prices_log = np.log(data['PRICE'])

In [13]:
X_train, X_test, y_train, y_test = train_test_split(features, prices_log, test_size= 0.20, random_state= 42)

In [14]:
print("X_train Size :",len(X_train))
print("Y_train Size :",len(y_train))
print("X_test Size :",len(X_test))
print("Y_test Size :",len(y_test))
print("Train Size :", (len(X_train)/len(features))*100)
print("Train Size :", (len(X_test)/len(features))*100)

X_train Size : 404
Y_train Size : 404
X_test Size : 102
Y_test Size : 102
Train Size : 79.84189723320159
Train Size : 20.158102766798418


In [15]:
# Multiple Linear Regression

regr = LinearRegression()
regr.fit(X_train, y_train)

LinearRegression()

In [16]:
Coefficent = pd.DataFrame(index= X_train.columns, data= regr.coef_, columns= ['Coefficent'])
Coefficent

Unnamed: 0,Coefficent
CRIM,-0.009679
ZN,0.000757
INDUS,0.003057
CHAS,0.096207
NOX,-0.727261
RM,0.113095
AGE,-0.000139
DIS,-0.048944
RAD,0.011139
TAX,-0.000505


In [17]:
print('Intercept :', regr.intercept_)

Intercept : 3.840920309917581


# Price Prediction

In [18]:
y_hat = regr.predict(X_test)

In [19]:
print('r2_score :',r2_score(y_test, y_hat))

r2_score : 0.7462724975382733


# Stats Model API : MODEL 1

In [20]:
X_incl_const_1 = sm.add_constant(X_train)

model_1 = sm.OLS(y_train, X_incl_const_1)
results_1 = model_1.fit()
print(results_1.summary())

                            OLS Regression Results                            
Dep. Variable:                  PRICE   R-squared:                       0.796
Model:                            OLS   Adj. R-squared:                  0.789
Method:                 Least Squares   F-statistic:                     116.9
Date:                Thu, 05 Nov 2020   Prob (F-statistic):          1.35e-125
Time:                        00:04:08   Log-Likelihood:                 106.78
No. Observations:                 404   AIC:                            -185.6
Df Residuals:                     390   BIC:                            -129.5
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.8409      0.227     16.943      0.0

# Variance Inflation Factor 1

In [21]:
vif_1 = [] #empty list                     

for i in range (X_incl_const_1.shape[1]):
    
    vif_1.append(variance_inflation_factor(exog= X_incl_const_1.values, exog_idx=i))

vif_1

[580.7472632659341,
 1.7131869906128505,
 2.465630718663123,
 3.8778553502602815,
 1.0966737120634569,
 4.469150159170631,
 1.9478087495837588,
 2.9899478376482787,
 4.16857837354429,
 7.658315779148442,
 8.943301431814218,
 1.851448407067042,
 1.3251213980906684,
 2.818045379538575]

# Model 2 by Dropping Features from Model 1 Using P_values > 0.05

In [22]:
X_incl_const_2 = sm.add_constant(X_train)

P_values = round(results_1.pvalues, 3)

for i in range(P_values.shape[0]):
    if(P_values[i]>0.05):
        
        X_incl_const_2.drop([P_values.index[i]], axis=1, inplace=True)

model_2 = sm.OLS(y_train, X_incl_const_2)
results_2 = model_2.fit()

print(results_2.summary())

                            OLS Regression Results                            
Dep. Variable:                  PRICE   R-squared:                       0.794
Model:                            OLS   Adj. R-squared:                  0.789
Method:                 Least Squares   F-statistic:                     151.9
Date:                Thu, 05 Nov 2020   Prob (F-statistic):          2.66e-128
Time:                        00:04:09   Log-Likelihood:                 105.49
No. Observations:                 404   AIC:                            -189.0
Df Residuals:                     393   BIC:                            -145.0
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.8528      0.226     17.070      0.0

# Variance Inflation Factor 2

In [23]:
vif_2 = [] #empty list                          

for i in range (X_incl_const_2.shape[1]):
    
    vif_2.append(variance_inflation_factor(exog= X_incl_const_2.values, exog_idx=i))

vif_2

[576.4258577750066,
 1.6995063324698798,
 1.0833550411756299,
 3.7973870197322683,
 1.8066696516255714,
 2.619209863656011,
 7.090392330230745,
 7.078572054426567,
 1.5665696655442751,
 1.318601128410779,
 2.551020818165935]

# Model 3 by Dropping Features from Model 2 Using VIF > 10

In [24]:
for i in range(X_incl_const_2.shape[1]):
    if(vif_2[i]>10):
        
        X_incl_const_2.drop([X_incl_const_2.columns[i]], axis=1, inplace=True)

In [25]:
X_incl_const_3 = sm.add_constant(X_incl_const_2)

model_3 = sm.OLS(y_train, X_incl_const_3)
results_3 = model_3.fit()

print(results_3.summary())

                            OLS Regression Results                            
Dep. Variable:                  PRICE   R-squared:                       0.794
Model:                            OLS   Adj. R-squared:                  0.789
Method:                 Least Squares   F-statistic:                     151.9
Date:                Thu, 05 Nov 2020   Prob (F-statistic):          2.66e-128
Time:                        00:04:09   Log-Likelihood:                 105.49
No. Observations:                 404   AIC:                            -189.0
Df Residuals:                     393   BIC:                            -145.0
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.8528      0.226     17.070      0.0

# Variance Inflation Factor 3

In [26]:
vif_3=[] #empty list                        

for i in range (X_incl_const_3.shape[1]):
    
    vif_3.append(variance_inflation_factor(exog= X_incl_const_3.values, exog_idx=i))

vif_3

[576.4258577750066,
 1.6995063324698798,
 1.0833550411756299,
 3.7973870197322683,
 1.8066696516255714,
 2.619209863656011,
 7.090392330230745,
 7.078572054426567,
 1.5665696655442751,
 1.318601128410779,
 2.551020818165935]

## Prediction Using Optimized Model 3

In [27]:
## Re-modelling the Data for Using in Model_3

X_test = X_test[X_incl_const_2.columns]    
Pred_y_test = results_3.predict( sm.add_constant(X_test))

In [28]:
print('Optimised r2_score :', r2_score(y_test, Pred_y_test))

Optimised r2_score : 0.7430764959352337


### Enter the Data to Predict the price

In [29]:
PT_ratio = round(float(input('Enter required PT_RATIO :')),1)
Rooms    = round(float(input('Enter required no.of Rooms :')),3)
Chas     = input('Do Tract should bound Charle\'s River ? [yes/no] : ')

if(Chas=='yes'):
    chas=1.0    
elif(Chas=='no'):
    chas=0.0  
else:
    print('Wrong input Chas: Enter yes or no')
    chas=data['CHAS']

Enter required PT_RATIO : 10
Enter required no.of Rooms : 4
Do Tract should bound Charle's River ? [yes/no] :  no


### Using the Mean Values for Other Features

In [30]:
data_3 = X_incl_const_3

X_new=[]
for i in range (data_3.shape[1]):
    if(data_3.columns[i]=='PTRATIO'):
        X_new.append(PT_ratio)
    elif(data_3.columns[i]=='RM'):
        X_new.append(Rooms)
    elif(data_3.columns[i]=='CHAS'):
        X_new.append(chas)
    else:
        X_new.append(data_3[data_3.columns[i]].mean())

### Predicting Your Output

In [31]:
Predicted_Log_Price = results_3.predict(X_new)[0]

print(f'Predicted Price for Your Requirements : ${round((30000*np.e**Predicted_Log_Price),3)}')

Predicted Price for Your Requirements : $666827.573


### Prediction With High Confidence Interval - 95%

In [32]:
upper_bound = 30000*np.e**(Predicted_Log_Price + 2*np.sqrt(results_3.mse_resid))
lower_bound = 30000*np.e**(Predicted_Log_Price - 2*np.sqrt(results_3.mse_resid))


print(f'The Price of a House With High Confidence Interval 95% Ranges From : ${round(lower_bound,3)} To : ${round(upper_bound,3)}')

The Price of a House With High Confidence Interval 95% Ranges From : $456970.788 To : $973057.85
