### Multiple Linear Regression Introduction

In this notebook (and following quizzes), you will be creating a few simple linear regression models, as well as a multiple linear regression model, to predict home value.

Let's get started by importing the necessary libraries and reading in the data you will be using.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm;

df = pd.read_csv('./house_prices.csv')
df.head()

  from pandas.core import datetools


Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price
0,1112,B,1188,3,2,ranch,598291
1,491,B,3512,5,3,victorian,1744259
2,5952,B,1134,3,2,ranch,571669
3,3525,A,1940,4,2,ranch,493675
4,5108,B,2208,6,4,victorian,1101539


`1.` Using statsmodels, fit three individual simple linear regression models to predict price.  You should have a model that uses **area**, another using **bedrooms**, and a final one using **bathrooms**.  You will also want to use an intercept in each of your three models.

Use the results from each of your models to answer the first two quiz questions below.

In [7]:
# Fitting a simple linear regression model for area
X_area = df['area']
y = df['price']
X_area = sm.add_constant(X_area) # add an intercept
model_area = sm.OLS(y, X_area).fit()
print(model_area.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.678
Model:                            OLS   Adj. R-squared:                  0.678
Method:                 Least Squares   F-statistic:                 1.269e+04
Date:                Thu, 01 Jun 2023   Prob (F-statistic):               0.00
Time:                        08:59:44   Log-Likelihood:                -84517.
No. Observations:                6028   AIC:                         1.690e+05
Df Residuals:                    6026   BIC:                         1.691e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       9587.8878   7637.479      1.255      0.2

In [8]:
# Fitting a simple linear regression model for bedrooms
X_bedrooms = df['bedrooms']
X_bedrooms = sm.add_constant(X_bedrooms) # add an intercept
model_bedrooms = sm.OLS(y, X_bedrooms).fit()
print(model_bedrooms.summary())


                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.553
Model:                            OLS   Adj. R-squared:                  0.553
Method:                 Least Squares   F-statistic:                     7446.
Date:                Thu, 01 Jun 2023   Prob (F-statistic):               0.00
Time:                        08:59:57   Log-Likelihood:                -85509.
No. Observations:                6028   AIC:                         1.710e+05
Df Residuals:                    6026   BIC:                         1.710e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -9.485e+04   1.08e+04     -8.762      0.0

In [9]:
# Fitting a simple linear regression model for bathrooms
X_bathrooms = df['bathrooms']
X_bathrooms = sm.add_constant(X_bathrooms) # add an intercept
model_bathrooms = sm.OLS(y, X_bathrooms).fit()
print(model_bathrooms.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.541
Model:                            OLS   Adj. R-squared:                  0.541
Method:                 Least Squares   F-statistic:                     7116.
Date:                Thu, 01 Jun 2023   Prob (F-statistic):               0.00
Time:                        09:00:02   Log-Likelihood:                -85583.
No. Observations:                6028   AIC:                         1.712e+05
Df Residuals:                    6026   BIC:                         1.712e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       4.314e+04   9587.189      4.500      0.0

>Based on the results of the three simple linear regression models, each of the variables claim their significance in predicting price.

`2.` Now that you have looked at the results from the simple linear regression models, let's try a multiple linear regression model using all three of these variables  at the same time.  You will still want an intercept in this model.

In [10]:
df['intercept'] =1

In [11]:
lm = sm.OLS(df['price'], df[['intercept','bathrooms', 'bedrooms', 'area']])
Results = lm.fit()
Results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.678
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,4230.0
Date:,"Thu, 01 Jun 2023",Prob (F-statistic):,0.0
Time:,09:01:25,Log-Likelihood:,-84517.0
No. Observations:,6028,AIC:,169000.0
Df Residuals:,6024,BIC:,169100.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,1.007e+04,1.04e+04,0.972,0.331,-1.02e+04,3.04e+04
bathrooms,7345.3917,1.43e+04,0.515,0.607,-2.06e+04,3.53e+04
bedrooms,-2925.8063,1.03e+04,-0.285,0.775,-2.3e+04,1.72e+04
area,345.9110,7.227,47.863,0.000,331.743,360.079

0,1,2,3
Omnibus:,367.658,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,350.116
Skew:,0.536,Prob(JB):,9.4e-77
Kurtosis:,2.503,Cond. No.,11600.0


In [12]:
# Another way to Fit the multiple linear regression model using all three predictor variables
X = df[['area', 'bedrooms', 'bathrooms']]
X = sm.add_constant(X) # add an intercept
model_multiple = sm.OLS(y, X).fit()
print(model_multiple.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.678
Model:                            OLS   Adj. R-squared:                  0.678
Method:                 Least Squares   F-statistic:                     4230.
Date:                Thu, 01 Jun 2023   Prob (F-statistic):               0.00
Time:                        09:02:29   Log-Likelihood:                -84517.
No. Observations:                6028   AIC:                         1.690e+05
Df Residuals:                    6024   BIC:                         1.691e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1.007e+04   1.04e+04      0.972      0.3

`3.` Along with using the **area**, **bedrooms**, and **bathrooms** you might also want to use **style** to predict the price.  Try adding this to your multiple linear regression model.  What happens?  Use the final quiz below to provide your answer.

In [17]:
# Converting the categorical style variable into dummy variables
style_dummies = pd.get_dummies(df['style'], prefix='style')
style_dummies.head()

Unnamed: 0,style_lodge,style_ranch,style_victorian
0,0,1,0
1,0,0,1
2,0,1,0
3,0,1,0
4,0,0,1


In [18]:
# Concatenating the dummy variables with the original dataset
df = pd.concat([df, style_dummies], axis=1)
df.head()

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price,intercept,style_lodge,style_ranch,style_victorian,style_lodge.1,style_ranch.1,style_victorian.1
0,1112,B,1188,3,2,ranch,598291,1,0,1,0,0,1,0
1,491,B,3512,5,3,victorian,1744259,1,0,0,1,0,0,1
2,5952,B,1134,3,2,ranch,571669,1,0,1,0,0,1,0
3,3525,A,1940,4,2,ranch,493675,1,0,1,0,0,1,0
4,5108,B,2208,6,4,victorian,1101539,1,0,0,1,0,0,1


In [16]:
# Fitting a multiple linear regression model using all four predictor variables
X = df[['area', 'bedrooms', 'bathrooms', 'style_ranch', 'style_victorian']]
X = sm.add_constant(X) # add an intercept
model_multiple_style = sm.OLS(y, X).fit()
print(model_multiple_style.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.678
Model:                            OLS   Adj. R-squared:                  0.678
Method:                 Least Squares   F-statistic:                     2538.
Date:                Thu, 01 Jun 2023   Prob (F-statistic):               0.00
Time:                        09:05:49   Log-Likelihood:                -84516.
No. Observations:                6028   AIC:                         1.690e+05
Df Residuals:                    6022   BIC:                         1.691e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const            7876.6441   1.11e+04     

>Using a multiple linear regression model shows only area as statistically significant.