In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('data/df.csv', delimiter = ',')
df.head()

Unnamed: 0,OverallQual,YearBuilt,ExterQual,BsmtQual,TotalBsmtSF,GrLivArea,FullBath,KitchenQual,GarageCars,OverallGrade,SalePrice
0,7.0,2003.0,4.0,4.0,856.0,1710.0,2.0,4.0,2.0,35.0,208500
1,6.0,1976.0,3.0,4.0,1262.0,1262.0,2.0,3.0,2.0,48.0,181500
2,7.0,2001.0,4.0,4.0,920.0,1786.0,2.0,4.0,2.0,35.0,223500
3,7.0,1915.0,3.0,3.0,756.0,1717.0,1.0,4.0,3.0,35.0,140000
4,8.0,2000.0,4.0,4.0,1145.0,2198.0,2.0,4.0,3.0,40.0,250000


In [19]:
X = df.drop(columns='SalePrice')
X.head(2)

Unnamed: 0,OverallQual,YearBuilt,ExterQual,BsmtQual,TotalBsmtSF,GrLivArea,FullBath,KitchenQual,GarageCars,OverallGrade
0,7.0,2003.0,4.0,4.0,856.0,1710.0,2.0,4.0,2.0,35.0
1,6.0,1976.0,3.0,4.0,1262.0,1262.0,2.0,3.0,2.0,48.0


In [9]:
y = df['SalePrice']
y.head

<bound method NDFrame.head of 0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1454    175000
1455    210000
1456    266500
1457    142125
1458    147500
Name: SalePrice, Length: 1459, dtype: int64>

Let's find out the price using Linear Regression Algorithms.

Python has two main libraries: `statsmodels` and `sklearn`

In [10]:
import statsmodels.api as sm

We have to add an intercept to our predictive dataset to also estimate the intercept. If we don't do that the intercept will be considered 0.

In [11]:
X = sm.add_constant(X) # adding a constant

Now, we can create a Python object that will represent linear regression:

In [12]:
lin_reg = sm.OLS(y,X)

If we use the type() function, we will see that the object lin_reg is a linear model with the ordinary least square method.

As the next step, we will fit this using our training data and export the summary of the model:

In [13]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.819
Model:                            OLS   Adj. R-squared:                  0.818
Method:                 Least Squares   F-statistic:                     657.2
Date:                Wed, 02 Dec 2020   Prob (F-statistic):               0.00
Time:                        13:08:15   Log-Likelihood:                -17283.
No. Observations:                1459   AIC:                         3.459e+04
Df Residuals:                    1448   BIC:                         3.465e+04
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         -8.58e+05   9.33e+04     -9.197   

We can see there is a lot of Statistical Information here. 

Let's try using SKLearn

In [14]:
from sklearn.linear_model import LinearRegression

In [20]:
regressor = LinearRegression()
regressor.fit(X, y)

LinearRegression()

We should see a summary like this:

This gives us an overview of the parameters we can set up for linear regression in sklearn. The most important one is fit_intercept. In sklearn, we don't have to add a constant to a dataset. We have to set this parameter to the value True if we want to compute an intercept as well.

We can check the beta coefficient now:

In [21]:
print(regressor.coef_)

[  6029.68368939    372.71507605  13762.20724747   1501.16506653
     39.24475291     62.64457489 -10053.24395591  11634.9228926
  10224.9534246    1069.75813647]


In [22]:
regressor.score(X,y)

0.8194412843036943

We built a model to see future predictions. The score gives us an adjusted R statistic. But if we wanted to see how similar the predictions are with our real data, how could we do this? How do predictions look compared to the real data?

In [46]:
prices = regressor.predict(X)

In [47]:
y_df = pd.DataFrame(prices, columns =['SalesPrice'])
df_with_predictions = df.merge(y_df, left_index=True, right_index=True)

In [48]:
df_with_predictions[['SalePrice', 'SalesPrice']]

Unnamed: 0,SalePrice,SalesPrice
0,208500,216839.069326
1,181500,177124.404346
2,223500,223366.291052
3,140000,185569.004434
4,250000,279236.638032
...,...,...
1454,175000,178432.737324
1455,210000,226826.017912
1456,266500,276802.580231
1457,142125,133362.894266


What if we had new information on new houses coming in? We probably won't know the real price for this houses, but can we see what our model thinks?

In [50]:
newHouses = {
    'OverallQual' : [8.0, 6.0, 9.0],
    'YearBuilt' : [2002.0, 2022.0, 1992.0],
    'ExterQual' : [3.0, 3.0, 4.0],
    'BsmtQual' : [4.0, 3.0, 4.0],
    'TotalBsmtSF' : [1034.0, 830.0, 916.0],
    'GrLivArea' : [1500.0, 1704.0, 1897.0],
    'FullBath' : [1.0, 2.0, 1.0],
    'KitchenQual' : [4.0, 3.0, 4.0],
    'GarageCars' : [2.0, 2.0, 2.0],
    'OverallGrade' : [35.0, 48.0, 40.0],
}

In [51]:
new_houses = pd.DataFrame(newHouses)
new_houses

Unnamed: 0,OverallQual,YearBuilt,ExterQual,BsmtQual,TotalBsmtSF,GrLivArea,FullBath,KitchenQual,GarageCars,OverallGrade
0,8.0,2002.0,3.0,4.0,1034.0,1500.0,1.0,4.0,2.0,35.0
1,6.0,2022.0,3.0,3.0,830.0,1704.0,2.0,3.0,2.0,48.0
2,9.0,1992.0,4.0,4.0,916.0,1897.0,1.0,4.0,2.0,40.0


In [52]:
prices = regressor.predict(new_houses)

In [53]:
y_df = pd.DataFrame(prices, columns =['SalesPrice'])

In [54]:
new_houses.merge(y_df, left_index=True, right_index=True)

Unnamed: 0,OverallQual,YearBuilt,ExterQual,BsmtQual,TotalBsmtSF,GrLivArea,FullBath,KitchenQual,GarageCars,OverallGrade,SalesPrice
0,8.0,2002.0,3.0,4.0,1034.0,1500.0,1.0,4.0,2.0,35.0,212617.279938
1,6.0,2022.0,3.0,3.0,830.0,1704.0,2.0,3.0,2.0,48.0,203503.301624
2,9.0,1992.0,4.0,4.0,916.0,1897.0,1.0,4.0,2.0,40.0,254269.826186
