House prediction dataset
`https://www.kaggle.com/datasets/harlfoxem/housesalesprediction`
The dataset contains house sale prices for King country which includes Seattle

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Load data
df = pd.read_csv("../data/kc_house_data.csv")

In [2]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


## Linear Regression
A method to help us understand the relationship between two variables:
- The predictor/independent variable (X)
- The response/dependent variable (Y)

The result of a linear regression is a function that predicts the response variable as a function of the predictor.

$
\widehat{Y} = a + bx
$

- `a` refers to the `intercept` of the line
- `b` refers t the `slope`

In [3]:
from sklearn.linear_model import LinearRegression

In [4]:

# Create a linear regression object
lm = LinearRegression()
lm

### How could the `sqft_living` helps us predict the house `price`?

In [5]:
X = df[['sqft_living']] # This needs to be a 2D array
Y = df[['price']] # This can be 2D

In [6]:
# Fit linear model
lm.fit(X,Y)

In [7]:
# Get the output prediction using X
Yhat = lm.predict(X)
# Print first 5 results
Yhat[:5]

array([[287555.06702452],
       [677621.82640197],
       [172499.40418656],
       [506441.44998452],
       [427866.85097324]])

In [8]:
# Get the value of the intercept (a)
lm.intercept_

array([-43580.74309447])

In [9]:
# Get the value of the slope (b)
lm.coef_

array([[280.6235679]])

Final estimated linear model

$
\widehat{Y} = w + bx
$

In [10]:
Yhat = -43580.74 + 280.62356*X
print(f'Using X price= {Yhat}')
# or using price
price = -43580.74 + 280.62356*df['sqft_living']
print(f'Using price = {price}')

Using X price=        sqft_living
0      287555.0608
1      677621.8092
2      172499.4012
3      506441.4376
4      427866.8408
...            ...
21608  385773.3068
21609  604659.6836
21610  242655.2912
21611  405416.9560
21612  242655.2912

[21613 rows x 1 columns]
Using price = 0        287555.0608
1        677621.8092
2        172499.4012
3        506441.4376
4        427866.8408
            ...     
21608    385773.3068
21609    604659.6836
21610    242655.2912
21611    405416.9560
21612    242655.2912
Name: sqft_living, Length: 21613, dtype: float64


## Task 1
Repeat the process (Linear Model Prediction) but using `bedrooms` as the independent variable and `price` as your dependent variable.

In [11]:
X = df[['bedrooms']] # 2D array
Y = df[['price']] # id
lm2 = LinearRegression()
lm2.fit(X,Y)
lm2

In [12]:
# Find the slope and intercept
lm2.coef_

array([[121716.12651184]])

In [13]:
lm2.intercept_

array([129802.35631826])

In [14]:
# Equation
Yhat = 129802.35 + 121716.1265*X
Yhat

Unnamed: 0,bedrooms
0,494950.7295
1,494950.7295
2,373234.6030
3,616666.8560
4,494950.7295
...,...
21608,494950.7295
21609,616666.8560
21610,373234.6030
21611,494950.7295


## Multiple Linear Regression
If you want to use more variables in our model to predict the price, then use a `multiple linear regression`

$
\widehat{Y} = a + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} +... b_{n}X_{n}
$

For this example the predictors of price will be:
- sqft_living
- bedrooms
- bathrooms
- sqft_lot

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.linear_model import LinearRegression
# Load data
df = pd.read_csv("../data/kc_house_data.csv")

In [16]:
# Get all the predictor variables
X = df[['sqft_living', 'bedrooms', 'bathrooms', 'sqft_lot']] # 2d
Y = df['price']

In [17]:
lm = LinearRegression()
lm.fit(X,Y)

In [18]:
# Get the intercept (a)
a = lm.intercept_
a

79092.32040168752

In [19]:
# Get the slopes b1, b2, b3, b4
bs = list(lm.coef_)
bs

[314.29172074654906,
 -59406.812405435456,
 6268.660401823767,
 -0.37765257884149556]

In [20]:
Yhat = a + (bs[0]*X['sqft_living']) + (bs[1]*X['bedrooms']) + (bs[2]*X['bathrooms']) + (bs[3]*X['sqft_lot'])