# Simple Linear Regression (Wednesday - House Data)

In this notebook we will learn the respective steps needed to compute Simple Linear Regression with data on house sales in King County, USA (https://www.kingcounty.gov) to predict house prices. You will:
* Upload and preprocess the data
* Write a function to compute the Simple Linear Regression weights
* Write a function to make predictions of the output given the input feature
* Compare different models for predicting house prices

# Upload and preprocess the data

Dataset is from house sales in King County, USA (https://www.kingcounty.gov).

In [1]:
import pandas as pd
sales = pd.read_csv('housing.csv')
# Look at the table to check potential features
sales[:10]

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
5,7237550310,20140512T000000,1225000.0,4,4.5,5420,101930,1.0,0,0,...,11,3890,1530,2001,0,98053,47.6561,-122.005,4760,101930
6,1321400060,20140627T000000,257500.0,3,2.25,1715,6819,2.0,0,0,...,7,1715,0,1995,0,98003,47.3097,-122.327,2238,6819
7,2008000270,20150115T000000,291850.0,3,1.5,1060,9711,1.0,0,0,...,7,1060,0,1963,0,98198,47.4095,-122.315,1650,9711
8,2414600126,20150415T000000,229500.0,3,1.0,1780,7470,1.0,0,0,...,7,1050,730,1960,0,98146,47.5123,-122.337,1780,8113
9,3793500160,20150312T000000,323000.0,3,2.5,1890,6560,2.0,0,0,...,7,1890,0,2003,0,98038,47.3684,-122.031,2390,7570


# Split data into training and testing

In [2]:
from sklearn.model_selection import train_test_split
# Split data set into 80% train and 20% test data 
train_data, test_data = train_test_split(sales, test_size=0.2)
# Look at the shape to check the split ratio
train_data.shape, test_data.shape

((17290, 21), (4323, 21))

# Compute average prices and sum of squares

In [3]:
# The average house price can be computed in two different ways. First get the prices:
prices = train_data['price'] # extract the price column

# The arithmetic average (the mean) is the sum of the prices divided by the total number of houses:
sum_prices = prices.sum()
num_houses = len(prices)
avg_price_1 = sum_prices/num_houses

# There is already a mean function that we can use
avg_price_2 = prices.mean()

print('Average price via arithmetic average: ' + str(avg_price_1))
print('Average price via mean function: ' + str(avg_price_2))

Average price via arithmetic average: 541967.252631579
Average price via mean function: 541967.252631579


As we see both ways led to the same result

In [4]:
# Let's compute the sum of squares of prices.
prices_squared = prices*prices
sum_prices_squared = prices_squared.sum()
print('The sum of price squared is: ' + str(sum_prices_squared))

The sum of price squared is: 7487008182652632.0


# Build a generic simple linear regression function 

Let's build a closed form solution (function) to compute the slope and intercept for a simple linear regression.

In [5]:
def simple_linear_regression(input_feature, output):
    n = len(input_feature)
    # compute the sum of input_feature and output
    x_sum = input_feature.sum()
    y_sum = output.sum()
    # compute the product of the output and the input_feature and its sum
    xy = input_feature * output
    xy_sum = xy.sum()
    # compute the squared value of the input_feature and its sum
    x_squared = input_feature * input_feature
    x_squared_sum = x_squared.sum()
    # use the formula for the slope
    slope = (xy_sum - x_sum * y_sum / n) / (x_squared_sum - x_sum * x_sum / n)
    # use the formula for the intercept
    intercept = y_sum / n - slope * x_sum / n
    return (intercept, slope)

Now let's use that function of a simple regression model for predicting price based on sqft_living - rembember that we train on train_data!

In [6]:
sqft_living_intercept, sqft_living_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])

print('Intercept: ' + str(sqft_living_intercept))
print('Slope: ' + str(sqft_living_slope))

Intercept: -53110.29412003083
Slope: 285.5238980221615


# Predicting values

With the model parameters 'intercept & slope' we can now write a function to return the predicted output - given the input_feature, slope and intercept:

In [7]:
def get_regression_predictions(input_feature, intercept, slope):
    # calculate the predicted values:
    predicted_values = intercept + input_feature * slope
    return predicted_values

Now that we have this function let's make a prediction. 

* What is the estimated price for a house with 2650 squarefeet (living) according to the squarefeet model we estiamted above?

In [8]:
my_house_sqft_living = 2650
estimated_price_living = get_regression_predictions(my_house_sqft_living, 
                                             sqft_living_intercept, 
                                             sqft_living_slope)

print('The estimated price for a house with %d squarefeet is $%.2f' % (my_house_sqft_living, estimated_price_living))

The estimated price for a house with 2650 squarefeet is $703528.04


# Your individual task 1: Build models for other features!

Answer the following questions by applying what you have learned so far:

* What is the estimated price for a house with 2650 squarefeet (lot)?
* What is the estimated price for a house with 4 bedrooms?
* What is the estimated price for a house with 3 bathrooms?


Go through the entire notebook first and think: What can be reused and what needs to be changed? (15min)

=> Present your thoughts to the group!

Now add more code and variables where needed! (45min)

=> Present your code to the group!


In [9]:
# Your code here...

# Residual Sum of Squares

Now we want to evaluate our models using Residual Sum of Squares (RSS) via a function. 

Recall that RSS is the sum of the squares of the residuals and the residuals are the respective differences between the predicted output and the true output. 

In [10]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    # First get the predictions
    predicted_output = get_regression_predictions(input_feature, intercept, slope)
    # Compute the residuals (since we are squaring it doesn't matter which order you subtract)
    residuals = predicted_output - output
    # Square the residuals and add them up
    residuals_squared = residuals * residuals
    RSS = residuals_squared.sum()
    return(RSS)

Now let's use the function to calculate the RSS on training data from the squarefeet (living) model calculated above.

In [11]:
rss_prices_on_sqft_living = get_residual_sum_of_squares(train_data['sqft_living'], 
                                                 train_data['price'], 
                                                 sqft_living_intercept, 
                                                 sqft_living_slope)

print('The RSS of predicting prices based on Square Feet (living) is : ' + str(rss_prices_on_sqft_living))

The RSS of predicting prices based on Square Feet (living) is : 1210693708866647.0


# Your individual task 2: Test your Linear Regression algorithm

If we have more models for predicting the price of a house, how do we know which one is better? 

Calculate the RSS on the TEST data (remember this data wasn't involved in learning the model) and identify which model works best:

* Simple Linear Regression on squarefeet (living)?
* Simple Linear Regression on squarefeet (lot)?
* Simple Linear Regression on bedrooms?
* Simple Linear Regression on bathrooms?

(45min)

In [12]:
# Compute RSS when using Square Feet (living) on TEST data:
rss_prices_on_sqft_living = get_residual_sum_of_squares(test_data['sqft_living'], 
                                                 test_data['price'], 
                                                 sqft_living_intercept, 
                                                 sqft_living_slope)

print('The test data RSS of predicting prices based on Square Feet (living) is: ' + str(rss_prices_on_sqft_living))

The test data RSS of predicting prices based on Square Feet (living) is: 267029915753831.7


In [13]:
# Your code here...

# Additional: Predict the squarefeet given price

Now we want to predict the squarefoot given the price. Since we have an equation y = a + b\*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x).

In [14]:
def inverse_regression_predictions(output, intercept, slope):
    # Solve output = intercept + slope * input_feature for input_feature. Use this equation to compute the inverse predictions:
    estimated_feature = (output - intercept) / slope
    return estimated_feature

Now that we have a function to compute the squarefeet given the price from our simple regression model let's see how big we might expect a house that costs $750,000 to be.

In [15]:
my_house_price = 750000
estimated_squarefeet_living = inverse_regression_predictions(my_house_price, 
                                                             sqft_living_intercept, 
                                                             sqft_living_slope)

print('The estimated squarefeet (living) for a house worth $%.2f is %d.' % (my_house_price, estimated_squarefeet_living))

The estimated squarefeet (living) for a house worth $750000.00 is 2812.
