# Regression Week 1: Simple Linear Regression

In this notebook we will use data on house sales in King County to predict house prices using simple (one input) linear regression. You will:

Write a function to compute the Simple Linear Regression weights using the closed form solution
Write a function to make predictions of the output given the input feature
Turn the regression around to predict the input given the output
Compare two different models for predicting house prices
In this notebook you will be provided with some already complete code as well as some code that you should complete yourself in order to answer quiz questions. The code we provide to complte is optional and is there to assist you with solving the problems but feel free to ignore the helper code and write your own.

### Import libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection  import train_test_split


In [2]:
sales = pd.read_csv('kc_house_data.csv')

sales.head

<bound method NDFrame.head of                id             date      price  bedrooms  bathrooms  \
0      7129300520  20141013T000000   221900.0         3       1.00   
1      6414100192  20141209T000000   538000.0         3       2.25   
2      5631500400  20150225T000000   180000.0         2       1.00   
3      2487200875  20141209T000000   604000.0         4       3.00   
4      1954400510  20150218T000000   510000.0         3       2.00   
5      7237550310  20140512T000000  1225000.0         4       4.50   
6      1321400060  20140627T000000   257500.0         3       2.25   
7      2008000270  20150115T000000   291850.0         3       1.50   
8      2414600126  20150415T000000   229500.0         3       1.00   
9      3793500160  20150312T000000   323000.0         3       2.50   
10     1736800520  20150403T000000   662500.0         3       2.50   
11     9212900260  20140527T000000   468000.0         2       1.00   
12      114101516  20140528T000000   310000.0         3     

# Split Data into Train and Test Data sets

In [3]:
sales.shape

y = sales['price']
X = sales.drop('price',1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state=0)

# Use spit(), seem we have gotten different X_train and y_train than the one given by the problem. 

train = pd.read_csv('kc_house_train_data.csv')
test = pd.read_csv('kc_house_test_data.csv')

In [4]:
train.shape
test.shape

(4229, 21)

# Explore Simple functions

In [25]:
prices = sales['price'] # extract the price column of the sales SFrame -- this is now an SArray

# recall that the arithmetic average (the mean) is the sum of the prices divided by the total number of houses:
sum_prices = prices.sum()
num_houses = len(prices) # when prices is an SArray .size() returns its length
avg_price_1 = sum_prices/num_houses
avg_price_2 = prices.mean() # if you just want the average, the .mean() function
print ("average price via method 1: " + str(avg_price_1))
print ("average price via method 2: " + str(avg_price_2))

average price via method 1: 540088.1417665294
average price via method 2: 540088.1417665294


In [27]:
prices_squared = prices*prices
sum_prices_squared = prices_squared.sum() # price_squared is an SArray of the squares and we want to add them up.
print ("the sum of price squared is: " + str(sum_prices_squared))

the sum of price squared is: 9217325138472052.0


# Build a generic simple linear regression function

Armed with these SArray functions we can use the closed form solution found from lecture to compute the slope and intercept for a simple linear regression on observations stored as SArrays: input_feature, output.

Complete the following function (or write your own) to compute the simple linear regression slope and intercept:

In [6]:
def simple_linear_regression(input_feature, output):

    # compute the sum of input_feature and output
    input_feature_sum = input_feature.sum()
    output_sum = output.sum()
    
    # compute the product of the output and the input_feature and its sum
    product_sum = (input_feature * output).sum()
    
    # compute the squared value of the input_feature and its sum
    input_sq_sum = (input_feature * input_feature).sum()
    
    n = len(input_feature)
    
    # use the formula for the slope
    slope = (product_sum - (output_sum * input_feature_sum * 1.0 / n)) / \
            (input_sq_sum - (input_feature_sum * input_feature_sum * 1.0 / n))
    
    # use the formula for the intercept
    intercept = (output_sum * 1.0 / n) - \
                (slope * input_feature_sum * 1.0 / n)
    
    return (intercept, slope)

We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1*input_feature then we know both our slope and intercept should be 1

In [7]:
test_feature = np.arange(5)
test_output = 1 + 1*test_feature
(test_intercept, test_slope) =  simple_linear_regression(test_feature, test_output)
print ("Intercept: " + str(test_intercept))
print ("Slope: " + str(test_slope))

Intercept: 1.0
Slope: 1.0


Now that we know it works let's build a regression model for predicting price based on sqft_living. Rembember that we train on train_data!

In [8]:
sqft_intercept, sqft_slope = simple_linear_regression(train['sqft_living'], train['price'])

print ("Intercept: " + str(sqft_intercept))
print ("Slope: " + str(sqft_slope))

Intercept: -47116.07907289418
Slope: 281.9588396303426


# Prdict Values

Now that we have the model parameters: intercept & slope we can make predictions. Using SArrays it's easy to multiply an SArray by a constant and add a constant value. Complete the following function to return the predicted output given the input_feature, slope and intercept:

In [9]:
def get_regression_predictions(input_feature, intercept, slope):
    # calculate the predicted values:
    
    predicted_values = intercept + input_feature * slope
    return predicted_values

Now that we can calculate a prediction given the slope and intercept let's make a prediction. Use (or alter) the following to find out the estimated price for a house with 2650 squarefeet according to the squarefeet model we estiamted above.

In [10]:
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print ("The estimated price for a house with %d squarefeet is $%.2f" % (my_house_sqft, estimated_price))

The estimated price for a house with 2650 squarefeet is $700074.85


# Residual Sum of Squares

Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output.

Complete the following (or write your own) function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:

In [11]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    # First get the predictions
    predicts = get_regression_predictions(input_feature, intercept, slope)
    # then compute the residuals (since we are squaring it doesn't matter which order you subtract)
    residuals = output - predicts
    # square the residuals and add them up
    RSS = (residuals * residuals).sum()
    return(RSS)

Let's test our get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero!

In [12]:
print (get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope)) # should be 0.0

0.0


Now use your function to calculate the RSS on training data from the squarefeet model calculated above.

### Quiz Question: According to this function and the slope and intercept from the squarefeet model What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data?

In [13]:
print (get_residual_sum_of_squares(train['sqft_living'], train['price'], sqft_intercept, sqft_slope)) 

1201918354177286.2


## Predict the squarefeet given price
What if we want to predict the squarefoot given the price? Since we have an equation y = a + b*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x).

Complete the following function to compute the inverse regression estimate, i.e. predict the input_feature given the output.



In [14]:
def inverse_regression_predictions(output, intercept, slope):
    # solve output = intercept + slope*input_feature for input_feature. Use this equation to compute the inverse predictions:
    
    estimated_feature = (output-intercept)/slope
    
    return estimated_feature

Now that we have a function to compute the squarefeet given the price from our simple regression model let's see how big we might expect a house that costs $800,000 to be.

### Quiz Question: According to this function and the regression slope and intercept from (3) what is the estimated square-feet for a house costing $800,000?

In [15]:
my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print ("The estimated squarefeet for a house worth $%.2f is %d" % (my_house_price, estimated_squarefeet))

The estimated squarefeet for a house worth $800000.00 is 3004


## New Model: estimate prices from bedrooms
We have made one model for predicting house prices using squarefeet, but there are many other features in the sales SFrame. Use your simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms. Use the training data!

In [16]:
# Estimate the slope and intercept for predicting 'price' based on 'bedrooms'
bedrooms_intercept, bedrooms_slope = simple_linear_regression(train['bedrooms'], train['price'])

print ("Intercept: " + str(bedrooms_intercept))
print ("Slope: " + str(bedrooms_slope))

Intercept: 109473.1776229596
Slope: 127588.95293398784


## Test your Linear Regression Algorithm
Now we have two models for predicting the price of a house. How do we know which one is better? Calculate the RSS on the TEST data (remember this data wasn't involved in learning the model). Compute the RSS from predicting prices using bedrooms and from predicting prices using squarefeet.

### Quiz Question: Which model (square feet or bedrooms) has lowest RSS on TEST data? Think about why this might be the case.

In [17]:
# Compute RSS when using bedrooms on TEST data:
RSS_bedroom = get_residual_sum_of_squares(test['bedrooms'], test['price'], bedrooms_intercept, bedrooms_slope)
print (get_residual_sum_of_squares(X_test['bedrooms'], y_test, bedrooms_intercept, bedrooms_slope)) 

472649169215554.06


In [18]:
# Compute RSS when using square feet on TEST data:
RSS_ft_square = get_residual_sum_of_squares(test['sqft_living'], test['price'], sqft_intercept, sqft_slope)
print (get_residual_sum_of_squares(X_test['sqft_living'], y_test, sqft_intercept, sqft_slope)) 

267278997754280.78


In [19]:
RSS_bedroom - RSS_ft_square 

217961652342488.28