# Regression Week 1: Simple Linear Regression

In this notebook we will use data on house sales in King County to predict house prices using simple (one input) linear regression. You will:
* Write a function to compute the Simple Linear Regression weights using the closed form solution
* Write a function to make predictions of the output given the input feature
* Turn the regression around to predict the input given the output
* Compare two different models for predicting house prices

**Load libraries**

In [28]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

**Load house sales data**

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [30]:
dtype_dict = {'bathrooms':float,
              'waterfront':int,
              'sqft_above':int,
              'sqft_living15':float,
              'grade':int,
              'yr_renovated':int,
              'price':float,
              'bedrooms':float,
              'zipcode':str,
              'long':float,
              'sqft_lot15':float,
              'sqft_living':float,
              'floors':str,
              'condition':int,
              'lat':float,
              'date':str,
              'sqft_basement':int,
              'yr_built':int,
              'id':str,
              'sqft_lot':int,
              'view':int}

train_data = pd.read_csv('kc_house_train_data.csv', usecols = ['sqft_living', 'bedrooms', 'price'], dtype = dtype_dict)
test_data = pd.read_csv('kc_house_test_data.csv', usecols = ['sqft_living', 'bedrooms', 'price'], dtype = dtype_dict)
train_data.head()

Unnamed: 0,price,bedrooms,sqft_living
0,221900.0,3.0,1180.0
1,538000.0,3.0,2570.0
2,180000.0,2.0,770.0
3,604000.0,4.0,1960.0
4,510000.0,3.0,1680.0


In [31]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4229 entries, 0 to 4228
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   price        4229 non-null   float64
 1   bedrooms     4229 non-null   float64
 2   sqft_living  4229 non-null   float64
dtypes: float64(3)
memory usage: 99.2 KB


Aside: The python notation x.xxe+yy means x.xx \* 10^(yy). e.g 100 = 10^2 = 1*10^2 = 1e2 

**Build a generic simple linear regression function**

We can use the closed form solution to compute the slope and intercept for a simple linear regression.

In [36]:
def simple_linear_regression(input_feature, output):
    # compute the sum of input_feature and output
    input_sum = input_feature.sum()
    output_sum = output.sum()
    # compute the product of the output and the input_feature and its sum
    output_input_sum = (output * input_feature).sum()
    # compute the squared value of the input_feature and its sum
    squared_input_sum = (input_feature * input_feature).sum()
    # use the formula for the slope
    numerator = output_input_sum - ((output_sum * input_sum) / len(output))
    denominator = squared_input_sum - ((input_sum * input_sum) / len(output))
    slope = numerator / denominator
    # use the formula for the intercept
    term1 = output_sum / len(output)
    term2 = slope * (input_sum / len(output))
    intercept = term1 - term2
    
    return (intercept, slope)

We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1\*input_feature then we know both our slope and intercept should be 1

In [37]:
test_feature = np.array(range(5))
test_output = np.array(1 + 1*test_feature)
(test_intercept, test_slope) =  simple_linear_regression(test_feature, test_output)
print("Intercept: " + str(test_intercept))
print("Slope: " + str(test_slope))

Intercept: 1.0
Slope: 1.0


Now that we know it works let's build a regression model for predicting price based on sqft_living. Rembember that we train on train_data!

In [38]:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])

print("Intercept: " + str(sqft_intercept))
print("Slope: " + str(sqft_slope))

Intercept: -47116.07907289418
Slope: 281.9588396303426


**Predicting Values**

Now that we have the model parameters: intercept & slope we can make predictions. Complete the following function to return the predicted output given the input_feature, slope and intercept:

In [39]:
def get_regression_predictions(input_feature, intercept, slope):
    # calculate the predicted values:
    predicted_values = intercept + slope * input_feature
    
    return predicted_values

Now that we can calculate a prediction given the slope and intercept let's make a prediction. Use (or alter) the following to find out the estimated price for a house with 2,650 square-feet according to the model we estimated above.

**Quiz Question: Using your slope and intercept, what is the predicted price for a house with 2650 sqft?**

In [46]:
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print("The estimated price for a house with %d square-feet is $%.2f" % (my_house_sqft, estimated_price))

The estimated price for a house with 2650 square-feet is $700074.85


**Residual Sum of Squares**

Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output. 

Complete the following (or write your own) function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:

In [41]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    # first get the predictions
    predicted_values = get_regression_predictions(input_feature, intercept, slope)
    # then compute the residuals (since we are squaring it doesn't matter which order you subtract)
    residuals = output - predicted_values
    # square the residuals and add them up
    RSS = (residuals * residuals).sum()

    return(RSS)

Let's test our get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero!

In [42]:
print(get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope)) # should be 0.0

0.0


Now use your function to calculate the RSS on training data from the model calculated above.

**Quiz Question: According to this function and the slope and intercept from the model What is the RSS for the simple linear regression using square-feet to predict prices on the training data?**

In [48]:
rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)
print('The RSS of predicting prices based on square-feet is : ' + str(rss_prices_on_sqft))

The RSS of predicting prices based on square-feet is : 1201918354177283.0


**Predict square-feet given price**

What if we want to predict square-feet given the price? Since we have an equation y = a + b\*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated square-feet (x).

Complete the following function to compute the inverse regression estimate, i.e. predict the input_feature given the output.

In [49]:
def inverse_regression_predictions(output, intercept, slope):
    # solve output = intercept + slope*input_feature for input_feature. Use this equation to compute the inverse predictions
    estimated_feature = (output - intercept) / slope

    return estimated_feature

Now that we have a function to compute the squarefeet given the price from our simple regression model let's see how big we might expect a house that costs $800,000 to be.

**Quiz Question: According to this function and the regression slope and intercept, what is the estimated square-feet for a house costing $800,000?**

In [50]:
my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print("The estimated squarefeet for a house worth $%.2f is %d" % (my_house_price, estimated_squarefeet))

The estimated squarefeet for a house worth $800000.00 is 3004


**New Model: Estimate prices using bedrooms**

We have made one model for predicting house prices using squarefeet, but there are many other features in the dataframe. 
Use your simple linear regression function to estimate the regression parameters from predicting prices based on number of bedrooms.

In [51]:
# Estimate the slope and intercept for predicting 'price' based on 'bedrooms'
bedrooms_intercept, bedrooms_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])

print("Intercept: " + str(bedrooms_intercept))
print("Slope: " + str(bedrooms_slope))

Intercept: 109473.1776229596
Slope: 127588.95293398784


**Test your linear regression algorithm**

Now we have two models for predicting the price of a house. How do we know which one is better? Calculate the RSS on the test data (remember this data wasn't involved in learning the model). Compute the RSS from predicting prices using bedrooms and from predicting prices using square-feet.

**Quiz Question: Which model (square-feet or bedrooms) has lowest RSS on the test data?**

In [52]:
# Compute RSS when using bedrooms on TEST data:
rss_prices_on_bedrooms = get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'],
                                                     bedrooms_intercept, bedrooms_slope)

In [53]:
# Compute RSS when using squarefeet on TEST data:
rss_prices_on_sqft = get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], sqft_intercept, sqft_slope)

In [54]:
rss_prices_on_bedrooms < rss_prices_on_sqft

False