# Simple Linear Regression

In this project we will use data on house sales in King County to predict house prices using simple (one input) linear regression. we will:
* Use Turi Create SArray and SFrame functions to compute important summary statistics
* Write a function to compute the Simple Linear Regression weights using the closed form solution
* Write a function to make predictions of the output given the input feature
* Turn the regression around to predict the input given the output
* Compare two different models for predicting house prices

In this project we will be provided with some already complete code as well as some code that we should complete ourself in order to ansour quiz questions. The code we provide to complte is optional and is there to assist we with solving the problems but feel free to ignore the helper code and write our own.

In [1]:
import turicreate as tc
import numpy as np

# Get the data

In [2]:
data = tc.SFrame("home_data.sframe")
train_data , test_data = data.random_split(0.8,seed = 0)
data

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
7129300520,2014-10-13 00:00:00+00:00,221900.0,3.0,1.0,1180.0,5650.0,1.0,0
6414100192,2014-12-09 00:00:00+00:00,538000.0,3.0,2.25,2570.0,7242.0,2.0,0
5631500400,2015-02-25 00:00:00+00:00,180000.0,2.0,1.0,770.0,10000.0,1.0,0
2487200875,2014-12-09 00:00:00+00:00,604000.0,4.0,3.0,1960.0,5000.0,1.0,0
1954400510,2015-02-18 00:00:00+00:00,510000.0,3.0,2.0,1680.0,8080.0,1.0,0
7237550310,2014-05-12 00:00:00+00:00,1225000.0,4.0,4.5,5420.0,101930.0,1.0,0
1321400060,2014-06-27 00:00:00+00:00,257500.0,3.0,2.25,1715.0,6819.0,2.0,0
2008000270,2015-01-15 00:00:00+00:00,291850.0,3.0,1.5,1060.0,9711.0,1.0,0
2414600126,2015-04-15 00:00:00+00:00,229500.0,3.0,1.0,1780.0,7470.0,1.0,0
3793500160,2015-03-12 00:00:00+00:00,323000.0,3.0,2.5,1890.0,6560.0,2.0,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7.0,1180.0,0.0,1955.0,0.0,98178,47.51123398
0,3,7.0,2170.0,400.0,1951.0,1991.0,98125,47.72102274
0,3,6.0,770.0,0.0,1933.0,0.0,98028,47.73792661
0,5,7.0,1050.0,910.0,1965.0,0.0,98136,47.52082
0,3,8.0,1680.0,0.0,1987.0,0.0,98074,47.61681228
0,3,11.0,3890.0,1530.0,2001.0,0.0,98053,47.65611835
0,3,7.0,1715.0,0.0,1995.0,0.0,98003,47.30972002
0,3,7.0,1060.0,0.0,1963.0,0.0,98198,47.40949984
0,3,7.0,1050.0,730.0,1960.0,0.0,98146,47.51229381
0,3,7.0,1890.0,0.0,2003.0,0.0,98038,47.36840673

long,sqft_living15,sqft_lot15
-122.25677536,1340.0,5650.0
-122.3188624,1690.0,7639.0
-122.23319601,2720.0,8062.0
-122.39318505,1360.0,5000.0
-122.04490059,1800.0,7503.0
-122.00528655,4760.0,101930.0
-122.32704857,2238.0,6819.0
-122.31457273,1650.0,9711.0
-122.33659507,1780.0,8113.0
-122.0308176,2390.0,7570.0


## Simple linear regression function

In [3]:
def simple_linear_regression(input_feature, output):
    N = len(input_feature)
    a = input_feature.sum()
    b = output.sum()
    c = (input_feature*output).sum()
    d = (input_feature*input_feature).sum()
    slope = (N*c - a*b)/(N*d - a*a)
    intercept = (b/N) - ((N*c*a - a*a*b)/(N*N*d - N*a*a))
    return (intercept , slope)

#test_feature = tc.SArray(range(5))
#test_output = tc.SArray(1 + 1*test_feature)
#(test_intercept, test_slope) =  simple_linear_regression(test_feature, test_output)
#print("Intercept: " + str(test_intercept))
#print("Slope: " + str(test_slope))

In [4]:
input_feature = train_data["sqft_living"]
output = train_data["price"]
sqft_intercept , sqft_slope = simple_linear_regression(input_feature,output)
print("Intercept:",sqft_intercept,"\nSlope:",sqft_slope)

Intercept: -47116.07657494012 
Slope: 281.95883856769746


# Get-predictions function

In [5]:
def get_predictions(inputs,slope,intercept):
    predicted_outputs = intercept + slope*tc.SArray(inputs)
    return predicted_outputs

In [6]:
get_predictions([2650],sqft_slope,sqft_intercept)

dtype: float
Rows: 1
[700074.8456294581]

# Study results

In [11]:
predicted_prices = get_predictions(train_data["sqft_living"],sqft_slope,sqft_intercept)

rss_sqft = ((train_data["price"] - predicted_prices)**2).sum()
rss_sqft

1201918356321967.5

In [8]:
sqft = (800000-sqft_intercept)/sqft_slope
sqft

3004.3962476159445

In [12]:
bdr_intercept , bdr_slope = simple_linear_regression(train_data["bedrooms"],output)
print("bdrIntercept:",bdr_intercept,"\nbdrSlope:",bdr_slope)
rss_bdr = ((train_data["price"] - get_predictions(train_data["bedrooms"],bdr_slope,bdr_intercept))**2).sum()
rss_bdr

bdrIntercept: 109473.18046928721 
bdrSlope: 127588.95217458384


2143244494226578.2

In [10]:
rss_bdr > rss_sqft

True