#Simple Linear Regression Model
In this notebook we will use data on house sales in King County to predict house prices using simple (one input) linear regression.



* Use graphlab SArray and SFrame functions to compute important summary statistics
*   Write a function to compute the Simple Linear Regression weights using the closed form solution
*   Write a function to make predictions of the output given the input feature
*   Turn the regression around to predict the input given the output

*   Compare two different models for predicting house prices



##To install turicreate or graphlab

In [None]:
!pip install turicreate 

##To import turicreate 

In [None]:
import turicreate as tc
from turicreate import SFrame
from google.colab import files

## Uploading files and unzipping

In [None]:
uploaded = files.upload()

Saving home_data.sframe.zip to home_data.sframe (2).zip


In [None]:
!unzip home_data.sframe.zip

Archive:  home_data.sframe.zip
replace home_data.sframe/m_1ce96d9d245ca490.0000? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace __MACOSX/home_data.sframe/._m_1ce96d9d245ca490.0000? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


##Load house sales data

In [None]:
sales = tc.SFrame('home_data.sframe')

##Split data into training and testing


In [None]:
train_data,test_data = sales.random_split(.8,seed=0)

In [None]:
sales

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
7129300520,2014-10-13 00:00:00+00:00,221900.0,3.0,1.0,1180.0,5650.0,1.0,0
6414100192,2014-12-09 00:00:00+00:00,538000.0,3.0,2.25,2570.0,7242.0,2.0,0
5631500400,2015-02-25 00:00:00+00:00,180000.0,2.0,1.0,770.0,10000.0,1.0,0
2487200875,2014-12-09 00:00:00+00:00,604000.0,4.0,3.0,1960.0,5000.0,1.0,0
1954400510,2015-02-18 00:00:00+00:00,510000.0,3.0,2.0,1680.0,8080.0,1.0,0
7237550310,2014-05-12 00:00:00+00:00,1225000.0,4.0,4.5,5420.0,101930.0,1.0,0
1321400060,2014-06-27 00:00:00+00:00,257500.0,3.0,2.25,1715.0,6819.0,2.0,0
2008000270,2015-01-15 00:00:00+00:00,291850.0,3.0,1.5,1060.0,9711.0,1.0,0
2414600126,2015-04-15 00:00:00+00:00,229500.0,3.0,1.0,1780.0,7470.0,1.0,0
3793500160,2015-03-12 00:00:00+00:00,323000.0,3.0,2.5,1890.0,6560.0,2.0,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7.0,1180.0,0.0,1955.0,0.0,98178,47.51123398
0,3,7.0,2170.0,400.0,1951.0,1991.0,98125,47.72102274
0,3,6.0,770.0,0.0,1933.0,0.0,98028,47.73792661
0,5,7.0,1050.0,910.0,1965.0,0.0,98136,47.52082
0,3,8.0,1680.0,0.0,1987.0,0.0,98074,47.61681228
0,3,11.0,3890.0,1530.0,2001.0,0.0,98053,47.65611835
0,3,7.0,1715.0,0.0,1995.0,0.0,98003,47.30972002
0,3,7.0,1060.0,0.0,1963.0,0.0,98198,47.40949984
0,3,7.0,1050.0,730.0,1960.0,0.0,98146,47.51229381
0,3,7.0,1890.0,0.0,2003.0,0.0,98038,47.36840673

long,sqft_living15,sqft_lot15
-122.25677536,1340.0,5650.0
-122.3188624,1690.0,7639.0
-122.23319601,2720.0,8062.0
-122.39318505,1360.0,5000.0
-122.04490059,1800.0,7503.0
-122.00528655,4760.0,101930.0
-122.32704857,2238.0,6819.0
-122.31457273,1650.0,9711.0
-122.33659507,1780.0,8113.0
-122.0308176,2390.0,7570.0


In [None]:
# Let's compute the mean of the House Prices in King County in 2 different ways.
prices = sales['price'] # extract the price column of the sales SFrame -- this is now an SArray

# recall that the arithmetic average (the mean) is the sum of the prices divided by the total number of houses:
sum_prices = prices.sum()
num_houses = prices.nnz() 
avg_price_1 = sum_prices/num_houses
avg_price_2 = prices.mean()
print ("average price via method 1: " + str(avg_price_1))
print ("average price via method 2: " + str(avg_price_2))

average price via method 1: 540088.1419053348
average price via method 2: 540088.1419053351


In [None]:
half_prices = 0.5*prices
# Let's compute the sum of squares of price. We can multiply two SArrays of the same length elementwise also with *
prices_squared = prices*prices
sum_prices_squared = prices_squared.sum() # price_squared is an SArray of the squares and we want to add them up.
print ("the sum of price squared is: " + str(sum_prices_squared))

the sum of price squared is: 9217325133550736.0


In [None]:
print (sales['price'].mean())

540088.1419053351


##Build a generic simple linear regression function

In [None]:
def simple_linear_regression(input_feature, output):
  Xi = input_feature
  Yi = output
  N = len(Xi)

  # compute the mean of  input_feature and output
  Ymean = Yi.mean()
  Xmean = Xi.mean()
    
  # compute the product of the output and the input_feature and its mean
  SumYiXi = (Yi * Xi).sum()
  YiXiByN = (Yi.sum() * Xi.sum()) / N
    
  # compute the squared value of the input_feature and its mean
  XiSq = (Xi * Xi).sum()
  XiXiByN = (Xi.sum() * Xi.sum()) / N
    
  # use the formula for the slope
  slope = (SumYiXi - YiXiByN) / (XiSq - XiXiByN)
    
  # use the formula for the intercept
  intercept = Ymean - (slope * Xmean)
  return (intercept, slope)

We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1*input_feature then we know both our slope and intercept should be 1

In [None]:
test_feature = tc.SArray(range(5))
test_output = tc.SArray(1 + 1*test_feature)
(test_intercept, test_slope) =  simple_linear_regression(test_feature, test_output)
print ("Intercept: " + str(test_intercept))
print ("Slope: " + str(test_slope))

Intercept: 1.0000000000000002
Slope: 1.0


Now that we know it works let's build a regression model for predicting price based on sqft_living. Rembember that we train on train_data!

In [None]:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])

print ("Intercept: " + str(sqft_intercept))
print ("Slope: " + str(sqft_slope))

Intercept: -47116.076574940584
Slope: 281.9588385676974


##Predicting Values


In [None]:
def get_regression_predictions(input_feature, intercept, slope):
    predicted_values = intercept + (slope * input_feature)
    return predicted_values

Now that we can calculate a prediction given the slop and intercept let's make a prediction. Use (or alter) the following to find out the estimated price for a house with 2650 squarefeet according to the squarefeet model we estiamted above.

In [None]:
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print ("The estimated price for a house with %d squarefeet is $%.2f" % (my_house_sqft, estimated_price))

The estimated price for a house with 2650 squarefeet is $700074.85


##Residual Sum of Squares

In [None]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    predicted_values = intercept + (slope * input_feature)
    residuals = output - predicted_values
    RSS = (residuals * residuals).sum()
    return(RSS)

In [None]:
print (get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope)) # should be 0.0

4.930380657631324e-32


In [None]:
rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)
print ('The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft))

The RSS of predicting Prices based on Square Feet is : 1201918356321966.2


##Predict the squarefeet given price
What if we want to predict the squarefoot given the price? Since we have an equation y = a + b*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x).

In [None]:
def inverse_regression_predictions(output, intercept, slope):
    # solve output = intercept + slope*input_feature for input_feature. Use this equation to compute the inverse predictions:
    estimated_feature = (output - intercept)/slope
    return estimated_feature

In [None]:
my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print ("The estimated squarefeet for a house worth $%.2f is %d" % (my_house_price, estimated_squarefeet))

The estimated squarefeet for a house worth $800000.00 is 3004


##New Model: estimate prices from bedrooms
We have made one model for predicting house prices using squarefeet, but there are many other features in the sales SFrame. Use your simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms. Use the training data!

In [None]:
# Estimate the slope and intercept for predicting 'price' based on 'bedrooms'
sqft_intercept, sqft_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])

print ("Intercept: " + str(sqft_intercept))
print ("Slope: " + str(sqft_slope))

Intercept: 109473.1804692861
Slope: 127588.95217458377


##Test your Linear Regression Algorithm
Now we have two models for predicting the price of a house. How do we know which one is better? Calculate the RSS on the TEST data (remember this data wasn't involved in learning the model). Compute the RSS from predicting prices using bedrooms and from predicting prices using squarefeet.

In [None]:
# Compute RSS when using bedrooms on TEST data:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])
rss_prices_on_bedrooms = get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'], sqft_intercept, sqft_slope)
print ('The RSS of predicting Prices based on Bedrooms is : ' + str(rss_prices_on_bedrooms))

The RSS of predicting Prices based on Bedrooms is : 493364582868287.75


In [None]:
# Compute RSS when using squarfeet on TEST data:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])
rss_prices_on_sqft = get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], sqft_intercept, sqft_slope)
print ('The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft))

The RSS of predicting Prices based on Square Feet is : 275402936247141.53


In [None]:
print ("The lowest RSS on TEST data: " + str(min(rss_prices_on_bedrooms,rss_prices_on_sqft) ))

The lowest RSS on TEST data: 275402936247141.53
