# Objectives:

In this project I used data on house sales in King County, where Seattle is located, to predict house prices using simple (one feature) linear regression. 

**I did:**

* Use SArray and SFrame functions to compute important summary statistics
* Write a function to compute the Simple Linear Regression weights using the closed form solution
* Write a function to make predictions of the output given the input feature
* Turn the regression around to predict the input/feature given the output
* Compare two different models for predicting house prices.


**Important notes:** 

For data manipulation, I used **SFrame**, an open-source, highly-scalable Python library for data manipulation. An alternative is the **Pandas** library. A huge advantage of SFrame over Pandas is that with SFrame, one is not limited to datasets that fit in memory, which allows to deal with large datasets, even on a laptop. 

For matrix operations, I used **Numpy**, an open-source Python library that provides fast performance, for data that fits in memory.

For plotting, I used Matplotlib, an open-source Python library with extensive plotting functionality.

As far as ML algorithim, I was interested, as self-education, to use **GraphLab Create**, a package that was developed by DataCamp group (Seattle, WA, USA). A popular alternative is to use scikit-learn. GraphLab Create is more scalable than scikit-learn and simpler to use when your data is not numeric vectors. On the other hand, scikit-learn is open-source.

# Fire up Graphlab create

In [39]:
import graphlab

# Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [40]:
graphlab.canvas.set_target('ipynb')

In [41]:
sales = graphlab.SFrame('kc_house_data.gl/')

# Split data into training and testing

Used seed=0 so that everyone running this notebook gets the same results.  In practice, you may set a random seed (or let GraphLab Create pick a random seed for you). I split the data to 80% and 20% for training and test data sets respecively.

In [42]:
train_data,test_data = sales.random_split(.8,seed=0)

In [43]:
def simple_linear_regression(input_feature, output):
    y_mean=output.mean()
    x_mean=input_feature.mean()
    numerator=(input_feature-x_mean)*(output-y_mean)
    numerator_sum=numerator.sum()
    denominator=(input_feature-x_mean)**2
    denominator_sum=denominator.sum()
    slope=numerator_sum/denominator_sum
    intercept=y_mean-(slope*x_mean)

    return (intercept, slope)

* The function "simple_linear_regression," accepts 'input_feature' and 'output' column data and calculates/returns the slope and intercept of created regression model
* The function "get_regression_predictions," accepts 'input_feature', and the learned 'slope' and 'intercept' and calculates/returns the prediction 'predicted_output'
* The function "get-residual_sum_squares," accepts column of data: ‘input_feature’, and ‘output’ and the regression parameters ‘slope’ and ‘intercept’ and outputs the Residual Sum of Squares (RSS).

# Model 1: Predicting Price Using Sqft as Feature

In [44]:
squarefeet_intercept,squarefeet_slope=simple_linear_regression(
    train_data['sqft_living'],train_data['price'])

Save the value of the slope and intercept for later.

In [45]:
squarefeet_intercept, squarefeet_slope

(-47116.07657494082, 281.9588385676973)

In [46]:
def get_regression_predictions(input_feature, intercept, slope):
    predicted_output=intercept+(slope*input_feature)
    return predicted_output 

# Prediction For a House with 2650 sqr. Feet

In [47]:
get_regression_predictions(2650,squarefeet_intercept,squarefeet_slope)

700074.845629457

In [48]:
sales.show(view="Scatter Plot", x="sqft_living", y="price")

In [49]:
def get_residual_sum_of_squares(input_feature, output, intercept,slope):
    RSSone=(output-(intercept+(slope*input_feature)))**2
    RSS=RSSone.sum()
    return RSS

In [50]:
get_residual_sum_of_squares(train_data['sqft_living'],train_data['price'],squarefeet_intercept,squarefeet_slope)

1201918356321967.5

# Predicting the input from the output:

In [51]:
def inverse_regression_predictions(output, intercept, slope):
    estimated_input=(output-intercept)/slope
    return estimated_input

In [52]:
inverse_regression_predictions(800000,squarefeet_intercept,squarefeet_slope)

3004.396247615949

# Model 2: Predicting Price Using Number of Bedrooms as Feature

In [53]:
bed_intercept,bed_slope=simple_linear_regression(
    train_data['bedrooms'],train_data['price'])

In [54]:
# RSS for test data-sqft model
RSS_living=get_residual_sum_of_squares(test_data['sqft_living'],
                test_data['price'],squarefeet_intercept,squarefeet_slope)

Comparing Model1 and Model2 based on their RSS values. Model 1 (using sqft_living as feature showed a lower RSS value).

In [55]:
# RSS for test data-bedrooms model
RSS_bed=get_residual_sum_of_squares(test_data['bedrooms'],
                            test_data['price'],bed_intercept,bed_slope)

In [56]:
RSS_living-RSS_bed

-217961646621146.6

# Conclusion:

As conclusion the model1 with SQFT_living (input) has a lower RSS value as compare to model2 created based on  number_of_bedroom (input).