# Fire up graphlab create
(See [Getting Started with SFrames](../Week%201/Getting%20Started%20with%20SFrames.ipynb) for setup instructions)

In [1]:
import graphlab

# Load some house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [2]:
sales = graphlab.SFrame('home_data.gl/')

[INFO] graphlab.cython.cy_server: GraphLab Create v1.10.1 started. Logging: /tmp/graphlab_server_1465951846.log


This non-commercial license of GraphLab Create is assigned to jzhao59@illinois.edu and will expire on June 11, 2017. For commercial licensing options, visit https://dato.com/buy/.


# Question 1

In [3]:
highest = sales[sales['zipcode'] == '98039']

In [4]:
highest['price'].mean()

2160606.5999999996

# Question 2

In [5]:
filter_data = sales[(sales['sqft_living'] > 2000) & (sales['sqft_living'] <= 4000)]

In [6]:
float(len(filter_data)) / len(sales)

0.42187572294452413

In [7]:
graphlab.canvas.set_target('ipynb')

# Create a simple regression model of sqft_living to price

Split data into training and testing.  
We use seed=0 so that everyone running this notebook gets the same results.  In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).  

In [8]:
train_data,test_data = sales.random_split(.8, seed=0)

## Build the regression model using only sqft_living as a feature

In [9]:
sqft_model = graphlab.linear_regression.create(train_data, target='price', 
                                               features=['sqft_living'], 
                                               validation_set=None)

# Evaluate the simple model

In [10]:
print test_data['price'].mean()

543054.042563


In [11]:
print sqft_model.evaluate(test_data)

{'max_error': 4143550.8825285914, 'rmse': 255191.02870527367}


RMSE of about \$255,170!

# Let's show what our predictions look like

Matplotlib is a Python plotting library that is also useful for plotting.  You can install it with:

'pip install matplotlib'

In [12]:
import matplotlib.pyplot as plt
%matplotlib inline

Above:  blue dots are original data, green line is the prediction from the simple regression.

Below: we can view the learned regression coefficients. 

# Question 3

In [13]:
advanced_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 
                     'zipcode','condition','grade','waterfront', 'view','sqft_above',
                     'sqft_basement','yr_built', 'yr_renovated','lat', 'long', 'sqft_living15','sqft_lot15']

In [14]:
advanced_model = graphlab.linear_regression.create(train_data,target='price',features=advanced_features,
                                                   validation_set=None)

In [15]:
print advanced_model.evaluate(test_data)

{'max_error': 3556849.413848093, 'rmse': 156831.11680191013}


# Build a regression model with more features

In [17]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [18]:
my_features_model = graphlab.linear_regression.create(train_data,target='price',features=my_features,validation_set=None)

In [19]:
print my_features

['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']


## Comparing the results of the simple model with adding more features

In [20]:
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)

{'max_error': 4143550.8825285914, 'rmse': 255191.02870527367}
{'max_error': 3486584.509381928, 'rmse': 179542.43331269105}


The RMSE goes down from \$255,170 to \$179,508 with more features.