In [1]:
import turicreate as tc

In [2]:
sales = tc.load_sframe("home_data.sframe")

# 1. Selection and summary statistics:  
In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the highest average house sale price.  Now, take the sales data, select only the houses with this zip code, and compute the average price.  Save this result to answer the quiz at the end.

In [5]:
tc.show(sales['zipcode'], sales['price'])

In [19]:
sales.groupby(key_column_names='zipcode', operations={'avg': tc.aggregate.AVG('price')}).sort('avg', ascending=False)

zipcode,avg
98039,2160606.6
98004,1355927.0977917982
98040,1194230.0035461
98112,1095499.36802974
98102,901258.238095238
98109,879623.6238532111
98105,862825.2314410482
98006,859684.7630522088
98119,849448.0108695652
98005,810164.880952381


In [20]:
subsales = sales[sales['zipcode']=='98039']

In [21]:
subsales['price'].mean()

2160606.5999999996

# 2.  Filtering data: 
One of the key features we used in our model was the number of square feet of living space (‘sqft_living’) in the house. For this part, we are going to use the idea of filtering (selecting) data.

In [22]:
mask = (sales['sqft_living']>2000) & (sales['sqft_living']<4000)

In [37]:
mask.sum() / sales['sqft_living'].shape[0]

0.4215518437977143

#  3. Building a regression model with several more features:  
In the sample notebook, we built two regression models to predict house prices, one using just ‘sqft_living’ and the other one using a few more features, we called this set

In [38]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

Now, going back to the original dataset, you will build a model using the following features:

In [39]:
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

**Compute the RMSE** (root mean squared error) on the test_data for the model using just my_features, and for the one using advanced_features.

**Note 1: both models must be trained on the original sales train dataset, not the one filtered on `sqft_living`.**

     Note 2:  when doing the train-test split, make sure you use seed=0, so you get the same training and test sets, and thus results, as we do.  

    Note 3:  in the module we discussed residual sum of squares (RSS) as an error metric for regression, but Turi Create uses root mean squared error (RMSE).  These are two common measures of error regression, and RMSE is simply the square root of the mean RSS: 


where N is the number of data points.  RMSE can be more intuitive than RSS, since its units are the same as that of the target column in the data, in our case the unit is dollars ($), and doesn't grow with the number of data points, like the RSS does.

(Important note:  when answering the question below using Turi Create, when you call the **linear_regression.create()** function, make sure you use the parameter **validation_set=None**, as done in the notebook you download above.  When you use regression Turi Create, it sets aside a small random subset of the data to validate some parameters.  This process can cause fluctuations in the final RMSE, so we will avoid it to make sure everyone gets the same answer.)

**What is the difference in RMSE between the model trained with my_features and the one trained with advanced_features?Save this result to answer the quiz at the end.**

In [46]:
training_set, test_set = sales.random_split(.8,seed=0)

In [48]:
my_model = tc.linear_regression.create(training_set,target='price',features=my_features, validation_set=None)

In [49]:
my_model.evaluate(test_set)

{'max_error': 3152242.7848689733, 'rmse': 180439.07296639978}

In [50]:
advanced_model = tc.linear_regression.create(training_set,target='price',features=advanced_features, validation_set=None)

In [51]:
advanced_model.evaluate(test_set)

{'max_error': 3170363.1813858226, 'rmse': 155269.6579282571}

In [53]:
advanced_model.evaluate(test_set)['rmse'] - my_model.evaluate(test_set)['rmse']

-25169.415038142673