### Load data

In [1]:
import graphlab

In [3]:
sales = graphlab.SFrame('../week2/home_data.gl/')
sales.head(2)

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
7129300520,2014-10-13 00:00:00+00:00,221900,3,1.0,1180,5650,1,0
6414100192,2014-12-09 00:00:00+00:00,538000,3,2.25,2570,7242,2,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7,1180,0,1955,0,98178,47.51123398
0,3,7,2170,400,1951,1991,98125,47.72102274

long,sqft_living15,sqft_lot15
-122.25677536,1340.0,5650.0
-122.3188624,1690.0,7639.0


### 1. Selection and summary statistics: 
In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the highest average house sale price. Now, take the sales data, select only the houses with this zip code, and compute the average price. Save this result to answer the quiz at the end.
- **A: 2160606.5999999996**

In [5]:
graphlab.canvas.set_target('ipynb')
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')

In [14]:
highest_zipcode = '98039'
sales_zipcode_98039 = sales[sales['zipcode']==highest_zipcode]
sales_zipcode_98039.head(1)

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
3625049014,2014-08-29 00:00:00+00:00,2950000,4,3.5,4860,23885,2,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,12,4860,0,1996,0,98039,47.61717049

long,sqft_living15,sqft_lot15
-122.23040939,3580.0,16054.0


In [11]:
len(sales_zipcode_98039)

50

In [13]:
sales_zipcode_98039['price'].mean()

2160606.5999999996

### 2. Filtering data:
One of the key features we used in our model was the number of square feet of living space (‘sqft_living’) in the house. For this part, we are going to use the idea of filtering (selecting) data.

In particular, we are going to use logical filters to select rows of an SFrame. You can find more info in the [Logical Filter section of this documentation](https://turi.com/products/create/docs/generated/graphlab.SFrame.html). Using such filters, first select the houses that have ‘sqft_living’ higher than 2000 sqft but no larger than 4000 sqft.What fraction of the all houses have ‘sqft_living’ in this range? Save this result to answer the quiz at the end.
- **A: 0.4266413732475825**

In [18]:
# sqft_living higher than 2000 sqft but no larger than 4000 sqft.
# sales[ (logical filter) ]

filtered_sales = sales[(sales['sqft_living'] >= 2000) & (sales['sqft_living'] <= 4000)]
filtered_sales.head(2)

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
6414100192,2014-12-09 00:00:00+00:00,538000,3,2.25,2570,7242,2,0
1736800520,2015-04-03 00:00:00+00:00,662500,3,2.5,3560,9796,1,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7,2170,400,1951,1991,98125,47.72102274
0,3,8,1860,1700,1965,0,98007,47.60065993

long,sqft_living15,sqft_lot15
-122.3188624,1690.0,7639.0
-122.14529566,2210.0,8925.0


In [19]:
len(filtered_sales)

9221

In [20]:
# ratio
len(filtered_sales) / len(sales)

0.4266413732475825

### 3. Building a regression model with several more features: 
In the sample notebook, we built two regression models to predict house prices, one using just ‘sqft_living’ and the other one using a few more features, we called this set

In [23]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

Now, going back to the original dataset, you will build a model using the following features:

In [22]:
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

**Compute the RMSE** (root mean squared error) on the test_data for the model using just my_features, and for the one using advanced_features.

Note 1: **both models must be trained on the original sales dataset, not the filter ed one.**

Note 2: when doing the train-test split, make sure you use seed=0, so you get the same training and test sets, and thus results, as we do.

Note 3: in the module we discussed residual sum of squares (RSS) as an error metric for regression, but GraphLab Create uses root mean squared error (RMSE). These are two common measures of error regression, and RMSE is simply the square root of the mean RSS:

(Important note: when answering the question below using GraphLab Create, when you call the linear_regression.create() function, make sure you use the parameter **validation_set=None**, as done in the notebook you download above. When you use regression GraphLab Create, it sets aside a small random subset of the data to validate some parameters. This process can cause fluctuations in the final RMSE, so we will avoid it to make sure everyone gets the same answer.)

**What is the difference in RMSE between the model trained with my_features and the one trained with advanced_features? Save this result to answer the quiz at the end.**

- **A: 22711.316510500183** 

In [26]:
# Build the regression models

my_features_model = graphlab.linear_regression.create(
    train_data, target='price', features=my_features, 
    validation_set=None
)

advanced_features_model = graphlab.linear_regression.create(
    train_data, target='price', features=advanced_features,
    validation_set=None
)

In [28]:
# evaluations mse
print my_features_model.evaluate(test_data)
print advanced_features_model.evaluate(test_data)

{'max_error': 3486584.509381705, 'rmse': 179542.4333126903}
{'max_error': 3556849.413858208, 'rmse': 156831.1168021901}


In [31]:
# difference in RMSE between the models
my_features_model.evaluate(test_data)['rmse'] - advanced_features_model.evaluate(test_data)['rmse']

22711.316510500183