This workbook adapts the Dataquest Lesson Evaluating (knn) Model Performance to the garfield dataset.

Let's modify the predict_price function to use only the rows in the training set, instead of the full dataset, to find the nearest neighbors, average the price values for those rows, and return the predicted price value. Then, we'll use this function to predict the price for just the rows in the test set. Once we have the predicted price values, we can compare with the true price values and start to understand the model's effectiveness in the next screen.

To start, we've gone ahead and assigned the first 75% of the rows in dc_listings to train_df and the last 25% of the rows to test_df.

In [1]:
import pandas as pd
import numpy as np
np.random.seed(1)
garfield = pd.read_csv('garfield_no_zeros_clean.csv')
garfield = garfield.loc[np.random.permutation(len(garfield))] # Shuffle the data frame
print(garfield.shape)

(1936, 13)


75% of 1936 total rows is 1452, so we will use this information to construct the training and testing datasets.

In [2]:
train_df = garfield.iloc[0:1452]
test_df = garfield.iloc[1452:]

* Within the predict_price function, change the Dataframe that temp_df is assigned to. Change it from dc_listings to train_df, so only the training set is used.
* Use the Series method apply to pass all of the values in the accommodates column from test_df through the predict_price function.
* Assign the resulting Series object to the predicted_price column in test_df.

In [3]:
def predict_price(new_listing):
    ## DataFrame.copy() performs a deep copy
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['Bedrooms'].apply(lambda x: np.abs(x - new_listing)) # Adapted to garfield data (Bedrooms)
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['Sale_Price'] # Adapted to garfield data (Sale Price)
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

test_df['predicted_price'] = test_df['Bedrooms'].apply(predict_price)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


We now need a metric that quantifies how good the predictions were on the test set. This class of metrics is called an error metric. As the name suggests, an error metric quantifies how inaccurate our predictions were from the actual values. In our case, the error metric tells us how off our predicted price values were from the actual price values for the living spaces in the test dataset.

We could start by calculating the difference between each predicted and actual value and then averaging these differences. This is referred to as mean error but isn't an effective error metric for most cases. Mean error treats a positive difference differently than a negative difference, but we're really interested in how far off the prediction is in either the positive or negative direction. If the true price was 200 dollars and the model predicted 210 or 190 it's off by 10 dollars either way.

We can instead use the mean absolute error, where we compute the absolute value of each error before we average all the errors.

* Use numpy.absolute() to calculate the mean absolute error between predicted_price and price.
* Assign the MAE to mae.

In [5]:
test_df['error'] = np.absolute(test_df['predicted_price'] - test_df['Sale_Price'])
mae = test_df['error'].mean()
print(mae)

156376.56611570247


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
