1. Testing quality of predictions
==

We now have a function that can predict the price for any living space we want to list as long as we know the number of people it can accommodate. The function we wrote represents a **machine learning model**, which means that it outputs a prediction based on the input to the model.

A simple way to test the quality of your model is to:

- split the dataset into 2 partitions:
    - the training set: contains the majority of the rows (75%)
    - the test set: contains the remaining minority of the rows (25%)
- use the rows in the training set to predict the **price** value for the rows in the test set
    - add new column named **predicted_price** to the test set
- compare the **predicted_price** values with the actual  **price** values in the test set to see how accurate the predicted values were.

This validation process, where we use the training set to make predictions and the test set to predict values for, is known as **train/test validation**. Whenever you're performing machine learning, you want to perform validation of some kind to ensure that your machine learning model can make good predictions on new data. While train/test validation isn't perfect, we'll use it to understand the validation process, to select an error metric, and then we'll dive into a more robust validation process later in this course.

Let's modify the **predicted_price** function to use only the rows in the training set, instead of the full dataset, to find the nearest neighbors, average the **price** values for those rows, and return the predicted price value. Then, we'll use this function to predict the price for just the rows in the test set. Once we have the predicted price values, we can compare with the true price values and start to understand the model's effectiveness in the next screen.

To start, we've gone ahead and assigned the first 75% of the rows in **dc_listings** to **train_df** and the last 25% of the rows to **test_df**. Here's a diagram explaining the split:

<img width="600" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=11IctHIyFi18HxRsg9LpsOf4tVKqfqvRz">


Exercise Start.
==

**Description**: 

1. Within the **predict_price** function, change the Dataframe that **temp_df** is assigned to. Change it from **dc_listings** to **train_df**, so only the training set is used.
2. Use the Series method **apply** to pass all of the values in the **accommodates** column from **test_df** through the **predict_price** function.
3. Assign the resulting Series object to the **predict_price** column in **test_df**.

In [75]:
# importing packages
import pandas as pd
import numpy as np

# seed
np.random.seed(1)

# import dataset
dc_listings = pd.read_csv("dc_airbnb.csv")

# scramble the data
dc_listings = dc_listings.loc[np.random.permutation(dc_listings.index)]

# cleaning & preparing
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

# separte data into train and test (75%/25%)
train_df = dc_listings.sample(frac=0.75,random_state=1)
test_df = dc_listings.drop(train_df.index)


def predict_price(new_listing):
    temp_df = train_df
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

In [83]:
predict_price(3)

206.6

In [82]:
train_df

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state,distance
932,100%,100%,5,2,Private room,1.0,2.5,2.0,110.0,$50.00,,7,365,6,38.913776,-77.038325,Washington,20009,DC,1
1509,90%,100%,2,1,Private room,1.0,1.5,1.0,70.0,$30.00,$150.00,1,1125,0,38.874765,-77.009623,Washington,20024,DC,2
2394,100%,80%,4,1,Shared room,1.0,1.0,1.0,90.0,,,1,4,77,38.914792,-77.047386,Washington,20009,DC,2
397,100%,100%,1,2,Private room,1.0,1.0,1.0,130.0,$35.00,$150.00,5,10,2,38.904678,-77.054143,Washington,20037,DC,1
2024,99%,89%,25,6,Entire home/apt,2.0,1.0,2.0,159.0,$115.00,$300.00,3,1125,2,38.892198,-77.000341,Washington,20002,DC,3
3022,100%,,1,2,Entire home/apt,0.0,1.0,1.0,120.0,$50.00,$300.00,4,1125,8,38.937001,-77.033109,Washington,20010,DC,1
1135,100%,75%,2,2,Entire home/apt,1.0,1.5,1.0,125.0,$30.00,$100.00,3,1125,0,38.913737,-77.035244,Washington,20009,DC,1
2163,100%,100%,1,4,Entire home/apt,2.0,2.0,2.0,250.0,$200.00,$250.00,4,28,7,38.895035,-76.999630,Washington,20002,DC,1
519,100%,93%,6,10,Entire home/apt,3.0,1.5,4.0,312.0,$150.00,$500.00,2,1125,1,38.905959,-77.023344,Washington,20001,DC,7
521,60%,100%,3,4,Entire home/apt,2.0,2.0,2.0,375.0,$40.00,,1,8,0,38.909571,-77.032955,Washington,20005,DC,1


In [81]:
test_df.shape

(931, 19)

2. Error Metrics
==

We now need a metric that quantifies how good the predictions were on the test set. This class of metrics is called an **error metric**. As the name suggests, an error metric quantifies how inaccurate our predictions were from the actual values. In our case, the error metric tells us how off our predicted price values were from the actual price values for the living spaces in the test dataset.

We could start by calculating the difference between each predicted and actual value and then averaging these differences. This is referred to as **mean error** but isn't an effective error metric for most cases. Mean error treats a positive difference differently than a negative difference, but we're really interested in how far off the prediction is in either the positive or negative direction. If the true price was 200 dollars and the model predicted 210 or 190 it's off by 10 dollars either way.

We can instead use the **mean absolute error**, where we compute the absolute value of each error before we average all the errors.

$\displaystyle MAE = \frac{\left | actual_1 - predicted_1 \right | + \left | actual_2 - predicted_2 \right | + \
\ldots + \left | actual_n - predicted_n \right | }{n}$


Exercise Start.
==

**Description**: 

1. Use **numpy.absolute()** to calculate the mean absolute error between **predicted_price** and **price**.
2. Assign the MAE to **mae**.

In [93]:
mae = 0
for index, row in test_df.iterrows():
    predict = predict_price(row["accommodates"])
    mae += np.absolute(row.price - predict)
mae = mae/test_df.shape[0]
print(mae)

65.36240601503764


3. Mean Squared Error
==

For many prediction tasks, we want to penalize predicted values that are further away from the actual value much more than those that are closer to the actual value.

We can instead take the mean of the squared error values, which is called the **mean squared error** or MSE for short. The MSE makes the gap between the predicted and actual values more clear. A prediction that's off by 100 dollars will have an error (of 10,000) that's 100 times more than a prediction that's off by only 10 dollars (which will have an error of 100).

Here's the formula for MSE:

$\displaystyle MSE = \frac{(actual_1 - predicted_1)^2 + (actual_2 - predicted_2)^2 + \
\ldots + (actual_n - predicted_n)^2 }{n}$

where **n** represents the number of rows in the test set. Let's calculate the MSE value for the predictions we made on the test set.


Exercise Start.
==

**Description**: 

1. Calculate the MSE value between the **predicted_price** and **price** columns and assign to **mse**.

In [94]:
mse = 0
for index, row in test_df.iterrows():
    predict = predict_price(row.accommodates)
    mse += (row.price - predict)**2
mse = mse/test_df.shape[0]
print(mse)

20021.818689581203


4. Training another model
==

The model we trained achieved a mean squared error of around **18646.5**. Is this a high or a low mean squared error value? What does this tell us about the quality of the predictions and the model? By itself, the mean squared error value for a single model isn't all that useful.

The units of mean squared error in our case is dollars squared (not dollars), which makes it hard to reason about intuitively as well. We can, however, train another model and then compare the mean squared error values to see which model performs better on a relative basis. Recall that a low error metric means that the gap between the predicted list price and actual list price values is low while a high error metric means the gap is high.

Let's train another model, this time using the **bathrooms** column, and compare MSE values.


Exercise Start.
==

**Description**: 

1. Modify the **predict_price** function below to use the **bathrooms** column instead of the **accommodates** column to make predictions.
2. Apply the function to **test_df** and assign the resulting Series object containing the predicted price values to the **predicted_price** column in **test_df**.
3. Calculate the squared error between the **price** and **predicted_price** columns in **test_df** and assign the resulting Series object to the **squared_error** column in **test_df**.
4. Calculate the mean of the **squared_error** column in **test_df** and assign to **mse**.
5. Use the **print** function or the variables inspector to display the **MSE** value.


In [8]:
# separte data into train and test (75%/25%)
train_df = dc_listings.sample(frac=0.75,random_state=1)
test_df = dc_listings.drop(train_df.index)

def predict_price(new_listing):
    temp_df = train_df
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors_prices.mean()
    return(predicted_price)

In [95]:
# separte data into train and test (75%/25%)
train_df = dc_listings.sample(frac=0.75,random_state=1)
test_df = dc_listings.drop(train_df.index)

def predict_price2(new_listing):
    temp_df = train_df
    temp_df['distance'] = temp_df['bathrooms'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors_prices.mean()
    return(predicted_price)

In [98]:
mse = 0
for index, row in test_df.iterrows():
    predict = predict_price2(row.bathrooms)
    mse += (row.price - predict)**2
mse = mse/test_df.shape[0]
print(mse)

17217.43991407113


5. Root Mean Squared Error
==

While comparing MSE values helps us identify which model performs better on a relative basis, it doesn't help us understand if the performance is good enough in general. This is because the units of the MSE metric are squared (in this case, dollars squared). An MSE value of 16377.5 dollars squared doesn't give us an intuitive sense of how far off the model's predictions are systematically off from the true price value in dollars.

**Root mean squared error** is an error metric whose units are the base unit (in our case, dollars). RMSE for short, this error metric is calculated by taking the square root of the MSE value:

$\displaystyle RMSE=\sqrt{MSE}$

Since the RMSE value uses the same units as the target column, we can understand how far off in real dollars we can expect the model to perform. For example, if a model achieves an RMSE value of greater than 100, we can expect the predicted price value to be off by 100 dollars on average.

Let's calculate the RMSE value of the model we trained using the <span style="background-color: #F9EBEA; color:##C0392B">bathrooms</span> column.


<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**: 

1. Calculate the RMSE value of the model we trained using the <span style="background-color: #F9EBEA; color:##C0392B">bathrooms</span> column and assign it to **rmse**.


In [99]:
np.sqrt(mse)

131.2152426895257

6. Comparing MAE and RMSE
==

The model achieved an RMSE value of approximately **131.21**, which implies that we should expect for the model to be off by **131.21** dollars on average for the predicted price values. Given that most of the living spaces are listed at just a few hundred dollars, we need to reduce this error as much as possible to improve the model's usefulness.

We discussed a few different error metrics we can use to understand a model's performance. As we mentioned earlier, these individual error metrics are helpeful for comparing models. To better understand a specific model, we can compare multiple error metrics for the same model. This requires a better understanding of the mathematical properties of the error metrics.

If you look at the equation for MAE:

$$\displaystyle MAE = \frac{\left | actual_1 - predicted_1 \right | + \left | actual_2 - predicted_2 \right | + \
\ldots + \left | actual_n - predicted_n \right | }{n}$$

you'll notice that a prediction that the individual errors (or differences between predicted and actual values) grow linearly. A prediction that's off by 10 dollars has a 10 times higher error than a prediction that's off by 1 dollar. If you look at the equation for RMSE, however:

$$\displaystyle RMSE = \sqrt{\frac{(actual_1 - predicted_1)^2 + (actual_2 - predicted_2)^2 + \
\ldots + (actual_n - predicted_n)^2 }{n}}$$

you'll notice that each error is squared before the square root of the sum of all the errors is taken. This means that the individual errors grows quadratically and has a different effect on the final RMSE value.

Let's look at an example using different data entirely. We've created 2 Series objects containing 2 sets of errors and assigned to **errors_one** and **errors_two**.

>```python
errors_one = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10])
errors_two = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 1000])
```


Exercise Start.
==

**Description**: 

1. Calculate the MAE for **errors_one** and assign to **mae_one**.
2. Calculate the RMSE for **errors_one** and assign to **rmse_one**.
3. Calculate the MAE for **errors_two** and assign to **mae_two**.
4. Calculate the RMSE for **errors_two** and assign to **rmse_two**.

In [108]:
errors_one = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10])
errors_two = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 1000])

mae_one = np.sum(errors_one)/len(errors_one)
rmse_one = np.sqrt(np.sum(errors_one**2)/len(errors_one))
mae_two = np.sum(errors_two)/len(errors_two)
rmse_two = np.sqrt(np.sum(errors_two**2)/len(errors_two))
print("Mae_one: {}\nRmse_one: {}".format(mae_one,rmse_one))
print("Mae_two: {}\nRmse_two: {}".format(mae_two,rmse_two))

Mae_one: 7.5
Rmse_one: 7.905694150420948
Mae_two: 62.5
Rmse_two: 235.82302686548658


In [105]:
errors_one**2

0      25
1     100
2      25
3     100
4      25
5     100
6      25
7     100
8      25
9     100
10     25
11    100
12     25
13    100
14     25
15    100
16     25
17    100
dtype: int64

7. Next steps
==

While the MAE (7.5) to RMSE (7.9056941504209481) ratio was about **1:1** for the first list of errors, the MAE (62.5) to RMSE (235.82302686548658) ratio was closer to **1:4** for the second list of errors. The only difference between the 2 sets of errors is the extreme 1000 value in **errors_two** instead of 10. When we're working with larger data sets, we can't inspect each value to understand if there's one or some outliers or if all of the errors are systematically higher. Looking at the ratio of MAE to RMSE can help us understand if there are large but infrequent errors. You can read more about comparing MAE and RMSE in [this wonderful post](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d#.lyc8od1ix).

In this mission, we learned how to test our machine learning models. In the next 2 missions, we'll explore how adding more features to the machine learning model and selecting a more optimal k value can help improve the model's performance.