# 1. Testing quality of predictions

A simple way to test the quality of your model is to:

*  **split the dataset into 2 partitions:**
   * the **training set:** contains the majority of the rows (75%)
   * the **test set:** contains the remaining minority of the rows (25%)

* **use the rows in the training set to predict the price value for the rows in the test set**

   add new column named predicted_price to the test set

* **compare the predicted_price values with the actual price values in the test set to see how accurate the predicted values were.**

### This validation process, where we use the training set to make predictions and the test set to predict values for, is known as train/test validation.

`Whenever you're performing machine learning, you want to perform validation of some kind to ensure that your machine learning model can make good predictions on new data.` While train/test validation isn't perfect, we'll use it to understand the validation process, to select an error metric,

## TODO:
* Within the predict_price function, change the Dataframe that temp_df is assigned to. Change it from dc_listings to train_df, so only the training set is used.
* Use the Series method apply to pass all of the values in the accommodates column from test_df through the predict_price function.
* Assign the resulting Series object to the predicted_price column in test_df.

In [1]:
import pandas as pd 
import numpy as np

dc_listings=pd.read_csv('dc_airbnb.csv')

dc_listings.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,$160.00,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,$350.00,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,$50.00,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,$95.00,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,$50.00,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD


In [2]:
# clean price column
dc_listings['price'].unique()

array(['$160.00', '$350.00', '$50.00', '$95.00', '$99.00', '$100.00',
       '$38.00', '$71.00', '$97.00', '$55.00', '$60.00', '$52.00',
       '$23.00', '$200.00', '$40.00', '$135.00', '$225.00', '$129.00',
       '$149.00', '$150.00', '$175.00', '$239.00', '$65.00', '$80.00',
       '$250.00', '$138.00', '$94.00', '$283.00', '$127.00', '$79.00',
       '$110.00', '$480.00', '$372.00', '$125.00', '$89.00', '$90.00',
       '$400.00', '$35.00', '$130.00', '$45.00', '$64.00', '$59.00',
       '$74.00', '$69.00', '$49.00', '$157.00', '$128.00', '$102.00',
       '$195.00', '$120.00', '$249.00', '$88.00', '$85.00', '$215.00',
       '$299.00', '$309.00', '$375.00', '$180.00', '$337.00', '$126.00',
       '$199.00', '$165.00', '$115.00', '$569.00', '$198.00', '$311.00',
       '$167.00', '$104.00', '$159.00', '$190.00', '$246.00', '$450.00',
       '$234.00', '$109.00', '$87.00', '$210.00', '$300.00', '$189.00',
       '$269.00', '$119.00', '$295.00', '$155.00', '$229.00', '$75.00',
      

In [3]:
dc_listings['price']=dc_listings['price'].str.replace('$','').str.replace(',','').astype(float)

In [4]:
dc_listings['price'].head()

0    160.0
1    350.0
2     50.0
3     95.0
4     50.0
Name: price, dtype: float64

In [5]:
dc_listings.shape

(3723, 19)

In [6]:
# split dataset in train and test dataset
train_df=dc_listings.iloc[0:2792]
test_df=dc_listings.iloc[2792:]

In [7]:
def predict_price(new_listing):
    ## DataFrame.copy() performs a deep copy
    temp_df = dc_listings.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

In [8]:
# trained using train_df dataset

def predict_price(new_listing):
    ## DataFrame.copy() performs a deep copy
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

In [9]:
# predict price using test_df dataset

test_df['predicted_price']=test_df['accommodates'].apply(predict_price)
test_df['predicted_price'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


2792    104.0
2793    177.4
2794    145.8
2795    177.4
2796    187.2
Name: predicted_price, dtype: float64

# 2. Error Metrics

**We now need a metric that quantifies how good the predictions were on the test set. This class of metrics is called an error metric.**

* As the name suggests, an error metric quantifies how inaccurate our predictions were from the actual values.

* We could start by calculating the difference between each predicted and actual value and then averaging these differences. This is referred to as **mean error** but isn't an effective error metric for most cases. Mean error treats a positive difference differently than a negative difference, but we're really interested in how far off the prediction is in either the positive or negative direction.

* We can instead use the **mean absolute error**, where we compute the absolute value of each error before we average all the errors.

### $MAE = \frac{1}{n} \sum_{k=1}^{n} \lvert (actual_1 - predicted_1) \rvert + \cdots + \lvert (actual_n - predicted_n) \rvert$

## TODO:
* Use numpy.absolute() to calculate the mean absolute error between predicted_price and price.
* Assign the MAE to mae.

In [10]:
import numpy as np 
 
test_df['error']=np.absolute(test_df['predicted_price']-test_df['price'])
mae=test_df['error'].mean()
mae

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


56.29001074113876

# 3. Mean Squared Error

* `For many prediction tasks, we want to penalize predicted values that are further away from the actual value far more than those closer to the actual value`.

* We can instead take the **mean of the squared error values,** which is called the mean squared error or MSE for short. The MSE makes the gap between the predicted and actual values more clear.

### $MSE = \frac{1}{n} \sum_{k=1}^{n} (actual_1 - predicted_1)^{2} + \cdots + (actual_n - predicted_n)^{2}$

## TODO:
* Calculate the MSE value between the predicted_price and price columns and assign to mse.

In [11]:
test_df['squared_error']=(test_df['predicted_price']-test_df['price'])**2
mse=test_df['squared_error'].mean()
mse

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


18646.525370569325

# 4. Training another model

The model we trained achieved a mean squared error of around 18646.5. Is this a high or a low mean squared error value? What does this tell us about the quality of the predictions and the model? By itself, the mean squared error value for a single model isn't all that useful.

* The units of mean squared error in our case is dollars squared (not dollars), which makes it hard to reason about intuitively as well. We can, however, train another model and then compare the mean squared error values to see which model performs better on a relative basis.

**low error metric means that the gap between the predicted list price and actual list price values is low while a high error metric means the gap is high.**

## TODO:
* Modify the predict_price function to the right to use the bathrooms column instead of the accommodates column to make predictions.
* Apply the function to test_df and assign the resulting Series object containing the predicted price values to the predicted_price column in test_df.
* Calculate the squared error between the price and predicted_price columns in test_df and assign the resulting Series object to the squared_error column in test_df.
* Calculate the mean of the squared_error column in test_df and assign to mse.
* Use the print function or the variables inspector to display the MSE value.

In [12]:
def predict_price(new_listing):
    ## DataFrame.copy() performs a deep copy
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['bathrooms'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

In [13]:
test_df['predicted_price'] = test_df['bathrooms'].apply(lambda x: predict_price(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [14]:
test_df['squarred_error']=(test_df['predicted_price']-test_df['price'])**(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [15]:
mse=test_df['squared_error'].mean()

In [16]:
mse

18646.525370569325

# 5. Root Mean Squared Error

* **While comparing MSE values helps us identify which model performs better on a relative basis, it doesn't help us understand if the performance is good enough in general.** This is because the units of the MSE metric are squared (in this case, dollars squared).

`Root mean squared error is an error metric whose units are the base unit (in our case, dollars). RMSE for short, this error metric is calculated by taking the square root of the MSE value:`

**Since the RMSE value uses the same units as the target column, we can understand how far off in real dollars we can expect the model to perform.**


### $RMSE = \sqrt{MSE}$

## TODO:
* Calculate the RMSE value of the model we trained using the bathrooms column and assign it to rmse.

In [17]:
rmse=mse**(1/2)
rmse

136.55228072269364

# 6. Comparing MAE and RMSE

* The model achieved an RMSE value of approximately 135.6, which implies that we should expect for the model to be off by 135.6 dollars on average for the predicted price values.

* These individual error metrics are helpful for comparing models. **To better understand a specific model, we can compare multiple error metrics for the same model.**

### * $ MAE = \frac{1}{n} \sum_{k=1}^{n} \lvert (actual_1 - predicted_1) \rvert + \cdots + \lvert (actual_n - predicted_n) \rvert $

* you'll notice that that the **differences between predicted and actual values grow linearly.** A prediction that's off by 10 dollars has a 10 times higher error than a prediction that's off by 1 dollar.

### * $ RMSE = \sqrt { \frac{ \sum_{k=1}^{n} (actual_1 - predicted_1)^2 + \cdots + (actual_n - predicted_n)^2 } {n} } $

* you'll notice that each error is squared before the square root of the sum of all the errors is taken. This means that the **individual errors grows quadratically and has a different effect on the final RMSE value**.

## TODO:
* Calculate the MAE for errors_one and assign to mae_one.
* Calculate the RMSE for errors_one and assign to rmse_one.
* Calculate the MAE for errors_two and assign to mae_two.
* Calculate the RMSE for errors_two and assign to rmse_two.

In [18]:
errors_one = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10])
errors_two = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 1000])

In [19]:
mae_one=errors_one.sum()/len(errors_one)
rmse_one=((errors_one**2).sum()/len(errors_one))**(1/2)
mae_two=errors_two.sum()/len(errors_two)
rmse_two=((errors_two**2).sum()/len(errors_two))**(1/2)

While the MAE (7.5) to RMSE (7.9056941504209481) ratio was about 1:1 for the first list of errors, the MAE (62.5) to RMSE (235.82302686548658) ratio was closer to 1:4 for the second list of errors.**In general, we should expect that the MAE value be much less than the RMSE value.** The only difference between the 2 sets of errors is the extreme 1000 value in errors_two instead of 10. When we're working with larger data sets, we can't inspect each value to understand if there's one or some outliers or if all of the errors are systematically higher.**Looking at the ratio of MAE to RMSE can help us understand if there are large but infrequent errors.** You can read more about comparing MAE and RMSE in this wonderful post.