### 1. Recap

Two ways to tweak the model to try to improve the accuracy (decrease the RMSE during validation):

- increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors
- increase k, the number of nearby neighbors the model uses when computing the prediction

#### Watch out for columns that don't work well with the distance equation!

Includes columns containing:

- non-numerical values (e.g. city or state)
    - Euclidean distance equation expects numerical values
- missing values 
    - distance equation expects a value for each observation and attribute
- non-ordinal values (e.g. latitude and longitude)
    - ranking by Euclidean distance doesn't make sense if all attributes aren't ordinal

In [2]:
import pandas as pd
import numpy as np
np.random.seed(1)

dc_listings = pd.read_csv('data/dc_airbnb.csv')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 574 to 1061
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null float64
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(6), int64(5), objec

In [3]:
# drop 9 columns with non-numerical, numerical but non-ordinal values, and the three describing the host instead of the living space
dc_listings.drop(['room_type','city','state','latitude','longitude','zipcode','host_response_rate',
                  'host_acceptance_rate','host_listings_count'], 
                 axis = 1, inplace = True)

In [4]:
# drop the cleaning_fee and security_deposit columns due to high number of missing values
dc_listings.drop(['cleaning_fee', 'security_deposit'], axis = 1, inplace = True)

In [5]:
# drop missing values
dc_listings.dropna(axis = 0, inplace = True)

# display null value counts for updated df to confim no missing values left
dc_listings.isna().sum()

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
dtype: int64

#### 4. Normalize columns

To prevent any single columns from having too much of an impact on the distance, we can normalize all of the columns to have a mean of 0 and a standard deviation of 1.

Normalizing the values in each column to the standard normal distribution (mean of 0, standard deviation of 1) preserves the distribution of the values in each column while aligning the scales. To normalize the values in a column to the standard normal distribution, you need to:

    - from each value, subtract the mean of the column
    - divide each value by the standard deviation of the column

In [6]:
# normalize all of the feature columns and assign the new Dataframe containing just the normalized feature columns to new df
normalized_listings = (dc_listings - dc_listings.mean()) / (dc_listings.std())

In [7]:
# add the price column from dc_listings to normalized_listings
normalized_listings['price'] = dc_listings['price']

# display the first 3 rows in normalized_listings
normalized_listings.head(3)

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,-0.596544,-0.249467,-0.439151,-0.546858,125.0,-0.341375,-0.016604,4.57965
1593,-0.596544,-0.249467,0.412923,-0.546858,85.0,-0.341375,-0.016603,1.159275
3091,-1.095499,-0.249467,-1.291226,-0.546858,50.0,-0.341375,-0.016573,-0.482505


#### 5. Euclidean distance for multivariate case

So far, we've been calculating Euclidean distance ourselves by writing the logic for the equation ourselves. We can instead use the `distance.euclidean()` function from `scipy.spatial`, which takes in 2 vectors as the parameters and calculates the Euclidean distance between them.

The `euclidean()` function expects:
- both of the vectors to be represented using a list-like object (Python list, NumPy array, or pandas Series)
- both of the vectors must be 1-dimensional and have the same number of elements

In [19]:
from scipy.spatial import distance

# calculate the Euclidean distance using only the accomodates and bathrooms features between the 1st and 5th row
first_fifth_distance = distance.euclidean(normalized_listings.iloc[0][['accommodates', 'bathrooms']], 
                                          normalized_listings.iloc[4][['accommodates', 'bathrooms']])

first_fifth_distance

5.272543124668404

Any model that helps us predict numerical values, like listing price in our case, is known as a regression model. The other main class of machine learning models is called classification, where we're trying to predict a label from a fixed set of labels (e.g. blood type or gender). The word regressor from the class name KNeighborsRegressor refers to the regression model class that we just discussed.

Scikit-learn uses a similar object-oriented style to Matplotlib and you need to instantiate an empty model first by calling the constructor:

`from sklearn.neighbors import KNeighborsRegressor
 knn = KNeighborsRegressor()`
 
- n_neighbors: number of neighbors
- algorithm: for computing nearest neighbor
- p: set to 2, corresponding to Euclidean distance

In [20]:
from sklearn.neighbors import KNeighborsRegressor

train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]

In [21]:
# create an instance of the KNeighborsRegressor class
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

# Use the fit method to specify the data we want the k-nearest neighbor model to use
train_features = train_df[['accommodates', 'bathrooms']]
train_target = train_df['price']

knn.fit(train_features, train_target)

# call the predict method to make predictions 
predictions = knn.predict(test_df[['accommodates', 'bathrooms']])

In [24]:
from sklearn.metrics import mean_squared_error

train_columns = ['accommodates', 'bathrooms']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute', metric='euclidean')
knn.fit(train_df[train_columns], train_df['price'])
predictions = knn.predict(test_df[train_columns])

# Use the mean_squared_error function to calculate the MSE value for the predictions
two_features_mse = mean_squared_error(test_df['price'], predictions)

# Calculate the RMSE value by taking the square root of the MSE value
two_features_rmse = two_features_rmse = np.sqrt(two_features_mse)

print('MSE: ', two_features_mse, 'RMSE: ', two_features_rmse)

MSE:  15660.39795221843 RMSE:  125.14151170662127


In [26]:
features = ['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

knn.fit(train_df[features], train_df['price'])

four_predictions = knn.predict(test_df[features])

four_mse = mean_squared_error(test_df['price'], four_predictions)

four_rmse = np.sqrt(four_mse)

print('MSE: ', four_mse, 'RMSE: ', four_rmse)

MSE:  13320.230625711036 RMSE:  115.41330350402


#### Use all of the columns, except for the price column, to train a k-nearest neighbors model using the same parameters for the KNeighborsRegressor class as the ones from the last few screens.

In [33]:
# Use all of the columns, except for the price column, to train a k-nearest 
# neighbors model using the same parameters for the KNeighborsRegressor class as the ones from the last few screens.
train_features = train_df.drop('price', axis = 1)
train_target = train_df['price']
test_features = test_df.drop('price', axis = 1)
test_target = test_df['price']

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

knn.fit(train_features, train_target)

# Use the model to make predictions on the test set and assign the resulting NumPy array of 
# predictions to all_features_predictions
all_features_predictions = knn.predict(test_features)

# Calculate the MSE and RMSE values and assign to all_features_mse and all_features_rmse accordingly
all_features_mse = mean_squared_error(test_target, all_features_predictions)

all_features_rmse = np.sqrt(all_features_mse)

print('MSE: ', all_features_mse, 'RMSE: ', all_features_rmse)

MSE:  15455.275631399316 RMSE:  124.31924883701363
