There are 2 ways we can tweak the model to try to improve the accuracy (decrease the RMSE during validation):

    increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors
    increase k, the number of nearby neighbors the model uses when computing the prediction

In this mission, we'll focus on increasing the number of attributes the model uses. When selecting more attributes to use in the model, we need to watch out for columns that don't work well with the distance equation. This includes columns containing:

    non-numerical values (e.g. city or state)
        Euclidean distance equation expects numerical values
    missing values
        distance equation expects a value for each observation and attribute
    non-ordinal values (e.g. latitude or longitude)
        ranking by Euclidean distance doesn't make sense if all attributes aren't ordinal

In the following code screen, we've read the dc_airbnb.csv dataset from the last mission into pandas and brought over the data cleaning changes we made. Let's first look at the first row's values to identify any columns containing non-numerical or non-ordinal values. In the next screen, we'll drop those columns and then look for missing values in each of the remaining columns.

In [1]:
import pandas as pd
import numpy as np
np.random.seed(1)

dc_listings = pd.read_csv('dc_airbnb.csv')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 574 to 1061
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null float64
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(6), int64(5), objec

 Removing features

The following columns contain non-numerical values:

    room_type: e.g. Private room
    city: e.g. Washington
    state: e.g. DC

while these columns contain numerical but non-ordinal values:

    latitude: e.g. 38.913458
    longitude: e.g. -77.031
    zipcode: e.g. 20009

Geographic values like these aren't ordinal, because a smaller numerical value doesn't directly correspond to a smaller value in a meaningful way. For example, the zip code 20009 isn't smaller or larger than the zip code 75023 and instead both are unique, identifier values. Latitude and longitude value pairs describe a point on a geographic coordinate system and different equations are used in those cases (e.g. haversine).

While we could convert the host_response_rate and host_acceptance_rate columns to be numerical (right now they're object data types and contain the % sign), these columns describe the host and not the living space itself. Since a host could have many living spaces and we don't have enough information to uniquely group living spaces to the hosts themselves, let's avoid using any columns that don't directly describe the living space or the listing itself:

    host_response_rate
    host_acceptance_rate
    host_listings_count

Let's remove these 9 columns from the Dataframe.

In [2]:
labels=['room_type','city','state','latitude','longitude','zipcode','host_response_rate','host_acceptance_rate','host_listings_count']
dc_listings=dc_listings.drop(labels,axis=1)
print(dc_listings.isnull().sum())

accommodates            0
bedrooms               21
bathrooms              27
beds                   11
price                   0
cleaning_fee         1388
security_deposit     2297
minimum_nights          0
maximum_nights          0
number_of_reviews       0
dtype: int64


Handling missing values

Of the remaining columns, 3 columns have a few missing values (less than 1% of the total number of rows):

    bedrooms
    bathrooms
    beds

Since the number of rows containing missing values for one of these 3 columns is low, we can select and remove those rows without losing much information. There are also 2 columns have a large number of missing values:

    cleaning_fee - 37.3% of the rows
    security_deposit - 61.7% of the rows

and we can't handle these easily. We can't just remove the rows containing missing values for these 2 columns because we'd miss out on the majority of the observations in the dataset. Instead, let's remove these 2 columns entirely from consideration.

In [3]:
dc_listings=dc_listings.drop(['cleaning_fee','security_deposit'],axis=1)
dc_listings=dc_listings.dropna(subset=['bedrooms', 'bathrooms','beds'],axis=0)
print(dc_listings.isnull().sum())

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
dtype: int64


Normalize columns

Here's how the dc_listings Dataframe looks like after all the changes we made:
accommodates 	bedrooms 	bathrooms 	beds 	price 	minimum_nights 	maximum_nights 	number_of_reviews
2 	1.0 	1.0 	1.0 	125.0 	1 	4 	149
2 	1.0 	1.5 	1.0 	85.0 	1 	30 	49
1 	1.0 	0.5 	1.0 	50.0 	1 	1125 	1
2 	1.0 	1.0 	1.0 	209.0 	4 	730 	2
12 	5.0 	2.0 	5.0 	215.0 	2 	1825 	34

You may have noticed that while the accommodates, bedrooms, bathrooms, beds, and minimum_nights columns hover between 0 and 12 (at least in the first few rows), the values in the maximum_nights and number_of_reviews columns span much larger ranges. For example, the maximum_nights column has values as low as 4 and high as 1825, in the first few rows itself. If we use these 2 columns as part of a k-nearest neighbors model, these attributes could end up having an outsized effect on the distance calculations because of the largeness of the values.

For example, 2 living spaces could be identical across every attribute but be vastly different just on the maximum_nights column. If one listing had a maximum_nights value of 1825 and the other a maximum_nights value of 4, because of the way Euclidean distance is calculated, these listings would be considered very far apart because of the outsized effect the largeness of the values had on the overall Euclidean distance. To prevent any single column from having too much of an impact on the distance, we can normalize all of the columns to have a mean of 0 and a standard deviation of 1.

Normalizing the values in each columns to the standard normal distribution (mean of 0, standard deviation of 1) preserves the distribution of the values in each column while aligning the scales. To normalize the values in a column to the standard normal distribution, you need to:

    from each value, subtract the mean of the column
    divide each value by the standard deviation of the column

Here's the mathematical formula describing the transformation that needs to be applied for all values in a column:

.

where
is a value in a specific column, is the mean of all the values in the column, and

is the standard deviation of all the values in the column. Here's what the corresponding code, using pandas, looks like:

# Subtract each value in the column by the mean.

first_transform = dc_listings['maximum_nights'] - dc_listings['maximum_nights'].mean()

# Divide each value in the column by the standard deviation.

normalized_col = first_transform / first_transform.std()

To apply this transformation across all of the columns in a Dataframe, you can use the corresponding Dataframe methods mean() and std():

normalized_listings = (dc_listings - dc_listings.mean()) / (dc_listings.std())

These methods were written with mass column transformation in mind and when you call mean() or std(), the appropriate column means and column standard deviations are used for each value in the Dataframe. Let's now normalize all of the feature columns in dc_listings.

In [4]:

normalized_listings = (dc_listings - dc_listings.mean())/(dc_listings.std())
normalized_listings['price'] = dc_listings['price']
normalized_listings.head(5)

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,-0.596544,-0.249467,-0.439151,-0.546858,125.0,-0.341375,-0.016604,4.57965
1593,-0.596544,-0.249467,0.412923,-0.546858,85.0,-0.341375,-0.016603,1.159275
3091,-1.095499,-0.249467,-1.291226,-0.546858,50.0,-0.341375,-0.016573,-0.482505
420,-0.596544,-0.249467,-0.439151,-0.546858,209.0,0.487635,-0.016584,-0.448301
808,4.393004,4.507903,1.264998,2.829956,215.0,-0.065038,-0.016553,0.646219


Euclidean distance for multivariate case

In the last mission, we trained 2 univariate k-nearest neighbors models. The first one used the accommodates attribute while the second one used the bathrooms attribute. Let's now train a model that uses both attributes when determining how similar 2 living spaces are. Let's refer to the Euclidean distance equation again to see what the distance calculation using 2 attributes would look like:

Since we're using 2 attributes, the distance calculation would look like:

To find the distance between 2 living spaces, we need to calculate the squared difference between both accommodates values, the squared difference between both bathrooms values, add them together, and then take the square root of the resulting sum. Here's what the Euclidean distance between the first 2 rows in normalized_listings looks like:

Imgur

So far, we've been calculating Euclidean distance ourselves by writing the logic for the equation ourselves. We can instead use the distance.euclidean() function from scipy.spatial, which takes in 2 vectors as the parameters and calculates the Euclidean distance between them. The euclidean() function expects:

    both of the vectors to be represented using a list-like object (Python list, NumPy array, or pandas Series)
    both of the vectors must be 1-dimensional and have the same number of elements

Here's a simple example:

from scipy.spatial import distance

first_listing = [-0.596544, -0.439151]

second_listing = [-0.596544, 0.412923]

dist = distance.euclidean(first_listing, second_listing)

Let's use the euclidean() function to calculate the Euclidean distance between 2 rows in our dataset to practice.

In [5]:
from scipy.spatial import distance
first_listing = normalized_listings.iloc[0][['accommodates', 'bathrooms']]
fifth_listing = normalized_listings.iloc[4][['accommodates', 'bathrooms']]
first_fifth_distance = distance.euclidean(first_listing, fifth_listing)
print(first_fifth_distance)

5.27254312467


In [6]:
# Introducing KNeighborRegressor classifier
from sklearn.neighbors import KNeighborsRegressor

train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]
knn=KNeighborsRegressor(n_neighbors=5,algorithm='brute')
train_feature=train_df.loc[:,['accommodates','bathrooms']]
train_target=train_df['price']
test_feature=test_df.loc[:,['accommodates','bathrooms']]
test_target =  test_df['price']                        
knn.fit(train_feature,train_target)
predictions=knn.predict(test_feature)                           

In [7]:
from sklearn.metrics import mean_squared_error
test_df['prediction_price'] = predictions
two_features_mse = mean_squared_error(test_df['price'],test_df['prediction_price'])
print(two_features_mse)
import numpy as np
two_features_rmse = np.sqrt(two_features_mse)
print(two_features_rmse)

15184.425165
123.225099574


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [8]:
from sklearn.neighbors import KNeighborsRegressor

train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]
knn=KNeighborsRegressor(n_neighbors=5,algorithm='brute')
train_feature=train_df.loc[:,['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']]
train_target=train_df['price']
test_feature=test_df.loc[:,['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']]
test_target =  test_df['price']                        
knn.fit(train_feature,train_target)
predictions=knn.predict(test_feature)

from sklearn.metrics import mean_squared_error
test_df['prediction_price'] = predictions
four_features_mse = mean_squared_error(test_df['price'],test_df['prediction_price'])
print(four_features_mse)
import numpy as np
four_features_rmse = np.sqrt(four_features_mse)
print(four_features_rmse)

14044.0656655
118.507660788


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [10]:
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

features = train_df.columns.tolist()
features.remove('price')

knn.fit(train_df[features], train_df['price'])
all_features_predictions = knn.predict(test_df[features])
all_features_mse = mean_squared_error(test_df['price'], all_features_predictions)
all_features_rmse = np.sqrt(all_features_mse )
print(all_features_mse)
print(all_features_rmse)

15392.6253925
124.067019761


Next steps

Interestingly enough, the RMSE value actually increased to 125.1 when we used all of the features available to us. This means that selecting the right features is important and that using more features doesn't automatically improve prediction accuracy. We should re-phrase the lever we mentioned earlier from:

    increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors

to:

    select the relevant attributes the model uses to calculate similarity when ranking the closest neighbors

The process of selecting features to use in a model is known as feature selection.

In this mission, we prepared the data to be able to use more features, trained a few models using multiple features, and evaluates the different performance tradeoffs. We explored how using more features doesn't always improve the accuracy of a k-nearest neighbors model. In the next mission, we'll explore another knob for tuning k-nearest neighbor models -- the k value.

Practice the workflow

You may have noticed that the general workflow for finding the best model is:

    select relevant features to use for predicting the target column.
    use grid search to find the optimal hyperparameter value for the selected features.
    evaluate the model's accuracy and repeat the process.


In [11]:
two_features = ['accommodates', 'bathrooms']
three_features = ['accommodates', 'bathrooms', 'bedrooms']
hyper_params = [x for x in range(1,21)]
# Append the first model's MSE values to this list.
two_mse_values = list()
# Append the second model's MSE values to this list.
three_mse_values = list()
two_hyp_mse = dict()
three_hyp_mse = dict()

for hp in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=hp, algorithm='brute')
    knn.fit(train_df[two_features], train_df['price'])
    predictions = knn.predict(test_df[two_features])
    mse = mean_squared_error(test_df['price'], predictions)
    two_mse_values.append(mse)
two_lowest_mse = two_mse_values[0]      # Assume first value is lowest mse
two_lowest_k = 1                        #assume k=1 is best 
for k,mse in enumerate(two_mse_values):
    if mse < two_lowest_mse:
        two_lowest_mse = mse     
        two_lowest_k = k + 1
    
for hp in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=hp, algorithm='brute')
    knn.fit(train_df[three_features], train_df['price'])
    predictions = knn.predict(test_df[three_features])
    mse = mean_squared_error(test_df['price'], predictions)
    three_mse_values.append(mse)
three_lowest_mse = three_mse_values[0]
three_lowest_k = 1

for k,mse in enumerate(three_mse_values):
    if mse < three_lowest_mse:
        three_lowest_mse = mse
        three_lowest_k = k + 1

two_hyp_mse[two_lowest_k] = two_lowest_mse
three_hyp_mse[three_lowest_k] = three_lowest_mse

print(two_hyp_mse)
print(three_hyp_mse)
    

{5: 15184.425164960181}
{5: 13281.215108077358}


The first model, which used the accommodates and bathrooms columns, was able to achieve an MSE value of approximately 14790. The second model, which added the bedrooms column, was ble to achieve an MSE value of approximately 13522.9, which is even lower than the lowest MSE value we achieved using the best model from the last mission (which used the accommodates, bedrooms, bathrooms, and number_of_reviews columns. Hopefully this demonstrates that using just one lever to find the best model isn't enough and you really want to use both levers in conjunction.