In [8]:
import pandas as pd
dc_listings = pd.read_csv('/Users/eleonoreserge/Documents/listings.csv')
print(dc_listings.iloc[0])

id                                                                            7087327
listing_url                                      https://www.airbnb.com/rooms/7087327
scrape_id                                                              20151002231825
last_scraped                                                               2015-10-03
name                                               Historic DC Condo-Walk to Capitol!
summary                             Professional pictures coming soon! Welcome to ...
space                                                                             NaN
description                         Professional pictures coming soon! Welcome to ...
experiences_offered                                                              none
neighborhood_overview                                                             NaN
notes                                                                             NaN
transit                                               

# Here's the strategy we wanted to use:

Find a few similar listings.
Calculate the average nightly rental price of these listings.
Set the average price as the price for our listing.
The k-nearest neighbors algorithm is similar to this strategy.

## Euclidean distance 

When trying to predict a continuous value, like price, the main similarity metric that's used is Euclidean distance. Here's the general formula for Euclidean distance:

d = \sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 +...+ (q_n-p_n)^2}

where q_1 to q_n represent the feature values for one observation and p_1 to p_n represent the feature values for the other observation. 

Here, to keep things simple, we'll use just one feature. Since we're only using one feature, this is known as the univariate case. Here's what the formula looks like for the univariate case:

d = \sqrt{(q_1 - p_1)^2}

The living space that we want to rent can accommodate 3 people. Let's first calculate the distance, using just the accommodates feature, between the first living space in the dataset and our own.

In [9]:
## Euclidean distance ##

import numpy as np
our_acc_value = 3
first_living_space_value = dc_listings.iloc[0]['accommodates']
first_distance = np.abs(first_living_space_value - our_acc_value)
print(first_distance)

1


The Euclidean distance between the first row in the dc_listings Dataframe and our own living space is 1. How do we know if this is high or low? If you look at the Euclidean distance equation itself, the lowest value you can achieve is 0. 
The closer to 0 the distance the more similar the living spaces are.

# Calculate distance for all observations

In [10]:
new_listing = 3
dc_listings['distance'] = dc_listings['accommodates'].apply(lambda x: np.abs(x - new_listing))
print(dc_listings['distance'].value_counts())

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64


We can notice that there are 461 living spaces that can accomodate 3 people just like ours (Value 0).
This means the 5 "nearest neighbors" we select after sorting all will have a distance value of 0. If we sort by the distance column and then just select the first 5 living spaces, we would be biasing the result to the ordering of the dataset.

Let's instead randomize the ordering of the dataset and then sort the Dataframe by the distance column. This way, all of the living spaces with the same number of bedrooms will still be at the top of the Dataframe but will be in random order across the first 461 rows.

In [12]:
## Randomizing, and sorting ##
import numpy as np
np.random.seed(1)
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
dc_listings = dc_listings.sort_values('distance')
print(dc_listings.iloc[0:10]['price'])

577     $185.00
2166    $180.00
3631    $175.00
71      $128.00
1011    $115.00
380     $219.00
943     $125.00
3107    $250.00
1499     $94.00
625     $150.00
Name: price, dtype: object


# Average price

Before we can select the 5 most similar living spaces and compute the average price, we need to clean the price column. Right now, the price column contains comma characters (,) and dollar sign characters and is formatted as a text column instead of a numeric one. We need to remove these values and convert the entire column to the float datatype. Then, we can calculate the average price.

In [13]:
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
mean_price = dc_listings.iloc[0:5]['price'].mean()
print(mean_price)

156.6


## Conclusion

Based on the average price of other listings that accommdate 3 people, we should charge 156.6 dollars per night for a guest to stay at our living space.

# Function to make predictions

Let's write a more general function that can suggest the optimal price for other values of the accommodates column.

In [16]:
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]

def predict_price(new_listing):
    temp_df = dc_listings.copy()
   
    return(new_listing)
acc_one = predict_price(1)
acc_two = predict_price(2)
acc_four = predict_price(4)
def predict_price(new_listing):
    temp_df = dc_listings.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors.mean()
    return(predicted_price)

acc_one = predict_price(1)
acc_two = predict_price(2)
acc_four = predict_price(4)
print(acc_one)
print(acc_two)
print(acc_four)


71.8
96.8
96.0


We now have a function that can predict the price for any living space we want to list as long as we know the number of people it can accommodate. The function we wrote represents a machine learning model, which means that it outputs a prediction based on the input to the model.

We just used a simple k-nearest neighbors machine learning model that used just one feature, or attribute, of the listing to predict the rent price.

Now we have to point out that using just a single feature to compare listings doesn't reflect the reality of the market. An apartment that can accommodate 4 guests in a popular part of Washington D.C. will rent for much higher than one that can accommodate 4 guests in a crime ridden area.

There are 2 ways we can tweak the model to try to improve the accuracy (decrease the RMSE during validation):

- increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors
- increase k, the number of nearby neighbors the model uses when computing the prediction

we'll focus on increasing the number of attributes the model uses.

When selecting more attributes to use in the model, we need to watch out for columns that don't work well with the distance equation. This includes columns containing:

- non-numerical values (e.g. city or state)
Euclidean distance equation expects numerical values
- missing values
distance equation expects a value for each observation and attribute
- non-ordinal values (e.g. latitude or longitude)
ranking by Euclidean distance doesn't make sense if all attributes aren't ordinal

Let's first look at the first row's values to identify any columns containing non-numerical or non-ordinal values.
We use the DataFrame.info() method to return the number of non-null values in each column.

In [35]:
import pandas as pd
import numpy as np
np.random.seed(1)

dc_listings = pd.read_csv('/Users/eleonoreserge/Documents/dc_airbnb.csv')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
print(dc_listings.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 574 to 1061
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null float64
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(6), int64(5), objec

# Removing features

The following columns contain non-numerical values:

- room_type: e.g. Private room
- city: e.g. Washington
- state: e.g. DC

while these columns contain numerical but non-ordinal values:

- latitude: e.g. 38.913458
- longitude: e.g. -77.031
- zipcode: e.g. 20009

While we could convert the host_response_rate and host_acceptance_rate columns to be numerical (right now they're object data types and contain the % sign), these columns describe the host and not the living space itself. Let's avoid using any columns that don't directly describe the living space or the listing itself:

- host_response_rate
- host_acceptance_rate
- host_listings_count

Let's remove these 9 columns from the Dataframe.

In [36]:
drop_columns = ['room_type', 'city', 'state', 'latitude', 'longitude', 'zipcode', 'host_response_rate', 'host_acceptance_rate', 'host_listings_count']
dc_listings = dc_listings.drop(drop_columns, axis=1)
print(dc_listings.isnull().sum())

accommodates            0
bedrooms               21
bathrooms              27
beds                   11
price                   0
cleaning_fee         1388
security_deposit     2297
minimum_nights          0
maximum_nights          0
number_of_reviews       0
dtype: int64


Of the remaining columns, 3 columns have a few missing values (less than 1% of the total number of rows):

- bedrooms
- bathrooms
- beds
Since the number of rows containing missing values for one of these 3 columns is low, we can select and remove those rows without losing much information. There are also 2 columns that have a large number of missing values:

- cleaning_fee  37.3% of the rows
- security_deposit  61.7% of the rows

and we can't handle these easily. We can't just remove the rows containing missing values for these 2 columns because we'd miss out on the majority of the observations in the dataset. Instead, let's remove these 2 columns entirely from consideration.

In [37]:
dc_listings = dc_listings.drop(['cleaning_fee', 'security_deposit'], axis=1)
dc_listings = dc_listings.dropna(axis=0)
print(dc_listings.isnull().sum())

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
dtype: int64


# Normalize columns

To prevent any single column from having too much of an impact on the distance, we can normalize all of the columns to have a mean of 0 and a standard deviation of 1 :

- from each value, subtract the mean of the column
- divide each value by the standard deviation of the column

apply this transformation across all of the columns in a Dataframe, we can use the corresponding Dataframe methods mean() and std()

Let's now normalize all of the feature columns in dc_listings.

In [38]:
normalized_listings = (dc_listings - dc_listings.mean())/(dc_listings.std())
normalized_listings['price'] = dc_listings['price']
print(normalized_listings.head(3))

      accommodates  bedrooms  bathrooms      beds  price  minimum_nights  \
574      -0.596544 -0.249467  -0.439151 -0.546858  125.0       -0.341375   
1593     -0.596544 -0.249467   0.412923 -0.546858   85.0       -0.341375   
3091     -1.095499 -0.249467  -1.291226 -0.546858   50.0       -0.341375   

      maximum_nights  number_of_reviews  
574        -0.016604           4.579650  
1593       -0.016603           1.159275  
3091       -0.016573          -0.482505  


# Euclidean distance for multivariate case

Let's now train a model that uses both attributes when determining how similar 2 living spaces are. Let's refer to the Euclidean distance equation again to see what the distance calculation using 2 attributes would look like:

d = \sqrt{(accommodates_1-accommodates_2)^2 + (bathrooms_1-bathrooms_2)^2 }

So far, we've been calculating Euclidean distance ourselves by writing the logic for the equation ourselves. We can instead use the distance.euclidean() function from scipy.spatial, which takes in 2 vectors as the parameters and calculates the Euclidean distance between them. The euclidean() function expects:

- both of the vectors to be represented using a list-like object (Python list, NumPy array, or pandas Series)
- both of the vectors must be 1-dimensional and have the same number of elements

Let's use the euclidean() function to calculate the Euclidean distance between 2 rows in our dataset to practice.

In [39]:
from scipy.spatial import distance
first_listing = normalized_listings.iloc[0][['accommodates', 'bathrooms']]
fifth_listing = normalized_listings.iloc[4][['accommodates', 'bathrooms']]
first_fifth_distance = distance.euclidean(first_listing, fifth_listing)
print(first_fifth_distance)

5.272543124668404


# scikit-learn

The scikit-learn workflow consists of 4 main steps:

- instantiate the specific machine learning model you want to use
- fit the model to the training data
- use the model to make predictions
- evaluate the accuracy of the predictions

Each model in scikit-learn is implemented as a separate class and the first step is to identify the class we want to create an instance of. In our case, we want to use the KNeighborsRegressor class.

Any model that helps us predict numerical values, like listing price in our case, is known as a regression model. The other main class of machine learning models is called classification, where we're trying to predict a label from a fixed set of labels (e.g. blood type or gender).

Scikit-learn uses a similar object-oriented style to Matplotlib and you need to instantiate an empty model first by calling the constructor:

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

- n_neighbors: the number of neighbors, is set to 5
- algorithm: for computing nearest neighbors, is set to auto
- p: set to 2, corresponding to Euclidean distance

Let's set the algorithm parameter to brute and leave the n_neighbors value as 5. If we leave the algorithm parameter set to the default value of auto, scikit-learn will try to use tree-based optimizations to improve performance (which are outside of the scope of our task).

## Fitting a model and making predictions

Now, we can fit the model to the data using the fit method.

Now that we specified the training data we want used to make predictions, we can use the predict method to make predictions on the test set.

The number of feature columns you use during both training and testing need to match or scikit-learn will return an error.
The predict() method returns a NumPy array containing the predicted price values for the test set. 

In [44]:
from sklearn.neighbors import KNeighborsRegressor

train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]
train_columns = ['accommodates', 'bathrooms']

# Instantiate ML model.
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

# Fit model to data.
knn.fit(train_df[train_columns], train_df['price'])

# Use model to make predictions.
predictions = knn.predict(test_df[train_columns])

# Calculating MSE using Scikit-Learn

we can use the sklearn.metrics.mean_squared_error function(). calculated the MSE and RMSE values.

The mean_squared_error() function takes in 2 inputs:

- list-like object, representing the true values
- list-like object, representing the predicted values using the model

In [46]:
from sklearn.metrics import mean_squared_error

train_columns = ['accommodates', 'bathrooms']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute', metric='euclidean')
knn.fit(train_df[train_columns], train_df['price'])
predictions = knn.predict(test_df[train_columns])
from sklearn.metrics import mean_squared_error

two_features_mse = mean_squared_error(test_df['price'], predictions)
two_features_rmse = two_features_mse ** (1/2)
print(two_features_mse)
print(two_features_rmse)

15660.39795221843
125.14151170662127


## Using more features

The model we trained using both features ended up performing better (lower error score) than either of the univariate models

Let's now train a model using the following 4 features:

- accommodates
- bedrooms
- bathrooms
- number_of_reviews

In [42]:
features = ['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')
knn.fit(train_df[features], train_df['price'])
four_predictions = knn.predict(test_df[features])
four_mse = mean_squared_error(test_df['price'], four_predictions)
four_rmse = four_mse ** (1/2)
print(four_mse)
print(four_rmse)

13320.230625711036
115.41330350402


So far so good! As we increased the features the model used, we observed lower MSE and RMSE values.
But selecting the right features is important and that using more features doesn't automatically improve prediction accuracy. 
For example,when we used all of the features available to us the RMSE value actually increased to 124.
Feature selection = important

In [43]:
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

features = train_df.columns.tolist()
features.remove('price')

knn.fit(train_df[features], train_df['price'])
all_features_predictions = knn.predict(test_df[features])
all_features_mse = mean_squared_error(test_df['price'], all_features_predictions)
all_features_rmse = all_features_mse ** (1/2)
print(all_features_mse)
print(all_features_rmse)

15455.275631399316
124.31924883701363


# Recap

We prepared the data to be able to use more features, trained a few models using multiple features. We explored how using more features doesn't always improve the accuracy of a k-nearest neighbors model. Now, we'll explore another knob for tuning k-nearest neighbor models - the k value.

we'll focus on the impact of increasing k, the number of nearby neighbors the model uses to make predictions. 
Traning (dc_airbnb_train.csv) and test (dc_airbnb_test.csv) sets.
Let's read both these CSV's into Dataframes.

In [49]:
import pandas as pd
train_df = pd.read_csv('/Users/eleonoreserge/Documents/dc_airbnb_train.csv')
test_df = pd.read_csv('/Users/eleonoreserge/Documents/dc_airbnb_test.csv') 

# Hyperparameter optimization

When we vary the features that are used in the model, we're affecting the data that the model uses. On the other hand, varying the k value affects the behavior of the model independently of the actual data that's used when making predictions. In other words, we're impacting how the model performs without trying to change the data that's used.

Hyperparameters : Values that affect the behavior and performance of a model that are unrelated to the data that's used

## Grid search
A simple but common hyperparameter optimization technique is known as grid search.
Grid search essentially boils down to evaluating the model performance at different k values and selecting the k value that resulted in the lowest error. 

Let's confirm that grid search will work quickly for the dataset we're working with by first observing how the model performance changes as we increase the k value from 1 to 5. 

Let's use the features that resulted in the best model accuracy:

- accommodates
- bedrooms
- bathrooms
- number_of_reviews

In [50]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
features = ['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']
hyper_params = [1, 2, 3, 4, 5]
mse_values = list()

for hp in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=hp, algorithm='brute')
    knn.fit(train_df[features], train_df['price'])
    predictions = knn.predict(test_df[features])
    mse = mean_squared_error(test_df['price'], predictions)
    mse_values.append(mse)

print(mse_values)

[26364.92832764505, 15100.52246871445, 14579.597901655923, 16212.300767918088, 14090.011649601822]


As we increased the k value from 1 to 5, the MSE value fell from approximately 26364 to approximately 14090.

Let's expand grid search all the way to a k value of 20.

In [51]:
features = ['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']
hyper_params = [x for x in range(1, 21)]
mse_values = list()

for hp in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=hp, algorithm='brute')
    knn.fit(train_df[features], train_df['price'])
    predictions = knn.predict(test_df[features])
    mse = mean_squared_error(test_df['price'], predictions)
    mse_values.append(mse)
print(mse_values)

[26364.92832764505, 15100.52246871445, 14579.597901655923, 16212.300767918088, 14090.011649601822, 13657.45250284414, 14288.273896589353, 14853.448183304892, 14670.831907751512, 14642.451478953355, 14734.071380889252, 14854.802332195677, 14733.16190399257, 14777.975894453346, 14771.171543420554, 14870.178509847838, 14830.55072806075, 14782.595763283192, 14773.558705907935, 14676.544189419797]


As we increased the k value from 1 to 6, the MSE value decreased from approximately 26364 to approximately 13657. However, as we increased the k value from 7 to 20, the MSE value didn't decrease further but instead hovered between approximately 14288 and 14870. This means that the optimal k value is 6, since it resulted in the lowest MSE value.

## Visualizing hyperparameter values

Let's confirm this behavior visually using a scatter plot.

In [52]:
import matplotlib.pyplot as plt

features = ['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']
hyper_params = [x for x in range(1, 21)]
mse_values = list()

for hp in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=hp, algorithm='brute')
    knn.fit(train_df[features], train_df['price'])
    predictions = knn.predict(test_df[features])
    mse = mean_squared_error(test_df['price'], predictions)
    mse_values.append(mse)
plt.scatter(hyper_params, mse_values)
plt.show()

<Figure size 640x480 with 1 Axes>

From the scatter plot, you can tell that the lowest MSE value was achieved at the k value of 6.

the general workflow for finding the best model is:

- select relevant features to use for predicting the target column.
- use grid search to find the optimal hyperparameter value for the selected features.
- evaluate the model's accuracy and repeat the process.

# Workflow of finding the optimal model to make predictions.

In [54]:
two_features = ['accommodates', 'bathrooms']
three_features = ['accommodates', 'bathrooms', 'bedrooms']
hyper_params = [x for x in range(1,21)]
# Append the first model's MSE values to this list.
two_mse_values = list()
# Append the second model's MSE values to this list.
three_mse_values = list()
two_hyp_mse = dict()
three_hyp_mse = dict()
for hp in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=hp, algorithm='brute')
    knn.fit(train_df[two_features], train_df['price'])
    predictions = knn.predict(test_df[two_features])
    mse = mean_squared_error(test_df['price'], predictions)
    two_mse_values.append(mse)

two_lowest_mse = two_mse_values[0]
two_lowest_k = 1

for k,mse in enumerate(two_mse_values):
    if mse < two_lowest_mse:
        two_lowest_mse = mse
        two_lowest_k = k + 1
    
for hp in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=hp, algorithm='brute')
    knn.fit(train_df[three_features], train_df['price'])
    predictions = knn.predict(test_df[three_features])
    mse = mean_squared_error(test_df['price'], predictions)
    three_mse_values.append(mse)
    
three_lowest_mse = three_mse_values[0]
three_lowest_k = 1

for k,mse in enumerate(three_mse_values):
    if mse < three_lowest_mse:
        three_lowest_mse = mse
        three_lowest_k = k + 1

two_hyp_mse[two_lowest_k] = two_lowest_mse
three_hyp_mse[three_lowest_k] = three_lowest_mse

print(two_hyp_mse)
print(three_hyp_mse)

{5: 14790.314266211606}
{5: 13522.893333333333}
