In [8]:
import pandas as pd
dc_listings = pd.read_csv('/Users/eleonoreserge/Documents/listings.csv')
print(dc_listings.iloc[0])

id                                                                            7087327
listing_url                                      https://www.airbnb.com/rooms/7087327
scrape_id                                                              20151002231825
last_scraped                                                               2015-10-03
name                                               Historic DC Condo-Walk to Capitol!
summary                             Professional pictures coming soon! Welcome to ...
space                                                                             NaN
description                         Professional pictures coming soon! Welcome to ...
experiences_offered                                                              none
neighborhood_overview                                                             NaN
notes                                                                             NaN
transit                                               

# Here's the strategy we wanted to use:

Find a few similar listings.
Calculate the average nightly rental price of these listings.
Set the average price as the price for our listing.
The k-nearest neighbors algorithm is similar to this strategy.

## Euclidean distance 

When trying to predict a continuous value, like price, the main similarity metric that's used is Euclidean distance. Here's the general formula for Euclidean distance:

d = \sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 +...+ (q_n-p_n)^2}

where q_1 to q_n represent the feature values for one observation and p_1 to p_n represent the feature values for the other observation. 

Here, to keep things simple, we'll use just one feature. Since we're only using one feature, this is known as the univariate case. Here's what the formula looks like for the univariate case:

d = \sqrt{(q_1 - p_1)^2}

The living space that we want to rent can accommodate 3 people. Let's first calculate the distance, using just the accommodates feature, between the first living space in the dataset and our own.

In [9]:
## Euclidean distance ##

import numpy as np
our_acc_value = 3
first_living_space_value = dc_listings.iloc[0]['accommodates']
first_distance = np.abs(first_living_space_value - our_acc_value)
print(first_distance)

1


The Euclidean distance between the first row in the dc_listings Dataframe and our own living space is 1. How do we know if this is high or low? If you look at the Euclidean distance equation itself, the lowest value you can achieve is 0. 
The closer to 0 the distance the more similar the living spaces are.

# Calculate distance for all observations

In [10]:
new_listing = 3
dc_listings['distance'] = dc_listings['accommodates'].apply(lambda x: np.abs(x - new_listing))
print(dc_listings['distance'].value_counts())

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64


We can notice that there are 461 living spaces that can accomodate 3 people just like ours (Value 0).
This means the 5 "nearest neighbors" we select after sorting all will have a distance value of 0. If we sort by the distance column and then just select the first 5 living spaces, we would be biasing the result to the ordering of the dataset.

Let's instead randomize the ordering of the dataset and then sort the Dataframe by the distance column. This way, all of the living spaces with the same number of bedrooms will still be at the top of the Dataframe but will be in random order across the first 461 rows.

In [12]:
## Randomizing, and sorting ##
import numpy as np
np.random.seed(1)
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
dc_listings = dc_listings.sort_values('distance')
print(dc_listings.iloc[0:10]['price'])

577     $185.00
2166    $180.00
3631    $175.00
71      $128.00
1011    $115.00
380     $219.00
943     $125.00
3107    $250.00
1499     $94.00
625     $150.00
Name: price, dtype: object


# Average price

Before we can select the 5 most similar living spaces and compute the average price, we need to clean the price column. Right now, the price column contains comma characters (,) and dollar sign characters and is formatted as a text column instead of a numeric one. We need to remove these values and convert the entire column to the float datatype. Then, we can calculate the average price.

In [13]:
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
mean_price = dc_listings.iloc[0:5]['price'].mean()
print(mean_price)

156.6


## Conclusion

Based on the average price of other listings that accommdate 3 people, we should charge 156.6 dollars per night for a guest to stay at our living space.

# Function to make predictions

Let's write a more general function that can suggest the optimal price for other values of the accommodates column.

In [16]:
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]

def predict_price(new_listing):
    temp_df = dc_listings.copy()
   
    return(new_listing)
acc_one = predict_price(1)
acc_two = predict_price(2)
acc_four = predict_price(4)
def predict_price(new_listing):
    temp_df = dc_listings.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors.mean()
    return(predicted_price)

acc_one = predict_price(1)
acc_two = predict_price(2)
acc_four = predict_price(4)
print(acc_one)
print(acc_two)
print(acc_four)


71.8
96.8
96.0
