# 1. Recap

In the last mission, we explored how to use a simple k-nearest neighbors machine learning model that used just one feature, or attribute, of the listing to predict the rent price. We first relied on the accommodates column, which describes the number of people a living space can comfortably accommodate. Then, we switched to the bathrooms column and observed an improvement in accuracy. While these were good features to become familiar with the basics of machine learning, it's clear that using just a single feature to compare listings doesn't reflect the reality of the market. An apartment that can accommodate 4 guests in a popular part of Washington D.C. will rent for much higher than one that can accommodate 4 guests in a crime ridden area.

***There are 2 ways we can tweak the model to try to improve the accuracy (decrease the RMSE during validation):**

* increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors
* increase k, the number of nearby neighbors the model uses when computing the prediction

In this mission, we'll focus on **increasing the number of attributes the model uses**. When selecting more attributes to use in the model, we need to **watch out for columns that don't work well with the distance equation**. This includes columns containing:

* **non-numerical values** (e.g. city or state)
Euclidean distance equation expects numerical values
* **missing values**
distance equation expects a value for each observation and attribute
* **non-ordinal values (e.g. latitude or longitude)**
ranking by Euclidean distance doesn't make sense if all attributes aren't ordinal

## TODO:
Use the DataFrame.info() method to return the number of non-null values in each column.

In [1]:
import pandas as pd

dc_listings=pd.read_csv('dc_airbnb.csv')

In [2]:
dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3723 entries, 0 to 3722
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null object
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(5), int64(5), object(9

## Observation 
* `host_response_rate',host_acceptance_rate,bedrooms,bathrooms,beds,cleaning_fee,security_deposit,zipcode have missing values`

In [3]:
dc_listings.loc[dc_listings['security_deposit'].isnull(),'security_deposit']

1       NaN
2       NaN
3       NaN
5       NaN
7       NaN
8       NaN
9       NaN
13      NaN
14      NaN
15      NaN
16      NaN
18      NaN
19      NaN
20      NaN
21      NaN
23      NaN
25      NaN
26      NaN
27      NaN
28      NaN
29      NaN
30      NaN
32      NaN
34      NaN
35      NaN
36      NaN
37      NaN
38      NaN
39      NaN
40      NaN
       ... 
3662    NaN
3663    NaN
3664    NaN
3665    NaN
3670    NaN
3671    NaN
3675    NaN
3676    NaN
3677    NaN
3679    NaN
3680    NaN
3681    NaN
3688    NaN
3689    NaN
3692    NaN
3694    NaN
3696    NaN
3698    NaN
3701    NaN
3703    NaN
3704    NaN
3705    NaN
3706    NaN
3710    NaN
3711    NaN
3712    NaN
3714    NaN
3716    NaN
3719    NaN
3721    NaN
Name: security_deposit, Length: 2297, dtype: object

In [4]:
import numpy as np
np.random.seed(1)


dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 574 to 1061
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null float64
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(6), int64(5), objec

# 2. Removing features

* The following columns contain non-numerical values:

  * room_type: e.g. Private room
  * city: e.g. Washington
  * state: e.g. DC
* while these columns contain numerical but non-ordinal values:

  * latitude: e.g. 38.913458
  * longitude: e.g. -77.031
  * zipcode: e.g. 20009

In [5]:
dc_listings.head(3)

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
574,100%,100%,1,2,Private room,1.0,1.0,1.0,125.0,,$300.00,1,4,149,38.913548,-77.031981,Washington,20009,DC
1593,87%,100%,2,2,Private room,1.0,1.5,1.0,85.0,$15.00,,1,30,49,38.953431,-77.030695,Washington,20011,DC
3091,100%,,1,1,Private room,1.0,0.5,1.0,50.0,,,1,1125,1,38.933491,-77.029679,Washington,20010,DC


While we could convert the host_response_rate and host_acceptance_rate columns to be numerical (right now they're object data types and contain the % sign), these columns describe the host and not the living space itself.

`host_response_rate
host_acceptance_rate
host_listings_count`

## TODO:
Remove the 9 columns we discussed above from dc_listings:
* 3 containing non-numerical values
* 3 containing numerical but non-ordinal values
* 3 describing the host instead of the living space itself

In [6]:
dc_listings=dc_listings.drop(['host_response_rate','host_acceptance_rate','host_listings_count','room_type','city','state',
                  'latitude','longitude','zipcode'],axis=1)

In [7]:
dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 574 to 1061
Data columns (total 10 columns):
accommodates         3723 non-null int64
bedrooms             3702 non-null float64
bathrooms            3696 non-null float64
beds                 3712 non-null float64
price                3723 non-null float64
cleaning_fee         2335 non-null object
security_deposit     1426 non-null object
minimum_nights       3723 non-null int64
maximum_nights       3723 non-null int64
number_of_reviews    3723 non-null int64
dtypes: float64(4), int64(4), object(2)
memory usage: 319.9+ KB


In [8]:
print(dc_listings.isnull().sum())

accommodates            0
bedrooms               21
bathrooms              27
beds                   11
price                   0
cleaning_fee         1388
security_deposit     2297
minimum_nights          0
maximum_nights          0
number_of_reviews       0
dtype: int64


# 3. Handling missing values

Of the remaining columns, 3 columns have a few missing values (less than 1% of the total number of rows):

`bedrooms
bathrooms
beds`

* Since the number of rows containing missing values for one of these 3 columns is low, **we can select and remove those rows without losing much information**

There are also 2 columns that have a large number of missing values:

`cleaning_fee - 37.3% of the rows
security_deposit - 61.7% of the rows`

We can't just remove the rows containing missing values for these 2 columns because we'd miss out on the majority of the observations in the dataset. Instead, **let's remove these 2 columns entirely from consideration**.

## TODO:
* Drop the cleaning_fee and security_deposit columns from dc_listings.
* Then, remove all rows that contain a missing value for the bedrooms, bathrooms, or beds column from dc_listings.
* You can accomplish this by using the Dataframe method dropna() and setting the axis parameter to 0.
* Since only the bedrooms, bathrooms, and beds columns contain any missing values, rows containing missing values in these columns will be removed.
* Display the null value counts for the updated dc_listings Dataframe to confirm that there are no missing values left.

In [9]:
dc_listings=dc_listings.drop(['cleaning_fee','security_deposit'],axis=1)

In [10]:
dc_listings[['bedrooms','bathrooms','beds']].isnull().head()

Unnamed: 0,bedrooms,bathrooms,beds
574,False,False,False
1593,False,False,False
3091,False,False,False
420,False,False,False
808,False,False,False


In [11]:
dc_listings=dc_listings.dropna(axis=0)

In [12]:
dc_listings.isnull().sum()

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
dtype: int64

# 4. Normalize columns

In [13]:
dc_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,2,1.0,1.0,1.0,125.0,1,4,149
1593,2,1.0,1.5,1.0,85.0,1,30,49
3091,1,1.0,0.5,1.0,50.0,1,1125,1
420,2,1.0,1.0,1.0,209.0,4,730,2
808,12,5.0,2.0,5.0,215.0,2,1825,34


You may have noticed that while the accommodates, bedrooms, bathrooms, beds, and minimum_nights columns hover between 0 and 12 (at least in the first few rows), the values in the maximum_nights and number_of_reviews columns span much larger ranges. For example, the maximum_nights column has values as low as 4 and as high as 1825, in the first few rows itself. If we use these 2 columns as part of a k-nearest neighbors model, these attributes could end up having an outsized effect on the distance calculations, because of the largeness of the values.

` To prevent any single column from having too much of an impact on the distance, we can normalize all of the columns to have a mean of 0 and a standard deviation of 1.`

**Normalizing the values in each column to the standard normal distribution (mean of 0, standard deviation of 1) preserves the distribution of the values in each column while aligning the scales.**

To normalize the values in a column to the standard normal distribution, you need to:

* from each value, subtract the mean of the column
* divide each value by the standard deviation of the column

## TODO:
* Normalize all of the feature columns in dc_listings and assign the new Dataframe containing just the normalized feature columns to normalized_listings.
* Add the price column from dc_listings to normalized_listings.
* Display the first 3 rows in normalized_listings.

In [14]:
normalized_listings=(dc_listings - dc_listings.mean())/dc_listings.std()

In [15]:
normalized_listings['price']=dc_listings['price']

In [16]:
normalized_listings.head(3)

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,-0.596544,-0.249467,-0.439151,-0.546858,125.0,-0.341375,-0.016604,4.57965
1593,-0.596544,-0.249467,0.412923,-0.546858,85.0,-0.341375,-0.016603,1.159275
3091,-1.095499,-0.249467,-1.291226,-0.546858,50.0,-0.341375,-0.016573,-0.482505


# 5. Euclidean distance for multivariate case

Since we're using 2 attributes, the distance calculation would look like:



## $d = \sqrt{(accommodates_1-accommodates_2)^2 + (bathrooms_1-bathrooms_2)^2 }$

So far, we've been calculating Euclidean distance ourselves by writing the logic for the equation ourselves. We can instead use the [distance.euclidean() function](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.euclidean.html) from scipy.spatial, which takes in 2 vectors as the parameters and calculates the Euclidean distance between them. The euclidean() function expects:

* both of the vectors to be represented using a list-like object (Python list, NumPy array, or pandas Series)
* both of the vectors must be 1-dimensional and have the same number of elements

## TODO:
* Calculate the Euclidean distance using only the accommodates and bathrooms features between the first row and fifth row in normalized_listings using the distance.euclidean() function.
* Assign the distance value to first_fifth_distance and display using the print function.

In [17]:
from scipy.spatial import distance

In [18]:
first_listing = normalized_listings.iloc[0][['accommodates', 'bathrooms']]
fifth_listing = normalized_listings.iloc[4][['accommodates', 'bathrooms']]
first_fifth_distance = distance.euclidean(first_listing, fifth_listing)
print(first_fifth_distance)

5.272543124668404


# 6. Introduction to scikit-learn

We'll learn about the [scikit-learn library](https://scikit-learn.org/stable/), which is the most popular machine learning library in Python. Scikit-learn contains functions for all of the major machine learning algorithms and a simple, unified workflow. Both of these properties allow data scientists to be incredibly productive when training and testing different models on a new dataset.

The scikit-learn workflow consists of 4 main steps:

* instantiate the specific machine learning model you want to use
* fit the model to the training data
* use the model to make predictions
* evaluate the accuracy of the predictions

 Each model in scikit-learn is implemented as a [separate class]((https://scikit-learn.org/dev/modules/classes.html)) and the first step is to identify the class we want to create an instance of.

Let's set the algorithm parameter to brute and leave the n_neighbors value as 5, which matches the implementation we wrote in the last mission. If we leave the algorithm parameter set to the default value of auto, scikit-learn will try to use tree-based optimizations to improve performance 

# 7. Fitting a model and making predictions

## TODO:
* Create an instance of the KNeighborsRegressor class with the following parameters:

  * n_neighbors: 5
  * algorithm: brute
* Use the fit method to specify the data we want the k-nearest neighbor model to use. Use the following parameters:

   * training data, feature columns: just the accommodates and bathrooms columns, in that order, from train_df.
   * training data, target column: the price column from train_df.
* Call the predict method to make predictions on:

   * the accommodates and bathrooms columns from test_df
   * assign the resulting NumPy array of predicted price values to predictions

In [19]:
from sklearn.neighbors import KNeighborsRegressor

train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]

knn=KNeighborsRegressor(n_neighbors=5,algorithm='brute')
knn.fit(train_df[['accommodates','bathrooms']],train_df['price'])
predictions=knn.predict(test_df[['accommodates','bathrooms']])

# 8. Calculating MSE using Scikit-Learn

Earlier in this mission, we calculated the `MSE and RMSE values using the pandas arithmetic operators to compare each predicted value with the actual value` from the price column of our test set. Alternatively, we can instead use the [sklearn.metrics.mean_squared_error function()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error). 

## TODO:
* Use the mean_squared_error function to calculate the MSE value for the predictions we made in the previous screen.
* Assign the MSE value to two_features_mse.
* Calculate the RMSE value by taking the square root of the MSE value and assign to two_features_rmse.
* Display both of these error scores using the print function.

In [20]:
from sklearn.metrics import mean_squared_error

In [21]:
two_features_mse=mean_squared_error(test_df['price'],predictions)
two_features_rmse=np.sqrt(two_features_mse)
print('mse:',two_features_mse)
print('rmse:',two_features_rmse)

mse: 15660.39795221843
rmse: 125.14151170662127


# 9. Using more features

## TODO:
* Create a new instance of the KNeighborsRegressor class with the following parameters:

  * n_neighbors: 5
  * algorithm: brute
* Fit a model that uses the following columns from our training set (train_df):

  * accommodates
  * bedrooms
  * bathrooms
  * number_of_reviews
* Use the model to make predictions on the test set (test_df) using the same columns. Assign the NumPy array of predictions to four_predictions.

* Use the mean_squared_error() function to calculate the MSE value for these predictions by comparing four_predictions with the price column from test_df. Assign the computed MSE value to four_mse.
* Calculate the RMSE value and assign to four_rmse.
* Display four_mse and four_rmse using the print function.

In [22]:
knn=KNeighborsRegressor(n_neighbors=5,algorithm='brute')
knn.fit(train_df[['accommodates','bedrooms','bathrooms','number_of_reviews']],train_df['price'])
four_predictions=knn.predict(test_df[['accommodates','bedrooms','bathrooms','number_of_reviews']])
four_mse=mean_squared_error(test_df['price'],four_predictions)
four_rmse=np.sqrt(four_mse)
print('four_mse:',four_mse)
print('four_rmse:',four_rmse)

four_mse: 13320.230625711036
four_rmse: 115.41330350402


# 10. Using all features

## TODO:
* Use all of the columns, except for the price column, to train a k-nearest neighbors model using the same parameters for the KNeighborsRegressor class as the ones from the last few screens.
* Use the model to make predictions on the test set and assign the resulting NumPy array of predictions to all_features_predictions.
* Calculate the MSE and RMSE values and assign to all_features_mse and all_features_rmse accordingly.
Use the print function to display both error scores.

In [23]:
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

features = train_df.columns.tolist()
features.remove('price')

knn.fit(train_df[features], train_df['price'])
all_features_predictions = knn.predict(test_df[features])
all_features_mse = mean_squared_error(test_df['price'], all_features_predictions)
all_features_rmse = all_features_mse ** (1/2)
print(all_features_mse)
print(all_features_rmse)

15455.275631399316
124.31924883701363


Interestingly enough, the RMSE value actually increased to 125.1 when we used all of the features available to us. This means that **selecting the right features is important and that using more features doesn't automatically improve prediction accuracy**. We should re-phrase the lever we mentioned earlier from:

increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors
to:

`select the relevant attributes the model uses to calculate similarity when ranking the closest neighbors`

`The process of selecting features to use in a model is known as feature selection.`

In this mission, we prepared the data to be able to use more features, trained a few models using multiple features, and evaluated the different performance tradeoffs. We explored how using more features doesn't always improve the accuracy of a k-nearest neighbors model.