# Lab 3: Training Decision Tree & KNN Classifiers

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.mode.chained_assignment = None 


from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In this Lab session, you will implement the following steps:

1. Load the Airbnb "listings" data set
2. Convert categorical features to one-hot encoded values
3. Split the data into training and test sets
4. Fit a Decision Tree classifier and evaluate the accuracy
 - Plot the accuracy of the DT model as a function of hyperparameter max depth
5. Fit a KNN classifier and evaluate the accuracy
 - Plot the accuracy of the KNN model as a function of hyperparameter $k$

## Part 1. Load the Dataset

We will work with a preprocessed version of the Airbnb NYC "listings" data set.

<b>Task</b>: load the data set into a Pandas DataFrame variable named `df`:

In [4]:
# Do not remove or edit the line below:
filename = os.path.join(os.getcwd(), "data", "airbnb.csv.gz")

df=pd.read_csv(filename,header=0)

In [5]:
df.shape

(28022, 44)

In [6]:
df.head(10)

Unnamed: 0,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,neighbourhood_group_cleansed,room_type,accommodates,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,0.8,0.17,False,8.0,8.0,True,True,Manhattan,Entire home/apt,1,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,0.09,0.69,False,1.0,1.0,True,True,Brooklyn,Entire home/apt,3,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,1.0,0.25,False,1.0,1.0,True,True,Brooklyn,Entire home/apt,4,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,1.0,1.0,False,1.0,1.0,True,False,Manhattan,Private room,2,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,0.890731,0.768297,False,1.0,1.0,True,True,Manhattan,Private room,1,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7
5,1.0,1.0,True,3.0,3.0,True,True,Brooklyn,Private room,2,...,4.82,4.87,4.73,False,3,1,2,0,1.48,7
6,1.0,1.0,False,1.0,1.0,True,True,Brooklyn,Entire home/apt,3,...,4.8,4.67,4.57,True,1,1,0,0,1.24,7
7,1.0,1.0,False,3.0,3.0,True,True,Manhattan,Private room,1,...,4.95,4.84,4.84,True,1,0,1,0,1.82,5
8,1.0,0.0,False,2.0,2.0,True,True,Brooklyn,Private room,1,...,5.0,5.0,5.0,False,2,0,2,0,0.07,5
9,1.0,0.99,True,1.0,1.0,True,True,Brooklyn,Entire home/apt,4,...,4.91,4.93,4.78,True,2,1,1,0,3.05,8


In [7]:
df.columns

Index(['host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_listings_count', 'host_total_listings_count',
       'host_has_profile_pic', 'host_identity_verified',
       'neighbourhood_group_cleansed', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights',
       'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'number_of_reviews', 'number_of_reviews_ltm',
       'number_of_reviews_l30d', 'review_scores_rating',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instant_bookable',
       'calculated_host_listings_count',
       'calculated_host_listings_coun

## Part 2. One-Hot Encode Categorical Values


Transform the string-valued categorical features into numerical boolean values using one-hot encoding.

### a. Find the Columns Containing String Values

First, let us identify all features that need to be one-hot encoded:

In [8]:
df.dtypes

host_response_rate                              float64
host_acceptance_rate                            float64
host_is_superhost                                  bool
host_listings_count                             float64
host_total_listings_count                       float64
host_has_profile_pic                               bool
host_identity_verified                             bool
neighbourhood_group_cleansed                     object
room_type                                        object
accommodates                                      int64
bathrooms                                       float64
bedrooms                                        float64
beds                                            float64
amenities                                        object
price                                           float64
minimum_nights                                    int64
maximum_nights                                    int64
minimum_minimum_nights                          

**Task**: add all of the column names of variables of type 'object' to a list named `to_encode`

In [9]:
to_encode = list(df.select_dtypes(include=['object']).columns)

Let's take a closer look at the candidates for one-hot encoding

In [10]:
df[to_encode].nunique()

neighbourhood_group_cleansed        5
room_type                           4
amenities                       25020
dtype: int64

Notice that one column stands out as containing two many values for us to attempt to transform. For this exercise, the best choice is to simply remove this column. Of course, this means losing potentially useful information. In a real-life situation, you would want to retain all of the information in a column, or you could selectively keep information in.

In the code cell below, drop this column from Dataframe `df` and from the `to_encode` list.

In [11]:
df.drop(columns='amenities', axis =1)
to_encode.remove('amenities')

### b. One-Hot Encode all Unique Values

All of the other columns in `to_encode` have reasonably small numbers of unique values, so we are going to simply one-hot encode every unique value of those columns.

<b>Task</b>: complete the code below to create one-hot encoded columns
Tip: Use the sklearn `OneHotEncoder` class

In [12]:
from sklearn.preprocessing import OneHotEncoder

# Create the encoder:
encoder = OneHotEncoder(handle_unknown='error', sparse=False)

# Apply the encoder:
df_enc = pd.DataFrame(encoder.fit_transform(df[to_encode]))

# Reinstate the original column names:
df_enc.columns = encoder.get_feature_names(to_encode)



In [13]:
df_enc.head()

Unnamed: 0,neighbourhood_group_cleansed_Bronx,neighbourhood_group_cleansed_Brooklyn,neighbourhood_group_cleansed_Manhattan,neighbourhood_group_cleansed_Queens,neighbourhood_group_cleansed_Staten Island,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


<b>Task</b>: You can now remove the original columns that we have just transformed from DataFrame `df`.


In [14]:
df.drop(columns=to_encode, inplace=True)

In [15]:
df.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,accommodates,bathrooms,bedrooms,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,0.8,0.17,False,8.0,8.0,True,True,1,1.0,1.323567,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,0.09,0.69,False,1.0,1.0,True,True,3,1.0,1.0,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,1.0,0.25,False,1.0,1.0,True,True,4,1.5,2.0,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,1.0,1.0,False,1.0,1.0,True,False,2,1.0,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,0.890731,0.768297,False,1.0,1.0,True,True,1,1.0,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7


<b>Task</b>: You can now join the transformed categorical features contained in `df_enc` with DataFrame `df`

In [16]:
df=df.join(df_enc)

Glance at the resulting column names:

In [17]:
df.columns

Index(['host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_listings_count', 'host_total_listings_count',
       'host_has_profile_pic', 'host_identity_verified', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights',
       'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'number_of_reviews', 'number_of_reviews_ltm',
       'number_of_reviews_l30d', 'review_scores_rating',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instant_bookable',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_co

Check for missing values.

In [18]:
df.isnull().sum()

host_response_rate                              0
host_acceptance_rate                            0
host_is_superhost                               0
host_listings_count                             0
host_total_listings_count                       0
host_has_profile_pic                            0
host_identity_verified                          0
accommodates                                    0
bathrooms                                       0
bedrooms                                        0
beds                                            0
amenities                                       0
price                                           0
minimum_nights                                  0
maximum_nights                                  0
minimum_minimum_nights                          0
maximum_minimum_nights                          0
minimum_maximum_nights                          0
maximum_maximum_nights                          0
minimum_nights_avg_ntm                          0


## Part 3. Create Training and Test Data Sets

### a. Create Labeled Examples 

<b>Task</b>: Choose columns from our data set to create labeled examples. 

In the `airbnb` dataset, we will choose column `host_is_superhost` to be the label. The remaining columns will be the features.

Obtain the features from DataFrame `df` and assign to `X`.
Obtain the label from DataFrame `df` and assign to `Y`


In [19]:
# YOUR CODE HERE
y=df['host_is_superhost']
X=df.drop(columns='host_is_superhost', axis=1)

In [20]:
print("Number of examples: " + str(X.shape[0]))
print("\nNumber of Features:" + str(X.shape[1]))
print(str(list(X.columns)))

Number of examples: 28022

Number of Features:50
['host_response_rate', 'host_acceptance_rate', 'host_listings_count', 'host_total_listings_count', 'host_has_profile_pic', 'host_identity_verified', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'review_scores_rating', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'instant_bookable', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', 'reviews_per_mon

### b. Split Examples into Training and Test Sets

<b>Task</b>: In the code cell below create training and test sets out of the labeled examples using Scikit-learn's `train_test_split()` function. 

Specify:
    * A test set that is one third (.33) of the size of the data set.
    * A seed value of '123'. 

In [21]:
# YOUR CODE HERE
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=.33, random_state=123)

Check that the dimensions of the training and test datasets are what you expected

In [22]:
print(X_train.shape)
print(X_test.shape)

(18774, 50)
(9248, 50)


## Part 4. Implement a Decision Tree Classifier

The code cell below contains a shell of a function named `train_test_DT()`. This function should train a Decision Tree classifier on the training data, test the resulting model on the test data, and compute and return the accuracy score of the resulting predicted class labels on the test data. Remember to use ```DecisionTreeClassifier()``` to create a model object.

<b>Task:</b> Complete the function to make it work.

In [23]:
def train_test_DT(X_train, X_test, y_train, y_test, leaf, depth, crit='entropy'):
    '''
    Fit a Decision Tree classifier to the training data X_train, y_train.
    Return the accuracy of resulting predictions on the test set.
    Parameters:
        leaf := The minimum number of samples required to be at a leaf node 
        depth := The maximum depth of the tree
        crit := The function to be used to measure the quality of a split. Default: gini.
    '''
    
    # 1. Create the  Scikit-learn DecisionTreeClassifier model
    model=DecisionTreeClassifier(criterion = crit, max_depth = depth, min_samples_leaf = leaf)
  
    # 2. Fit the model to the training data below
    model.fit(X_train, y_train)
    
    # 3. Make predictions 
    class_label_predictions=model.predict(X_test)
        
  
    # 4. Compute the accuracy
    acc_score=accuracy_score(y_test,class_label,predictions)
       
   
    
    return acc_score

#### Visualization

The cell below contains a function that you will use to compare the accuracy results of training multiple models with different hyperparameter values.

Function `visualize_accuracy()` accepts two arguments:
1. a list of hyperparamter values
2. a list of accuracy scores

Both lists must be of the same size.

In [24]:
# Do not remove or edit the code below

def visualize_accuracy(hyperparam_range, acc):

    fig = plt.figure()
    ax = fig.add_subplot(111)
    p = sns.lineplot(x=hyperparam_range, y=acc, marker='o', label = 'Full training set')
        
    plt.title('Test set accuracy of the model predictions, for ' + ','.join([str(h) for h in hyperparam_range]))
    ax.set_xlabel('Hyperparameter value')
    ax.set_ylabel('Accuracy')
    plt.show()

#### Train on Different Values of Hyperparameter Max Depth

<b>Task:</b> 

Complete function `train_multiple_trees()` in the code cell below. The function should train multiple decision trees and return a list of accuracy scores.

The function will:

1. accept list `max_depth_range` and `leaf` as parameters; list `max_depth_range` will contain multiple values for hyperparameter max depth.

2. loop over list `max_depth_range` and at each iteration:

    a. index into list `max_depth_range` to obtain a value for max depth<br>
    b. call `train_test_DT` with the training and test set, the value of max depth, and the value of `leaf`<br>
    c. print the resulting accuracy score<br>
    d. append the accuracy score to list `accuracy_list`<br>


In [30]:
def train_multiple_trees(max_depth_range, leaf):
    
    accuracy_list = []
    
    for i in max_depth_range:
        acc_score=train_test_DT(X_train, X_test, y_train, y_test,i,leaf)
        print('Max Depth=' + str(i) + ', accuracy score: ' + str(acc_score))
       
        accuracy_list.append(float(acc_score))
        
    return accuracy_list

The code cell below tests function `train_multiple_trees()` and calls function `visualize_accuracy()` to visualize the results.

In [31]:
max_depth_range = [8, 32]
leaf = 1

acc = train_multiple_trees(max_depth_range, leaf)

visualize_accuracy(max_depth_range, acc)

ValueError: could not convert string to float: '["Crib", "Fire extinguisher", "Lock on bedroom door", "Smoke alarm", "Shower gel", "Extra pillows and blankets", "Dedicated workspace", "Hot water", "TV", "Elevator", "Hangers", "First aid kit", "Hair dryer", "Bed linens", "Long term stays allowed", "Air conditioning", "Cleaning before checkout", "Carbon monoxide alarm", "Paid parking off premises", "Private entrance", "Shampoo", "Iron", "Heating", "Wifi", "Essentials"]'

<b>Analysis</b>: Is this graph conclusive for determining a good value of max depth?

<Double click this Markdown cell to make it editable, and record your findings here.>

<b>Task:</b> Let's train on more values for max depth.

In the code cell below:

1. call `train_multiple_trees()` with arguments `max_depth_range` and `leaf`
2. call `visualize_accuracy()` with arguments `max_depth_range` and `acc`


In [27]:
max_depth_range = [2**i for i in range(6)]
leaf = 1
acc = train_multiple_trees(max_depth_range,leaf)
        
visualize_accuracy(max_depth_range, acc)

ValueError: could not convert string to float: '["Crib", "Fire extinguisher", "Lock on bedroom door", "Smoke alarm", "Shower gel", "Extra pillows and blankets", "Dedicated workspace", "Hot water", "TV", "Elevator", "Hangers", "First aid kit", "Hair dryer", "Bed linens", "Long term stays allowed", "Air conditioning", "Cleaning before checkout", "Carbon monoxide alarm", "Paid parking off premises", "Private entrance", "Shampoo", "Iron", "Heating", "Wifi", "Essentials"]'

<b>Analysis</b>: Analyze this graph. Keep in mind that this is the performance on the test set, and pay attention to the scale of the y-axis. Answer the following questions in the cell below.<br>
How would you go about choosing the best model based on this plot? Is it conclusive? <br>
What other hyperparameters of interest would you want to vary to make sure you are finding the best model fit?

<Double click this Markdown cell to make it editable, and record your answers here.>

## Part 5. Implement a KNN Classifier


Note: In this section you will train KNN classifiers using the same training and test data.

The code cell below contains a shell of a function named `train_test_knn()`. This function should train a KNN classifier on the training data, test the resulting model on the test data, and compute and return the accuracy score of the resulting predicted class labels on the test data. 

Remember to use ```KNeighborsClassifier()``` to create a model object and call the method with one parameter: `n_neighbors = k`. 

<b>Task:</b> Complete the function to make it work.

In [32]:
def train_test_knn(X_train, X_test, y_train, y_test, k):
    '''
    Fit a k Nearest Neighbors classifier to the training data X_train, y_train.
    Return the accuracy of resulting predictions on the test data.
    '''
    
    # YOUR CODE HERE
    model=KNeighborsClassifier(n_neighbors = k)

    model.fit(X_train, y_train)
    
    class_label_predictions=model.predict(X_test)

    acc_score=accuracy_score(y_test, class_label_predictions)
    
    return acc_score

#### Train on Different Values of Hyperparameter K

<b>Task:</b> 

Just as you did above, complete function `train_multiple_knns()` in the code cell below. The function should train multiple KNN models and return a list of accuracy scores.

The function will:

1. accept list `k_range` as a parameter; this list will contain multiple values for hyperparameter $k$

2. loop over list `k_range` and at each iteration:

    a. index into list `k_range` to obtain a value for $k$<br>
    b. call `train_test_knn` with the training and test set, and the value of $k$<br>
    c. print the resulting accuracy score<br>
    d. append the accuracy score to list `accuracy_list` <br>


In [36]:
def train_multiple_knns(k_range):
    
    accuracy_list = []

    for k in k_range:
        score = train_test_knn(X_train, X_test, y_train, y_test, k)
        print('k=' + str(k) + ', accuracy score: ' + str(score))
        acc1.append(float(score))
    
    return accuracy_list

The code cell below uses your `train_multiple_knn()` function to train 3 KNN models, specifying three values for $k$: $3, 30$, and $300$. It calls function `visualize_accuracy()` to visualize the results. Note: this make take a second.

In [37]:
k_range = [3, 30, 300]
acc = train_multiple_knns(k_range)

visualize_accuracy(k_range, acc)

ValueError: could not convert string to float: '["Crib", "Fire extinguisher", "Lock on bedroom door", "Smoke alarm", "Shower gel", "Extra pillows and blankets", "Dedicated workspace", "Hot water", "TV", "Elevator", "Hangers", "First aid kit", "Hair dryer", "Bed linens", "Long term stays allowed", "Air conditioning", "Cleaning before checkout", "Carbon monoxide alarm", "Paid parking off premises", "Private entrance", "Shampoo", "Iron", "Heating", "Wifi", "Essentials"]'

<b>Task:</b> Let's train on more values for $k$

In the code cell below:

1. call `train_multiple_knns()` with argument `k_range`
2. call `visualize_accuracy()` with arguments `k_range` and the resulting accuracy list obtained from `train_multiple_knns()`


In [38]:
k_range = np.arange(1, 40, step = 3) 

# YOUR CODE HERE
train_multiple_knns(k_range)
visualize_accuracy(k_range, acc_score)

ValueError: could not convert string to float: '["Crib", "Fire extinguisher", "Lock on bedroom door", "Smoke alarm", "Shower gel", "Extra pillows and blankets", "Dedicated workspace", "Hot water", "TV", "Elevator", "Hangers", "First aid kit", "Hair dryer", "Bed linens", "Long term stays allowed", "Air conditioning", "Cleaning before checkout", "Carbon monoxide alarm", "Paid parking off premises", "Private entrance", "Shampoo", "Iron", "Heating", "Wifi", "Essentials"]'

<b>Analysis</b>: Compare the performance of the KNN model relative to the Decision Tree model, with various hyperparameter values and record your findings in the cell below.

<Double click this Markdown cell to make it editable, and record your findings here.>