In this notebook we will build a classification model using DecisionTrees and Random forest classifier from python's scikit learn library

## Table of contents
1. Data Loading
2. Data Exploration
3. Visualization
4. Preprocessing
5. Decision Trees and hyperparameter analysis 
5. Random Forest
6. Model comparision using ROC curve

## Loading Data

In this section we will import all the necessary packages and load the datasets we plan to work on. We will use the 
<a href='https://www.kaggle.com/jessemostipak/hotel-booking-demand'> Hotel booking data </a> and build a model to determine which customers will cancel their hotel booking

In [240]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import *
from matplotlib.legend_handler import HandlerLine2D
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import recall_score, precision_score, accuracy_score, f1_score
from sklearn.metrics import confusion_matrix,auc,roc_auc_score,roc_curve
import warnings
warnings.filterwarnings('ignore')

In [241]:
# Load the data
# file_path = 'C:\Users\Tejal\Documents\Tejal\WWC-siliconvalley\hotel_bookings.csv'
# ALREADY MADE:
# file_path='https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv'
# df = pd.read_csv(file_path)

## Explore the dataset

Understanding the data, its features and distribution is a major part of builiding ML models. 

In [242]:
# df.head()

In [243]:
# check the shape of the dataset
# df.shape   

In [244]:
# Check the datatype of features
# df.dtypes

In [245]:
# Feature list 
# df.columns

In [246]:
# Check for null values
# percent_missing = df.isnull().sum() * 100 / len(df)
# missing_value_df = pd.DataFrame({'column_name': df.columns,
#                                  'percent_missing': percent_missing})
# missing_value_df.sort_values('percent_missing', ascending=False,inplace=True)
# missing_value_df

Company, agent, country and children have null values. There are multiple techniques for imputing null value but for simplicity we impute them with 0. As company has a very high null value percentage we will drop the column

In [247]:
# Let us create a copy of dataframe for backup and impute null with 0
# backup_df=df.copy
# df = df.drop('company',axis=1)
# df=df.fillna(0)

In [248]:
# The df has no Null values
# (df['agent'].isnull().sum()/len(df)) * 100

## Data Visualization

In this task, our target variable is is_cancelled which indicates if the booking was cancelled. 1 --> canceled, 0 --> Not canceled 

In [249]:
# df['is_canceled'].value_counts().plot(kind='pie',autopct='%1.1f%%')

37% customers have cancelled their bookings. we see that our data in imbalanced

In [250]:
# df.columns

In [251]:
# Hotel feature count and distribution across 0 and 1 class 
# df['hotel'].value_counts().plot(kind='pie',autopct='%1.1f%%')

In [252]:
# sns.countplot(x='is_canceled',hue='hotel',data=df)

As data has higher city hotel reservation data points compared to resort, above observation is on par with  same trend

In [253]:
#market segments
# df.groupby(['market_segment'])['is_canceled'].count().plot(kind='bar')

## Feature Engineering

1. Derive new features using existing features
2. Remove irrelevant features
3. Transform existing features
4. Encoding categorical variables

In [254]:
# # Split data into train validation & test set in train:val:test=60:20:20 size
# # We are splitting the data into 3 chunks as we will be tuning many hyperparameters in this notebook
# train, val_test = train_test_split(df, test_size=0.4, random_state = 42)
# val, test = train_test_split(val_test, test_size=0.5, random_state = 42)

In [278]:
test_file_path='https://raw.githubusercontent.com/WomenWhoCode/WWCodeDataScience/master/Intro_to_MachineLearning/data/titanic/test.csv'
test = pd.read_csv(test_file_path)
train_file_path='https://raw.githubusercontent.com/WomenWhoCode/WWCodeDataScience/master/Intro_to_MachineLearning/data/titanic/train.csv'
train = pd.read_csv(train_file_path)

In [282]:
# train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])  # stackoverflow

In [279]:
train.shape

(891, 12)

In [280]:
test.shape

(418, 11)

In [259]:
val.shape

(23878, 31)

In [260]:
# #Let us add weekend stay and weekday stay days to get total days of stay
# train['total_days'] = train['stays_in_week_nights'] + train['stays_in_weekend_nights']
# test['total_days'] = test['stays_in_week_nights'] + test['stays_in_weekend_nights']
# val['total_days'] = val['stays_in_week_nights'] + val['stays_in_weekend_nights']

# # drop the weekend stay and weekday stay days features
# train = train.drop('stays_in_week_nights',axis=1).drop('stays_in_weekend_nights',axis=1)
# test = test.drop('stays_in_week_nights',axis=1).drop('stays_in_weekend_nights',axis=1)
# val = val.drop('stays_in_week_nights',axis=1).drop('stays_in_weekend_nights',axis=1)

KeyError: 'stays_in_week_nights'

In [None]:
# train_0=train[(train['is_canceled']==0)]
# train_1=train[train['is_canceled']==1]
# sns.set(rc={"figure.figsize": (20, 20)})
# subplot(2,2,1)
# ax = sns.distplot(train_0['total_days'], bins=100, color='r')
# subplot(2,2,2)
# ax=sns.distplot(train_1['total_days'], bins=100, color='g')

In [None]:
# #Total customers
# train['total_customers'] = train['adults'] + train['children']+train['babies']
# test['total_customers'] = test['adults'] + test['children']+test['babies']
# val['total_customers'] = val['adults'] + val['children']+val['babies']


# train = train.drop('adults',axis=1).drop('children',axis=1).drop('babies',axis=1)
# test = test.drop('adults',axis=1).drop('children',axis=1).drop('babies',axis=1)
# val = val.drop('adults',axis=1).drop('children',axis=1).drop('babies',axis=1)

In [None]:
train['total_customers'].value_counts().plot(kind='bar',figsize=(5,5)) 

In [None]:
train = train.drop(['reservation_status_date'],axis=1)
test = test.drop(['reservation_status_date'],axis=1)
val = val.drop(['reservation_status_date'],axis=1)

In [None]:
print (len(train['agent'].unique())) # 309 unique values - Large number of unique agents and it is categorical, difficult to encode
train = train.drop('agent',axis=1)
test = test.drop('agent',axis=1)
val = val.drop('agent',axis=1)

In [None]:
print(len(train['country'].unique())) # 160 countries
train = train.drop('country',axis=1)
test = test.drop('country',axis=1)
val = val.drop('country',axis=1)

In [None]:
train.hist(column='previous_bookings_not_canceled',bins=20,figsize=(10,5))

In [None]:
#train['previous_bookings_not_canceled'].value_counts() # We observe that most data has value = 0; hence we drop the feature
#train.groupby(['is_canceled'])['previous_bookings_not_canceled'].value_counts() # We observe that data distribution across both class is remains same
train = train.drop('previous_bookings_not_canceled',axis=1)
test = test.drop('previous_bookings_not_canceled',axis=1)
val = val.drop('previous_bookings_not_canceled',axis=1)

In [None]:
train.groupby(['is_canceled'])['previous_cancellations'].value_counts().plot(kind='bar',figsize=(5,5))
# We observe that most data has value = 0; and trend remains same across the 2 classes
train = train.drop('previous_cancellations',axis=1)
test = test.drop('previous_cancellations',axis=1)
val = val.drop('previous_cancellations',axis=1)

In [None]:
len(train.columns)

## Feature Correlation

In [None]:
backup_train = train.copy()
backup_test = test.copy()
backup_val = val.copy()

In [None]:
#Custom encoding
train['arrival_date_month'] = train['arrival_date_month'].map({'January':1, 'February': 2, 'March':3, \
                                                         'April':4, 'May':5, 'June':6, 'July':7,\
                                                         'August':8, 'September':9, 'October':10, \
                                                         'November':11, 'December':12})

In [None]:
encode = LabelEncoder()

In [None]:
train.columns

In [None]:
train['market_segment'].unique()

In [None]:
cat_col=['hotel','arrival_date_year','meal','market_segment','distribution_channel','reserved_room_type', 'assigned_room_type',\
        'deposit_type','customer_type','reservation_status']
for i in cat_col:
    train[i] = encode.fit_transform(train[i])

In [None]:
train['market_segment'].unique()

In [None]:
# train.head()
train.dtypes

### Feature correlation
<b>Spearman</b> and <b>Pearson</b> are the 2 statistical methods to compute the correlation between features. 
- Pearson is suggested method for features with continuous values and linear relationship
- Spearman is suggested method when features have ordinal categorical data or non-linear relationship
<br>Pandas correlation method by default uses Pearson method, but we can also change it to spearman </br>

In [None]:
train.corr()

In [None]:
feat_corr = train.corr()
feat_corr['is_repeated_guest'].sort_values()

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(feat_corr)

The diagonal shows correlation of each feature with itself, hence indicates highest correlation.
Using the table and plot we observe that few features have veryhigh correlation
Ex:- 
1. Arrival_date_week_number and arrival_date_month = 0.99
2. reserved_room_type vs assigned_room_type = 0.81

In [None]:
feat_corr['is_canceled'].sort_values()

The reservation_status has high correlation with is_canceled. In Naive Bayes session, we saw that removing the reservation_status feature caused the model performance to drop considerably. Lets see how it affects Trees

## Implementing Decision Tree

There are various decision tree algorithms like - ID3, C4.5, C5.0 and CART. Scikit learn implements optimized version of CART alogrithm. We have multiple hyperparameters in decision tree, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">sklearn documentation</a> 
<br> <br>
We will try to see the effect of following hyperparameters on modelling -
1. criterion -{gini and entropy}
2. max_depth
3. class_weight

### Model 1 -
Default hyperparaments -- Gini criterion, no class weight and no pruning

In [None]:
y_train = backup_train["is_canceled"]
X_train = backup_train.drop(["is_canceled"], axis=1)
y_val = backup_val["is_canceled"]
X_val = backup_val.drop(["is_canceled"], axis=1)

<b>Encoding categorical features

Scikit's Decision tree and Random Forest implementations cannot handle string values so we need to encode the categorical values to convert them to numeric value. However, few other languages like R, Spark and Weka have Decision trees that can handle string feature values <br><br>
We use one-hot encoding instead of label encoder to avoid the illusion of continuous values for categorical features

In [None]:
cat_cols=['hotel','arrival_date_month','arrival_date_year','meal','market_segment','distribution_channel','reserved_room_type', 'assigned_room_type',\
        'deposit_type','customer_type','reservation_status']
X_train_enc = pd.get_dummies(data=X_train,columns=cat_cols)
X_val_enc = pd.get_dummies(data=X_val,columns=cat_cols)
X_train_enc,X_val_enc =X_train_enc.align(X_val_enc, join='left', axis=1)
X_val_enc=X_val_enc.fillna(0)

In [None]:
X_train_enc.head()

In [None]:
clf = DecisionTreeClassifier(random_state = 0)
clf.fit(X_train_enc, y_train)

In [None]:
y_pred = clf.predict(X_val_enc)
y_prob = clf.predict_proba(X_val_enc)

In [None]:
y_pred[:10]

## Evaluation metric

<b> Precission and Recall </b>

<img src="img/PR diagram1.PNG" width="200p"/>

<img src="img/PR diagram 2.PNG" width="400"/>

<br><b> Confusion Matrix </b>

<img src="img/Confusion matrix.PNG" width="200"/>

In [None]:
print('test-set confusion matrix:\n', confusion_matrix(y_val,y_pred)) 
print("recall score: ", recall_score(y_val,y_pred))
print("precision score: ", precision_score(y_val,y_pred))
print("f1 score: ", f1_score(y_val,y_pred))
print("accuracy score: ", accuracy_score(y_val,y_pred))

Feature Importance

In [None]:
d = pd.DataFrame(
    {'Features': list(X_train_enc.columns),
     'Importance': clf.feature_importances_
    })
d.sort_values(by=['Importance'],ascending=False)[:5]

The model performance is perfect but only 1 feature has been used in the model, hence we should remove that feature to avoid data leak

### Model 2 - 
Remove the feature that is highly correlated with target feature
<br>
<b>Reservation_status</b> has high correlation with is_canceled. Looking at the values in column reveals that canceled is a reservation type. This might be causing data leak. Hence we will delete this feature and train with default hyperparameters

In [None]:
df['reservation_status'].unique()

In [None]:
X_train = X_train.drop('reservation_status',axis=1)
X_val = X_val.drop('reservation_status',axis=1)

In [None]:
cat_cols=['hotel','arrival_date_month','arrival_date_year','meal','market_segment','distribution_channel','reserved_room_type', 'assigned_room_type',\
        'deposit_type','customer_type']
X_train_enc = pd.get_dummies(data=X_train,columns=cat_cols,drop_first=True)
X_val_enc = pd.get_dummies(data=X_val,columns=cat_cols,drop_first=True)
X_train_enc,X_val_enc =X_train_enc.align(X_val_enc, join='left', axis=1)
X_val_enc=X_val_enc.fillna(0)

In [None]:
clf2 = DecisionTreeClassifier(random_state = 0)
clf2.fit(X_train_enc, y_train)

In [None]:
y_pred2 = clf2.predict(X_val_enc)
y_prob2 = clf2.predict_proba(X_val_enc)

In [None]:
y_prob2[:10]

In [None]:
print('test-set confusion matrix:\n', confusion_matrix(y_val,y_pred2)) 
print("recall score: ", recall_score(y_val,y_pred2))
print("precision score: ", precision_score(y_val,y_pred2))
print("f1 score: ", f1_score(y_val,y_pred2))
print("accuracy score: ", accuracy_score(y_val,y_pred2))

In [None]:
d = pd.DataFrame(
    {'Features': list(X_train_enc.columns),
     'Importance': clf2.feature_importances_
    })
d.sort_values(by=['Importance'],ascending=False)[:15]

### Model 3
Let us remove 1 feature from the correlated feature pair, We will remove the feature with lesser importance
1. Arrival_date_week_number and arrival_date_month = 0.99
2. reserved_room_type vs assigned_room_type = 0.81
3. market_segment vs distribution_channel = 0.76

In [None]:
X_train = X_train.drop('arrival_date_month',axis=1)
X_val = X_val.drop('arrival_date_month',axis=1)

X_train = X_train.drop('market_segment',axis=1)
X_val = X_val.drop('market_segment',axis=1)

X_train = X_train.drop('reserved_room_type',axis=1)
X_val = X_val.drop('reserved_room_type',axis=1)

In [None]:
X_train.dtypes

In [None]:
cat_cols=['hotel','arrival_date_year','meal','distribution_channel','assigned_room_type',\
        'deposit_type','customer_type']
X_train_enc = pd.get_dummies(data=X_train,columns=cat_cols,drop_first=True)
X_val_enc = pd.get_dummies(data=X_val,columns=cat_cols,drop_first=True)
X_train_enc,X_val_enc =X_train_enc.align(X_val_enc, join='left', axis=1)
X_val_enc=X_val_enc.fillna(0)

In [None]:
clf3 = DecisionTreeClassifier(random_state = 0)
clf3.fit(X_train_enc, y_train)
y_pred3=clf3.predict(X_val_enc)
print("f1 score: ", f1_score(y_val,y_pred3))

#### Training metric

In [None]:
y_pred3_train = clf3.predict(X_train_enc)
print('test-set confusion matrix:\n', confusion_matrix(y_train,y_pred3_train)) 
print("f1 score: ", f1_score(y_train,y_pred3_train))
print("accuracy score: ", accuracy_score(y_train,y_pred3_train))

We see that the model has very low training error but considerably high test error. This indicates that the model is overfitted. 

In [None]:
d = pd.DataFrame(
    {'Features': list(X_train_enc.columns),
     'Importance': clf3.feature_importances_
    })
d.sort_values(by=['Importance'],ascending=False)[:15]

### Model 4 - 
$Gini Impurity= \sum_{k=1}^{c} (P_k)*(1-P_k)$ <br><br>
$Entropy = \sum_{k=1}^{c}-P_k*log_2(P_k)$

<img src="img/Impurity criterion.PNG" width="400"/>
We see that both criterion follow same curve indicating that there is no significant difference between the two.

In [None]:
y_train = backup_train["is_canceled"]
X_train = backup_train.drop(["is_canceled"], axis=1).drop(["reservation_status"],axis=1)
y_val = backup_val["is_canceled"]
X_val = backup_val.drop(["is_canceled"], axis=1).drop(["reservation_status"],axis=1)

In [None]:
cat_cols=['hotel','arrival_date_month','arrival_date_year','meal','market_segment','distribution_channel','reserved_room_type', 'assigned_room_type',\
        'deposit_type','customer_type']
X_train_enc = pd.get_dummies(data=X_train,columns=cat_cols,drop_first=True)
X_val_enc = pd.get_dummies(data=X_val,columns=cat_cols,drop_first=True)
X_train_enc,X_val_enc =X_train_enc.align(X_val_enc, join='left', axis=1)
X_val_enc=X_val_enc.fillna(0)

In [None]:
clf4 = DecisionTreeClassifier(criterion="entropy",random_state = 0)
clf4.fit(X_train_enc, y_train)

In [None]:
y_pred4 = clf4.predict(X_val_enc)
y_prob4 = clf4.predict_proba(X_val_enc)
print("f1 score: ", f1_score(y_val,y_pred4))

Impurity criterion did not affect our model performance. The feature importance of the 2 models also look similar 

In [None]:
d = pd.DataFrame(
    {'Features': list(X_train_enc.columns),
     'Clf2_Importance': clf2.feature_importances_,
     'Clf4_Importance': clf4.feature_importances_

    })
d.sort_values(by=['Clf4_Importance'],ascending=False)[:15]

### Model 5 -
Pruning the model to avoid overfitting <br>
We have multiple hyperparameters such as max_depth, min_sample_split etc which can be tuned to prune the model. 

In [None]:
max_depths = np.linspace(1, 32, 32, endpoint=True)
train_results = []
test_results = []
for max_depth in max_depths:
   clf5 = DecisionTreeClassifier(max_depth=max_depth)
   clf5.fit(X_train_enc, y_train)
   train_pred = clf5.predict(X_train_enc)
   f1_score1 = f1_score(y_train,train_pred)
   train_results.append(f1_score1)

   y_pred = clf5.predict(X_val_enc)
   f1_score1 = f1_score(y_val,y_pred)
   test_results.append(f1_score1)
    
plt.figure(figsize=(10,10))
line1, = plt.plot(max_depths, train_results, 'b', label="Train F1-score")
line2, = plt.plot(max_depths, test_results, 'r', label="Val F1-score")
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel('F1-score')
plt.xlabel('Tree depth')
plt.show()

As the tree depth increases, our training f1-score improves and we eventually get f1-score=1, however the widening gap between the test and training error curve indicates that the model is unable to generalize well on unseen data i.e. the model has overfitted

In [None]:
clf5 = DecisionTreeClassifier(criterion="gini",random_state = 0,max_depth=12)
clf5.fit(X_train_enc, y_train)
y_pred5 = clf5.predict(X_val_enc)
y_pred5_train = clf5.predict(X_train_enc)
print("Train data f1 score: ", f1_score(y_train,y_pred5_train))
print("Val data f1 score: ", f1_score(y_val,y_pred5))

### Model 6
<b> Weighted Decision tree or Cost-sensitive tree </b> <br>
Experiment with class weight. Our data is slightly imbalanced, so try to assign higher weight for positive samples versus negative samples using <b>class_weight</b> hyeperparameter. We can assign {class:weight} or “balanced”<br>

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
</br>

In [None]:
clf6 = DecisionTreeClassifier(criterion="gini",random_state = 0,max_depth=12,class_weight={0:1,1:2})
clf6.fit(X_train_enc, y_train)
y_pred6 = clf6.predict(X_val_enc)
y_pred6_train = clf6.predict(X_train_enc)
print("Train data f1 score: ", f1_score(y_train,y_pred6_train))
print("Val data f1 score: ", f1_score(y_val,y_pred6))

By adjusting the class_weight, our test f1 score has improved by 1.3%

Decision tree has many hyperparameters and we can use sklearn's <b>GridSearchCV</b> or <b>RandomizedSearchCV</b> to find the best hyperparaters. <a href="https://scikit-learn.org/stable/modules/grid_search.html">Sklearn documentation</a>

## Ensemble model - Random forest
Random forest is a ensemble of decision trees which aims to improve prediction accuracy while avoiding over fitting. <br>

In sklearn's Random forest implementation, each subsample used to fit each tree is same size as actual data but sampled using replacement if <b>bootstrap </b> hyperparameter is set to True. It has many hyperparameters similar to decision tree. <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">sklearn documenatation </a>

In [None]:
X_train.dtypes

In [None]:
y_train = backup_train["is_canceled"]
X_train = backup_train.drop(["is_canceled"], axis=1).drop(["reservation_status"],axis=1)
y_val = backup_val["is_canceled"]
X_val = backup_val.drop(["is_canceled"], axis=1).drop(["reservation_status"],axis=1)

In [None]:
cat_cols=['hotel','arrival_date_month','arrival_date_year','meal','market_segment','distribution_channel','reserved_room_type', 'assigned_room_type',\
        'deposit_type','customer_type']
X_train_enc = pd.get_dummies(data=X_train,columns=cat_cols,drop_first=True)
X_val_enc = pd.get_dummies(data=X_val,columns=cat_cols,drop_first=True)
X_train_enc,X_val_enc =X_train_enc.align(X_val_enc, join='left', axis=1)
X_val_enc=X_val_enc.fillna(0)

In [None]:
rf1=RandomForestClassifier()
rf1.fit(X_train_enc,y_train)
y_pred_rf1=rf1.predict(X_val_enc)
y_pred_rf1_train=rf1.predict(X_train_enc)

print("Train data f1 score: ", f1_score(y_train,y_pred_rf1_train))
print("Val data f1 score: ", f1_score(y_val,y_pred_rf1))

In [None]:
rf1.get_params()

With default setting, we see that the model has better performance on test data but the training data f1 score is very high, signifying a overfitting. 

### Model 2
1. Using optimal max_depth 
2. Class_weight = "balanced_subsample" - Same as balanced but the ratio is computed for each tree based on the bootstrapped data considered

In [None]:
max_depths = np.linspace(1, 32, 32, endpoint=True)
train_results = []
test_results = []
for max_depth in max_depths:
   rf2=RandomForestClassifier(max_depth=max_depth,n_estimators=30)
   rf2.fit(X_train_enc, y_train)
   train_pred = rf2.predict(X_train_enc)
   f1_score1 = f1_score(y_train,train_pred)
   train_results.append(f1_score1)

   y_pred = rf2.predict(X_val_enc)
   f1_score1 = f1_score(y_val,y_pred)
   test_results.append(f1_score1)
    
plt.figure(figsize=(10,10))
line1, = plt.plot(max_depths, train_results, 'b', label="Train F1-score")
line2, = plt.plot(max_depths, test_results, 'r', label="Val F1-score")
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel('F1-score')
plt.xlabel('Tree depth')
plt.show()

In [None]:
rf2=RandomForestClassifier(max_depth=15,class_weight="balanced_subsample")
rf2.fit(X_train_enc,y_train)
y_pred_rf2=rf2.predict(X_val_enc)
y_pred_rf2_train=rf2.predict(X_train_enc)

print("Train data f1 score: ", f1_score(y_train,y_pred_rf2_train))
print("Val data f1 score: ", f1_score(y_val,y_pred_rf2))

### Model 3
1. Increase number of estimators to 80; It increases the training time 

In [None]:
rf3=RandomForestClassifier(max_depth=17,class_weight="balanced_subsample",n_estimators=80)
rf3.fit(X_train_enc,y_train)
y_pred_rf3=rf3.predict(X_val_enc)
y_pred_rf3_train=rf3.predict(X_train_enc)

print("Train data f1 score: ", f1_score(y_train,y_pred_rf3_train))
print("Val data f1 score: ", f1_score(y_val,y_pred_rf3))

## ROC curve
We will compare the best versions of the 2 classifiers we have trained so far in classification session -
1. Decision tree model 
2. Random forest

Decision Tree

In [None]:
y_test = backup_test["is_canceled"]
X_test = backup_test.drop(["is_canceled"], axis=1).drop('reservation_status',axis=1)

cat_cols=['hotel','arrival_date_month','arrival_date_year','meal','market_segment','distribution_channel','reserved_room_type', 'assigned_room_type',\
        'deposit_type','customer_type']
X_test_enc = pd.get_dummies(data=X_test,columns=cat_cols,drop_first=True)
X_train_enc,X_test_enc =X_train_enc.align(X_test_enc, join='left', axis=1)
X_test_enc=X_test_enc.fillna(0)

In [None]:
y_prob6 = clf6.predict_proba(X_test_enc)
false_positive_rateDT_6, true_positive_rateDT_6, thresholdDT_6 = roc_curve(y_test, y_prob6[:,1])
roc_aucDT_6 = auc(false_positive_rateDT_6, true_positive_rateDT_6)

Random Forest

In [None]:
y_test = backup_test["is_canceled"]
X_test = backup_test.drop(["is_canceled"], axis=1).drop('reservation_status',axis=1)
cat_cols=['hotel','arrival_date_month','arrival_date_year','meal','market_segment','distribution_channel','reserved_room_type', 'assigned_room_type',\
        'deposit_type','customer_type']
X_test_enc = pd.get_dummies(data=X_test,columns=cat_cols,drop_first=True)
X_train_enc,X_test_enc =X_train_enc.align(X_test_enc, join='left', axis=1)
X_test_enc=X_test_enc.fillna(0)

In [None]:
y_prob_rf = rf3.predict_proba(X_test_enc)
false_positive_rateRF, true_positive_rateRF, thresholdRF = roc_curve(y_test, y_prob_rf[:,1])
roc_aucRF = auc(false_positive_rateRF, true_positive_rateRF)

In [None]:
plt.figure(figsize = (10,10))
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rateDT_6, true_positive_rateDT_6, color = 'red', label = 'DT AUC = %0.2f' % roc_aucDT_6)
plt.plot(false_positive_rateRF, true_positive_rateRF, color = 'green', label = 'RF AUC = %0.2f' % roc_aucRF)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1], linestyle = '--')
plt.axis('tight')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

We see that Random forest has the better performance among the 2 models