![](https://www.rathbonehotel.co.uk/app/uploads/fly-images/945/Rathbone-Hotel-Studio-Suite1-1730x730-c.jpg)

### [1. EDA](#eda) ###
### [2. Data Preparation](#data) ###
*           [Missing Data](#missing)
*           [Categorical Data](#categorical)
        
### [3. Model training](#model)
### [4. Evaluating our model](#evaluation)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
sns.set_style("dark")

In [None]:
df = pd.read_csv("../input/hotel-booking-demand/hotel_bookings.csv")

In [None]:
df.info()

In [None]:
df.head(10)

In [None]:
df.describe()

<a id="eda"></a>
# EDA

In [None]:
df.columns

In [None]:
df['hotel'].unique()

#### we have passengers from diffrent countries and two hotels. Resort and City. According to https://www.sciencedirect.com/science/article/pii/S2352340918315191 both hotels are in Portugal. Our goal is to predict whether guests are going to cancel their trip or not. This can be really helpfull for hotels to predict the food and rooms.

In [None]:
sns.countplot(data=df, x = 'hotel')

In [None]:
sns.countplot(data=df, x = 'is_canceled', hue='is_repeated_guest')

we can see that guests that aren't repeated guests canceled the trip which make sense.

In [None]:
sns.countplot(data=df, x = 'hotel', hue='is_canceled')

In [None]:
fig = plt.figure(figsize=(10,5), dpi = 100)
sns.countplot(data=df, x = 'arrival_date_month')
plt.xlabel('Month', fontsize=15)
plt.xticks(rotation=45,fontsize=11);

In [None]:
df['reserved_room_type'].unique()

In [None]:
data = df[df['is_canceled'] == 0]
fig = plt.figure(figsize=(9,6), dpi = 100)
sns.boxplot(data= data, x = 'reserved_room_type', y = 'adr', hue = 'hotel')

#### The above plot shows that August was the busiest month and January was the least.

In [None]:
sns.boxplot(data= df,x = 'is_canceled', y='adr')
plt.ylim(0,600)
print(df['adr'].mean())

In [None]:
df['country'].value_counts()


In [None]:
fig = plt.figure(figsize=(12,4), dpi=150)


country_wise_guests = df[(df['is_canceled'] == 0)]['country'].value_counts().reset_index()
country_wise_guests.columns = ['country', 'No of guests']

country_wise_guests = country_wise_guests[country_wise_guests['No of guests'] > 60]

sns.barplot(data=country_wise_guests, x = 'country', y = 'No of guests')
plt.xticks(rotation=90,fontsize=11);

Most of the guest are from Portugal wich is reasonable because both hotels are in PRT and that could give us a hint to fill out Country missing data with PRT later. Since there was 156 countries, I pickup ones with more than 60 guests so we can have a clear plot.

In [None]:
fig = plt.figure(figsize=(10,5),dpi=100)

sns.lineplot(data=df, x= 'arrival_date_month', y = 'adr', hue='hotel',)
plt.xticks(rotation=45,fontsize=10);

#### In general City Hotel has higher prices.
#### This plot shows that in August which was the crowdest month, City hotel charge guests the most.

In [None]:
sns.countplot(data=df, x= 'market_segment')
plt.xticks(rotation=45,fontsize=10);

#### We all know in currect people use of technology are masively increased and we can see in the above plot that almost 50 percent of reservations are apply via Online Travel Agents.

In [None]:
sns.countplot(data=df, x= 'total_of_special_requests')

#### This is the numbers of special requests. Now let's see its relationship with canselation.

In [None]:
sns.countplot(data=df, x= 'total_of_special_requests', hue='is_canceled')

Nearly half bookings without any special requests have been canceled and another half of them have not been canceledfig = px.pie(country_data,
             values="Number of Guests",
             names="country",
             title="Home country of guests",
             template="seaborn")

<a id="data"></a>
# Data Prepration

<a id="missing"></a>
**Missing Data**

In [None]:
df.isnull().sum()

In [None]:
df['agent'] = df['agent'].fillna(0)
df['children'] = df['children'].fillna(0)
df['country'] = df['country'].fillna('PRT')
df = df.drop('company', axis = 1)

In [None]:
df.isnull().sum()

Since missing data in Company feature is too much I prefered to drop the whole feature.

<a id="categorical"></a>
**Categorical Data**

In [None]:
useless_col = ['days_in_waiting_list', 'arrival_date_year', 'arrival_date_year', 'assigned_room_type', 'booking_changes',
               'reservation_status', 'country', 'days_in_waiting_list']

df.drop(useless_col, axis = 1, inplace = True)

let's just find the categorical features first:

In [None]:
a = df.select_dtypes(object).columns
for i in a:
    print (i, df[i].nunique())

According to our result, It's obvious that we can't use one hot encoding for most of our categorical features! because that would create a lot of columns and adds a lot of complexity to our model. Therefore we're going to use label encoding which you can use either Lable Encoder in sklearn or install label converter with pip. I prefer the first way.

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'])

df['year'] = df['reservation_status_date'].dt.year
df['month'] = df['reservation_status_date'].dt.month
df['day'] = df['reservation_status_date'].dt.day

df.drop(['reservation_status_date','arrival_date_month'] , axis = 1, inplace = True)

In [None]:
a = df.select_dtypes(object).columns
cat_list = []
for i in a:
    print (i, df[i].nunique())
    cat_list.append(i)

In [None]:
for i in cat_list:
    df[i] = le.fit_transform(df[i])
df['year'] = le.fit_transform(df['year'])
df['month'] = le.fit_transform(df['month'])
df['day'] = le.fit_transform(df['day'])

In [None]:
df.head()

<a id="model"></a>
# Train|Test Split

In [None]:
from sklearn.model_selection import train_test_split
y = df['is_canceled']
X = df.drop('is_canceled', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=101,test_size=0.3)

# Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train The Model

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

y_pred_dtc = dtc.predict(X_test)

# Model Evaluation

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
acc_dtc = accuracy_score(y_test, y_pred_dtc)
conf = confusion_matrix(y_test, y_pred_dtc)
clf_report = classification_report(y_test, y_pred_dtc)
acc_dtc

<a id="evaluation"></a>
# Evaluation

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(dtc, X_test, y_test)

In [None]:
print(clf_report)

Since the the relationship between our label and features was incomprehensive, in below data frame we will see which feature was more effective in our model

In [None]:
pd.DataFrame(index = X.columns, data = dtc.feature_importances_, 
             columns=['Feature Importance']).sort_values('Feature Importance', ascending = False)

Typically we would run a grid-search to test diffrent hyperparameters such min_samples_split, min_samples_leaf and etc, but since we are getting good results there no need for that.

Just to have a visualization of our tree I plot the tree below but since our features are too many, I set max_depth to 3 to get a smaller tree that we can see properly.

In [None]:
from sklearn.tree import plot_tree



plt.figure(figsize=(12,8), dpi=200)

pruned_dtc = DecisionTreeClassifier(max_depth=3)
pruned_dtc.fit(X_train, y_train)
y_pred_dtc = pruned_dtc.predict(X_test)

plot_tree(pruned_dtc, filled = True);

Decision Trees themeselves are prun to overfitting and we already know that there are many developments that expand of a Decision Tree model such as Random Forest or a Gradient Boosted tree to help expand the decision tree model as well as fix some its potential flaws. Those are more advanced tree base methods that are build of the Decision tree. Thus you don't a single Decision tree in more realistics problems.

**✅If my notebook helped you, please make sure to upvote.** 
