## Business Context

With the increasing popularity and ease of access to online hotel booking platforms, customers tend to make reservations in advance to avoid any last-minute rush and higher prices. These online platforms offer flexible cancellation options, in some cases even a day before reservation. To compete with this, even offline bookings have increased the flexibility in cancellations. This has led to an increase in the growing number of cancellations, with one of the primary reasons being last-minute changes in travel plans. These sudden changes can result from unforeseen circumstances, such as personal emergencies, flight delays, or unexpected events at the travel destination.

Hotel booking cancellations become a crucial problem to solve as it leads to revenue loss and operational inefficiencies. The cancellation of bookings impacts a hotel on various fronts:

1. Loss of revenue when the hotel cannot resell the room

2. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms

3. Lowering prices last minute, so the hotel can resell a room, resulting in reduced profit margins

## Problem Definition

The INN Hotels Group has been contending with the challenge of rising cancellations for nearly a year now. However, the last three months witnessed a **rise in inventory loss due to cancellation rise to an all-time high of 18%**. This has led to a jump in the **revenue loss to an all-time high of approx. \$0.25 million annually**. This has significantly impacted their profit margins.

- In the current context, inventory refers to a hotel room, and the inability to sell one leads to inventory loss

The group has been using heuristic mechanisms (rule and domain expert-based) to try and reduce the revenue loss due to cancellations, but this hasn't been effective so far hasn't been effective (neither efficient nor scalable), as evident from the magnitude of losses they are incurring.

The group has decided that they **need a Data Science-based solution to predict the likelihood of a booking being canceled** as they expect it to be more effective than their current mechanism. They hope that this proactive approach will help them significantly **minimize revenue loss and improve operational efficiency**.

In [None]:
# Import the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from scipy import stats

from sklearn.preprocessing import PowerTransformer

import warnings
warnings.filterwarnings('ignore')

: 

### Load the Past Data and New Data

In [None]:
new_data = pd.read_csv("Z:\python classes\data\Hotel_cancellation_data\INNHotelsGroup_newdata.csv")
past_data  = pd.read_csv("Z:\python classes\data\Hotel_cancellation_data\INNHotelsGroup_pastdata.csv")

In [None]:
# Lets check their shape
past_data.shape

In [None]:
new_data.shape

In [None]:
past_data.head(2)

In [None]:
past_data.tail(2)

In [None]:
new_data.head(2)

In [None]:
new_data.tail(2)

#### Lets Understand exactly how many bookings were cancelled and rebooked at last moment from JAN-21 to JULY-22

In [None]:
# Lets see how many bookings were cancelled

In [None]:
past_data['booking_status'].value_counts(normalize=True)

In [None]:
past_data['booking_status'].value_counts(normalize=True).plot(kind='pie',
                                    autopct='%.2f%%',radius=1.2,colors=['green','red'],
                                    shadow=True,explode=[0,0.1])
plt.show()

In [None]:
# Lets see how many cancelled bookings were rebooked?

book_data = past_data[past_data['booking_status']=='Canceled']

In [None]:
book_data['rebooked'].value_counts(normalize=True).plot(kind='pie',
                                    autopct='%.2f%%',radius=1.2,colors=['green','red'],
                                    shadow=True,explode=[0,0.1])
plt.show()

Inference: Here we can see that out of all the booking ~33% bookings are being cancelled and out of these 33% only ~20% booking are rebooked. Hence the Heuristic approach is not affective at all, the inventory losses are still very high.

## Exploring the data

In [None]:
past_data.head(2)

In [None]:
past_data.describe().T

In [None]:
# Booking ID is a redundant column lets make it an index.

past_data.set_index('booking_id',inplace=True)
new_data.set_index('booking_id',inplace=True)

In [None]:
# We also need to change the data type of columns "Arrival Date" to datetime
past_data['arrival_date'] = pd.to_datetime(past_data['arrival_date'],format='%Y-%m-%d')
new_data['arrival_date'] = pd.to_datetime(new_data['arrival_date'],format='%Y-%m-%d')

In [None]:
past_data.dtypes

In [None]:
past_data.tail(2)

### Univariate Analysis 

In [None]:
num_cols = ['lead_time','avg_price_per_room']
cat_cols = ['market_segment_type', 'no_of_special_requests',
            'no_of_adults', 'no_of_weekend_nights','required_car_parking_space', 
            'no_of_week_nights','booking_status', 'rebooked']

In [None]:
t=1
plt.figure(figsize=(10,3))
for i in num_cols:
    plt.subplot(1,2,t)
    sns.kdeplot(data=past_data,x=i,fill=True)
    plt.title(f'Skewness: {round(past_data[i].skew(),2)}')
    t+=1
plt.tight_layout()
plt.show()
    

Inference:
* The lead time has very large values >200, which means there are bookings which have been made more than 200 days prior !!!
* There are booking with average price per room is 0?

In [None]:
t=1
plt.figure(figsize=(10,3))
for i in num_cols:
    plt.subplot(1,2,t)
    sns.boxplot(data=past_data,x=i)
    plt.title(f'Skewness: {round(past_data[i].skew(),2)}')
    t+=1
plt.tight_layout()
plt.show()

In [None]:
t=1
plt.figure(figsize=(10,17))
for i in cat_cols:
    plt.subplot(4,2,t)
    sns.countplot(data=past_data,x=i)
    t+=1
plt.tight_layout()
plt.show()

Inference:

* Most of the people book online and maximum people are 2 adults . The cancelled cases are less than non_cancelled Cases . and out of the cancelled people that is around 8000 approx 2000 were rebooked

* max people don't need any parking space that means they don't have their own vehicle when they come to stay

* Most of the people stay for 1 or 2 weekend nights

### Bivariate Analysis

In [None]:
# Num VS Cat

In [None]:
t=1
plt.figure(figsize=(10,3))
for i in num_cols:
    plt.subplot(1,2,t)
    sns.kdeplot(data=past_data,x=i,fill=True,hue='booking_status')
    plt.title(f'Skewness: {round(past_data[i].skew(),2)}')
    t+=1
plt.tight_layout()
plt.show()

In [None]:
t=1
plt.figure(figsize=(10,3))
for i in num_cols:
    plt.subplot(1,2,t)
    sns.boxplot(data=past_data,x=i,y='booking_status')
    plt.title(f'Skewness: {round(past_data[i].skew(),2)}')
    t+=1
plt.tight_layout()
plt.show()

In [None]:
# Cat vs Cat (Grouped Bar plot)

for i in cat_cols:
    if i !='booking_status':
        pd.crosstab(past_data[i],past_data['booking_status']).plot(kind='bar')
    plt.show()

### Multivariate Analysis

In [None]:
sns.heatmap(past_data.corr(numeric_only=True),annot=True,cmap='RdBu',vmin=-1)
plt.show()

## Inferential Statistics

In [None]:
# Lets statistically test the inferences that we have made.

In [None]:
# Lead_time vs Booking Status

samp1 = past_data[past_data['booking_status']=='Canceled']['lead_time']
samp2 = past_data[past_data['booking_status']=='Not Canceled']['lead_time']

# Lets go with 2 Sample t-test
# Ho: mu1=mu2 (Lead time does not affects the booking cancelations)
# Ha: mu1!=mu2 (Lead time does affects the booking cancelations)


# Assumption 1: data must be normal (Since sample size>30 lets assume it normal)

# Assumption 2: variances must be equal
# Ho: Variances are equal
# Ha: Variances are not equal
print(stats.levene(samp1,samp2))  #p_value less than 0.05

# Hence varainces of pop are not equal hence lets go with two sample t-test
# with unequal varainces (Welch's t-test)

print(stats.ttest_ind(samp1,samp2,equal_var=False))
# Since p_value is <0.05 we reject Ho:

In [None]:
# Avg_room_price VS booking_status

samp1 = past_data[past_data['booking_status']=='Canceled']['avg_price_per_room']
samp2 = past_data[past_data['booking_status']=='Not Canceled']['avg_price_per_room']

# Lets go with 2 Sample t-test
# Ho: mu1=mu2 (avg_price_per_room does not affects the booking status)
# Ha: mu1!=mu2 (avg_price_per_room does affects the booking status)


# Assumption 1: data must be normal (Since sample size>30 lets assume it normal)

# Assumption 2: variances must be equal
# Ho: Variances are equal
# Ha: Variances are not equal
print(stats.levene(samp1,samp2))  #p_value less than 0.05

# Hence varainces of pop are not equal hence lets go with two sample t-test
# with unequal varainces (Welch's t-test)

print(stats.ttest_ind(samp1,samp2,equal_var=False))
# Since p_value is <0.05 we reject Ho:

In [None]:
# All cat columns VS booking status

# chi-square test for independence

# Ho: No relation in categories
# Ha: There is relation in categories

for i in cat_cols:
    if i not in ['booking_status','rebooked']:
        ct = pd.crosstab(past_data['booking_status'],past_data[i])
        print(i,':\t',stats.chi2_contingency(ct)[1])
        
# Since p-values for all the category columns are less than 0.05,
# all the cat columns are statistically significant

## Data Preprocessing

In [None]:
# Lets remove the "rebooked" from the predictive modelling as this info
# will not be available for future data

In [None]:
past_data.drop(columns=['rebooked'],inplace=True)

In [None]:
past_data.head(2)

### Missing Values

In [None]:
past_data.isnull().sum()

In [None]:
new_data.isnull().sum()

### Outlier Treatment

In [None]:
t=1
plt.figure(figsize=(10,3))
for i in num_cols:
    plt.subplot(1,2,t)
    sns.boxplot(data=past_data,x=i)
    plt.title(f'Skewness: {round(past_data[i].skew(),2)}')
    t+=1
plt.tight_layout()
plt.show()

In [None]:
# Lets cap the extreme outliers
for i in num_cols:
    q1,q3 = np.quantile(past_data[i],[0.25,0.75])
    iqr = q3-q1
    ul,ll = q3+2.5*iqr,q1-2.5*iqr
    past_data[i]=past_data[i].apply(lambda x: ul if x>ul else 
                                    ll if x<ll else x)

In [None]:
t=1
plt.figure(figsize=(10,3))
for i in num_cols:
    plt.subplot(1,2,t)
    sns.boxplot(data=past_data,x=i)
    plt.title(f'Skewness: {round(past_data[i].skew(),2)}')
    t+=1
plt.tight_layout()
plt.show()

### Feature Encoding

In [None]:
past_data = pd.get_dummies(past_data,columns=['market_segment_type'],drop_first=True,dtype=int)

In [None]:
new_data = pd.get_dummies(new_data,columns=['market_segment_type'],drop_first=True,dtype=int)

In [None]:
# Let do ordinal encoding for booking status

past_data['booking_status'] = past_data['booking_status'].map({'Canceled':1,
                                                              'Not Canceled':0})

In [None]:
new_data['booking_status'] = new_data['booking_status'].map({'Canceled':1,
                                                              'Not Canceled':0})

In [None]:
new_data.head(2)

### Feature Transformation

In [None]:
transformer = PowerTransformer(standardize=False)

In [None]:
past_data[num_cols] = transformer.fit_transform(past_data[num_cols])
new_data[num_cols] = transformer.transform(new_data[num_cols])

In [None]:
t=1
plt.figure(figsize=(10,3))
for i in num_cols:
    plt.subplot(1,2,t)
    sns.kdeplot(data=past_data,x=i,fill=True)
    plt.title(f'Skewness: {round(past_data[i].skew(),2)}')
    t+=1
plt.tight_layout()
plt.show()

### Feature Engineering (Generate new columns)

In [None]:
# Lets extract month and weekday from date of arrival
past_data['arrival_month']=past_data['arrival_date'].dt.month
past_data['arrival_wkday']=past_data['arrival_date'].dt.weekday

In [None]:
past_data.drop(columns='arrival_date',inplace=True)
past_data.head(2)

In [None]:
new_data['arrival_month']=new_data['arrival_date'].dt.month
new_data['arrival_wkday']=new_data['arrival_date'].dt.weekday
new_data.drop(columns='arrival_date',inplace=True)

In [None]:
new_data.head(2)

In [None]:
# We can add no of weekdays and weekdn in the booking and get stay duration
past_data['total_nights'] = past_data['no_of_week_nights'] + past_data['no_of_weekend_nights']

In [None]:
new_data['total_nights'] = new_data['no_of_week_nights'] + new_data['no_of_weekend_nights']

In [None]:
past_data.head(2)

In [None]:
new_data.head(2)

In [None]:
# We can also calculate the departure weekday from total nights and arrival 
# weekday

In [None]:
past_data['depart_wkday'] = past_data['arrival_wkday']+past_data['total_nights']

In [None]:
new_data['depart_wkday'] = new_data['arrival_wkday']+new_data['total_nights']

In [None]:
def setting_weekday(num):
    if num>6:
        return num%7
    else:
        return num

In [None]:
past_data['depart_wkday'] = past_data['depart_wkday'].apply(setting_weekday)
new_data['depart_wkday'] = new_data['depart_wkday'].apply(setting_weekday)

In [None]:
past_data.head()

In [None]:
past_data.shape

In [None]:
new_data.shape

# Predictive Modeling

In [None]:
#libraries
from sklearn.metrics import (classification_report, accuracy_score, f1_score, recall_score, precision_score, cohen_kappa_score, roc_curve, roc_auc_score)
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [None]:
y_train= past_data['booking_status']
x_train= past_data.drop(columns='booking_status')
y_test=new_data['booking_status']
x_test= new_data.drop(columns='booking_status')

In [None]:
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

In [None]:
#Create a function to train, predict and validate the models
def model_training(model, xtrain=x_train, ytrain=y_train, xtest=x_test, cutoff=0.5):
    m=model
    m.fit(xtrain,ytrain)
    pred_s= m.predict_proba(xtest)[:,1]
    pred_h= (pred_s>cutoff).astype(int)

    return pred_s, pred_h

In [None]:
def model_scores(predh, preds, ytest= y_test):
    print('Classification Report:\n', classification_report(ytest, predh))

    fpr, tpr, thres= roc_curve(ytest,preds)
    plt.plot(fpr, tpr)
    plt.plot([0,1],[0,1], ls='--', color='red')
    plt.title(f'ROC AUC: {round(roc_auc_score(ytest,preds),2)}')
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.show()

In [None]:
#Function for Model Tuning
def model_tuning(model, grid, x=x_train, y=y_train, folds=6, score='roc_auc'):
    gscv= GridSearchCV(estimator=model, param_grid=grid, cv=folds, verbose=0, scoring=score)
    gscv.fit(x,y)
    print('Best Score:', gscv.best_score_)
    return gscv.best_params_

In [None]:
#Function for score card
mod= []
accu= []
rec= [] 
pred= []
f1= []
ck= []
auc=[]

def model_scorecard(model_name, predh, preds, ytest=y_test):
    mod.append(model_name)
    accu.append(accuracy_score(ytest, predh))
    rec.append(recall_score(ytest, predh))
    pred.append(precision_score(ytest, predh))
    f1.append(f1_score(ytest, predh))
    ck.append(cohen_kappa_score(ytest, predh))
    auc.append(roc_auc_score(ytest, preds))
    global scorecard 

    scorecard= pd.DataFrame({'Accuracy': accu, 'Recall': rec, 'Precision': pred, 'F1 Score': f1, 'Cohen Kappa Score': ck,
    'ROC AUC': auc}, index= mod)
    return scorecard

## Logistic Regression: Baseline Model

In [None]:
y_pred_lr_s, y_pred_lr_h= model_training(LogisticRegression())

In [None]:
model_scores(y_pred_lr_h, y_pred_lr_s)

In [None]:
model_scorecard('LOG REG(BASE MODEL)',  y_pred_lr_h, y_pred_lr_s)

## Decision Tree

In [None]:
grid={'max_depth':[4,5,6,7,8]}
best_params=model_tuning(DecisionTreeClassifier(class_weight='balanced'),grid)

In [None]:
ypred_dt_s, ypred_dt_h= model_training(DecisionTreeClassifier(class_weight='balanced' ,**best_params))

In [None]:
model_scores(ypred_dt_h, ypred_dt_s)

In [None]:
model_scorecard('Decision Tree', ypred_dt_h, ypred_dt_s)

## Random Forest

In [None]:
grid={'n_estimators': [70,100,120,150,180, 200],
'max_depth':[3,4,5,6]}
best_params=model_tuning(RandomForestClassifier(class_weight='balanced'), grid)

In [None]:
ypred_rf_s, ypred_rf_h=model_training(RandomForestClassifier(class_weight='balanced',**best_params))

In [None]:
model_scores(ypred_rf_h, ypred_rf_s)

In [None]:
model_scorecard('Random Forest', ypred_rf_h, ypred_rf_s)

## XG Boost

In [None]:
grid={'max_depth':[2,3,4,5],'n_estimators':[100,150,200,250],'learning_rate':[0.01,0.02,0.05,0.1]}

In [None]:
best_params=model_tuning(XGBClassifier(class_weights='balanced'), grid)

In [None]:
ypred_xgb_s, ypred_xgb_h= model_training(XGBClassifier(class_weight='balanced', **best_params))

In [None]:
model_scores(ypred_xgb_h, ypred_xgb_s)

In [None]:
model_scorecard('XGBoost', ypred_xgb_h, ypred_xgb_s)


In [None]:
best_params

## Final Model

In [None]:
#Lets explore xgboost

model_XGB= XGBClassifier(n_estimators=100, max_depth=2, learning_rate=0.05)

In [None]:
#Train score:
ypred_xgb_s, ypred_xgb_h= model_training(model_XGB, xtest=x_train)
model_scores(ypred_xgb_h, ypred_xgb_s, ytest=y_train)

In [None]:
#Test score:
ypred_xgb_s, ypred_xgb_h= model_training(model_XGB)
model_scores(ypred_xgb_h, ypred_xgb_s)

In [None]:
#Lets see decision Tree

model_dt= DecisionTreeClassifier( max_depth=7)

In [None]:
ypred_dt_s, ypred_dt_h= model_training(model_dt, xtest=x_train)
model_scores(ypred_dt_h, ypred_dt_s, ytest=y_train)

In [None]:
#Test score:
ypred_dt_s, ypred_dt_h= model_training(model_dt)
model_scores(ypred_dt_h, ypred_dt_s)

In [None]:
#Lets try voting to combine both the models
from sklearn.ensemble import VotingClassifier

In [None]:
model_vote=VotingClassifier(estimators=[('DT',model_dt),('XGB', model_XGB)], voting='soft')

In [None]:
model_vote.fit(x_train, y_train)

In [None]:
ypred_vot_s= model_vote.predict_proba(x_test)[:,1]
ypred_vot_h= (ypred_vot_s>=0.4).astype(int)

In [None]:
fpr, tpr, thres= roc_curve(y_test, ypred_vot_s)
pd.DataFrame({'FPR':fpr, 'TPR': tpr, 'Thres': thres, 'YI': tpr*(1-fpr)}).sort_values(by='YI', ascending=False).head(2)

In [None]:
model_scores(ypred_vot_h, ypred_vot_s)

In [None]:
ypred_vot_s= model_vote.predict_proba(x_test)[:,1]
ypred_vot_h= (ypred_vot_s>=0.3).astype(int)

In [None]:
model_scores(ypred_vot_h, ypred_vot_s)

## Deployment

In [None]:
x_train.columns

In [None]:
#Lets deploy the model

lt=15
spcl=0
price=120
adul=2
wkend=2
park=0
wk=1
mrkt=1
amnth=10
awk=3
tn=3
dw=5

In [None]:
lt_t, price_t= transformer.transform([[lt, price]])[0]
lt_t, price_t

In [None]:
input_list=[lt_t, spcl, price_t, adul, wkend, park, wk, mrkt, amnth, awk, tn, dw]

In [None]:
model_vote.predict_proba([input_list])[:,1]

In [None]:
model_vote.predict([input_list])

In [None]:
#So we need all these above instances for deployment Lets create their joblib files
import joblib

In [None]:
with open('transformer.joblib','wb') as file:
    joblib.dump(transformer, file)

In [None]:
with open('final_model.joblib','wb') as file:
    joblib.dump(model_vote, file)

In [None]:
! pip install streamlit