# HOTEL BOOKING CANCELLATION PREDICTION USING MACHINE LEARNING MODELS

The objective of this project is to predict hotel booking cancellation status using machine learning algorithms. The dataset comprises booking-related features such as the number of adults and children, number of nights booked (weekend and weekday), meal plan, room type, lead time, market segment type, price, special requests, and reservation date. By applying a range of classification models, including Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Logistic Regression, Random Forest, and Decision Tree, this project aims to identify patterns in booking data that influence cancellation behavior. The performance of these models is assessed using metrics to determine the best model for accurately predicting booking cancellations.

Importing Dataset and Libraries

In [1]:
import pandas as pd
import warnings 
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
df=pd.read_csv(r"E:\data_analytics\ml_works\machine_learning_projects\booking.csv")
df.head()

Unnamed: 0,Booking_ID,number of adults,number of children,number of weekend nights,number of week nights,type of meal,car parking space,room type,lead time,market segment type,repeated,P-C,P-not-C,average price,special requests,date of reservation,booking status
0,INN00001,1,1,2,5,Meal Plan 1,0,Room_Type 1,224,Offline,0,0,0,88.0,0,10/2/2015,Not_Canceled
1,INN00002,1,0,1,3,Not Selected,0,Room_Type 1,5,Online,0,0,0,106.68,1,11/6/2018,Not_Canceled
2,INN00003,2,1,1,3,Meal Plan 1,0,Room_Type 1,1,Online,0,0,0,50.0,0,2/28/2018,Canceled
3,INN00004,1,0,0,2,Meal Plan 1,0,Room_Type 1,211,Online,0,0,0,100.0,1,5/20/2017,Canceled
4,INN00005,1,0,1,2,Not Selected,0,Room_Type 1,48,Online,0,0,0,77.0,0,4/11/2018,Canceled


Information about Data

In [2]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36285 entries, 0 to 36284
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Booking_ID                36285 non-null  object 
 1   number of adults          36285 non-null  int64  
 2   number of children        36285 non-null  int64  
 3   number of weekend nights  36285 non-null  int64  
 4   number of week nights     36285 non-null  int64  
 5   type of meal              36285 non-null  object 
 6   car parking space         36285 non-null  int64  
 7   room type                 36285 non-null  object 
 8   lead time                 36285 non-null  int64  
 9   market segment type       36285 non-null  object 
 10  repeated                  36285 non-null  int64  
 11  P-C                       36285 non-null  int64  
 12  P-not-C                   36285 non-null  int64  
 13  average price             36285 non-null  float64
 14  specia

Finding and Removing Null Values

In [None]:
df.isna().sum()
df.dropna(inplace=True)

Booking_ID                  0
number of adults            0
number of children          0
number of weekend nights    0
number of week nights       0
type of meal                0
car parking space           0
room type                   0
lead time                   0
market segment type         0
repeated                    0
P-C                         0
P-not-C                     0
average price               0
special requests            0
date of reservation         0
booking status              0
dtype: int64

Removing Space in Column name

In [4]:
df.columns=df.columns.str.replace(" ","_")
print(df.columns)

Index(['Booking_ID', 'number_of_adults', 'number_of_children',
       'number_of_weekend_nights', 'number_of_week_nights', 'type_of_meal',
       'car_parking_space', 'room_type', 'lead_time', 'market_segment_type',
       'repeated', 'P-C', 'P-not-C', 'average_price', 'special_requests',
       'date_of_reservation', 'booking_status'],
      dtype='object')


Encoding Categorical Value

In [5]:
le=LabelEncoder()
column_to_convert=['type_of_meal','room_type','market_segment_type','booking_status']
for col in column_to_convert:
    df[col]=le.fit_transform(df[col])
df.head(10)

Unnamed: 0,Booking_ID,number_of_adults,number_of_children,number_of_weekend_nights,number_of_week_nights,type_of_meal,car_parking_space,room_type,lead_time,market_segment_type,repeated,P-C,P-not-C,average_price,special_requests,date_of_reservation,booking_status
0,INN00001,1,1,2,5,0,0,0,224,3,0,0,0,88.0,0,10/2/2015,1
1,INN00002,1,0,1,3,3,0,0,5,4,0,0,0,106.68,1,11/6/2018,1
2,INN00003,2,1,1,3,0,0,0,1,4,0,0,0,50.0,0,2/28/2018,0
3,INN00004,1,0,0,2,0,0,0,211,4,0,0,0,100.0,1,5/20/2017,0
4,INN00005,1,0,1,2,3,0,0,48,4,0,0,0,77.0,0,4/11/2018,0
5,INN00006,1,0,0,2,1,0,0,346,3,0,0,0,100.0,1,9/13/2016,0
6,INN00007,1,1,1,4,0,0,0,34,4,0,0,0,107.55,1,10/15/2017,1
7,INN00008,3,0,1,3,0,0,3,83,4,0,0,0,105.61,1,12/26/2018,1
8,INN00009,1,1,0,4,0,0,0,121,3,0,0,0,96.9,1,7/6/2018,1
9,INN00010,2,0,0,5,0,0,3,44,4,0,0,0,133.44,3,10/18/2018,1


Independent and Dependent Variables

In [6]:
x=df.iloc[:,[7,8,9,10,11,12,13,14]].values
x=pd.DataFrame(x)
y=df.iloc[:,16].values
y=pd.DataFrame(y)
print(x)
print(y)

         0      1    2    3    4    5       6    7
0      0.0  224.0  3.0  0.0  0.0  0.0   88.00  0.0
1      0.0    5.0  4.0  0.0  0.0  0.0  106.68  1.0
2      0.0    1.0  4.0  0.0  0.0  0.0   50.00  0.0
3      0.0  211.0  4.0  0.0  0.0  0.0  100.00  1.0
4      0.0   48.0  4.0  0.0  0.0  0.0   77.00  0.0
...    ...    ...  ...  ...  ...  ...     ...  ...
36280  0.0  346.0  4.0  0.0  0.0  0.0  115.00  1.0
36281  0.0   34.0  4.0  0.0  0.0  0.0  107.55  1.0
36282  3.0   83.0  4.0  0.0  0.0  0.0  105.61  1.0
36283  0.0  121.0  3.0  0.0  0.0  0.0   96.90  1.0
36284  3.0   44.0  4.0  0.0  0.0  0.0  133.44  3.0

[36285 rows x 8 columns]
       0
0      1
1      1
2      0
3      0
4      0
...   ..
36280  0
36281  1
36282  1
36283  1
36284  1

[36285 rows x 1 columns]


Splitting Variable to Test and Train

In [7]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

Feature Scaling

In [8]:
st_x=StandardScaler()
x_train=st_x.fit_transform(x_train)
x_test=st_x.fit_transform(x_test)
print(x_train)

[[-0.50443941  0.95892444 -0.80924927 ... -0.08890487 -0.52128352
  -0.78650943]
 [-0.50443941 -0.73614609 -0.80924927 ... -0.08890487 -0.52128352
  -0.78650943]
 [-0.50443941  0.0185086   0.66076874 ... -0.08890487  0.31671783
   1.75320866]
 ...
 [ 1.64271351  1.04019495  0.66076874 ... -0.08890487  0.7671793
   0.48334961]
 [-0.50443941 -0.93351731 -0.80924927 ... -0.08890487 -1.09901824
  -0.78650943]
 [ 1.64271351 -0.49233457  0.66076874 ... -0.08890487  0.17914832
   0.48334961]]


### SVM Model

In [9]:
from sklearn.svm import SVC

Model Fitting

In [10]:
classifier=SVC(kernel='linear',random_state=0)
classifier.fit(x_train,y_train)

Model Prediction

In [11]:
y_pred=classifier.predict(x_test)
print(y_pred)
print(y_test)

[0 1 1 ... 1 1 1]
       0
25951  0
25828  1
36016  1
25304  0
8085   1
...   ..
15543  1
29200  0
21027  0
35141  1
29945  0

[7257 rows x 1 columns]


Model Evaluation

In [12]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.20711037618850764
mean squared error: 0.20711037618850764
root mean squared error: 0.4550938103166287


Prediction Accuracy

In [13]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

79.28896238114923 %


### KNN Model

In [14]:
from sklearn.neighbors import KNeighborsClassifier

Model Fitting

In [15]:
classifier=KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
classifier.fit(x_train,y_train)

Model Prediction

In [16]:
y_pred=classifier.predict(x_test)
print(y_pred)
print(y_test)

[0 1 1 ... 1 1 1]
       0
25951  0
25828  1
36016  1
25304  0
8085   1
...   ..
15543  1
29200  0
21027  0
35141  1
29945  0

[7257 rows x 1 columns]


Model Evaluation

In [17]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.15957007027697395
mean squared error: 0.15957007027697395
root mean squared error: 0.3994622263455882


Prediction Accuracy

In [18]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

84.0429929723026 %


### Logistic Regression Model

In [19]:
from sklearn.linear_model import LogisticRegression

Model Fitting

In [20]:
model=LogisticRegression(max_iter=1000)
model.fit(x_test,y_test)

Model Prediction

In [21]:
y_pred=model.predict(x_test)
print(y_test)
print(y_pred)

       0
25951  0
25828  1
36016  1
25304  0
8085   1
...   ..
15543  1
29200  0
21027  0
35141  1
29945  0

[7257 rows x 1 columns]
[0 1 1 ... 1 1 1]


Model Evaluation

In [22]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.20132286068623398
mean squared error: 0.20132286068623398
root mean squared error: 0.44869016112038157


Prediction Accuracy

In [23]:
score=metrics.accuracy_score(y_test,y_pred)
print(score*100,"%")

79.8677139313766 %


### Decision Tree Model

In [24]:
from sklearn.tree import DecisionTreeClassifier

Model Fitting

In [25]:
classifier=DecisionTreeClassifier(criterion='entropy',random_state=0)
classifier.fit(x_train,y_train)

Model Prediction

In [26]:
y_pred=classifier.predict(x_test)
print(y_pred)
print(y_test)

[0 1 0 ... 1 0 1]
       0
25951  0
25828  1
36016  1
25304  0
8085   1
...   ..
15543  1
29200  0
21027  0
35141  1
29945  0

[7257 rows x 1 columns]


Model Evaluation

In [27]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.2062835882596114
mean squared error: 0.2062835882596114
root mean squared error: 0.45418453106596596


Prediction Accuracy

In [28]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

79.37164117403886 %


### Random Forest Model

In [29]:
from sklearn.ensemble import RandomForestClassifier

Model Fitting

In [30]:
classifier=RandomForestClassifier(criterion='entropy',n_estimators=10,random_state=0)
classifier.fit(x_train,y_train)

Model Prediction

In [31]:
y_pred=classifier.predict(x_test)
print(y_test)
print(y_pred)

       0
25951  0
25828  1
36016  1
25304  0
8085   1
...   ..
15543  1
29200  0
21027  0
35141  1
29945  0

[7257 rows x 1 columns]
[0 1 1 ... 1 0 1]


Model Evaluation

In [32]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))


mean absolute error: 0.16756235358963759
mean squared error: 0.16756235358963759
root mean squared error: 0.409343808539518


Prediction Accuracy

In [33]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

83.24376464103625 %


Conclusion:
In this hotel booking cancellation prediction project, various machine learning models, including SVM, KNN, Logistic Regression, Random Forest, and Decision Tree, were implemented to predict whether a booking would be canceled. Among all the models, the K-Nearest Neighbors (KNN) algorithm demonstrated the highest accuracy of 84.04%, indicating its superior performance in capturing the patterns and trends in the booking data.

The success of the KNN model suggests that customer booking behavior and associated features have clear relationships that can be effectively leveraged using distance-based classification. The model's strong accuracy could aid hotel management in anticipating booking cancellations, enabling better inventory management, optimized resource allocation, and improved overall business strategies. Future enhancements could involve feature engineering, hyperparameter tuning, and testing with additional advanced models to further refine predictive performance.