<a href="https://www.kaggle.com/code/halismanaz/hotel-reservations-classificaton?scriptVersionId=136742476" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

 # Hotel Reservations Classificaton

## Business Understanding

### Summary

**Goal**: The primary objective is to **predict hotel booking cancellations** to mitigate revenue loss due to unoccupied rooms.

**Data**: Our dataset contains detailed information about hotel bookings, including customer specifics and whether the booking was cancelled or not.

**Model**: We aim to develop a **binary classification model** that can predict future booking cancellations accurately.

**Business Value**: This model can enable hotel management to identify potential cancellations early. The benefits include improved room occupancy rates, optimized revenue, and enhanced customer satisfaction.

### Success Metrics

The success of our model will be evaluated based on its predictive performance. In particular, we will use the following metrics:

**Accuracy:** This is the most intuitive performance measure. It is simply the ratio of correctly predicted observations to the total observations.

**Precision:** Precision looks at the ratio of correct positive observations to the total predicted positives. It is a measure of a classifier's exactness. Low precision indicates a high number of false positives.

**Recall (Sensitivity):** Recall is the ratio of correct positive observations to the all observations in actual class. It is a measure of a classifier's completeness. Low recall indicates a high number of false negatives.

**F1 Score:** The F1 Score is the weighted average of Precision and Recall. This score tries to balance both precision and recall. It is suitable for uneven class distribution problems.

*The exact importance of these metrics will depend on the business context. For instance, if the cost of falsely predicting that a booking will not be cancelled (a false negative) is high (e.g., due to lost revenue from not being able to fill the room), then we might want to prioritize Recall. Alternatively, if the cost of falsely predicting that a booking will be cancelled (a false positive) is high (e.g., due to lost customer goodwill from overbooking), then we might want to prioritize Precision.*

## Data Understanding

### Load necessary libraries

In [1]:
# Required Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve, auc
from sklearn.exceptions import ConvergenceWarning
import matplotlib.pyplot as plt
from scipy.stats import uniform, randint
from sklearn.model_selection import GridSearchCV
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning)

### Load dataset

In [2]:
df = pd.read_csv("/kaggle/input/hotel-reservations-classification-dataset/Hotel Reservations.csv")

### Explore data

In [3]:
df.head(10)  # Display the first few rows to understand the data

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.0,0,Not_Canceled
1,INN00002,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,INN00003,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.0,0,Canceled
3,INN00004,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.0,0,Canceled
4,INN00005,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.5,0,Canceled
5,INN00006,2,0,0,2,Meal Plan 2,0,Room_Type 1,346,2018,9,13,Online,0,0,0,115.0,1,Canceled
6,INN00007,2,0,1,3,Meal Plan 1,0,Room_Type 1,34,2017,10,15,Online,0,0,0,107.55,1,Not_Canceled
7,INN00008,2,0,1,3,Meal Plan 1,0,Room_Type 4,83,2018,12,26,Online,0,0,0,105.61,1,Not_Canceled
8,INN00009,3,0,0,4,Meal Plan 1,0,Room_Type 1,121,2018,7,6,Offline,0,0,0,96.9,1,Not_Canceled
9,INN00010,2,0,0,5,Meal Plan 1,0,Room_Type 4,44,2018,10,18,Online,0,0,0,133.44,3,Not_Canceled


In [4]:
for i in range(1,len(df.columns)):
    print(i, df.columns[i])

1 no_of_adults
2 no_of_children
3 no_of_weekend_nights
4 no_of_week_nights
5 type_of_meal_plan
6 required_car_parking_space
7 room_type_reserved
8 lead_time
9 arrival_year
10 arrival_month
11 arrival_date
12 market_segment_type
13 repeated_guest
14 no_of_previous_cancellations
15 no_of_previous_bookings_not_canceled
16 avg_price_per_room
17 no_of_special_requests
18 booking_status


In [5]:
df.describe()  # Show summary statistics to understand the data

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,required_car_parking_space,lead_time,arrival_year,arrival_month,arrival_date,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
count,36275.0,36275.0,36275.0,36275.0,36275.0,36275.0,36275.0,36275.0,36275.0,36275.0,36275.0,36275.0,36275.0,36275.0
mean,1.844962,0.105279,0.810724,2.2043,0.030986,85.232557,2017.820427,7.423653,15.596995,0.025637,0.023349,0.153411,103.423539,0.619655
std,0.518715,0.402648,0.870644,1.410905,0.173281,85.930817,0.383836,3.069894,8.740447,0.158053,0.368331,1.754171,35.089424,0.786236
min,0.0,0.0,0.0,0.0,0.0,0.0,2017.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,1.0,0.0,17.0,2018.0,5.0,8.0,0.0,0.0,0.0,80.3,0.0
50%,2.0,0.0,1.0,2.0,0.0,57.0,2018.0,8.0,16.0,0.0,0.0,0.0,99.45,0.0
75%,2.0,0.0,2.0,3.0,0.0,126.0,2018.0,10.0,23.0,0.0,0.0,0.0,120.0,1.0
max,4.0,10.0,7.0,17.0,1.0,443.0,2018.0,12.0,31.0,1.0,13.0,58.0,540.0,5.0


In [6]:
df.isnull().sum()  # Check for missing values to understand the data

Booking_ID                              0
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

## Data Preperation

### One-hot encoding and Label Encoding

In [7]:
categorical_vars = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']
df_encoded = pd.get_dummies(df, columns=categorical_vars)
label_encoder = LabelEncoder()
df_encoded['booking_status'] = label_encoder.fit_transform(df_encoded['booking_status'])

### Split the data into features and target variable

In [8]:
X = df_encoded.drop(['Booking_ID', 'booking_status'], axis=1)
y = df_encoded['booking_status']

### Split the data into training and test sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Feature Engineering

In [10]:
X_train_fe = X_train.copy()
X_test_fe = X_test.copy()

X_train_fe['total_nights'] = X_train_fe['no_of_weekend_nights'] + X_train_fe['no_of_week_nights']
X_train_fe['total_people'] = X_train_fe['no_of_adults'] + X_train_fe['no_of_children']

X_test_fe['total_nights'] = X_test_fe['no_of_weekend_nights'] + X_test_fe['no_of_week_nights']
X_test_fe['total_people'] = X_test_fe['no_of_adults'] + X_test_fe['no_of_children']

## Modeling

### Logistic Regression Model

In [11]:
logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)
logreg_train_accuracy = accuracy_score(y_train, logreg.predict(X_train))
print(f"Logistic regression model accuracy: {logreg_train_accuracy:.4f}")

Logistic regression model accuracy: 0.8017


### Random Forest Model

In [12]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_train_accuracy = accuracy_score(y_train, rf.predict(X_train))
print(f"Random forest model accuracy: {rf_train_accuracy:.4f}")

Random forest model accuracy: 0.9940


### XGBoost Model

In [13]:
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)
xgb_train_accuracy = accuracy_score(y_train, xgb.predict(X_train))
print(f"XGBoost model accuracy: {xgb_train_accuracy:.4f}")

XGBoost model accuracy: 0.9181


### Hyperparameter Tuning for XGBoost

In [14]:
param_grid = {
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 6],
    'n_estimators': [100, 200],
    'subsample': [0.5, 1],
    'colsample_bytree': [0.5, 1],
}

grid_search = GridSearchCV(XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
                           param_grid, cv=3, scoring='accuracy')

grid_search.fit(X_train_fe, y_train)

# Get the best parameters
best_params = grid_search.best_params_


In [15]:
print(best_params)

{'colsample_bytree': 1, 'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 200, 'subsample': 1}


### Train the XGBoost model with the best parameters

In [16]:
xgb_best = XGBClassifier(**best_params, use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_best.fit(X_train_fe, y_train)
xgb_best_pred = xgb_best.predict(X_test_fe)
xgb_train_accuracy = accuracy_score(y_train, xgb.predict(X_train))
print(f"XGBoost model accuracy: {xgb_train_accuracy:.4f}")

XGBoost model accuracy: 0.9181


### Overfitting Check

In [17]:
# Calculate test accuracy for each model
logreg_accuracy = accuracy_score(y_test, logreg.predict(X_test))
rf_accuracy = accuracy_score(y_test, rf.predict(X_test))
xgb_accuracy = accuracy_score(y_test, xgb.predict(X_test))

# Calculate training accuracy for each model
logreg_train_accuracy = accuracy_score(y_train, logreg.predict(X_train))
rf_train_accuracy = accuracy_score(y_train, rf.predict(X_train))
xgb_train_accuracy = accuracy_score(y_train, xgb.predict(X_train))

# Create a dataframe to hold the results
overfitting_check = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost' ],
    'Training Accuracy': [logreg_train_accuracy, rf_train_accuracy, xgb_train_accuracy],
    'Test Accuracy': [logreg_accuracy, rf_accuracy, xgb_accuracy]
})

overfitting_check['Difference'] = overfitting_check['Training Accuracy'] - overfitting_check['Test Accuracy']

overfitting_check

Unnamed: 0,Model,Training Accuracy,Test Accuracy,Difference
0,Logistic Regression,0.801688,0.80193,-0.000241
1,Random Forest,0.994004,0.905031,0.088973
2,XGBoost,0.918125,0.891937,0.026189


In summary:
- The **Random Forest model** has the highest accuracy on the training data, but it seems to be **overfitting** and performs worse on the test data compared to the other models.
- The **XGBoost model** performs slightly worse on the training data but better on the test data, indicating that it **generalizes better** to new data.
- The **Logistic Regression model** performs similarly on both the training and test data, indicating **good generalization**, but its overall accuracy is lower than the XGBoost model.



## Evaluation

### Confusion Matrix

In [18]:
# Compute confusion matrix for the best performing model (XGBoost Best)
conf_matrix = confusion_matrix(y_test, xgb_best_pred)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[1907  509]
 [ 306 4533]]


### Metrics

In [19]:
# Compute accuracy, precision, recall, and F1 score
accuracy = accuracy_score(y_test, xgb_best_pred)
precision = precision_score(y_test, xgb_best_pred)
recall = recall_score(y_test, xgb_best_pred)
f1 = f1_score(y_test, xgb_best_pred)

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')

Accuracy: 0.8877
Precision: 0.8990
Recall: 0.9368
F1 Score: 0.9175
