<h1 style="text-align: center;">[Capstone Project Module 3: Hotel Booking Demand Classification Model ]</h1>
<h3 style="text-align: center;">[Alief Dharmawan]</h3>

---

## **Section 1. Business Understanding**

**1.1 Context**

"This data set contains booking information for a hotel located in Portugal, and includes information regarding room reservation for respective customers.
All personally identifying information has been removed from the data."

The data includes the following features:

country: Country of origin.

market_segment: Market segment designation. 

previous_cancellations: Number of previous bookings that were cancelled by the customer prior to the current booking.

booking_changes: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation.

deposit_type: Indication on if the customer made a deposit to guarantee the booking. 

days_in_waiting_list: Number of days the booking was in the waiting list before it was confirmed to the customer.

customer_type: Type of booking.

reserved_room_type: Code of room type reserved. Code is presented instead of designation for anonymity reasons.

required_car_parking_space: Number of car parking spaces required by the customer.

total_of_special_request: Number of special requests made by the customer (e.g. twin bed or high floor).

is_canceled: Value indicating if the booking was canceled (1) or not (0).


The dataset contains booking information for a hotel in Portugal, including customer demographics, booking behaviors, and whether the reservation was canceled. Hotels often struggle with last-minute cancellations, which cause revenue loss and operational inefficiencies. By leveraging machine learning, the hotel can anticipate cancellations and take proactive measures such as overbooking or targeted customer outreach.

**1.2 Problem Statements**

In the modern day of increasing adoption of technology, Hotels need to keep up with modern technology to compete and stay in business. One thing that every hotel needs to be aware of is the OTA, or Online Travel Agency, which is a third-party website or platform that allows travelers to book accommodations, flights, car rentals, and other travel services online. These platforms act as intermediaries, connecting hotels with potential guests and offering a convenient way to search, compare, and book travel arrangements. Using these comes with it's own problems. Some common Modern Hotel Problems include: =

Digital Convenience:
Guests want easy online booking, mobile check-in/check-out, and access to information via apps or websites. 
Relevant columns: Booking changes, Deposit type, Customer Type 

Competition:
The hospitality industry is highly competitive, requiring hotels to differentiate themselves and manage pricing effectively. 
Relevant columns: previous cancellations, market segment, booking changes, days in waiting list, customer type 

Lack of Guest Information:
Incomplete or inaccurate guest information from OTAs can lead to service issues. 
Relevant columns: Required car parking space, total of special request, reserved room type, days in waiting list 

Personalization:
Guests increasingly expect personalized experiences, including tailored recommendations, customized room preferences, and seamless digital interactions. 
Relevant columns: Total of special requests, required car parking space, reserved room type 

Cancellations are an important metric for hotels to keep in mind so that they can better plan out their strategy. All of this data is to keep track and predict hotel cancellation bookings. 

The hotel experiences a substantial number of last-minute booking cancellations, leading to:
- Lost revenue from unsold rooms.
- Operational inefficiencies in room allocation and staffing.
- Distorted demand forecasting and pricing decisions.

**Core Problem:**  
How can the hotel predict the likelihood of a booking being canceled early enough to take corrective actions?


**1.3 Goals**

- Build a predictive model to estimate the probability that a booking will be canceled.
- Enable the hotel to adjust operational and revenue strategies based on predicted risk levels.
- Provide actionable insights to reduce cancellation-related losses.

**1.4 Analytical Approach**

- **Problem Type:** Binary Classification (Canceled vs. Not Canceled).
- **Target Variable:** `is_canceled` (1 = canceled, 0 = not canceled).
- **Possible Model Uses:** Logistic Regression, Random Forest, Gradient Boosting (XGBoost, LightGBM).
- **Preprocessing Steps:**  
  - Handle missing values.  
  - Encode categorical features (`market_segment`, `deposit_type`, etc.).  
  - Feature engineering (e.g., booking season, lead time categories).  
  - Scale/normalize numerical variables if needed.

**1.5 Metric Evaluation (Business Metric, Machine Learning Evaluation Metric)**

### Business Metric
- **Reduction in revenue loss from cancellations** - measured by comparing historical cancellation rates to projected rates after implementing the predictive model.
- **Operational efficiency gains** - e.g., reduced overstaffing or unused room inventory.

### Machine Learning Evaluation Metric
- **Primary Metric:** Recall on the "canceled" class - to maximize detection of likely cancellations.
- **Secondary Metrics:**  
  - F1-score - balance precision and recall.  
  - ROC AUC - assess overall classification performance.

Metric selection aligns with business priority: better to flag most high-risk bookings (higher recall) even at the cost of some false positives.

**1.6 Success Criteria**

### Business Success
- At least **X%** reduction in revenue loss from last-minute cancellations.  
- Improved room occupancy rates and staff scheduling efficiency.

### Machine Learning Success
- Recall ≥ **0.80** on test data.  
- F1-score ≥ **0.75** to maintain balanced performance.

## **Section 2. Data Understanding**

**2.1 General Information**

# Library Imports

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, confusion_matrix, classification_report
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

In [8]:
df = pd.read_csv(r"C:\Users\Alief\Downloads\Purwadhika Data Science Program\Module 3 Machine Learning\Capstone Project Module 3\data_hotel_booking_demand.csv")

In [10]:
# Select features and target
features = [
    'market_segment', 'deposit_type', 'customer_type', 'reserved_room_type',
    'previous_cancellations', 'booking_changes',
    'required_car_parking_spaces', 'total_of_special_requests'
]
target = 'is_canceled'

In [12]:
# Separate categorical and numerical features
cat_features = ['market_segment', 'deposit_type', 'customer_type', 'reserved_room_type']
num_features = ['previous_cancellations', 'booking_changes',
                'required_car_parking_spaces', 'total_of_special_requests']

**2.2 Feature Information**

| Feature | Description | Impact to Business |
|---------| ----------- | ------------------ |

**2.3 Statistics Summary**

## **Section 3. Data Cleaning**

**3.1 Missing Values**

**3.2 Duplicated Values**

**3.3 Identify Spelling Errors**

**3.4 Identify Anomaly Values**
- Check Distribution (Numerical Variable)
- Check Cardinality (Categorical Variable)

## **Section 4. Data Generation**

**4.1 Constructing `Seen` and `Unseen` Data**

**4.2 Constructing `Training` and `Testing` Data (from `Seen` Dataset)**

## **Section 5. Exploratory Data Analysis (EDA)**

**5.1 Analysis 1**

**5.2 Analysis 2**

## **Section 6. Data Preparation**

**6.1 Initialization**
- Initialization function
- Define Feature and Target

**6.2 Data Transformation (Feature Engineering)**

**6.3 Overview**

## **Section 7. Model Development**

**7.1 Initialization**
- Initialization Function
- Create Custome Metrics
- Create a workflow of the experiment

**7.2 Developing the Model Pipeline**

**7.3 Model Benchmarking (Comparing model base performance)**

**7.4 Tune Model**

**7.5 Analyze Model**

- Evaluate model on data testing
- Residual Analysis
- Learning Curve Inspection

**7.6 Model Calibration (Classification Only)**

**7.6 Model Explanation and Interpretation**
- Feature Importance (Tree Based Model) / Coefficient Regression (Regression Based Model)
- SHAP Value identification
- Counter Factual Analysis

## **Section 8. Model Deployment**

## **Section 9. Model Implementation**

**9.1 How to implement the model ?**


**9.2 What are the limitations of the model ?**

**9.3 Business Calculation (Simulation using unseen data)**

## **Section 10. Conclusion and Recommendation**

**10.1 Conclusion**
- Conclusion (Model)
- Conclusion (Business)

**10.2 Recommendation**
- Recommendation (Model)
- Recommendation (Business)