<h1 style="text-align: center;">[Capstone Project Module 3: Hotel Booking Demand Classification Model ]</h1>
<h3 style="text-align: center;">[Alief Dharmawan]</h3>

---

## **Section 1. Business Understanding**

**1.1 Context**

"Context
This data set contains booking information for a hotel located in Portugal, and includes information regarding room reservation for respective customers.
All personally identifying information has been removed from the data."

Features

country: Country of origin.

market_segment: Market segment designation. 

previous_cancellations: Number of previous bookings that were cancelled by the customer prior to the current booking.

booking_changes: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation.

deposit_type: Indication on if the customer made a deposit to guarantee the booking. 

days_in_waiting_list: Number of days the booking was in the waiting list before it was confirmed to the customer.

customer_type: Type of booking.

reserved_room_type: Code of room type reserved. Code is presented instead of designation for anonymity reasons.

required_car_parking_space: Number of car parking spaces required by the customer.

total_of_special_request: Number of special requests made by the customer (e.g. twin bed or high floor).

is_canceled: Value indicating if the booking was canceled (1) or not (0).



**1.2 Problem Statements**

Common Modern Hotel Problems: 

In the hotel industry, an OTA, or Online Travel Agency, is a third-party website or platform that allows travelers to book accommodations, flights, car rentals, and other travel services online. These platforms act as intermediaries, connecting hotels with potential guests and offering a convenient way to search, compare, and book travel arrangements. 

Digital Convenience:
Guests want easy online booking, mobile check-in/check-out, and access to information via apps or websites. 
Relevant columns: Booking changes, Deposit type, Customer Type 


Competition:
The hospitality industry is highly competitive, requiring hotels to differentiate themselves and manage pricing effectively. 
Relevant columns: previous cancellations, market segment, booking changes, days in waiting list, customer type 

Lack of Guest Information:
Incomplete or inaccurate guest information from OTAs can lead to service issues. 
Relevant columns: Required car parking space, total of special request, reserved room type, days in waiting list 

Personalization:
Guests increasingly expect personalized experiences, including tailored recommendations, customized room preferences, and seamless digital interactions. 
Relevant columns: Total of special requests, required car parking space, reserved room type 



All of the columns we want to be able to predict possible hotel booking cancellations. 


**1.3 Goals**

**1.4 Analytical Approach**

**1.5 Metric Evaluation (Business Metric, Machine Learning Evaluation Metric)**

**1.6 Success Criteria**

## **Section 2. Data Understanding**

**2.1 General Information**

# Library Imports

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, confusion_matrix, classification_report
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

In [8]:
df = pd.read_csv(r"C:\Users\Alief\Downloads\Purwadhika Data Science Program\Module 3 Machine Learning\Capstone Project Module 3\data_hotel_booking_demand.csv")

In [10]:
# Select features and target
features = [
    'market_segment', 'deposit_type', 'customer_type', 'reserved_room_type',
    'previous_cancellations', 'booking_changes',
    'required_car_parking_spaces', 'total_of_special_requests'
]
target = 'is_canceled'

In [12]:
# Separate categorical and numerical features
cat_features = ['market_segment', 'deposit_type', 'customer_type', 'reserved_room_type']
num_features = ['previous_cancellations', 'booking_changes',
                'required_car_parking_spaces', 'total_of_special_requests']

In [13]:
# New minimal feature set (mix of 1 categorical + 2 numeric)
minimal_features = ['deposit_type', 'previous_cancellations', 'total_of_special_requests']

In [14]:
X_small = df[minimal_features]
y_small = df['is_canceled']

In [15]:
#Split the data into training and testing sets
X_train_small, X_test_small, y_train_small, y_test_small = train_test_split( 
    X_small, y_small, test_size=0.2, random_state=42
)

In [16]:
# Preprocessing for the small dataset
from sklearn.compose import ColumnTransformer

small_preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['deposit_type']),
    ('num', SimpleImputer(strategy='mean'), ['previous_cancellations', 'total_of_special_requests'])
])

In [17]:
# Lighter pipeline
small_pipeline = Pipeline([
    ('preprocess', small_preprocessor),
    ('clf', LogisticRegression(max_iter=1000))
])

In [18]:
# Fit model
small_pipeline.fit(X_train_small, y_train_small)

0,1,2
,steps,"[('preprocess', ...), ('clf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('cat', ...), ('num', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [19]:
# Predict probabilities and classes
y_probs_small = small_pipeline.predict_proba(X_test_small)
y_pred_small = small_pipeline.predict(X_test_small)

In [20]:
# Confusion matrix and classification report
cm_small = confusion_matrix(y_test_small, y_pred_small)
report_small = classification_report(y_test_small, y_pred_small, output_dict=True)

In [21]:
# Preview of sigmoid outputs
probs_small_preview = y_probs_small[:5]

(cm_small, report_small, probs_small_preview)

(array([[10534,    67],
        [ 3850,  2264]], dtype=int64),
 {'0': {'precision': 0.73234149054505,
   'recall': 0.9936798415243845,
   'f1-score': 0.8432259355613368,
   'support': 10601.0},
  '1': {'precision': 0.9712569712569713,
   'recall': 0.3702976774615636,
   'f1-score': 0.5361752516281824,
   'support': 6114.0},
  'accuracy': 0.7656595871971283,
  'macro avg': {'precision': 0.8517992309010107,
   'recall': 0.6819887594929741,
   'f1-score': 0.6897005935947595,
   'support': 16715.0},
  'weighted avg': {'precision': 0.8197318135526891,
   'recall': 0.7656595871971283,
   'f1-score': 0.7309131696883302,
   'support': 16715.0}},
 array([[0.67756454, 0.32243546],
        [0.67756454, 0.32243546],
        [0.75515787, 0.24484213],
        [0.75515787, 0.24484213],
        [0.81906484, 0.18093516]]))

In [22]:
# Convert classification report dict to a DataFrame
report_df = pd.DataFrame(report_small).transpose()

# Round for readability
report_df = report_df.round(3)

# Display only useful metrics (optional)
report_df = report_df[['precision', 'recall', 'f1-score', 'support']]

report_df.head()


Unnamed: 0,precision,recall,f1-score,support
0,0.732,0.994,0.843,10601.0
1,0.971,0.37,0.536,6114.0
accuracy,0.766,0.766,0.766,0.766
macro avg,0.852,0.682,0.69,16715.0
weighted avg,0.82,0.766,0.731,16715.0


In [23]:
# Turn confusion matrix into labeled DataFrame
cm_df = pd.DataFrame(
    cm_small,
    index=['Actual Not Canceled', 'Actual Canceled'],
    columns=['Predicted Not Canceled', 'Predicted Canceled']
)

cm_df

Unnamed: 0,Predicted Not Canceled,Predicted Canceled
Actual Not Canceled,10534,67
Actual Canceled,3850,2264


In [24]:
# Display probabilities as DataFrame with labeled columns
probs_df = pd.DataFrame(probs_small_preview, columns=['Prob_Not_Canceled', 'Prob_Canceled'])
probs_df = probs_df.round(3)
probs_df

Unnamed: 0,Prob_Not_Canceled,Prob_Canceled
0,0.678,0.322
1,0.678,0.322
2,0.755,0.245
3,0.755,0.245
4,0.819,0.181


**2.2 Feature Information**

| Feature | Description | Impact to Business |
|---------| ----------- | ------------------ |

**2.3 Statistics Summary**

## **Section 3. Data Cleaning**

**3.1 Missing Values**

**3.2 Duplicated Values**

**3.3 Identify Spelling Errors**

**3.4 Identify Anomaly Values**
- Check Distribution (Numerical Variable)
- Check Cardinality (Categorical Variable)

## **Section 4. Data Generation**

**4.1 Constructing `Seen` and `Unseen` Data**

**4.2 Constructing `Training` and `Testing` Data (from `Seen` Dataset)**

## **Section 5. Exploratory Data Analysis (EDA)**

**5.1 Analysis 1**

**5.2 Analysis 2**

## **Section 6. Data Preparation**

**6.1 Initialization**
- Initialization function
- Define Feature and Target

**6.2 Data Transformation (Feature Engineering)**

**6.3 Overview**

## **Section 7. Model Development**

**7.1 Initialization**
- Initialization Function
- Create Custome Metrics
- Create a workflow of the experiment

**7.2 Developing the Model Pipeline**

**7.3 Model Benchmarking (Comparing model base performance)**

**7.4 Tune Model**

**7.5 Analyze Model**

- Evaluate model on data testing
- Residual Analysis
- Learning Curve Inspection

**7.6 Model Calibration (Classification Only)**

**7.6 Model Explanation and Interpretation**
- Feature Importance (Tree Based Model) / Coefficient Regression (Regression Based Model)
- SHAP Value identification
- Counter Factual Analysis

## **Section 8. Model Deployment**

## **Section 9. Model Implementation**

**9.1 How to implement the model ?**


**9.2 What are the limitations of the model ?**

**9.3 Business Calculation (Simulation using unseen data)**

## **Section 10. Conclusion and Recommendation**

**10.1 Conclusion**
- Conclusion (Model)
- Conclusion (Business)

**10.2 Recommendation**
- Recommendation (Model)
- Recommendation (Business)