## Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
# Add any other packages you would like to use here

## Dataset

* The label in the dataset is given as `is_canceled`.
* For a complete description of dataset, visit the link: https://www.sciencedirect.com/science/article/pii/S2352340918315191

In [None]:
df = pd.read_csv('~/data/train/hotel_bookings.csv')
df.head()

 ## Helpful EDA

In [None]:
df.info()

In [None]:
df['reservation_status'].unique()

In [None]:
df['is_canceled'].mean()

In [None]:
df.shape

As written I will keep this part simple

At first glance, the dataset looks very clean, even surprisingly clean (benchmark dataset).

According to the assignment, I will solve a classification task and will use regression trees, which have proven to be the most robust tool in practice. Where can I help myself with the `sklearn` package, where the complete handling is already programmed.

In [None]:
df.drop_duplicates().shape[0]
# a lot duplicates are questionable. I cant see any ID column what indicates records.
# for this purpose and short time I gonna use only unique rows.

In [None]:
df = df.drop_duplicates()

## Data
if it is a classification task, I am most interested in how I have balanced data between cancel and non-cancel. Plus cross-correlation. Then I will be interested in how to impute valuse.

In [None]:
df.columns

In [None]:
object_columns = list(df.select_dtypes(include='object').columns)
numeric_columns = list(df.select_dtypes(exclude='object').columns)
print("Object Columns:", object_columns)
print("\nNumeric Columns:", numeric_columns)

In [None]:
import matplotlib.pyplot as plt
def plot_hist(df:pd.DataFrame, column:str):
    string_column = df[column]
    value_counts = string_column.value_counts()
    plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
    value_counts.plot(kind='bar')
    plt.ylabel('Frequency')
    plt.show()

In [None]:
for col in object_columns:
    print(col)
    plot_hist(df, col) 

Imbalance in data, such as waiting for something like food, country, and so on. A decision tree could help with that.

distribution_channel is a subset of market_segment, and this could introduce uncertainty for the model.

Similarly, reserved_room_type and assigned_room_type seem identical at first glance.

reservation_status is particularly interesting. I see 'canceled' there, which I would assume will be strongly correlated with a canceled response.

Some predictors also contain NaNs. In this case, I would impute them as unknown. It could harm the model if I were to use the most common value, as these unknowns may carry information.

In [None]:
sum(df["reserved_room_type"] != df["assigned_room_type"])
# not same only in 15k cases lets make new variable named reserved_assignet_diff
# this will distinguish between them. or for shake of simplicity not count with it.

In [None]:
sub_df = df[df["reserved_room_type"] != df["assigned_room_type"]]
sub_df[sub_df["is_canceled"] == 1][["reserved_room_type", "assigned_room_type"]]

In [None]:
df[df["reservation_status"] == "Canceled"]["is_canceled"].unique() 
# this gonna be the strongest predictor ever!

In [None]:
sum(df["is_canceled"]) - df[df["reservation_status"] == "Canceled"].shape[0]
# not sure what I should predict here. there is two ways:
# - use full dataset and let model learn from alredy canceled cases to predict cancelation
#   in general.
# - split data and learn model only on not canceled data, this will lead model to learn
#   patterns after cancelation.
# customer can cancel reseravation shortly after reservation and this behaviour will 
# bias model we need to develop.

In [None]:
numeric_columns = df.select_dtypes(include=['number'])
correlation_matrix = numeric_columns.corr()
correlation_matrix

In [None]:
df["company"].unique() # weird company... i will not use this one...

In [None]:
df.describe() # lets check for hard unusual observations

In [None]:
# As I expect the responce is highly correlated with lead_time, previous_cancellations,
# booking_changes, required_car_parking_spaces and total_of_special_requests.
# Between regressors above are strong correlation too. What si not good for trees.

# I am bit nervous from strong correlation previous_bookings_not_canceled vs. 
# is_repeated_guest, both looks as good predictors, hmm somehow merge it.

# required_car_parking_spaces another "clear" predictor. Person who book a hotel and forgot 
# to check if they have a parking, than realize they not... 

# similar story for total_of_special_requests, booking_changes etc.

# in general I dont need all columns for cold start, I will cherry pick someones to 
# avoid multicolinearity. Droped predictors are mostly suspicious or unlogic for such 
# modelling (like meal, arrival_date_year). For next steps I propose to do deeper dive into.

# also I will close my eyes to extreme values

# of course if we want predict future arrival_date_year is not a good to extrapolate (experiences)

In [None]:
categorical_features = list(X.select_dtypes(include='object').columns)
numerical_features = list(X.select_dtypes(exclude='object').columns)

In [None]:
print(numerical_features)

In [None]:
df["stays_in_nights"] = df["stays_in_weekend_nights"] + df["stays_in_week_nights"]

In [None]:
categorical_features = ['hotel', 'arrival_date_month', 'market_segment', 'deposit_type', 'customer_type']
numerical_features = ['lead_time', 'stays_in_nights', 'adults', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'agent', 'required_car_parking_spaces', 'total_of_special_requests']

## Model
for training part I gonna use sklearn pipelines, I have good experiences as good benchmark. As next step I propose to experiment with torch MLP 

I drop a lot features what looks suspicious or doesnt fit to my instincts

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

X = df.drop(columns=["is_canceled"])
y = df["is_canceled"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean")), 
           ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="unknown")),
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

xgb_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('xgb', xgb.XGBClassifier(objective="binary:logistic", random_state=42))
])

# Define the parameter grid to search
# in production this grid search is not neccesary
param_grid = {
    'xgb__n_estimators': [50, 100, 200],  # Number of boosting rounds
    'xgb__learning_rate': [0.01, 0.2, 0.5],  # Step size shrinkage used to prevent overfitting
    'xgb__max_depth': [3, 5, 10],  # Maximum depth of a tree
}

grid_search = GridSearchCV(xgb_pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best CV Score:", grid_search.best_score_)

# Make predictions
y_pred = grid_search.predict(X_test)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

In [None]:
# I added reservation_status and result are not supprising, I got a best model ever with
# 100% accuracy but this is not what we want.
#Best Parameters: {'xgb__learning_rate': 0.01, 'xgb__max_depth': 3, 'xgb__n_estimators': 50}
#Best CV Score: 1.0
#Accuracy: 1.0