<h1><center>Canceled bookings at a hotel</center></h1>


You have been assigned the task of building a model that will predict whether or not a customer of a hotel will cancel their booking. The data for this assingment is found in the csv file `hotel_clf`

<br> 
<div>
<img src="https://5.imimg.com/data5/PC/BL/MY-33192851/hotel-reservation-services-500x500.jpg" width="400"/>
</div>
<br> 
If the model predicts that a customer will cancel their booking, that customer will be sent a special deal to try to keep the customer from cancel the booking. If the prediction is correct (a True Positive), the expected gain is 1000 SEK. However, if the prediction is wrong (a False Positive), the expected loss is 500 SEK. 

Your goal is to build the most profitable model possible.

<hr style="border:1px solid pink"> </hr>

In [82]:
# ML imports
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import xgboost as xgb

# Processing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

# Model Training 
from sklearn.model_selection import train_test_split, cross_val_score
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
from tensorflow.keras.models import load_model
from tensorflow.keras.layers import Dropout
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score, confusion_matrix

# Early stop to save time
from tensorflow.keras.callbacks import EarlyStopping

# Evaluate models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV

## Q1 | Choose Metric

Reason about which metric you think will be best to optimize your model for.

- Recall?
- Precision?
- Accuracy?
- F1-score?

Make a decision about which metric you think will lead to the most profitable model

In [None]:
'''
Answer: I believe precision will lead to the most profitable model as it meassures
the proportion of true positives out of all predicted positives.
'''

## Q2 | Data prepatation

- Prepare your data so that you end up with a clean and preprocessed train and test set
    
    
- Instructions for train test split:    
    - Test size = 0.2
    - Random state = 42

In [5]:
data = pd.read_csv('../data/hotel_clf.csv')

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   hotel               10000 non-null  object 
 1   is_canceled         10000 non-null  int64  
 2   lead_time           10000 non-null  int64  
 3   adults              10000 non-null  int64  
 4   children            10000 non-null  int64  
 5   market_segment      10000 non-null  object 
 6   country             10000 non-null  object 
 7   reserved_room_type  10000 non-null  object 
 8   booking_changes     10000 non-null  int64  
 9   adr                 10000 non-null  float64
 10  customer_type       10000 non-null  object 
dtypes: float64(1), int64(5), object(5)
memory usage: 859.5+ KB


In [17]:
data['hotel'].unique()

array(['City Hotel', 'Resort Hotel'], dtype=object)

In [30]:
# Encode hotel column
data = pd.get_dummies(data, columns=['hotel'], prefix='hotel', drop_first=True) # drop first = true to avoid dummy trap

In [19]:
data['market_segment'].unique()

array(['Online TA', 'Offline TA/TO', 'Groups', 'Corporate', 'Direct',
       'Complementary', 'Aviation'], dtype=object)

In [34]:
# Encode Market_segment column
data = pd.get_dummies(data, columns=['market_segment'], prefix='segment', drop_first=True)

In [21]:
data['country'].unique()

array(['ESP', 'FRA', 'PRT', 'DEU', 'DNK', 'GBR', 'AUT', 'USA', 'POL',
       'NLD', 'BRA', 'GRC', 'TUR', 'SWE', 'ITA', 'BEL', 'IND', 'CHE',
       'HRV', 'MYS', 'NOR', 'CHN', 'JPN', 'CN', 'ISR', 'LUX', 'IRL',
       'HUN', 'ROU', 'IRQ', 'AGO', 'NGA', 'IRN', 'CZE', 'AUS', 'FIN',
       'ARG', 'SGP', 'KOR', 'CYP', 'THA', 'PHL', 'LBN', 'TWN', 'SVN',
       'UKR', 'SRB', 'COL', 'BGR', 'NZL', 'CHL', 'KEN', 'LVA', 'MAR',
       'RUS', 'LTU', 'ALB', 'SAU', 'ARM', 'UZB', 'MEX', 'DZA', 'VEN',
       'IDN', 'BIH', 'ECU', 'ZAF', 'TUN', 'URY', 'AND', 'BGD', 'MOZ',
       'EGY', 'GEO', 'PAN', 'PAK', 'CPV', 'EST', 'ISL', 'PER', 'MUS',
       'GIB', 'TJK', 'CIV', 'GNB', 'AZE', 'ARE', 'GUY', 'GGY', 'SVK',
       'OMN', 'HKG', 'ZMB', 'BLR', 'CRI', 'MNE', 'BHR', 'MLT', 'MDV',
       'CUB', 'GAB', 'LIE', 'LAO', 'BRB'], dtype=object)

In [36]:
# Encode country based on region to reduce the number of unique values
# Create a dictionary for grouping countries into regions
country_to_region = {
    'ESP': 'Europe', 'FRA': 'Europe', 'PRT': 'Europe', 'DEU': 'Europe',
    'DNK': 'Europe', 'GBR': 'Europe', 'AUT': 'Europe', 'POL': 'Europe',
    'NLD': 'Europe', 'BEL': 'Europe', 'CHE': 'Europe', 'ITA': 'Europe',
    'SWE': 'Europe', 'FIN': 'Europe', 'NOR': 'Europe', 'IRL': 'Europe',
    'CZE': 'Europe', 'HUN': 'Europe', 'SVK': 'Europe', 'SVN': 'Europe',
    'LUX': 'Europe', 'LTU': 'Europe', 'LVA': 'Europe', 'EST': 'Europe',
    'ALB': 'Europe', 'BIH': 'Europe', 'HRV': 'Europe', 'MNE': 'Europe',
    'SRB': 'Europe', 'GRC': 'Europe', 'CYP': 'Europe', 'ISL': 'Europe',
    
    'USA': 'Americas', 'BRA': 'Americas', 'CAN': 'Americas', 'ARG': 'Americas',
    'MEX': 'Americas', 'COL': 'Americas', 'CHL': 'Americas', 'PER': 'Americas',
    'URY': 'Americas', 'ECU': 'Americas', 'PAN': 'Americas', 'VEN': 'Americas',
    'CRI': 'Americas', 'CUB': 'Americas', 'GUY': 'Americas',

    'IND': 'Asia', 'CHN': 'Asia', 'JPN': 'Asia', 'KOR': 'Asia', 'THA': 'Asia',
    'PHL': 'Asia', 'MYS': 'Asia', 'SGP': 'Asia', 'TWN': 'Asia', 'HKG': 'Asia',
    'ARE': 'Asia', 'IRQ': 'Asia', 'IRN': 'Asia', 'UZB': 'Asia', 'TJK': 'Asia',
    'PAK': 'Asia', 'BGD': 'Asia', 'MDV': 'Asia', 'KAZ': 'Asia', 'LKA': 'Asia',

    'ZAF': 'Africa', 'KEN': 'Africa', 'DZA': 'Africa', 'NGA': 'Africa', 
    'MAR': 'Africa', 'EGY': 'Africa', 'MOZ': 'Africa', 'GHA': 'Africa',
    'AGO': 'Africa', 'CIV': 'Africa', 'TUN': 'Africa', 'ZMB': 'Africa',

    'AUS': 'Oceania', 'NZL': 'Oceania', 'FJI': 'Oceania', 'WSM': 'Oceania'
}

# Map countries to regions
data['region'] = data['country'].map(country_to_region)

# One-hot encode the region column
data = pd.get_dummies(data, columns=['region'], prefix='region', drop_first=True)

In [54]:
data.drop(columns=['country'], inplace=True)

In [23]:
data['reserved_room_type'].unique()

array(['A', 'D', 'B', 'C', 'L', 'E', 'G', 'F', 'H'], dtype=object)

In [42]:
# Encode reserved_room_type
# One-hot encode the reserved_room_type column
data = pd.get_dummies(data, columns=['reserved_room_type'], prefix='room', drop_first=True)

In [25]:
data['customer_type'].unique()

array(['Transient', 'Transient-Party', 'Contract', 'Group'], dtype=object)

In [44]:
# Encode customer_type
# One-hot encode the customer_type column
data = pd.get_dummies(data, columns=['customer_type'], prefix='type', drop_first=True)

In [56]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 28 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   is_canceled            10000 non-null  int64  
 1   lead_time              10000 non-null  int64  
 2   adults                 10000 non-null  int64  
 3   children               10000 non-null  int64  
 4   booking_changes        10000 non-null  int64  
 5   adr                    10000 non-null  float64
 6   hotel_Resort Hotel     10000 non-null  bool   
 7   segment_Complementary  10000 non-null  bool   
 8   segment_Corporate      10000 non-null  bool   
 9   segment_Direct         10000 non-null  bool   
 10  segment_Groups         10000 non-null  bool   
 11  segment_Offline TA/TO  10000 non-null  bool   
 12  segment_Online TA      10000 non-null  bool   
 13  region_Americas        10000 non-null  bool   
 14  region_Asia            10000 non-null  bool   
 15  reg

In [59]:
# Now that the data is ready, I'll split it into train and test
# Separate Features and Target Variable
X = data.drop(columns=['is_canceled']) 
y = data['is_canceled']

# Split the Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of training and test data
print("Training Features Shape:", X_train.shape)
print("Test Features Shape:", X_test.shape)
print("Training Labels Shape:", y_train.shape)
print("Test Labels Shape:", y_test.shape)

Training Features Shape: (8000, 27)
Test Features Shape: (2000, 27)
Training Labels Shape: (8000,)
Test Labels Shape: (2000,)


## Q3 | Build a LogReg Model

Guidelines:
- Use a LogisticRegression model
    - Random state = 42
- Use the metric you decided on in the previous question

- You are not allowed to change the model after looking at the performance on test data
- Your models predictions on test data will be translated into SEK. I.e:
    - 10 TP = 10 * 1 000 SEK = +10 000 
    - 10 FP = 10 * -500 SEK = -5 000 SEK
        - Expected Value from model = +5 000 SEK 
        
        
After you have trained your model, make predictions for your test data and calculate the profitable of the model

In [73]:
# Initialize Logistic Regression Model
logreg = LogisticRegression(random_state=42)

# Train the model
logreg.fit(X_train, y_train)

# Make predictions on the test data
y_pred = logreg.predict(X_test)

# Calculate precision to evaluate performance
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.4f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Unpacking the confusion matrix
tn, fp, fn, tp = cm.ravel()

# Calculate profits
profit_tp = tp * 1000  # Each TP gives 1000 SEK
profit_fp = fp * -500  # Each FP costs 500 SEK

total_profit = profit_tp + profit_fp
print(f"Total Profit: {total_profit} SEK")

Precision: 0.7345
Confusion Matrix:
[[1116  124]
 [ 417  343]]
Total Profit: 281000 SEK


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Q4 | Build a RandomForestClassifier model

- Use a RandomForestClassifier model:
    - random_state = 42


- After you have trained your model, make predictions for your test data and calculate the profitable of the model

- Which model was more profitable, the LogReg or the RandomForestClassifier?

In [80]:
# Initialize and Train RandomForestClassifier Model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Make predictions on the test data
y_pred_rf = rf.predict(X_test)

# Calculate precision for RandomForestClassifier
precision_rf = precision_score(y_test, y_pred_rf)
print(f"RandomForest Precision: {precision_rf:.4f}")

# Confusion Matrix for RandomForest
cm_rf = confusion_matrix(y_test, y_pred_rf)
print("RandomForest Confusion Matrix:")
print(cm_rf)

# Unpacking the confusion matrix for RandomForest
tn_rf, fp_rf, fn_rf, tp_rf = cm_rf.ravel()

# Calculate profits for RandomForest
profit_tp_rf = tp_rf * 1000  # Each TP gives 1000 SEK
profit_fp_rf = fp_rf * -500  # Each FP costs 500 SEK
total_profit_rf = profit_tp_rf + profit_fp_rf
print(f"RandomForest Total Profit: {total_profit_rf} SEK")

RandomForest Precision: 0.7193
RandomForest Confusion Matrix:
[[1055  185]
 [ 286  474]]
RandomForest Total Profit: 381500 SEK


In [None]:
# It would seem that logistic regression is more profitable according to the metric I chose.

## Q5 | Did you choose the right metric? 

Calculate the profitablity for the RandomForestClassifier for all 4 different metrics. Then rank order the outcome. I.e.:

- RFC (precision) = 1
- RFC (accuracy) = 2
- ...
- ...


***Note:*** You don't have to use a param_grid for this question, just run the RandomForest with default settings

In [92]:
# Initialize and Train RandomForestClassifier Model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Make predictions on the test data
y_pred_rf = rf.predict(X_test)

# Calculate the confusion matrix for RandomForest
cm_rf = confusion_matrix(y_test, y_pred_rf)
tn_rf, fp_rf, fn_rf, tp_rf = cm_rf.ravel()

# Profit calculation based on the confusion matrix
def calculate_profit(tp, fp, fn, tn=None):
    profit_tp = tp * 1000  # Each TP gives 1000 SEK
    profit_fp = fp * -500  # Each FP costs 500 SEK
    return profit_tp + profit_fp

# Calculate profit for each metric
# Precision Profit
precision_rf = precision_score(y_test, y_pred_rf)
precision_profit = calculate_profit(tp_rf, fp_rf, fn_rf)

# Accuracy Profit (using all metrics TP, TN, FP, FN)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
accuracy_profit = (tp_rf + tn_rf) * 1000 + fp_rf * -500 + fn_rf * -500

# Recall Profit
recall_rf = recall_score(y_test, y_pred_rf)
recall_profit = calculate_profit(tp_rf, fp_rf, fn_rf)

# F1-score Profit
f1_rf = f1_score(y_test, y_pred_rf)
f1_profit = calculate_profit(tp_rf, fp_rf, fn_rf)

# Rank the profitability based on each metric
profits = {
    "Precision": precision_profit,
    "Accuracy": accuracy_profit,
    "Recall": recall_profit,
    "F1-score": f1_profit
}

# Rank the models by profitability
ranked_metrics = sorted(profits.items(), key=lambda x: x[1], reverse=True)

# Display Results
print(f"RandomForest Profits:")
for i, (metric, profit) in enumerate(ranked_metrics, start=1):
    print(f"RFC ({metric}) = {i} with Profit: {profit} SEK")

RandomForest Profits:
RFC (Accuracy) = 1 with Profit: 1293500 SEK
RFC (Precision) = 2 with Profit: 381500 SEK
RFC (Recall) = 3 with Profit: 381500 SEK
RFC (F1-score) = 4 with Profit: 381500 SEK
