<div style="background-color: rgba(255, 100, 20, 0.7); padding: 10px;">
<span style="color: white;">

# Introduction

Welcome to this series of notebooks where we will evaluate and compare the performance of Logistic Regression and LightGBM classifiers. This comprehensive analysis will be conducted in several stages, each documented in a separate notebook to ensure clarity and organization.

## Series Outline

This is apart of a series of ntoebooks breaking down a classification project.

1. **Exploratory Data Analysis (EDA) and Feature Engineering**

2. **Generalized Linear Model (Logistic Regression) Development**  **<------ This Notebook**

3. **LightGBM Classifier Development**

4. **Model Evaluation and Comparison**
</span>


## Logistic Regression

In this notebook, we will focus on developing a Logistic Regression model to predict hotel booking cancellations. The process will involve several steps to ensure that we build a robust and accurate model. We will start with a base model and iteratively improve it through various techniques. The key steps include:

1. **Balancing an Imbalanced Dataset**: Addressing class imbalance to ensure that the model performs well across all classes.
2. **Logging Continuous Features**: Applying log transformations to continuous features to handle skewness and improve model performance.
3. **Advanced Encoding Techniques**: Utilising advanced encoding methods for categorical features to capture more information.
4. **Adding Interaction Terms**: Introducing interaction terms to capture relationships between features that may not be evident individually.
5. **Regularization**: Applying regularisation techniques to prevent overfitting and improve the generalizability of the model.

By the end of this notebook, we aim to have a well-tuned Logistic Regression model that can effectively predict hotel booking cancellations.


In [28]:
# Importing Libraries
import pandas as pd
import sweetviz as sv
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime
import warnings
import mlflow
import pickle
import category_encoders as ce
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import(
    ShuffleSplit,
    KFold,
    cross_validate,
    cross_val_predict,
    GridSearchCV,
    train_test_split
)
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_selection import RFE, SelectKBest, f_classif
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import StratifiedKFold


## Creating train / test split file

Splitting the dataset into training and testing sets at an early stage is a crucial step in the machine learning workflow. This practice ensures that the evaluation of the model's performance is unbiased and reflective of its ability to generalize to unseen data. Here are several reasons why early train-test splitting is essential:

1. **Prevents Data Leakage**: By splitting the data early, we avoid the risk of data leakage, where information from the test set inadvertently influences the training process. This ensures that the model is evaluated on data it has never seen before, providing a more accurate assessment of its performance.

2. **Ensures Fair Evaluation**: Early splitting ensures that the test set remains untouched and is only used for final evaluation. This helps in obtaining a fair estimate of the model's performance on new, unseen data, which is critical for assessing its generalizability.

3. **Facilitates Hyperparameter Tuning**: When tuning hyperparameters or selecting features, it is important to use only the training set (or a validation set derived from it) to avoid overfitting. Early splitting ensures that the test set is not used during this process, maintaining the integrity of the evaluation.

4. **Supports Cross-Validation**: Cross-validation techniques, such as k-fold or stratified k-fold, rely on the training set to create multiple train-validation splits. Early splitting ensures that the test set is excluded from these processes, allowing for robust model validation and selection.

5. **Enhances Reproducibility**: By defining the train-test split early and using a fixed random state, we ensure that the data split is consistent across different runs and experiments. This enhances the reproducibility of the results and facilitates collaboration and comparison of different models.

In summary, performing the train-test split early in the machine learning pipeline is a best practice that helps in building robust, unbiased, and generalizable models. It safeguards 

In [29]:
# Load the dataset from the specified CSV file
df = pd.read_csv("Data/hotel_bookings.csv")

# Add an 'index' column to the DataFrame, which contains the original row indices
df['index'] = df.index

# Split the data into training (80%) and test (20%) sets
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

# Further split the training set into validation (5%) and training (95%) sets
# Note: The train_size parameter here is 0.05, which means 5% of the original training set
df_val, df_train = train_test_split(df_train, train_size=0.05, random_state=42)

# Save the training, validation, and test sets to separate CSV files
df_train.to_csv('Data/train_data.csv', index=False)
df_val.to_csv('Data/validation_data.csv', index=False)
df_test.to_csv('Data/test_data.csv', index=False)

# Print the sizes of the training, test, and validation sets
print('Train size:', len(df_train), 'Test size:', len(df_test), 'Validation size:', len(df_val))


Train size: 90737 Test size: 23878 Validation size: 4775


From now on we will only ever load in our train and test data seperately 

In [30]:
# Loading in our training data
df_train = pd.read_csv("Data/train_data.csv")

# Creating processing pipeline

First we need to load in our cleaning and feature engineering functions from before:

In [31]:
def data_cleaning(raw_data):
    #Replacing the small amount of null values in children and country columns with sensible values
    null_replacements = {"children": 0.0,"country": "Unknown"} 
    clean_data = raw_data.fillna(null_replacements) 
    #Dropping only row with room type L
    clean_data = clean_data[clean_data['assigned_room_type'] != 'L']
    #Replace error entries with nulls
    clean_data['market_segment'].replace('Undefined', np.nan, inplace=True)
    clean_data['distribution_channel'].replace('Undefined', np.nan, inplace=True)
    #Dropping bookings with no guests
    empty_bookings = (clean_data.children == 0) & (clean_data.adults == 0) & (clean_data.babies == 0) # There are no guests when these conditions are true
    clean_data = clean_data[~empty_bookings] 
    #Dropping all the duplicates
    clean_data = clean_data.drop_duplicates()
    clean_data = clean_data.reset_index(drop=True)
    return clean_data
df_train = data_cleaning(df_train)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  clean_data['market_segment'].replace('Undefined', np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  clean_data['distribution_channel'].replace('Undefined', np.nan, inplace=True)


In [32]:
def feature_engineering(df):
    # Add a boolean variable for whether or not an agent was used to book
    df['used_agent'] = np.where(df['agent'].isnull(),0,1)
    # Combining weekdays and weekends for a total nights value
    df['total_nights'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']
    # Adding a column to map hotel to 0/1
    df['hotel_binary'] = df['hotel'].map({'Resort Hotel': 0, 'City Hotel': 1})
    # Creating a flags for guests coming from Portugal and EU
    df['portugal_flag'] = (df['country'] == 'PRT').astype('float64')
    eu_countries = ['AUT, BEL, BGR, HRV, CYP, CZE, DNK, EST, FIN, FRA, DEU, GRC, HUN, IRL, ITA, LVA, LTU, LUX, MLT, NLD, POL']
    df['eu_flag'] = df['country'].isin(eu_countries).astype('float64')
    
    # Now finding the day of the week for each booking 

    # Dictionary to map the month to its corresponding number
    month_to_number = {
        "January": 1,
        "February": 2,
        "March": 3,
        "April": 4,
        "May": 5,
        "June": 6,
        "July": 7,
        "August": 8,
        "September": 9,
        "October": 10,
        "November": 11,
        "December": 12
    }
    # adding columns for the date of arrivaL
    df['arrival_date_year'] = df['arrival_date_year'].astype(str)
    df['arrival_date_month_number'] = df['arrival_date_month'].map(month_to_number).astype(str).str.zfill(2)  
    df['arrival_date_day_of_month'] = df['arrival_date_day_of_month'].astype(str).str.zfill(2)  
    # string in the format "YYYY-MM-DD"
    df['raw_date_string'] = df['arrival_date_year'] + "-" + \
                                    df['arrival_date_month_number'] + "-" + \
                                    df['arrival_date_day_of_month']
    df['raw_date_string']

    # Creating a function that transforms the date into its corresponding day of the week
    def find_day(raw_date_string):
        date_object = datetime.strptime(raw_date_string, "%Y-%m-%d")
        day_of_week = date_object.weekday()
        return day_of_week
    df['day_of_week_on_arrival'] = df['raw_date_string'].apply(find_day)

    # Adding a column to encode the seasonal affect of which week in the year the booking was made
    period = df['arrival_date_week_number'].nunique()
    two_pi = 2*np.pi
    df['sin_week'] = np.sin(two_pi * df['arrival_date_week_number'] / period)
    df['cos_week'] = np.cos(two_pi * df['arrival_date_week_number'] / period)
    
    df['arrival_date_day_of_month'] = df['arrival_date_day_of_month'].astype(int)
    df['arrival_date_month_number'] = df['arrival_date_month_number'].astype(int)
    
    return df
df_train = feature_engineering(df_train)

With that done we can check our functions have been applied to df_train correctly.

In [33]:
# Checking our data has been properly processed
df_train

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,used_agent,total_nights,hotel_binary,portugal_flag,eu_flag,arrival_date_month_number,raw_date_string,day_of_week_on_arrival,sin_week,cos_week
0,City Hotel,0,13,2017,August,32,6,2,1,2,...,1,3,1,0.0,0.0,8,2017-08-06,6,-0.606800,-0.794854
1,Resort Hotel,0,1,2016,April,17,23,0,1,2,...,1,1,0,0.0,0.0,4,2016-04-23,5,0.902798,-0.430065
2,City Hotel,0,194,2017,April,13,1,1,1,2,...,1,2,1,0.0,0.0,4,2017-04-01,5,0.999561,0.029633
3,City Hotel,0,159,2016,September,40,30,2,3,2,...,1,5,1,0.0,0.0,9,2016-09-30,4,-0.999561,0.029633
4,City Hotel,1,214,2017,May,18,3,0,4,2,...,1,4,1,0.0,0.0,5,2017-05-03,2,0.845596,-0.533823
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90596,City Hotel,1,261,2015,October,42,16,0,1,2,...,1,1,1,1.0,0.0,10,2015-10-16,4,-0.964636,0.263587
90597,City Hotel,1,35,2017,August,33,16,0,4,2,...,1,4,1,1.0,0.0,8,2017-08-16,2,-0.696551,-0.717507
90598,City Hotel,0,8,2016,May,20,12,0,2,2,...,1,2,1,1.0,0.0,5,2016-05-12,3,0.696551,-0.717507
90599,City Hotel,0,0,2015,October,41,7,0,2,2,...,1,2,1,1.0,0.0,10,2015-10-07,2,-0.989040,0.147647


### Explanation of Scikit-Learn Pipelines

In this notebook, we utilize Scikit-Learn's `Pipeline` class to streamline and automate the preprocessing and modeling steps. Pipelines are a powerful tool in machine learning workflows, allowing us to chain multiple processing steps together, ensuring that they are executed in a specific order. This not only makes the code more readable and maintainable but also helps prevent data leakage by ensuring that the same transformations are applied to both the training and test data.

#### Key Components of Our Pipeline

1. **ColumnTransformer**: This allows us to apply different preprocessing steps to different subsets of features. For example, we can impute missing values in numerical features using the median, while filling missing values in categorical features with a constant value.

2. **SimpleImputer**: Used within the `ColumnTransformer` to handle missing values. We use different strategies for numerical, categorical, and binary features.

3. **OneHotEncoder and TargetEncoder**: These are used to encode categorical features. One-hot encoding is applied to features with lower cardinality, while target encoding is used for features with higher cardinality.

4. **StandardScaler**: This scales the features to have zero mean and unit variance, which is essential for many machine learning algorithms to perform optimally.

5. **LogisticRegression**: Our chosen model for this notebook. We use logistic regression to predict the probability of hotel booking cancellations.

By combining these components into a pipeline, we ensure that our preprocessing steps are consistently applied, and our model is trained and evaluated in a systematic manner. This approach not only improves the robustness of our model but also makes our workflow more efficient and reproducible.

First lets split our data into numerical, binary and categorical features:

In [34]:
# Checking the data types 
df_train.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90601 entries, 0 to 90600
Data columns (total 43 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   hotel                           90601 non-null  object 
 1   is_canceled                     90601 non-null  int64  
 2   lead_time                       90601 non-null  int64  
 3   arrival_date_year               90601 non-null  object 
 4   arrival_date_month              90601 non-null  object 
 5   arrival_date_week_number        90601 non-null  int64  
 6   arrival_date_day_of_month       90601 non-null  int32  
 7   stays_in_weekend_nights         90601 non-null  int64  
 8   stays_in_week_nights            90601 non-null  int64  
 9   adults                          90601 non-null  int64  
 10  children                        90601 non-null  float64
 11  babies                          90601 non-null  int64  
 12  meal                            

In [35]:
# Categorising the features into numerical, binary and categorical
numeric_feats = [
    "lead_time",
    "arrival_date_week_number",
    "arrival_date_day_of_month",
    "stays_in_weekend_nights",
    "stays_in_week_nights",
    "adults",
    "children",
    "babies",
    "previous_cancellations",
    'previous_bookings_not_canceled',
    "days_in_waiting_list",
    "adr",
    "required_car_parking_spaces",
    "total_of_special_requests",
    "total_nights",
    "day_of_week_on_arrival",
    "sin_week",
    "cos_week"
]

binary_feats = [
    "is_repeated_guest",
    "portugal_flag",
    "eu_flag",
    "used_agent",
    "hotel_binary",
]

categorical_feats = [
    "arrival_date_month",
    "meal",
    "country",
    "market_segment",
    "distribution_channel",
    "reserved_room_type",
    "assigned_room_type", # check this
    "deposit_type",
    "customer_type"
]

# Checking how many features we ended up with
all_feats = numeric_feats + binary_feats + categorical_feats 

print(f"Number of features: {len(all_feats)}")

Number of features: 32


Checking the features we have omitted from the pipeline:

In [36]:
columns = df.columns
set(columns)-set(all_feats)

{'agent',
 'arrival_date_year',
 'booking_changes',
 'company',
 'hotel',
 'index',
 'is_canceled',
 'reservation_status',
 'reservation_status_date'}

### Imputing Missing Values

Imputing is a technique used to handle missing data in a dataset. Missing data can occur for various reasons, such as data entry errors, equipment malfunctions, or respondents skipping questions in a survey. Imputing involves replacing missing values with substituted values to maintain the integrity of the dataset and allow for effective analysis.

In this notebook, we use the `ColumnTransformer` from Scikit-Learn to define a strategy for imputing missing values in different types of features. The code snippet below demonstrates how we set up the imputation process:

In [37]:
# Define the column transformer for imputing missing values
simple_imputer = ColumnTransformer(
    [
        # Impute numeric features using the median strategy and add an indicator for missing values
        ("numeric", SimpleImputer(strategy="median", add_indicator=True), numeric_feats),
        
        # Impute categorical features using a constant value ("missing")
        ("categorical", SimpleImputer(strategy="constant", fill_value="missing"), categorical_feats),
        
        # Impute binary features using the most frequent value and add an indicator for missing values
        ("binary", SimpleImputer(strategy="most_frequent", add_indicator=True), binary_feats)
    ],
    remainder="drop",  # Drop any columns not specified in the transformers
    verbose_feature_names_out=False  # Do not add transformer names to the output feature names
)

# Set the output of the transformer to be a pandas DataFrame
simple_imputer.set_output(transform="pandas")

### Encoding Categorical values

We will now check the values of our categorical features

In [38]:
# Checking the values of our categorical features
for feat in categorical_feats:
    print(df_train[feat].value_counts())

arrival_date_month
August       10529
July          9561
May           9003
October       8542
April         8413
June          8252
September     8017
March         7452
February      6081
November      5151
December      5113
January       4487
Name: count, dtype: int64
meal
BB           70150
HB           10982
SC            7995
Undefined      880
FB             594
Name: count, dtype: int64
country
PRT    36757
GBR     9195
FRA     7920
ESP     6515
DEU     5545
       ...  
CYM        1
DMA        1
NIC        1
MDG        1
MRT        1
Name: count, Length: 174, dtype: int64
market_segment
Online TA        43007
Offline TA/TO    18366
Groups           15006
Direct            9518
Corporate         3971
Complementary      552
Aviation           179
Name: count, dtype: int64
distribution_channel
TA/TO        74409
Direct       11042
Corporate     5006
GDS            142
Name: count, dtype: int64
reserved_room_type
A    65165
D    14669
E     4986
F     2202
G     1601
B      839
C


### One-Hot Encoding vs. Target Encoding

In this notebook, we use two different encoding techniques for categorical features: One-Hot Encoding and Target Encoding. Each method has its strengths and is better suited for different types of categorical features based on their cardinality.

#### One-Hot Encoding

One-Hot Encoding is a technique that converts categorical variables into a series of binary columns. Each unique category value is represented as a separate column, and a value of 1 or 0 is assigned to indicate the presence or absence of that category.

#### Target Encoding

Target Encoding, also known as Mean Encoding, replaces each category with the mean of the target variable for that category. This technique captures the relationship between the categorical feature and the target variable.

### Why Different Cardinalities Matter

The choice between one-hot encoding and target encoding largely depends on the cardinality of the categorical feature:

- **Low Cardinality**: One-hot encoding is preferred because it creates a manageable number of binary columns, and the interpretability of the encoded features is high.
- **High Cardinality**: Target encoding is preferred because it avoids the explosion of the feature space that one-hot encoding would cause. It also captures the relationship between the categorical feature and the target variable, which can be beneficial for model performance.

By using the appropriate encoding technique based on the cardinality of the categorical features, we can improve the efficiency and effectiveness of our machine learning models.


In [39]:
# These features have a limited number of unique values, making them suitable for one-hot encoding
ohe_feats = [
    "meal",
    "market_segment",
    "distribution_channel",
    "deposit_type",
    "customer_type"
]

# These features have a larger number of unique values, making them suitable for target encoding
target_encoded_feats = list(set(categorical_feats) - set(ohe_feats))

# Define the column transformer for encoding features
encoder = ColumnTransformer(
    [
        ("one_hot", ce.OneHotEncoder(use_cat_names=True), ohe_feats),
        ("target", ce.TargetEncoder(smoothing=100, min_samples_leaf=1000), target_encoded_feats)
    ],
    remainder="passthrough",  # Pass through any columns not specified in the transformers
    verbose_feature_names_out=False  # Do not add transformer names to the output feature names
)

# Set the output of the transformer to be a pandas DataFrame
encoder.set_output(transform="pandas")


### Scaling Features

Scaling is a crucial preprocessing step in machine learning, especially for algorithms that are sensitive to the scale of the input features, such as logistic regression. Scaling ensures that all features contribute equally to the model's performance by transforming them to a common scale. In this notebook, we use the `StandardScaler` from Scikit-Learn to standardize our features.

#### Why Scaling is Important

1. **Improves Model Performance**: Many machine learning algorithms, including logistic regression, perform better when the features are on a similar scale. Scaling helps in achieving this, leading to improved model performance.

2. **Speeds Up Convergence**: For optimization algorithms like gradient descent, scaling can speed up the convergence process, making the training process faster and more efficient.

3. **Prevents Dominance of Features**: Without scaling, features with larger ranges can dominate the learning process, leading to biased models. Scaling ensures that all features are treated equally.



In [40]:
# Initialize the StandardScaler for scaling features
scaler = StandardScaler()

# Set the output of the scaler to be a pandas DataFrame
scaler.set_output(transform="pandas")

# Define the preprocessing pipeline
prep_pipeline = Pipeline(
    [
        # Step 1: Impute missing values using the simple_imputer defined earlier
        ("imputation", simple_imputer),
        
        # Step 2: Encode categorical features using the encoder defined earlier
        ("encoding", encoder),
        
        # Step 3: Scale features using the StandardScaler
        ("scaling", scaler)
    ]
)

# Fit the preprocessing pipeline to the training data and transform it
# The target variable 'is_canceled' is not used in the transformation but is required for fitting
prep_pipeline.fit_transform(df_train, df_train['is_canceled'])

Unnamed: 0,meal_BB,meal_SC,meal_HB,meal_FB,meal_Undefined,market_segment_Online TA,market_segment_Offline TA/TO,market_segment_Groups,market_segment_Direct,market_segment_Aviation,...,total_of_special_requests,total_nights,day_of_week_on_arrival,sin_week,cos_week,is_repeated_guest,portugal_flag,eu_flag,used_agent,hotel_binary
0,0.539937,-0.311102,-0.371392,-0.081237,-0.099036,1.051978,-0.504236,-0.445539,-0.342616,-0.044493,...,-0.720842,-0.167365,1.540109,-0.809480,-0.954567,-0.179301,-0.826231,0.0,0.397049,0.709860
1,0.539937,-0.311102,-0.371392,-0.081237,-0.099036,1.051978,-0.504236,-0.445539,-0.342616,-0.044493,...,-0.720842,-0.953139,1.027567,1.267506,-0.408832,-0.179301,-0.826231,0.0,0.397049,-1.408729
2,0.539937,-0.311102,-0.371392,-0.081237,-0.099036,-0.950591,1.983200,-0.445539,-0.342616,-0.044493,...,-0.720842,-0.560252,1.027567,1.400637,0.278890,-0.179301,-0.826231,0.0,0.397049,0.709860
3,0.539937,-0.311102,-0.371392,-0.081237,-0.099036,1.051978,-0.504236,-0.445539,-0.342616,-0.044493,...,1.802202,0.618408,0.515025,-1.349861,0.278890,-0.179301,-0.826231,0.0,0.397049,0.709860
4,0.539937,-0.311102,-0.371392,-0.081237,-0.099036,1.051978,-0.504236,-0.445539,-0.342616,-0.044493,...,1.802202,0.225521,-0.510058,1.188804,-0.564057,-0.179301,-0.826231,0.0,0.397049,0.709860
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90596,-1.852066,-0.311102,2.692574,-0.081237,-0.099036,-0.950591,1.983200,-0.445539,-0.342616,-0.044493,...,-0.720842,-0.953139,0.515025,-1.301809,0.628892,-0.179301,1.210316,0.0,0.397049,0.709860
90597,-1.852066,3.214375,-0.371392,-0.081237,-0.099036,1.051978,-0.504236,-0.445539,-0.342616,-0.044493,...,0.540680,0.225521,-0.510058,-0.932964,-0.838853,-0.179301,1.210316,0.0,0.397049,0.709860
90598,0.539937,-0.311102,-0.371392,-0.081237,-0.099036,-0.950591,1.983200,-0.445539,-0.342616,-0.044493,...,-0.720842,-0.560252,0.002483,0.983740,-0.838853,-0.179301,1.210316,0.0,0.397049,0.709860
90599,0.539937,-0.311102,-0.371392,-0.081237,-0.099036,-0.950591,-0.504236,2.244471,-0.342616,-0.044493,...,-0.720842,-0.560252,-0.510058,-1.335386,0.455441,-0.179301,1.210316,0.0,0.397049,0.709860


We are now ready to initialise our Logistic Regression model and attach it to the end of our pipeline:

In [41]:
# Initialize the Logistic Regression model
lr_model = LogisticRegression()

# Create a pipeline with the preprocessing steps and the Logistic Regression model
# The pipeline consists of two steps:
# 1. "prep": Apply the preprocessing pipeline (cloned to ensure a fresh copy)
# 2. "model": Fit the Logistic Regression model
lr_pipe = Pipeline([
    ("prep", clone(prep_pipeline)),  # Clone the preprocessing pipeline to avoid modifying the original
    ("model", lr_model)  # Add the Logistic Regression model to the pipeline
])

# Fit the pipeline to the training data
# This will apply the preprocessing steps and then fit the Logistic Regression model
lr_pipe.fit(df_train, df_train["is_canceled"])

### Evaluating the model

#### Cross validation

Cross-validation is a robust technique used to evaluate the performance of a machine learning model. It helps in mitigating overfitting and provides a better understanding of how the model will generalize to an independent dataset. 

#### Metrics

In machine learning, it's generally advisable to focus on a single evaluation metric that aligns with your primary goal. This is because different metrics can sometimes provide conflicting information, making it difficult to optimize your model effectively. For instance, if your objective is to maximize recall to identify customers likely to cancel their hotel bookings, concentrating on recall ensures that your model is tuned to catch as many potential cancellations as possible. However, for this series of notebooks we wil be looking at 5 scorer metrics eachtime just out of curiosity.


In [42]:
# Initialize StratifiedKFold with 5 splits, shuffling, and a fixed random state for reproducibility
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define the scoring metrics to evaluate the model
scoring = ['neg_log_loss', "accuracy", "precision", "recall", "f1"]

# Perform cross-validation on the logistic regression pipeline
cv_results_lr = cross_validate(
    clone(lr_pipe),  # Clone the logistic regression pipeline to avoid modifying the original
    df_train,  
    df_train['is_canceled'],  
    cv=kf,  
    return_train_score=True,  
    scoring=scoring  
)

# Initialize an empty DataFrame to store the results
results = pd.DataFrame(columns=['model', 'neg_log_loss', 'accuracy', 'precision', 'recall', 'f1'])

# Create an entry with the mean cross-validation scores for the logistic regression model
entry = {
    "model": "BaseModel",  # Model name
    "neg_log_loss": np.mean(cv_results_lr['test_neg_log_loss']),  
    "accuracy": np.mean(cv_results_lr['test_accuracy']),  
    "precision": np.mean(cv_results_lr['test_precision']),  
    "recall": np.mean(cv_results_lr['test_recall']),  
    "f1": np.mean(cv_results_lr['test_f1']),  
}

# Concatenate the entry with the results DataFrame
entry = pd.DataFrame([entry])
results = pd.concat([results, entry], ignore_index=True)

results

  results = pd.concat([results, entry], ignore_index=True)


Unnamed: 0,model,neg_log_loss,accuracy,precision,recall,f1
0,BaseModel,-0.391137,0.807706,0.804423,0.633956,0.709033


As we can see the model does pretty well on all of the metrics except for Recall. This is most likely due to the fact that our data has an imbalance for the target as we noted during EDA. This means that our model may underfit the positive class, resulting in a low level of Recall (what % of the true positives do we predict successfully). To adress this we wil test out over sampling the positive class usng a `RandomOverSampler` from sklearn.

First lets remind ourselves as to how imbalanced our data is:

In [43]:
# Checking the distribution of the target variable
df_train['is_canceled'].value_counts() * 100 / len(df_train)

is_canceled
0    63.041247
1    36.958753
Name: count, dtype: float64

Okay now let's address this: 

In [44]:
# Initialize the RandomOverSampler with a fixed random state for reproducibility
ros = RandomOverSampler(random_state=42)

# Apply the RandomOverSampler to balance the training dataset
# This will oversample the minority class to match the number of samples in the majority class
df_train, df_train_y = ros.fit_resample(df_train, df_train['is_canceled'])

# Checking the distribution of the target variable after oversampling
df_train['is_canceled'].value_counts() * 100 / len(df_train)

is_canceled
0    50.0
1    50.0
Name: count, dtype: float64

As you can see we now have a 50/50 split for our target class. Let/s=='s now repeat the evaluation steps from before:

In [45]:
# Refitting the pipeline to the oversampled data
lr_pipe.fit(df_train, df_train["is_canceled"])

In [46]:
# Repeating the same CV as before

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = ['neg_log_loss', "accuracy", "precision", "recall", "f1"]

cv_results_lr = cross_validate(
    clone(lr_pipe),
    df_train,
    df_train['is_canceled'],
    cv=kf,
    return_train_score=True,
    scoring=scoring
)

entry = {
    "model": "Added Balanced Data",
    "neg_log_loss": np.mean(cv_results_lr["test_neg_log_loss"]),
    "accuracy": np.mean(cv_results_lr["test_accuracy"]),
    "precision": np.mean(cv_results_lr["test_precision"]),
    "recall": np.mean(cv_results_lr["test_recall"]),
    "f1": np.mean(cv_results_lr["test_f1"]),
}
entry = pd.DataFrame([entry])
results = pd.concat([results, entry], ignore_index=True)
results

Unnamed: 0,model,neg_log_loss,accuracy,precision,recall,f1
0,BaseModel,-0.391137,0.807706,0.804423,0.633956,0.709033
1,Added Balanced Data,-0.4117,0.795425,0.81047,0.771202,0.790343


Nice, thats a huge increase on Recall from the previous model. Our precision also increased however accuracy and neg_log_loss did fall.

### Logging Continuous Features

Log transformations are a powerful technique for handling continuous features in Generalized Linear Models (GLMs), such as Logistic Regression. Applying a log transformation to continuous features can significantly enhance the performance and interpretability of the model. Here are several reasons why logging continuous features can be beneficial:

1. **Handling Skewness**: Many continuous features in real-world datasets exhibit skewed distributions. Skewed data can adversely affect the performance of GLMs, which assume that the input features are normally distributed. Log transformations help in reducing skewness, bringing the distribution closer to normality, and thereby improving model performance.

2. **Reducing the Impact of Outliers**: Continuous features often contain outliers that can disproportionately influence the model's coefficients. Log transformations compress the range of the data, reducing the impact of extreme values and making the model more robust to outliers.

3. **Improving Linearity**: GLMs, including Logistic Regression, assume a linear relationship between the input features and the log-odds of the target variable. Log transformations can help in achieving linearity by stabilizing the variance and making the relationship between the features and the target variable more linear.

By applying log transformations to continuous features, we can address these issues and build more robust, interpretable, and effective GLMs. This preprocessing step is particularly important when dealing with features that span several orders of magnitude or exhibit significant skewness.


In [47]:
def log_transform(df, cols):
    # Create a deep copy of the DataFrame to avoid modifying the original data
    df_copy = df.copy(deep=True)

    # Clip the values in the specified columns to ensure they are non-negative
    # This is necessary because the logarithm of a negative number is undefined
    df_copy.loc[:, cols] = df_copy[cols].clip(lower=0)

    # Apply the natural logarithm (log1p) transformation to the specified columns
    # np.log1p(x) is equivalent to np.log(1 + x), which handles zero values more gracefully
    df_copy.loc[:, cols] = df_copy[cols].apply(np.log1p)

    return df_copy

In [48]:
# Lower cardinality features one-hot encoding
ohe_feats = [
    "meal",
    "market_segment",
    "distribution_channel",
    "deposit_type",
    "customer_type"
]

# Higher cardinality features target encoding
js_encoded_feats = list(set(categorical_feats) - set(ohe_feats))

# Redefine the encoder with the new 
encoder = ColumnTransformer(
    [
        # One-hot encode the lower cardinality features
        ("one_hot", ce.OneHotEncoder(use_cat_names=True), ohe_feats),
        
        # Apply James-Stein encoding to the higher cardinality features
        ("james_stein", ce.JamesSteinEncoder(), js_encoded_feats)
    ],
    remainder="passthrough",  
    verbose_feature_names_out=False  
)

encoder.set_output(transform="pandas")

# Define the preprocessing pipeline with log transformation
prep_pipeline_w_log = Pipeline(
    [
        # Step 1: Impute missing values using the simple_imputer defined earlier
        ("imputation", simple_imputer),
        
        # Step 2: Encode categorical features using the encoder defined earlier
        ("encoding", encoder),
        
        # Step 3: Apply log transformation to numeric and higher cardinality features
        ("log_transform", FunctionTransformer(log_transform, kw_args={"cols": numeric_feats + js_encoded_feats})),
        
        # Step 4: Scale features using the StandardScaler
        ("scaling", scaler)
    ]
)

lr_pipe_w_log = Pipeline([("prep", clone(prep_pipeline_w_log)), ("model", lr_model)])
lr_pipe_w_log.fit(df_train, df_train["is_canceled"])

In [49]:
# Repeating the same CV as before
cv_results_lr = cross_validate(
    clone(lr_pipe_w_log),
    df_train,
    df_train["is_canceled"],
    cv=kf,
    return_train_score=True,
    scoring=scoring
)

entry = {
    "model": "Added Log + JS Encoding",
    "neg_log_loss": np.mean(cv_results_lr["test_neg_log_loss"]),
    "accuracy": np.mean(cv_results_lr["test_accuracy"]),
    "precision": np.mean(cv_results_lr["test_precision"]),
    "recall": np.mean(cv_results_lr["test_recall"]),
    "f1": np.mean(cv_results_lr["test_f1"]),
}
entry = pd.DataFrame([entry])
results = pd.concat([results, entry], ignore_index=True)
results

Unnamed: 0,model,neg_log_loss,accuracy,precision,recall,f1
0,BaseModel,-0.391137,0.807706,0.804423,0.633956,0.709033
1,Added Balanced Data,-0.4117,0.795425,0.81047,0.771202,0.790343
2,Added Log + JS Encoding,-0.387716,0.810841,0.821642,0.794068,0.807608


In [50]:
# Define a custom transformer class to add interaction terms between specified features
class AddInteractionTerms(BaseEstimator, TransformerMixin):
    
    # The fit method is used to learn any parameters from the training data
    def fit(self, X, y=None, **fit_params):
        
        # Store the input column names
        self.input_columns = X.columns

        # Initialize a list to keep track of the new interaction columns added
        self.added_columns = []
        
        # Check if both 'assigned_room_type' and 'reserved_room_type' columns are present in the input data
        if 'assigned_room_type' in X.columns and 'reserved_room_type' in X.columns:
            # If present, add the interaction term column name to the list
            self.added_columns.append('assigned_room_type_X_reserved_room_type')
        
        # Same here
        if 'previous_bookings_not_canceled' in X.columns and 'is_repeated_guest' in X.columns:
            self.added_columns.append('previous_bookings_not_canceled_X_is_repeated_guest')
        
        return self
    
    # The transform method is used to apply the learned parameters to the input data
    def transform(self, X):
        # Create a deep copy of the input data to avoid modifying the original data
        X_copy = X.copy(deep=True)

        # Check if both 'assigned_room_type' and 'reserved_room_type' columns are present in the input data
        if 'assigned_room_type' in X.columns and 'reserved_room_type' in X.columns:
            # If present, create the interaction term by multiplying the two columns
            X_copy['assigned_room_type_X_reserved_room_type'] = X_copy['assigned_room_type'] * X_copy['reserved_room_type']
    
        # Same here
        if 'previous_bookings_not_canceled' in X.columns and 'is_repeated_guest' in X.columns:
            X_copy['previous_bookings_not_canceled_X_is_repeated_guest'] = X_copy['previous_bookings_not_canceled'] * X_copy['is_repeated_guest']

        return X_copy
    
    # The get_feature_names_out method is used to get the names of the output features
    def get_feature_names_out(self, *args, **params):
        # Return the original input columns along with the newly added interaction columns
        return self.input_columns + self.added_columns

In [51]:
prep_pipeline_w_log_interactions = Pipeline(
    [
        # Step 1: Impute missing values using the simple_imputer defined earlier
        ("imputation", simple_imputer),
        
        # Step 2: Encode categorical features using the encoder defined earlier
        ("encoding", encoder),
        
        # Step 3: Apply log transformation to specified columns using the log_transform function
        ("log_transform", FunctionTransformer(log_transform, kw_args={"cols": numeric_feats + js_encoded_feats})),
        
        # Step 4: Add interaction terms between specified features using the AddInteractionTerms transformer
        ("interactions", AddInteractionTerms()),
        
        # Step 5: Scale features using the StandardScaler defined earlier
        ("scaling", scaler)
    ]
)

# Initialize the Logistic Regression model with a higher maximum number of iterations
lr_model = LogisticRegression(max_iter=10000)

# Create a pipeline with the new preprocessing step
lr_pipe_w_log_interactions = Pipeline([
    ("prep", clone(prep_pipeline_w_log_interactions)),  
    ("model", lr_model)  
])

lr_pipe_w_log_interactions.fit(df_train, df_train["is_canceled"])

In [52]:
# Repeating the same CV as before

cv_results_lr = cross_validate(
    clone(lr_pipe_w_log_interactions),
    df_train,
    df_train["is_canceled"],
    cv=kf,
    return_train_score=True,
    scoring=scoring
)

entry = {
    "model": "Added Interactions",
    "neg_log_loss": np.mean(cv_results_lr["test_neg_log_loss"]),
    "accuracy": np.mean(cv_results_lr["test_accuracy"]),
    "precision": np.mean(cv_results_lr["test_precision"]),
    "recall": np.mean(cv_results_lr["test_recall"]),
    "f1": np.mean(cv_results_lr["test_f1"]),
}
entry = pd.DataFrame([entry])
results = pd.concat([results, entry], ignore_index=True)
results

Unnamed: 0,model,neg_log_loss,accuracy,precision,recall,f1
0,BaseModel,-0.391137,0.807706,0.804423,0.633956,0.709033
1,Added Balanced Data,-0.4117,0.795425,0.81047,0.771202,0.790343
2,Added Log + JS Encoding,-0.387716,0.810841,0.821642,0.794068,0.807608
3,Added Interactions,-0.387013,0.811007,0.822123,0.793771,0.807687


In [53]:
# Define the parameter grid for regularization parameters
param_grid = {
    'model__C': [0.01, 0.1, 1, 10, 20],
    'model__penalty': ['l1', 'l2'],
    'model__solver': ['liblinear'],
    'model__max_iter': [1000]
}

# Initialize GridSearchCV with the logistic regression pipeline
grid_search = GridSearchCV(
    estimator=clone(lr_pipe_w_log_interactions),
    param_grid=param_grid,
    cv=kf,
    scoring='neg_log_loss',
    n_jobs=1,
    verbose=2
)

# Fit the grid search to the training data
grid_search.fit(df_train, df_train['is_canceled'])

# Get the best model from the grid search
best_model = grid_search.best_estimator_

best_model

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END model__C=0.01, model__max_iter=1000, model__penalty=l1, model__solver=liblinear; total time=   2.1s
[CV] END model__C=0.01, model__max_iter=1000, model__penalty=l1, model__solver=liblinear; total time=   2.8s
[CV] END model__C=0.01, model__max_iter=1000, model__penalty=l1, model__solver=liblinear; total time=   3.1s
[CV] END model__C=0.01, model__max_iter=1000, model__penalty=l1, model__solver=liblinear; total time=   5.0s
[CV] END model__C=0.01, model__max_iter=1000, model__penalty=l1, model__solver=liblinear; total time=   6.8s
[CV] END model__C=0.01, model__max_iter=1000, model__penalty=l2, model__solver=liblinear; total time=   2.4s
[CV] END model__C=0.01, model__max_iter=1000, model__penalty=l2, model__solver=liblinear; total time=   2.5s
[CV] END model__C=0.01, model__max_iter=1000, model__penalty=l2, model__solver=liblinear; total time=   2.5s
[CV] END model__C=0.01, model__max_iter=1000, model__penalty=l2, mo

In [54]:
# Repeating the same CV as before
cv_results_best_model = cross_validate(
    best_model,
    df_train,
    df_train['is_canceled'],
    cv=kf,
    return_train_score=True,
    scoring=scoring
)

entry = {
    "model": "Added Regularisation",
    "neg_log_loss": np.mean(cv_results_best_model['test_neg_log_loss']),
    "accuracy": np.mean(cv_results_best_model['test_accuracy']),
    "precision": np.mean(cv_results_best_model['test_precision']),
    "recall": np.mean(cv_results_best_model['test_recall']),
    "f1": np.mean(cv_results_best_model['test_f1']),
}
entry = pd.DataFrame([entry])

results = pd.concat([results, entry], ignore_index=True)
results

Unnamed: 0,model,neg_log_loss,accuracy,precision,recall,f1
0,BaseModel,-0.391137,0.807706,0.804423,0.633956,0.709033
1,Added Balanced Data,-0.4117,0.795425,0.81047,0.771202,0.790343
2,Added Log + JS Encoding,-0.387716,0.810841,0.821642,0.794068,0.807608
3,Added Interactions,-0.387013,0.811007,0.822123,0.793771,0.807687
4,Added Regularisation,-0.386729,0.81148,0.8221,0.795014,0.80832
