## ML With sci-kit learn: Classification

Scikit-learn is a cornerstone in the Python machine learning toolkit, especially when dealing with tabular data. It's the go-to for scenarios where we're not tackling unstructured data types like text or images but focusing on more classic, structured datasets.

The library's compatibility with Pandas and Numpy is seamless, forming a robust stack for data manipulation and analysis. Scikit-learn's API is designed for consistency and ease of use, which streamlines the process of building and deploying machine learning models. It's not just about model creation; the library also provides essential tools for model evaluation and selection, ensuring we can measure and improve our approach systematically.

With scikit-learn, you have access to a broad array of algorithms for various machine learning tasks—classification, regression, clustering, and more. It's also equipped with features for preprocessing data, selecting the right features, and fine-tuning models through cross-validation and hyperparameter optimization.

## Importing Packages and data

In [1]:
# %pip install -U scikit-learn 

In [2]:
import pandas as pd 
import numpy  as np
import plotly.express as px
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures, KBinsDiscretizer
from sklearn.impute import SimpleImputer # missing values
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression # linear regression for classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import HistGradientBoostingClassifier # Fits many decision trees sequentially. Each tree is trained to improve on the performance of all the previous trees.
from sklearn.ensemble import RandomForestClassifier # Fits many decision trees in parallel with some added randomness. The final prediction is the average of all trees..
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score


In [3]:
pd.options.display.float_format = '{:.2f}'.format

In [4]:
hotel_df = pd.read_csv("data/output/hotel_dataset_cleaned.csv")


In [5]:
hotel_df.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,hotel
0,0,342,2015,July,27,1,0,0,2,0.0,...,-1.0,-1.0,0,Transient,0.0,0,0,Check-Out,2015-07-01,Resort hotel
1,0,737,2015,July,27,1,0,0,2,0.0,...,-1.0,-1.0,0,Transient,0.0,0,0,Check-Out,2015-07-01,Resort hotel
2,0,7,2015,July,27,1,0,1,1,0.0,...,-1.0,-1.0,0,Transient,75.0,0,0,Check-Out,2015-07-02,Resort hotel
3,0,13,2015,July,27,1,0,1,1,0.0,...,304.0,-1.0,0,Transient,75.0,0,0,Check-Out,2015-07-02,Resort hotel
4,0,14,2015,July,27,1,0,2,2,0.0,...,240.0,-1.0,0,Transient,98.0,0,1,Check-Out,2015-07-03,Resort hotel


## Predict who cancels reservations

**In this case study we want to predict who cancels a reservation.**

Finding out who cancels, is **predicting a categorical value**. `isCanceled` which is a number but in our dataset it should be regarded as a discrete category. This is called **classification**. When doing classification we will need slightly different machine learning models but most importantly, we will need different evaluation metrics.

### Step 1: splitting data in training and test set

In [6]:
X = hotel_df.drop(["IsCanceled"], axis=1)
y = hotel_df["IsCanceled"]

In [7]:
# Split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y)


In [8]:
print(f"{len(X_train) = } and {len(X_test) = }")

len(X_train) = 89407 and len(X_test) = 29803


### Step 2: Preprocessing

Some important steps preceding training a machine learning model: 
- One Hot Encoding 
- Scaling 
- Creating polynominal features (interaction terms)
- Binning
- Dealing with outliers and missing values 

Depending on the data type, different preprocessing steps are needed

In scikit-learn, the Pipeline and ColumnTransformer classes offer an idiomatic and streamlined way to chain multiple preprocessing steps and a model into a single workflow. The `Pipeline` class allows you to assemble sequences of transformations and a final model, which simplifies your code and helps prevent common mistakes, such as fitting preprocessing steps to the test data. The `ColumnTransformer` is particularly useful for applying different preprocessing to different columns, such as one-hot encoding for categorical variables and scaling for numerical variables. By combining these tools, you not only reduce the need for manual 'glue' code but also safeguard against the leakage of information from the test set into the training process. We highly recommend you always use this in the scope of this course.

The syntax is quite simple:

*  `make_column_transformer` expects multiple tuples. The `transformer` (preprocessing step) is in the first position of the tuple and the columns you apply the transformer (type: `list[str]`) is in the second position.  
*  `make_pipeline` similarly expects all objects in sequence, so typically you add your preprocessing first followed by the model you want to apply.

Note that instead of dealing with null or empty values upfront (in the preprocessing of the data), we can also add this to our pipeline in this step with the 'SimpleImputer' function. 


In [9]:
num_features = ["LeadTime","ArrivalDateWeekNumber","ArrivalDateDayOfMonth",
                "StaysInWeekendNights","StaysInWeekNights","Adults","Children",
                "Babies","PreviousCancellations",
                "PreviousBookingsNotCanceled", "ADR"]

cat_features = ["hotel","ArrivalDateMonth","Meal","MarketSegment", "IsRepeatedGuest", "RequiredCarParkingSpaces", "TotalOfSpecialRequests",
                "DistributionChannel","ReservedRoomType","DepositType","CustomerType"]

In [10]:
# Make a pipeline for the numeric values
numeric_preprocessing = make_pipeline(SimpleImputer(strategy="constant"), (StandardScaler()))

In [11]:
# Make a pipeline for the categorical values
categorical_preprocessing = make_pipeline(SimpleImputer(strategy="constant", fill_value="Unknown"), OneHotEncoder(handle_unknown='ignore'))


In [12]:
# compose the two pipelines you made
preprocessing = make_column_transformer((numeric_preprocessing, num_features), (categorical_preprocessing, cat_features))

### Step 3: Model Training + Model Prediction

In [13]:
# Logistic Regression
log_reg_pipe = make_pipeline(preprocessing, LogisticRegression())
log_reg_pipe.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:
predictions_log_reg_train = log_reg_pipe.predict(X_train)
predictions_log_reg_test = log_reg_pipe.predict(X_test)

### Step 4: Model Evaluation

In [15]:
# Calculate accuracy
accuracy_train = accuracy_score(y_train, predictions_log_reg_train)
accuracy_test = accuracy_score(y_test, predictions_log_reg_test)

# Calculate precision
precision_train = precision_score(y_train, predictions_log_reg_train)
precision_test = precision_score(y_test, predictions_log_reg_test)

# Calculate recall
recall_train = recall_score(y_train, predictions_log_reg_train)
recall_test = recall_score(y_test, predictions_log_reg_test)

# Calculate AUC score
auc_train = roc_auc_score(y_train, log_reg_pipe.predict_proba(X_train)[:, 1])
auc_test = roc_auc_score(y_test, log_reg_pipe.predict_proba(X_test)[:, 1])

# Print the evaluation metrics
print("Logistic Regression Model Evaluation:")
print("Training Accuracy:", accuracy_train)
print("Test Accuracy:", accuracy_test)
print("Training Precision:", precision_train)
print("Test Precision:", precision_test)
print("Training Recall:", recall_train)
print("Test Recall:", recall_test)
print("Training AUC:", auc_train)
print("Test AUC:", auc_test)

Logistic Regression Model Evaluation:
Training Accuracy: 0.8088740255237286
Test Accuracy: 0.8105224306277892
Training Precision: 0.8292903172844067
Test Precision: 0.8334769230769231
Training Recall: 0.609784806688196
Test Recall: 0.6119645761792879
Training AUC: 0.8521443872878045
Test AUC: 0.8525836012854449


### Step 5: Finding best model: Cross Validation

In [16]:
# Evaluate using cross validation 
results_lr = cross_val_score(log_reg_pipe, X_train, y_train, cv=4, scoring="accuracy", n_jobs=-1) 
np.mean(results_lr)

0.8085831982130803

❓❓❓ **TASK**

You can experiment with a variety of machine learning models, set-ups (different pre-processing pipelines) and cross-validation.
Afterwards you can evaluate these models. 
