<img src="images/aaib.PNG" style="width:400px;height:250px;">

# About Today's Practice:

1. Content-Related: **Implementing Supervised ML Models: LR, SVM, DT, RF, AdaBoost, and GB!**
\
&nbsp;
2. **Dataset:** We can see that there are 32 features (columns) and 119390 records (rows) in our dataset. Our main objective with this data is to predict if the **hotel booking** would be made by a customer, provided if they make a reservation within the constraints of out data.
\
&nbsp;

# Set-up 

In [None]:
# Commonly used libraries
import numpy as np
import pandas as pd

# From ScikitLearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import AdaBoostClassifier
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Data Exploration & Preprocessing

In [None]:
# loading dataset and chekcing its heads!
data = pd.read_csv('datasets/hotel_bookings.csv')
data.head(10)

In [None]:
data.shape

In [None]:
# We drop some columns right away because they either have very low variance or would not be support (make sense to be used)
data.drop(inplace=True, axis=1, labels=['agent', 'company','hotel','reservation_status_date'])

In [None]:
# Let's check for any null values, if there are any...
data.isnull().sum()

In [None]:
# As the focus of the practical is not on DPT, let's simply replace the null values with the mode.
data.fillna(data.mode().iloc[0], inplace=True)

In [None]:
data.isnull().sum()

In [None]:
data.head()

In [None]:
# Now we will separate the dependent and independent feature
# The dependent variable is "is canceled", which tells us if a reservation was canceled (or not)
# X = ?
# y = ?

X = data.iloc[:,1:]
y = data.iloc[:,0]


In [None]:
# Using onehotencoder for the categorical features!
# ct = make_column_transformer(
#     (OneHotEncoder(),['meal''distribution_channel','reservation_status','country','arrival_date_month','market_segment','deposit_type','customer_type', 'reserved_room_type','assigned_room_type']), remainder = 'passthrough')
ct = make_column_transformer(
    (OneHotEncoder(), ['meal', 'distribution_channel', 'reservation_status', 'country', 'arrival_date_month',
                       'market_segment', 'deposit_type', 'customer_type', 'reserved_room_type', 'assigned_room_type'
                      ]), remainder='passthrough')

In [None]:
# Column Transformer is given the One Hot Encoder and the list of all categorical columns. 
# Now, we simply need to apply fit and transform to our independant variables.
X = ct.fit_transform(X).toarray()


In [None]:
X

In [None]:
y

In [None]:
# Now, we split data between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
X.shape

# Scaling & Dimensionality Reduction 

It's important to note that the number of features just went **from 28 to 256** very quickly. That is a very big number. The [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) happens when a dataset has too many variables. It just means that our model will have to deal with too much unnecessary information, which will slow it down and make it less efficient.

We use methods called [Dimensionality Reduction](https://en.wikipedia.org/wiki/Dimensionality_Reduction) to avoid the curse of dimensionality. PCA, or [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis), is one of the most popular ones. PCA has however one small requirement: the data it is used on must have a sandar scale. Which we do in the next cell:

In [None]:
# Scaling the data (Train and Test)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print("X_train ---------->\n", X_train, "\nX_test -------->\n", X_test)

In [None]:
# Implementing PCA - To reduce Dimensionality 
pca = PCA(n_components = 50)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
explained_variance
# The number of components that we are asking to be selected is 50. IN practice you do with "None", check EV, 
# then select a threshold and compare the EV.

# Models Implementation - LR

In [None]:
# Logistic Regression
classifier = LogisticRegression(random_state = 0, max_iter=1000, solver = 'lbfgs')
classifier.fit(X_train, y_train)

Now, let's see how our model performs on the test data

In [None]:
# Let's check how our model performs on the test data
y_pred = classifier.predict(X_test)

To calculate the accuracy of our model, the simplest way is to construct a confusion matrix

In [None]:
# Confusion matrix (CM)
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
# Accuracy Score
# ac = accuracy_score(y_train, y_pred)
ac = accuracy_score(y_test, y_pred)
ac

# Models Implementation - SVM

In [None]:
clf_svm = svm.SVC(max_iter=1000, gamma='scale', kernel = "rbf", random_state=0)
clf_svm.fit(X_train, y_train)

In [None]:
y_pred = clf_svm.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
ac = accuracy_score(y_test, y_pred)
ac

# Models Implementation - DT

In [None]:
clf_tree = tree.DecisionTreeClassifier(max_depth=5, criterion = "gini", min_samples_split=100,
                                       min_samples_leaf= 30, random_state=0)
clf_tree.fit(X_train, y_train)

In [None]:
y_pred = clf_tree.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
ac = accuracy_score(y_test, y_pred)
ac

# Models Implementation - RF

In [None]:
clf_rf = RandomForestClassifier(n_estimators=30, max_depth=5, random_state=0)
clf_rf.fit(X_train, y_train)

In [None]:
y_pred = clf_rf.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
ac = accuracy_score(y_test, y_pred)
ac

# Models Implementation - AB

In [None]:
clf_ab = AdaBoostClassifier(n_estimators=100, learning_rate=1, random_state=0)
clf_ab.fit(X_train, y_train)

In [None]:
y_pred = clf_ab.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
ac = accuracy_score(y_test, y_pred)
ac

# Models Implementation - GB

In [None]:
clf_gb = GradientBoostingClassifier(n_estimators=30, learning_rate=0.1, max_depth=1, random_state=0)
clf_gb.fit(X_train, y_train)

In [None]:
y_pred = clf_gb.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
ac = accuracy_score(y_test, y_pred)
ac

# Voting Classifier

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[
        ('log_reg', classifier),
        ('svm_clf', clf_svm),
        ('dt_clf', clf_tree),
        ('rf_clf', clf_rf),
        ('ada_clf', clf_svm),
        ('gb_clf', clf_svm)
    ],
    voting='soft'  # Can be 'soft' for averaging probabilities
)
    
# Train and evaluate the Voting Classifier
accuracy_scores = cross_val_score(voting_clf, X_train, y_train, cv=5, scoring='accuracy')
f1_scores = cross_val_score(voting_clf, X_train, y_train, cv=5, scoring='f1')

print(f'Voting Classifier Accuracy: {accuracy_scores.mean():.4f}')
print(f'Voting Classifier F1-Score: {f1_scores.mean():.4f}')

In [95]:
#fit all models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

classifier = LogisticRegression(random_state=0, max_iter=1000, solver='lbfgs')


clf_svm = SVC(max_iter=1000, gamma='scale', kernel='rbf', random_state=0, probability=True)


clf_tree = DecisionTreeClassifier(max_depth=5, criterion='gini', min_samples_split=100, min_samples_leaf=30, random_state=0)


clf_rf = RandomForestClassifier(n_estimators=30, max_depth=5, random_state=0)


clf_ab = AdaBoostClassifier(n_estimators=100, learning_rate=1, random_state=0)


clf_gb = GradientBoostingClassifier(n_estimators=30, learning_rate=0.1, max_depth=1, random_state=0)



In [96]:
from sklearn.ensemble import VotingClassifier

from joblib import parallel_backend




# Combine models into a Voting Classifier
voting_clf = VotingClassifier(
    estimators=[
        ('log_reg', classifier),
        ('svm_clf', clf_svm),
        ('dt_clf', clf_tree),
        ('rf_clf', clf_rf),
        ('ada_clf', clf_ab),
        ('gb_clf', clf_gb)
    ],
    voting='soft'  # Use 'hard' for hard voting
)

# Use parallel processing if possible
with parallel_backend('threading', n_jobs=-1):
    voting_clf.fit(X_train, y_train)

# Predict using the Voting Classifier
y_pred = voting_clf.predict(X_test)

# Evaluate the Voting Classifier
accuracy = accuracy_score(y_test, y_pred)
print(f'Voting Classifier Accuracy: {accuracy:.4f}')




Voting Classifier Accuracy: 0.9964


## References:

- [Dataset from Kaggle](https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset)
- [Hotel Booking (Logistic Regression) by Amit Sharma](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand)