

### Exploratory data analysis

First, we must explore the data in order to better understand what we have and the statistical properties of the dataset.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("customer_booking.csv", encoding='latin-1')
df.head()

In [None]:
df.info()

The `.info()` method gives us a data description, telling us the names of the columns, their data types and how many null values we have. Fortunately, we have no null values. It looks like some of these columns should be converted into different data types, e.g. flight_day.

To provide more context, below is a more detailed data description, explaining exactly what each column means:

- `num_passengers` = number of passengers travelling
- `sales_channel` = sales channel booking was made on
- `trip_type` = trip Type (Round Trip, One Way, Circle Trip)
- `purchase_lead` = number of days between travel date and booking date
- `length_of_stay` = number of days spent at destination
- `flight_hour` = hour of flight departure
- `flight_day` = day of week of flight departure
- `route` = origin -> destination flight route
- `booking_origin` = country from where booking was made
- `wants_extra_baggage` = if the customer wanted extra baggage in the booking
- `wants_preferred_seat` = if the customer wanted a preferred seat in the booking
- `wants_in_flight_meals` = if the customer wanted in-flight meals in the booking
- `flight_duration` = total duration of flight (in hours)
- `booking_complete` = flag indicating if the customer completed the booking

Before we compute any statistics on the data, lets do any necessary data conversion

In [None]:
df["flight_day"].unique()

In [None]:
mapping = {
    "Mon": 1,
    "Tue": 2,
    "Wed": 3,
    "Thu": 4,
    "Fri": 5,
    "Sat": 6,
    "Sun": 7,
}

df["flight_day"] = df["flight_day"].map(mapping)

In [None]:
df["flight_day"].unique()

In [None]:
df.describe()

#Check for Missing Values:

In [None]:
print(df.isnull().sum())

In [None]:
df.head()

#Encoding Categorical data

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
categorical_features =['sales_channel','trip_type','route','booking_origin'	]
df = pd.get_dummies(df, columns=categorical_features)
df.head()

In [None]:
df['purchase_lead'] = df['purchase_lead'].mean()
df['length_of_stay'] = df['length_of_stay'].mean()
df['flight_hour'] = df['flight_hour'].mean()
df['flight_duration'] = df['flight_duration'].mean()


df.head()

### Correlations

In [None]:
plt.figure(figsize=(45, 45))
correlation = df.corr()
sns.heatmap(
    correlation,
    xticklabels=correlation.columns.values,
    yticklabels=correlation.columns.values,
    annot=True,
    annot_kws={'size': 12}
)
# Axis ticks size
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()

since from the above we can see  that the corelation matrix

#Spliting dataset into training set and test set

In [None]:
y = df['booking_complete']
X = df.drop(columns=[ 'booking_complete'])

In [None]:
print(X)

In [None]:
print(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Create an imputer to replace missing values (NaN) with the mean of each column
imputer = SimpleImputer(strategy='mean')

In [None]:
# Fit the imputer on the training data and transform both training and testing data
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

In [None]:
# Impute missing values in the target variable (y_train)
imputer_y = SimpleImputer(strategy='most_frequent')  # Use most frequent for categorical target
y_train = imputer_y.fit_transform(y_train.values.reshape(-1, 1))  # Reshape for single feature
y_train = y_train.ravel() # Flatten the array

In [None]:
#training our dataset using Random forest Classifier
model = RandomForestClassifier(n_estimators=1000)
model.fit(X_train, y_train)

In [None]:
#finding R_2 score for random forest
from sklearn.metrics import r2_score

y_train_pred = model.predict(X_train) # Predict on the training data
r2_score_train = r2_score(y_train, y_train_pred) # Compare true values with predictions
print("R2 score on training data:", r2_score_train)

In [None]:
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix , accuracy_score,classification_report,f1_score,precision_score,recall_score
cm = confusion_matrix(y_test,y_pred)
print(cm)
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(f1_score(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(recall_score(y_test,y_pred))
