# Task 2

---

## Predictive modeling of customer bookings

This Jupyter notebook includes some code to get you started with this predictive modeling task. We will use various packages for data manipulation, feature engineering and machine learning.

### Exploratory data analysis

First, we must explore the data in order to better understand what we have and the statistical properties of the dataset.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("customer_booking.csv", encoding="ISO-8859-1")
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,Internet,RoundTrip,262,19,7,Sat,AKLDEL,New Zealand,1,0,0,5.52,0
1,1,Internet,RoundTrip,112,20,3,Sat,AKLDEL,New Zealand,0,0,0,5.52,0
2,2,Internet,RoundTrip,243,22,17,Wed,AKLDEL,India,1,1,0,5.52,0
3,1,Internet,RoundTrip,96,31,4,Sat,AKLDEL,New Zealand,0,0,1,5.52,0
4,2,Internet,RoundTrip,68,22,15,Wed,AKLDEL,India,1,0,1,5.52,0


In [3]:
df = df.reset_index()
df = df.rename(columns = {"index" : "id"})
df.head()

Unnamed: 0,id,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,0,2,Internet,RoundTrip,262,19,7,Sat,AKLDEL,New Zealand,1,0,0,5.52,0
1,1,1,Internet,RoundTrip,112,20,3,Sat,AKLDEL,New Zealand,0,0,0,5.52,0
2,2,2,Internet,RoundTrip,243,22,17,Wed,AKLDEL,India,1,1,0,5.52,0
3,3,1,Internet,RoundTrip,96,31,4,Sat,AKLDEL,New Zealand,0,0,1,5.52,0
4,4,2,Internet,RoundTrip,68,22,15,Wed,AKLDEL,India,1,0,1,5.52,0


In [4]:
print("Number of passengers is 1 and booking is not complete", len(df.loc[(df["num_passengers"] == 1) & (df["booking_complete"] == 0)]))
print("Number of passengers is 2 and booking is not complete", len(df.loc[(df["num_passengers"] == 2) & (df["booking_complete"] == 0)]))
print("Number of passengers is 3 and booking is not complete", len(df.loc[(df["num_passengers"] == 3) & (df["booking_complete"] == 0)]))
print("Number of passengers is 4 and booking is not complete", len(df.loc[(df["num_passengers"] == 4) & (df["booking_complete"] == 0)]))

Number of passengers is 1 and booking is not complete 26897
Number of passengers is 2 and booking is not complete 10753
Number of passengers is 3 and booking is not complete 2450
Number of passengers is 4 and booking is not complete 1509


In [5]:
df.to_csv("customer_booking_new.csv")

The `.head()` method allows us to view the first 5 rows in the dataset, this is useful for visual inspection of our columns

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     50000 non-null  int64  
 1   num_passengers         50000 non-null  int64  
 2   sales_channel          50000 non-null  object 
 3   trip_type              50000 non-null  object 
 4   purchase_lead          50000 non-null  int64  
 5   length_of_stay         50000 non-null  int64  
 6   flight_hour            50000 non-null  int64  
 7   flight_day             50000 non-null  object 
 8   route                  50000 non-null  object 
 9   booking_origin         50000 non-null  object 
 10  wants_extra_baggage    50000 non-null  int64  
 11  wants_preferred_seat   50000 non-null  int64  
 12  wants_in_flight_meals  50000 non-null  int64  
 13  flight_duration        50000 non-null  float64
 14  booking_complete       50000 non-null  int64  
dtypes:

The `.info()` method gives us a data description, telling us the names of the columns, their data types and how many null values we have. Fortunately, we have no null values. It looks like some of these columns should be converted into different data types, e.g. flight_day.

To provide more context, below is a more detailed data description, explaining exactly what each column means:

- `num_passengers` = number of passengers travelling
- `sales_channel` = sales channel booking was made on
- `trip_type` = trip Type (Round Trip, One Way, Circle Trip)
- `purchase_lead` = number of days between travel date and booking date
- `length_of_stay` = number of days spent at destination
- `flight_hour` = hour of flight departure
- `flight_day` = day of week of flight departure
- `route` = origin -> destination flight route
- `booking_origin` = country from where booking was made
- `wants_extra_baggage` = if the customer wanted extra baggage in the booking
- `wants_preferred_seat` = if the customer wanted a preferred seat in the booking
- `wants_in_flight_meals` = if the customer wanted in-flight meals in the booking
- `flight_duration` = total duration of flight (in hours)
- `booking_complete` = flag indicating if the customer completed the booking

Before we compute any statistics on the data, lets do any necessary data conversion

In [7]:
df["flight_day"].unique()

array(['Sat', 'Wed', 'Thu', 'Mon', 'Sun', 'Tue', 'Fri'], dtype=object)

In [8]:
mapping = {
    "Mon": 1,
    "Tue": 2,
    "Wed": 3,
    "Thu": 4,
    "Fri": 5,
    "Sat": 6,
    "Sun": 7,
}

df["flight_day"] = df["flight_day"].map(mapping)

In [9]:
df["flight_day"].unique()

array([6, 3, 4, 1, 7, 2, 5])

In [10]:
df.describe()

Unnamed: 0,id,num_passengers,purchase_lead,length_of_stay,flight_hour,flight_day,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,24999.5,1.59124,84.94048,23.04456,9.06634,3.81442,0.66878,0.29696,0.42714,7.277561,0.14956
std,14433.901067,1.020165,90.451378,33.88767,5.41266,1.992792,0.470657,0.456923,0.494668,1.496863,0.356643
min,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4.67,0.0
25%,12499.75,1.0,21.0,5.0,5.0,2.0,0.0,0.0,0.0,5.62,0.0
50%,24999.5,1.0,51.0,17.0,9.0,4.0,1.0,0.0,0.0,7.57,0.0
75%,37499.25,2.0,115.0,28.0,13.0,5.0,1.0,1.0,1.0,8.83,0.0
max,49999.0,9.0,867.0,778.0,23.0,7.0,1.0,1.0,1.0,9.5,1.0


The `.describe()` method gives us a summary of descriptive statistics over the entire dataset (only works for numeric columns). This gives us a quick overview of a few things such as the mean, min, max and overall distribution of each column.

From this point, you should continue exploring the dataset with some visualisations and other metrics that you think may be useful. Then, you should prepare your dataset for predictive modelling. Finally, you should train your machine learning model, evaluate it with performance metrics and output visualisations for the contributing variables. All of this analysis should be summarised in your single slide.

In [11]:
df.columns

Index(['id', 'num_passengers', 'sales_channel', 'trip_type', 'purchase_lead',
       'length_of_stay', 'flight_hour', 'flight_day', 'route',
       'booking_origin', 'wants_extra_baggage', 'wants_preferred_seat',
       'wants_in_flight_meals', 'flight_duration', 'booking_complete'],
      dtype='object')

Here, the majority of entries have a value of 0, while a smaller portion have a value of 1 .

In [12]:
df["booking_complete"].value_counts().reset_index()

Unnamed: 0,index,booking_complete
0,0,42522
1,1,7478


In [13]:
# find categorical variables

categorical = [var for var in df.columns if df[var].dtype=='O']

print(f'There are {len(categorical)} categorical variables\n')

print('The categorical variables are :\n\n',categorical)

There are 4 categorical variables

The categorical variables are :

 ['sales_channel', 'trip_type', 'route', 'booking_origin']


In [14]:
df[categorical]

Unnamed: 0,sales_channel,trip_type,route,booking_origin
0,Internet,RoundTrip,AKLDEL,New Zealand
1,Internet,RoundTrip,AKLDEL,New Zealand
2,Internet,RoundTrip,AKLDEL,India
3,Internet,RoundTrip,AKLDEL,New Zealand
4,Internet,RoundTrip,AKLDEL,India
...,...,...,...,...
49995,Internet,RoundTrip,PERPNH,Australia
49996,Internet,RoundTrip,PERPNH,Australia
49997,Internet,RoundTrip,PERPNH,Australia
49998,Internet,RoundTrip,PERPNH,Australia


### Summary of categorical variables

In [15]:
df["sales_channel"].unique() # has 2 unique value

array(['Internet', 'Mobile'], dtype=object)

In [16]:
import matplotlib.pyplot as plt
import seaborn as sns

In [17]:
sns.__version__

'0.12.2'

In [18]:
import pandas as pd
from pandas_profiling import ProfileReport


In [19]:
profile = ProfileReport(df)
profile.to_file("report.html")


In [20]:
df["sales_channel"].value_counts().reset_index()

Unnamed: 0,index,sales_channel
0,Internet,44382
1,Mobile,5618


In [21]:
df["trip_type"].unique() # has 3 unique value

array(['RoundTrip', 'CircleTrip', 'OneWay'], dtype=object)

In [22]:
df["trip_type"].value_counts().reset_index()

Unnamed: 0,index,trip_type
0,RoundTrip,49497
1,OneWay,387
2,CircleTrip,116


In [23]:
df["route"].nunique() # has 799 unique value

799

In [24]:
df["booking_origin"].nunique() # has 104 unique value

104

In [25]:
df["booking_origin"].value_counts()

Australia               17872
Malaysia                 7174
South Korea              4559
Japan                    3885
China                    3387
                        ...  
Panama                      1
Tonga                       1
Tanzania                    1
Bulgaria                    1
Svalbard & Jan Mayen        1
Name: booking_origin, Length: 104, dtype: int64

In [26]:
for var in categorical:

    print(var, ' contains ', len(df[var].unique()), ' labels')

sales_channel  contains  2  labels
trip_type  contains  3  labels
route  contains  799  labels
booking_origin  contains  104  labels


In [27]:
# find numerical variables

numerical = [var for var in df.columns if df[var].dtype!='O']

print(f'There are {len(numerical)} numerical variables\n')

print('The numerical variables are :', numerical)

There are 11 numerical variables

The numerical variables are : ['id', 'num_passengers', 'purchase_lead', 'length_of_stay', 'flight_hour', 'flight_day', 'wants_extra_baggage', 'wants_preferred_seat', 'wants_in_flight_meals', 'flight_duration', 'booking_complete']


In [28]:
df[numerical].head()

Unnamed: 0,id,num_passengers,purchase_lead,length_of_stay,flight_hour,flight_day,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,0,2,262,19,7,6,1,0,0,5.52,0
1,1,1,112,20,3,6,0,0,0,5.52,0
2,2,2,243,22,17,3,1,1,0,5.52,0
3,3,1,96,31,4,6,0,0,1,5.52,0
4,4,2,68,22,15,3,1,0,1,5.52,0


In [29]:
X = df.drop(["booking_complete"], axis=1)
y = df["booking_complete"]

In [30]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

In [32]:
# display categorical variables

categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']

# display numerical variables

numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']

print(categorical)
print("\n")
print(numerical)

['sales_channel', 'trip_type', 'route', 'booking_origin']


['id', 'num_passengers', 'purchase_lead', 'length_of_stay', 'flight_hour', 'flight_day', 'wants_extra_baggage', 'wants_preferred_seat', 'wants_in_flight_meals', 'flight_duration']


In [33]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [34]:
cols = ['sales_channel', 'trip_type', 'route', 'booking_origin']
for var in cols:

    X_train[var] = le.fit_transform(X_train[var].str.strip())

for var in cols:
    X_test[var] = le.fit_transform(X_test[var].str.strip())

In [35]:
X_train.shape, X_test.shape

((35000, 14), (15000, 14))

In [36]:
from imblearn.over_sampling import SMOTE
from collections import Counter

In [37]:
print("Class distribution in training data before oversampling:", Counter(y_train))

Class distribution in training data before oversampling: Counter({0: 29751, 1: 5249})


In [38]:
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [39]:
print("Class distribution in training data after oversampling:", Counter(y_train_resampled))

Class distribution in training data after oversampling: Counter({1: 29751, 0: 29751})


In [40]:
cols_names = X.columns
cols_names

Index(['id', 'num_passengers', 'sales_channel', 'trip_type', 'purchase_lead',
       'length_of_stay', 'flight_hour', 'flight_day', 'route',
       'booking_origin', 'wants_extra_baggage', 'wants_preferred_seat',
       'wants_in_flight_meals', 'flight_duration'],
      dtype='object')

In [41]:
X_train_resampled

Unnamed: 0,id,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration
0,17325,1,0,2,95,46,11,5,558,47,1,0,0,4.750000
1,13544,1,1,2,112,18,13,7,404,48,1,0,0,6.620000
2,49844,1,0,2,284,6,9,1,697,48,1,0,0,4.670000
3,16371,1,1,2,3,86,10,4,508,4,1,0,0,8.580000
4,13084,1,0,2,114,36,23,3,396,4,1,0,1,8.830000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59497,30318,4,0,2,236,4,8,4,177,88,1,0,0,8.670000
59498,2837,1,0,2,61,80,4,5,54,8,0,0,1,6.075579
59499,42319,1,0,2,252,6,9,5,5,48,1,0,1,8.830000
59500,49956,2,0,2,25,6,5,6,697,48,0,0,1,4.670000


In [42]:
X_train_resampled.drop(["id"], axis=1, inplace = True)
X_test.drop(["id"], axis=1, inplace = True)

In [43]:
X_train_resampled.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration
0,1,0,2,95,46,11,5,558,47,1,0,0,4.75
1,1,1,2,112,18,13,7,404,48,1,0,0,6.62
2,1,0,2,284,6,9,1,697,48,1,0,0,4.67
3,1,1,2,3,86,10,4,508,4,1,0,0,8.58
4,1,0,2,114,36,23,3,396,4,1,0,1,8.83


In [44]:
X_test.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration
11872,1,0,2,3,22,1,3,329,1,1,1,1,8.83
40828,1,0,2,1,5,3,7,559,1,1,0,1,8.83
36400,4,0,2,229,5,8,5,184,11,0,0,0,4.72
5166,1,0,2,8,17,4,5,132,11,0,0,0,6.42
30273,1,0,2,122,4,15,4,156,64,0,0,0,4.67


In [45]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_resampled = scaler.fit_transform(X_train_resampled)

X_test = scaler.transform(X_test)

In [52]:
cols_names = list(X.columns)
cols_names.remove("id")
cols_names

['num_passengers',
 'sales_channel',
 'trip_type',
 'purchase_lead',
 'length_of_stay',
 'flight_hour',
 'flight_day',
 'route',
 'booking_origin',
 'wants_extra_baggage',
 'wants_preferred_seat',
 'wants_in_flight_meals',
 'flight_duration']

In [53]:
X_train_resampled = pd.DataFrame(X_train_resampled, columns=[cols_names])
X_test = pd.DataFrame(X_test, columns=[cols_names])

In [54]:
X_train_resampled

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration
0,-0.540002,-0.274205,0.087655,0.159371,0.792026,0.421036,0.737252,0.792943,0.246712,0.760572,-0.551657,-0.729179,-1.585863
1,-0.540002,3.646907,0.087655,0.352712,-0.103991,0.821891,1.814851,0.115345,0.281991,0.760572,-0.551657,-0.729179,-0.331137
2,-0.540002,-0.274205,0.087655,2.308867,-0.487998,0.020180,-1.417946,1.404540,0.281991,0.760572,-0.551657,-0.729179,-1.639541
3,-0.540002,3.646907,0.087655,-0.886944,2.072051,0.220608,0.198453,0.572944,-1.270263,0.760572,-0.551657,-0.729179,0.983977
4,-0.540002,-0.274205,0.087655,0.375458,0.472020,2.826168,-0.340347,0.080146,-1.270263,0.760572,-0.551657,1.371406,1.151722
...,...,...,...,...,...,...,...,...,...,...,...,...,...
59497,2.673118,-0.274205,0.087655,1.762963,-0.552000,-0.180248,0.198453,-0.883450,1.693130,0.760572,-0.551657,-0.729179,1.044365
59498,-0.540002,-0.274205,0.087655,-0.227310,1.880047,-0.981958,0.737252,-1.424648,-1.129149,-1.314800,-0.551657,1.371406,-0.696430
59499,-0.540002,-0.274205,0.087655,1.944931,-0.487998,0.020180,0.737252,-1.640247,0.281991,0.760572,-0.551657,1.371406,1.151722
59500,0.531038,-0.274205,0.087655,-0.636738,-0.487998,-0.781531,1.276052,1.404540,0.281991,-1.314800,-0.551657,1.371406,-1.639541


# SVM Model

In [61]:
from sklearn.svm import SVC

In [62]:
svc = SVC()

In [63]:
svc.fit(X_train_resampled, y_train_resampled)

In [64]:
svc.score(X_train_resampled, y_train_resampled)

0.7490000336123156

In [66]:
svc.score(X_test, y_test)

0.64

Score on testing data is less

# Hyperparameter tunnig

In [67]:
from sklearn.model_selection import RandomizedSearchCV

In [68]:
params_svc = {
    "C" : [0.01,0.1,1,10],
    "kernel": ["rbf", "linear", "poly"],
    "gamma":["auto", 0.1,0.3,0.5],
    "degree" : [1,2,3]
}

In [69]:
rscv_svc = RandomizedSearchCV(svc, param_distributions=params_svc, n_iter = 5, cv=5, scoring="roc_auc", verbose=3)

In [70]:
rscv_svc.fit(X_train_resampled, y_train_resampled)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5] END C=0.1, degree=1, gamma=0.3, kernel=rbf;, score=0.678 total time= 2.6min
[CV 2/5] END C=0.1, degree=1, gamma=0.3, kernel=rbf;, score=0.830 total time= 2.8min
[CV 3/5] END C=0.1, degree=1, gamma=0.3, kernel=rbf;, score=0.828 total time= 2.7min
[CV 4/5] END C=0.1, degree=1, gamma=0.3, kernel=rbf;, score=0.826 total time= 2.7min
[CV 5/5] END C=0.1, degree=1, gamma=0.3, kernel=rbf;, score=0.824 total time= 2.8min
[CV 1/5] END C=1, degree=3, gamma=auto, kernel=rbf;, score=0.679 total time= 2.0min
[CV 2/5] END C=1, degree=3, gamma=auto, kernel=rbf;, score=0.826 total time= 2.3min
[CV 3/5] END C=1, degree=3, gamma=auto, kernel=rbf;, score=0.826 total time= 2.3min
[CV 4/5] END C=1, degree=3, gamma=auto, kernel=rbf;, score=0.826 total time= 2.3min
[CV 5/5] END C=1, degree=3, gamma=auto, kernel=rbf;, score=0.825 total time= 2.3min
[CV 1/5] END C=0.01, degree=1, gamma=0.5, kernel=linear;, score=0.585 total time= 1.7min
[CV 2/

In [72]:
print("Best parameter for SVM: ",rscv_svc.best_params_)

Best parameter for SVM:  {'kernel': 'rbf', 'gamma': 0.3, 'degree': 1, 'C': 0.1}


In [81]:
print(f"Best score we got : {rscv_svc.best_score_:.2f}")

Best score we got : 0.80
