# Task 2

---

## Predictive modeling of customer bookings

This Jupyter notebook includes some code to get you started with this predictive modeling task. We will use various packages for data manipulation, feature engineering and machine learning.

### Exploratory data analysis

First, we must explore the data in order to better understand what we have and the statistical properties of the dataset.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import classification_report,accuracy_score,roc_auc_score
from sklearn.linear_model import LogisticRegression,RidgeClassifier,SGDClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.ensemble import AdaBoostClassifier,BaggingClassifier,GradientBoostingClassifier,RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler,LabelEncoder
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv("customer_booking.csv", encoding="ISO-8859-1")
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,Internet,RoundTrip,262,19,7,Sat,AKLDEL,New Zealand,1,0,0,5.52,0
1,1,Internet,RoundTrip,112,20,3,Sat,AKLDEL,New Zealand,0,0,0,5.52,0
2,2,Internet,RoundTrip,243,22,17,Wed,AKLDEL,India,1,1,0,5.52,0
3,1,Internet,RoundTrip,96,31,4,Sat,AKLDEL,New Zealand,0,0,1,5.52,0
4,2,Internet,RoundTrip,68,22,15,Wed,AKLDEL,India,1,0,1,5.52,0


The `.head()` method allows us to view the first 5 rows in the dataset, this is useful for visual inspection of our columns

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   num_passengers         50000 non-null  int64  
 1   sales_channel          50000 non-null  object 
 2   trip_type              50000 non-null  object 
 3   purchase_lead          50000 non-null  int64  
 4   length_of_stay         50000 non-null  int64  
 5   flight_hour            50000 non-null  int64  
 6   flight_day             50000 non-null  object 
 7   route                  50000 non-null  object 
 8   booking_origin         50000 non-null  object 
 9   wants_extra_baggage    50000 non-null  int64  
 10  wants_preferred_seat   50000 non-null  int64  
 11  wants_in_flight_meals  50000 non-null  int64  
 12  flight_duration        50000 non-null  float64
 13  booking_complete       50000 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.3+ 

The `.info()` method gives us a data description, telling us the names of the columns, their data types and how many null values we have. Fortunately, we have no null values. It looks like some of these columns should be converted into different data types, e.g. flight_day.

To provide more context, below is a more detailed data description, explaining exactly what each column means:

- `num_passengers` = number of passengers travelling
- `sales_channel` = sales channel booking was made on
- `trip_type` = trip Type (Round Trip, One Way, Circle Trip)
- `purchase_lead` = number of days between travel date and booking date
- `length_of_stay` = number of days spent at destination
- `flight_hour` = hour of flight departure
- `flight_day` = day of week of flight departure
- `route` = origin -> destination flight route
- `booking_origin` = country from where booking was made
- `wants_extra_baggage` = if the customer wanted extra baggage in the booking
- `wants_preferred_seat` = if the customer wanted a preferred seat in the booking
- `wants_in_flight_meals` = if the customer wanted in-flight meals in the booking
- `flight_duration` = total duration of flight (in hours)
- `booking_complete` = flag indicating if the customer completed the booking

Before we compute any statistics on the data, lets do any necessary data conversion

In [4]:
df.describe(include='all')

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
count,50000.0,50000,50000,50000.0,50000.0,50000.0,50000,50000,50000,50000.0,50000.0,50000.0,50000.0,50000.0
unique,,2,3,,,,7,799,104,,,,,
top,,Internet,RoundTrip,,,,Mon,AKLKUL,Australia,,,,,
freq,,44382,49497,,,,8102,2680,17872,,,,,
mean,1.59124,,,84.94048,23.04456,9.06634,,,,0.66878,0.29696,0.42714,7.277561,0.14956
std,1.020165,,,90.451378,33.88767,5.41266,,,,0.470657,0.456923,0.494668,1.496863,0.356643
min,1.0,,,0.0,0.0,0.0,,,,0.0,0.0,0.0,4.67,0.0
25%,1.0,,,21.0,5.0,5.0,,,,0.0,0.0,0.0,5.62,0.0
50%,1.0,,,51.0,17.0,9.0,,,,1.0,0.0,0.0,7.57,0.0
75%,2.0,,,115.0,28.0,13.0,,,,1.0,1.0,1.0,8.83,0.0


The `.describe()` method gives us a summary of descriptive statistics over the entire dataset (only works for numeric columns). This gives us a quick overview of a few things such as the mean, min, max and overall distribution of each column.

From this point, you should continue exploring the dataset with some visualisations and other metrics that you think may be useful. Then, you should prepare your dataset for predictive modelling. Finally, you should train your machine learning model, evaluate it with performance metrics and output visualisations for the contributing variables. All of this analysis should be summarised in your single slide.

# Finding NULL Values

In [5]:
df.isna().sum()

num_passengers           0
sales_channel            0
trip_type                0
purchase_lead            0
length_of_stay           0
flight_hour              0
flight_day               0
route                    0
booking_origin           0
wants_extra_baggage      0
wants_preferred_seat     0
wants_in_flight_meals    0
flight_duration          0
booking_complete         0
dtype: int64

# Processing Flight_Day Column

In [6]:
df["flight_day"].unique()

array(['Sat', 'Wed', 'Thu', 'Mon', 'Sun', 'Tue', 'Fri'], dtype=object)

In [7]:
mapping = {
    "Mon": 1,
    "Tue": 2,
    "Wed": 3,
    "Thu": 4,
    "Fri": 5,
    "Sat": 6,
    "Sun": 7
}

df["flight_day"] = df["flight_day"].map(mapping)

In [8]:
df["flight_day"].unique()

array([6, 3, 4, 1, 7, 2, 5], dtype=int64)

# Processing Sales_Channel Column

In [9]:
df["sales_channel"].unique()

array(['Internet', 'Mobile'], dtype=object)

In [10]:
mapping = {
    "Internet": 0,
    "Mobile": 1
}

df["sales_channel"] = df["sales_channel"].map(mapping)

In [11]:
df["sales_channel"].unique()

array([0, 1], dtype=int64)

# Processing Trip_Type Column

In [12]:
df["trip_type"].unique()

array(['RoundTrip', 'CircleTrip', 'OneWay'], dtype=object)

In [13]:
mapping = {
    "OneWay": 0,
    "CircleTrip": 1,
    "RoundTrip":2    
}

df["trip_type"] = df["trip_type"].map(mapping)

In [14]:
df["trip_type"].unique()

array([2, 1, 0], dtype=int64)

# Processing Route Column

In [15]:
df["route"].unique()

array(['AKLDEL', 'AKLHGH', 'AKLHND', 'AKLICN', 'AKLKIX', 'AKLKTM',
       'AKLKUL', 'AKLMRU', 'AKLPEK', 'AKLPVG', 'AKLTPE', 'AORICN',
       'AORKIX', 'AORKTM', 'AORMEL', 'BBIMEL', 'BBIOOL', 'BBIPER',
       'BBISYD', 'BDOCTS', 'BDOCTU', 'BDOHGH', 'BDOICN', 'BDOIKA',
       'BDOKIX', 'BDOMEL', 'BDOOOL', 'BDOPEK', 'BDOPER', 'BDOPUS',
       'BDOPVG', 'BDOSYD', 'BDOTPE', 'BDOXIY', 'BKICKG', 'BKICTS',
       'BKICTU', 'BKIHND', 'BKIICN', 'BKIKIX', 'BKIKTM', 'BKIMEL',
       'BKIMRU', 'BKIOOL', 'BKIPEK', 'BKIPER', 'BKIPUS', 'BKIPVG',
       'BKISYD', 'BKIXIY', 'BLRICN', 'BLRMEL', 'BLRPER', 'BLRSYD',
       'BOMMEL', 'BOMOOL', 'BOMPER', 'BOMSYD', 'BTJJED', 'BTUICN',
       'BTUPER', 'BTUSYD', 'BTUWUH', 'BWNCKG', 'BWNDEL', 'BWNHGH',
       'BWNIKA', 'BWNKTM', 'BWNMEL', 'BWNOOL', 'BWNPER', 'BWNSYD',
       'BWNTPE', 'CANDEL', 'CANIKA', 'CANMEL', 'CANMRU', 'CANOOL',
       'CANPER', 'CANSYD', 'CCUMEL', 'CCUMRU', 'CCUOOL', 'CCUPER',
       'CCUSYD', 'CCUTPE', 'CEBMEL', 'CEBOOL', 'CEBPER', 'CEBS

In [16]:
LE=LabelEncoder()
df['route']=LE.fit_transform(df['route'])

In [17]:
df["route"].unique()

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
        28,  29,  30,  31,  32,  33,  34,  36,  37,  38,  39,  41,  42,
        43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
        56,  57,  58,  59,  60,  61,  62,  64,  65,  66,  67,  68,  69,
        70,  71,  72,  73,  74,  75,  76,  77,  79,  80,  81,  82,  83,
        84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,
        97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
       110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 121, 122, 125,
       126, 127, 129, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140,
       141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153,
       154, 155, 157, 158, 159, 160, 161, 162, 163, 165, 166, 167, 170,
       171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183,
       185, 187, 188, 189, 190, 192, 193, 194, 195, 196, 197, 19

# Processing Booking_Origin Column

In [18]:
df['booking_origin'].unique()

array(['New Zealand', 'India', 'United Kingdom', 'China', 'South Korea',
       'Japan', 'Malaysia', 'Singapore', 'Switzerland', 'Germany',
       'Indonesia', 'Czech Republic', 'Vietnam', 'Thailand', 'Spain',
       'Romania', 'Ireland', 'Italy', 'Slovakia', 'United Arab Emirates',
       'Tonga', 'Réunion', '(not set)', 'Saudi Arabia', 'Netherlands',
       'Qatar', 'Hong Kong', 'Philippines', 'Sri Lanka', 'France',
       'Croatia', 'United States', 'Laos', 'Hungary', 'Portugal',
       'Cyprus', 'Australia', 'Cambodia', 'Poland', 'Belgium', 'Oman',
       'Bangladesh', 'Kazakhstan', 'Brazil', 'Turkey', 'Kenya', 'Taiwan',
       'Brunei', 'Chile', 'Bulgaria', 'Ukraine', 'Denmark', 'Colombia',
       'Iran', 'Bahrain', 'Solomon Islands', 'Slovenia', 'Mauritius',
       'Nepal', 'Russia', 'Kuwait', 'Mexico', 'Sweden', 'Austria',
       'Lebanon', 'Jordan', 'Greece', 'Mongolia', 'Canada', 'Tanzania',
       'Peru', 'Timor-Leste', 'Argentina', 'New Caledonia', 'Macau',
       'Myanmar (

In [19]:
LE=LabelEncoder()
df['booking_origin']=LE.fit_transform(df['booking_origin'])

In [20]:
df['booking_origin'].unique()

array([ 61,  36, 100,  17,  85,  43,  51,  80,  90,  28,  37,  21, 103,
        93,  86,  75,  40,  42,  81,  99,  95,  77,   0,  78,  59,  74,
        34,  71,  87,  27,  19, 101,  48,  35,  73,  20,   4,  14,  72,
         9,  65,   7,  45,  11,  97,  46,  91,  12,  16,  13,  98,  23,
        18,  38,   6,  83,  82,  54,  58,  76,  47,  55,  89,   5,  49,
        44,  31,  56,  15,  92,  70,  94,   3,  60,  50,  57,  64,  67,
        10,  63,  26,  62,  52,  24,  41,  96,  84,  68,  69,  25,  79,
         1,  32,  22,  53, 102,   8,  66,  39,  29,  30,  33,   2,  88])

In [21]:
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,0,2,262,19,7,6,0,61,1,0,0,5.52,0
1,1,0,2,112,20,3,6,0,61,0,0,0,5.52,0
2,2,0,2,243,22,17,3,0,36,1,1,0,5.52,0
3,1,0,2,96,31,4,6,0,61,0,0,1,5.52,0
4,2,0,2,68,22,15,3,0,36,1,0,1,5.52,0


In [22]:
y=df['booking_complete'].values
df.drop('booking_complete',inplace=True,axis=1)
x=df.values

# Splitting the data

In [23]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=2)
print(f'Shape of x_train:{x_train.shape}')
print(f'Shape of y_train:{y_train.shape}')
print(f'Shape of x_test:{x_test.shape}')
print(f'Shape of y_test:{y_test.shape}')

Shape of x_train:(40000, 13)
Shape of y_train:(40000,)
Shape of x_test:(10000, 13)
Shape of y_test:(10000,)


In [24]:
lr=LogisticRegression()
rc=RidgeClassifier()
sgdc=SGDClassifier()
knn=KNeighborsClassifier()
svm=SVC()
dtc=DecisionTreeClassifier()
etc=ExtraTreeClassifier()
abc=AdaBoostClassifier()
bc=BaggingClassifier()
gbc=GradientBoostingClassifier()
rfc=RandomForestClassifier()
gnb=GaussianNB()
mnb=MultinomialNB()
models=[lr,rc,sgdc,knn,svm,dtc,etc,abc,bc,gbc,rfc,gnb,mnb]

In [25]:
for model in models:
    clf = model
    clf.fit(x_train,y_train)
    print("Model:",model)
    print("\nTraining Accuracy_Score:",accuracy_score(y_train,clf.predict(x_train)))
    print("\nTesting Accuracy_Score:",accuracy_score(y_test,clf.predict(x_test)))
    print("\nTraining AUC_Score:",roc_auc_score(y_train,clf.predict(x_train)))
    print("\nTesting AUC_Score:",roc_auc_score(y_test,clf.predict(x_test)))
    print("\nTraining Classification Report:\n",classification_report(y_train, clf.predict(x_train)))
    print("\nTesting Classification Report:\n",classification_report(y_test, clf.predict(x_test)))
    print("\n\n\n\n")

Model: LogisticRegression()

Training Accuracy_Score: 0.8497

Testing Accuracy_Score: 0.8523

Training AUC_Score: 0.49985293252544266

Testing AUC_Score: 0.49994134209291413

Training Classification Report:
               precision    recall  f1-score   support

           0       0.85      1.00      0.92     33998
           1       0.00      0.00      0.00      6002

    accuracy                           0.85     40000
   macro avg       0.42      0.50      0.46     40000
weighted avg       0.72      0.85      0.78     40000


Testing Classification Report:
               precision    recall  f1-score   support

           0       0.85      1.00      0.92      8524
           1       0.00      0.00      0.00      1476

    accuracy                           0.85     10000
   macro avg       0.43      0.50      0.46     10000
weighted avg       0.73      0.85      0.78     10000






Model: RidgeClassifier()

Training Accuracy_Score: 0.84995

Testing Accuracy_Score: 0.8524

Training


Training Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.99      0.92     33998
           1       0.57      0.04      0.08      6002

    accuracy                           0.85     40000
   macro avg       0.71      0.52      0.50     40000
weighted avg       0.81      0.85      0.79     40000


Testing Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.99      0.92      8524
           1       0.48      0.03      0.06      1476

    accuracy                           0.85     10000
   macro avg       0.67      0.51      0.49     10000
weighted avg       0.80      0.85      0.79     10000






Model: RandomForestClassifier()

Training Accuracy_Score: 0.99985

Testing Accuracy_Score: 0.8537

Training AUC_Score: 0.9997059630631976

Testing AUC_Score: 0.5427768748593164

Training Classification Report:
               precision    recall  f1-score   support

        