# Task 2

---

## Predictive modeling of customer bookings

This Jupyter notebook includes some code to get you started with this predictive modeling task. We will use various packages for data manipulation, feature engineering and machine learning.

### Exploratory data analysis

First, we must explore the data in order to better understand what we have and the statistical properties of the dataset.

In [20]:
import pandas as pd
from sklearn import preprocessing

In [3]:
df = pd.read_csv("customer_booking.csv", encoding="ISO-8859-1")
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,Internet,RoundTrip,262,19,7,Sat,AKLDEL,New Zealand,1,0,0,5.52,0
1,1,Internet,RoundTrip,112,20,3,Sat,AKLDEL,New Zealand,0,0,0,5.52,0
2,2,Internet,RoundTrip,243,22,17,Wed,AKLDEL,India,1,1,0,5.52,0
3,1,Internet,RoundTrip,96,31,4,Sat,AKLDEL,New Zealand,0,0,1,5.52,0
4,2,Internet,RoundTrip,68,22,15,Wed,AKLDEL,India,1,0,1,5.52,0


The `.head()` method allows us to view the first 5 rows in the dataset, this is useful for visual inspection of our columns

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   num_passengers         50000 non-null  int64  
 1   sales_channel          50000 non-null  object 
 2   trip_type              50000 non-null  object 
 3   purchase_lead          50000 non-null  int64  
 4   length_of_stay         50000 non-null  int64  
 5   flight_hour            50000 non-null  int64  
 6   flight_day             50000 non-null  object 
 7   route                  50000 non-null  object 
 8   booking_origin         50000 non-null  object 
 9   wants_extra_baggage    50000 non-null  int64  
 10  wants_preferred_seat   50000 non-null  int64  
 11  wants_in_flight_meals  50000 non-null  int64  
 12  flight_duration        50000 non-null  float64
 13  booking_complete       50000 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.3+ 

The `.info()` method gives us a data description, telling us the names of the columns, their data types and how many null values we have. Fortunately, we have no null values. It looks like some of these columns should be converted into different data types, e.g. flight_day.

To provide more context, below is a more detailed data description, explaining exactly what each column means:

- `num_passengers` = number of passengers travelling
- `sales_channel` = sales channel booking was made on
- `trip_type` = trip Type (Round Trip, One Way, Circle Trip)
- `purchase_lead` = number of days between travel date and booking date
- `length_of_stay` = number of days spent at destination
- `flight_hour` = hour of flight departure
- `flight_day` = day of week of flight departure
- `route` = origin -> destination flight route
- `booking_origin` = country from where booking was made
- `wants_extra_baggage` = if the customer wanted extra baggage in the booking
- `wants_preferred_seat` = if the customer wanted a preferred seat in the booking
- `wants_in_flight_meals` = if the customer wanted in-flight meals in the booking
- `flight_duration` = total duration of flight (in hours)
- `booking_complete` = flag indicating if the customer completed the booking

Before we compute any statistics on the data, lets do any necessary data conversion

In [5]:
df["flight_day"].unique()

array(['Sat', 'Wed', 'Thu', 'Mon', 'Sun', 'Tue', 'Fri'], dtype=object)

In [6]:
mapping = {
    "Mon": 1,
    "Tue": 2,
    "Wed": 3,
    "Thu": 4,
    "Fri": 5,
    "Sat": 6,
    "Sun": 7,
}

df["flight_day"] = df["flight_day"].map(mapping)

In [7]:
df["flight_day"].unique()

array([6, 3, 4, 1, 7, 2, 5], dtype=int64)

In [8]:
df.describe()

Unnamed: 0,num_passengers,purchase_lead,length_of_stay,flight_hour,flight_day,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,1.59124,84.94048,23.04456,9.06634,3.81442,0.66878,0.29696,0.42714,7.277561,0.14956
std,1.020165,90.451378,33.88767,5.41266,1.992792,0.470657,0.456923,0.494668,1.496863,0.356643
min,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4.67,0.0
25%,1.0,21.0,5.0,5.0,2.0,0.0,0.0,0.0,5.62,0.0
50%,1.0,51.0,17.0,9.0,4.0,1.0,0.0,0.0,7.57,0.0
75%,2.0,115.0,28.0,13.0,5.0,1.0,1.0,1.0,8.83,0.0
max,9.0,867.0,778.0,23.0,7.0,1.0,1.0,1.0,9.5,1.0


The `.describe()` method gives us a summary of descriptive statistics over the entire dataset (only works for numeric columns). This gives us a quick overview of a few things such as the mean, min, max and overall distribution of each column.

From this point, you should continue exploring the dataset with some visualisations and other metrics that you think may be useful. Then, you should prepare your dataset for predictive modelling. Finally, you should train your machine learning model, evaluate it with performance metrics and output visualisations for the contributing variables. All of this analysis should be summarised in your single slide.

In [9]:
df["sales_channel"].unique()

array(['Internet', 'Mobile'], dtype=object)

In [10]:
mapping = {
    "Internet": 1,
    "Mobile": 2

}

df["sales_channel"] = df["sales_channel"].map(mapping)

In [11]:
df["trip_type"].unique()

array(['RoundTrip', 'CircleTrip', 'OneWay'], dtype=object)

In [12]:
mapping = {
    "RoundTrip": 1,
    "CircleTrip": 2,
    'OneWay':3

}

df["trip_type"] = df["trip_type"].map(mapping)

In [13]:
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,1,1,262,19,7,6,AKLDEL,New Zealand,1,0,0,5.52,0
1,1,1,1,112,20,3,6,AKLDEL,New Zealand,0,0,0,5.52,0
2,2,1,1,243,22,17,3,AKLDEL,India,1,1,0,5.52,0
3,1,1,1,96,31,4,6,AKLDEL,New Zealand,0,0,1,5.52,0
4,2,1,1,68,22,15,3,AKLDEL,India,1,0,1,5.52,0


In [17]:
df = df.drop(['route', 'booking_origin'], axis=1)

In [19]:
## random forest classifier

X = df.loc[:,df.columns != 'booking_complete']
y = df.loc[:, df.columns == 'booking_complete']
print(X.shape)
print(y.shape)

(50000, 11)
(50000, 1)


In [22]:
X = preprocessing.scale(X)
y= y.to_numpy()

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


X_train,X_test, y_train,y_test = train_test_split(X,y,test_size=0.3,shuffle=True)

In [27]:
from sklearn.tree import DecisionTreeClassifier
DTclf_model = DecisionTreeClassifier(criterion="gini", random_state=1,max_depth=5, min_samples_leaf=1)   
DTclf_model.fit(X_train,y_train)

DecisionTreeClassifier(max_depth=5, random_state=1)

In [28]:
y_predDT = DTclf_model.predict(X_test)
print('Accuracy of classifier on training set: {:.4f}'.format(DTclf_model.score(X_train, y_train)))
print('Accuracy of Decision tree classifier on test set: {:.4f}'.format(DTclf_model.score(X_test, y_test)))

Accuracy of classifier on training set: 0.8487
Accuracy of Decision tree classifier on test set: 0.8552


In [31]:
from sklearn.model_selection import GridSearchCV

parameters = {'max_leaf_nodes': [3, 7, 10, 30, 40], 
          'min_samples_split': [4, 6, 8, 10],
          'max_depth':[4,5,6,7,8,10,13,15,20]}

DTclf_model = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=parameters, cv= 5, verbose = 2, return_train_score = True)
DTclf_model.fit(X_train, y_train)

Fitting 5 folds for each of 180 candidates, totalling 900 fits
[CV] END .max_depth=4, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=4, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=4, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=4, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=4, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=4, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=4, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=4, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=4, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=4, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=4, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=4, max_leaf_nodes=3, min_s

[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=5, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=5, max_l

[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=6, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=6, max_l

[CV] END .max_depth=7, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=7, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=7, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=7, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=7, max_leaf_nodes=3, min_samples_split=6; total time=   0.0s
[CV] END .max_depth=7, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=7, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=7, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=7, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=7, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END max_depth=7, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=7, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=7, max_le

[CV] END .max_depth=8, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END .max_depth=8, max_leaf_nodes=3, min_samples_split=8; total time=   0.0s
[CV] END max_depth=8, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=8, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=8, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=8, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=8, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END .max_depth=8, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=8, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=8, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=8, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=8, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END .max_depth=8, max_l

[CV] END max_depth=10, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=10, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=10, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=10, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=10, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=10, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=10, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=10, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=10, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=10, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=10, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=10, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=10, 

[CV] END max_depth=13, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=13, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=13, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=13, max_leaf_nodes=3, min_samples_split=10; total time=   0.0s
[CV] END max_depth=13, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=13, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=13, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=13, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=13, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=13, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=13, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=13, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=13, m

[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=8; total time=   0.0s
[CV] END max_depth=15, max_leaf_nodes=7, min_samples_split=8; total time=   0.0s
[CV] END max_depth=15, max_l

[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=4; total time=   0.0s
[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=6; total time=   0.0s
[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=8; total time=   0.0s
[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=8; total time=   0.0s
[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=8; total time=   0.0s
[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=8; total time=   0.0s
[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=8; total time=   0.0s
[CV] END max_depth=20, max_leaf_nodes=7, min_samples_split=10; total time=   0.0s
[CV] END max_depth=20, max_

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [4, 5, 6, 7, 8, 10, 13, 15, 20],
                         'max_leaf_nodes': [3, 7, 10, 30, 40],
                         'min_samples_split': [4, 6, 8, 10]},
             return_train_score=True, verbose=2)

In [32]:
DTclf_model.best_score_

0.8484

In [33]:
DTclf_model.best_estimator_

DecisionTreeClassifier(max_depth=4, max_leaf_nodes=3, min_samples_split=4)

In [36]:
DTclf_model.best_estimator_.feature_importances_

array([0.        , 0.        , 0.        , 0.        , 0.43337525,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.56662475])

In [37]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=100, max_depth=None, max_leaf_nodes=None,min_samples_split=2)  
rf_clf.fit(X_train,y_train)
score = rf_clf.score(X_test, y_test)
print(score)

  rf_clf.fit(X_train,y_train)


0.8512666666666666


In [38]:
param_grid = { 
    'n_estimators': [200, 500],
    'max_depth' : [3,5,7],
    'max_features': ['auto', 'sqrt', 'log2']
}

rf_clf = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, cv= 5, verbose = 2)
rf_clf.fit(X_train, y_train)

# print best parameter after tuning
print(rf_clf.best_params_)
 
# print how our model looks after hyper-parameter tuning
print(rf_clf.best_estimator_)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=auto, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=auto, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=auto, n_estimators=200; total time=   1.3s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=auto, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=auto, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=auto, n_estimators=500; total time=   3.7s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=auto, n_estimators=500; total time=   3.7s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=auto, n_estimators=500; total time=   3.6s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=auto, n_estimators=500; total time=   3.5s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=auto, n_estimators=500; total time=   3.5s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=sqrt, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=sqrt, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=sqrt, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=sqrt, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=sqrt, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=sqrt, n_estimators=500; total time=   3.6s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=sqrt, n_estimators=500; total time=   3.6s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=sqrt, n_estimators=500; total time=   3.7s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=sqrt, n_estimators=500; total time=   3.7s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=sqrt, n_estimators=500; total time=   3.6s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=log2, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=log2, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=log2, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=log2, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=log2, n_estimators=200; total time=   1.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=log2, n_estimators=500; total time=   3.7s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=log2, n_estimators=500; total time=   4.2s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=log2, n_estimators=500; total time=   3.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=log2, n_estimators=500; total time=   3.8s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=3, max_features=log2, n_estimators=500; total time=   3.7s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=auto, n_estimators=200; total time=   1.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=auto, n_estimators=200; total time=   2.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=auto, n_estimators=200; total time=   2.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=auto, n_estimators=200; total time=   2.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=auto, n_estimators=200; total time=   2.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=auto, n_estimators=500; total time=   4.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=auto, n_estimators=500; total time=   4.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=auto, n_estimators=500; total time=   4.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=auto, n_estimators=500; total time=   4.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=auto, n_estimators=500; total time=   4.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=sqrt, n_estimators=200; total time=   2.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=sqrt, n_estimators=200; total time=   1.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=sqrt, n_estimators=200; total time=   1.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=sqrt, n_estimators=200; total time=   1.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=sqrt, n_estimators=200; total time=   1.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=sqrt, n_estimators=500; total time=   5.1s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=sqrt, n_estimators=500; total time=   5.2s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=sqrt, n_estimators=500; total time=   4.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=sqrt, n_estimators=500; total time=   4.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=sqrt, n_estimators=500; total time=   5.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=log2, n_estimators=200; total time=   2.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=log2, n_estimators=200; total time=   2.1s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=log2, n_estimators=200; total time=   1.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=log2, n_estimators=200; total time=   2.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=log2, n_estimators=200; total time=   2.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=log2, n_estimators=500; total time=   5.1s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=log2, n_estimators=500; total time=   5.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=log2, n_estimators=500; total time=   4.8s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=log2, n_estimators=500; total time=   4.8s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=5, max_features=log2, n_estimators=500; total time=   4.7s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=auto, n_estimators=200; total time=   2.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=auto, n_estimators=200; total time=   2.7s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=auto, n_estimators=200; total time=   2.5s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=auto, n_estimators=200; total time=   2.7s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=auto, n_estimators=200; total time=   2.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=auto, n_estimators=500; total time=   7.2s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=auto, n_estimators=500; total time=   6.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=auto, n_estimators=500; total time=   7.2s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=auto, n_estimators=500; total time=   7.7s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=auto, n_estimators=500; total time=   7.3s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=sqrt, n_estimators=200; total time=   2.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=sqrt, n_estimators=200; total time=   3.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=sqrt, n_estimators=200; total time=   3.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=sqrt, n_estimators=200; total time=   3.1s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=sqrt, n_estimators=200; total time=   3.0s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=sqrt, n_estimators=500; total time=   7.3s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=sqrt, n_estimators=500; total time=   6.8s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=sqrt, n_estimators=500; total time=   7.3s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=sqrt, n_estimators=500; total time=   6.5s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=sqrt, n_estimators=500; total time=   6.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=log2, n_estimators=200; total time=   2.7s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=log2, n_estimators=200; total time=   2.8s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=log2, n_estimators=200; total time=   2.9s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=log2, n_estimators=200; total time=   2.8s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=log2, n_estimators=200; total time=   2.6s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=log2, n_estimators=500; total time=   6.3s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=log2, n_estimators=500; total time=   6.4s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=log2, n_estimators=500; total time=   6.8s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=log2, n_estimators=500; total time=   6.8s


  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...max_depth=7, max_features=log2, n_estimators=500; total time=   7.0s


  self.best_estimator_.fit(X, y, **fit_params)


{'max_depth': 3, 'max_features': 'auto', 'n_estimators': 200}
RandomForestClassifier(max_depth=3, n_estimators=200)


In [47]:
print(rf_clf.best_estimator_)

RandomForestClassifier(max_depth=3, n_estimators=200)


In [42]:
rf_clf.best_score_

0.8484

In [39]:
rf_clf.best_estimator_.feature_importances_

array([0.00872709, 0.02356156, 0.0045052 , 0.04048927, 0.35868523,
       0.00910929, 0.0026896 , 0.12739948, 0.05120498, 0.02097004,
       0.35265826])

In [41]:
!pip install xgboost
from xgboost import XGBClassifier
xgb_clf = XGBClassifier(learning_rate=0.5, n_estimators=150, base_score=0.5)
xgb_clf.fit(X_train, y_train)
y_predXG = xgb_clf.predict(X_test)
print('Accuracy of Extreme Gradient Boosting classifier on test set: {:.2f}'.format(xgb_clf.score(X_test, y_test)))

Collecting xgboost
  Downloading xgboost-1.7.3-py3-none-win_amd64.whl (89.1 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.7.3
Accuracy of Extreme Gradient Boosting classifier on test set: 0.84


In [43]:
param_grid = { 
    'learning_rate': [0.3,0.4,0.5,0.6,0.7],
    'n_estimators' : [50,100,150,200],
    'base_score': [0.4,0.5,0.6]
}

xgb_clf = GridSearchCV(estimator=XGBClassifier(), param_grid=param_grid, cv= 5, verbose = 2)
xgb_clf.fit(X_train, y_train)

# print best parameter after tuning
print(xgb_clf.best_params_)
 
# print how our model looks after hyper-parameter tuning
print(xgb_clf.best_estimator_)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
[CV] END .base_score=0.4, learning_rate=0.3, n_estimators=50; total time=   0.3s
[CV] END .base_score=0.4, learning_rate=0.3, n_estimators=50; total time=   0.3s
[CV] END .base_score=0.4, learning_rate=0.3, n_estimators=50; total time=   0.3s
[CV] END .base_score=0.4, learning_rate=0.3, n_estimators=50; total time=   0.3s
[CV] END .base_score=0.4, learning_rate=0.3, n_estimators=50; total time=   0.3s
[CV] END base_score=0.4, learning_rate=0.3, n_estimators=100; total time=   0.6s
[CV] END base_score=0.4, learning_rate=0.3, n_estimators=100; total time=   0.7s
[CV] END base_score=0.4, learning_rate=0.3, n_estimators=100; total time=   0.6s
[CV] END base_score=0.4, learning_rate=0.3, n_estimators=100; total time=   0.7s
[CV] END base_score=0.4, learning_rate=0.3, n_estimators=100; total time=   0.7s
[CV] END base_score=0.4, learning_rate=0.3, n_estimators=150; total time=   1.0s
[CV] END base_score=0.4, learning_rate=0.3, n_e

[CV] END .base_score=0.5, learning_rate=0.3, n_estimators=50; total time=   0.6s
[CV] END .base_score=0.5, learning_rate=0.3, n_estimators=50; total time=   0.7s
[CV] END .base_score=0.5, learning_rate=0.3, n_estimators=50; total time=   0.6s
[CV] END .base_score=0.5, learning_rate=0.3, n_estimators=50; total time=   0.5s
[CV] END base_score=0.5, learning_rate=0.3, n_estimators=100; total time=   1.0s
[CV] END base_score=0.5, learning_rate=0.3, n_estimators=100; total time=   1.0s
[CV] END base_score=0.5, learning_rate=0.3, n_estimators=100; total time=   1.0s
[CV] END base_score=0.5, learning_rate=0.3, n_estimators=100; total time=   1.0s
[CV] END base_score=0.5, learning_rate=0.3, n_estimators=100; total time=   1.0s
[CV] END base_score=0.5, learning_rate=0.3, n_estimators=150; total time=   1.5s
[CV] END base_score=0.5, learning_rate=0.3, n_estimators=150; total time=   1.6s
[CV] END base_score=0.5, learning_rate=0.3, n_estimators=150; total time=   1.5s
[CV] END base_score=0.5, lea

[CV] END .base_score=0.6, learning_rate=0.3, n_estimators=50; total time=   0.6s
[CV] END .base_score=0.6, learning_rate=0.3, n_estimators=50; total time=   0.6s
[CV] END base_score=0.6, learning_rate=0.3, n_estimators=100; total time=   1.3s
[CV] END base_score=0.6, learning_rate=0.3, n_estimators=100; total time=   1.2s
[CV] END base_score=0.6, learning_rate=0.3, n_estimators=100; total time=   1.4s
[CV] END base_score=0.6, learning_rate=0.3, n_estimators=100; total time=   1.2s
[CV] END base_score=0.6, learning_rate=0.3, n_estimators=100; total time=   1.2s
[CV] END base_score=0.6, learning_rate=0.3, n_estimators=150; total time=   1.8s
[CV] END base_score=0.6, learning_rate=0.3, n_estimators=150; total time=   2.1s
[CV] END base_score=0.6, learning_rate=0.3, n_estimators=150; total time=   2.1s
[CV] END base_score=0.6, learning_rate=0.3, n_estimators=150; total time=   2.0s
[CV] END base_score=0.6, learning_rate=0.3, n_estimators=150; total time=   1.8s
[CV] END base_score=0.6, lea

In [44]:
xgb_clf.best_score_

0.8466571428571428

In [45]:
xgb_clf.best_estimator_.feature_importances_

array([0.05075717, 0.08417379, 0.08104861, 0.05572521, 0.08536433,
       0.05288049, 0.05374562, 0.27242067, 0.07735417, 0.06940907,
       0.11712087], dtype=float32)

In [48]:
# checking cross-val score
from sklearn.model_selection import cross_val_score

model = RandomForestClassifier(max_depth=3, n_estimators=200)
accuracy = cross_val_score(model, X_train, y_train, scoring='accuracy', cv = 10)
print(accuracy)
#get the mean of each fold 
print("Accuracy of Model with Cross Validation is:",accuracy.mean() * 100)

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


[0.84857143 0.84857143 0.84857143 0.84857143 0.84828571 0.84828571
 0.84828571 0.84828571 0.84828571 0.84828571]
Accuracy of Model with Cross Validation is: 84.84


In [50]:
model.fit(X_train,y_train)
y_predRF = model.predict(X_test)

  model.fit(X_train,y_train)


In [51]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predRF))

              precision    recall  f1-score   support

           0       0.86      1.00      0.92     12828
           1       0.00      0.00      0.00      2172

    accuracy                           0.86     15000
   macro avg       0.43      0.50      0.46     15000
weighted avg       0.73      0.86      0.79     15000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
