# Xente fraud detection

## Content

1. Intro
1. Getting the data
2. EDA

## 1. Intro
Xente is a payment service.
Have a look on the website or this video to learn more.

Video: https://www.loom.com/share/95af830a57f5452085fe73e2f4edd414

Website: https://www.xente.co


## Set-up and Import

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Preprocessing
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import plot_confusion_matrix, recall_score, accuracy_score, precision_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix, roc_curve
from sklearn.metrics import fbeta_score, make_scorer

###
# Import functions from own Python-File (see visuals_script.py in Repo)
import visuals_script as vs

# Pretty display for notebooks
%matplotlib inline

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

RSEED = 42
###


# Define a plotting style to be used for all plots in this notebook
plt.style.use('tableau-colorblind10')

## Getting the data
The data is available on zindi: https://zindi.africa/competitions/xente-fraud-detection-challenge/data

In [81]:
df = pd.read_csv('data/training.csv')
df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,transactionid,batchid,accountid,subscriptionid,customerid,currencycode,countrycode,providerid,productid,productcategory,channelid,amount,value,transactionstarttime,pricingstrategy,fraudresult
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15T02:18:49Z,2,0
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15T02:19:08Z,2,0
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15T02:44:21Z,2,0
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,20000.0,21800,2018-11-15T03:32:55Z,2,0
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-644.0,644,2018-11-15T03:34:21Z,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95657,TransactionId_89881,BatchId_96668,AccountId_4841,SubscriptionId_3829,CustomerId_3078,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-1000.0,1000,2019-02-13T09:54:09Z,2,0
95658,TransactionId_91597,BatchId_3503,AccountId_3439,SubscriptionId_2643,CustomerId_3874,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2019-02-13T09:54:25Z,2,0
95659,TransactionId_82501,BatchId_118602,AccountId_4841,SubscriptionId_3829,CustomerId_3874,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2019-02-13T09:54:35Z,2,0
95660,TransactionId_136354,BatchId_70924,AccountId_1346,SubscriptionId_652,CustomerId_1709,UGX,256,ProviderId_6,ProductId_19,tv,ChannelId_3,3000.0,3000,2019-02-13T10:01:10Z,2,0


In [3]:
df["transactionstarttime"]

0        2018-11-15T02:18:49Z
1        2018-11-15T02:19:08Z
2        2018-11-15T02:44:21Z
3        2018-11-15T03:32:55Z
4        2018-11-15T03:34:21Z
                 ...         
95657    2019-02-13T09:54:09Z
95658    2019-02-13T09:54:25Z
95659    2019-02-13T09:54:35Z
95660    2019-02-13T10:01:10Z
95661    2019-02-13T10:01:28Z
Name: transactionstarttime, Length: 95662, dtype: object

In [4]:
df['transactionstarttime'] = df['transactionstarttime'].str.replace('T', ' ')
df['transactionstarttime'] = df['transactionstarttime'].str.replace('Z', '')

In [5]:
df['transactionstarttime'] = pd.to_datetime(df['transactionstarttime'], infer_datetime_format=True) 

In [6]:
df['year'] = df['transactionstarttime'].dt.year
df['month'] = df['transactionstarttime'].dt.month
df['day'] = df['transactionstarttime'].dt.day
df['hour'] = df['transactionstarttime'].dt.hour
df['weekday'] = df['transactionstarttime'].dt.weekday

In [7]:
df["debit"] = df["amount"].apply(lambda x: 0 if x > 0 else 1)
df[["amount", "debit"]]

Unnamed: 0,amount,debit
0,1000.0,0
1,-20.0,1
2,500.0,0
3,20000.0,0
4,-644.0,1
...,...,...
95657,-1000.0,1
95658,1000.0,0
95659,-20.0,1
95660,3000.0,0


In [8]:
df.drop('amount', axis=1, inplace=True)
df.drop('transactionstarttime', axis=1, inplace=True)

In [9]:
df.head()

Unnamed: 0,transactionid,batchid,accountid,subscriptionid,customerid,currencycode,countrycode,providerid,productid,productcategory,channelid,value,pricingstrategy,fraudresult,year,month,day,hour,weekday,debit
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000,2,0,2018,11,15,2,3,0
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,20,2,0,2018,11,15,2,3,1
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500,2,0,2018,11,15,2,3,0
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,21800,2,0,2018,11,15,3,3,0
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,644,2,0,2018,11,15,3,3,1


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95662 entries, 0 to 95661
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   transactionid    95662 non-null  object
 1   batchid          95662 non-null  object
 2   accountid        95662 non-null  object
 3   subscriptionid   95662 non-null  object
 4   customerid       95662 non-null  object
 5   currencycode     95662 non-null  object
 6   countrycode      95662 non-null  int64 
 7   providerid       95662 non-null  object
 8   productid        95662 non-null  object
 9   productcategory  95662 non-null  object
 10  channelid        95662 non-null  object
 11  value            95662 non-null  int64 
 12  pricingstrategy  95662 non-null  int64 
 13  fraudresult      95662 non-null  int64 
 14  year             95662 non-null  int64 
 15  month            95662 non-null  int64 
 16  day              95662 non-null  int64 
 17  hour             95662 non-null

In [11]:
df.nunique()

transactionid      95662
batchid            94809
accountid           3633
subscriptionid      3627
customerid          3742
currencycode           1
countrycode            1
providerid             6
productid             23
productcategory        9
channelid              4
value               1517
pricingstrategy        4
fraudresult            2
year                   2
month                  4
day                   31
hour                  24
weekday                7
debit                  2
dtype: int64

#### transactionid
- Unique transaction identifier on platform
- is unique, like a index 
- we deleted the TransactionId_

In [12]:
# delete the column name in rows
df["transactionid"] = df["transactionid"].apply(lambda x: x.replace("TransactionId_", ""))

#### batchid 
- Unique number assigned to a batch of transactions for processing
- drop the letters
- do not know if it is important

In [13]:
# delete the column name in rows
df["batchid"] = df["batchid"].apply(lambda x: x.replace("BatchId_", ""))

In [14]:
#df['batchid'].value_counts()
# most batches have just one entrance, some have more see example
# 67019     28
# 51870     16
# 113893    14
# 127204    12
# 116835    10

#### accountid
- Unique number identifying the customer on (xente) platform
- every customer gets an ID
- drop the letters

In [15]:
# delete the column name in rows
df["accountid"] = df["accountid"].apply(lambda x: x.replace("AccountId_", ""))

In [16]:
#df['accountid'].value_counts()
#Some Accounts have a lot of entries
#4841    30893
#4249     4457
#4840     1738
#3206     1105
#318      1070

#### subscriptionid
- Unique number identifying the customer subscription
- drop the letters

In [17]:
# delete the column name in rows
df["subscriptionid"] = df["subscriptionid"].apply(lambda x: x.replace("SubscriptionId_", ""))

In [18]:
#df['subscriptionid'].value_counts()
#3829    32630
#4429     4457
#1372     1105
#3087     1070
#4346      965

#### customerid  
- Unique identifier attached to Account
- buisnes partner/company who need financial services, uses xente 
- drop the letters


In [19]:
# delete the column name in rows
df["customerid"] = df["customerid"].apply(lambda x: x.replace("CustomerId_", ""))

In [20]:
#df['customerid'].value_counts()
#7343    4091
#3634    2085
#647     1869
#1096     784
#4033     778

#### currencycode and countrycode 
- have just one entrance and containe no information
- Country currency
- Numerical geographical code of country
- we delete them


In [21]:
#df.currencycode.nunique()
#df.countrycode.nunique()

In [22]:
df.drop('currencycode', axis=1, inplace=True)
df.drop('countrycode', axis=1, inplace=True)
#df.head()

#### providerid
- Source provider of Item bought
- this is the phone company/utilities company

- needs to be transformed to dummies

In [23]:
df['providerid'].value_counts()

ProviderId_4    38189
ProviderId_6    34186
ProviderId_5    14542
ProviderId_1     5643
ProviderId_3     3084
ProviderId_2       18
Name: providerid, dtype: int64

#### productid
- Item name being bought
- products of customer companies
- 23 entrances from 32.635 to 1
- need to be transformed to dummies

In [24]:
#df['productid'].value_counts()

#### productcategory
- ProductIds are organized into these broader product categories.
- 9 categories from 45.405 to 2

In [25]:
#df['productcategory'].value_counts()

#### channelid
- Identifies if customer used web,Android, IOS, pay later or checkout.
- 4 categories
- dummies

In [26]:
df['channelid'].value_counts()

ChannelId_3    56935
ChannelId_2    37141
ChannelId_5     1048
ChannelId_1      538
Name: channelid, dtype: int64

#### amount
- Value of the transaction. Positive for debits from customer account and negative for credit into customer account
- 1676 values, numbers positive and negative
- how to deal with negative values?
- maybe normalise?

#### value
- Absolute value of the amount
- all positive 
- 1517 why are there less- because positive and negative together

#### transactionstarttime
- Transaction start time

#### pricingstrategy
- Category of Xente's pricing structure for merchants
- 4 categories
-needs dummies

#### fraudresult,
- Fraud status of transaction 1 -yes or 0-No
- extremely unbalanced 
- 0   - 95469
- 1   -  193

In [27]:
df[df['fraudresult'] == 1].groupby('productcategory').count()

Unnamed: 0_level_0,transactionid,batchid,accountid,subscriptionid,customerid,providerid,productid,channelid,value,pricingstrategy,fraudresult,year,month,day,hour,weekday,debit
productcategory,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
airtime,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18
financial_services,161,161,161,161,161,161,161,161,161,161,161,161,161,161,161,161,161
transport,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
utility_bill,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12


In [28]:
df[df['fraudresult'] == 1].groupby('channelid').count()

Unnamed: 0_level_0,transactionid,batchid,accountid,subscriptionid,customerid,providerid,productid,productcategory,value,pricingstrategy,fraudresult,year,month,day,hour,weekday,debit
channelid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
ChannelId_1,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
ChannelId_2,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5
ChannelId_3,184,184,184,184,184,184,184,184,184,184,184,184,184,184,184,184,184


In [29]:
df[df['fraudresult'] == 1].groupby('productid').count()

Unnamed: 0_level_0,transactionid,batchid,accountid,subscriptionid,customerid,providerid,productcategory,channelid,value,pricingstrategy,fraudresult,year,month,day,hour,weekday,debit
productid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
ProductId_10,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
ProductId_13,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
ProductId_15,157,157,157,157,157,157,157,157,157,157,157,157,157,157,157,157,157
ProductId_21,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
ProductId_22,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
ProductId_3,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12
ProductId_5,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
ProductId_6,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
ProductId_9,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3


In [30]:
df[df['fraudresult'] == 1].groupby('value').count().sort_values('value',ascending=False).head(60)


Unnamed: 0_level_0,transactionid,batchid,accountid,subscriptionid,customerid,providerid,productid,productcategory,channelid,pricingstrategy,fraudresult,year,month,day,hour,weekday,debit
value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
9880000,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
9870000,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
9860888,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
9856000,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
9850000,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
9800000,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
8600000,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
7000000,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
5000000,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13
4000000,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


# Exploring

In [31]:
# creation of correlation matrix
corrM = df.corr()
 
corrM

Unnamed: 0,value,pricingstrategy,fraudresult,year,month,day,hour,weekday,debit
value,1.0,-0.01702,0.566739,0.012001,-0.010452,-0.024104,0.000474,-0.013759,-0.039482
pricingstrategy,-0.01702,1.0,-0.033821,0.029757,-0.031291,-0.131597,0.007423,-0.122899,0.027161
fraudresult,0.566739,-0.033821,1.0,0.009811,-0.008887,-0.008636,0.008295,-0.006913,-0.034272
year,0.012001,0.029757,0.009811,1.0,-0.996205,-0.247493,-0.009621,-0.022821,-0.052524
month,-0.010452,-0.031291,-0.008887,-0.996205,1.0,0.207837,0.012241,0.025646,0.052523
day,-0.024104,-0.131597,-0.008636,-0.247493,0.207837,1.0,-0.019464,0.022521,-0.022912
hour,0.000474,0.007423,0.008295,-0.009621,0.012241,-0.019464,1.0,-0.004345,0.004606
weekday,-0.013759,-0.122899,-0.006913,-0.022821,0.025646,0.022521,-0.004345,1.0,-0.078143
debit,-0.039482,0.027161,-0.034272,-0.052524,0.052523,-0.022912,0.004606,-0.078143,1.0


In [32]:
#sns.pairplot(df, hue="fraudresult");

### Reebal's part

### Preparing the Data

In [33]:
df = df.drop(['transactionid', 'batchid', 'accountid', 'subscriptionid', 'customerid'], axis=1)
df

Unnamed: 0,providerid,productid,productcategory,channelid,value,pricingstrategy,fraudresult,year,month,day,hour,weekday,debit
0,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000,2,0,2018,11,15,2,3,0
1,ProviderId_4,ProductId_6,financial_services,ChannelId_2,20,2,0,2018,11,15,2,3,1
2,ProviderId_6,ProductId_1,airtime,ChannelId_3,500,2,0,2018,11,15,2,3,0
3,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,21800,2,0,2018,11,15,3,3,0
4,ProviderId_4,ProductId_6,financial_services,ChannelId_2,644,2,0,2018,11,15,3,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95657,ProviderId_4,ProductId_6,financial_services,ChannelId_2,1000,2,0,2019,2,13,9,2,1
95658,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000,2,0,2019,2,13,9,2,0
95659,ProviderId_4,ProductId_6,financial_services,ChannelId_2,20,2,0,2019,2,13,9,2,1
95660,ProviderId_6,ProductId_19,tv,ChannelId_3,3000,2,0,2019,2,13,10,2,0


In [34]:
df["providerid"] = df["providerid"].apply(lambda x: x.replace("ProviderId_", ""))
df["productid"] = df["productid"].apply(lambda x: x.replace("ProductId_", ""))
df["channelid"] = df["channelid"].apply(lambda x: x.replace("ChannelId_", ""))

In [35]:
X = df.drop('fraudresult', axis=1)
y = df.fraudresult

In [36]:
df

Unnamed: 0,providerid,productid,productcategory,channelid,value,pricingstrategy,fraudresult,year,month,day,hour,weekday,debit
0,6,10,airtime,3,1000,2,0,2018,11,15,2,3,0
1,4,6,financial_services,2,20,2,0,2018,11,15,2,3,1
2,6,1,airtime,3,500,2,0,2018,11,15,2,3,0
3,1,21,utility_bill,3,21800,2,0,2018,11,15,3,3,0
4,4,6,financial_services,2,644,2,0,2018,11,15,3,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95657,4,6,financial_services,2,1000,2,0,2019,2,13,9,2,1
95658,6,10,airtime,3,1000,2,0,2019,2,13,9,2,0
95659,4,6,financial_services,2,20,2,0,2019,2,13,9,2,1
95660,6,19,tv,3,3000,2,0,2019,2,13,10,2,0


In [37]:
#checking X
X.head()

Unnamed: 0,providerid,productid,productcategory,channelid,value,pricingstrategy,year,month,day,hour,weekday,debit
0,6,10,airtime,3,1000,2,2018,11,15,2,3,0
1,4,6,financial_services,2,20,2,2018,11,15,2,3,1
2,6,1,airtime,3,500,2,2018,11,15,2,3,0
3,1,21,utility_bill,3,21800,2,2018,11,15,3,3,0
4,4,6,financial_services,2,644,2,2018,11,15,3,3,1


In [38]:
#checking y
y.head()

0    0
1    0
2    0
3    0
4    0
Name: fraudresult, dtype: int64

In [39]:
df.head()

Unnamed: 0,providerid,productid,productcategory,channelid,value,pricingstrategy,fraudresult,year,month,day,hour,weekday,debit
0,6,10,airtime,3,1000,2,0,2018,11,15,2,3,0
1,4,6,financial_services,2,20,2,0,2018,11,15,2,3,1
2,6,1,airtime,3,500,2,0,2018,11,15,2,3,0
3,1,21,utility_bill,3,21800,2,0,2018,11,15,3,3,0
4,4,6,financial_services,2,644,2,0,2018,11,15,3,3,1


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95662 entries, 0 to 95661
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   providerid       95662 non-null  object
 1   productid        95662 non-null  object
 2   productcategory  95662 non-null  object
 3   channelid        95662 non-null  object
 4   value            95662 non-null  int64 
 5   pricingstrategy  95662 non-null  int64 
 6   fraudresult      95662 non-null  int64 
 7   year             95662 non-null  int64 
 8   month            95662 non-null  int64 
 9   day              95662 non-null  int64 
 10  hour             95662 non-null  int64 
 11  weekday          95662 non-null  int64 
 12  debit            95662 non-null  int64 
dtypes: int64(9), object(4)
memory usage: 9.5+ MB


In [41]:
df.nunique()

providerid            6
productid            23
productcategory       9
channelid             4
value              1517
pricingstrategy       4
fraudresult           2
year                  2
month                 4
day                  31
hour                 24
weekday               7
debit                 2
dtype: int64

In [42]:
# TODO: One-hot encode the 'features_raw' data using pandas.get_dummies()
features = X
cat_feats = ['providerid', 'productid', 'productcategory', 'channelid', 'pricingstrategy', 'year', 'month', 'day', 'hour', 'weekday']
features_dummies = pd.get_dummies(features, columns=cat_feats, drop_first=True)

# TODO: Our Target is already a numerical value with 0 for ok and 1 for fraud
target_enc = y 

In [43]:
features_dummies.head()

Unnamed: 0,value,debit,providerid_2,providerid_3,providerid_4,providerid_5,providerid_6,productid_10,productid_11,productid_12,...,hour_20,hour_21,hour_22,hour_23,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6
0,1000,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,1,0,0,0
1,20,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,500,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,21800,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,644,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [44]:
features_dummies.shape

(95662, 106)

In [45]:
features_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95662 entries, 0 to 95661
Columns: 106 entries, value to weekday_6
dtypes: int64(2), uint8(104)
memory usage: 10.9 MB


In [46]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95662 entries, 0 to 95661
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   providerid       95662 non-null  object
 1   productid        95662 non-null  object
 2   productcategory  95662 non-null  object
 3   channelid        95662 non-null  object
 4   value            95662 non-null  int64 
 5   pricingstrategy  95662 non-null  int64 
 6   year             95662 non-null  int64 
 7   month            95662 non-null  int64 
 8   day              95662 non-null  int64 
 9   hour             95662 non-null  int64 
 10  weekday          95662 non-null  int64 
 11  debit            95662 non-null  int64 
dtypes: int64(8), object(4)
memory usage: 8.8+ MB


In [47]:
features_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95662 entries, 0 to 95661
Columns: 106 entries, value to weekday_6
dtypes: int64(2), uint8(104)
memory usage: 10.9 MB


### Shuffle and Split Data

Now all _categorical variables_ have been converted into numerical features. We will now split the data (both features and their labels) into training and test sets. 70% of the data will be used for training and 30% for testing.  

In [48]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_dummies, target_enc, test_size=0.3, random_state=RSEED, stratify= target_enc)

# Show the results of the split
print ("Training set has {} samples.".format(X_train.shape[0]))
print ("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 66963 samples.
Testing set has 28699 samples.


In [49]:
X_train

Unnamed: 0,value,debit,providerid_2,providerid_3,providerid_4,providerid_5,providerid_6,productid_10,productid_11,productid_12,...,hour_20,hour_21,hour_22,hour_23,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6
18178,50,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
81353,5000,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
31115,1000,0,0,0,0,0,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
21634,50,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
10517,17000,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65270,1000,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
37067,50,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
50895,3000,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
57355,10000,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [50]:
from collections import Counter
from imblearn.over_sampling import SMOTE



In [51]:
# summarize class distribution
counter = Counter(y)
print(counter)


Counter({0: 95469, 1: 193})


In [52]:
# transform the dataset
# do all imports at the top
oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)


In [53]:
# summarize the new class distribution
counter = Counter(y_train)
print(counter)

Counter({0: 66828, 1: 66828})


### Normalizing Numerical Features
It is often good practice to perform some type of scaling on numerical features. Normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning, as exampled below. To avoid data leakage we will normalize the data after we split it into train and test set. 

In [54]:
# Import sklearn.preprocessing.StandardScaler
from sklearn.preprocessing import StandardScaler

# Initialize a scaler, then apply it to the features
scaler = StandardScaler()
numerical = ['value']

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[numerical] = scaler.fit_transform(X_train_scaled[numerical])
X_test_scaled[numerical] = scaler.transform(X_test_scaled[numerical])

# Show an example of a record with scaling applied
X_train_scaled.head()

Unnamed: 0,value,debit,providerid_2,providerid_3,providerid_4,providerid_5,providerid_6,productid_10,productid_11,productid_12,...,hour_20,hour_21,hour_22,hour_23,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6
0,-0.528239,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,-0.524476,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,-0.527517,0,0,0,0,0,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
3,-0.528239,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,-0.515352,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [55]:
X_train_scaled.describe()

Unnamed: 0,value,debit,providerid_2,providerid_3,providerid_4,providerid_5,providerid_6,productid_10,productid_11,productid_12,...,hour_20,hour_21,hour_22,hour_23,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6
count,133656.0,133656.0,133656.0,133656.0,133656.0,133656.0,133656.0,133656.0,133656.0,133656.0,...,133656.0,133656.0,133656.0,133656.0,133656.0,133656.0,133656.0,133656.0,133656.0,133656.0
mean,8.505928e-18,0.204922,7.5e-05,0.132482,0.204922,0.140989,0.178959,0.08251,0.007242,7e-06,...,0.023276,0.016475,0.003561,0.001526,0.122942,0.086094,0.091765,0.171313,0.074654,0.05898
std,1.000004,0.403646,0.00865,0.339015,0.403646,0.348011,0.38332,0.275142,0.084794,0.002735,...,0.15078,0.127294,0.059571,0.039038,0.328372,0.280504,0.288696,0.376784,0.262834,0.235588
min,-0.5282758,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.527517,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.4902621,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,-0.0206732,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,6.965286,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Baseline Model

### Value of Product:
Find fraudulent transactions, safe money, avoid reputation damage and prevent money laundering.
### Prediction:
Transaction is fraudulent
### Evaluation Metric:
f1-score (recommended and given by Zindi)
### Baseline Model:
Transactions with an amount equal or exceeding 500.000 are frauds. 
### Score Baseline Model:
f1-score = 0.81

### Baseline Model:
A simple investigation of the dataset to tell us how likely a transaction is to be a fraud.

Let's take a look at the following:


The total number of records, 'n_records' 95.662
The number transactions that are more than 500,000, 'n_greater_500k'.
The number transactions that are less than 500,000, 'n_at_most_500k'.
The percentage of fraud cases 'greater_percent'.

In [56]:
"""
# TODO: Total number of records
n_records = df.shape[0]

# TODO: The number transactions that are more than 500,000
n_greater_500k = len(df[(df["value"]>=500000)])

# TODO: The number transactions that are less than 500,000
n_at_most_500k = len(df[(df["value"]<500000)])

# TODO: The percentage of transactions that are more than 500.000
greater_percent = n_greater_500k / (n_records) * 100.0

# Print the results
print ("Total number of transactions: {}".format(n_records))
print ("Transactions more than or equal to $500,000: {}".format(n_greater_500k))
print ("Transactions less than 500,000: {}".format(n_at_most_500k))
print ("Percentage of percentage of transactions that are more than 500.000 => Percentage of fraud: {:.2f}%".format(greater_percent))
"""

'\n# TODO: Total number of records\nn_records = df.shape[0]\n\n# TODO: The number transactions that are more than 500,000\nn_greater_500k = len(df[(df["value"]>=500000)])\n\n# TODO: The number transactions that are less than 500,000\nn_at_most_500k = len(df[(df["value"]<500000)])\n\n# TODO: The percentage of transactions that are more than 500.000\ngreater_percent = n_greater_500k / (n_records) * 100.0\n\n# Print the results\nprint ("Total number of transactions: {}".format(n_records))\nprint ("Transactions more than or equal to $500,000: {}".format(n_greater_500k))\nprint ("Transactions less than 500,000: {}".format(n_at_most_500k))\nprint ("Percentage of percentage of transactions that are more than 500.000 => Percentage of fraud: {:.2f}%".format(greater_percent))\n'

In [57]:
# add explaination why we came up with this model
y_pred_baseline = (X_train.value >= 500000)*1


In [58]:
y_true = (y_train)

In [59]:

print(accuracy_score(y_true, y_pred_baseline))
print(recall_score(y_true, y_pred_baseline))
print(precision_score(y_true, y_pred_baseline))
print(f1_score(y_true, y_pred_baseline))

0.9480307655473753
0.8966301550248399
0.9993662230227827
0.9452147713469942


In [60]:
y_pred_baseline_test = (X_test.value >= 500000)*1

In [61]:
y_true_test = (y_test)

In [77]:
print(accuracy_score(y_test, y_pred_baseline_test))
print(recall_score(y_test, y_pred_baseline_test))
print(precision_score(y_test, y_pred_baseline_test))
print(f1_score(y_test, y_pred_baseline_test))
print(confusion_matrix(y_test,y_pred_baseline_test))

0.9990243562493467
0.8793103448275862
0.7083333333333334
0.7846153846153847
[[28620    21]
 [    7    51]]


Score vom Datensatz ohne Oversampling
0.9991487836566462
0.8686131386861314
0.7531645569620253
0.8067796610169492

##alternative score for evaluating the result
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html
-  sklearn.metrics.matthews_corrcoef(y_true, y_pred, *, sample_weight=None)[source]

##  Supervised Learning Models

###  Model Application

Now we'll pick three supervised learning models above that are appropriate for this problem, and test them on the census data. 

In [63]:
def train_predict_evaluate(model, X_train, y_train, X_test, y_test, fbeta=1.0):
    """Train model, make prediction and evaluate

    Args:
        model (_type_): Classifier model
        X_train (_type_): Train data features
        y_train (_type_): Train data target
        X_test (_type_): Test data features
        y_test (_type_): Test data target
        fbeta (float, optional): Beta for f_beta score. Defaults to 1.0.
    """
    # train model
    model.fit(X_train, y_train)

    # make predictions
    y_pred = model.predict(X_test)

    # print metrics of predictions
    print(f"Model: {type(model)}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"recall: {recall_score(y_test, y_pred)}")
    print(f"F_beta: {fbeta_score(y_test, y_pred, beta=fbeta)}")
    print("Confusion matrix: ")
    print(confusion_matrix(y_test, y_pred))
    print('-----------------------------')

In [64]:
from sklearn.ensemble import AdaBoostClassifier
# TODO: Train a decision tree, SVM and AdaBoostClassifier on the train data
model_dtree = DecisionTreeClassifier(random_state=RSEED)
model_svc = SVC()
model_adaboost = AdaBoostClassifier()

for model in [model_dtree, model_svc, model_adaboost]:
    train_predict_evaluate(model, X_train_scaled, y_train, X_test_scaled, y_test, fbeta=1)

Model: <class 'sklearn.tree._classes.DecisionTreeClassifier'>
Accuracy: 0.9988501341510158
recall: 0.7758620689655172
F_beta: 0.7317073170731708
Confusion matrix: 
[[28621    20]
 [   13    45]]
-----------------------------
Model: <class 'sklearn.svm._classes.SVC'>
Accuracy: 0.9980487124986933
recall: 0.46551724137931033
F_beta: 0.49090909090909085
Confusion matrix: 
[[28616    25]
 [   31    27]]
-----------------------------
Model: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
Accuracy: 0.9974215129447019
recall: 0.9310344827586207
F_beta: 0.5934065934065935
Confusion matrix: 
[[28571    70]
 [    4    54]]
-----------------------------


In [65]:
    """oversampling only in the training data, fbeta = 1, without stratify
    Model: <class 'sklearn.tree._classes.DecisionTreeClassifier'>
Accuracy: 0.9990243562493467
recall: 0.8928571428571429
F_beta: 0.78125
Confusion matrix: 
[[28621    22]
 [    6    50]]
-----------------------------
Model: <class 'sklearn.svm._classes.SVC'>
Accuracy: 0.990870762047458
recall: 0.9464285714285714
F_beta: 0.28804347826086957
Confusion matrix: 
[[28384   259]
 [    3    53]]
-----------------------------
Model: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
Accuracy: 0.9953308477647305
recall: 0.9107142857142857
F_beta: 0.43220338983050843
Confusion matrix: 
[[28514   129]
 [    5    51]]
    """

"oversampling only in the training data, fbeta = 1, without stratify\nModel: <class 'sklearn.tree._classes.DecisionTreeClassifier'>\nAccuracy: 0.9990243562493467\nrecall: 0.8928571428571429\nF_beta: 0.78125\nConfusion matrix: \n[[28621    22]\n [    6    50]]\n-----------------------------\nModel: <class 'sklearn.svm._classes.SVC'>\nAccuracy: 0.990870762047458\nrecall: 0.9464285714285714\nF_beta: 0.28804347826086957\nConfusion matrix: \n[[28384   259]\n [    3    53]]\n-----------------------------\nModel: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>\nAccuracy: 0.9953308477647305\nrecall: 0.9107142857142857\nF_beta: 0.43220338983050843\nConfusion matrix: \n[[28514   129]\n [    5    51]]\n"

In [66]:
    """ oversampling only in the training data, fbeta = 0.5
    Model: <class 'sklearn.tree._classes.DecisionTreeClassifier'>
Accuracy: 0.9990243562493467
recall: 0.8928571428571429
F_beta: 0.7267441860465117
Confusion matrix: 
[[28621    22]
 [    6    50]]
-----------------------------
Model: <class 'sklearn.svm._classes.SVC'>
Accuracy: 0.990870762047458
recall: 0.9464285714285714
F_beta: 0.20322085889570557
Confusion matrix: 
[[28384   259]
 [    3    53]]
-----------------------------
Model: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
Accuracy: 0.9953308477647305
recall: 0.9107142857142857
F_beta: 0.32860824742268036
Confusion matrix: 
[[28514   129]
 [    5    51]]
-----------------------------
    """

" oversampling only in the training data, fbeta = 0.5\nModel: <class 'sklearn.tree._classes.DecisionTreeClassifier'>\nAccuracy: 0.9990243562493467\nrecall: 0.8928571428571429\nF_beta: 0.7267441860465117\nConfusion matrix: \n[[28621    22]\n [    6    50]]\n-----------------------------\nModel: <class 'sklearn.svm._classes.SVC'>\nAccuracy: 0.990870762047458\nrecall: 0.9464285714285714\nF_beta: 0.20322085889570557\nConfusion matrix: \n[[28384   259]\n [    3    53]]\n-----------------------------\nModel: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>\nAccuracy: 0.9953308477647305\nrecall: 0.9107142857142857\nF_beta: 0.32860824742268036\nConfusion matrix: \n[[28514   129]\n [    5    51]]\n-----------------------------\n"

In [67]:
""" Ergebnisse ohne Oversampling

Model: <class 'sklearn.tree._classes.DecisionTreeClassifier'>
Accuracy: 0.999442489285341
recall: 0.8214285714285714
F_beta: 0.8712121212121213
Confusion matrix: 
[[28637     6]
 [   10    46]]
-----------------------------
Model: <class 'sklearn.svm._classes.SVC'>
Accuracy: 0.9994773337050071
recall: 0.8035714285714286
F_beta: 0.8928571428571429
Confusion matrix: 
[[28639     4]
 [   11    45]]
-----------------------------
Model: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
Accuracy: 0.9994076448656748
recall: 0.8214285714285714
F_beta: 0.8582089552238806
Confusion matrix: 
[[28636     7]
 [   10    46]]
"""

" Ergebnisse ohne Oversampling\n\nModel: <class 'sklearn.tree._classes.DecisionTreeClassifier'>\nAccuracy: 0.999442489285341\nrecall: 0.8214285714285714\nF_beta: 0.8712121212121213\nConfusion matrix: \n[[28637     6]\n [   10    46]]\n-----------------------------\nModel: <class 'sklearn.svm._classes.SVC'>\nAccuracy: 0.9994773337050071\nrecall: 0.8035714285714286\nF_beta: 0.8928571428571429\nConfusion matrix: \n[[28639     4]\n [   11    45]]\n-----------------------------\nModel: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>\nAccuracy: 0.9994076448656748\nrecall: 0.8214285714285714\nF_beta: 0.8582089552238806\nConfusion matrix: \n[[28636     7]\n [   10    46]]\n"

In [68]:
    """ Ergebnisse mit Oversampling für train und test
    Model: <class 'sklearn.tree._classes.DecisionTreeClassifier'>
Accuracy: 0.998708145665305
recall: 0.9993050005212496
F_beta: 0.998361350080891
Confusion matrix: 
[[28451    54]
 [   20 28757]]
-----------------------------
Model: <class 'sklearn.svm._classes.SVC'>
Accuracy: 0.9941866554938724
recall: 0.9999652500260625
F_beta: 0.990847674730905
Confusion matrix: 
[[28173   332]
 [    1 28776]]
-----------------------------
Model: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
Accuracy: 0.9927551412311023
recall: 0.9890190082357438
F_beta: 0.9950215708622053
Confusion matrix: 
[[28406    99]
 [  316 28461]]
-----------------------------

    """


" Ergebnisse mit Oversampling für train und test\nModel: <class 'sklearn.tree._classes.DecisionTreeClassifier'>\nAccuracy: 0.998708145665305\nrecall: 0.9993050005212496\nF_beta: 0.998361350080891\nConfusion matrix: \n[[28451    54]\n [   20 28757]]\n-----------------------------\nModel: <class 'sklearn.svm._classes.SVC'>\nAccuracy: 0.9941866554938724\nrecall: 0.9999652500260625\nF_beta: 0.990847674730905\nConfusion matrix: \n[[28173   332]\n [    1 28776]]\n-----------------------------\nModel: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>\nAccuracy: 0.9927551412311023\nrecall: 0.9890190082357438\nF_beta: 0.9950215708622053\nConfusion matrix: \n[[28406    99]\n [  316 28461]]\n-----------------------------\n\n"

### Model Tuning
Using grid search (`GridSearchCV`) with different parameter/value combinations, we can tune our model for even better results. We will tune the AdaBoostClassifier since it showed the best performance. 
For Adaboost, we'll tune the n_estimators and learning rate parameters, and also the base classifier paramters (remember our base classifier for the Adaboost ensemble is a Decision tree!).

In [69]:
# Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer 

# Initialize the classifier
clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier()) 

# Create the parameters list you wish to tune
parameters = {'n_estimators':[50, 120],                
              'learning_rate':[0.1, 0.5, 1.],               
              #'base_estimator__min_samples_split' : np.arange(2, 8, 2),               
              'base_estimator__max_depth' : np.arange(3, 8, 2)              
             } 

# Make an fbeta_score scoring object
scorer = make_scorer(fbeta_score,beta=1) 


# TODO: Perform grid search on the classifier using 'scorer' as the scoring method
gs = GridSearchCV(clf, parameters, n_jobs=-1, scoring=scorer, cv=5, verbose = 0)

# TODO: Fit the grid search object to the training data and find the optimal parameters
gs.fit(X_train_scaled, y_train)

GridSearchCV(cv=5,
             estimator=AdaBoostClassifier(base_estimator=DecisionTreeClassifier()),
             n_jobs=-1,
             param_grid={'base_estimator__max_depth': array([3, 5, 7]),
                         'learning_rate': [0.1, 0.5, 1.0],
                         'n_estimators': [50, 120]},
             scoring=make_scorer(fbeta_score, beta=1))

In [75]:
# TODO: Get the  best estimator
best_clf = gs.best_estimator_

# TODO: Make predictions using the unoptimized model
clf.fit(X_train_scaled, y_train)
predictions = clf.predict(X_test_scaled)

# TODO: Make predictions using the optimized model
best_predictions = best_clf.predict(X_test_scaled)

# Report the before-and-afterscores
print("Unoptimized model\n------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
print("Recall score on testing data: {:.4f}".format(recall_score(y_test, predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 1)))
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("Final recall score on the testing data: {:.4f}".format(recall_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 1)))
print(confusion_matrix(y_test, best_predictions))
print(best_clf)

Unoptimized model
------
Accuracy score on testing data: 0.9993
Recall score on testing data: 0.8103
F-score on testing data: 0.8246

Optimized Model
------
Final accuracy score on the testing data: 0.9995
Final recall score on the testing data: 0.8966
Final F-score on the testing data: 0.8814
[[28633     8]
 [    6    52]]
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5),
                   learning_rate=0.5, n_estimators=120)


In [71]:
"""Unoptimized model
------
Accuracy score on testing data: 0.9993
Recall score on testing data: 0.8276
F-score on testing data: 0.8348

Optimized Model
------
Final accuracy score on the testing data: 0.9994
Final recall score on the testing data: 0.8448
Final F-score on the testing data: 0.8596
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
                   n_estimators=120)"""

'Unoptimized model\n------\nAccuracy score on testing data: 0.9993\nRecall score on testing data: 0.8276\nF-score on testing data: 0.8348\n\nOptimized Model\n------\nFinal accuracy score on the testing data: 0.9994\nFinal recall score on the testing data: 0.8448\nFinal F-score on the testing data: 0.8596\nAdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),\n                   n_estimators=120)'