## Problem Statement

Problem Statement: PredCatch Analytics' Australian banking client's profitability and reputation are being hit by fraudulent ATM transactions. They want PredCatch to help them in reducing and if possible completely eliminating such fraudulent transactions. PredCatch believes it can do the same by building a predictive model to catch such fraudulent transactions in real time and decline them. Your job as PredCatch's Data Scientist is to build this fraud detection & prevention predictive model in the first step. If successful, in the 2nd step you will have to present your solutions and explain how it works to the client. The data has been made available to you.


The challenging part of the problem is that the data contains very few fraud instances in comparison to the overall population. To give more edge to the solution they have also collected data regarding location [geo_scores] of the transactions, their own proprietary index [Lambda_wts], on network turn around times [Qset_tats] and vulnerability qualification score [instance_scores]. As of now you don't need to understand what they mean.


Training data contains masked variables pertaining to each transaction id. Your prediction target here is 'Target'.<br>
1: Fraudulent transactions <br>
0: Clean transactions

## Importing Dependencies

In [1]:
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')

### Importing Data

In [2]:
geo = pd.read_csv('Geo_scores.csv')
lambdawts = pd.read_csv('Lambda_wts.csv')
instance = pd.read_csv('instance_scores.csv')
qsets = pd.read_csv('Qset_tats.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test_share.csv')

In [3]:
geo.head(2)

Unnamed: 0,id,geo_score
0,26674,4.48
1,204314,4.48


In [4]:
geo['id'].value_counts()

26674     5
149679    5
114110    5
24969     5
262179    5
         ..
152225    5
259714    5
232       5
128848    5
258558    5
Name: id, Length: 284807, dtype: int64

In [5]:
geo.isnull().sum()/len(geo)*100

id           0.000000
geo_score    5.023964
dtype: float64

In [6]:
geo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1424035 entries, 0 to 1424034
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   id         1424035 non-null  int64  
 1   geo_score  1352492 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 21.7 MB


### Filling Nan with zero because the data given by bank is encrypted

e.g. According to the Documentation - The geo data gives the location of the ATM but in dataset it has only numbers
So there is conflict whether its categorical or Numerical data
Hence we are replacing Nan with zeros so that our ML model will work just fine
and then we will convey to the bank that these many values are missing, if you can provide that data itll be great otherwise I have created model using zeros in place of Nan

In [8]:
geo.fillna(0, inplace=True)

In [9]:
geo.isnull().sum()

id           0
geo_score    0
dtype: int64

### Going through the data

In [18]:
geo= geo.groupby('id').mean()

In [19]:
geo.head()

Unnamed: 0_level_0,geo_score
id,Unnamed: 1_level_1
0,-0.62
1,1.07
2,0.07
3,0.18
4,0.54


In [17]:
geo[geo['id']==0].mean()

id           0.00
geo_score   -0.62
dtype: float64

In [12]:
lambdawts.head(2)

Unnamed: 0,Group,lambda_wt
0,Grp936,3.41
1,Grp347,-2.88


In [13]:
lambdawts['Group'].value_counts()

Grp936     1
Grp1128    1
Grp341     1
Grp63      1
Grp173     1
          ..
Grp337     1
Grp649     1
Grp1183    1
Grp46      1
Grp37      1
Name: Group, Length: 1400, dtype: int64

In [14]:
lambdawts.isnull().sum()

Group        0
lambda_wt    0
dtype: int64

In [15]:
instance.head(2)

Unnamed: 0,id,instance_scores
0,173444,-0.88
1,259378,1.5


In [16]:
instance.isnull().sum()

id                 0
instance_scores    0
dtype: int64

In [17]:
instance = instance.groupby('id').mean()

In [18]:
instance.head()

Unnamed: 0_level_0,instance_scores
id,Unnamed: 1_level_1
0,0.09
1,-0.17
2,0.21
3,-0.05
4,0.75


In [19]:
qsets.head(2)

Unnamed: 0,id,qsets_normalized_tat
0,9983,2.41
1,266000,3.1


In [20]:
qsets.isnull().sum()/len(qsets)*100

id                      0.000000
qsets_normalized_tat    7.247083
dtype: float64

In [21]:
qsets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1424035 entries, 0 to 1424034
Data columns (total 2 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   id                    1424035 non-null  int64  
 1   qsets_normalized_tat  1320834 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 21.7 MB


In [22]:
qsets.fillna(0, inplace=True)

In [23]:
qsets.isnull().sum()

id                      0
qsets_normalized_tat    0
dtype: int64

In [24]:
qsets = qsets.groupby('id').mean()

In [25]:
qsets.head()

Unnamed: 0_level_0,qsets_normalized_tat
id,Unnamed: 1_level_1
0,0.21
1,-0.11
2,1.11
3,-0.68
4,-0.24


In [26]:
train.head(2)

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Dem8,Dem9,Cred1,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,Target
0,112751,Grp169,1.07,0.58,0.48,0.766667,1.233333,1.993333,0.34,1.01,...,0.68,0.726667,0.606667,1.01,0.933333,0.603333,0.686667,0.673333,-245.75,0
1,18495,Grp161,0.473333,1.206667,0.883333,1.43,0.726667,0.626667,0.81,0.783333,...,0.716667,0.743333,0.68,0.69,0.56,0.67,0.553333,0.653333,-248.0,0


In [27]:
train.isnull().sum()

id                0
Group             0
Per1              0
Per2              0
Per3              0
Per4              0
Per5              0
Per6              0
Per7              0
Per8              0
Per9              0
Dem1              0
Dem2              0
Dem3              0
Dem4              0
Dem5              0
Dem6              0
Dem7              0
Dem8              0
Dem9              0
Cred1             0
Cred2             0
Cred3             0
Cred4             0
Cred5             0
Cred6             0
Normalised_FNT    0
Target            0
dtype: int64

In [28]:
test.head(2)

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Dem7,Dem8,Dem9,Cred1,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT
0,146574,Grp229,-0.3,1.54,0.22,-0.28,0.57,0.26,0.7,1.076667,...,0.786667,0.546667,0.313333,0.703333,0.813333,0.776667,0.796667,0.823333,0.783333,-249.75
1,268759,Grp141,0.633333,0.953333,0.81,0.466667,0.91,0.253333,1.04,0.55,...,0.636667,0.77,0.993333,0.536667,0.703333,0.806667,0.63,0.673333,0.673333,-249.8125


In [29]:
train.head(2)

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Dem8,Dem9,Cred1,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,Target
0,112751,Grp169,1.07,0.58,0.48,0.766667,1.233333,1.993333,0.34,1.01,...,0.68,0.726667,0.606667,1.01,0.933333,0.603333,0.686667,0.673333,-245.75,0
1,18495,Grp161,0.473333,1.206667,0.883333,1.43,0.726667,0.626667,0.81,0.783333,...,0.716667,0.743333,0.68,0.69,0.56,0.67,0.553333,0.653333,-248.0,0


In [30]:
train['data'] = 'train'

In [31]:
train.head()

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Dem9,Cred1,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,Target,data
0,112751,Grp169,1.07,0.58,0.48,0.766667,1.233333,1.993333,0.34,1.01,...,0.726667,0.606667,1.01,0.933333,0.603333,0.686667,0.673333,-245.75,0,train
1,18495,Grp161,0.473333,1.206667,0.883333,1.43,0.726667,0.626667,0.81,0.783333,...,0.743333,0.68,0.69,0.56,0.67,0.553333,0.653333,-248.0,0,train
2,23915,Grp261,1.13,0.143333,0.946667,0.123333,0.08,0.836667,0.056667,0.756667,...,0.82,0.6,0.383333,0.763333,0.67,0.686667,0.673333,-233.125,0,train
3,50806,Grp198,0.636667,1.09,0.75,0.94,0.743333,0.346667,0.956667,0.633333,...,0.9,0.68,0.846667,0.423333,0.52,0.846667,0.76,-249.7775,0,train
4,184244,Grp228,0.56,1.013333,0.593333,0.416667,0.773333,0.46,0.853333,0.796667,...,0.486667,0.693333,0.526667,0.52,0.716667,0.706667,0.673333,-247.5775,0,train


In [32]:
test.head()

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Dem7,Dem8,Dem9,Cred1,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT
0,146574,Grp229,-0.3,1.54,0.22,-0.28,0.57,0.26,0.7,1.076667,...,0.786667,0.546667,0.313333,0.703333,0.813333,0.776667,0.796667,0.823333,0.783333,-249.75
1,268759,Grp141,0.633333,0.953333,0.81,0.466667,0.91,0.253333,1.04,0.55,...,0.636667,0.77,0.993333,0.536667,0.703333,0.806667,0.63,0.673333,0.673333,-249.8125
2,59727,Grp188,1.043333,0.74,0.86,1.006667,0.583333,0.616667,0.63,0.686667,...,0.626667,0.756667,0.953333,0.623333,0.753333,0.87,0.596667,0.68,0.67,-248.12
3,151544,Grp426,1.283333,0.3,0.576667,0.636667,0.256667,0.543333,0.356667,0.663333,...,0.48,0.46,0.26,0.8,0.606667,0.456667,0.32,0.676667,0.66,-222.9875
4,155008,Grp443,1.186667,0.326667,0.476667,0.866667,0.436667,0.68,0.476667,0.686667,...,0.706667,0.74,0.823333,0.67,0.896667,0.566667,0.546667,0.65,0.663333,-196.22


In [33]:
test['data'] = 'test'

In [34]:
test.head()

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Dem8,Dem9,Cred1,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,data
0,146574,Grp229,-0.3,1.54,0.22,-0.28,0.57,0.26,0.7,1.076667,...,0.546667,0.313333,0.703333,0.813333,0.776667,0.796667,0.823333,0.783333,-249.75,test
1,268759,Grp141,0.633333,0.953333,0.81,0.466667,0.91,0.253333,1.04,0.55,...,0.77,0.993333,0.536667,0.703333,0.806667,0.63,0.673333,0.673333,-249.8125,test
2,59727,Grp188,1.043333,0.74,0.86,1.006667,0.583333,0.616667,0.63,0.686667,...,0.756667,0.953333,0.623333,0.753333,0.87,0.596667,0.68,0.67,-248.12,test
3,151544,Grp426,1.283333,0.3,0.576667,0.636667,0.256667,0.543333,0.356667,0.663333,...,0.46,0.26,0.8,0.606667,0.456667,0.32,0.676667,0.66,-222.9875,test
4,155008,Grp443,1.186667,0.326667,0.476667,0.866667,0.436667,0.68,0.476667,0.686667,...,0.74,0.823333,0.67,0.896667,0.566667,0.546667,0.65,0.663333,-196.22,test


### Creating a combined train+test dataset for data preprocessing

In [35]:
all_data = pd.concat([train, test], axis=0)

In [36]:
all_data.tail()

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Dem9,Cred1,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,Target,data
56957,18333,Grp102,0.553333,1.043333,1.096667,0.686667,0.673333,0.34,0.9,0.643333,...,0.433333,0.66,0.776667,0.61,0.69,0.75,0.7,-249.505,,test
56958,244207,Grp504,1.353333,0.616667,0.276667,0.783333,0.69,0.65,0.473333,0.67,...,0.87,0.683333,0.69,0.64,0.883333,0.663333,0.66,-248.7525,,test
56959,103277,Grp78,1.083333,0.433333,0.806667,0.49,0.243333,0.316667,0.533333,0.606667,...,0.063333,0.753333,0.78,0.603333,0.88,0.643333,0.676667,-231.05,,test
56960,273294,Grp134,0.566667,1.153333,0.37,0.616667,0.793333,0.226667,0.91,0.696667,...,1.026667,0.626667,0.646667,0.566667,0.616667,0.713333,0.706667,-246.315,,test
56961,223337,Grp18,1.426667,0.11,-0.006667,-0.2,0.983333,1.87,0.033333,0.963333,...,0.67,0.77,0.893333,0.586667,0.616667,0.683333,0.65,-248.45,,test


In [37]:
all_data.shape

(284807, 29)

In [38]:
all_data = pd.merge(all_data, geo, on='id', how='left')

In [39]:
all_data.head()

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Cred1,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,Target,data,geo_score
0,112751,Grp169,1.07,0.58,0.48,0.766667,1.233333,1.993333,0.34,1.01,...,0.606667,1.01,0.933333,0.603333,0.686667,0.673333,-245.75,0.0,train,0.22
1,18495,Grp161,0.473333,1.206667,0.883333,1.43,0.726667,0.626667,0.81,0.783333,...,0.68,0.69,0.56,0.67,0.553333,0.653333,-248.0,0.0,train,-0.25
2,23915,Grp261,1.13,0.143333,0.946667,0.123333,0.08,0.836667,0.056667,0.756667,...,0.6,0.383333,0.763333,0.67,0.686667,0.673333,-233.125,0.0,train,-0.95
3,50806,Grp198,0.636667,1.09,0.75,0.94,0.743333,0.346667,0.956667,0.633333,...,0.68,0.846667,0.423333,0.52,0.846667,0.76,-249.7775,0.0,train,0.49
4,184244,Grp228,0.56,1.013333,0.593333,0.416667,0.773333,0.46,0.853333,0.796667,...,0.693333,0.526667,0.52,0.716667,0.706667,0.673333,-247.5775,0.0,train,0.85


In [40]:
all_data.isnull().sum()

id                    0
Group                 0
Per1                  0
Per2                  0
Per3                  0
Per4                  0
Per5                  0
Per6                  0
Per7                  0
Per8                  0
Per9                  0
Dem1                  0
Dem2                  0
Dem3                  0
Dem4                  0
Dem5                  0
Dem6                  0
Dem7                  0
Dem8                  0
Dem9                  0
Cred1                 0
Cred2                 0
Cred3                 0
Cred4                 0
Cred5                 0
Cred6                 0
Normalised_FNT        0
Target            56962
data                  0
geo_score             0
dtype: int64

In [41]:
all_data = pd.merge(all_data, lambdawts, on='Group', how='left')

In [42]:
all_data.head()

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,Target,data,geo_score,lambda_wt
0,112751,Grp169,1.07,0.58,0.48,0.766667,1.233333,1.993333,0.34,1.01,...,1.01,0.933333,0.603333,0.686667,0.673333,-245.75,0.0,train,0.22,-0.13
1,18495,Grp161,0.473333,1.206667,0.883333,1.43,0.726667,0.626667,0.81,0.783333,...,0.69,0.56,0.67,0.553333,0.653333,-248.0,0.0,train,-0.25,0.66
2,23915,Grp261,1.13,0.143333,0.946667,0.123333,0.08,0.836667,0.056667,0.756667,...,0.383333,0.763333,0.67,0.686667,0.673333,-233.125,0.0,train,-0.95,-0.51
3,50806,Grp198,0.636667,1.09,0.75,0.94,0.743333,0.346667,0.956667,0.633333,...,0.846667,0.423333,0.52,0.846667,0.76,-249.7775,0.0,train,0.49,0.72
4,184244,Grp228,0.56,1.013333,0.593333,0.416667,0.773333,0.46,0.853333,0.796667,...,0.526667,0.52,0.716667,0.706667,0.673333,-247.5775,0.0,train,0.85,0.6


In [43]:
all_data = pd.merge(all_data, qsets, on='id', how='left')

In [44]:
all_data.head()

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,Target,data,geo_score,lambda_wt,qsets_normalized_tat
0,112751,Grp169,1.07,0.58,0.48,0.766667,1.233333,1.993333,0.34,1.01,...,0.933333,0.603333,0.686667,0.673333,-245.75,0.0,train,0.22,-0.13,-0.7
1,18495,Grp161,0.473333,1.206667,0.883333,1.43,0.726667,0.626667,0.81,0.783333,...,0.56,0.67,0.553333,0.653333,-248.0,0.0,train,-0.25,0.66,0.14
2,23915,Grp261,1.13,0.143333,0.946667,0.123333,0.08,0.836667,0.056667,0.756667,...,0.763333,0.67,0.686667,0.673333,-233.125,0.0,train,-0.95,-0.51,-0.43
3,50806,Grp198,0.636667,1.09,0.75,0.94,0.743333,0.346667,0.956667,0.633333,...,0.423333,0.52,0.846667,0.76,-249.7775,0.0,train,0.49,0.72,-0.31
4,184244,Grp228,0.56,1.013333,0.593333,0.416667,0.773333,0.46,0.853333,0.796667,...,0.52,0.716667,0.706667,0.673333,-247.5775,0.0,train,0.85,0.6,-0.63


In [45]:
all_data = pd.merge(all_data, instance, on='id', how='left')

In [46]:
all_data.head()

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Cred4,Cred5,Cred6,Normalised_FNT,Target,data,geo_score,lambda_wt,qsets_normalized_tat,instance_scores
0,112751,Grp169,1.07,0.58,0.48,0.766667,1.233333,1.993333,0.34,1.01,...,0.603333,0.686667,0.673333,-245.75,0.0,train,0.22,-0.13,-0.7,-0.06
1,18495,Grp161,0.473333,1.206667,0.883333,1.43,0.726667,0.626667,0.81,0.783333,...,0.67,0.553333,0.653333,-248.0,0.0,train,-0.25,0.66,0.14,0.52
2,23915,Grp261,1.13,0.143333,0.946667,0.123333,0.08,0.836667,0.056667,0.756667,...,0.67,0.686667,0.673333,-233.125,0.0,train,-0.95,-0.51,-0.43,1.56
3,50806,Grp198,0.636667,1.09,0.75,0.94,0.743333,0.346667,0.956667,0.633333,...,0.52,0.846667,0.76,-249.7775,0.0,train,0.49,0.72,-0.31,0.7
4,184244,Grp228,0.56,1.013333,0.593333,0.416667,0.773333,0.46,0.853333,0.796667,...,0.716667,0.706667,0.673333,-247.5775,0.0,train,0.85,0.6,-0.63,-0.47


In [47]:
all_data.isnull().sum()

id                          0
Group                       0
Per1                        0
Per2                        0
Per3                        0
Per4                        0
Per5                        0
Per6                        0
Per7                        0
Per8                        0
Per9                        0
Dem1                        0
Dem2                        0
Dem3                        0
Dem4                        0
Dem5                        0
Dem6                        0
Dem7                        0
Dem8                        0
Dem9                        0
Cred1                       0
Cred2                       0
Cred3                       0
Cred4                       0
Cred5                       0
Cred6                       0
Normalised_FNT              0
Target                  56962
data                        0
geo_score                   0
lambda_wt                   0
qsets_normalized_tat        0
instance_scores             0
dtype: int

In [48]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 284807 entries, 0 to 284806
Data columns (total 33 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    284807 non-null  int64  
 1   Group                 284807 non-null  object 
 2   Per1                  284807 non-null  float64
 3   Per2                  284807 non-null  float64
 4   Per3                  284807 non-null  float64
 5   Per4                  284807 non-null  float64
 6   Per5                  284807 non-null  float64
 7   Per6                  284807 non-null  float64
 8   Per7                  284807 non-null  float64
 9   Per8                  284807 non-null  float64
 10  Per9                  284807 non-null  float64
 11  Dem1                  284807 non-null  float64
 12  Dem2                  284807 non-null  float64
 13  Dem3                  284807 non-null  float64
 14  Dem4                  284807 non-null  float64
 15  

In [49]:
train = all_data[all_data['data']=='train']
test = all_data[all_data['data']=='test']

In [50]:
train.tail()

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Cred4,Cred5,Cred6,Normalised_FNT,Target,data,geo_score,lambda_wt,qsets_normalized_tat,instance_scores
227840,97346,Grp232,0.476667,1.013333,0.536667,0.576667,1.406667,1.846667,0.6,1.103333,...,0.533333,0.68,0.693333,-246.5025,0.0,train,-0.14,0.75,-0.55,-0.44
227841,147361,Grp199,1.363333,0.73,0.06,0.776667,0.883333,0.466667,0.733333,0.59,...,0.73,0.646667,0.656667,-249.7775,0.0,train,0.39,-0.98,0.38,-0.4
227842,50989,Grp36,1.06,0.756667,0.906667,0.896667,0.503333,0.396667,0.683333,0.62,...,0.696667,0.663333,0.673333,-249.7775,0.0,train,1.03,0.15,0.01,-0.13
227843,149780,Grp445,0.433333,1.013333,1.163333,0.94,0.93,0.9,0.813333,0.72,...,0.54,0.766667,0.71,-242.75,0.0,train,-3.29,1.53,0.38,-0.66
227844,22175,Grp143,1.006667,0.553333,0.946667,1.206667,0.406667,0.75,0.52,0.756667,...,0.58,0.683333,0.676667,-235.0,0.0,train,-0.36,0.0,0.55,-0.22


### Feature Selection

In [51]:
# split the data into ind variable and dependent variable
x_train = train.drop(['id','Group','Target', 'data'], axis=1)
y_train = train['Target']

In [52]:
x_train.head()

Unnamed: 0,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,Per9,Dem1,...,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,geo_score,lambda_wt,qsets_normalized_tat,instance_scores
0,1.07,0.58,0.48,0.766667,1.233333,1.993333,0.34,1.01,0.863333,0.46,...,1.01,0.933333,0.603333,0.686667,0.673333,-245.75,0.22,-0.13,-0.7,-0.06
1,0.473333,1.206667,0.883333,1.43,0.726667,0.626667,0.81,0.783333,0.19,0.47,...,0.69,0.56,0.67,0.553333,0.653333,-248.0,-0.25,0.66,0.14,0.52
2,1.13,0.143333,0.946667,0.123333,0.08,0.836667,0.056667,0.756667,0.226667,0.66,...,0.383333,0.763333,0.67,0.686667,0.673333,-233.125,-0.95,-0.51,-0.43,1.56
3,0.636667,1.09,0.75,0.94,0.743333,0.346667,0.956667,0.633333,0.486667,1.096667,...,0.846667,0.423333,0.52,0.846667,0.76,-249.7775,0.49,0.72,-0.31,0.7
4,0.56,1.013333,0.593333,0.416667,0.773333,0.46,0.853333,0.796667,0.516667,0.756667,...,0.526667,0.52,0.716667,0.706667,0.673333,-247.5775,0.85,0.6,-0.63,-0.47


In [53]:
y_train.head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: Target, dtype: float64

In [54]:
test.head()

Unnamed: 0,id,Group,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,...,Cred4,Cred5,Cred6,Normalised_FNT,Target,data,geo_score,lambda_wt,qsets_normalized_tat,instance_scores
227845,146574,Grp229,-0.3,1.54,0.22,-0.28,0.57,0.26,0.7,1.076667,...,0.796667,0.823333,0.783333,-249.75,,test,0.25,0.76,-0.43,-0.04
227846,268759,Grp141,0.633333,0.953333,0.81,0.466667,0.91,0.253333,1.04,0.55,...,0.63,0.673333,0.673333,-249.8125,,test,0.43,0.18,-0.62,-0.77
227847,59727,Grp188,1.043333,0.74,0.86,1.006667,0.583333,0.616667,0.63,0.686667,...,0.596667,0.68,0.67,-248.12,,test,1.32,0.39,-0.41,0.11
227848,151544,Grp426,1.283333,0.3,0.576667,0.636667,0.256667,0.543333,0.356667,0.663333,...,0.32,0.676667,0.66,-222.9875,,test,-2.11,1.8,0.37,0.33
227849,155008,Grp443,1.186667,0.326667,0.476667,0.866667,0.436667,0.68,0.476667,0.686667,...,0.546667,0.65,0.663333,-196.22,,test,-2.11,1.89,-0.13,-0.37


In [55]:
x_test = test.drop(['id','Group','Target','data'], axis=1)
y_test = test['Target']

In [56]:
x_test.head()

Unnamed: 0,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,Per9,Dem1,...,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,geo_score,lambda_wt,qsets_normalized_tat,instance_scores
227845,-0.3,1.54,0.22,-0.28,0.57,0.26,0.7,1.076667,0.93,0.156667,...,0.813333,0.776667,0.796667,0.823333,0.783333,-249.75,0.25,0.76,-0.43,-0.04
227846,0.633333,0.953333,0.81,0.466667,0.91,0.253333,1.04,0.55,0.543333,0.433333,...,0.703333,0.806667,0.63,0.673333,0.673333,-249.8125,0.43,0.18,-0.62,-0.77
227847,1.043333,0.74,0.86,1.006667,0.583333,0.616667,0.63,0.686667,0.593333,1.25,...,0.753333,0.87,0.596667,0.68,0.67,-248.12,1.32,0.39,-0.41,0.11
227848,1.283333,0.3,0.576667,0.636667,0.256667,0.543333,0.356667,0.663333,1.156667,1.186667,...,0.606667,0.456667,0.32,0.676667,0.66,-222.9875,-2.11,1.8,0.37,0.33
227849,1.186667,0.326667,0.476667,0.866667,0.436667,0.68,0.476667,0.686667,1.476667,1.213333,...,0.896667,0.566667,0.546667,0.65,0.663333,-196.22,-2.11,1.89,-0.13,-0.37


In [57]:
y_test.head()

227845   NaN
227846   NaN
227847   NaN
227848   NaN
227849   NaN
Name: Target, dtype: float64

In [58]:
x_train.head()

Unnamed: 0,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,Per9,Dem1,...,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,geo_score,lambda_wt,qsets_normalized_tat,instance_scores
0,1.07,0.58,0.48,0.766667,1.233333,1.993333,0.34,1.01,0.863333,0.46,...,1.01,0.933333,0.603333,0.686667,0.673333,-245.75,0.22,-0.13,-0.7,-0.06
1,0.473333,1.206667,0.883333,1.43,0.726667,0.626667,0.81,0.783333,0.19,0.47,...,0.69,0.56,0.67,0.553333,0.653333,-248.0,-0.25,0.66,0.14,0.52
2,1.13,0.143333,0.946667,0.123333,0.08,0.836667,0.056667,0.756667,0.226667,0.66,...,0.383333,0.763333,0.67,0.686667,0.673333,-233.125,-0.95,-0.51,-0.43,1.56
3,0.636667,1.09,0.75,0.94,0.743333,0.346667,0.956667,0.633333,0.486667,1.096667,...,0.846667,0.423333,0.52,0.846667,0.76,-249.7775,0.49,0.72,-0.31,0.7
4,0.56,1.013333,0.593333,0.416667,0.773333,0.46,0.853333,0.796667,0.516667,0.756667,...,0.526667,0.52,0.716667,0.706667,0.673333,-247.5775,0.85,0.6,-0.63,-0.47


In [59]:
x_train.describe()

Unnamed: 0,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,Per9,Dem1,...,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,geo_score,lambda_wt,qsets_normalized_tat,instance_scores
count,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,...,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0
mean,0.666006,0.667701,0.666315,0.666687,0.666723,0.667378,0.666934,0.666279,0.666688,0.666576,...,0.666264,0.666755,0.666878,0.666566,0.666776,-227.95417,-0.000135,0.00035,-0.000103,-0.000123
std,0.654133,0.548305,0.506357,0.471956,0.461393,0.444573,0.415657,0.401546,0.366537,0.340436,...,0.202204,0.174204,0.160803,0.135762,0.111612,61.951661,0.997518,0.957957,0.850163,1.091488
min,-18.136667,-23.573333,-15.443333,-1.226667,-37.246667,-8.053333,-13.853333,-23.74,-3.81,-0.893333,...,-0.28,-2.766667,-0.08,-6.856667,-4.476667,-250.0,-18.68,-19.21,-25.16,-24.59
25%,0.36,0.47,0.37,0.383333,0.436667,0.41,0.483333,0.596667,0.453333,0.413333,...,0.546667,0.56,0.556667,0.643333,0.65,-248.6175,-0.41,-0.43,-0.48,-0.54
50%,0.67,0.69,0.726667,0.66,0.65,0.576667,0.68,0.673333,0.65,0.656667,...,0.68,0.673333,0.65,0.666667,0.67,-244.51,0.14,0.05,-0.07,-0.09
75%,1.103333,0.933333,1.01,0.913333,0.87,0.8,0.856667,0.776667,0.866667,0.913333,...,0.813333,0.783333,0.746667,0.696667,0.693333,-230.75,0.62,0.49,0.4,0.45
max,1.483333,8.02,3.793333,6.163333,12.266667,25.1,40.863333,7.336667,5.863333,4.673333,...,2.193333,3.173333,1.84,11.203333,11.95,6172.79,7.85,10.53,8.54,23.75


In [60]:
column = list(x_train.columns)

In [61]:
y_train.value_counts()

0.0    227451
1.0       394
Name: Target, dtype: int64

## Handling Imbalance in the Data

In [62]:
import imblearn

In [63]:
from imblearn.over_sampling import RandomOverSampler
over = RandomOverSampler()
x_over, y_over = over.fit_resample(x_train, y_train)

In [64]:
y_over.value_counts()

0.0    227451
1.0    227451
Name: Target, dtype: int64

In [65]:
y_test.value_counts()

Series([], Name: Target, dtype: int64)

In [66]:
x_train.isnull().sum()

Per1                    0
Per2                    0
Per3                    0
Per4                    0
Per5                    0
Per6                    0
Per7                    0
Per8                    0
Per9                    0
Dem1                    0
Dem2                    0
Dem3                    0
Dem4                    0
Dem5                    0
Dem6                    0
Dem7                    0
Dem8                    0
Dem9                    0
Cred1                   0
Cred2                   0
Cred3                   0
Cred4                   0
Cred5                   0
Cred6                   0
Normalised_FNT          0
geo_score               0
lambda_wt               0
qsets_normalized_tat    0
instance_scores         0
dtype: int64

In [67]:
# feature scaling 
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc_x_train = sc.fit_transform(x_over)
sc_x_test = sc.fit_transform(x_test)

In [68]:
#COnverting Back to DataFrame for better Visibility
x_train1 = pd.DataFrame(sc_x_train,columns = column)
x_test1 = pd.DataFrame(sc_x_test, columns = column)

In [69]:
x_train1.head()

Unnamed: 0,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,Per9,Dem1,...,Cred2,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,geo_score,lambda_wt,qsets_normalized_tat,instance_scores
0,0.63135,-0.550239,0.468098,-0.618878,0.759868,2.721847,0.30177,0.133083,0.80635,-0.906135,...,1.921603,1.135156,-0.440992,0.003355,-0.055037,-0.340953,0.73228,0.710502,0.435929,0.603078
1,0.321452,-0.05975,0.657208,0.005976,0.414859,0.343422,0.535265,-0.014205,-0.050014,-0.895077,...,0.215304,-0.515961,-0.022217,-0.372543,-0.191318,-0.376046,0.628442,0.881982,0.575983,0.729067
2,0.662512,-0.892016,0.686903,-1.224893,-0.025481,0.708887,0.16101,-0.031533,-0.00338,-0.684962,...,-1.4199,0.383308,-0.022217,0.003355,-0.055037,-0.144039,0.47379,0.628018,0.480946,0.954978
3,0.406284,-0.151064,0.594693,-0.4556,0.426208,-0.143865,0.608128,-0.111675,0.327295,-0.202067,...,1.050679,-1.120387,-0.96446,0.454431,0.535515,-0.40377,0.791931,0.895006,0.500954,0.768167
4,0.366465,-0.211071,0.521237,-0.948575,0.446636,0.05337,0.556793,-0.005541,0.36545,-0.578062,...,-0.65562,-0.692866,0.270926,0.059739,-0.055037,-0.369457,0.871467,0.868959,0.4476,0.514017


### Splitting the training data into train and test again to build the model and predict target variable basis test data

In [None]:
### Train-Test-Split

In [71]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_train1,y_over, train_size=0.80,random_state=1)
y_train = pd.DataFrame(y_train,columns = ['Target'])

# PYCARET - please explore more on this package

In [73]:
# !pip install pycaret

In [74]:
# import pycaret

In [75]:
# setting up an environmet in pycaret
# !pip install numba==0.53
# from pycaret.classification import *

In [76]:
# https://pycaret.gitbook.io/docs/get-started/quickstart#classification

In [77]:
# from pycaret.classification import *

In [78]:
# data = pd.concat([x_train,y_train],axis=1)

In [79]:
# data.tail()

In [80]:
# s = setup(data, target = 'Target', fold_shuffle=True)

In [81]:
# best = compare_models()

# Deep Neural Network 

In [83]:
import tensorflow as tf
from tensorflow import keras

In [84]:
from keras import Sequential
from keras.layers import Dense

In [85]:
dnn = tf.keras.models.Sequential()
dnn.add(tf.keras.layers.Dense(units=6, activation='relu'))
dnn.add(tf.keras.layers.Dense(units=6, activation='relu'))
dnn.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))
dnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
dnn.fit(x_train, y_train, batch_size=64, epochs=1)



<keras.callbacks.History at 0x28fbc7e3520>

In [84]:
dnn.fit(x_test, y_test, batch_size=64, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1677c3e2a90>

In [86]:
y_pred = dnn.predict(x_test)



### Converting the output into classes

In [87]:
y_pred = (y_pred>0.5)
y_pred = pd.DataFrame(y_pred)
y_pred

Unnamed: 0,0
0,False
1,True
2,False
3,False
4,True
...,...
90976,True
90977,False
90978,True
90979,True


In [88]:
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
y_pred1 = lb.fit_transform(y_pred)

In [89]:
y_pred1

array([0, 1, 0, ..., 1, 1, 0], dtype=int64)

## Evaluation

In [90]:
from sklearn.metrics import accuracy_score,precision_score,recall_score

In [91]:
print(f'Accuracy-{accuracy_score(y_pred1,y_test)}, Precision-{precision_score(y_pred1,y_test)}, Recall-{recall_score(y_pred1,y_test)}')

Accuracy-0.9615084468185665, Precision-0.9409609477841159, Recall-0.981441647597254


## Making Predictions on Test Data

In [92]:
predictions = dnn.predict(x_test1)
predictions = (predictions>0.5)
predictions = pd.DataFrame(predictions)
predictions = lb.fit_transform(predictions)
predictions



array([0, 1, 1, ..., 1, 1, 0], dtype=int64)

In [93]:
predictions = pd.DataFrame(predictions,columns=['Target'])

In [94]:
predictions

Unnamed: 0,Target
0,0
1,1
2,1
3,1
4,1
...,...
56957,1
56958,1
56959,1
56960,1


In [95]:
Final_data = pd.concat([x_test1,predictions],axis=1)

In [97]:
Final_data.tail()

Unnamed: 0,Per1,Per2,Per3,Per4,Per5,Per6,Per7,Per8,Per9,Dem1,...,Cred3,Cred4,Cred5,Cred6,Normalised_FNT,geo_score,lambda_wt,qsets_normalized_tat,instance_scores,Target
56957,-0.179016,0.681375,0.8544,0.042552,0.015151,-0.732406,0.587567,-0.064751,-0.337788,-0.236705,...,-0.327395,0.150618,0.639967,0.326377,-0.335826,-0.149612,-0.414715,0.011275,-0.473436,1
56958,1.055686,-0.082058,-0.780228,0.24737,0.051795,-0.031261,-0.481964,0.004675,0.977692,-1.100845,...,-0.152975,1.355111,-0.028831,-0.060154,-0.324212,0.108857,-1.600739,0.070375,-0.427064,1
56959,0.638974,-0.410096,0.276299,-0.374147,-0.930274,-0.785181,-0.331561,-0.160212,-0.675793,-0.550938,...,-0.366155,1.334344,-0.183169,0.100901,-0.05099,-0.527374,0.303142,-0.024185,0.602391,1
56960,-0.158437,0.878197,-0.594173,-0.105765,0.27899,-0.988739,0.612634,0.074102,-0.182488,-0.492019,...,-0.579335,-0.306259,0.357014,0.390799,-0.286591,0.267915,-0.716423,0.720477,-0.853685,1
56961,1.168867,-0.988635,-1.345039,-1.836125,0.696736,2.728087,-1.584918,0.768365,-1.096015,-0.236705,...,-0.463055,-0.306259,0.125507,-0.156787,-0.319543,-0.696373,-0.383504,0.413156,1.455633,0
