<h1 align="center" style="color:#6699ff"> DataCamp IEEE Fraud Detection </h1>

<img src="https://github.com/DataCampM2DSSAF/suivi-du-data-camp-equipe-tchouacheu_toure_niang_chokki/blob/master/img/credit-card-fraud-detection.png?raw=true" width="800" align="right">

#  <a style="color:#6699ff"> Team </a>
- <a style="color:#6699ff">Mohamed NIANG </a>
- <a style="color:#6699ff">Fernanda Tchouacheu </a>
- <a style="color:#6699ff">Sokhna Penda Toure </a>
- <a style="color:#6699ff">Hypolite Chokki </a>

# <a style="color:#6699ff">  Table of Contents</a> 

<a style="color:#6699ff"> I. Introduction</a>

<a style="color:#6699ff"> II. Descriptive Statistics & Visualization</a>

<a style="color:#6699ff"> III. Preprocessing</a>

<a style="color:#6699ff"> IV. Machine Learning Models</a>

# <a style="color:#6699ff"> I. Introduction</a>

**Pourquoi la détection de fraude ?**
> La fraude est un commerce d'un milliard de dollars et elle augmente chaque année. L'enquête mondiale de PwC sur la criminalité économique de 2018 a révélé que la moitié (49 %) des 7 200 entreprises interrogées avaient été victimes d'une fraude quelconque. C'est une augmentation par rapport à l'étude PwC de 2016, dans laquelle un peu plus d'un tiers des organisations interrogées (36 %) avaient été victimes de la criminalité économique.


Cette compétition est un problème de **classification binaire** - c'est-à-dire que notre variable cible est un attribut binaire (l'utilisateur qui fait le clic est-il frauduleux ou non ?) et notre objectif est de classer les utilisateurs en "frauduleux" ou "non frauduleux" le mieux possible.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file 

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from xgboost import XGBClassifier
import xgboost as xgb
import lightgbm as lgb
import catboost
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score
import matplotlib.gridspec as gridspec
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import gc
gc.enable()

import os
os.chdir('/kaggle/input/ieeecis-fraud-detection') # Set working directory
print(os.listdir('/kaggle/input/ieeecis-fraud-detection'))

# <a style="color:#6699ff"> II. Descriptive Statistics and Visualization</a>

Dans cette compétition, nous voulons prédire la probabilité qu'une transaction en ligne soit frauduleuse, comme le montre la cible binaire "isFraud".

Les données sont divisées en deux fichiers **identity** et **transaction**, qui sont reliés par "TransactionID". 

> Note : Toutes les transactions n'ont pas d'informations d'identité correspondantes.

**Variables catégorielles - Transaction**

- ProductCD
- emaildomain
- card1 - card6
- addr1, addr2
- P_emaildomain
- R_emaildomain
- M1 - M9

**Variables catégorielles - Identité**

- DeviceType
- DeviceInfo
- id_12 - id_38

**La variable TransactionDT est le temps d'une date-heure de référence donnée (en seconde).**

**Fichiers**

- train_{transaction, identité}.csv - l'ensemble d'apprentissage
- test_{transaction, identité}.csv - l'ensemble de test (**nous devons prédire la valeur isFraud pour ces observations**)
- sample_submission.csv - un exemple de fichier de soumission dans le format correct

> Questions

Nous allons commencer à explorer les données en se basant sur les caractéristiques catégorielles et les montants des transactions. L'objectif est de répondre à des questions comme :

1. Quel type de données avons-nous ?
2. Combien de cols, de lignes, de valeurs manquantes avons-nous ?
3. Quelle est la distribution cible ?
4. Quelle est la distribution des valeurs des transactions de fraude et des transactions sans fraude ?
5. Avons nous des produits frauduleux prédominants ?
6. Quelles sont les caractéristiques ou la cible qui présentent des schémas intéressants ?

Et beaucoup d'autres questions qui vont soulever l'exploration.

**Load data**

In [None]:
%%time
train_transaction = pd.read_csv('train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('test_transaction.csv', index_col='TransactionID')
train_identity = pd.read_csv('train_identity.csv', index_col='TransactionID')
test_identity = pd.read_csv('test_identity.csv', index_col='TransactionID')
print ("Data is loaded!")

In [None]:
print('train_transaction shape is {}'.format(train_transaction.shape))
print('test_transaction shape is {}'.format(test_transaction.shape))
print('train_identity shape is {}'.format(train_identity.shape))
print('test_identity shape is {}'.format(test_identity.shape))

In [None]:
train_transaction.head()

In [None]:
train_identity.head()

**Missing Data Problem**

On remarque qu'il y a beaucoup de colonnes **NaN** : 

- ``` V300 ... V339``` 
- ``` id_01 ... id_34``` 

## 1er problème : NaN

**train_transaction**

In [None]:
missing_values_count = train_transaction.isnull().sum()
print (missing_values_count[0:10])
total_cells = np.product(train_transaction.shape)
total_missing = missing_values_count.sum()
print ("% of missing data = ",(total_missing/total_cells) * 100)

**train_identity**

In [None]:
missing_values_count = train_identity.isnull().sum()
print (missing_values_count[0:10])
total_cells = np.product(train_identity.shape)
total_missing = missing_values_count.sum()
print ("% of missing data = ",(total_missing/total_cells) * 100)

In [None]:
del missing_values_count, total_cells, total_missing
gc.collect()

**Imbalanced Problem**

In [None]:
ax = sns.countplot(y="isFraud", data=train_transaction)
plt.title('Distribution of  isFraud')

total = len(train_transaction['isFraud'])
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))

plt.show()

## 2ème problème : Imbalanced class

Nous remarquons que notre jeu de données ("isFraud") est déséquilibrée. La plupart des transactions sont non frauduleuses. Si nous utilisons ce cadre de données comme base pour nos modèles prédictifs et nos analyses, nous pourrions obtenir beaucoup d'erreurs et nos algorithmes seront probablement surdimensionnés puisqu'ils "supposeront" que la plupart des transactions ne sont pas des fraudes. Mais nous ne voulons pas que notre modèle suppose, nous voulons que notre modèle détecte des modèles qui donnent des signes de fraude !

**Déséquilibrée** signifie que le nombre de points de données disponibles pour les différentes classes est différent.

Dans cette partie, nous essayerons de voir, d'une part, ce que réprésente réellement certaines variables,
et d'autre part, s'il y a des liens entre les variables et la variable cible.

In [None]:
train_transaction['TransactionDT'].head()

In [None]:
train_transaction['TransactionDT'].shape[0] , train_transaction['TransactionDT'].nunique()

In [None]:
train_transaction['TransactionDT'].value_counts().head(10)

## Distribution Plots

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,4))

time_val = train_transaction['TransactionDT'].values

sns.distplot(time_val, ax=ax[0], color='r')
ax[0].set_title('Distribution of TransactionDT', fontsize=14)
ax[1].set_xlim([min(time_val), max(time_val)])

sns.distplot(np.log(time_val), ax=ax[1], color='b')
ax[1].set_title('Distribution of LOG TransactionDT', fontsize=14)
ax[1].set_xlim([min(np.log(time_val)), max(np.log(time_val))])

plt.show()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,4))

time_val = train_transaction.loc[train_transaction['isFraud'] == 1]['TransactionDT'].values

sns.distplot(np.log(time_val), ax=ax[0], color='r')
ax[0].set_title('Distribution of LOG TransactionDT, isFraud=1', fontsize=14)
ax[1].set_xlim([min(np.log(time_val)), max(np.log(time_val))])

time_val = train_transaction.loc[train_transaction['isFraud'] == 0]['TransactionDT'].values

sns.distplot(np.log(time_val), ax=ax[1], color='b')
ax[1].set_title('Distribution of LOG TransactionDT, isFraud=0', fontsize=14)
ax[1].set_xlim([min(np.log(time_val)), max(np.log(time_val))])


plt.show()

In [None]:
train_transaction['TransactionDT'].plot(kind='hist',
                                        figsize=(15, 5),
                                        label='train',
                                        bins=50,
                                        title='Train vs Test TransactionDT distribution')
test_transaction['TransactionDT'].plot(kind='hist',
                                       label='test',
                                       bins=50)
plt.legend()
plt.show()

In [None]:
del fig, ax, time_val
gc.collect()

In [None]:
train_transaction.head()

**isFraud vs time**

In [None]:
i = 'isFraud'
cor = np.corrcoef(train_transaction['TransactionDT'], train_transaction[i])[0,1]
train_transaction.loc[train_transaction['isFraud'] == 0].set_index('TransactionDT')[i].plot(style='.', title=i+" corr= "+str(round(cor,3)), figsize=(15, 3), label="isFraud=0")
train_transaction.loc[train_transaction['isFraud'] == 1].set_index('TransactionDT')[i].plot(style='.', title=i+" corr= "+str(round(cor,3)), figsize=(15, 3), label="isFraud=1")
plt.legend()
plt.show()

**C features: C1, C2 ... C14**

In [None]:
c_features = list(train_transaction.columns[16:30])
for i in c_features:
    cor = np.corrcoef(train_transaction['TransactionDT'], train_transaction[i])[0,1]
    train_transaction.set_index('TransactionDT')[i].plot(style='.', title=i+" corr= "+str(round(cor,3)), figsize=(15, 3))
    test_transaction.set_index('TransactionDT')[i].plot(style='.', title=i+" corr= "+str(round(cor,3)), figsize=(15, 3))
    plt.show()

In [None]:
del cor, c_features
gc.collect()

**D features: D1 ... D15**

In [None]:
d_features = list(train_transaction.columns[30:45])

for i in d_features:
    cor = np.corrcoef(train_transaction['TransactionDT'], train_transaction[i])[0,1]
    train_transaction.set_index('TransactionDT')[i].plot(style='.', title=i+" corr= "+str(round(cor,3)), figsize=(15, 3))
    test_transaction.set_index('TransactionDT')[i].plot(style='.', title=i+" corr= "+str(round(cor,3)), figsize=(15, 3))
    plt.show()

Le problème ici est que les caractéristiques D sont principalement des NaNs.

In [None]:
train_transaction[d_features].head()

In [None]:
# Click output to see the number of missing values in each column
missing_values_count = train_transaction[d_features].isnull().sum()
missing_values_count

In [None]:
# how many total missing values do we have?
total_cells = np.product(train_transaction[d_features].shape)
total_missing = missing_values_count.sum()
# percent of data that is missing
(total_missing/total_cells) * 100

Si nous considérons les caractéristiques D, 58.15% sont des valeurs manquantes.

In [None]:
for i in d_features:
    cor_tr = np.corrcoef(train_transaction['TransactionDT'], train_transaction[i].fillna(-1))[0,1]
    cor_te = np.corrcoef(test_transaction['TransactionDT'], test_transaction[i].fillna(-1))[0,1]
    train_transaction.set_index('TransactionDT')[i].fillna(-1).plot(style='.', title=i+" corr_tr= "+str(round(cor_tr,3))+" || corr_te= "+str(round(cor_te,3)), figsize=(15, 3))
    test_transaction.set_index('TransactionDT')[i].fillna(-1).plot(style='.', title=i+" corr_tr= "+str(round(cor_tr,3))+"  || corr_te= "+str(round(cor_te,3)), figsize=(15, 3))
    plt.show()

In [None]:
del d_features, cor, missing_values_count, total_cells, total_missing
gc.collect()

**M features: M1 .. M9**

In [None]:
m_features = list(train_transaction.columns[45:54])
train_transaction[m_features].head()

**V150**

In [None]:
i = "V150"
cor_tr = np.corrcoef(train_transaction['TransactionDT'], train_transaction[i].fillna(-1))[0,1]
cor_te = np.corrcoef(test_transaction['TransactionDT'], test_transaction[i].fillna(-1))[0,1]
train_transaction.set_index('TransactionDT')[i].fillna(-1).plot(style='.', title=i+" corr_tr= "+str(round(cor_tr,3))+" || corr_te= "+str(round(cor_te,3)), figsize=(15, 3))
test_transaction.set_index('TransactionDT')[i].fillna(-1).plot(style='.', title=i+" corr_tr= "+str(round(cor_tr,3))+"  || corr_te= "+str(round(cor_te,3)), figsize=(15, 3))
plt.show()

In [None]:
del cor_tr, cor_te
gc.collect()

Vérifions toute la liste des V qui sont nulles.

In [None]:
train_transaction.loc[:,train_transaction.columns[train_transaction.columns.str.startswith('V')]].isnull().sum()

**TransactionAmt**

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,4))

time_val = train_transaction['TransactionAmt'].values

sns.distplot(time_val, ax=ax[0], color='r')
ax[0].set_title('Distribution of TransactionAmt', fontsize=14)
ax[1].set_xlim([min(time_val), max(time_val)])

sns.distplot(np.log(time_val), ax=ax[1], color='b')
ax[1].set_title('Distribution of LOG TransactionAmt', fontsize=14)
ax[1].set_xlim([min(np.log(time_val)), max(np.log(time_val))])

plt.show()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,4))

time_val = train_transaction.loc[train_transaction['isFraud'] == 1]['TransactionAmt'].values

sns.distplot(np.log(time_val), ax=ax[0], color='r')
ax[0].set_title('Distribution of LOG TransactionAmt, isFraud=1', fontsize=14)
ax[1].set_xlim([min(np.log(time_val)), max(np.log(time_val))])

time_val = train_transaction.loc[train_transaction['isFraud'] == 0]['TransactionAmt'].values

sns.distplot(np.log(time_val), ax=ax[1], color='b')
ax[1].set_title('Distribution of LOG TransactionAmt, isFraud=0', fontsize=14)
ax[1].set_xlim([min(np.log(time_val)), max(np.log(time_val))])


plt.show()

In [None]:
del fig, ax, time_val
gc.collect()

## Count Plots

**D Features**

In [None]:
plt.figure(figsize=(10, 7))
d_features = list(train_transaction.columns[30:45])
uniques = [len(train_transaction[col].unique()) for col in d_features]
sns.set(font_scale=1.2)
ax = sns.barplot(d_features, uniques, log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique values per feature TRAIN')
for p, uniq in zip(ax.patches, uniques):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center") 

In [None]:
del d_features, uniques
gc.collect()

**C features**

In [None]:
plt.figure(figsize=(10, 7))
c_features = list(train_transaction.columns[16:30])
uniques = [len(train_transaction[col].unique()) for col in c_features]
sns.set(font_scale=1.2)
ax = sns.barplot(c_features, uniques, log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique values per feature TRAIN')
for p, uniq in zip(ax.patches, uniques):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center")

In [None]:
del c_features, uniques
gc.collect()

**V features**

In [None]:
plt.figure(figsize=(35, 8))
v_features = list(train_transaction.columns[54:120])
uniques = [len(train_transaction[col].unique()) for col in v_features]
sns.set(font_scale=1.2)
ax = sns.barplot(v_features, uniques, log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique values per feature')
for p, uniq in zip(ax.patches, uniques):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center") 

In [None]:
del v_features, uniques
gc.collect()

In [None]:
plt.figure(figsize=(35, 8))
v_features = list(train_transaction.columns[120:170])
uniques = [len(train_transaction[col].unique()) for col in v_features]
sns.set(font_scale=1.2)
ax = sns.barplot(v_features, uniques, log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique values per feature')
for p, uniq in zip(ax.patches, uniques):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center") 

In [None]:
del v_features, uniques
gc.collect()

In [None]:
plt.figure(figsize=(35, 8))
v_features = list(train_transaction.columns[170:220])
uniques = [len(train_transaction[col].unique()) for col in v_features]
sns.set(font_scale=1.2)
ax = sns.barplot(v_features, uniques, log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique values per feature')
for p, uniq in zip(ax.patches, uniques):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center") 

In [None]:
del v_features, uniques
gc.collect()

In [None]:
plt.figure(figsize=(35, 8))
v_features = list(train_transaction.columns[220:270])
uniques = [len(train_transaction[col].unique()) for col in v_features]
sns.set(font_scale=1.2)
ax = sns.barplot(v_features, uniques, log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique values per feature')
for p, uniq in zip(ax.patches, uniques):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center") 

In [None]:
del v_features, uniques
gc.collect()

In [None]:
plt.figure(figsize=(35, 8))
v_features = list(train_transaction.columns[270:320])
uniques = [len(train_transaction[col].unique()) for col in v_features]
sns.set(font_scale=1.2)
ax = sns.barplot(v_features, uniques, log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique values per feature')
for p, uniq in zip(ax.patches, uniques):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center") 

In [None]:
del v_features, uniques
gc.collect()

In [None]:
plt.figure(figsize=(38, 8))
v_features = list(train_transaction.columns[320:390])
uniques = [len(train_transaction[col].unique()) for col in v_features]
sns.set(font_scale=1.2)
ax = sns.barplot(v_features, uniques, log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique values per feature')
for p, uniq in zip(ax.patches, uniques):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center") 

In [None]:
del v_features, uniques
gc.collect()

**id_code**

In [None]:
train_identity.head()

In [None]:
plt.figure(figsize=(35, 8))
features = list(train_identity.columns[0:38])
uniques = [len(train_identity[col].unique()) for col in features]
sns.set(font_scale=1.2)
ax = sns.barplot(features, uniques, log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique values per feature TRAIN')
for p, uniq in zip(ax.patches, uniques):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center") 

In [None]:
del features, uniques
gc.collect()

**Categorical Features**

- ProductCD
- emaildomain
- card1 - card6
- addr1, addr2
- P_emaildomain
- R_emaildomain
- M1 - M9
- DeviceType
- DeviceInfo
- id_12 - id_38

In [None]:
train_transaction.head(6)

In [None]:
train_identity.head(6)

**ProductCD**

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20,5))

sns.countplot(x="ProductCD", ax=ax[0], hue = "isFraud", data=train_transaction)
ax[0].set_title('ProductCD train', fontsize=14)
sns.countplot(x="ProductCD", ax=ax[1], data=test_transaction)
ax[1].set_title('ProductCD test', fontsize=14)
plt.show()

**Device Type & Device Info**

In [None]:
ax = sns.countplot(x="DeviceType", data=train_identity)
ax.set_title('DeviceType', fontsize=14)
plt.show()

**Device information**

In [None]:
print ("Unique Devices = ",train_identity['DeviceInfo'].nunique())
train_identity['DeviceInfo'].value_counts().head()

**Card**

In [None]:
cards = ['card1', 'card2', 'card3', 'card4', 'card5', 'card6']
for i in cards:
    print ("Unique ",i, " = ",train_transaction[i].nunique())

In [None]:
fig, ax = plt.subplots(1, 4, figsize=(25,5))

sns.countplot(x="card4", ax=ax[0], data=train_transaction.loc[train_transaction['isFraud'] == 0])
ax[0].set_title('card4 isFraud=0', fontsize=14)
sns.countplot(x="card4", ax=ax[1], data=train_transaction.loc[train_transaction['isFraud'] == 1])
ax[1].set_title('card4 isFraud=1', fontsize=14)
sns.countplot(x="card6", ax=ax[2], data=train_transaction.loc[train_transaction['isFraud'] == 0])
ax[2].set_title('card6 isFraud=0', fontsize=14)
sns.countplot(x="card6", ax=ax[3], data=train_transaction.loc[train_transaction['isFraud'] == 1])
ax[3].set_title('card6 isFraud=1', fontsize=14)
plt.show()

**Email Domain**

In [None]:
"emaildomain" in train_transaction.columns, "emaildomain" in train_identity.columns

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(32,10))

sns.countplot(y="P_emaildomain", ax=ax[0], data=train_transaction)
ax[0].set_title('P_emaildomain', fontsize=14)
sns.countplot(y="P_emaildomain", ax=ax[1], data=train_transaction.loc[train_transaction['isFraud'] == 1])
ax[1].set_title('P_emaildomain isFraud = 1', fontsize=14)
sns.countplot(y="P_emaildomain", ax=ax[2], data=train_transaction.loc[train_transaction['isFraud'] == 0])
ax[2].set_title('P_emaildomain isFraud = 0', fontsize=14)
plt.show()

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(32,10))

sns.countplot(y="R_emaildomain", ax=ax[0], data=train_transaction)
ax[0].set_title('R_emaildomain', fontsize=14)
sns.countplot(y="R_emaildomain", ax=ax[1], data=train_transaction.loc[train_transaction['isFraud'] == 1])
ax[1].set_title('R_emaildomain isFraud = 1', fontsize=14)
sns.countplot(y="R_emaildomain", ax=ax[2], data=train_transaction.loc[train_transaction['isFraud'] == 0])
ax[2].set_title('R_emaildomain isFraud = 0', fontsize=14)
plt.show()

**Il semble que les criminels préfèrent le gmail.**

In [None]:
del fig, ax
gc.collect()

# <a style="color:#6699ff"> III. Preprocessing</a>

## Merge transaction & identity 

In [None]:
%%time
train_df = pd.merge(train_transaction, train_identity, on = "TransactionID", how = "left")
print("Tain: ",train_df.shape)
del train_transaction, train_identity
gc.collect()

In [None]:
%%time
test_df = pd.merge(test_transaction, test_identity, on = "TransactionID", how = "left")
print("Test: ",test_df.shape)
test_df["isFraud"] = 0
del test_transaction, test_identity
gc.collect()

## Pipeline of preprocessing

In [None]:
emails = {
'gmail': 'google', 
'att.net': 'att', 
'twc.com': 'spectrum', 
'scranton.edu': 'other', 
'optonline.net': 'other', 
'hotmail.co.uk': 'microsoft',
'comcast.net': 'other', 
'yahoo.com.mx': 'yahoo', 
'yahoo.fr': 'yahoo',
'yahoo.es': 'yahoo', 
'charter.net': 'spectrum', 
'live.com': 'microsoft', 
'aim.com': 'aol', 
'hotmail.de': 'microsoft', 
'centurylink.net': 'centurylink',
'gmail.com': 'google', 
'me.com': 'apple', 
'earthlink.net': 'other', 
'gmx.de': 'other',
'web.de': 'other', 
'cfl.rr.com': 'other', 
'hotmail.com': 'microsoft', 
'protonmail.com': 'other', 
'hotmail.fr': 'microsoft', 
'windstream.net': 'other', 
'outlook.es': 'microsoft', 
'yahoo.co.jp': 'yahoo', 
'yahoo.de': 'yahoo',
'servicios-ta.com': 'other', 
'netzero.net': 'other', 
'suddenlink.net': 'other',
'roadrunner.com': 'other', 
'sc.rr.com': 'other', 
'live.fr': 'microsoft',
'verizon.net': 'yahoo', 
'msn.com': 'microsoft', 
'q.com': 'centurylink', 
'prodigy.net.mx': 'att', 
'frontier.com': 'yahoo', 
'anonymous.com': 'other', 
'rocketmail.com': 'yahoo',
'sbcglobal.net': 'att',
'frontiernet.net': 'yahoo', 
'ymail.com': 'yahoo',
'outlook.com': 'microsoft',
'mail.com': 'other', 
'bellsouth.net': 'other',
'embarqmail.com': 'centurylink',
'cableone.net': 'other', 
'hotmail.es': 'microsoft', 
'mac.com': 'apple',
'yahoo.co.uk': 'yahoo',
'netzero.com': 'other', 
'yahoo.com': 'yahoo', 
'live.com.mx': 'microsoft',
'ptd.net': 'other',
'cox.net': 'other',
'aol.com': 'aol',
'juno.com': 'other',
'icloud.com': 'apple'
}

# number types for filtering the columns
int_types = ["int8", "int16", "int32", "int64", "float"]

In [None]:
# Let's check how many missing values has each column.

def check_nan(df, limit):
    '''
    Check how many values are missing in each column.
    If the number of missing values are higher than limit, we drop the column.
    '''
    
    total_rows = df.shape[0]
    total_cols = df.shape[1]
    
    total_dropped = 0
    col_to_drop = []
    
    for col in df.columns:

        null_sum = df[col].isnull().sum()
        perc_over_total = round((null_sum/total_rows), 2)
        
        if perc_over_total > limit:
            
            print("The col {} contains {} null values.\nThis represents {} of total rows."\
                  .format(col, null_sum, perc_over_total))
            
            print("Dropping column {} from the df.\n".format(col))
            
            col_to_drop.append(col)
            total_dropped += 1            
    
    df.drop(col_to_drop, axis = 1, inplace = True)
    print("We have dropped a total of {} columns.\nIt's {} of the total"\
          .format(total_dropped, round((total_dropped/total_cols), 2)))
    
    return df

In [None]:
def binarizer(df_train, df_test):
    '''
    Work with cat features and binarize the values.
    Works with 2 dataframes at a time and returns a tupple of both.
    '''
    cat_cols = df_train.select_dtypes(exclude=int_types).columns

    for col in cat_cols:
        
        # creating a list of unique features to binarize so we dont get and value error
        unique_train = list(df_train[col].unique())
        unique_test = list(df_test[col].unique())
        unique_values = list(set(unique_train + unique_test))
        
        enc = LabelEncoder()
        enc.fit(unique_values)
        
        df_train[col] = enc.transform((df_train[col].values).reshape(-1 ,1))
        df_test[col] = enc.transform((df_test[col].values).reshape(-1 ,1))
    
    return (df_train, df_test)

In [None]:
def cathegorical_imputer(df_train, df_test, strategy, fill_value):
    '''
    Replace all cathegorical features with a constant or the most frequent strategy.
    '''
    cat_cols = df_train.select_dtypes(exclude=int_types).columns
    
    for col in cat_cols:
        print("Working with column {}".format(col))
        
        # select the correct inputer
        if strategy == "constant":
            # input a fill_value of -999 to all nulls
            inputer = SimpleImputer(strategy=strategy, fill_value=fill_value)
        elif strategy == "most_frequent":
            inputer = SimpleImputer(strategy=strategy)
        
        # replace the nulls in train and test
        df_train[col] = inputer.fit_transform(X = (df_train[col].values).reshape(-1, 1))
        df_test[col] = inputer.transform(X = (df_test[col].values).reshape(-1, 1))
        
    return (df_train, df_test)

In [None]:
def numerical_inputer(df_train, df_test, strategy, fill_value):
    '''
    Replace NaN in the numerical features.
    Works with 2 dataframes at a time (train & test).
    Return a tupple of both.
    '''
    
    # assert valid strategy
    message = "Please select a valid strategy (mean, median, constant (and give a fill_value) or most_frequent)"
    assert strategy in ["constant", "most_frequent", "mean", "median"], message
    
    # int_types defined earlier in the kernel
    num_cols = df_train.select_dtypes(include = int_types).columns
    
    for col in num_cols:

        print("Working with column {}".format(col))

        # select the correct inputer
        if strategy == "constant":
            inputer = SimpleImputer(strategy=strategy, fill_value=fill_value)
        elif strategy == "most_frequent":
            inputer = SimpleImputer(strategy=strategy)
        elif strategy == "mean":
            inputer = SimpleImputer(strategy=strategy)
        elif strategy == "median":
            inputer = SimpleImputer(strategy=strategy)

        # replace the nulls in train and test
        try:
            df_train[col] = inputer.fit_transform(X = (df_train[col].values).reshape(-1, 1))
            df_test[col] = inputer.transform(X = (df_test[col].values).reshape(-1, 1))
        except:
            print("Col {} gave and error.".format(col))
            
    return (df_train, df_test)

In [None]:
def pipeline(df_train, df_test):
    '''
    We define a personal pipeline to process the data and fill with processing functions.
    NOTE: modifies the df in place.
    '''
    print("Shape of train is {}".format(df_train.shape))
    print("Shape of test is {}".format(df_test.shape))
    # We have set the limit of 70%. If a column contains more that 70% of it's values as NaN/Missing values we will drop the column
    # Since it's very unlikely that it will help our future model.
    print("Checking for nan values\n")
    df_train = check_nan(df_train, limit=0.7)
    
    # Select the columns from df_train with less nulls and asign to test.
    df_test = df_test[list(df_train.columns)]
          
    print("Shape of train is {}".format(df_train.shape))
    print("Shape of test is {}".format(df_test.shape))
          
    # mapping emails
    print("Mapping emails \n")
    df_train["EMAILP"] = df_train["P_emaildomain"].map(emails)
    df_test["EMAILP"] = df_test["P_emaildomain"].map(emails)

    print("Shape of train is {}".format(df_train.shape))
    print("Shape of test is {}".format(df_test.shape))
          
    # replace nulls from the train and test df with a value of "Other"
    print("Working with cathegorical values\n")
    df_train, df_test = cathegorical_imputer(df_train, df_test, strategy = "constant", fill_value = "Other")
    
    print("Shape of train is {}".format(df_train.shape))
    print("Shape of test is {}".format(df_test.shape))
          
    # now we will make a one hot encoder of these colums
    print("Binarazing values\n")
    df_train, df_test = binarizer(df_train, df_test)
    
    print("Shape of train is {}".format(df_train.shape))
    print("Shape of test is {}".format(df_test.shape))
          
    # working with null values in numeric columns
    print("Working with numerical columns. NAN values\n")
    df_train, df_test = numerical_inputer(df_train, df_test, strategy = "constant", fill_value=-999)
        
    print("Shape of train is {}".format(df_train.shape))
    print("Shape of test is {}".format(df_test.shape))
          
    return (df_train, df_test)

In [None]:
# before preprocesing
print("Train before preprocesing: ",train_df.shape)
print("Test before preprocesing: ",test_df.shape)

train_df, test_df = pipeline(train_df, test_df)

# after preprocesing
print("Train after preprocesing: ",train_df.shape)
print("Test after preprocesing: ",test_df.shape)

In [None]:
# check for null values
columns = train_df.columns
for col in  columns:
    total_nulls = train_df[col].isnull().sum()
    if total_nulls > 0:
        print(col, total_nulls)
        
columns = test_df.select_dtypes(exclude=int_types).columns
train_df[columns]

columns = test_df.select_dtypes(include=int_types).columns
train_df[columns]

**Reduce Memory Usage**

In [None]:
%%time
# From kernel https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [None]:
%%time
train_df = reduce_mem_usage(train_df)
test_df = reduce_mem_usage(test_df)

# <a style="color:#6699ff"> IV. Machine Learning Models</a>

In [None]:
cols_to_drop = ["TransactionID", "isFraud", "TransactionDT"]
useful_cols = list(train_df.columns)

for col in cols_to_drop:
    while True:
        try:
            useful_cols.remove(col)
        except:
            break
            
Y = train_df["isFraud"]
X = train_df[useful_cols]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

**Reduce Memory Usage**

In [None]:
%%time
X_train = reduce_mem_usage(X_train)
X_test = reduce_mem_usage(X_test)

## XGBoost classifier

In [None]:
eval_set = [(X_train, y_train), (X_test, y_test)]
xgboost_classifier = XGBClassifier()
xgboost_classifier.fit(X_train, y_train, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=0)

In [None]:
predictions = xgboost_classifier.predict(X_test)

In [None]:
print(confusion_matrix(predictions, y_test))

In [None]:
print(classification_report(predictions, y_test))

**Learning curve for XGBoost**

In [None]:
# retrieve performance metrics
results = xgboost_classifier.evals_result()
epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)
# plot log loss
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
plt.ylabel('Log Loss')
plt.title('XGBoost Log Loss')
plt.show()
# plot classification error
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend()
plt.ylabel('Classification Error')
plt.title('XGBoost Classification Error')
plt.show()

**Feature importance for XGBoost**

In [None]:
cols = list(train_df.columns)
feature_imp = pd.DataFrame(sorted(zip(xgboost_classifier.feature_importances_,cols)), columns=['Value','Feature'])
plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False).iloc[:50])
plt.title('XGBoost hyperparameter tunning Most Important Features')
plt.tight_layout()
plt.show()

**First submission for XGBOOST**

In [None]:
proba = xgboost_classifier.predict_proba(test_df[useful_cols])
proba[:,1]

y_preds = proba[:,1]

submission = pd.read_csv('sample_submission.csv', index_col='TransactionID')
submission['isFraud'] = y_preds
submission.head()

In [None]:
plt.hist(submission.isFraud,bins=100)
plt.ylim((0,5000))
plt.title('XGBoost submission')
plt.show()

In [None]:
submission.to_csv('/kaggle/working/xgboost_submission.csv')

In [None]:
del eval_set, xgboost_classifier, predictions, results, epochs, x_axis, fig, ax, cols, feature_imp, proba, y_preds, submission
gc.collect()

## XGBoost classifier with hyperparameter tuning 

In [None]:
xgboost_classifier = XGBClassifier(n_estimators=5000,
        max_depth=12, 
        learning_rate=0.02, 
        subsample=0.8,
        colsample_bytree=0.4, 
        missing=-1,
        tree_method='gpu_hist')

eval_set = [(X_train, y_train), (X_test, y_test)]
xgboost_classifier.fit(X_train, y_train, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=0)

In [None]:
del eval_set
gc.collect()

In [None]:
predictions = xgboost_classifier.predict(X_test)

In [None]:
print(confusion_matrix(predictions, y_test))

In [None]:
print(classification_report(predictions, y_test))

In [None]:
del predictions
gc.collect()

**Learning curve for XGBoost hyperparameter tunning**

In [None]:
# retrieve performance metrics
results = xgboost_classifier.evals_result()
epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)
# plot log loss
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
plt.ylabel('Log Loss')
plt.title('XGBoost tunning Log Loss')
plt.show()
# plot classification error
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend()
plt.ylabel('Classification Error')
plt.title('XGBoost tunning Classification Error')
plt.show()

In [None]:
del results, epochs, x_axis, fig, ax

**Feature importance for XGBoost hyperparameter tunning**

In [None]:
cols = list(train_df.columns)
feature_imp = pd.DataFrame(sorted(zip(xgboost_classifier.feature_importances_,cols)), columns=['Value','Feature'])
plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False).iloc[:50])
plt.title('XGBoost hyperparameter tunning Most Important Features')
plt.tight_layout()
plt.show()

In [None]:
del cols, feature_imp
gc.collect()

**Submission for XGBOOST with hyperparameter tunning**

In [None]:
proba = xgboost_classifier.predict_proba(test_df[useful_cols])
proba[:,1]

y_preds = proba[:,1]

submission = pd.read_csv('sample_submission.csv', index_col='TransactionID')
submission['isFraud'] = y_preds
submission.head()

In [None]:
plt.hist(submission.isFraud,bins=100)
plt.ylim((0,5000))
plt.title('XGBoost hyperparameter tunning submission')
plt.show()

In [None]:
submission.to_csv('/kaggle/working/xgboost_tunning_submission.csv')

In [None]:
del xgboost_classifier,proba, y_preds, submission
gc.collect()