# IEEE Fraud

Le but de cette compétition est de prédire la probabilité qu'une transaction en ligne soit frauduleuse. 

**Transaction variables**

* TransactionDT: timedelta from a given reference  datetime (not an actual timestamp)
* TransactionAMT: transaction payment amount in USD
* ProductCD: product code, the product for each transaction
* card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
* addr: address
* dist: distance
* P_ and (R__) emaildomain: purchaser and recipient email domain
* C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
* D1-D15: timedelta, such as days between previous transaction, etc.
* M1-M9: match, such as names on card and address, etc.
* Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

**Categorical Features - Transaction**

* ProductCD
* emaildomain
* card1 - card6
* addr1, addr2
* P_emaildomain
* R_emaildomain
* M1 - M9

**Categorical Features - Identity**

* DeviceType
* DeviceInfo
* id_12 - id_38

The TransactionDT feature is a timedelta from a given reference datetime (not an actual timestamp).




## Data preprocessing & analysis : 

In [2]:
import pandas as pd
import numpy as np
import scipy as sp
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Standard plotly imports
#import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls
from plotly.offline import iplot, init_notebook_mode
#import cufflinks
#import cufflinks as cf
import plotly.figure_factory as ff

# Using plotly + cufflinks in offline mode
init_notebook_mode(connected=True)
#cufflinks.go_offline(connected=True)

# Preprocessing, modelling and evaluating
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.model_selection import StratifiedKFold, cross_val_score, KFold
from xgboost import XGBClassifier
import xgboost as xgb

### Importer la Data :

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
path = '/content/drive/My Drive/Data Fraud/'

In [0]:
train_id = pd.read_csv(path+'train_identity.csv')
train_tr = pd.read_csv(path+'train_transaction.csv')
test_id = pd.read_csv(path+'test_identity.csv')
test_tr = pd.read_csv(path+'test_transaction.csv')

Ces données semblent assez énormes et difficiles à comprendre. TransactionID est la colonne commune dans les données de transaction et les données d'identité et les deux tables peuvent être jointes à l'aide de cette colonne commune.

###**Merging data** : 

In [0]:
train_data = pd.merge(train_tr, train_id, on='TransactionID', how='left')
test_data = pd.merge(test_tr, test_id, on='TransactionID', how='left')

In [7]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 590540 entries, 0 to 590539
Columns: 434 entries, TransactionID to DeviceInfo
dtypes: float64(399), int64(4), object(31)
memory usage: 1.9+ GB


In [8]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 506691 entries, 0 to 506690
Columns: 433 entries, TransactionID to DeviceInfo
dtypes: float64(399), int64(3), object(31)
memory usage: 1.6+ GB


### **Reduce memory :**



In [0]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: 
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [10]:
train = reduce_mem_usage(train_data)
test = reduce_mem_usage(test_data)

Mem. usage decreased to 650.48 Mb (66.8% reduction)
Mem. usage decreased to 565.37 Mb (66.3% reduction)


Comme on a fusionné les deux data et on a pu réduire leur mémoire,  **train_id** et **train_trasaction** (de même pour données test), maintenant on supprime les autres qui nous seront inutiles.

In [0]:
del train_id, train_tr, test_id, test_tr, train_data, test_data

**Reconnaître les variables catégorielles et numériques.**

1. Variables numériques :

In [12]:
num = list(train.select_dtypes(exclude=['object']).columns)
num

['TransactionID',
 'isFraud',
 'TransactionDT',
 'TransactionAmt',
 'card1',
 'card2',
 'card3',
 'card5',
 'addr1',
 'addr2',
 'dist1',
 'dist2',
 'C1',
 'C2',
 'C3',
 'C4',
 'C5',
 'C6',
 'C7',
 'C8',
 'C9',
 'C10',
 'C11',
 'C12',
 'C13',
 'C14',
 'D1',
 'D2',
 'D3',
 'D4',
 'D5',
 'D6',
 'D7',
 'D8',
 'D9',
 'D10',
 'D11',
 'D12',
 'D13',
 'D14',
 'D15',
 'V1',
 'V2',
 'V3',
 'V4',
 'V5',
 'V6',
 'V7',
 'V8',
 'V9',
 'V10',
 'V11',
 'V12',
 'V13',
 'V14',
 'V15',
 'V16',
 'V17',
 'V18',
 'V19',
 'V20',
 'V21',
 'V22',
 'V23',
 'V24',
 'V25',
 'V26',
 'V27',
 'V28',
 'V29',
 'V30',
 'V31',
 'V32',
 'V33',
 'V34',
 'V35',
 'V36',
 'V37',
 'V38',
 'V39',
 'V40',
 'V41',
 'V42',
 'V43',
 'V44',
 'V45',
 'V46',
 'V47',
 'V48',
 'V49',
 'V50',
 'V51',
 'V52',
 'V53',
 'V54',
 'V55',
 'V56',
 'V57',
 'V58',
 'V59',
 'V60',
 'V61',
 'V62',
 'V63',
 'V64',
 'V65',
 'V66',
 'V67',
 'V68',
 'V69',
 'V70',
 'V71',
 'V72',
 'V73',
 'V74',
 'V75',
 'V76',
 'V77',
 'V78',
 'V79',
 'V80',
 'V81',


2. variables catégorielles :

In [13]:
cat = list(train.select_dtypes(include=['object']).columns)
cat

['ProductCD',
 'card4',
 'card6',
 'P_emaildomain',
 'R_emaildomain',
 'M1',
 'M2',
 'M3',
 'M4',
 'M5',
 'M6',
 'M7',
 'M8',
 'M9',
 'id_12',
 'id_15',
 'id_16',
 'id_23',
 'id_27',
 'id_28',
 'id_29',
 'id_30',
 'id_31',
 'id_33',
 'id_34',
 'id_35',
 'id_36',
 'id_37',
 'id_38',
 'DeviceType',
 'DeviceInfo']

### Les données manquantes :

In [14]:
print(f'There are {train.isnull().any().sum()} columns in train dataset with missing values.')

There are 414 columns in train dataset with missing values.


In [15]:
print(f'There are {test.isnull().any().sum()} columns in train dataset with missing values.')


There are 385 columns in train dataset with missing values.


In [16]:
data_null = train.isnull().sum()/len(train) * 100
data_null = data_null.drop(data_null[data_null == 0].index).sort_values(ascending=False)[:500]
missing_data = pd.DataFrame({'Missing Ratio': data_null})
missing_data.head()

Unnamed: 0,Missing Ratio
id_24,99.196159
id_25,99.130965
id_07,99.12707
id_08,99.12707
id_21,99.126393


**Approche naïve :** Nous allons maintenant supprimer les variables dont la valeur manquante est supérieure à 70%.

In [0]:
def null_attr(data):
    null_cols = [c for c in data.columns if data[c].isnull().sum() / data.shape[0] > 0.7]
    return null_cols

In [0]:
def repeated_val(data):
    repval_cols = [c for c in train.columns if train[c].value_counts(dropna=False, normalize=True).values[0] > 0.9]
    return repval_cols

In [0]:
def useless_columns(data):
    null = null_attr(data)
    print("More than 90% null: " + str(len(null)))
    repeated = repeated_val(data)
    print("More than 90% repeated value: " + str(len(repeated)))
    cols_to_drop = list(set(null + repeated))
    cols_to_drop.remove('isFraud')
    return cols_to_drop

In [20]:
useless_columns(train)

More than 90% null: 208
More than 90% repeated value: 67


['V168',
 'V145',
 'V139',
 'V146',
 'V253',
 'V117',
 'V187',
 'id_28',
 'id_36',
 'V275',
 'V192',
 'V158',
 'V177',
 'V101',
 'V200',
 'V246',
 'V254',
 'V256',
 'V231',
 'V104',
 'V148',
 'V190',
 'V199',
 'V232',
 'V169',
 'V181',
 'V170',
 'id_22',
 'V269',
 'V132',
 'id_14',
 'V159',
 'V106',
 'V213',
 'V171',
 'V109',
 'V153',
 'V277',
 'V118',
 'V273',
 'V259',
 'id_11',
 'C3',
 'V266',
 'V142',
 'V201',
 'V239',
 'V267',
 'V320',
 'V208',
 'V298',
 'id_27',
 'V162',
 'V240',
 'V244',
 'V186',
 'V218',
 'V250',
 'V141',
 'V197',
 'V222',
 'V110',
 'V281',
 'V329',
 'V184',
 'V178',
 'DeviceType',
 'id_31',
 'V173',
 'V174',
 'V247',
 'V157',
 'V120',
 'V268',
 'V327',
 'V149',
 'V263',
 'V144',
 'id_06',
 'V233',
 'V252',
 'V274',
 'V234',
 'V207',
 'V276',
 'D8',
 'id_08',
 'id_15',
 'V188',
 'V185',
 'id_13',
 'V286',
 'V251',
 'V165',
 'id_32',
 'V335',
 'id_17',
 'V124',
 'V206',
 'V237',
 'V242',
 'V129',
 'V189',
 'V133',
 'V305',
 'V227',
 'V336',
 'V297',
 'D9',
 'V179

In [21]:
train = train.drop(useless_columns(train), axis=1)

More than 90% null: 208
More than 90% repeated value: 67


In [22]:
test = test.drop(useless_columns(test),axis=1)

More than 90% null: 208
More than 90% repeated value: 1


In [23]:
train.head(5)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,P_emaildomain,C1,C2,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D10,D11,D15,M1,M2,M3,M4,...,V88,V89,V90,V91,V92,V93,V94,V95,V96,V97,V99,V100,V126,V127,V128,V130,V131,V279,V280,V282,V283,V285,V287,V288,V289,V291,V292,V294,V302,V303,V304,V306,V307,V308,V310,V312,V313,V314,V315,V317
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,credit,315.0,87.0,19.0,,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,14.0,,13.0,,,13.0,13.0,0.0,T,T,T,M2,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,117.0
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,325.0,87.0,,gmail.com,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,0.0,,0.0,,,,M0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,debit,330.0,87.0,287.0,outlook.com,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,0.0,315.0,315.0,T,T,T,M0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,debit,476.0,87.0,,yahoo.com,2.0,5.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,1.0,0.0,25.0,1.0,112.0,112.0,0.0,94.0,0.0,84.0,,111.0,,,,M0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,48.0,28.0,10.0,4.0,50.0,1758.0,925.0,354.0,135.0,1.0,28.0,0.0,0.0,10.0,4.0,0.0,0.0,1.0,1.0,38.0,0.0,0.0,0.0,50.0,1758.0,925.0,354.0,135.0,0.0,0.0,0.0,1404.0
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,credit,420.0,87.0,,gmail.com,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,,,,,,,,,,,,...,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
le=preprocessing.LabelEncoder()
def imputation(data):
   for i in data.columns:
      if data[i].dtypes=='int64' or data[i].dtypes=='float64':
         data[i].fillna(data[i].mean(),inplace=True)
      if data[i].dtypes=='object':
         data[i].fillna(data[i].mode()[0],inplace=True)
      if 'M' in i:
         data[i].fillna(data[i].mode()[0],inplace=True)
         data[i]=le.fit_transform(data[i])

In [0]:
imputation(train)
imputation(test)

###**Application de modèles :**

récupération des variables numériques :

In [0]:
X_num = [i for i in train.columns if i in num]

In [0]:
X=train[X_num]

In [0]:
X.fillna(X.mean(), inplace=True)

In [29]:
X.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,card1,card2,card3,card5,addr1,addr2,dist1,C1,C2,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D10,D11,D15,V1,V2,V3,V4,V5,V6,V7,V8,...,V88,V89,V90,V91,V92,V93,V94,V95,V96,V97,V99,V100,V126,V127,V128,V130,V131,V279,V280,V282,V283,V285,V287,V288,V289,V291,V292,V294,V302,V303,V304,V306,V307,V308,V310,V312,V313,V314,V315,V317
0,2987000,0,86400,68.5,13926,362.5,150.0,142.0,315.0,87.0,19.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,14.0,169.625,13.0,140.0,42.34375,13.0,13.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,117.0
1,2987001,0,86401,29.0,2755,404.0,150.0,102.0,325.0,87.0,118.5,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,169.625,28.34375,0.0,42.34375,0.0,146.625,0.0,1.0,1.044922,1.078125,0.84668,0.876953,1.045898,1.073242,1.027344,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2987002,0,86469,59.0,4663,490.0,150.0,166.0,330.0,87.0,287.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,169.625,28.34375,0.0,42.34375,0.0,315.0,315.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2987003,0,86499,50.0,18132,567.0,150.0,117.0,476.0,87.0,118.5,2.0,5.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,1.0,0.0,25.0,1.0,112.0,112.0,0.0,94.0,0.0,84.0,146.625,111.0,1.0,1.044922,1.078125,0.84668,0.876953,1.045898,1.073242,1.027344,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,48.0,28.0,10.0,4.0,50.0,1758.0,925.0,354.0,135.0,1.0,28.0,0.0,0.0,10.0,4.0,0.0,0.0,1.0,1.0,38.0,0.0,0.0,0.0,50.0,1758.0,925.0,354.0,135.0,0.0,0.0,0.0,1404.0
4,2987004,0,86506,50.0,4497,514.0,150.0,102.0,420.0,87.0,118.5,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,169.625,28.34375,140.0,42.34375,124.0,146.625,163.75,1.0,1.044922,1.078125,0.84668,0.876953,1.045898,1.073242,1.027344,...,0.999023,0.000902,0.401855,0.42041,0.150269,0.154785,0.136963,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**1) KNN :**

In [0]:
Y = X["isFraud"] 
X = X.loc[:, X.columns != "isFraud"]

split data :

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.7, random_state=0)

In [32]:
from sklearn import neighbors

knn = neighbors.KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [33]:
pred_train = knn.predict_proba(X_train)
print("score auc train :",roc_auc_score(y_train, pred_train[:, 1]))

score auc train : 0.9817013035543113


In [34]:
pred_test = knn.predict_proba(X_test)
print("score auc test :",roc_auc_score(y_test, pred_test[:, 1]))

score auc test : 0.6726890087845089


**2) Regression Tree :**

In [35]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [36]:
pred_train = clf.predict_proba(X_train)
print("score auc train :",roc_auc_score(y_train, pred_train[:, 1]))

score auc train : 1.0


Comme on a le problème de imbalanced data, normal que ça 1 pour arbre de décision et même 0.98 pour KNN.

In [37]:
pred_test = clf.predict_proba(X_test)
print("score auc test :",roc_auc_score(y_test, pred_test[:, 1]))

score auc test : 0.7677666183445674


On va essayer un dernier : 

**3) XGBoost:**

In [39]:
boost = XGBClassifier()
boost.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [40]:
pred_train = boost.predict_proba(X_train)
print("score auc train :",roc_auc_score(y_train, pred_train[:, 1]))

score auc train : 0.8846411761809592


In [41]:
pred_test = boost.predict_proba(X_test)
print("score auc test :",roc_auc_score(y_test, pred_test[:, 1]))

score auc test : 0.8792474628877732


Problème "inbalanced Data":

D-Features:

V-Features:

C-Features:

P_R emails: