# Exercise 04

# Fraud Detection

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [3]:
import zipfile
with zipfile.ZipFile('datasets/fraud_detection.csv.zip', 'r') as z:
    f = z.open('15_fraud_detection.csv')
    data = pd.io.parsers.read_table(f, index_col=0, sep=',')

In [4]:
data.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [5]:
data.shape, data.Label.sum(), data.Label.mean()

((138721, 16), 797, 0.0057453449730033666)

In [6]:
X = data.drop(['Label'], axis=1)
Y = data['Label']


# Exercice 04.1

Estimate a Logistic Regression

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

In [7]:
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=1)



In [16]:
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

y_pred_class_log = logreg.predict(X_test)

In [9]:
print('Accuracy',metrics.accuracy_score(y_pred_class_log,y_test))

Accuracy 0.993973645512


In [18]:
y_test.values

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [17]:
print('F1 Score',metrics.f1_score(y_pred_class_log,y_test.values))

F1 Score 0.0


  'recall', 'true', average, warn_for)


In [19]:
print('F Beta Score (Beta=10)',metrics.fbeta_score(y_pred_class_log,y_test.values,beta=10))

F Beta Score (Beta=10) 0.0


  'recall', 'true', average, warn_for)


El accuracy para el modelo es bastante bueno, ya que indica que el 99.4% de los datos están clasificando correctamente en la categoría correspondiente. El F1-Score y F-Beta-Score no pueden ser calculados.

# Exercice 04.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

In [20]:
def UnderSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_0_new =  n_samples_1 / target_percentage - n_samples_1
    n_samples_0_new_per = n_samples_0_new / n_samples_0

    filter_ = y == 0

    np.random.seed(seed)
    rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)
    
    filter_ = filter_ & rand_1
    filter_ = filter_ | (y == 1)
    filter_ = filter_.astype(bool)
    
    return X[filter_], y[filter_]

In [21]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5,0.6,0.7,0.8,0.9]:
    X_u, y_u = UnderSampling(X_train, y_train, target_percentage, 666)
    print('Target percentage', target_percentage)
    print('y.shape = ',y_u.shape[0], 'y.mean() = ', y_u.mean())
    logreg.fit(X_u, y_u)
    y_pred_class_log = logreg.predict(X_test)
    print('Accuracy',metrics.accuracy_score(y_pred_class_log,y_test))
    print('F1 Score',metrics.f1_score(y_pred_class_log,y_test))
    print('F Beta Score (Beta=10)',metrics.fbeta_score(y_pred_class_log,y_test,beta=10))
    print(' ')

Target percentage 0.1
y.shape =  5966 y.mean() =  0.09855849815621857
Accuracy 0.993483463568
F1 Score 0.0
F Beta Score (Beta=10) 0.0
 
Target percentage 0.2
y.shape =  3020 y.mean() =  0.1947019867549669
Accuracy 0.982238113088
F1 Score 0.0609756097561
F Beta Score (Beta=10) 0.0449798481373
 
Target percentage 0.3
y.shape =  2029 y.mean() =  0.2897979300147856
Accuracy 0.974597041608
F1 Score 0.049622437972
F Beta Score (Beta=10) 0.0322598564068
 
Target percentage 0.4
y.shape =  1489 y.mean() =  0.3948959032907992
Accuracy 0.919696663879
F1 Score 0.037331489803
F Beta Score (Beta=10) 0.0203046063237
 
Target percentage 0.5
y.shape =  1208 y.mean() =  0.4867549668874172
Accuracy 0.520371384908
F1 Score 0.0194529592077
F Beta Score (Beta=10) 0.00994504415743
 
Target percentage 0.6
y.shape =  998 y.mean() =  0.5891783567134269
Accuracy 0.378881808483
F1 Score 0.017155632614
F Beta Score (Beta=10) 0.00874616365017
 
Target percentage 0.7
y.shape =  842 y.mean() =  0.6983372921615202
Acc

El valor para el Target percentage que elegiría es **0.2** ya que maximiza tanto el Accuracy, como el F1 Score y el F Beta Score. Se puede observar como tal que no hay un patrón al aumentar el valor del target percentage, sin embargo hay un punto que maximiza las métricas de evaluación del modelo y tienden a bajar a medida que el valor del target percentage aumenta.

# Exercice 04.3

Now using random-over-sampling

In [22]:
import random
def OverSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_1_new =  -target_percentage * n_samples_0 / (target_percentage- 1)

    np.random.seed(seed)
    filter_ = np.random.choice(X[y == 1].shape[0], int(n_samples_1_new))
    # filter_ is within the positives, change to be of all
    filter_ = np.nonzero(y == 1)[0][filter_]
    
    filter_ = np.concatenate((filter_, np.nonzero(y == 0)[0]), axis=0)
    
    return X[filter_], y[filter_]

In [23]:
    X_u, y_u = OverSampling(X_train, y_train, 0.5, 666)
    X_u

KeyError: '[ 41645  75082  12793 ..., 104037 104038 104039] not in index'

In [24]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5,0.6,0.7,0.8,0.9]:
    X_u, y_u = OverSampling(X_train.values, y_train, target_percentage, 666)
    print('Target percentage', target_percentage)
    print('y.shape = ',y_u.shape[0], 'y.mean() = ', y_u.mean())
    logreg.fit(X_u,y_u)
    y_pred_class_log = logreg.predict(X_test)
    print('Accuracy', metrics.accuracy_score(y_pred_class_log,y_test))
    print('F1 Score',metrics.f1_score(y_pred_class_log,y_test))
    print('F Beta Score (Beta=10)',metrics.fbeta_score(y_pred_class_log,y_test,beta=10))
    print(' ')

Target percentage 0.1
y.shape =  114946 y.mean() =  0.005550405561993047


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

# Exercice 04.4*
Evaluate the results using SMOTE

Which parameters did you choose?

In [None]:
def SMOTE(X, y, target_percentage=0.5, k=5, seed=None):
    
    # New samples
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()
    n_samples_1_new =  int(-target_percentage * n_samples_0 / (target_percentage- 1) - n_samples_1)
    
    # A matrix to store the synthetic samples
    new = np.zeros((n_samples_1_new, X.shape[1]))
    
    # Create seeds
    np.random.seed(seed)
    seeds = np.random.randint(1, 1000000, 3)
    
    # Select examples to use as base
    np.random.seed(seeds[0])
    sel_ = np.random.choice(y[y==1].shape[0], n_samples_1_new)
    
    # Define random seeds (2 per example)
    np.random.seed(seeds[1])
    nn__ = np.random.choice(k, n_samples_1_new)
    np.random.seed(seeds[2])
    steps = np.random.uniform(size=n_samples_1_new)  

    # For each selected examples create one synthetic case
    for i, sel in enumerate(sel_):
        # Select neighbor
        nn_ = nn__[i]
        step = steps[i]
        # Create new sample
        new[i, :] = X[y==1][sel] - step * (X[y==1][sel] - X[y==1][nn_])
    
    X = np.vstack((X, new))
    y = np.append(y, np.ones(n_samples_1_new))
    
    return X, y

In [108]:
for target_percentage in [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]:
        X_u, y_u = SMOTE(X_train.values, y_train, target_percentage, 10, seed=666)
        print('Target percentage', target_percentage, 'k ', 10)
        print('y.shape = ',y_u.shape[0], 'y.mean() = ', y_u.mean())
        logreg.fit(X_u,y_u)
        y_pred_class_log = logreg.predict(X_test)
        print('Accuracy', metrics.accuracy_score(y_pred_class_log,y_test))
        print('F1 Score',metrics.f1_score(y_pred_class_log,y_test))
        print('F Beta Score (Beta=10)',metrics.fbeta_score(y_pred_class_log,y_test,beta=10))
        print(' ')

Target percentage 0.1 k  10
y.shape =  114946 y.mean() =  0.0999947801576
Accuracy 0.991522735792
F1 Score 0.00675675675676
F Beta Score (Beta=10) 0.011336850376
 
Target percentage 0.2 k  10
y.shape =  129315 y.mean() =  0.2
Accuracy 0.975404400104
F1 Score 0.0361581920904
F Beta Score (Beta=10) 0.0238316447669
 
Target percentage 0.3 k  10
y.shape =  147788 y.mean() =  0.29999729342
Accuracy 0.956344972752
F1 Score 0.0269922879177
F Beta Score (Beta=10) 0.0157217087074
 
Target percentage 0.4 k  10
y.shape =  172420 y.mean() =  0.4
Accuracy 0.857818402007
F1 Score 0.0167497507478
F Beta Score (Beta=10) 0.00882263019203
 
Target percentage 0.5 k  10
y.shape =  206904 y.mean() =  0.5
Accuracy 0.698826446758
F1 Score 0.0174960022575
F Beta Score (Beta=10) 0.0090108585018
 
Target percentage 0.6 k  10
y.shape =  258629 y.mean() =  0.599998453383
Accuracy 0.529425333756
F1 Score 0.0153252081574
F Beta Score (Beta=10) 0.00783706816545
 
Target percentage 0.7 k  10
y.shape =  344839 y.mean(

El valor para el Target percentage que elegiría es **0.2** ya que maximiza tanto el Accuracy, como el F1 Score y el F Beta Score. Se puede observar como tal que no hay un patrón al aumentar el valor del target percentage, sin embargo hay un punto que maximiza las métricas de evaluación del modelo y tienden a bajar a medida que el valor del target percentage aumenta.

# Exercice 04.5

Estimate a Logistic Regression, GaussianNB, K-nearest neighbors and a Decision Tree **Classifiers**

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

Combine the classifiers and comment

In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

models = {'lr': LogisticRegression(),
          'dt': DecisionTreeClassifier(),
          'nb': GaussianNB(),
          'nn': KNeighborsClassifier()}

In [26]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=666)

for model in models.keys():
    models[model].fit(X_train, y_train)   

In [27]:
# predict test for each model
y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())
for model in models.keys():
    y_pred[model] = models[model].predict(X_test)
    

In [28]:
# Evaluate each model
import numpy as np
from sklearn.metrics import mean_squared_error

for model in models.keys():
    print(model,np.sqrt(mean_squared_error(y_pred[model], y_test)))

lr 0.076507183297
dt 0.105088109965
nb 0.137846867531
nn 0.0770704339682


In [29]:
from sklearn import metrics
for model in models.keys():
    print(model, 'precision',metrics.precision_score(y_pred[model],y_test))
    print(model, 'recall', metrics.recall_score(y_pred[model],y_test))
    print(model, 'f1.score',metrics.f1_score(y_pred[model],y_test))
    print(model, 'fbeta',metrics.fbeta_score(y_pred[model],y_test,beta=10))
    print(' ')

lr precision 0.0
lr recall 0.0
lr f1.score 0.0
lr fbeta 0.0
 
dt precision 0.137931034483
dt recall 0.118644067797
dt f1.score 0.127562642369
dt fbeta 0.118808553544
 
nb precision 0.0246305418719
nb recall 0.0107296137339
nb f1.score 0.0149476831091
nb fbeta 0.0107899066299
 
nn precision 0.0443349753695
nn recall 0.428571428571
nn f1.score 0.0803571428571
nn fbeta 0.394702561876
 


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


Se puede observar que en cuanto a precisión, el mejor modelo es el árbol de clasificación con 0,13. Sin embargo, el modelo que mejor presenta recall es el modelo KNeighborsClassifier con 0.4285, es decir que este modelo está prediciendo mejor los verdaderos positivos sobre el total de los que fueron clasificados correctamente (falsos negativos y verdaderos positivos). El modelo que mejor que presenta la mezcla entre la presición y el recall, es decir, el f1 score es el modelo de árbol de clasificación con 0.1275. Finalmente, el modelo que mejor tiene un f beta score, es decir, la media armónica ponderada entre el recall y la presición, es el modelo Gaussian Naive Bayes con 0.3947.

In [30]:
y_pred=y_pred.astype(np.int)
y_pred.mode(axis=1).values[:,0]
print('fbeta',metrics.fbeta_score(y_pred.mode(axis=1).values[:,0],y_test,beta=10))

fbeta 0.333333333333


Se puede observar que combinando el f beta no mejora, ya que pasa de tener un 0.3947 con un modelo Gaussian Naive Bayes a un 0.3333 con una combinación entre este anterior, una regresión logística, un árbol de clasificación y un k-nearest neighbors.

# Exercice 04.6

Using the under-sampled dataset

Evaluate a RandomForestClassifier and compare the results

change n_estimators=100, what happened

In [31]:
from sklearn.ensemble import RandomForestClassifier

for target_percentage in [0.2,0.4,0.6,0.8,0.9]:
    for ntree in [100,200,500]:
        X_u, y_u = UnderSampling(X_train, y_train, target_percentage, 666)
        print('Target percentage', target_percentage)
        print('y.shape = ',y_u.shape[0], 'y.mean() = ', y_u.mean())
        print('ntree = ',ntree)
        rfclas = RandomForestClassifier(n_estimators=ntree, random_state=666, n_jobs=-1)
        rfclas.fit(X_u,y_u)
        y_pred_class_RF = rfclas.predict(X_test)
        print('F Beta Score (Beta=10)',metrics.fbeta_score(y_pred_class_RF,y_test,beta=10))
        print(' ')

Target percentage 0.2
y.shape =  3054 y.mean() =  0.1944990176817289
ntree =  100
F Beta Score (Beta=10) 0.0710014770582
 
Target percentage 0.2
y.shape =  3054 y.mean() =  0.1944990176817289
ntree =  200
F Beta Score (Beta=10) 0.0734674628106
 
Target percentage 0.2
y.shape =  3054 y.mean() =  0.1944990176817289
ntree =  500
F Beta Score (Beta=10) 0.0735419043227
 
Target percentage 0.4
y.shape =  1501 y.mean() =  0.39573617588274485
ntree =  100
F Beta Score (Beta=10) 0.0299223113789
 
Target percentage 0.4
y.shape =  1501 y.mean() =  0.39573617588274485
ntree =  200
F Beta Score (Beta=10) 0.0295718722621
 
Target percentage 0.4
y.shape =  1501 y.mean() =  0.39573617588274485
ntree =  500
F Beta Score (Beta=10) 0.0292587294494
 
Target percentage 0.6
y.shape =  1011 y.mean() =  0.5875370919881305
ntree =  100
F Beta Score (Beta=10) 0.0140301921278
 
Target percentage 0.6
y.shape =  1011 y.mean() =  0.5875370919881305
ntree =  200
F Beta Score (Beta=10) 0.0139785897549
 
Target percen

En el anterior ejercicio se puede observar que se hace la comparación de modelos Random Forest variando los parámetros de cantidad de árboles y utilizando Under Sampling se concluye que el valor del Target percentage que optimiza el F Beta Score es 0.2 y 500 árboles. A medida que el Target percentage aumenta, el F Beta Score disminuye, sin embargo el comportamiento de los árboles es un poco impreciso, ya que para Target percentage = 0.2 a mayor cantidad de árboles, el F Beta mejora, sin embargo, para un Target percentage = 0.9 a mayor cantidad de árboles el F Beta Score empeora.