# Exercise 15

# Fraud Detection

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [74]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [75]:
import zipfile
with zipfile.ZipFile('15_fraud_detection.csv.zip', 'r') as z:
    f = z.open('15_fraud_detection.csv')
    data = pd.io.parsers.read_table(f, index_col=0, sep=',')

In [76]:
data.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [4]:
data.shape, data.Label.sum(), data.Label.mean()

((138721, 16), 797, 0.0057453449730033666)

# Exercice 15.1

Estimate a Logistic Regression and a Decision Tree

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

In [77]:
X = data.drop ('Label', axis=1)
y=  data.Label

# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, random_state=1)

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(random_state=42)

log_reg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = log_reg.predict(X_test)


In [14]:
from sklearn.metrics import accuracy_score, f1_score, fbeta_score
print(accuracy_score(y_pred_class, y_test))
print(f1_score(y_test, y_pred_class))
print(fbeta_score(y_test, y_pred_class,beta=10))

0.993973645512
0.0
0.0


  'precision', 'predicted', average, warn_for)


In [78]:
X = data.drop ('Label', axis=1)
y=  data.Label

# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, random_state=1)

from sklearn.tree import DecisionTreeClassifier
tree_reg = DecisionTreeClassifier(random_state=42)

tree_reg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = tree_reg.predict(X_test)

In [17]:
from sklearn.metrics import accuracy_score, f1_score, fbeta_score
print(accuracy_score(y_pred_class, y_test))
print(f1_score(y_test, y_pred_class))
print(fbeta_score(y_test, y_pred_class,beta=10))

0.989302499928
0.147126436782
0.15298684086


Se puede observar que el f1 score y f beta score muestran que el arbolo de desición realizauna mejor prediccion de los fraudes comparado con la regresión logistica

# Exercice 15.2 (2 points)

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

In [18]:
def UnderSampling(X_train, y_train, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y_train.shape[0]
    n_samples_0 = (y_train == 0).sum()
    n_samples_1 = (y_train == 1).sum()

    n_samples_0_new =  n_samples_1 / target_percentage - n_samples_1
    n_samples_0_new_per = n_samples_0_new / n_samples_0

    filter_ = y_train == 0

    np.random.seed(seed)
    rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)
    
    filter_ = filter_ & rand_1
    filter_ = filter_ | (y_train == 1)
    filter_ = filter_.astype(bool)
    
    return X_train[filter_], y_train[filter_]

In [21]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    X_u, y_u = UnderSampling(X_train, y_train, target_percentage, seed=12345)
    print('Target percentage', target_percentage)
    print('y_train.shape = ',y_u.shape[0], 'y_train.mean() = ', y_u.mean())

Target percentage 0.1
y_train.shape =  5948 y_train.mean() =  0.0988567585743
Target percentage 0.2
y_train.shape =  2958 y_train.mean() =  0.19878296146
Target percentage 0.3
y_train.shape =  1931 y_train.mean() =  0.304505437597
Target percentage 0.4
y_train.shape =  1454 y_train.mean() =  0.404401650619
Target percentage 0.5
y_train.shape =  1181 y_train.mean() =  0.497883149873


In [23]:
X_u, y_u=UnderSampling(X_train, y_train, 0.5, 12345)

In [28]:
log_reg.fit(X_u, y_u)
tree_reg.fit(X_u,y_u)

y_pred_class_log_reg = log_reg.predict(X_test)
y_pred_class_tree_reg = tree_reg.predict(X_test)

In [56]:
from sklearn.metrics import accuracy_score, f1_score, fbeta_score
print(accuracy_score(y_pred_class_log_reg, y_test))
print(accuracy_score(y_pred_class_tree_reg, y_test))

0.994002479744
0.988495141432


# Exercice 15.3 (2 points)

Same analysis using TomekLinks and Condensed Nearest Neighbours

Do not test different parameters for CNN

## TomekLinks

In [30]:
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=2)
nn.fit(X_train)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=2, p=2, radius=1.0)

In [31]:
nns = nn.kneighbors(X_train, return_distance=False)[:, 1]

In [32]:
# Initialize the boolean result as false, and also a counter
links = np.zeros(len(y_train), dtype=bool)

# Loop through each sample of the majority class then we
# look at its first neighbour. If its closest neighbour also has the
# current sample as its closest neighbour, the two form a Tomek link.
for ind, ele in enumerate(y_train):

    if ele == 1 | links[ind] == True:  # Keep all from the minority class
        continue

    if y[nns[ind]] == 1:

        # If they form a tomek link, put a True marker on this
        # sample, and increase counter by one.
        if nns[nns[ind]] == ind:
            links[ind] = True

In [33]:
filter_ = np.logical_not(links)
print('y_train.shape = ',y_train[filter_].shape[0], 'y_train.mean() = ', y_train[filter_].mean())

y_train.shape =  103748 y_train.mean() =  0.00566757913406


## Condensed Nearest Neighbours

In [45]:
X = data.drop ('Label', axis=1)
y= data.Label

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, random_state=123)

In [46]:
# Import the K-NN classifier
from sklearn.neighbors import KNeighborsClassifier
def CondensedNearestNeighbor(X_train, y_train, n_seeds_S=1, size_ngh=1, seed=None):
    # Randomly get one sample from the majority class
    np.random.seed(seed)
    maj_sample = np.random.choice(X_train[y_train == 0].shape[0], n_seeds_S)
    maj_sample = X_train[y_train == 0][maj_sample]
    # Create the set C
    # Select all positive and the randomly selected negatives
    C_x = np.append(X_train[y_train == 1], maj_sample, axis=0)
    C_y = np.append(y_train[y_train == 1], [0] * n_seeds_S)
    # Create the set S
    S_x = X_train[y_train == 0]
    S_y = y_train[y_train == 0]
    knn = KNeighborsClassifier(n_neighbors=size_ngh)

    # Fit C into the knn
    knn.fit(C_x, C_y)

    # Classify on S
    pred_S_y = knn.predict(S_x)
    # Find the misclassified S_y
    idx_tmp = np.nonzero(y_train == 0)[0][np.nonzero(pred_S_y != S_y)]

    filter_ = np.nonzero(y_train == 1)[0]
    filter_ = np.concatenate((filter_, idx_tmp), axis=0)

    return X_train[filter_], y_train[filter_]

In [47]:
for n_seeds_S, size_ngh in [(1, 1), (100, 100), (50, 50), (100, 50), (50, 100)]:
    X_u, y_u = CondensedNearestNeighbor(X_train, y_train, n_seeds_S, size_ngh, 1)
    print('n_seeds_S ', n_seeds_S, 'size_ngh ', size_ngh)
    print('y_train.shape = ',y_u.shape[0], 'y_train.mean() = ', y_u.mean())

n_seeds_S  1 size_ngh  1
y_train.shape =  103890 y_train.mean() =  0.00566945808066
n_seeds_S  100 size_ngh  100
y_train.shape =  104040 y_train.mean() =  0.00566128412149
n_seeds_S  50 size_ngh  50
y_train.shape =  104040 y_train.mean() =  0.00566128412149
n_seeds_S  100 size_ngh  50
y_train.shape =  104040 y_train.mean() =  0.00566128412149
n_seeds_S  50 size_ngh  100
y_train.shape =  104040 y_train.mean() =  0.00566128412149


In [62]:
X_u, y_u = CondensedNearestNeighbor(X_train, y_train, 1, 1, seed=12345)

In [63]:
log_reg.fit(X_u, y_u)
tree_reg.fit(X_u,y_u)

y_pred_class_log_reg = log_reg.predict(X_test)
y_pred_class_tree_reg = tree_reg.predict(X_test)

In [64]:
from sklearn.metrics import accuracy_score, f1_score, fbeta_score
print(accuracy_score(y_pred_class_log_reg, y_test))
print(accuracy_score(y_pred_class_tree_reg, y_test))

0.994002479744
0.988120296416


# Exercice 15.4 

Now using random-over-sampling

In [57]:
import random
def OverSampling(X_train, y_train, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y_train.shape[0]
    n_samples_0 = (y_train == 0).sum()
    n_samples_1 = (y_train == 1).sum()

    n_samples_1_new =  -target_percentage * n_samples_0 / (target_percentage- 1)

    np.random.seed(seed)
    filter_ = np.random.choice(X_train[y_train == 1].shape[0], int(n_samples_1_new))
    # filter_ is within the positives, change to be of all
    filter_ = np.nonzero(y == 1)[0][filter_]
    
    filter_ = np.concatenate((filter_, np.nonzero(y_train == 0)[0]), axis=0)
    
    return X_train[filter_], y_train[filter_]

In [60]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    X_u, y_u = OverSampling(X_train, y_train, target_percentage, 12345)
    print('Target percentage', target_percentage)
    print('y_train.shape = ',y_u.shape[0], 'y_train.mean() = ', y_u.mean())

Target percentage 0.1
y_train.shape =  114945 y_train.mean() =  0.000400191395885
Target percentage 0.2
y_train.shape =  129313 y_train.mean() =  0.000966646818185
Target percentage 0.3
y_train.shape =  147787 y_train.mean() =  0.00145479643
Target percentage 0.4
y_train.shape =  172418 y_train.mean() =  0.00193135287499
Target percentage 0.5
y_train.shape =  206902 y_train.mean() =  0.00248910112034


In [61]:
 X_u, y_u = OverSampling(X_train, y_train,0.5,12345)

In [65]:
log_reg.fit(X_u, y_u)
tree_reg.fit(X_u,y_u)

y_pred_class_log_reg = log_reg.predict(X_test)
y_pred_class_tree_reg = tree_reg.predict(X_test)

In [66]:
from sklearn.metrics import accuracy_score, f1_score, fbeta_score
print(accuracy_score(y_pred_class_log_reg, y_test))
print(accuracy_score(y_pred_class_tree_reg, y_test))

0.994002479744
0.988120296416


# Exercice 15.5 (3 points)

Evaluate the results using SMOTE

Which parameters did you choose?

In [79]:
n_samples = y_train.shape[0]
n_samples_0 = (y_train == 0).sum()
n_samples_1 = (y_train == 1).sum()
def SMOTE(X_train, y_train, target_percentage=0.5, k=5, seed=None):
    
    # New samples
    n_samples_1_new =  int(-target_percentage * n_samples_0 / (target_percentage- 1) - n_samples_1)
    
    # A matrix to store the synthetic samples
    new = np.zeros((n_samples_1_new, X_train.shape[1]))
    
    # Create seeds
    np.random.seed(seed)
    seeds = np.random.randint(1, 1000000, 3)
    
    # Select examples to use as base
    np.random.seed(seeds[0])
    sel_ = np.random.choice(y_train[y_train==1].shape[0], n_samples_1_new)
    
    # Define random seeds (2 per example)
    np.random.seed(seeds[1])
    nn__ = np.random.choice(k, n_samples_1_new)
    np.random.seed(seeds[2])
    steps = np.random.uniform(size=n_samples_1_new)  

    # For each selected examples create one synthetic case
    for i, sel in enumerate(sel_):
        # Select neighbor
        nn_ = nn__[i]
        step = steps[i]
        # Create new sample
        new[i, :] = X_train[y_train==1][sel] - step * (X_train[y_train==1][sel] - X_train[y_train==1][nn_])
    
    X_train = np.vstack((X_train, new))
    y_train = np.append(y_train, np.ones(n_samples_1_new))
    
    return X_train, y_train

In [70]:
for target_percentage in [0.25, 0.5]:
    for k in [5, 15]:
        X_u, y_u = SMOTE(X_train, y_train, target_percentage, k, seed=12345)
        print('Target percentage', target_percentage, 'k ', k)
        print('y.shape = ',y_u.shape[0], 'y.mean() = ', y_u.mean())

Target percentage 0.25 k  5
y.shape =  137934 y.mean() =  0.249996375078
Target percentage 0.25 k  15
y.shape =  137934 y.mean() =  0.249996375078
Target percentage 0.5 k  5
y.shape =  206902 y.mean() =  0.5
Target percentage 0.5 k  15
y.shape =  206902 y.mean() =  0.5


In [80]:
X_u, y_u = SMOTE(X_train, y_train, 0.5, 5, 12345)

In [81]:
log_reg.fit(X_u, y_u)
tree_reg.fit(X_u,y_u)

y_pred_class_log_reg = log_reg.predict(X_test)
y_pred_class_tree_reg = tree_reg.predict(X_test)

In [82]:
from sklearn.metrics import accuracy_score, f1_score, fbeta_score
print(accuracy_score(y_pred_class_log_reg, y_test))
print(accuracy_score(y_pred_class_tree_reg, y_test))
print(f1_score(y_pred_class_log_reg, y_test))
print(f1_score(y_pred_class_tree_reg, y_test))

0.709033764886
0.986505579424
0.0154161381598
0.093023255814


# Exercice 15.6 (3 points)

Compare and comment about the results

Dado que el accurancy no es un buen indicador en este caso dado que la base de datos presenta menor cantidad de reportes de fraude (1) comparado con los que no son fraude (0), puesto que los modelos anteriores predicen casi todo como 0 y es por esto que el accurancy no es confiable. Para corregir este error es necesario realizar esta comparación mediante el f1 score y se observa que el modelo mas apropiado corresponde al arbol de desición en el modelo SMOTE, una de las ventajas de este modelo es que nos permite añadir nueva información para realizar las estimaciones. 
