# Objective of this Notebook

We will use an SVM classifier to predict if a credit card transaction is fraudulent or not. 

We'll use a dataset from kaggle. 
You can download it at https://www.kaggle.com/mlg-ulb/creditcardfraud

The dataset contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.



# Table of contents
- [1. Data Exploration](#1-data-exploration)
  * [Reading data](#reading-data1)
  * [Features types](#features-types1)
  * [Missing values](#missing-values2)
  * [Target classes repartition](#target-classes-repartition1)
- [2. Data preprocessing](#2-data-preprocessing2)
  * [Feature scaling](#feature-scaling1)
- [3. Modeling](#3-svm)
  * [Creation of training / validation / test sets](#creation-of-training---validation---test-sets)
  * [Support Vector Classification](#support-vector-classification-params)
    + [The kernel](#the-kernel)
    + [C : strength of regularization](#c---strength-of-regularization)
  * [Test of the best model](#test-of-the-best-model)


## 1. Data Exploration
<a id='1-data-exploration'></a>

### Reading data
<a id='reading-data1'></a>

In [1]:
import pandas as pd
import os
dataset = pd.read_csv('creditcard.csv')
dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Features types
<a id='features-types1'></a>

In [2]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26  

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. 

Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

### Missing values
<a id='missing-values2'></a>

In [3]:
dataset.isnull().any().any()

False

The dataset is pretty convenient, there is no missing values.

### Target classes repartition
<a id='target-classes-repartition1'></a>

In [4]:
negative_class = dataset['Class'].value_counts()[0]
positive_class = dataset['Class'].value_counts()[1]
total =  positive_class + negative_class
print('Percentage of negative class (not fraudulent):',(negative_class / total)*100, "%")
print('Percentage of class (fraudulent):',(positive_class / total)*100, "%")

Percentage of negative class (not fraudulent): 99.82725143693798 %
Percentage of class (fraudulent): 0.1727485630620034 %


The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Given the class imbalance ratio, it is recommended measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC).                    Confusion matrix accuracy is not meaningful for unbalanced classification.

## 2. Data preprocessing
<a id='2-data-preprocessing2'></a>
### Feature scaling
<a id='feature-scaling1'></a>

Before applying data to SVM, it is important to perform scaling of that data. Main purpose of scaling data before
processing is to avoid attributes in greater numeric ranges. Other purpose is to avoid some types of numerical difficulties
during calculation.  Large attribute values might cause numerical problems.

Two techniques we can use: 

* Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. 

* Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. 

-> To implement both of these techniques, we can adjust our input values as shown in this formula:
$Z = \frac{x - \mu}{\sigma}$ where x is the feature value, $\mu$ is the average of all the values for the feature concerned and $\sigma$ is the standard deviation (or the range value : max-min)





In [5]:
dataset.agg([min, max])

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
min,0.0,-56.40751,-72.715728,-48.325589,-5.683171,-113.743307,-26.160506,-43.557242,-73.216718,-13.434066,...,-34.830382,-10.933144,-44.807735,-2.836627,-10.295397,-2.604551,-22.565679,-15.430084,0.0,0
max,172792.0,2.45493,22.057729,9.382558,16.875344,34.801666,73.301626,120.589494,20.007208,15.594995,...,27.202839,10.50309,22.528412,4.584549,7.519589,3.517346,31.612198,33.847808,25691.16,1


Here we can see, huge differences between features intervals. Thus we should perform feature scaling.

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#Features column names without the target 'target' 
features_columns_names = dataset.columns[:-1]
scaled_data = scaler.fit_transform(dataset[features_columns_names])

#Replacing features old values with their scaled values
dataset.loc[:, features_columns_names] = scaled_data

In [7]:
dataset.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,-1.050379e-14,-8.157366e-16,3.154853e-17,-4.409878e-15,-6.734811e-16,-2.874435e-16,4.168992e-16,-8.767997e-16,-2.423604e-16,3.078727e-16,...,1.6850770000000002e-17,1.478472e-15,-6.797197e-16,1.234659e-16,-7.659279e-16,3.247603e-16,-2.953495e-18,5.401572000000001e-17,3.202236e-16,0.001727
std,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,...,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,0.041527
min,-1.996583,-28.79855,-44.03529,-31.87173,-4.013919,-82.4081,-19.63606,-35.2094,-61.30252,-12.22802,...,-47.41907,-15.06565,-71.75446,-4.683638,-19.75033,-5.401098,-55.9066,-46.74612,-0.3532294,0.0
25%,-0.855212,-0.4698918,-0.3624707,-0.5872142,-0.5993788,-0.5010686,-0.5766822,-0.447886,-0.1746805,-0.5853631,...,-0.3109433,-0.7473476,-0.2591784,-0.5854676,-0.6084001,-0.6780717,-0.1755053,-0.160444,-0.3308401,0.0
50%,-0.2131453,0.009245351,0.03965683,0.1186124,-0.01401724,-0.03936682,-0.2058046,0.03241723,0.01871982,-0.04681169,...,-0.04009429,0.009345377,-0.0179242,0.06765678,0.0318324,-0.1081217,0.003325174,0.03406368,-0.2652715,0.0
75%,0.9372174,0.6716939,0.4867202,0.6774569,0.5250082,0.4433465,0.2991625,0.4611107,0.2740785,0.5435305,...,0.2537392,0.728336,0.2364319,0.7257153,0.6728006,0.4996663,0.2255648,0.2371526,-0.04471707,0.0
max,1.642058,1.253351,13.35775,6.187993,11.91874,25.21413,55.02015,97.47824,16.75153,14.19494,...,37.03471,14.47304,36.07668,7.569684,14.42532,7.293975,78.3194,102.5434,102.3622,1.0


Here the features have been transformed with feature scaling and normalization so that they have the properties of a standard normal distribution i.e with mean = 0 and standard deviation = 1.
This will speed up calcultations of algorithms using Gradient Descent and measures of distance.

## 3. Modeling
<a id='3-svm'></a>
### Creation of training / validation / test sets
<a id='creation-of-training---validation---test-sets'></a>
We will create 3 datasets : training set (60% of all the dataset), validation set (20%), test set (20%)

In [8]:
from sklearn.model_selection import train_test_split 

target = dataset['Class']
features = dataset.drop(columns=['Class'])

#First we create a training (60% of the entire dataset) and test set (40% of the entire dataset)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.40, random_state=42, shuffle=True, stratify=target)

#Afterwards, We split the test set in 2 equal parts : the validation set and the new and smaller test set.
X_test, X_validation, y_test, y_validation = train_test_split(X_test, y_test, test_size=0.50, random_state=42, shuffle=True, stratify=y_test)


print('X_train shape:',X_train.shape, ' & y_train shape:',y_train.shape)
print('X_validation shape:',X_validation.shape,' & y_validation shape:',y_validation.shape)
print('X_test shape:',X_test.shape,' & y_test shape:',y_test.shape)

X_train shape: (170884, 30)  & y_train shape: (170884,)
X_validation shape: (56962, 30)  & y_validation shape: (56962,)
X_test shape: (56961, 30)  & y_test shape: (56961,)


### Support Vector Classification parameters
<a id='support-vector-classification-params'></a>

We will train different classifiers using different combination value of parameters.
The parameters are :
#### The kernel
<a id='the-kernel'></a>
It is a simlilarity function used to learn complex or less complex decision boundaries.
We will use 2 of the most used kernels, the linear and the gaussian kernel.
The liner kernel is the most simple, it separates data with a line which is reasonable when data is linearly separable.
The gaussian kernel is a more complex kernel, intended to separate non linearly separable data.

<img src="lin.png" />



#### C : strength of regularization
<a id='c---strength-of-regularization'></a>
C is a regularization parameter which must be strictly positive.

For this parameter C, we should take into consideration two things :
* A small value for C, means a small penalization which will not encourage the model to classify each example well. This can cause a high bias and a low variance which means underfitting.
* A large value for C, means a higher penalization which will encourage the model to classify each example well. This can cause a lower bias but a higher variance which means overfitting.


The parameter C will have several numerical values.

In [9]:
from sklearn.svm import SVC
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import auc
from sklearn.metrics import precision_recall_curve
    
kernels = ['linear', 'rbf' ]
C = [0.1, 0.3, 1, 3, 10]
results = []
for kernel in kernels:
    for c in C:
        svc = SVC(C = c, kernel = kernel)
        svc = svc.fit(X_train, y_train)
        predictions = svc.predict(X_validation)
        precision, recall, _ = precision_recall_curve(y_validation.values, predictions)
        precision_recal_auc = auc(recall, precision)
        results.append([c,kernel,precision_recal_auc])

As explained previously, given the class imbalance ratio, it is recommended measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). This is what we used here when doing the ' auc(recall,precision)' step which is computing the area under the precision-recall curve (auprc).

We can now : 
* figure out the best parameters giving the best auprc
* build the best model with the best parameters
* test the generalization of the best model on the test set.

### Test of the best model
<a id='test-of-the-best-model'></a>


In [10]:
def getBestParameters(results):
    best_c = results[0][0]
    best_kernel = results[0][1]
    highest_auprc = results[0][2]
    for i in range(1,len(results)):
        current_auprc = results[i][2]
        if (current_auprc > highest_auprc):
            highest_auprc = current_auprc
            best_c = results[i][0]
            best_kernel = results[i][1]
    return [best_c,best_kernel, highest_auprc]

best_params = getBestParameters(results)
best_c = best_params[0]
best_kernel = best_params[1]
highest_auprc = best_params[2]


print('Best parameters:',best_params)

svc = SVC(C = best_c, kernel = best_kernel)
svc = svc.fit(X_train, y_train)
test_predictions = svc.predict(X_test)
precision, recall, _ = precision_recall_curve(y_test.values, test_predictions)
precision_recal_auc = auc(recall, precision)

print("Test set, area under precision-recall value:",precision_recal_auc)
        

Best parameters: [3, 'rbf', 0.8285812728599148]
Test set, area under precision-recall value: 0.8098635014125732
