# Credit card fraud detection

The mean goal is to predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

### 1st step:
Import data and calculate the percentage of the observations in the dataset that are instances of fraud.

In [2]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
from sklearn.metrics import recall_score, precision_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing


df = pd.read_csv('fraud_data.csv')
print("%f %% of observations are fraud" % (len(df[df["Class"]==1])*100/len(df)))


1.641082 % of observations are fraud


### 2nd step:
Split data to X_train, X_test, y_train, and y_test.

In [3]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### 3rd step:
Training a dummy classifier that classifies everything as the majority class of the training data and calculate its accuracy and its the recall

In [4]:
model=DummyClassifier(strategy="most_frequent").fit(X_train,y_train)
acc1=model.score(X_test,y_test)
rec1=recall_score(y_test,model.predict(X_test))
print("The acuuracy of a dummy classifier is %f%% and its recall is %f%%" % (acc1*100,rec1*100))

The acuuracy of a dummy classifier is 98.525074% and its recall is 0.000000%


### 4th step: 
As the pourcentage of fraud if very small, accuracy is not a good metric for our project. 
Recall and precision are better metrics to evaluate our model. 
As the recall of dummy regressor is very small, we train a SVC classifer using the default parameters and calculate its accuracy, it's recall and it's precision.

In [5]:
model=SVC().fit(X_train,y_train)
acc2=model.score(X_test,y_test)   
rec2=recall_score(y_test,model.predict(X_test))   
pre2=precision_score(y_test,model.predict(X_test))  

print("The accuracy of SVC classifer is %f%%, its recall is %f%% and its precision is %F%%" % (acc2*100,rec2*100,pre2*100))

The accuracy of SVC classifer is 99.004425%, its recall is 35.000000% and its precision is 93.333333%


### 5th step:
We see that the recall is better in SVC classifier than dummy regressor.
Even if we have a good precision score, but to avoid to misclassify any fraud, we will improve recall by optimising parameters of the SVC classifier.

In [6]:
model=SVC(C=1e9, gamma= 1e-07).fit(X_train,y_train)
svm_predicted_mc = model.decision_function(X_test) > -220
acc3=model.score(X_test,svm_predicted_mc)   
rec3=recall_score(y_test,svm_predicted_mc)   
pre3=precision_score(y_test,svm_predicted_mc)  
print("The accuracy of optimized SVC classifer is %f%%, its recall is %f%% and its precision is %F%%" % (acc3*100,rec3*100,pre3*100))

The accuracy of optimized SVC classifer is 99.594395%, its recall is 82.500000% and its precision is 73.333333%


### 6th step:
We train a logistic regression and optimize our model with Grid Search CV function. 
We evaluate again with precision and recall.

In [7]:
scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)
lr = LogisticRegression()
grid_values = {'penalty':['l2'], 'C': [0.01, 0.1, 1, 10, 100]}
grid_clf_rec = GridSearchCV(lr, param_grid = grid_values, scoring = 'recall')
grid_clf_rec.fit(X_scaled, y_train)
lr_predict_mc=grid_clf_rec.predict(scaler.transform(X_test))
rec4=recall_score(y_test,lr_predict_mc)   
pre4=precision_score(y_test,lr_predict_mc)  
print("The precision of optimized logistic regression is %f%%, its recall is %f%%" % (pre4*100,rec4*100))

The precision of optimized logistic regression is 96.923077%, its recall is 78.750000%


###### As a conclusion, the SVC model with C=1e9 and gamma= 1e-07 is the best model to use in our case.