# Credit Card Fraud Detection
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

## Description of data

- The datasets contains transactions made by credit cards in September 2013 by european cardholders.
- This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.
- The dataset is highly <b>unbalanced</b>, the positive class (frauds) account for 0.172% of all transactions.
- It contains only numerical input variables which are the result of a PCA transformation. 
- Due to confidentiality issues, the original features and more background information about the data is not provided. 
- Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. 
- Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset.
- The feature 'Amount' is the transaction Amount.
- Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Source: [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud)

## Import required Libraries

In [None]:
import pandas as pd
import numpy as np

# For scaling the features and train-test split
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

# For model buidling
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC

# For hyper-paramter tuning
# from sklearn.model_selection import GridSearchCV

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from utils import predict_and_evaluate

In [None]:
# read data file
# this file is compressed in bzip2 format and index column is included in it
df = pd.read_csv("CC.csv.bz2",compression='bz2', index_col=0)

OSError: Invalid data stream

##  Undersand the data

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum() # Check Null Values!

In [None]:
df.columns

In [None]:
# Check Distribution Of Label
print('No Frauds', round(df['Class'].value_counts()[0]/len(df) * 100,2), '% of the dataset')
print('Frauds', round(df['Class'].value_counts()[1]/len(df) * 100,2), '% of the dataset')

In [None]:
# The classes are heavily skewed. This is problem that needs to be solved. How?
print('No Frauds', round(df['Class'].value_counts()[0],2), 'are normal transactions')
print('Frauds', round(df['Class'].value_counts()[1],2), 'are fraud')

In [None]:
colors = ["#0101DF", "#DF0101"]

sns.countplot('Class', data=df, palette=colors)
plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)', fontsize=14)

- Notice how imbalanced is our original dataset! 
- Most of the transactions are non-fraud. 
- If we use this dataframe as the base for our predictive models and analysis we might get a lot of errors and our algorithms will probably overfit since it will "assume" that most transactions are not fraud. 
- But we don't want our model to assume, we want our model to detect patterns that give signs of fraud!

## Preprocessing - Scaling and Distribution
- We will first scale the columns comprise of <b>Time</b> and <b>Amount </b>. 
- Time and amount should be scaled as the other columns. 
- On the other hand, we need to also create a sub sample of the dataframe in order to have an equal amount of Fraud and Non-Fraud cases, helping our algorithms better understand patterns that determines whether a transaction is a fraud or not.

### What is a sub-Sample?
In this scenario, our subsample will be a dataframe with a 50/50 ratio of fraud and non-fraud transactions. Meaning our sub-sample will have the same amount of fraud and non fraud transactions.

### Why do we create a sub-Sample?
We saw that the original dataframe is heavily imbalanced! Using the original dataframe  will cause the following issues:
<ul>
<li><b>Overfitting: </b>Our classification models will assume that in most cases there are no frauds! What we want for our model is to be certain when a fraud occurs. </li>
<li><b>Wrong Correlations:</b> Although we don't know what the "V" features stand for, it will be useful to understand how each of this features influence the result (Fraud or No Fraud) by having an imbalance dataframe we are not able to see the true correlations between the class and features. </li>
</ul>

### Scaling

The **StandardScaler** assumes your data is normally distributed within each feature and will scale them such that the distribution is now centred around 0, with a standard deviation of 1. 

$$\frac{\text{x}-\text{mean}}{\text{standard deviation}}$$

The **MinMaxScaler** is the probably the most famous scaling algorithm, and follows the following formula for each feature. 

$$\frac{\text{x}-\text{min}}{\text{max}-\text{min}}$$

It essentially shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values). If the distribution is not Gaussian or the standard deviation is very small, the min-max scaler works better. However, it is sensitive to outliers, so if there are outliers in the data, you might want to consider the Robust Scaler below.

**Robust Scaler** scale features using statistics that are robust to outliers. The RobustScaler uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rathar than the min-max, so that it is robust to outliers. 

$$\frac{\text{x}-\text{Q1(x)}}{\text{Q3(x)}-\text{Q1(x)}}$$

In [None]:
# Since most of our data has already been scaled, we will scale the columns that are not scaled (Amount and Time)
# RobustScaler is less prone to outliers.
rob_scaler = RobustScaler()

In [None]:
df.head()

NameError: name 'df' is not defined

In [None]:
df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))

In [None]:
df.drop(['Amount'], axis=1, inplace=True) # remove original time and Amount Columns from df

In [None]:
df.head()

In [None]:
# Rearranging the columns
scaled_amount = df['scaled_amount']

df.drop(['scaled_amount'], axis=1, inplace=True)
df.insert(0, 'scaled_amount', scaled_amount)

In [None]:
# Amount is Scaled!
df.head()

**EXERCISE:** Scale the Time Column

### Splitting the DataFrame

Before proceeding with any <b> Sampling technique</b> we have to separate the orginal dataframe.<br> 
<b> Why? for testing purposes, we want to test our models on the original testing set not on the testing set created by either of these techniques.</b><br> The main goal is to fit the model either with the dataframes that were undersample and oversample (in order for our models to detect the patterns), and test it on the original testing set. 

In [None]:
ss = StratifiedShuffleSplit(n_splits=1,
                            test_size=0.2,
                            train_size=0.8,
                            random_state=42)

In [None]:
X = df.drop('Class', axis=1)
y = df['Class']

In [None]:
for train_index, test_index in ss.split(X, y):
    train_df = df.iloc[train_index]
    test_df = df.iloc[test_index]

In [None]:
print('Distributions: \n')
print("Train Set")
print(train_df.Class.value_counts())
print("\nTest Set")
print(test_df.Class.value_counts())
print("\nPercentage:")
print("\nTrain Set")
print((train_df.Class.value_counts()/ len(train_df))*100)
print("\nTest Set")
print((test_df.Class.value_counts()/ len(test_df))*100)

### Random Under-Sampling:

Implement *"Random Under Sampling"* which basically consists of removing data in order to have a more <b> balanced dataset </b> and thus avoiding our models to overfitting.

**Steps:**
<ul>
<li>The first thing we have to do is determine how <b>imbalanced</b> is our class (use "value_counts()" on the class column to determine the amount for each label)  </li>
<li>Once we determine how many instances are considered <b>fraud transactions </b> (Fraud = "1") , we should bring the <b>non-fraud transactions</b> to the same amount as fraud transactions (assuming we want a 50/50 ratio), this will be equivalent to 492 cases of fraud and 492 cases of non-fraud transactions.  </li>
<li> After implementing this technique, we have a sub-sample of our dataframe with a 50/50 ratio with regards to our classes. Then the next step we will implement is to <b>shuffle the data</b> to see if our models can maintain a certain accuracy everytime we run this script.</li>
</ul>

**Note:** The main issue with "Random Under-Sampling" is that we run the risk that our classification models will not perform as accurate as we would like to since there is a great deal of <b>information loss</b> (randomly picking 394 non-fraud transaction  from 2,27,451 non-fraud transactions)

In [None]:
# Lets shuffle the data before creating the subsamples
train_df = train_df.sample(frac=1)

In [None]:
# amount of fraud classes 394 rows
fraud_df = train_df.loc[train_df['Class'] == 1]
non_fraud_df = train_df.loc[train_df['Class'] == 0][:394]

In [None]:
normal_distributed_df = pd.concat([fraud_df, non_fraud_df])

In [None]:
# As fraud_df and non_fraud_df are concatenated, Shuffle dataframe rows to mix the rows
df2 = normal_distributed_df.sample(frac=1, random_state=42)

In [None]:
df2.shape

In [None]:
df2.head()

###  Equally Distributing 
<a id="correlating"></a>
Now that we have our dataframe correctly balanced, we can go further with our <b>analysis</b> and <b>data preprocessing</b>.

In [None]:
print('Distribution of the Classes in the subsample dataset')
print(df2['Class'].value_counts()/len(df2))

In [None]:
colors = ["#0101DF", "#DF0101"]
sns.countplot('Class', data=df2, palette=colors)
plt.title('Equally Distributed Classes', fontsize=14)
plt.show()

## Training the ML Model for Fraud Detection(Classification) 

In [None]:
# Create X_train, X_test, y_train, y_test for ease of use
X_train = df2.drop('Class', axis=1)
y_train = df2['Class']

X_test = test_df.drop('Class', axis=1)
y_test = test_df['Class']

### Random Forest

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, criterion="entropy")

In [None]:
rf_clf.fit(X_train, y_train)

In [None]:
# Decision Tree in the Forest
rf_clf.estimators_[0]

#### Evaluation Metrics
The Given the class imbalance ratio, Confusion matrix and accuracy is not meaningful
for unbalanced classification. A robust evaluation is required to measure the
performance of a fraud detection model.

**1. False Positives:**
A false positive is an outcome where the model incorrectly predicts the positive class.
<br>**2. False Negatives:**
A false negative is an outcome where the model incorrectly predicts the negative class.
<br>**3. Precision:**
Precision talks about how precise/accurate the model is i.e. out of those predicted positives, how many of them are actual positive. Precision is a good measure to determine, when the costs of False Positives is high. For instance, here, a false positive means that a transaction is that is non- fraudulent has been identified as fraudulent. This can happen if the precision is not high for the fraud detection model.
<br>**4. Recall:**
Recall calculates how many of the Actual Positives our model captures through labeling it as Positive (True Positive). If a fraudulent transaction is predicted as non-fraudulent (Predicted Negative), the consequence can be very bad for the bank.
<br>**5. F1 Score:**
F1 Score is used to seek a balance between Precision and Recall.
<br>**6. Mathews Correlation Coefficient:**
The coefficient takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1.<br> 
A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.<br>
The Matthews correlation coefficient is more informative than F1 score and
accuracy in evaluating binary classification problems, because it takes into
account the balance ratios of the four confusion matrix categories (true
positives, true negatives, false positives, and false negative).

In [None]:
rf_res = predict_and_evaluate(rf_clf, X_test, y_test)

#### Feature Importances

In order to quantify the usefulness of all the variables in the entire random forest, we can look at the relative importances of the variables.

In [None]:
rf_clf.feature_importances_

In [None]:
feature_importances = pd.Series(rf_clf.feature_importances_, index=X_train.columns)
feature_importances.sort_values(ascending=False, inplace=True)

In [None]:
feature_importances

In [None]:
fig = plt.figure(figsize=(8,4), dpi=100)
feature_importances.plot.bar()
plt.title("Feature importances using MDI")
plt.xlabel("Features")
plt.ylabel("Mean Decrease in Impurity")
plt.show()

**EXERCISE:** Train RF model again by considering only important features (e.g. top 10) and evaluate the model and observe the difference in the metrics.

### Gradient Boosting

In [None]:
gbm_clf = GradientBoostingClassifier()

In [None]:
gbm_clf.fit(X_train,y_train)

In [None]:
gbm_res = predict_and_evaluate(gbm_clf, X_test, y_test)

### XGBoost

In [None]:
xgb_clf = XGBClassifier()

In [None]:
xgb_clf.fit(X_train,y_train)

In [None]:
xgb_res = predict_and_evaluate(xgb_clf, X_test, y_test)

### SVM

In [None]:
svm_clf = SVC()

In [None]:
svm_clf.fit(X_train, y_train)

In [None]:
svm_res = predict_and_evaluate(svm_clf, X_test, y_test)

## Comparing the metrics for all the algorithms

In [None]:
results = pd.DataFrame(data=[rf_res, gbm_res, xgb_res, svm_res], 
             columns=('Algorithm','False Positives', 
                      'False Negatives', 'Precision', 
                      'Recall', 'F1 Score', 'MCC'))

In [None]:
results