## This is the EDA notebook for the credit card fraud detection project
### Table of contents:
1. Project overview
2. First look at the data
3. Plots labels vs features
4. Explore basic logistic model
5. Sum up with preprocessing steps

### 1. Project overview
This project aim to identify whether a credit card transaction is fraudulent or not. <br>
The data comes from a Kaggle competition: https://www.kaggle.com/mlg-ulb/creditcardfraud <br>


### 2. First look at the data
Some information from the tast 
The data is highly unbalanced only 492 out of 284,807 samples are fraudulent. <br>
Feature V1-28 are anonymous and uninterpretable principal components. <br>
Time and Amound have not been transformed. <br>
Time contains the number of seconds elapsed between each transaction and the first transaction in the dataset.<br>
Amount refers to the transaction amount.


In [None]:
import numpy as np
import pandas as pd

In [None]:
df_train = pd.read_csv('creditcard_train.csv')

In [None]:
df_train.head()

In [None]:
df_train.info()

In [None]:
df_train.describe()

There are 199,364 data points, with no missing value.<br>
As we can see Time needs to be converted to datetimedelta object, and Class into category. <br>
We will transform Time into datetime seconds for now, and explore whether hours, minutes later. <br>
This dataset time span about 2 days (172792/86400).

In [None]:
# Convert Time into datetime delta; Class into category and sort by Time
import datetime
df_train['timedelta'] = df_train['Time'].apply(lambda x: datetime.timedelta(seconds=x))
df_train['fraud'] = df_train['Class'].astype('category')
df_train.sort_values('Time', inplace=True)

In [None]:
df_train.reset_index(inplace=True, drop=True)

In [None]:
df_train.head()

### 3. Plots

#### Fraudulent frequency

In [None]:
fraudulent_pct = df_train['fraud'].value_counts()[1]/len(df_train)*100
print('{}% of the transactios are fraudulent'.format(fraudulent_pct.round(2)))

#### Time vs Fraudulent Transactions

In [None]:
df_train['min'] = (df_train['Time']//60).astype('int')

In [None]:
df_train['hour'] = (df_train['Time']//3600).astype('int')

In [None]:
# import plotly.express as px
# df_train['fraud_by_hr'] = df_train.groupby('hour').mean()['Class']
# px.bar(x=df_train.groupby('hour').mean().index, y=df_train.groupby('hour').mean()['Class'], 
#         title='Average of fraudulent transactions by hour', 
#         labels={'x': 'Hour', 'y':  'Average fraudulant transactions'})

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
plt.figure(figsize=(15,10))
fig = sns.barplot(x='hour', y='Class', data=df_train)
fig.set(xlabel='Hour', ylabel='Average fraud count', title='Average fraud count by hour')

Between 2-7 hours after hour 0 each day, there are spikes of fraudulent activities.

In [None]:
# px.line(x=df_train.groupby('min').mean().index, y=df_train.groupby('min').mean()['Class'], 
#         title='Average of fraudulent transactions by minute', 
#         labels={'x': 'Minute', 'y':  'Average fraudulant transactions'})

plt.figure(figsize=(15,10))
fig = sns.barplot(x='min', y='Class', data=df_train)
fig.set(xlabel='Minute', ylabel='Average fraud count', title='Average fraud count by minute')

There is a similar pattern by minute, and some day to day differences too. However, will not be able to model day to day differences due to lack of data. <br>
It seems that the increased fraud activity might be focused around early hours, and not just because of outliers<br>
Because there is a periodic pattern, we are going to transform Time, min and hour, and squeeze the data into one day.

In [None]:
df_train['sec'] = [second if second<=86400 else second-86400 for second in list(df_train['Time'])]
df_train['min'] = [minute if minute<=1440 else minute-1440 for minute in list(df_train['min'])]
df_train['hour'] = [hour if hour<=23 else hour-24 for hour in list(df_train['hour'])]

In [None]:
# px.bar(x=df_train.groupby('hour').mean().index, y=df_train.groupby('hour').mean()['Class'], 
#         title='Average of fraudulent transactions by hour', 
#         labels={'x': 'Hour', 'y':  'Average fraudulant transactions'})

# Plot average fraud count by hour again:\
plt.figure(figsize=(15,10))
fig = sns.barplot(x='hour', y='Class', data=df_train)
fig.set(xlabel='Hour', ylabel='Average fraud count', title='Average fraud count by hour')

There is definitely a peak in fraudulent transacations. Do we consider isolating those hours? Or there are two types of fraud, one constant cross all hours and another happens at the early hours?

#### Amount vs Fraudulent Transactions

In [None]:
df_train.groupby('fraud').mean()['Amount']

Fraudulent transaction is on average 44% more than nonfraudulent transactions

In [None]:
# import plotly.graph_objects as go
# hours = list(set(df_train['hour'].astype('str')))
# fraudulent_amount_by_hour = df_train[df_train['fraud']==1].groupby('hour').mean()['Amount']
# genuine_amount_by_hour = df_train[df_train['fraud']==0].groupby('hour').mean()['Amount']
# fig = go.Figure(data=[
#     go.Bar(name='Fraudulent', x=hours, y=fraudulent_amount_by_hour),
#     go.Bar(name='Genuine', x=hours, y=genuine_amount_by_hour)
# ])
# fig.update_layout(title='Average raudulent and genuine ransaction amount by hour', barmode='group',
#                  yaxis=dict(title='Average transaction amount'),
#                  xaxis=dict(title='Hours'),
#                  hovermode='x unified')
# fig.show()

plt.figure(figsize=(15,10))
fig = sns.barplot(x='hour', y='Amount', hue='fraud', data=df_train)
fig.set(xlabel='Hour', ylabel='Average transaction amount', title='Average transation amount by hour: fraud vs genuine')

Genuine transaction amount is very constant, whereas fraud transactions amount varies hugely, not only between themselves (huge confidence intervals), but also between different hours. <br>
Because of the fact that the fraudulent amount varies hugely, we need to be very cautious excluding any outliers in the data, unless justified.

In [None]:
# from plotly.subplots import make_subplots
# fig = make_subplots(rows=2, cols=1, subplot_titles=['Distribution of amount: fraudulent', 'Distribution of amount: genuine'])
# fig.add_trace(go.Histogram(
#     x=df_train[df_train['fraud']==1]['Amount'], nbinsx=500
# ), row=1, col=1)
# fig.add_trace(go.Histogram(
#     x=df_train[df_train['fraud']==0]['Amount'], nbinsx=500
# ), row=2, col=1)
# fig.update_layout(showlegend=False)
# fig.show()

plt.figure(figsize=(15,10))
fig = sns.boxplot(data=df_train, x='fraud', y='Amount')
fig.set(title='Amount distribution: fraud vs genuine')

It looks like genuine transaction amounts vary more than fraudulent ones. Let's zoom into Amount less than 500.

In [None]:
plt.figure(figsize=(15,10))
fig = sns.boxplot(data=df_train, x='fraud', y='Amount')
fig.set(title='Amount distribution: fraud vs genuine', ylim=(0,500))

Genuine transactions have higher medium amount, but lower upper fence. Fraudulent transaction amounts cluster more just above 0. Let's see it in histogram again

In [None]:
plt.figure(figsize=(15,10))
fig = sns.displot(df_train[df_train['fraud']==1], x='Amount', height=8, aspect=2)
fig.set(title='Amount distribution: fraud')

In [None]:
plt.figure(figsize=(15,10))
fig = sns.displot(df_train[df_train['fraud']==0], x='Amount', height=5, aspect=3)
fig.set(title='Amount distribution: genuine', ylim=(0,10000), xlim=(0,2500))

In [None]:
# #Zoom in to amount range from 0 to 500
# from plotly.subplots import make_subplots
# fig = make_subplots(rows=2, cols=1, subplot_titles=['Distribution of amount: fraudulent', 'Distribution of amount: genuine'])
# fig.add_trace(go.Histogram(
#     x=df_train[df_train['fraud']==1]['Amount'], nbinsx=500
# ), row=1, col=1)
# fig.add_trace(go.Histogram(
#     x=df_train[df_train['fraud']==0]['Amount'], nbinsx=5000
# ), row=2, col=1)
# fig.update_layout(showlegend=False)
# fig.update_xaxes(title_text='Transaction amount', range=[0,500], row=1, col=1)
# fig.update_xaxes(title_text='Transaction amount', range=[0,500], row=2, col=1)
# fig.show()

There seem to be two clusters of fraudulent transaction amounts, one around 0, another around 100. Fraudulent transaction amount rarely go beyond 500. <br>
Does it mean two different patterns of fraudulent activities, or it's noise due to the day to day variance? <br>
There is a mean difference between fraudulent and genuine transaction amount. Let's formally test it.

In [None]:
from scipy.stats import ttest_ind
fraud_amount = df_train[df_train['fraud']==1]['Amount']
genuine_amount = df_train[df_train['fraud']==0]['Amount']
ttest_ind(fraud_amount, genuine_amount)

There is a significant difference between the mean amount from fraudulent transactions to genuine transactions. But we need to be careful, because although the average amount is higher for fraudulent transactions, we need to think about the implication of identifying big amount as fraudulent, because clearly there are a huge number of genuine high-value transactions. Maybe amount combining with other features will be useful?

#### V1-28 vs fraudulence

Quick look at V1-28. Bellow we plot out V1-28 in searborn pairplot. In the diagnal line is the distribution of each V feature and the rest are scatter plots of one V feature against another. <br>
Blue represent genuine transaction data points. Orange represent fraudulent transaction data points.<br>
(Pairplot has been broken down into 7 parts so it takes less time to run.)


In [None]:
unkwn_features = ['V{}'.format(n+1) for n in range(0,28)]

In [None]:


# rows = (len(unkwn_features)//2)
# cols = 2
# subplot_titles = tuple(var+' distribution' for var in unkwn_features)
# fig = make_subplots(rows=rows, cols=cols, subplot_titles=subplot_titles)

# for i, feature in enumerate(unkwn_features):
#     row = (i//cols)+1
#     col = (i%cols)+1
    
#     fig.add_trace(go.Histogram(
#         x=df_train[feature]
#     ), row=row, col=col)

# fig.update_layout(height=4000, showlegend=False)
# fig.show()
from IPython.display import Image
nplots = 7
for i in range(nplots):
    df = df_train[unkwn_features[i*4: (i+1)*4]+['fraud']]
    fig = sns.pairplot(df, hue='fraud')
    fig.savefig("pairplot{}.png".format(i+1))
    plt.clf() # Clean parirplot figure from sns 

In [None]:
Image(filename='pairplot1.png') # Show pairplot as image

In [None]:
Image(filename='pairplot2.png') # Show pairplot as image

In [None]:
Image(filename='pairplot3.png') # Show pairplot as image

In [None]:
Image(filename='pairplot4.png') # Show pairplot as image

In [None]:
Image(filename='pairplot5.png') # Show pairplot as image

In [None]:
Image(filename='pairplot6.png') # Show pairplot as image

In [None]:
Image(filename='pairplot7.png') # Show pairplot as image

V1-28 are all continuous standardised features with a mean of 0. <br>
The features for fraudulent transactions seem to have more contant distribution, often near or beyong one tail of genuine transaction features distributions (normal).<br>
There is multicoliniarity present amongst V features. Fraudulent transactions sometimes sit on one extreme tail of the correlation (e.g. V10:V9), sometimes sit outside of the correlation/form a different correlation (e.g. V5:V7)<br>
Promising features include: V1-20 seem to be more promising. Harder to tell from V21-28, but we will look at these in detail later.

#### Heatmap outlining correlations

In [None]:
from sklearn.preprocessing import MinMaxScaler
df_train['hour'] = df_train['hour'].astype('category')
df_train = pd.get_dummies(df_train, drop_first=True)
scaler = MinMaxScaler()
df_train['Amount'] = np.squeeze(scaler.fit_transform(np.array(df_train['Amount']).reshape(-1,1)))
df_train.columns

In [None]:
df_train.rename(columns={'fraud_1':'fraud'}, inplace=True)
df_train.columns

In [None]:
features_lst = ['Class', 'Amount'] + unkwn_features + ['hour_{}'.format(i) for i in range(1,24)]
df_hm = pd.DataFrame(df_train[features_lst].corr()['Class'].sort_values(ascending=False))
plt.figure(figsize=(10,20))
sns.heatmap(df_hm, annot=True)

The most positively related to fraud transactions: V11, V4, V2, V21, hour_2, V19, hour_4, V20, V27, V8 <br>
The most negatively related to fraud transactions: V17, V14, V12, V10, V16, V3, V7,V18, V1, V9, V5, V6 <br>
The negative correlations seems to be stronger than the positive ones.

**SMOTE** Because we have a very unbalanced data, feature importance drawn from the above heatmap might not be very accurate. Here we try to resolve this problem using Synthetic Minority Oversampling Technique, SMOTE, and Random Under Sampling. <br>
Ref: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ <br>
We will: 1. over sample fraud transactions to be 10% of the genuine transactions. 2. Then under sample the genuine transactions so the numbers match with fraud transactions.

In [None]:
X = df_train[features_lst].drop('Class', axis=1)
y = df_train['Class']

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

over = SMOTE(random_state=42, sampling_strategy=0.1)
under = RandomUnderSampler(random_state=42, sampling_strategy=1)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)
X, y = pipeline.fit_resample(X, y)

In [None]:
y.value_counts()

In [None]:
# Put y back into the data frame for heatmap
X['fraud'] = y
df_hm = pd.DataFrame(X.corr()['fraud'].sort_values(ascending=False))
plt.figure(figsize=(10,20))
sns.heatmap(df_hm, annot=True)

The important features didn't seem to change, but the correlation magnitude have increased and order changed. <br>
We are not picking out the most correlated features here, because we want to use a basic logistic model to help reduce the dimensionality and further feature engineering if need.

#### Multicolinearity

In [None]:
plt.figure(figsize=(35,35))
sns.heatmap(X.corr(), annot=True)

There is very strong multicolinearity amongst V1-18. We assume the anonamous features have some actual meanings, hence useful to keep them as they are for interpretation. Therefore, we do not to represent them with less features using PCA for the moment.

### 4. Basic Logistic Model

We build a basic logistic model to:
1. Narrow down important features to help gain intuition or improve model performance (more generalised model?)
2. Play round with sampling methods and discover impact
3. Explore other feature engineering

In [None]:
features_lst.remove('Class')

#### Evaluation matrix
Because this data is heavily unbalanced, the usual accuracy metrix is not suitable. Alternatively we use the area under the Precision-Recall curve to evaluate our model. The reason is because there is a very small group of positive(fraud) cases, to accurally and sensitively measure the model performance, TN(True negative) is too large to be included. Hence precision and recall are more useful. <br>
Here we are not trying to decide whether precision or recall is more important. Imagine a bank would want to achieve high precision: true positive/(true positive + false positive), less false alarm AND high recall: true positive/(true positive + false negative), less undetected fraud.<br>
The calculation of the AUC of PR curve is included in the metrics.py file.

#### Basic logistic model with over and under sampling

In [None]:
import statsmodels.api as sm
exog = sm.add_constant(X[features_lst])
endog = X['fraud']

In [None]:
exog.columns

In [None]:
logit_mod = sm.Logit(endog=endog, exog=exog)
log_res = logit_mod.fit(method='bfgs', maxiter=5000)

In [None]:
print(log_res.summary())

Sklearn and Statsmodel different results under 'bfgs' solver?

**Complete quasi-separation** <br>
For a visual understanding of quasi-separation, read here: <br>
https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-models/what-are-complete-separation-and-quasi-complete-separation/ <br>

A possible complete quasi-separation perhaps indicate that some of the features yield perfect prediction for most values, but not all. Looking back at the pairplot produced between unknow features, V9-V20 definitely exhibit some of that characteristics. <br>

(Unvarified due to time limit) Some suggested that features with insignificant results with big coefficients and huge confidence intervals suggest they contribute to quasi-separation <br>

The logistic does produce high sudo R-sq score. However we will have a look at PR-AUC.


In [None]:

y_pred = log_res.predict(exog=exog)

In [None]:

from sklearn.metrics import precision_recall_curve, auc, make_scorer

from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

def pr_auc(y, y_pred):
    # calculate the area under the precision recall curve
    p, r, _ = precision_recall_curve(y, y_pred)
    return auc(r, p)

def evaluate_model(X, y, model):
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)
    metric = make_scorer(pr_auc, needs_proba=True)
    scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)
    return scores

In [None]:
from sklearn.linear_model import LogisticRegression
# Stochastic Average Gradient descent solver
model = LogisticRegression(fit_intercept=False, solver='lbfgs', max_iter=1000)
scores = evaluate_model(X[features_lst], X['fraud'], model)
print('Mean AUC score:{}'.format(np.mean(scores)))

This is quite a good score!<br>
We could reduce the features based on the statsmodel output, but we will keep them as they are.<br>
We wrap up EDA with a summary of preprocessing steps, as below. <br>

### 5. Preprocessing steps

In [None]:
import datetime
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline


def preproc(df):

    df['timedelta'] = df['Time'].apply(lambda x: datetime.timedelta(seconds=x))
    df['fraud'] = df['Class'].astype('category')
    df.sort_values('Time', inplace=True)
    df.reset_index(inplace=True, drop=True)
    df['hour'] = (df['Time']//3600).astype('int')
    df['hour'] = [hour if hour <= 23 else hour-24 for hour in list(df['hour'])]
    df['hour'] = df['hour'].astype('category')
    unkwn_features = ['V{}'.format(n+1) for n in range(0, 28)]

    df = pd.get_dummies(df, drop_first=True)
    scaler = MinMaxScaler()
    df['Amount'] = np.squeeze(scaler.fit_transform(
        np.array(df['Amount']).reshape(-1, 1)))
    df.rename(columns={'fraud_1': 'fraud'}, inplace=True)
    features_lst = ['Amount'] + unkwn_features + \
        ['hour_{}'.format(i) for i in range(1, 24)]
    X = df[features_lst]
    y = df['fraud']

    return X, y

def sampler(X, y, over_pct, under_pct):
    over = SMOTE(random_state=42, sampling_strategy=over_pct)
    under = RandomUnderSampler(random_state=42, sampling_strategy=under_pct)
    steps = [('o', over), ('u', under)]
    pipeline = Pipeline(steps=steps)
    X, y = pipeline.fit_resample(X, y)
    return X, y