# Introduction:

This dataset contains transactions of September 2013, by citizens of the EU region. 
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Due to security issues, there are no information about original features and more background info about the data (I believe).

## Importing the data directly from Kaggle:

In [None]:
import opendatasets as od

In [None]:
dataset = 'https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud'

In [None]:
od.download(dataset)

In [None]:
import os
data_dir = '.\creditcardfraud'
os.listdir(data_dir)

## Loading the packages:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier
import xgboost as xgb

pd.set_option('display.max_columns', 100)

# TRAIN/VALIDATION/TEST SPLIT
# VALIDATION
VALID_SIZE = 0.20  # simple validation using train_test_split
TEST_SIZE = 0.20  # test size using_train_test_split

# CROSS-VALIDATION
NUMBER_KFOLDS = 5  # number of KFolds for cross-validation

RANDOM_STATE = 2018

MAX_ROUNDS = 1000  # lgb iterations
EARLY_STOP = 50  # lgb early stop
OPT_ROUNDS = 1000  # To be adjusted based on best validation rounds
VERBOSE_EVAL = 50  # Print out metric result

### Reading and Checking the data:

In [None]:
data_df = pd.read_csv("C:/Users/Souptik/creditcardfraud/creditcard.csv")
print("Credit Card Fraud Detection data -  rows:",
      data_df.shape[0]," columns:", data_df.shape[1])

In [None]:
data_df.head()

In [None]:
data_df.describe()

#### Checking for missing data:

In [None]:
total = data_df.isnull().sum().sort_values(ascending=False)
percent = (data_df.isnull().sum() / data_df.isnull().count() * 100).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent']).transpose()
print("Missing Data:")
print(missing_data)

##### Clearly, there are no missing data.

### Data Unbalance:
##### Let's check data unbalance with respect to "Target" value, which is "Class".

In [None]:
# Create a DataFrame to hold the class counts
temp = data_df["Class"].value_counts()
df = pd.DataFrame({'Class': temp.index, 'values': temp.values})

# Create the bar chart
trace = go.Bar(
    x=df['Class'],
    y=df['values'],
    name="Credit Card Fraud Class - data unbalance (Not fraud=0, Fraud=1)",
    marker=dict(color="Red"),
    text=df['values']
)
data = [trace]
layout = dict(
    title='Credit Card Fraud Class - data unbalance (Not fraud=0, Fraud=1)',
    xaxis=dict(title='Class', showticklabels=True),
    yaxis=dict(title='Number of transactions'),
    hovermode='closest',
    width=600
)
fig = dict(data=data, layout=layout)
iplot(fig, filename='class')

##### Only 492 (or 0.172%) of transactions are fraudulent, which indicates that the data is highly unbalanced with respect to the target variable "Class".

## EDA:

#### 1. Histogram for "Transaction in Time":

In [None]:
import plotly.express as px

fig = px.histogram(data_df, x='Time', color='Class', nbins=50,
                   labels={'Time': 'Time [s]', 'Class': 'Class'},
                   title='Credit Card Transactions Time Density Plot',
                   barmode='overlay', histnorm='probability density')

fig.update_layout(showlegend=True)
fig.show()

##### We can infer the following:

- Transaction Time Distribution: The plot shows the distribution of transaction times in seconds (Time [s]) for both fraud and non-fraud transactions. The x-axis represents the transaction time, and the y-axis shows the probability density of the transactions.

- Peak Times: The plot allows us to identify the peak times when most transactions occur. For non-fraud transactions (class 0), there are one or more peaks where a large number of legitimate transactions occur. For fraud transactions (class 1), there may be different peak times or patterns compared to non-fraud transactions, which could indicate potential anomalies.

- Transaction Time Differences: We can observe if there are any notable differences in transaction time distributions between fraud and non-fraud transactions. Differences in peak times or shapes of the distributions may suggest potential patterns or anomalies in the fraud transactions.

- Overlapping Areas: The overlapping areas in the plot represent regions where both fraud and non-fraud transactions occur similarly in terms of transaction times. This overlap can be important in distinguishing between fraud and non-fraud transactions, as some fraudulent activities might be similar to legitimate transactions in terms of time.

#### Let's aggregate the data by hours.

In [None]:
data_df['Hour'] = data_df['Time'].apply(lambda x: np.floor(x/3600))

tmp = data_df.groupby(['Hour', 'Class'])['Amount'].aggregate(['min', 'max', 'count', 'sum', 'mean', 'median', 'var']).reset_index()
df = pd.DataFrame(tmp)

df.columns = ['Hour', 'Class', 'Min', 'Max', 'Transactions', 'Sum',
             'Mean', 'Median', 'Var']
df.head()

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Sum", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Sum", data=df.loc[df.Class==1], color="red")
plt.suptitle("Total Amount")
plt.show();

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Transactions", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Transactions", data=df.loc[df.Class==1], color="red")
plt.suptitle("Total Number of Transactions")
plt.show();

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Mean", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Mean", data=df.loc[df.Class==1], color="red")
plt.suptitle("Average Amount of Transactions")
plt.show();

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Max", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Max", data=df.loc[df.Class==1], color="red")
plt.suptitle("Maximum Amount of Transactions")
plt.show();

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Median", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Median", data=df.loc[df.Class==1], color="red")
plt.suptitle("Median Amount of Transactions")
plt.show();

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Min", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Min", data=df.loc[df.Class==1], color="red")
plt.suptitle("Minimum Amount of Transactions")
plt.show();

#### Transactions amount:

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,6))
s = sns.boxplot(ax = ax1, x="Class", y="Amount", hue="Class",data=data_df, palette="PRGn",showfliers=True)
s = sns.boxplot(ax = ax2, x="Class", y="Amount", hue="Class",data=data_df, palette="PRGn",showfliers=False)
plt.show();

In [None]:
tmp = data_df[['Amount','Class']].copy()
class_0 = tmp.loc[tmp['Class'] == 0]['Amount']
class_1 = tmp.loc[tmp['Class'] == 1]['Amount']
class_0.describe()

In [None]:
class_1.describe()

The real transaction have a larger mean value, larger Q1, smaller Q3 and Q4 and larger outliers; fraudulent transactions have a smaller Q1 and mean, larger Q4 and smaller outliers.

##### Let's plot the fraudulent transactions (amount) against time.

In [None]:
fraudulent_transactions = data_df[data_df['Class'] == 1]

# Create a scatter plot for fraudulent transactions (amount) against time
plt.figure(figsize=(12, 6))
plt.scatter(fraudulent_transactions['Time'], fraudulent_transactions['Amount'], color='red', alpha=0.7)
plt.title('Fraudulent Transactions (Amount) vs Time')
plt.xlabel('Time (seconds)')
plt.ylabel('Amount')
plt.show()

#### Moving on the Features engineering:

##### Features correlation:

In [None]:
# Calculate the correlation matrix
corr_matrix = data_df.corr()

# Perform hierarchical clustering to reorder the rows and columns
g = sns.clustermap(corr_matrix, cmap='coolwarm', center=0, annot=True, fmt=".2f",
                   linewidths=.5, cbar_kws={"shrink": 0.8})

# Set the title of the plot
plt.title('Credit Card Transactions Features Correlation Heatmap')

# Rotate the x-axis labels for better readability
plt.setp(g.ax_heatmap.get_xticklabels(), rotation=45, ha='right')

# Show the plot
plt.show()

- There are no notable correlations between features V1-V28. This means that these features are relatively independent of each other and don't show strong linear relationships.

- There are certain correlations between some of these features and Time:

    * V3 shows an inverse correlation with Time, indicating that as Time increases, V3 tends to decrease (or vice versa). This suggests that there might be some time-dependent patterns in the data related to V3.
- There are certain correlations between some of these features and Amount:

    * V7 and V20 show a direct correlation with Amount, indicating that as Amount increases, V7 and V20 tend to increase as well. This suggests that there might be some relationship between the transaction amount and these features.
    * V1 and V5 show an inverse correlation with Amount, suggesting that as Amount increases, V1 and V5 tend to decrease (or vice versa). This could indicate some patterns related to the transaction amount and these features.

#### Let's plot the correlated and inverse correlated values on the same graph.

In [None]:
# Creating a DataFrame with the correlated and inverse correlated values
correlated_df = data_df[['Amount', 'V7', 'V20']].copy()
inverse_correlated_df = data_df[['Amount', 'V1', 'V5']].copy()

# Creating subplots for correlated and inverse correlated plots
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18, 10))

# Plotting the correlated values
sns.scatterplot(x='Amount', y='V7', data=correlated_df, ax=ax1, color='blue', label='V7 (Correlated)')
sns.scatterplot(x='Amount', y='V20', data=correlated_df, ax=ax1, color='green', label='V20 (Correlated)')
ax1.set_title('Correlated Features vs. Amount')
ax1.legend()

# Plotting the inverse correlated values
sns.scatterplot(x='Amount', y='V1', data=inverse_correlated_df, ax=ax2, color='red', label='V1 (Inverse Correlated)')
sns.scatterplot(x='Amount', y='V5', data=inverse_correlated_df, ax=ax2, color='orange', label='V5 (Inverse Correlated)')
ax2.set_title('Inverse Correlated Features vs. Amount')
ax2.legend()

plt.tight_layout()
plt.show()

From the output, we can understand the following:

1. Correlated Features (V7 and V20) vs. Amount:

  * V7 (Correlated): As the transaction Amount increases, the values of V7 tend to increase as well. There is a positive correlation between V7 and the transaction Amount.
  * V20 (Correlated): Similarly, as the transaction Amount increases, the values of V20 also tend to increase. There is a positive correlation between V20 and the transaction Amount.

2. Inverse Correlated Features (V1 and V5) vs. Amount:

  * V1 (Inverse Correlated): As the transaction Amount increases, the values of V1 tend to decrease. There is a negative (inverse) correlation between V1 and the transaction Amount.
  * V5 (Inverse Correlated): Similarly, as the transaction Amount increases, the values of V5 tend to decrease. There is a negative (inverse) correlation between V5 and the transaction Amount.

#### Features density plot:

In [None]:
selected_features = ['V1', 'V5', 'V7', 'V20']

fig, axes = plt.subplots(nrows=len(selected_features), ncols=1, figsize=(8, 12), sharex=True)

# Plotting the density plot for each feature
for i, feature in enumerate(selected_features):
    sns.kdeplot(data_df[data_df['Class'] == 0][feature], label='Not Fraud', ax=axes[i], color='blue', linewidth=2)
    sns.kdeplot(data_df[data_df['Class'] == 1][feature], label='Fraud', ax=axes[i], color='red', linewidth=2)

    axes[i].set_title(f'{feature} Density Plot', fontsize=14)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('Density')

plt.tight_layout()
plt.show()

#### What if I had to select all features?

In [None]:
selected_features = data_df.drop('Class', axis=1).columns.tolist()

fig, axes = plt.subplots(nrows=len(selected_features), ncols=1, figsize=(8, 4 * len(selected_features)), sharex=True)

for i, feature in enumerate(selected_features):
    sns.kdeplot(data_df[data_df['Class'] == 0][feature], label='Not Fraud', ax=axes[i], color='blue', linewidth=2)
    sns.kdeplot(data_df[data_df['Class'] == 1][feature], label='Fraud', ax=axes[i], color='red', linewidth=2)

    axes[i].set_title(f'{feature} Density Plot', fontsize=14)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('Density')

plt.tight_layout()
plt.show()

#### Summary of my observations:

- Good Separation: Features V4 and V11 have clearly separated distributions for Class values 0 and 1, indicating they could be strong indicators for distinguishing between legitimate and fraudulent transactions.

- Partial Separation: Features V12, V14, and V18 show partial separation between the two classes, suggesting they still provide some discriminatory power, but there is some overlap in their distributions.

- Distinct Profiles: Features V1, V2, V3, and V10 have distinct profiles for the two values of Class, indicating they may also be informative in distinguishing between the classes.

- Similar Profiles: Features V25, V26, and V28 have similar profiles for the two values of Class, meaning their distributions are not very informative in differentiating between legitimate and fraudulent transactions.

- Centered Around 0: For most features (except Time and Amount), the distributions for Class = 0 (legitimate transactions) are centered around 0, with some having a long tail on one side. This suggests that in general, legitimate transactions tend to have values closer to 0 for these features.

- Skewed Distribution: For Class = 1 (fraudulent transactions), the distributions are skewed, indicating that certain features may have extreme values for fraudulent cases.

### Building the models:

##### Defining the predictors and targets:

In [None]:
target = 'Class'
predictors = ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',\
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',\
       'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28',\
       'Amount']

##### Splitting the data:

In [None]:
train_df, test_df = train_test_split(data_df, test_size=TEST_SIZE, random_state=RANDOM_STATE, shuffle=True )
train_df, valid_df = train_test_split(train_df, test_size=VALID_SIZE, random_state=RANDOM_STATE, shuffle=True )

### 1st model: RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

In [None]:
rf = RandomForestClassifier(n_jobs=4, 
                             random_state=42,
                             criterion='gini',
                             n_estimators=100,
                             verbose=False)

In [None]:
rf.fit(train_df[predictors], train_df[target].values)

In [None]:
preds = rf.predict(valid_df[predictors])

In [None]:
tmp = pd.DataFrame({'Feature': predictors, 'Feature importance': rf.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()

##### The most important features are: V17, V14, V12, V10, V16, V11 and V9.

#### Let's plot the confusion matrix for this model:

In [None]:
cm = confusion_matrix(valid_df[target].values, preds)

# Plot the confusion matrix as a heatmap
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False,
            xticklabels=["Not Fraud", "Fraud"], yticklabels=["Not Fraud", "Fraud"])
plt.xlabel("Predicted Class")
plt.ylabel("True Class")
plt.title("Confusion Matrix")
plt.show()

In [None]:
roc_auc_score(valid_df[target].values, preds)

#### The ROC-AUC score obtained with RandomForrestClassifier is 0.85.

### 2nd Model: LGBMClassifier

In [None]:
import lightgbm as lgb
from sklearn.metrics import confusion_matrix, roc_auc_score

# Define hyperparameters for LGBM model
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.05,
    'num_leaves': 7,
    'max_depth': 4,
    'min_child_samples': 100,
    'max_bin': 100,
    'subsample': 0.9,
    'subsample_freq': 1,
    'colsample_bytree': 0.7,
    'min_child_weight': 0,
    'min_split_gain': 0,
    'nthread': 8,
    'verbose': 0,
    'scale_pos_weight': 150
}

# Create the LGBM dataset
dtrain = lgb.Dataset(train_df[predictors].values,
                     label=train_df[target].values,
                     feature_name=predictors)

dvalid = lgb.Dataset(valid_df[predictors].values,
                     label=valid_df[target].values,
                     feature_name=predictors)

# Train the LGBM model
model = lgb.train(params,
                  dtrain,
                  valid_sets=[dtrain, dvalid],
                  valid_names=['train', 'valid'],
                  num_boost_round=MAX_ROUNDS)

# Make predictions on the validation set
preds = model.predict(valid_df[predictors])

# Convert probabilities to binary predictions (0 or 1) using a threshold of 0.5
preds_binary = (preds >= 0.5).astype(int)

# Evaluate the model using a confusion matrix
cm = confusion_matrix(valid_df[target].values, preds_binary)

# Plot the confusion matrix as a heatmap
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False,
            xticklabels=["Not Fraud", "Fraud"], yticklabels=["Not Fraud", "Fraud"])
plt.xlabel("Predicted Class")
plt.ylabel("True Class")
plt.title("Confusion Matrix (LGBM)")
plt.show()

# Calculate the ROC AUC score for the LGBM model
roc_auc = roc_auc_score(valid_df[target].values, preds)
print("ROC AUC Score (LGBM):", roc_auc)

In [None]:
fig, (ax) = plt.subplots(ncols=1, figsize=(8,5))
lgb.plot_importance(model, height=0.8, title="Features importance (LightGBM)", ax=ax,color="red") 
plt.show()

### 3rd Model: XGBoost

In [None]:
import xgboost as xgb

# Define hyperparameters for XGB model
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'eta': 0.05,
    'max_depth': 4,
    'subsample': 0.9,
    'colsample_bytree': 0.7,
    'min_child_weight': 0,
    'scale_pos_weight': 150
}

# Create the XGB dataset
dtrain = xgb.DMatrix(train_df[predictors].values, label=train_df[target].values)
dvalid = xgb.DMatrix(valid_df[predictors].values, label=valid_df[target].values)

# Train the XGB model
model = xgb.train(params,
                  dtrain,
                  num_boost_round=MAX_ROUNDS,
                  evals=[(dtrain, 'train'), (dvalid, 'valid')],
                  early_stopping_rounds=2*EARLY_STOP,
                  verbose_eval=VERBOSE_EVAL)

# Make predictions on the validation set
preds = model.predict(dvalid)

# Convert probabilities to binary predictions (0 or 1) using a threshold of 0.5
preds_binary = (preds >= 0.5).astype(int)

# Evaluate the model using a confusion matrix
cm = confusion_matrix(valid_df[target].values, preds_binary)

# Plot the confusion matrix as a heatmap
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False,
            xticklabels=["Not Fraud", "Fraud"], yticklabels=["Not Fraud", "Fraud"])
plt.xlabel("Predicted Class")
plt.ylabel("True Class")
plt.title("Confusion Matrix (XGB)")
plt.show()

# Calculate the ROC AUC score for the XGB model
roc_auc = roc_auc_score(valid_df[target].values, preds)
print("ROC AUC Score (XGB):", roc_auc)

In [None]:
fig, (ax) = plt.subplots(ncols=1, figsize=(8,5))
xgb.plot_importance(model, height=0.8, title="Features importance (XGBoost)", ax=ax, color="green") 
plt.show()

In [None]:
from sklearn.model_selection import KFold
import gc

kf = KFold(n_splits=NUMBER_KFOLDS, random_state=RANDOM_STATE, shuffle=True)

# Create arrays and dataframes to store results
oof_preds = np.zeros(train_df.shape[0])
test_preds = np.zeros(test_df.shape[0])
feature_importance_df = pd.DataFrame()
n_fold = 0

for train_idx, valid_idx in kf.split(train_df):
    train_x, train_y = train_df[predictors].iloc[train_idx], train_df[target].iloc[train_idx]
    valid_x, valid_y = train_df[predictors].iloc[valid_idx], train_df[target].iloc[valid_idx]

    # XGBoost model initialization
    model = xgb.XGBClassifier(
        n_jobs=-1,
        n_estimators=2000,
        learning_rate=0.01,
        max_depth=4,
        subsample=0.9,
        colsample_bytree=0.7,
        random_state=RANDOM_STATE
    )

    model.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)],
              eval_metric='auc', early_stopping_rounds=2*EARLY_STOP, verbose=VERBOSE_EVAL)

    oof_preds[valid_idx] = model.predict_proba(valid_x)[:, 1]
    test_preds += model.predict_proba(test_df[predictors])[:, 1] / kf.n_splits

    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = predictors
    fold_importance_df["importance"] = model.feature_importances_
    fold_importance_df["fold"] = n_fold + 1

    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(valid_y, oof_preds[valid_idx])))
    del model, train_x, train_y, valid_x, valid_y
    gc.collect()
    n_fold = n_fold + 1

train_auc_score = roc_auc_score(train_df[target], oof_preds)
print('Full AUC score %.6f' % train_auc_score)

In [None]:
final_pred = test_preds

In [None]:
print(final_pred)

# Conclusion:

We investigated the data, checking for data unbalancing, visualizing the features and understanding the relationship between different features. We then investigated two predictive models. The data was split in 3 parts, a train set, a validation set and a test set. For the first three models, we only used the train and test set.

We started with RandomForrestClassifier, for which we obtained an AUC scode of 0.85 when predicting the target for the test set.

We then experimented with a LightGBM model. In this case, se used the validation set for validation of the training model. The best validation score obtained was 0.929.

We then presented the data to a XGBoost model. We used both train-validation split and cross-validation to evaluate the model effectiveness to predict 'Class' value, i.e. detecting if a transaction was fraudulent. With the first method we obtained values of AUC for the validation set around 0.947980. For the test set, the score obtained was 0.903344.
With the cross-validation, we obtained an AUC score for the test prediction of 0.947980.

Thanks to: https://www.kaggle.com/code/gpreda/credit-card-fraud-detection-predictive-models/notebook