## 1. Introduction

Credit card fraud detection is a crucial task for financial institutions to minimize financial losses and protect their customers. Machine learning algorithms have proven to be effective tools for identifying fraudulent transactions due to their ability to learn complex patterns from large datasets. 

This project aims to develop a credit card fraud detection system using various machine learning algorithms:
- Decision Trees.
- Random Forest.
- Light Gradient Boosting Machine (LightGBM).
- XGBoost.
- Artificial Neural Networks (ANN). 
- Isolation Forest
- Local Outlier Factor (LOF).

The project will utilize a comprehensive dataset of credit card transactions, including both fraudulent and legitimate transactions. The data will be preprocessed to handle missing values, outliers, and categorical variables. Subsequently, the machine learning algorithms will be trained and evaluated on the preprocessed data. The performance of each algorithm will be assessed using various metrics, such as accuracy, precision, recall, and F1-score.

The project outcomes will provide insights into the effectiveness of different machine learning algorithms for credit card fraud detection. The findings can be used to guide the selection of appropriate algorithms for practical fraud detection systems

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Load Dataset

In [None]:
df = pd.read_csv('./creditcard.csv')
df.head()

## 2. Data Preprocessing

### Shape of Dataset

In [None]:
df.shape

### Information of dataset

In [None]:
df.info()

### Missing Values

In [None]:
# Columns with missing values
missing_values = df.isnull().sum()
print(missing_values)

### Drop Duplicates

In [None]:
df.drop_duplicates()

## 3. Exploratory Data Analysis

In [None]:
labels, counts = np.unique(df.Class, return_counts=True)

plt.figure(figsize=(5, 5))
plt.pie(counts, autopct='%1.2f%%', labels=labels)
plt.legend(['Normal', 'Fraud'])
plt.title('Type of transaction')

plt.show()


In [None]:
count= df['Class'].value_counts(normalize=False).sort_values()
prop = df['Class'].value_counts(normalize=True)
dist = pd.DataFrame({'Freq[N]':count,'Prop[%]':prop.round(4)})
dist

The highly imbalanced dataset (99.83% normal, 0.17% fraudulent transactions) poses significant challenges for fraud detection models. This imbalance can lead to biased predictions, poor performance on the minority class, and misleading evaluation metrics. Models trained on such data may struggle to learn fraud patterns effectively, potentially missing critical fraudulent activities.

Balancing the dataset is crucial to address these issues. It helps the model learn characteristics of both normal and fraudulent transactions equally, reducing bias and improving overall detection capabilities. Balanced data enables more meaningful model evaluation and aligns with the primary business objective of identifying fraud, even if it's rare. This approach leads to more robust and reliable fraud detection systems.


### Time and Amount Distribution

In [None]:
normal_time = df.loc[df['Class'] == 0]["Time"]
fraud_time = df.loc[df['Class'] == 1]["Time"]

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(15, 8))

#Plot 1
sns.boxplot(x='Class', y='Time', data=df, ax=axs[0])
axs[0].set_title('Boxplot of Time by Class')
axs[0].set_xlabel('Normal = 0, Fraud = 1')
axs[0].set_ylabel('Time')

# Plot 2
sns.kdeplot(data=normal_time, ax=axs[1], label='Normal')
sns.kdeplot(data=fraud_time, ax=axs[1], label='Fraud')
axs[1].set_title('Density Plot of Time by Class')
axs[1].set_xlabel('Time [s]')
axs[1].set_ylabel('Density')
axs[1].legend()

plt.tight_layout()
plt.show()

In [None]:
normal_amount = df.loc[df['Class'] == 0]['Amount']
fraud_amount = df.loc[df['Class'] == 1]['Amount']

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(15, 8))

# Plot 1
sns.boxplot(x='Class', y='Amount', data=df, ax=axs[0])
axs[0].set_yscale('log')
axs[0].set_title('Boxplot of Amount by Class (Log Scale)')
axs[0].set_xlabel('Normal = 0, Fraud = 1')
axs[0].set_ylabel('Amount (Log Scale)')

# Plot 2
sns.kdeplot(data=np.log1p(normal_amount), ax=axs[1], label='Normal', fill=True, color='blue')
sns.kdeplot(data=np.log1p(fraud_amount), ax=axs[1], label='Fraud', fill=True, color='red')
axs[1].set_title('Log-Transformed Density of Amount by Class')
axs[1].set_xlabel('Log(Amount + 1)')
axs[1].set_ylabel('Density')
axs[1].legend()

plt.tight_layout()
plt.show()

Analyzing transaction time and amount alone did not yield significant insights for distinguishing fraudulent transactions from normal ones. The distribution of transaction times for both normal and fraudulent transactions appeared to be similar, indicating that fraudulent transactions were not concentrated at specific times. Similarly, the amount spent on both normal and fraudulent transactions exhibited overlapping distributions, suggesting that there was no clear spending threshold that could be used to identify fraudulent activity.

### Correlation Matrix

In [None]:
correlation_matrix = df.corr()

plt.figure(figsize=(15, 8))
sns.heatmap(correlation_matrix, annot=False, vmin=-1, vmax=1, cmap='vlag')
plt.title('Correlation Matrix')
plt.show()

In [None]:
top_correlations = pd.concat([correlation_matrix.unstack().sort_values(ascending=False).drop_duplicates().head(6),
                              correlation_matrix.unstack().sort_values(ascending=True).drop_duplicates().head(5)])
top_correlations

## 4. Data Preparation

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('Class', axis=1)
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Standard Scaler

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_normalized = scaler.fit_transform(X_train)
X_test_normalized = scaler.transform(X_test)

### SMOTE

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_normalized, y_train)


In [None]:
print(X_train_resampled.shape)
print(y_train_resampled.shape)

## 5. Modeling Algorithms

The following modeling algorithms will be used for the fraud detection:

- Decision Tree
- Random Forest
- Light Gradient Boosting
- XGB
- Artificial Neural Network

### Import Libraries

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')


### 5.1 Decision Tree

In [None]:
# Randomized Search Parameters
param_dist_dt = {
    'max_depth': [5, 10, 20, 30],
    'criterion': ['gini', 'entropy'],
    'min_samples_split': [2, 5, 10]
}

# Randomized Search
random_search_dt = RandomizedSearchCV(estimator=DecisionTreeClassifier(),
                                      param_distributions=param_dist_dt,
                                      n_iter=10,  # Número de iteraciones
                                      scoring='accuracy',
                                      cv=5,
                                      verbose=0,
                                      n_jobs=-1,
                                      random_state=42)

random_search_dt.fit(X_train_resampled, y_train_resampled)



In [None]:
#Best Parameters for Decision Tree
best_dt = random_search_dt.best_estimator_
y_pred_dt = best_dt.predict(X_test_normalized)

# Results
print(f'Best parameters found: {random_search_dt.best_params_}')
print(f'Best cross-validation accuracy: {random_search_dt.best_score_:.4f}')

test_accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f'Test set accuracy: {test_accuracy_dt:.4f}\n')

### 5.2 Random Forest

In [None]:
# Parámetros para Randomized Search
param_dist_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20, 30],
    'criterion': ['gini', 'entropy', 'log_loss'],  # Ajuste al espacio de búsqueda correcto
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 4]
}

# Randomized Search
random_search_rf = RandomizedSearchCV(estimator=RandomForestClassifier(),
                                      param_distributions=param_dist_rf,
                                      n_iter=10,  # Número de iteraciones
                                      scoring='accuracy',
                                      cv=5,
                                      verbose=0,
                                      n_jobs=-1,
                                      random_state=42)

random_search_rf.fit(X_train_resampled, y_train_resampled)


In [None]:
#Best Parameters for Random Forest
best_rf = random_search_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test_normalized)

# Results
print(f'Best parameters found: {random_search_rf.best_params_}')
print(f'Best cross-validation accuracy: {random_search_rf.best_score_:.4f}')

test_accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f'Test set accuracy: {test_accuracy_rf:.4f}\n')

### 5.3 Gradient Boosting

In [None]:
'''
param_dist_gb = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'subsample': [0.8, 0.9, 1.0]
}

random_search_gb = RandomizedSearchCV(
    estimator=GradientBoostingClassifier(),
    param_distributions=param_dist_gb,
    n_iter=10,  # Number of random parameter combinations to try
    scoring='accuracy',
    cv=5,
    verbose=3,
    n_jobs=-1
)

random_search_gb.fit(X_train_resampled, y_train_resampled)


In [None]:
'''
# Getting the best model and making predictions
best_gb = random_search_gb.best_estimator_
y_pred_gb = best_gb.predict(X_test_normalized)

# Printing results
print(f'Best parameters found: {random_search_gb.best_params_}')
print(f'Best cross-validation accuracy: {random_search_gb.best_score_:.4f}')

# Calculating accuracy on the test set
test_accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f'Test set accuracy: {test_accuracy_gb:.4f}\n')

### 5.4 XGB

In [None]:
'''
param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0],
    'gamma': [0, 0.1, 0.2],
    'reg_alpha': [0, 0.1, 0.5],
    'reg_lambda': [0.1, 1, 5]
}


grid_search_xgb = GridSearchCV(estimator=XGBClassifier(),
                               param_grid=param_grid_xgb,
                               scoring='accuracy',
                               cv=5,
                               verbose=0,
                               n_jobs=-1)

# Ejecutar la búsqueda de cuadrícula en los datos resampleados
grid_search_xgb.fit(X_train_resampled, y_train_resampled)

In [None]:
'''
# Best parameters for XGB
best_xgb = grid_search_xgb.best_estimator_
y_pred_xgb = best_xgb.predict(X_test_normalized)

#Results
print(f'Best parameters found: {grid_search_xgb.best_params_}')
print(f'Best cross-validation accuracy: {grid_search_xgb.best_score_:.4f}')

test_accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f'Test set accuracy: {test_accuracy_gb:.4f}\n')

### 5.5 Artificial Neural Network

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

In [None]:
model_ann = Sequential([
    Dense(64, activation='relu', input_shape=(X_train_resampled.shape[1],)),
    Dropout(0.5),
    Dense(32, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model_ann.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

# Resumen del modelo
model_ann.summary()

# Entrenar el modelo
history = model_ann.fit(X_train_resampled, y_train_resampled,
                        epochs=50, batch_size=64, verbose=1,
                        validation_data=(X_test_normalized, y_test))

In [None]:
y_pred_ann = model_ann.predict(X_test_normalized).flatten()
y_pred_ann_int = y_pred_ann.astype(np.int64)


test_accuracy_ann = accuracy_score(y_test, y_pred_ann_int)
print(f'Test set accuracy: {test_accuracy_ann:.4f}\n')