#  Steps:

1. Problem Statement

2. Data Collection and preprocessing

3. EDA

4. Solving Class Imbalance Problem

5. Model Implementation

# 1. Problem statement

To build a machine learning model to identify fraudulent credit card
transactions.

### 2. Data Collection and Preprocessing

I am using the data from : https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/

About the data:

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

In [None]:
pip install -U imbalanced-learn

In [None]:
pip install xgboost


# 2.1 Data Collection

In [None]:
# import necessary libraries
# importing necessary libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# importing models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn import preprocessing
from xgboost import XGBClassifier

# libaries for under sampling 
from imblearn.under_sampling import RandomUnderSampler


# importing evaluation metrics
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

from sklearn.metrics import roc_auc_score, plot_roc_curve

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
import joblib

In [None]:
# load dataset
credit_card_df=pd.read_csv('creditcard.csv')
credit_card_df.head()

# 2.2 Preprocessing the data

In [None]:
# check the size of the dataset
credit_card_df.shape

In [None]:
# get the  dataset info
credit_card_df.info()

In [None]:
# check for null values in the data
credit_card_df.isna().sum()

There are no null values in the data

In [None]:
# check for any duplicate data in the dataset
credit_card_df.duplicated().any()

In [None]:
# drop duplicates data
credit_card_df=credit_card_df.drop_duplicates()

In [None]:
credit_card_df['Amount']

In [None]:
# Normalising the data
# All the feature from v1-v28 are already in normalised form only amount needs to be normalised
# using StandardScaler to normalise the amount feature

scaler=StandardScaler()
credit_card_df['Amount']=scaler.fit_transform(pd.DataFrame(credit_card_df['Amount']))

In [None]:
credit_card_df['Amount']

# 3. EDA

In [None]:
sns.heatmap(credit_card_df.corr(), cmap='YlGnBu', annot=False)

In [None]:
sns.pairplot(credit_card_df, hue='species', palette='Blues')

# 4. Solving Class Imbalance problem

In [None]:
# let's check for traget variable
credit_card_df['Class'].value_counts()

In [None]:
credit_card_df['Class'].value_counts().plot(kind='bar', color=['red','blue'])

It seems there is a class imbalance problem in the target variable where the fraud which is 1 which is very low in number as compare to non fraudulent transaction. To solve the class imbalance problem I am goign to use random Under-sampling which is removing some data from non-fraudulent util it is balance with the fraudulent transaction data.

# Random Under-Sampling

In [None]:
# splitting the data into X and y
X=credit_card_df.drop('Class', axis=1)
y=credit_card_df['Class']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# random under-sampling
rus=RandomUnderSampler(random_state=42, replacement=True)
x_rus,y_rus=rus.fit_resample(X_train,y_train)

x_rus.shape

In [None]:
# Visualize class distribution before and after sampling
plt.figure(figsize=(12, 6))

# Plot original class distribution
plt.subplot(1, 2, 1)
sns.countplot(x=y)
plt.title("Original Class Distribution")

# Plot class distribution after SMOTE
plt.subplot(1, 2, 2)
sns.countplot(x=y_rus)
plt.title("Class Distribution After Under-sampling")

plt.tight_layout()
plt.show()

Now the class is balance 

# 4. Machine learning model implementation

For model i will be testing the below models:

1. RandomForestClassifier

2. Logistic Regression

3. Support Vector Machines

4. Gradient Boosting Models(XGBoost)

In [None]:
models={'Logistic Regression':LogisticRegression(),
        'Random Forest Classifier':RandomForestClassifier(),
       'Support Vector Machine':SVC(),
       'XGBoost':XGBClassifier()}

In [None]:
models

In [None]:
# create a function to fit and score models
def fit_and_score(models, X_train,X_test, y_train, y_test):
    # set random seed
    np.random.seed(42)
    #make a dictionary to keep model scores
    model_scores=[]
    #Loop through models
    for name,model in models.items():
        # Fit the model
        model.fit(X_train,y_train)
        y_pred=model.predict(X_test)
        
#         save the trained model to use it later using joblib library
        model_filename = f"{name}_model.pkl"
        joblib.dump(model, model_filename)
        print(f"Model {name} saved as {model_filename}")
        
        #Evaluate the model and append its score
        print(f"Evaluating {name}....")
        report_dict=classification_report(y_test,y_pred, output_dict=True)
        
                # Extract relevant metrics from the classification report
        precision_0 = report_dict['0']['precision']
        recall_0 = report_dict['0']['recall']
        f1_0 = report_dict['0']['f1-score']
        
        precision_1 = report_dict['1']['precision']
        recall_1 = report_dict['1']['recall']
        f1_1 = report_dict['1']['f1-score']

        model_scores.append({
            'Model': name,
            'Precision_0': precision_0,
            'Recall_0': recall_0,
            'F1_0': f1_0,
            'Precision_1': precision_1,
            'Recall_1': recall_1,
            'F1_1': f1_1
        })
        
    model_scores=pd.DataFrame(model_scores)
    return model_scores

In [None]:
model_scores=fit_and_score(models=models, X_train=x_rus,X_test=X_test,y_train=y_rus ,y_test=y_test)
model_scores

Random Forest Classifier is performing well because it maintains a high precision, recall, and F1-Score for non-fraudulent transactions (class 0), and it also achieves a good balance between precision and recall for fraudulent transactions (class 1)

In [None]:
model_scores

In [None]:
# plot the results
# Melt the DataFrame to make it suitable for plotting
df_melted = pd.melt(model_scores, id_vars=['Model'], var_name='Metric', value_name='Score')

# Plotting using seaborn
plt.figure(figsize=(14, 8))
sns.barplot(x='Model', y='Score', hue='Metric', data=df_melted, palette='viridis')
plt.title('Model Comparison - Precision, Recall, and F1-Score')
plt.xlabel('Model')
plt.ylabel('Score')
plt.show()

Let's try Random Over Sampling to imbalance the class and compare the results

# Random Over-Sampling

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros=RandomOverSampler(random_state=42)
x_ros,y_ros=ros.fit_resample(X_train,y_train)


In [None]:
x_ros.shape

In [None]:
# Visualize class distribution before and after sampling
plt.figure(figsize=(12, 6))

# Plot original class distribution
plt.subplot(1, 2, 1)
sns.countplot(x=y)
plt.title("Original Class Distribution")

# Plot class distribution after SMOTE
plt.subplot(1, 2, 2)
sns.countplot(x=y_ros)
plt.title("Class Distribution After Over-sampling")

plt.tight_layout()
plt.show()

# Evaluating the model
Since randomforest classifer was performing well in under sampling i am going to use random forest classifer in random over-sampled data

In [None]:
rf=RandomForestClassifier()

In [None]:
model_scores=fit_and_score(models=rf, X_train=x_ros,X_test=X_test,y_train=y_ros ,y_test=y_test)
model_scores