# Credit Card Fraud Detection Model


# Initial Observations

I found this data on Kaggle at, https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023 . My initial observations are that the amount variable may need to be transformed in the preprocessing. 

## Import Statements

In [2]:
import seaborn as sea
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from kneed import KneeLocator
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import preprocessing

sea.set(style='ticks')



## Importing Data

In [54]:
card_data = pd.read_csv("C:/Users/james/OneDrive/Desktop/Datasets for Projects/Credit Card Fraud Detection/creditcard_2023.csv")


## Preprocessing

To preprocess the data, I am scaling all values using the scale method from sklearn preprocessing. From this process I am splitting my x and y variables before variable selection.

In [55]:
card_data = card_data.drop(columns = 'id')
preprocessed = pd.DataFrame(preprocessing.scale(card_data.drop('Class', axis = 'columns')))
preprocessed['Class'] = card_data['Class']
preprocessed.columns = card_data.columns
x = preprocessed.columns.drop('Class')
y = ['Class']


## Variable Selection

To eliminate features I decided to use the recursive feature elimination to bring down the total amount of variables used.

In [56]:
logreg = LogisticRegression()
rfe = RFE(logreg)
rfe = rfe.fit(preprocessed[x], preprocessed[y].values.ravel())
print(rfe.support_)
print(rfe.ranking_)

[ True False  True  True False False  True  True  True  True  True  True
 False  True False  True  True  True False False False  True False False
 False False False False False]
[ 1  8  1  1 15  3  1  1  1  1  1  1 14  1  5  1  1  1 13 11  4  1  2  6
  7 12 10  9 16]


Taking the output, I then created a list of the final variables by using a for loop to prevent having to manually sort through the values. 

In [66]:
index = 0
list = rfe.support_.tolist()
final_variables = []
for i in list:
    if i == True:
        final_variables.append(x[index])
    index += 1

Next I am checking the p-values of the chosen variables to ensure I don't need to do another pass of feature elimination. 

In [68]:
X = preprocessed[final_variables]
y = card_data['Class']
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.144060
         Iterations 9
                          Results: Logit
Model:              Logit            Method:           MLE        
Dependent Variable: Class            Pseudo R-squared: 0.792      
Date:               2023-09-23 13:55 AIC:              163861.5868
No. Observations:   568630           BIC:              164019.1006
Df Model:           13               Log-Likelihood:   -81917.    
Df Residuals:       568616           LL-Null:          -3.9414e+05
Converged:          1.0000           LLR p-value:      0.0000     
No. Iterations:     9.0000           Scale:            1.0000     
--------------------------------------------------------------------
          Coef.    Std.Err.       z       P>|z|     [0.025    0.975]
--------------------------------------------------------------------
V1       -0.3546     0.0106    -33.4496   0.0000   -0.3754   -0.3338
V3       -0.3596     0.0116    -30.9643   0.0

## Data Splitting and Training

Now to split the data into the testing and training data with a test size of 15% from the original data.

In [72]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0)
logreg = LogisticRegression(solver='lbfgs', max_iter=200)
logreg.fit(X_train, y_train)

## Model Utilization

Finishing up by creating the predictions from the trained model to move onto assessing performance of the model.

In [73]:
y_train_predictions = logreg.predict(X_train)
y_test_predictions = logreg.predict(X_test)
y_train_predictions, y_test_predictions

(array([0, 1, 0, ..., 0, 1, 1], dtype=int64),
 array([1, 1, 1, ..., 0, 1, 0], dtype=int64))

## Assessing Model Performance

Last but not least I created a dataframe to allow for the performance of the model to be easily compared and to allow for comparison of different model performance.

In [74]:
lr_train_MSE = mean_squared_error(y_train, y_train_predictions)
lr_train_R2 = r2_score(y_train, y_train_predictions)

lr_test_MSE = mean_squared_error(y_test, y_test_predictions)
lr_test_R2 = r2_score(y_test, y_test_predictions)

results = pd.DataFrame(['Logistic Regression', lr_train_MSE, lr_test_MSE, lr_train_R2, lr_test_R2]).transpose()
results.columns = ['Model', 'Train MSE', 'Test MSE', 'Train R^2', 'Test R^2']
results


Unnamed: 0,Model,Train MSE,Test MSE,Train R^2,Test R^2
0,Logistic Regression,0.035094,0.035313,0.859625,0.858749


## Conclusions

Due to this data being anonymized, I wasn't able to fully learn exactly what each variable meant when it contributes to detecting credit card fraud. One thing I was surprised to find out was that when I used the recursive feature elimination method to choose variables, it recommended the removal of the purchase amount. Even after processing the variable to standardize the values it still raised the performance of the model in the end. I have decided to analyze performance of the model using the Mean Square Error and the R^2 value. In the end I added the performance of the model to a pandas dataframe to allow for easier comparison of multiple models. 
