In this project, we'll use a dataset from Kaggle that contains information about credit card transactions and whether they were fraudulent or not.

The dataset has 31 variables:

id: unique identifier for each transaction.

V1-V28: Anonymized features representing various transaction attributes (e.g., time, location, etc.).

Amount: The transaction amount.

Class: Binary label indicating whether the transaction is fraudulent (1) or not (0).

Our objective is to create a simple model using a neural network that predicts fraudulent transactions 

In [13]:
import pandas as pd
from get_path_from_config import get_path
import numpy as np
from tensorflow import keras
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt

In [14]:
# We load the data
creditcard_path = get_path('creditcard_path')
if creditcard_path is not None:
    data = pd.read_csv(creditcard_path)
else:
    print("CSV file path not configured.")

data.head()

Unnamed: 0,id,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-0.260648,-0.469648,2.496266,-0.083724,0.129681,0.732898,0.519014,-0.130006,0.727159,...,-0.110552,0.217606,-0.134794,0.165959,0.12628,-0.434824,-0.08123,-0.151045,17982.1,0
1,1,0.9851,-0.356045,0.558056,-0.429654,0.27714,0.428605,0.406466,-0.133118,0.347452,...,-0.194936,-0.605761,0.079469,-0.577395,0.19009,0.296503,-0.248052,-0.064512,6531.37,0
2,2,-0.260272,-0.949385,1.728538,-0.457986,0.074062,1.419481,0.743511,-0.095576,-0.261297,...,-0.00502,0.702906,0.945045,-1.154666,-0.605564,-0.312895,-0.300258,-0.244718,2513.54,0
3,3,-0.152152,-0.508959,1.74684,-1.090178,0.249486,1.143312,0.518269,-0.06513,-0.205698,...,-0.146927,-0.038212,-0.214048,-1.893131,1.003963,-0.51595,-0.165316,0.048424,5384.44,0
4,4,-0.20682,-0.16528,1.527053,-0.448293,0.106125,0.530549,0.658849,-0.21266,1.049921,...,-0.106984,0.729727,-0.161666,0.312561,-0.414116,1.071126,0.023712,0.419117,14278.97,0


In [15]:
# Using the method "info", we can examine the data types of the variables and check for any missing values. 
# As the data is already clean, in this case there are none
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568630 entries, 0 to 568629
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      568630 non-null  int64  
 1   V1      568630 non-null  float64
 2   V2      568630 non-null  float64
 3   V3      568630 non-null  float64
 4   V4      568630 non-null  float64
 5   V5      568630 non-null  float64
 6   V6      568630 non-null  float64
 7   V7      568630 non-null  float64
 8   V8      568630 non-null  float64
 9   V9      568630 non-null  float64
 10  V10     568630 non-null  float64
 11  V11     568630 non-null  float64
 12  V12     568630 non-null  float64
 13  V13     568630 non-null  float64
 14  V14     568630 non-null  float64
 15  V15     568630 non-null  float64
 16  V16     568630 non-null  float64
 17  V17     568630 non-null  float64
 18  V18     568630 non-null  float64
 19  V19     568630 non-null  float64
 20  V20     568630 non-null  float64
 21  V21     56

The variable 'Class' is of integer type, but, since it can only take the values 0 and 1, we transform it to be a categorical variable.

In [16]:
data["Class"] = data["Class"].astype("category")
print (data["Class"].dtype)

category


Now we can split the data in two different datasets: train and test

In [17]:
msk = np.random.rand(len(data)) < 0.8 # Boolean array to split the data into train and test. We will allocate 80% to the train

train = data [msk]
test = data [~msk]

# We separate explanatory variables and the target variable
x_train, y_train = train.drop ("Class", axis = 1), train["Class"]
x_test, y_test = test.drop ("Class", axis = 1), test["Class"]

We now build our model. It is a sequential network with three layers. In the input and hidden layers, we use a ReLU activation function, while in the output layer, we apply a sigmoid function that will return a value of either 0 or 1. We have made this choice because it is a binary classification problem

In [18]:
model = keras.Sequential ([
    keras.layers.Dense (64, activation = "relu", input_shape = (x_train.shape[1],)),
    keras.layers.Dense (32, activation = "relu"),
    keras.layers.Dense (1, activation = "sigmoid")
])

In [19]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [20]:
model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x23e9d37a8d0>

In [21]:
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print(f"Loss: {test_loss}, Accuracy: {test_accuracy}")


Loss: 0.013413510285317898, Accuracy: 0.997198224067688


In [22]:
predictions = model.predict(x_test)

