# Credit Card Fraud Prediction SCRUM method Pablo Sánchez Arias

In [1]:
# Importing the neccessary libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from mpl_toolkits.axes_grid1 import make_axes_locatable
from google.colab import files

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import plotly.express as px
import plotly.graph_objects as go

In [2]:
uploaded = files.upload() # This is the dataset file: https://www.kaggle.com/datasets/neharoychoudhury/credit-card-fraud-data/data?select=fraud_data.csv

Saving fraud_data.csv to fraud_data.csv


In [3]:
df = pd.read_csv('/content/fraud_data.csv')
df.head()

Unnamed: 0,trans_date_trans_time,merchant,category,amt,city,state,lat,long,city_pop,job,dob,trans_num,merch_lat,merch_long,is_fraud
0,04-01-2019 00:58,"""Stokes, Christiansen and Sipes""",grocery_net,14.37,Wales,AK,64.7556,-165.6723,145,"""Administrator, education""",09-11-1939,a3806e984cec6ac0096d8184c64ad3a1,65.654142,-164.722603,1
1,04-01-2019 15:06,Predovic Inc,shopping_net,966.11,Wales,AK,64.7556,-165.6723,145,"""Administrator, education""",09-11-1939,a59185fe1b9ccf21323f581d7477573f,65.468863,-165.473127,1
2,04-01-2019 22:37,Wisozk and Sons,misc_pos,49.61,Wales,AK,64.7556,-165.6723,145,"""Administrator, education""",09-11-1939,86ba3a888b42cd3925881fa34177b4e0,65.347667,-165.914542,1
3,04-01-2019 23:06,Murray-Smitham,grocery_pos,295.26,Wales,AK,64.7556,-165.6723,145,"""Administrator, education""",09-11-1939,3a068fe1d856f0ecedbed33e4b5f4496,64.445035,-166.080207,1
4,04-01-2019 23:59,Friesen Lt,health_fitness,18.17,Wales,AK,64.7556,-165.6723,145,"""Administrator, education""",09-11-1939,891cdd1191028759dc20dc224347a0ff,65.447094,-165.446843,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14446 entries, 0 to 14445
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   trans_date_trans_time  14446 non-null  object 
 1   merchant               14446 non-null  object 
 2   category               14446 non-null  object 
 3   amt                    14446 non-null  float64
 4   city                   14446 non-null  object 
 5   state                  14446 non-null  object 
 6   lat                    14446 non-null  float64
 7   long                   14446 non-null  float64
 8   city_pop               14446 non-null  int64  
 9   job                    14446 non-null  object 
 10  dob                    14446 non-null  object 
 11  trans_num              14446 non-null  object 
 12  merch_lat              14446 non-null  float64
 13  merch_long             14446 non-null  float64
 14  is_fraud               14446 non-null  object 
dtypes:

In [5]:
df.columns

Index(['trans_date_trans_time', 'merchant', 'category', 'amt', 'city', 'state',
       'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'merch_lat',
       'merch_long', 'is_fraud'],
      dtype='object')

## Data Visualization

Different features were visualized using a variety of charts and graphs to uncover patterns and relationships, aiding in the understanding of their impact on the dataset. Scatter plots, bar charts, choropleth maps, histograms, and violin plots were employed to identify distributions that influence data behavior.

In [6]:
# Cleaning the 'is_fraud' column and extracting the first character to deal with malformed data and converting to integer
df['is_fraud'] = df['is_fraud'].astype(str).str.extract(r'(\d)').fillna(0).astype(int)
#Printing unique values in 'is_fraud' column to verify conversion
print("Unique values in 'is_fraud' column:", df['is_fraud'].unique())

Unique values in 'is_fraud' column: [1 0]


This code is designed to handle cases where the is_fraud column might contain malformed or unexpected data. For example, if there are values like "Yes" or "1.0", the code will extract the first digit and convert it to an integer. This ensures that the column only contains numeric values (specifically 0 and 1 if everything is correct).

In [7]:
# Grouping by product category and counting the number of fraud occurrences
fraud_counts_by_category = df[df['is_fraud'] == 1].groupby('category').size().reset_index(name='fraud_count')

# Sorting the results in descending order by fraud_count
fraud_counts_by_category = fraud_counts_by_category.sort_values(by='fraud_count', ascending=False)

# Displaying the first five rows
fraud_counts_by_category.head()

Unnamed: 0,category,fraud_count
4,grocery_pos,444
11,shopping_net,396
8,misc_net,223
12,shopping_pos,194
2,gas_transport,159


Visualization of total transactions by state and their relationship with fraud count.

In [8]:
# Filtering for fraudulent transactions, grouping by state, and counting occurrences
frauds_by_state = df[df['is_fraud'] == 1].groupby('state').size().reset_index(name='fraud_count')

# Grouping by state to count total transactions and merge with fraud counts
merged_df = df.groupby('state').size().reset_index(name='total_transactions').merge(frauds_by_state, on='state', how='left')

# Sorting the merged DataFrame by fraud_count in descending order
merged_df = merged_df.sort_values(by='fraud_count', ascending=False)

# Displaying the merged DataFrame
merged_df

Unnamed: 0,state,total_transactions,fraud_count
2,CA,3375,411
6,MO,2329,267
7,NE,1460,238
9,OR,1211,197
11,WA,1150,126
8,NM,1003,121
12,WY,1100,119
3,CO,856,115
10,UT,597,73
0,AK,173,65


Choropleth map to visualize statewise fraud rates

In [9]:
# Defining fraud rate
merged_df['fraud_rate'] = merged_df['fraud_count'] / merged_df['total_transactions'] * 100

# Creating a choropleth map with auto-sizing
fig = px.choropleth(
    merged_df, locations='state', locationmode='USA-states', color='fraud_rate',
    color_continuous_scale=px.colors.sequential.Inferno, scope='usa',
    title='Fraud Rate by State', labels={'fraud_rate': 'Fraud Rate (%)'}
)

# Automatically scale to fit the display
fig.update_layout(
    autosize=True,
    height=600,  # You can adjust height as needed
    margin=dict(l=0, r=0, t=50, b=0)  # Adjust margins to fit better
)

# Displaying the plot
fig.show()

In [10]:
# Calculating the minimum and maximum values for the 'amt' column
min_amt = df['amt'].min()
max_amt = df['amt'].max()

print(f"Minimum value in 'amt': {min_amt}")
print(f"Maximum value in 'amt': {max_amt}")

Minimum value in 'amt': 1.0
Maximum value in 'amt': 3261.47


In [11]:
# Define the ranges based on the min and max values
ranges = [1, 500, 1000, 2000, 3261.47]

# Create labels for those ranges
labels = ['1-500', '501-1000', '1001-2000', '2001-3261.47']

# Categorize the 'amt' values into these ranges
df['amt_range'] = pd.cut(df['amt'], bins=ranges, labels=labels, include_lowest=True)

# Count the number of occurrences within each range
range_counts = df['amt_range'].value_counts().sort_index()

# Display the counts for each range
print(range_counts)

amt_range
1-500           13457
501-1000          731
1001-2000         253
2001-3261.47        5
Name: count, dtype: int64


In [12]:
#Converting date of birth:'dob' and transaction date and time:'trans_date_trans_time' columns to datetime format
df['dob'] = pd.to_datetime(df['dob'], format='%d-%m-%Y')
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'], format='%d-%m-%Y %H:%M')

#Extracting date and time into separate columns
df['trans_date'] = df['trans_date_trans_time'].dt.date
df['trans_time'] = df['trans_date_trans_time'].dt.time

#Calculating age directly by subtracting year of birth from transaction year
df['age'] = df['trans_date'].apply(lambda x: x.year) - df['dob'].dt.year

#Filtering for fraudulent transactions
fraud_df = df[df['is_fraud'] == 1]

In [13]:
fraud_df = fraud_df.copy()  # Avoid SettingWithCopyWarning
bins = range(0, 101, 10)
labels = [f'{i}-{i+9}' for i in bins[:-1]]

# Cut the 'age' into the bins
fraud_df['age_group'] = pd.cut(fraud_df['age'], bins=bins, labels=labels, right=False)

# Ensure the age groups are sorted correctly as an ordered categorical variable
fraud_df['age_group'] = pd.Categorical(fraud_df['age_group'], categories=labels, ordered=True)

# Sorting the DataFrame by the age_group to ensure correct order
fraud_df = fraud_df.sort_values(by='age_group')

# Create the histogram using Plotly Express
fig = px.histogram(
    fraud_df,
    x='age_group',
    title='Histogram of Fraudulent Transactions by Age Group',
    labels={'age_group': 'Age Group', 'count': 'Number of Fraudulent Transactions'},
)

# Update the layout to adjust the axes
fig.update_layout(
    xaxis_title='Age Group',
    yaxis_title='Number of Fraudulent Transactions'
)

# Show the plot
fig.show()

The distribution of age groups among those affected by fraudulent transactions shows that individuals aged 50-60 are the most vulnerable and face the highest risk of credit card fraud.

## Model training and prediction

In [None]:
#Extracting hour of transaction
df['transaction_hour'] = df['trans_date_trans_time'].dt.hour

#Dropping irrelevant columns
df = df.drop(columns=['trans_date_trans_time', 'trans_date', 'dob', 'trans_num','trans_time', 'merchant','state','city', 'amt_range'])


These columns might be dropped because they are either redundant, have been transformed into more useful features, or do not contribute meaningfully to the model or analysis being conducted. Dropping irrelevant columns helps in simplifying the model, reducing noise, and preventing overfitting.

In [15]:
#Choosing categorical columns to fit into model training
categorical_columns = ['category', 'job']
label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

In [16]:
#Converting target variable to numeric
df['is_fraud'] = df['is_fraud'].astype(int)

In [17]:
#Features and target
X = df.drop(columns=['is_fraud'])
y = df['is_fraud']

In [18]:
df.head()

Unnamed: 0,category,amt,lat,long,city_pop,job,merch_lat,merch_long,is_fraud,age,transaction_hour
0,3,14.37,64.7556,-165.6723,145,1,65.654142,-164.722603,1,80,0
1,11,966.11,64.7556,-165.6723,145,1,65.468863,-165.473127,1,80,15
2,9,49.61,64.7556,-165.6723,145,1,65.347667,-165.914542,1,80,22
3,4,295.26,64.7556,-165.6723,145,1,64.445035,-166.080207,1,80,23
4,5,18.17,64.7556,-165.6723,145,1,65.447094,-165.446843,1,80,23


- **Label Encoding**: Converts categorical data into numeric form to make it compatible with machine learning models.
- **Target Conversion**: Ensures the target variable is in an appropriate numeric format.
- **Feature and Target Separation**: Prepares the dataset for model training by separating the input features (`X`) from the output label (`y`).

This code is essentially the final step in preprocessing the data, getting it ready to be fed into a machine learning model for training.

In [19]:
#Splitting the data 80:20 for training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The 80:20 split is a commonly used heuristic in data science and machine learning. It offers a good balance between having enough data for training and a sufficient amount for testing. It is particularly effective when the dataset is large enough that even 20% still provides a robust test set.

In [20]:
# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

These values are based on common practices, prior experience, and the need to explore different hyperparameters without making the search space excessively large.

Because I will be working with a `Random Forest`, there's no need to standardize the features. The choice of `Random Forest` was based on its robustness, ease of use, ability to handle complex relationships in the data, and strong performance in initial experiments.

In [21]:
# Initializing the model
rf = RandomForestClassifier(random_state=42)

# Setting up the GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=0) # verbose only used during development

# Fitting the grid search to the data
grid_search.fit(X_train, y_train)

In [22]:
#Output of the best parameters and score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_}")

Best Parameters: {'bootstrap': False, 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}
Best Score: 0.9809625695672419


In [23]:
best_params = grid_search.best_params_
final_rf = RandomForestClassifier(**best_params, random_state=42)
final_rf.fit(X_train, y_train)

In [24]:
# Final prediction
y_pred = final_rf.predict(X_test)

y_pred

array([0, 0, 0, ..., 0, 0, 0])

In [25]:
accuracy = accuracy_score(y_test, y_pred) # accuarcy of the model
confusion = confusion_matrix(y_test, y_pred) # confusion matrix
report = classification_report(y_test, y_pred, output_dict=True) # classification report
report_df = pd.DataFrame(report).transpose() # converting classification report to dataframe for better visualisation

In [26]:
report_df

Unnamed: 0,precision,recall,f1-score,support
0,0.98654,0.99481,0.990658,2505.0
1,0.964286,0.911688,0.93725,385.0
accuracy,0.983737,0.983737,0.983737,0.983737
macro avg,0.975413,0.953249,0.963954,2890.0
weighted avg,0.983575,0.983737,0.983543,2890.0


## Conclusion

**Strong Performance**: Your model is performing very well, particularly in detecting non-fraudulent transactions (Class 0). It also performs strongly on the more challenging fraudulent transactions (Class 1).

**Imbalance Management**: The slightly lower recall for Class 1 suggests that while the model is good at detecting fraud, there is still a small percentage of fraudulent transactions that are being missed. This could be a point for further optimization if detecting all fraudulent transactions is critical.

**Ways to get a better score**: Stratifying `y` was something that I could have done to improve the score as we are dealing with imbalanced data. When trying other models, `CatBoost` could perfectly provide accurate predictions across both classes. As a result, the model could also generalize well to unseen data.

## Bibliography

*   https://www.kaggle.com/code/renjiabarai/credit-card-fraud-prediction/notebook
*   https://www.kaggle.com/code/neharoychoudhury/credit-card-fraud-detection-analysis/notebook

