<a href="https://colab.research.google.com/github/Navyasri28/credit-card-fraud-detection/blob/main/CreditCardFraudDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Load the dataset
The dataset is first loaded into a DataFrame for analysis. This helps us view and explore the structure and contents of the data.
This step downloads the credit card transaction dataset from Kaggle and loads it into a pandas DataFrame. The data contains details like transaction type, amount, country, card type, and whether the transaction was fraudulent. We start by checking the structure and previewing the data.

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("anurag629/credit-card-fraud-transaction-data")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/anurag629/credit-card-fraud-transaction-data?dataset_version_number=1...


100%|██████████| 1.63M/1.63M [00:00<00:00, 83.3MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/anurag629/credit-card-fraud-transaction-data/versions/1





In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv(path+"/CreditCardData.csv")

In [None]:
df.head()

Unnamed: 0,Transaction ID,Date,Day of Week,Time,Type of Card,Entry Mode,Amount,Type of Transaction,Merchant Group,Country of Transaction,Shipping Address,Country of Residence,Gender,Age,Bank,Fraud
0,#3577 209,14-Oct-20,Wednesday,19,Visa,Tap,£5,POS,Entertainment,United Kingdom,United Kingdom,United Kingdom,M,25.2,RBS,0
1,#3039 221,14-Oct-20,Wednesday,17,MasterCard,PIN,£288,POS,Services,USA,USA,USA,F,49.6,Lloyds,0
2,#2694 780,14-Oct-20,Wednesday,14,Visa,Tap,£5,POS,Restaurant,India,India,India,F,42.2,Barclays,0
3,#2640 960,13-Oct-20,Tuesday,14,Visa,Tap,£28,POS,Entertainment,United Kingdom,India,United Kingdom,F,51.0,Barclays,0
4,#2771 031,13-Oct-20,Tuesday,23,Visa,CVC,£91,Online,Electronics,USA,USA,United Kingdom,M,38.0,Halifax,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Transaction ID          100000 non-null  object 
 1   Date                    100000 non-null  object 
 2   Day of Week             100000 non-null  object 
 3   Time                    100000 non-null  int64  
 4   Type of Card            100000 non-null  object 
 5   Entry Mode              100000 non-null  object 
 6   Amount                  99994 non-null   object 
 7   Type of Transaction     100000 non-null  object 
 8   Merchant Group          99990 non-null   object 
 9   Country of Transaction  100000 non-null  object 
 10  Shipping Address        99995 non-null   object 
 11  Country of Residence    100000 non-null  object 
 12  Gender                  99996 non-null   object 
 13  Age                     100000 non-null  float64
 14  Bank                 

We identify and remove rows with missing values to maintain data integrity and avoid errors during model training.

In [None]:
df.isnull().sum()

Unnamed: 0,0
Transaction ID,0
Date,0
Day of Week,0
Time,0
Type of Card,0
Entry Mode,0
Amount,6
Type of Transaction,0
Merchant Group,10
Country of Transaction,0


In [None]:
df.dropna(inplace=True)

In [None]:
df['Fraud'].value_counts()

Unnamed: 0_level_0,count
Fraud,Unnamed: 1_level_1
0,92785
1,7192


In [None]:
7192/92785 * 100

7.751252896481112

In [None]:
df.head()

Unnamed: 0,Transaction ID,Date,Day of Week,Time,Type of Card,Entry Mode,Amount,Type of Transaction,Merchant Group,Country of Transaction,Shipping Address,Country of Residence,Gender,Age,Bank,Fraud
0,#3577 209,14-Oct-20,Wednesday,19,Visa,Tap,£5,POS,Entertainment,United Kingdom,United Kingdom,United Kingdom,M,25.2,RBS,0
1,#3039 221,14-Oct-20,Wednesday,17,MasterCard,PIN,£288,POS,Services,USA,USA,USA,F,49.6,Lloyds,0
2,#2694 780,14-Oct-20,Wednesday,14,Visa,Tap,£5,POS,Restaurant,India,India,India,F,42.2,Barclays,0
3,#2640 960,13-Oct-20,Tuesday,14,Visa,Tap,£28,POS,Entertainment,United Kingdom,India,United Kingdom,F,51.0,Barclays,0
4,#2771 031,13-Oct-20,Tuesday,23,Visa,CVC,£91,Online,Electronics,USA,USA,United Kingdom,M,38.0,Halifax,1


In [None]:
df.drop('Transaction ID',axis=1,inplace=True)

In [None]:
df.columns

Index(['Date', 'Day of Week', 'Time', 'Type of Card', 'Entry Mode', 'Amount',
       'Type of Transaction', 'Merchant Group', 'Country of Transaction',
       'Shipping Address', 'Country of Residence', 'Gender', 'Age', 'Bank',
       'Fraud'],
      dtype='object')

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
df.columns

Index(['Date', 'Day of Week', 'Time', 'Type of Card', 'Entry Mode', 'Amount',
       'Type of Transaction', 'Merchant Group', 'Country of Transaction',
       'Shipping Address', 'Country of Residence', 'Gender', 'Age', 'Bank',
       'Fraud'],
      dtype='object')

In [None]:
for col in df.columns:
  if df[col].dtype == 'object':
    print(col)

Day of Week
Type of Card
Entry Mode
Amount
Type of Transaction
Merchant Group
Country of Transaction
Shipping Address
Country of Residence
Gender
Bank


In [None]:
df['Amount'] = df['Amount'].str[1::]

In [None]:
df.head()

Unnamed: 0,Date,Day of Week,Time,Type of Card,Entry Mode,Amount,Type of Transaction,Merchant Group,Country of Transaction,Shipping Address,Country of Residence,Gender,Age,Bank,Fraud
0,2020-10-14,Wednesday,19,Visa,Tap,5,POS,Entertainment,United Kingdom,United Kingdom,United Kingdom,M,25.2,RBS,0
1,2020-10-14,Wednesday,17,MasterCard,PIN,288,POS,Services,USA,USA,USA,F,49.6,Lloyds,0
2,2020-10-14,Wednesday,14,Visa,Tap,5,POS,Restaurant,India,India,India,F,42.2,Barclays,0
3,2020-10-13,Tuesday,14,Visa,Tap,28,POS,Entertainment,United Kingdom,India,United Kingdom,F,51.0,Barclays,0
4,2020-10-13,Tuesday,23,Visa,CVC,91,Online,Electronics,USA,USA,United Kingdom,M,38.0,Halifax,1


In [None]:
df['Amount'] = df['Amount'].astype('float')

In [None]:
df.describe()

Unnamed: 0,Date,Time,Amount,Age,Fraud
count,99977,99977.0,99977.0,99977.0,99977.0
mean,2020-10-13 12:02:43.765666304,14.5631,112.579933,44.993595,0.071937
min,2020-10-13 00:00:00,0.0,5.0,15.0,0.0
25%,2020-10-13 00:00:00,10.0,17.0,38.2,0.0
50%,2020-10-14 00:00:00,15.0,30.0,44.9,0.0
75%,2020-10-14 00:00:00,19.0,208.0,51.7,0.0
max,2020-10-16 00:00:00,24.0,400.0,86.1,1.0
std,,5.308202,123.435613,9.948121,0.258384


In [None]:
print(df.columns)

Index(['Date', 'Day of Week', 'Time', 'Type of Card', 'Entry Mode', 'Amount',
       'Type of Transaction', 'Merchant Group', 'Country of Transaction',
       'Shipping Address', 'Country of Residence', 'Gender', 'Age', 'Bank',
       'Fraud'],
      dtype='object')


###One-Hot Encode Categorical Columns
Categorical columns such as day of the week, card type, country, and gender cannot be directly used in machine learning models. So, we convert them into numerical format using one-hot encoding, which creates new binary columns for each unique category.

In [None]:
df.drop(['Shipping Address'], axis=1, inplace=True)


In [None]:
categorical_cols = [
    'Day of Week', 'Time', 'Type of Card', 'Entry Mode',
    'Type of Transaction', 'Merchant Group',
    'Country of Transaction', 'Country of Residence',
    'Gender', 'Bank'
]

df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

In [None]:
df.head()

Unnamed: 0,Date,Amount,Age,Fraud,Day of Week_Thursday,Day of Week_Tuesday,Day of Week_Wednesday,Time_1,Time_2,Time_3,...,Country of Residence_USA,Country of Residence_United Kingdom,Gender_M,Bank_Barlcays,Bank_HSBC,Bank_Halifax,Bank_Lloyds,Bank_Metro,Bank_Monzo,Bank_RBS
0,2020-10-14,5.0,25.2,0,False,False,True,False,False,False,...,False,True,True,False,False,False,False,False,False,True
1,2020-10-14,288.0,49.6,0,False,False,True,False,False,False,...,True,False,False,False,False,False,True,False,False,False
2,2020-10-14,5.0,42.2,0,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,2020-10-13,28.0,51.0,0,False,True,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
4,2020-10-13,91.0,38.0,1,False,True,False,False,False,False,...,False,True,True,False,False,True,False,False,False,False


In [None]:
print(df.columns)

Index(['Date', 'Amount', 'Age', 'Fraud', 'Day of Week_Thursday',
       'Day of Week_Tuesday', 'Day of Week_Wednesday', 'Time_1', 'Time_2',
       'Time_3', 'Time_4', 'Time_5', 'Time_6', 'Time_7', 'Time_8', 'Time_9',
       'Time_10', 'Time_11', 'Time_12', 'Time_13', 'Time_14', 'Time_15',
       'Time_16', 'Time_17', 'Time_18', 'Time_19', 'Time_20', 'Time_21',
       'Time_22', 'Time_23', 'Time_24', 'Type of Card_Visa', 'Entry Mode_PIN',
       'Entry Mode_Tap', 'Type of Transaction_Online',
       'Type of Transaction_POS', 'Merchant Group_Electronics',
       'Merchant Group_Entertainment', 'Merchant Group_Fashion',
       'Merchant Group_Food', 'Merchant Group_Gaming',
       'Merchant Group_Products', 'Merchant Group_Restaurant',
       'Merchant Group_Services', 'Merchant Group_Subscription',
       'Country of Transaction_India', 'Country of Transaction_Russia',
       'Country of Transaction_USA', 'Country of Transaction_United Kingdom',
       'Country of Residence_India', 'Cou

In [None]:
print(df.shape)

(99977, 61)


###Feature Scaling (Amount & Age)
To ensure consistency and improve model performance, we scale the 'Amount' and 'Age' columns. Feature scaling transforms these columns to have a mean of 0 and standard deviation of 1, making the model treat all features equally regardless of their original scale.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Amount', 'Age']] = scaler.fit_transform(df[['Amount', 'Age']])

In [None]:
print(df[['Amount', 'Age']].mean())
print(df[['Amount', 'Age']].std())

Amount    3.439818e-17
Age       1.642620e-15
dtype: float64
Amount    1.000005
Age       1.000005
dtype: float64


In [None]:
df[['Amount', 'Age']].head()

Unnamed: 0,Amount,Age
0,-0.871551,-1.989692
1,1.421153,0.463045
2,-0.871551,-0.280818
3,-0.685218,0.603776
4,-0.174828,-0.70301


In [None]:
df.head()

Unnamed: 0,Date,Amount,Age,Fraud,Day of Week_Thursday,Day of Week_Tuesday,Day of Week_Wednesday,Time_1,Time_2,Time_3,...,Country of Residence_USA,Country of Residence_United Kingdom,Gender_M,Bank_Barlcays,Bank_HSBC,Bank_Halifax,Bank_Lloyds,Bank_Metro,Bank_Monzo,Bank_RBS
0,2020-10-14,-0.871551,-1.989692,0,False,False,True,False,False,False,...,False,True,True,False,False,False,False,False,False,True
1,2020-10-14,1.421153,0.463045,0,False,False,True,False,False,False,...,True,False,False,False,False,False,True,False,False,False
2,2020-10-14,-0.871551,-0.280818,0,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,2020-10-13,-0.685218,0.603776,0,False,True,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
4,2020-10-13,-0.174828,-0.70301,1,False,True,False,False,False,False,...,False,True,True,False,False,True,False,False,False,False


In [None]:
print(df['Fraud'].value_counts(normalize=True) * 100)  #Check Class Imbalance

Fraud
0    92.806345
1     7.193655
Name: proportion, dtype: float64


###Split the dataset
The dataset is split into training and testing sets. We use 80% of the data for training the model and 20% for testing. Stratified sampling is applied to maintain the same fraud-to-non-fraud ratio in both sets.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('Fraud', axis=1)
y = df['Fraud']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [None]:
print("Training set class distribution:")
print(y_train.value_counts(normalize=True) * 100)

print("Test set class distribution:")
print(y_test.value_counts(normalize=True) * 100)

Training set class distribution:
Fraud
0    92.805791
1     7.194209
Name: proportion, dtype: float64
Test set class distribution:
Fraud
0    92.808562
1     7.191438
Name: proportion, dtype: float64


In [None]:
print(df.dtypes)

Date                    datetime64[ns]
Amount                         float64
Age                            float64
Fraud                            int64
Day of Week_Thursday              bool
                             ...      
Bank_Halifax                      bool
Bank_Lloyds                       bool
Bank_Metro                        bool
Bank_Monzo                        bool
Bank_RBS                          bool
Length: 61, dtype: object


In [None]:
cols_to_exclude = ['Date', 'Amount']   ## exclude Date and Amount from PCA

X_train_numeric = X_train.drop(cols_to_exclude, axis=1)
X_test_numeric = X_test.drop(cols_to_exclude, axis=1)

###Handle class imbalance on training data only (using SMOTE)
The dataset has a class imbalance issue, with far fewer fraudulent transactions. We use SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic examples of the minority class (fraud) only in the training data, helping the model learn from both classes more effectively.



In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_numeric, y_train)

###Scale features (fit scaler on training set, transform both train and test)
After SMOTE, we again apply scaling to ensure that the newly created synthetic data and the existing data are on the same scale. This step is crucial before performing dimensionality reduction or model training.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test_numeric)


###PCA (Principal Component Analysis)
PCA is applied to reduce the number of features while retaining most of the data's variance. This makes training faster and can help prevent overfitting. We use 95% explained variance to determine how many components to keep.



In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95, random_state=42)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)


In [None]:
# Convert PCA arrays to DataFrame
X_train_pca_df = pd.DataFrame(X_train_pca, index=X_train_resampled.index)
X_test_pca_df = pd.DataFrame(X_test_pca, index=X_test.index)
# Add back 'Amount' column
X_train_pca_df['Amount'] = X_train['Amount']
X_test_pca_df['Amount'] = X_test['Amount']

In [None]:
# Fix missing values in Amount using median
X_train_pca_df['Amount'].fillna(X_train_pca_df['Amount'].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_train_pca_df['Amount'].fillna(X_train_pca_df['Amount'].median(), inplace=True)


In [None]:
print("Missing values in 'Amount':", X_train_pca_df['Amount'].isnull().sum())

Missing values in 'Amount': 0


In [None]:
print("Train shape:", X_train_pca_df.shape)
print("Test shape:", X_test_pca_df.shape)
print("y_test shape:", y_test.shape)


Train shape: (148454, 49)
Test shape: (19996, 49)
y_test shape: (19996,)


In [None]:
print("Training features shape after SMOTE and PCA:", X_train_pca_df.shape)
print("Training labels shape after SMOTE:", y_train_resampled.shape)
print("Test features shape after PCA:", X_test_pca_df.shape)
print("Test labels shape:", y_test.shape)

Training features shape after SMOTE and PCA: (148454, 49)
Training labels shape after SMOTE: (148454,)
Test features shape after PCA: (19996, 49)
Test labels shape: (19996,)


# Model selection and Training
We train two different models—XGBoost and Random Forest—on the processed data. Each model learns patterns to classify whether a transaction is fraudulent or not. Their performance is evaluated on the test set using multiple metrics.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [None]:
X_train_pca_df.columns = X_train_pca_df.columns.astype(str)
X_test_pca_df.columns = X_test_pca_df.columns.astype(str)

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import accuracy_score

# Select the XGBoost model
model_xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')  # Add parameters to avoid warnings
model_name_xgb = 'XGBoost'

# Train the model
model_xgb.fit(X_train_pca_df, y_train_resampled)

Parameters: { "use_label_encoder" } are not used.



In [None]:
from sklearn.ensemble import RandomForestClassifier

# Example with reduced estimators and parallel processing
model = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42) # Added random_state for reproducibility

# Train the model
model.fit(X_train_pca_df, y_train_resampled)


In [None]:
# Make predictions
y_pred_xgb = model_xgb.predict(X_test_pca_df)
y_prob_xgb = model_xgb.predict_proba(X_test_pca_df)[:, 1]

# Calculate metrics
precision_xgb = precision_score(y_test, y_pred_xgb)
recall_xgb = recall_score(y_test, y_pred_xgb)
f1_xgb = f1_score(y_test, y_pred_xgb)
auc_xgb = roc_auc_score(y_test, y_prob_xgb)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)

# Print the results for the XGBoost model
print(f'Metrics for {model_name_xgb}:')
print(f'Precision: {precision_xgb:.4f}')
print(f'Recall: {recall_xgb:.4f}')
print(f'F1 Score: {f1_xgb:.4f}')
print(f'AUC: {auc_xgb:.4f}')
print(f'Accuracy: {accuracy_xgb:.4f}')

Metrics for XGBoost:
Precision: 0.7657
Recall: 0.7955
F1 Score: 0.7804
AUC: 0.9670
Accuracy: 0.9678


In [None]:
y_pred_rndf = model.predict(X_test_pca_df)
y_prob_rndf = model.predict_proba(X_test_pca_df)[:, 1]

# Calculate metrics
precision_rndf = precision_score(y_test, y_pred_rndf)
recall_rndf = recall_score(y_test, y_pred_rndf)
f1_rndf = f1_score(y_test, y_pred_rndf)
auc_rndf = roc_auc_score(y_test, y_prob_rndf)
accuracy_rndf = accuracy_score(y_test, y_pred_rndf)

# Print the results for the RandomForest model
print(f'Metrics for RandomForest:')
print(f'Precision: {precision_rndf:.4f}')
print(f'Recall: {recall_rndf:.4f}')
print(f'F1 Score: {f1_rndf:.4f}')
print(f'AUC: {auc_rndf:.4f}')
print(f'Accuracy: {accuracy_rndf:.4f}')

Metrics for RandomForest:
Precision: 0.7108
Recall: 0.8359
F1 Score: 0.7683
AUC: 0.9607
Accuracy: 0.9637


In [None]:
import joblib

import joblib

# Define the filename for your model
filename1 = 'xgboost_fraud_model.joblib'
filename2 = 'random_forest_fraud_model.joblib'

# Save the model to the file
joblib.dump(model_xgb, filename1)
joblib.dump(model, filename2)

print(f"Model saved as '{filename1}'")
print(f"Model saved as '{filename2}'")

Model saved as 'xgboost_fraud_model.joblib'
Model saved as 'random_forest_fraud_model.joblib'


# Conclusion

In this project, we successfully built and evaluated machine learning models for credit card fraud detection. After loading, cleaning, and preprocessing the credit card transaction data, including handling missing values, converting data types, and applying one-hot encoding and feature scaling, we addressed the class imbalance using SMOTE and reduced dimensionality with PCA. We then trained two models, XGBoost and Random Forest, and evaluated their performance using key metrics like precision, recall, F1 score, AUC, and accuracy. While both models performed well, XGBoost demonstrated slightly better performance across most metrics, particularly in precision and overall accuracy. Therefore, based on this analysis, XGBoost is selected as the preferred model for identifying fraudulent credit card transactions in this dataset. The trained XGBoost model is saved for future use in predicting fraudulent transactions.