▶**TASK-5 : CREDIT CARD FRAUD DETECTION**

> Author: Tushar Surja

> Batch: June 2025 batch B33

> Domain: Data Science


> Aim: To build a machine learning model to identify fraudulent credit card
transactions.



## Load data

### Subtask:
Load the credit card transaction data from "creditcard.csv" into a pandas DataFrame.


**Reasoning**:
Load the data from the specified CSV file into a pandas DataFrame and display the first few rows.



In [None]:
import pandas as pd

df = pd.read_csv('creditcard.csv')
display(df.head())

  df = pd.read_csv('/content/creditcard.csv')


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


## Preprocess data

### Subtask:
Handle missing values, encode categorical features if any, and normalize numerical features.


**Reasoning**:
Check for missing values, handle them, and normalize numerical features.



In [2]:
print("Missing values before handling:")
display(df.isnull().sum())

# Fill missing values in 'Amount' and 'Class' with the mean and mode respectively.
# As seen from previous display of df.head(), the last row contains a NaN value in the 'Amount' and 'Class' columns.
df['Amount'].fillna(df['Amount'].mean(), inplace=True)
df['Class'].fillna(df['Class'].mode()[0], inplace=True)

print("\nMissing values after handling:")
display(df.isnull().sum())

# No categorical features are present in the dataframe based on the column names and previous inspection.

# Normalize numerical features using StandardScaler
from sklearn.preprocessing import StandardScaler

# Exclude 'Time' and 'Class' columns from normalization as they are not features to be scaled in this context
features_to_scale = df.drop(['Time', 'Class'], axis=1).columns
scaler = StandardScaler()
df[features_to_scale] = scaler.fit_transform(df[features_to_scale])

display(df.head())

Missing values before handling:


Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0



Missing values after handling:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Amount'].fillna(df['Amount'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Class'].fillna(df['Class'].mode()[0], inplace=True)


Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0


ValueError: could not convert string to float: '-'

**Reasoning**:
The traceback indicates a ValueError during scaling because of a non-numeric value ('-') in the DataFrame. Need to identify and handle the non-numeric value before scaling.



In [3]:
# Identify columns with non-numeric values
for col in df.columns:
    try:
        pd.to_numeric(df[col])
    except ValueError:
        print(f"Column '{col}' contains non-numeric values.")
        # Replace '-' with NaN in the identified column(s)
        df[col] = df[col].replace('-', pd.NA)

# Convert the column(s) to numeric after replacing '-' with NaN
df['V28'] = pd.to_numeric(df['V28'])

# Check for missing values again after replacing '-' with NaN and fill them
print("\nMissing values after replacing '-':")
display(df.isnull().sum())

# Fill any new missing values created by the replacement (if any) with the mean
df['V28'].fillna(df['V28'].mean(), inplace=True)

print("\nMissing values after filling new NaNs:")
display(df.isnull().sum())


# Normalize numerical features using StandardScaler
from sklearn.preprocessing import StandardScaler

# Exclude 'Time' and 'Class' columns from normalization as they are not features to be scaled in this context
features_to_scale = df.drop(['Time', 'Class'], axis=1).columns
scaler = StandardScaler()
df[features_to_scale] = scaler.fit_transform(df[features_to_scale])

display(df.head())

Column 'V28' contains non-numeric values.

Missing values after replacing '-':


Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0



Missing values after filling new NaNs:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['V28'].fillna(df['V28'].mean(), inplace=True)


Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-0.612705,-0.051651,1.465187,0.934321,-0.04127,0.300588,0.305712,0.028402,0.415601,...,0.028315,0.62257,-0.128385,0.092374,-0.004539,-0.43123,0.340687,-0.075561,0.233584,0.0
1,0,0.791173,0.158032,-0.400015,0.232144,0.262936,-0.124226,0.03256,0.017142,-0.151999,...,-0.261676,-0.820798,0.227058,-0.591362,0.083721,0.208575,-0.024957,0.04024,-0.357899,0.0
2,1,-0.611906,-0.835727,0.86456,0.18052,-0.167184,1.344094,0.779147,0.151773,-1.306272,...,0.400544,1.400298,1.583496,-1.178718,-1.046769,-0.32964,-0.143903,-0.200818,1.155611,0.0
3,1,-0.396189,-0.121216,0.880131,-0.758033,0.209228,0.912614,0.304005,0.25923,-1.18928,...,-0.097474,0.193322,-0.262408,-1.996117,1.180839,-0.497878,0.158981,0.191503,0.128435,0.0
4,2,-0.501803,0.536395,0.687874,0.198077,-0.093867,0.014804,0.608839,-0.277366,0.831717,...,0.040721,1.442188,-0.173678,0.217329,-0.768878,0.973064,0.560943,0.688974,-0.086976,0.0


## Handle class imbalance

### Subtask:
Address the class imbalance issue by using techniques like oversampling or undersampling.


**Reasoning**:
Check the class distribution, separate features and target, apply SMOTE for oversampling, and verify the new class distribution.



In [4]:
# 1. Check the distribution of the 'Class' column
print("Class distribution before oversampling:")
display(df['Class'].value_counts())

# 2. Separate the features (X) and the target variable (y)
X = df.drop('Class', axis=1)
y = df['Class']

# 3. Apply SMOTE for oversampling
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

# 5. Verify the class distribution after applying SMOTE
print("\nClass distribution after oversampling:")
display(y_res.value_counts())

Class distribution before oversampling:


Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,136479
1.0,263



Class distribution after oversampling:


Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,136479
1.0,136479


## Split data

### Subtask:
Split the dataset into training and testing sets.


**Reasoning**:
Split the oversampled data into training and testing sets.



In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (218366, 30)
Shape of X_test: (54592, 30)
Shape of y_train: (218366,)
Shape of y_test: (54592,)


## Train model

### Subtask:
Train a classification algorithm (e.g., Logistic Regression, Random Forest) on the training data.


**Reasoning**:
Import the RandomForestClassifier model and train it on the training data.



In [6]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

## Evaluate model

### Subtask:
Evaluate the trained model's performance using appropriate metrics like precision, recall, and F1-score on the testing data.


**Reasoning**:
Evaluate the trained model on the test set using precision, recall, and F1-score.



In [7]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     27210
         1.0       1.00      1.00      1.00     27382

    accuracy                           1.00     54592
   macro avg       1.00      1.00      1.00     54592
weighted avg       1.00      1.00      1.00     54592



In [8]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Display some predictions and the actual values for comparison
predictions_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
display(predictions_df.head())

Unnamed: 0,Actual,Predicted
101605,0.0,0.0
163508,1.0,1.0
53688,0.0,0.0
206362,1.0,1.0
41667,0.0,0.0


## Summary:

### Data Analysis Key Findings

* The initial dataset contained a significant class imbalance, with non-fraudulent transactions vastly outnumbering fraudulent ones.
* SMOTE oversampling successfully balanced the classes in the dataset.
* The trained Random Forest model achieved perfect scores (1.00) for precision, recall, and F1-score for both fraudulent and non-fraudulent classes on the test set.

### Insights or Next Steps

* While the model performed perfectly on the test set, it's important to consider the possibility of overfitting, especially given the oversampling technique used. Further validation with a separate, untouched dataset or cross-validation could provide a more robust evaluation.
* Investigate the feature importances from the Random Forest model to understand which features are most indicative of fraudulent transactions.