To build a  Machine Learning model to identify fraudulent credit card transcations.

Importing Libraries: Let’s start by importing the required libraries for data analysis and visualization:

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix


Loading the Dataset: The Titanic dataset is available in the Seaborn library, so we can load it directly into a DataFrame:

In [3]:
# Load the credit card fraud dataset (replace 'your_dataset.csv' with the actual file path)
credit_card_data = pd.read_csv(r'C:\Users\Lenovo\OneDrive\Desktop\creditcard.csv')


Exploring the Data: To gain initial insights into the dataset, we can perform some basic exploratory operations:

In [4]:
credit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [12]:
# Checking the missing values 
credit_card_data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [5]:
# Preprocess and normalize the data (assuming 'Amount' and 'Time' are features)
# Add additional preprocessing steps as needed
scaler = StandardScaler()
credit_card_data['Amount'] = scaler.fit_transform(credit_card_data['Amount'].values.reshape(-1, 1))
credit_card_data['Time'] = scaler.fit_transform(credit_card_data['Time'].values.reshape(-1, 1))

In [6]:
# Define features and target variable
X = credit_card_data.drop('Class', axis=1)
y = credit_card_data['Class']

In [7]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
# Handle class imbalance using Random Over-sampling and Under-sampling
oversampler = RandomOverSampler(sampling_strategy=0.5, random_state=42)
undersampler = RandomUnderSampler(sampling_strategy=0.8, random_state=42)

In [9]:
# Build a Logistic Regression model within a pipeline
model = Pipeline([
    ('over', oversampler),
    ('under', undersampler),
    ('classifier', LogisticRegression(random_state=42))
])

In [10]:
# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

In [11]:
# Evaluate the model's performance
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99     56864
           1       0.08      0.92      0.15        98

    accuracy                           0.98     56962
   macro avg       0.54      0.95      0.57     56962
weighted avg       1.00      0.98      0.99     56962

Confusion Matrix:
 [[55842  1022]
 [    8    90]]


# Classification Report:

Classification Report:
Precision:

Precision for class 0 (genuine transactions) is high (1.00), indicating a low false positive rate. This means that when the model predicts a transaction as genuine, it is correct 100% of the time.
Precision for class 1 (fraudulent transactions) is relatively low (0.08), suggesting a higher false positive rate. The model is less accurate when predicting fraudulent transactions.
Recall (Sensitivity):

Recall for class 0 is high (0.98), indicating that the model captures the majority of genuine transactions. It correctly identifies 98% of genuine transactions.
Recall for class 1 is very high (0.92), suggesting that the model is effective at identifying fraudulent transactions. It correctly identifies 92% of fraudulent transactions.
F1-Score:

The F1-score is the harmonic mean of precision and recall. For class 0, it is high (0.99), indicating a good balance between precision and recall. For class 1, it is relatively low (0.15), reflecting the trade-off between precision and recall.
Support:

Support represents the number of actual occurrences of each class in the specified dataset. The majority of transactions are genuine (class 0), and there are fewer instances of fraudulent transactions (class 1).

# Confusion Matrix:

True Positives (TP): 90 - The model correctly predicted 90 instances of fraudulent transactions.
True Negatives (TN): 55842 - The model correctly predicted 55842 instances of genuine transactions.
False Positives (FP): 1022 - The model incorrectly predicted 1022 instances as fraudulent when they were genuine (Type I error).
False Negatives (FN): 8 - The model incorrectly predicted 8 instances as genuine when they were fraudulent (Type II error).
Conclusion: