## Credit Card Fraud Detection

In [6]:
import pandas as pd
import numpy as np

## Task 1: Loading the Data
In this task, you will load the provided credit card fraud dataset into a DataFrame. This is the first step in any machine learning workflow, where you need to ensure that the dataset is correctly imported and ready for analysis.

The dataset is in CSV format, and you need to use the appropriate Python libraries to load it into a DataFrame. You should also explore the initial structure of the dataset by displaying the first few rows.

In [7]:
import pandas as pd

# Load the dataset
file_path = 'credit_card_fraud_dataset_updated.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
print(data.head())

# Get the general info about the dataset
print("\nGeneral information about the dataset:")
print(data.info())


First 5 rows of the dataset:
   Transaction_Amount  Transaction_Time  Merchant_ID  Card_Balance  \
0          124.835708         11.625301    86.583402     58.986524   
1           93.086785          3.261624   174.054308     31.517630   
2          132.384427          2.235485   485.080780     51.674163   
3          176.151493          2.970619   390.613135     36.068555   
4           88.292331         26.230675   142.012825     29.165892   

   Transaction_Type  Transaction_Method  Fraud  
0                 1                   0      0  
1                 4                   1      0  
2                 4                   2      0  
3                 1                   2      0  
4                 1                   0      0  

General information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Transa

## Task 2: Splitting the Data into Train and Test
In this task, you will split the dataset into training and testing sets. The training set will be used to build the machine learning model, while the testing set will be used to evaluate its performance. A typical split ratio is 80% for training and 20% for testing, but you can experiment with other ratios as needed.

It is important to perform the split randomly to avoid bias in the model's performance. Also, ensure that the split is done in a way that the class distribution (fraud and non-fraud) is maintained in both training and testing sets.

In [8]:
from sklearn.model_selection import train_test_split

# Separate features and target variable
X = data.drop(columns=['Fraud'])  # Features
y = data['Fraud']  # Target variable

# Split the data into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check the shape of the resulting datasets
print("Shape of the training set (X_train, y_train):", X_train.shape, y_train.shape)
print("Shape of the testing set (X_test, y_test):", X_test.shape, y_test.shape)


Shape of the training set (X_train, y_train): (8000, 6) (8000,)
Shape of the testing set (X_test, y_test): (2000, 6) (2000,)


## Task 3: Checking the Quality of Train Data (Missing Values, Outliers, Imbalance)

### Sub-task (a): Handling Missing Values
In this sub-task, you will focus on handling missing values in the dataset. Missing data can occur in different ways, and for this task, you will fill in the missing values in the categorical features (Transaction_Type and Transaction_Method) using mode imputation, and for numerical features, you will use mean imputation.

In [9]:
from sklearn.impute import SimpleImputer

# 1. Check for missing values in the training data
print("Missing values in the training data:")
print(X_train.isnull().sum())

# 2. Handle missing values in categorical features (impute with mode)
categorical_columns = ['Transaction_Type', 'Transaction_Method']
cat_imputer = SimpleImputer(strategy='most_frequent')
X_train[categorical_columns] = cat_imputer.fit_transform(X_train[categorical_columns])

# 3. Handle missing values in numerical features (impute with mean)
numerical_columns = ['Transaction_Amount', 'Transaction_Time', 'Card_Balance']
num_imputer = SimpleImputer(strategy='mean')
X_train[numerical_columns] = num_imputer.fit_transform(X_train[numerical_columns])

# Verify that there are no more missing values
print("\nMissing values after imputation:")
print(X_train.isnull().sum())


Missing values in the training data:
Transaction_Amount      0
Transaction_Time        0
Merchant_ID           785
Card_Balance            0
Transaction_Type        0
Transaction_Method      0
dtype: int64

Missing values after imputation:
Transaction_Amount      0
Transaction_Time        0
Merchant_ID           785
Card_Balance            0
Transaction_Type        0
Transaction_Method      0
dtype: int64


### Sub-task (b): Handling Outliers

In this sub-task, you will focus on handling outliers in the numerical features. Outliers can skew model performance, so it's important to address them. You will resolve outliers by clipping values that fall outside a specified range.

In [10]:
# 1. Identifying and resolving outliers using IQR method
for col in numerical_columns:
    Q1 = X_train[col].quantile(0.25)
    Q3 = X_train[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Clip outliers to the lower and upper bounds
    X_train[col] = np.clip(X_train[col], lower_bound, upper_bound)

# Check the data after outlier handling
print("\nData after handling outliers:")
print(X_train[numerical_columns].describe())



Data after handling outliers:
       Transaction_Amount  Transaction_Time  Card_Balance
count         8000.000000       8000.000000   8000.000000
mean           106.605542         29.590670     49.878737
std             58.412112         26.011732     15.070003
min            -39.497426          0.001444      8.046668
25%             67.925854          9.052395     39.362634
50%            103.089080         21.528801     49.967581
75%            139.541374         43.087049     60.239944
max            246.964653         94.139031     91.555909


### Sub-task (c): Handling Class Imbalance
In this sub-task, you will address the issue of class imbalance in the target variable (Fraud). A common solution for handling imbalanced datasets is to use sampling techniques to either undersample the majority class or oversample the minority class. For this task, you will use undersampling to balance the classes by randomly sampling the majority class to match the number of minority class instances.

In [11]:
from sklearn.utils import resample

# 1. Check the class distribution in the training set
print("\nClass distribution before balancing:")
print(y_train.value_counts())

# 2. Apply undersampling to balance the classes
# Separate the majority and minority classes
majority_class = X_train[y_train == 0]
minority_class = X_train[y_train == 1]

# Downsample the majority class to match the minority class size
majority_class_downsampled = resample(majority_class,
                                      replace=False,  # without replacement
                                      n_samples=len(minority_class),  # match minority class size
                                      random_state=42)

# Combine the minority class with the downsampled majority class
X_train_balanced = pd.concat([majority_class_downsampled, minority_class])
y_train_balanced = pd.concat([y_train[y_train == 0].iloc[majority_class_downsampled.index], y_train[y_train == 1]])

# Check the class distribution after balancing
print("\nClass distribution after balancing:")
print(y_train_balanced.value_counts())



Class distribution before balancing:
Fraud
0    7859
1     141
Name: count, dtype: int64


IndexError: positional indexers are out-of-bounds

## Task 4: Understanding the Data - Feature Value Distributions

In this task, you will analyze the distributions of the features in the training data. Understanding the distribution of each feature is an important step because it can influence how you process and model the data. For example, features that are highly skewed might require transformations, and categorical features should be analyzed to understand their value counts.

You will analyze both continuous features (e.g., Transaction_Amount, Transaction_Time) and categorical features (e.g., Transaction_Type, Transaction_Method).



In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skew

# 1. Visualizing the distribution of continuous features (histograms)
continuous_columns = ['Transaction_Amount', 'Transaction_Time', 'Card_Balance']  # List of continuous features

# Plot histograms for continuous features
plt.figure(figsize=(12, 8))
for i, col in enumerate(continuous_columns, 1):
    plt.subplot(2, 3, i)
    sns.histplot(X_train[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# 2. Visualizing the distribution of categorical features (bar plots)
categorical_columns = ['Transaction_Type', 'Transaction_Method']  # List of categorical features

# Plot bar plots for categorical features
plt.figure(figsize=(12, 6))
for i, col in enumerate(categorical_columns, 1):
    plt.subplot(1, 2, i)
    sns.countplot(x=X_train[col])
    plt.title(f'Count Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')

plt.tight_layout()
plt.show()

# 3. Analyzing skewness of continuous features
print("\nSkewness of continuous features:")
for col in continuous_columns:
    feature_skew = skew(X_train[col].dropna())
    print(f"{col}: Skewness = {feature_skew:.2f}")

# Suggest transformations if skewed
print("\nSuggested transformations for skewed features:")
for col in continuous_columns:
    feature_skew = skew(X_train[col].dropna())
    if feature_skew > 1:  # Positive skew
        print(f"{col}: Consider log transformation or scaling.")
    elif feature_skew < -1:  # Negative skew
        print(f"{col}: Consider square root transformation.")


## Task 5: Figuring Out What Metric to Choose Best - Pros and Cons of Each
In this task, you will learn how to choose the best evaluation metric for a binary classification problem (such as fraud detection). Different metrics provide different insights into model performance, and it’s important to select the right one based on the nature of the problem and the business objectives.

In the case of fraud detection, where the goal is to correctly identify fraudulent transactions, some metrics are more appropriate than others due to the potential class imbalance (fraudulent transactions are usually much rarer than non-fraudulent ones).

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# Create binary arrays for y_dummy1 (true labels) and y_dummy2 (predicted labels)
y_dummy1 = np.random.choice([0, 1], size=1000)  # 1000 samples, with values 0 or 1 (fraud or not)
y_dummy2 = np.random.choice([0, 1], size=1000)  # Predicted labels

# 1. Evaluate using multiple metrics: Accuracy, Precision, Recall, F1-Score, and ROC AUC

# Accuracy: Proportion of correctly classified transactions (fraud and non-fraud)
accuracy = accuracy_score(y_dummy1, y_dummy2)

# Precision: How many of the predicted fraudulent transactions were actually fraud?
precision = precision_score(y_dummy1, y_dummy2)

# Recall: How many of the actual fraudulent transactions were correctly predicted?
recall = recall_score(y_dummy1, y_dummy2)

# F1-Score: Harmonic mean of precision and recall
f1 = f1_score(y_dummy1, y_dummy2)

# ROC AUC: Measures the model’s ability to distinguish between fraud and non-fraud
# For this, y_dummy2 should be probability scores, so assuming we have predicted probabilities, we'll simulate them
y_dummy2_prob = np.random.rand(1000)  # Random probability scores for predictions
roc_auc = roc_auc_score(y_dummy1, y_dummy2_prob)

# Printing the evaluation metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(f"ROC AUC: {roc_auc}")


## Task 6: Building a Baseline No-ML Solution
In this task, you will build a baseline no-machine learning solution. A baseline model is crucial for comparing the performance of machine learning models. It allows you to measure how well a machine learning model is performing relative to a simple or naïve approach.

For the fraud detection problem, a natural baseline approach is to predict that all transactions are non-fraudulent (i.e., predict 0 for all cases). This simple model will serve as a baseline, and you can later compare it with the performance of your logistic regression model.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# 1. Baseline solution: Predict 0 (non-fraudulent) for all transactions
y_pred_baseline = [0] * len(y_test_classification)

# 2. Evaluate the baseline model's performance
accuracy_baseline = accuracy_score(y_test_classification, y_pred_baseline)
precision_baseline = precision_score(y_test_classification, y_pred_baseline)
recall_baseline = recall_score(y_test_classification, y_pred_baseline)
f1_baseline = f1_score(y_test_classification, y_pred_baseline)
roc_auc_baseline = roc_auc_score(y_test_classification, y_pred_baseline)

print(f"Baseline Accuracy: {accuracy_baseline:.4f}")
print(f"Baseline Precision: {precision_baseline:.4f}")
print(f"Baseline Recall: {recall_baseline:.4f}")
print(f"Baseline F1-Score: {f1_baseline:.4f}")
print(f"Baseline ROC AUC: {roc_auc_baseline:.4f}")


## Task 7: Building a Basic Logistic Regression Model - Feature Transformation
In this task, you will build a basic logistic regression model for the fraud detection problem. You will also apply necessary feature transformations to ensure that the data is properly prepared for logistic regression. These transformations will include encoding categorical variables and normalizing continuous features.

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split

# 1. Split the data into features (X) and target (y)
X = df.drop('Fraud', axis=1)
y = df['Fraud']

# 2. Define which features are categorical and which are continuous
categorical_features = ['Transaction_Type', 'Transaction_Method']
continuous_features = [col for col in X.columns if col not in categorical_features]

# 3. Set up a column transformer to apply appropriate transformations
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), continuous_features),  # Standardize continuous features
        ('cat', OneHotEncoder(), categorical_features)  # One-hot encode categorical features
    ])

# 4. Create a pipeline that first applies the transformations and then fits the logistic regression model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))  # Logistic Regression Model
])

# 5. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 6. Train the model using the pipeline
pipeline.fit(X_train, y_train)

# 7. Predict using the trained model
y_pred = pipeline.predict(X_test)

# 8. Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1])

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC AUC: {roc_auc:.4f}")

## Task 8: Evaluating Performance of Model Using Chosen Metric and Comparing with No-Machine Learning Solution
Task Description:
In this task, you will evaluate the performance of the logistic regression model built in Task 6 using the chosen classification metrics (accuracy, precision, recall, F1-score, and ROC AUC). Additionally, you will compare the model’s performance with the baseline no-machine learning solution built in Task 7.

In [None]:
# 1. Evaluate the logistic regression model using the same metrics
accuracy_logreg = accuracy_score(y_test, y_pred)
precision_logreg = precision_score(y_test, y_pred)
recall_logreg = recall_score(y_test, y_pred)
f1_logreg = f1_score(y_test, y_pred)
roc_auc_logreg = roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1])

# 2. Compare with the baseline model metrics
# We already calculated baseline metrics earlier, so we'll use those:

print("\nEvaluation of Logistic Regression Model:")
print(f"Logistic Regression Accuracy: {accuracy_logreg:.4f}")
print(f"Logistic Regression Precision: {precision_logreg:.4f}")
print(f"Logistic Regression Recall: {recall_logreg:.4f}")
print(f"Logistic Regression F1-Score: {f1_logreg:.4f}")
print(f"Logistic Regression ROC AUC: {roc_auc_logreg:.4f}")

print("\nEvaluation of Baseline Model (No-ML):")
print(f"Baseline Accuracy: {accuracy_baseline:.4f}")
print(f"Baseline Precision: {precision_baseline:.4f}")
print(f"Baseline Recall: {recall_baseline:.4f}")
print(f"Baseline F1-Score: {f1_baseline:.4f}")
print(f"Baseline ROC AUC: {roc_auc_baseline:.4f}")

# 3. Interpret the results:
print("\nInterpretation:")
print("1. Accuracy can be misleading in imbalanced datasets, so check precision and recall.")
print("2. Logistic regression should perform better in terms of precision, recall, F1-score, and ROC AUC.")
print("3. Look at the ROC AUC to see how well the model distinguishes between fraud and non-fraud.")
print("4. If the logistic regression model significantly outperforms the baseline, it indicates the benefit of using machine learning.")
