# Exercise: Cross-Validation and Classifier Comparison on Imbalanced Data


In this exercise, you will:

- Use 5-fold cross-validation to evaluate the performance of several classifiers on an imbalanced dataset.

- Compare the classifiers using accuracy and F1-score metrics.
Understand why accuracy can be misleading on imbalanced datasets and explore different classifier parameters.

a) Load and Explore the Dataset: We’ll use the Credit Card Fraud Detection dataset from Scikit-Learn, which has a binary classification target and is highly imbalanced.

In this dataset, the target variable (Class) has two classes: 1 indicates fraud, and 0 indicates non-fraudulent transactions.

b) Define Cross-Validation and Performance Metrics: Use 5-fold cross-validation to evaluate each classifier.

c) Calculate accuracy and F1-score to assess model performance.

d) Use the following classifiers:
- Naïve Bayes
- k-Nearest Neighbors (k-NN)
- Support Vector Machine (SVM)
- Decision Tree

For each classifier, try adjusting key parameters and observe the effect on performance metrics.

e) Analyze the Results:
- Compare accuracy and F1-score for each classifier.
- Why might accuracy be misleading on this imbalanced dataset?
- Which classifier performed best on F1-score, and why might this be a better metric here?
- How did different parameter settings (e.g., k in k-NN, C in SVM) affect performance?

In [1]:
# Import necessary libraries
from sklearn.datasets import fetch_openml
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer, accuracy_score, f1_score
import pandas as pd
import numpy as np

# Load the imbalanced dataset
data = fetch_openml(name="creditcard", version=1, as_frame=True)
df = data.frame

# Separate features and target
X = df.drop('Class', axis=1)
y = df['Class'].astype(int)  # Convert target labels to integers

# Check the imbalance in the dataset
print("Class distribution:\n", y.value_counts())




Class distribution:
 Class
0    284315
1       492
Name: count, dtype: int64


In [None]:
# Define 5-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define classifiers
classifiers = {
    "Naive Bayes": GaussianNB(),
    "k-NN (k=5)": KNeighborsClassifier(n_neighbors=5),
    "SVM (C=1)": SVC(C=1, kernel='linear'),
    "Decision Tree": DecisionTreeClassifier(max_depth=5)
}

# Define scoring metrics
scoring = {
    "accuracy": make_scorer(accuracy_score),
    "f1_score": make_scorer(f1_score, pos_label=1)  # Focus on the minority class
}

# Evaluate each classifier
results = {}
for name, clf in classifiers.items():
    print(f"\nEvaluating {name}...")
    acc_scores = cross_val_score(clf, X, y, cv=cv, scoring=scoring["accuracy"])
    f1_scores = cross_val_score(clf, X, y, cv=cv, scoring=scoring["f1_score"])

    results[name] = {
        "Accuracy (mean)": np.mean(acc_scores),
        "F1-Score (mean)": np.mean(f1_scores)
    }

# Display results
results_df = pd.DataFrame(results).T
print("\nCross-Validation Results:\n", results_df)