<a href="https://colab.research.google.com/github/BrundaSreedhar/credit-card-fraud-detection/blob/main/ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Initial EDA

In [None]:
import pandas as pd
import numpy as np

# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
file_path = "creditcard.csv"

# Load the latest version
df = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
  "mlg-ulb/creditcardfraud",
  file_path,
  # Provide any additional arguments like
  # sql_query or pandas_kwargs. See the
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

print("First 5 records:")
print(df.head())
print(df.shape)

In [None]:
print('Missing values:', df.isnull().sum().sum())
print('\nDtypes:\n', df.dtypes.value_counts())

In [None]:
df.info()

It contains 284,807 credit card transactions with 31 features, including a target variable Class indicating whether a transaction is fraudulent or legitimate. Time represents the elapsed time between transactions and Amount indicates the transaction value. All features are numerical, and the dataset contains no missing values.

In [None]:
df['Class'].value_counts()

In [None]:
df['Class'].value_counts(normalize=True)

**Core Difficulty:** Dataset is imbalanced. Based on an initial analysis, approximately 99.81% of transactions are legitimate, while only 0.18% are fraudulent. This imbalance increases the risk of developing models that appear accurate but fail to identify fraudulent activity.

In [None]:
df[['Amount', 'Class']].groupby('Class').mean()


An initial comparison of transaction amounts shows that fraudulent transactions have a higher average transaction value (approximately 123) compared to legitimate transactions (approximately 90). While this suggests that transaction amount may be a useful feature for fraud detection, the overlap between classes indicates that accurate classification will require combining transaction amount with other anonymized features provided in the dataset.

In [None]:
print("\n" + "-"*80)
print("Statistical Summary:")
print("-"*80)
print(df.describe())

In [None]:
import matplotlib.pyplot as plt

# Count class distribution
class_counts = df['Class'].value_counts()

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart
axes[0].bar(['Legitimate (0)', 'Fraudulent (1)'], class_counts.values, color=['steelblue', 'crimson'], edgecolor='black')
axes[0].set_ylabel('Number of Transactions')
axes[0].set_title('Class Distribution (Count)')

# Pie chart
axes[1].pie(
    class_counts.values,
    labels=['Legitimate', 'Fraudulent'],
    autopct='%1.3f%%',
    startangle=90
)
axes[1].set_title('Class Distribution (Percentage)')

plt.tight_layout()
plt.show()

# Print basic stats
print("Legitimate:", class_counts[0])
print("Fraudulent:", class_counts[1])
print("Imbalance Ratio (Legit : Fraud) =", round(class_counts[0] / class_counts[1], 0))


In [None]:
df.hist(bins=30, figsize=(30, 30))

Most of the other columns are roughly normally distributed around 0, which is expected as they're transformed already. Amounts are generally mostly small with some extreme values like 25000, these would be outliers. Time is a value between 0 and 172792 and its fairly distributed across the period, there are no heavy-tail outliers here. We preprocess the amount and time columns.


In [None]:
# --- 2.4 PCA feature distributions: fraud vs legit ---
v_features = [f'V{i}' for i in range(1, 29)]

fig, axes = plt.subplots(7, 4, figsize=(20, 28))
axes = axes.flatten()

for i, feat in enumerate(v_features):
    for label, color, name in [(0, 'steelblue', 'Legit'), (1, 'crimson', 'Fraud')]:
        axes[i].hist(df[df['Class']==label][feat], bins=50, alpha=0.5,
                     color=color, label=name, density=True)
    axes[i].set_title(feat, fontsize=10)
    axes[i].legend(fontsize=7)

plt.suptitle('PCA Feature Distributions: Fraud vs Legit', fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

In [None]:
# --- 2.5 Correlation heatmap (fraud transactions only) ---
import seaborn as sns
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

for ax, label, title in [(axes[0], 0, 'Correlation — Legit'), (axes[1], 1, 'Correlation — Fraud')]:
    corr = df[df['Class']==label][v_features + ['Amount']].corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))
    sns.heatmap(corr, mask=mask, ax=ax, cmap='RdBu_r', center=0,
                square=True, linewidths=0.5, annot=False, fmt='.1f')
    ax.set_title(title, fontsize=12)

plt.tight_layout()
plt.show()

Naive, fast and model-free check of what could potentially be important features. This does not consider variance or correlations between features but might give us some intuition on which variables might influence why specific transactions were flagged as fraud.

In [None]:
# --- 2.6 Feature importance preview: mean absolute difference between classes ---
fraud = df[df['Class']==1][v_features].mean()
legit = df[df['Class']==0][v_features].mean()
diff = (fraud - legit).abs().sort_values(ascending=False)

plt.figure(figsize=(12, 5))
diff.plot(kind='bar', color='darkorange', edgecolor='black')
plt.title('|Mean(Fraud) - Mean(Legit)| per Feature\n(Higher = more separability)', fontsize=13)
plt.ylabel('Absolute Mean Difference')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print('Top 10 most separating features:')
print(diff.head(10))

In [None]:
from scipy.stats import mannwhitneyu

results = {}
for feat in v_features:
    stat, p = mannwhitneyu(
        df[df['Class']==0][feat],
        df[df['Class']==1][feat],
        alternative='two-sided'
    )
    results[feat] = {'stat': stat, 'p_value': p}

results_df = pd.DataFrame(results).T
results_df['-log10(p)'] = -np.log10(results_df['p_value'])
results_df = results_df.sort_values('-log10(p)', ascending=False)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# --- Left: -log10(p-value) ---
colors = ['crimson' if p < 0.05 else 'steelblue' for p in results_df['p_value']]
axes[0].barh(results_df.index, results_df['-log10(p)'], color=colors, edgecolor='black')
axes[0].axvline(-np.log10(0.05), color='black', linestyle='--', label='p=0.05')
axes[0].set_xlabel('-log10(p-value)')
axes[0].set_title('Mann-Whitney U: Feature Significance\n(red = significant, higher = more significant)')
axes[0].legend()

# --- Right: U-statistic (normalized) ---
n0 = (df['Class']==0).sum()
n1 = (df['Class']==1).sum()
results_df['U_norm'] = results_df['stat'] / (n0 * n1)  # ranges 0-1, 0.5 = no difference
axes[1].barh(results_df.index, results_df['U_norm'], color='darkorange', edgecolor='black')
axes[1].axvline(0.5, color='black', linestyle='--', label='No difference (0.5)')
axes[1].set_xlabel('Normalized U-statistic')
axes[1].set_title('Mann-Whitney U: Effect Size\n(further from 0.5 = stronger separation)')
axes[1].legend()

plt.tight_layout()
plt.show()


Left (-log10 p-value): which features are statistically different between fraud and legit. Almost all V features will be significant given the dataset size, so this alone isn't enough.


Right (normalized U-statistic): the *effect size*, which is more meaningful here. It's essentially the probability that a random fraud transaction scores higher than a random legit one on that feature. Values near 0 or 1 mean strong separation; 0.5 means the feature is useless for discrimination.
The right plot is helpful rank features by - significance without effect size is misleading with large datasets because even tiny, meaningless differences become statistically significant.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6)) # Changed to 1 row, 2 columns, and adjusted figsize

# Overall Amount Distribution (with log scale for better visibility)
axes[0].hist(df['Amount'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Transaction Amount ($)', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Frequency (log scale)', fontsize=11, fontweight='bold')
axes[0].set_title('Distribution of Transaction Amounts (All)', fontsize=13, fontweight='bold')
axes[0].set_yscale('log')
axes[0].grid(alpha=0.3)
df_legit=df[df['Class']==0]
df_fraud=df[df['Class']==1]

# Amount Distribution by Class
# Box plot comparison
box_data = [df_legit['Amount'], df_fraud['Amount']]
bp = axes[1].boxplot(box_data, tick_labels=['Legitimate', 'Fraudulent'],
                         patch_artist=True, showfliers=False)
bp['boxes'][0].set_facecolor('#2ecc71')
bp['boxes'][1].set_facecolor('#e74c3c')
axes[1].set_ylabel('Transaction Amount ($)', fontsize=11, fontweight='bold')
axes[1].set_title('Amount Distribution by Class (No Outliers)', fontsize=13, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('02_amount_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

# Print statistics
print("="*80)
print("TRANSACTION AMOUNT STATISTICS")
print("="*80)
print(f"Overall - Mean: ${df['Amount'].mean():.2f}, Median: ${df['Amount'].median():.2f}")
print(f"Legitimate - Mean: ${df_legit['Amount'].mean():.2f}, Median: ${df_legit['Amount'].median():.2f}")
print(f"Fraudulent - Mean: ${df_fraud['Amount'].mean():.2f}, Median: ${df_fraud['Amount'].median():.2f}")
print("="*80)

The transaction amount distribution is highly right-skewed, as shown by the histogram on a log scale, indicating that most transactions involve small amounts while a few very large transactions occur infrequently. When comparing transaction amounts by class using the box plot (with outliers removed), fraudulent transactions tend to have a lower median amount than legitimate ones, even though their mean is higher due to the presence of some high-value fraud cases. This suggests that fraud commonly occurs at smaller transaction amounts, possibly to avoid detection, while occasional large fraudulent transactions significantly increase the average.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Transaction frequency over time
axes[0, 0].hist(df['Time'], bins=100, color='teal', edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Time (seconds from first transaction)', fontsize=11, fontweight='bold')
axes[0, 0].set_ylabel('Number of Transactions', fontsize=11, fontweight='bold')
axes[0, 0].set_title('Transaction Frequency Over Time', fontsize=13, fontweight='bold')
axes[0, 0].grid(alpha=0.3)

# Time distribution by class
axes[0, 1].hist(df_legit['Time'], bins=100, alpha=0.6, label='Legitimate',
                color='#2ecc71', edgecolor='black')
axes[0, 1].hist(df_fraud['Time'], bins=100, alpha=0.6, label='Fraudulent',
                color='#e74c3c', edgecolor='black')
axes[0, 1].set_xlabel('Time (seconds)', fontsize=11, fontweight='bold')
axes[0, 1].set_ylabel('Frequency', fontsize=11, fontweight='bold')
axes[0, 1].set_title('Time Distribution by Class', fontsize=13, fontweight='bold')
axes[0, 1].legend(fontsize=10)
axes[0, 1].grid(alpha=0.3)

# Fraud rate over time periods
time_bins = pd.cut(df['Time'], bins=48)
fraud_rate_time = df.groupby(time_bins)['Class'].agg(['mean', 'count'])
fraud_rate_time['fraud_pct'] = fraud_rate_time['mean'] * 100

axes[1, 0].plot(range(len(fraud_rate_time)), fraud_rate_time['fraud_pct'],
                marker='o', linewidth=2, markersize=4, color='crimson')
axes[1, 0].set_xlabel('Time Period', fontsize=11, fontweight='bold')
axes[1, 0].set_ylabel('Fraud Rate (%)', fontsize=11, fontweight='bold')
axes[1, 0].set_title('Fraud Rate Over Time Periods', fontsize=13, fontweight='bold')
axes[1, 0].grid(alpha=0.3)

# Scatter: Time vs Amount (sample for clarity)
sample_size = min(10000, len(df))
df_sample = df.sample(n=sample_size, random_state=42)
scatter_legit = df_sample[df_sample['Class'] == 0]
scatter_fraud = df_sample[df_sample['Class'] == 1]

axes[1, 1].scatter(scatter_legit['Time'], scatter_legit['Amount'],
                   alpha=0.3, s=10, c='#2ecc71', label='Legitimate')
axes[1, 1].scatter(scatter_fraud['Time'], scatter_fraud['Amount'],
                   alpha=0.8, s=30, c='#e74c3c', label='Fraudulent',
                   edgecolors='black', linewidth=0.5)
axes[1, 1].set_xlabel('Time (seconds)', fontsize=11, fontweight='bold')
axes[1, 1].set_ylabel('Amount ($)', fontsize=11, fontweight='bold')
axes[1, 1].set_title(f'Time vs Amount (Sample: {sample_size:,})', fontsize=13, fontweight='bold')
axes[1, 1].legend(fontsize=10)
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('03_time_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print(" Time distribution analysis complete!")

The time-based visualizations reveal meaningful patterns in transaction behavior and fraud occurrence. Overall transaction frequency is not uniform across time, showing distinct peaks and low-activity periods, which suggests cyclic usage patterns. When comparing time distributions by class, fraudulent transactions broadly follow the same temporal structure as legitimate ones, indicating that fraud does not occur only at specific times but blends into normal activity. However, the fraud rate over time highlights certain periods with noticeable spikes, where the proportion of fraudulent transactions increases despite lower transaction counts. This suggests that fraud risk varies across time windows rather than volume alone. Additionally, the Time vs Amount scatter plot shows that fraudulent transactions tend to cluster at lower amounts but occasionally appear as higher-value outliers, reinforcing the need to consider time-based patterns alongside transaction amounts when detecting fraud.

#Data Preprocessing


In [None]:
#Amount ranges from 0 to 25691.160000

from sklearn.preprocessing import RobustScaler, StandardScaler
new_df = df.copy()
new_df['Amount'] = RobustScaler().fit_transform(new_df['Amount'].to_numpy().reshape(-1, 1))
new_df['Amount'].hist()

In [None]:
new_df['Amount'].describe()

We now have a much smaller standard deviation, there are still outliers but its much better than what we had previously


We'll just standardize Time since we dont seem to have any outliers

In [None]:
time = new_df['Time']
#standard scaler
new_df['Time'] = StandardScaler().fit_transform(new_df[['Time']])
new_df.head()

In [None]:
new_df = new_df.sample(frac=1, random_state=42)
new_df

In [None]:
from sklearn.model_selection import train_test_split

train, temp = train_test_split(
    new_df,
    test_size=0.2,
    stratify=new_df['Class'],
    random_state=42
)

test, val = train_test_split(
    temp,
    test_size=0.5,
    stratify=temp['Class'],
    random_state=42
)

In [None]:
x_train = train.drop(columns=['Class'])
y_train = train['Class']

x_test = test.drop(columns=['Class'])
y_test = test['Class']

x_val = val.drop(columns=['Class'])
y_val = val['Class']

x_train.shape, y_train.shape, x_test.shape, y_test.shape, x_val.shape, y_val.shape

#Modelling

Try Logistic Regression with and without class weights

In [None]:
#logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
logistic_model = LogisticRegression()
logistic_model.fit(x_train, y_train)
logistic_model.score(x_val, y_val)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_val, logistic_model.predict(x_val), target_names=['Legit', 'Fraud']))

In [None]:
from sklearn.metrics import precision_recall_curve
import numpy as np

logistic_model_weighted = LogisticRegression(class_weight='balanced')
logistic_model_weighted.fit(x_train, y_train)

print("Before tuning threshold.. ")
print(classification_report(y_val, logistic_model_weighted.predict(x_val), target_names=['Legit', 'Fraud']))

probs = logistic_model_weighted.predict_proba(x_val)[:, 1]

precision, recall, thresholds = precision_recall_curve(y_val, probs)

print("Precision:", precision)
print("Recall:", recall)
print("Thresholds:", thresholds)
#choose threshold that maximizes F1
f1 = 2 * (precision * recall) / (precision + recall + 1e-9)
best_idx = np.argmax(f1)
best_threshold = thresholds[best_idx]

print("Best threshold:", best_threshold)

In [None]:
from sklearn.metrics import average_precision_score

print("After tuning threshold.. ")
y_pred = (probs >= best_threshold).astype(int)
print(classification_report(y_val, y_pred))
#print PR AUC
print("PR AUC :", average_precision_score(y_val, probs))

precisison here helps us understand the number of false positives. (Calling it a fraud when it wasnt a fraud)
higher prec -> we dint flag real transactions


Recall measures false negatives. it was fraud but we predicted not fraud. -> the model didnt predict that. Recall is important cause we want to catch the fraudulent transactions.

Accuracy -> is it 100% accurate? no, we focus on the precision and recall. because of the imbalance in the dataset. accuracy is accurate if it was balanced.

#Random Forest

Experiments with and without SMOTE

In [None]:
#Random forest to test the data
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier()
random_forest.fit(x_train, y_train)

print(classification_report(y_val, random_forest.predict(x_val), target_names=['Legit', 'Fraud']))

In [None]:
#try tuning threshold

y_probs = random_forest.predict_proba(x_val)[:, 1]   # probability of fraud

precision, recall, thresholds = precision_recall_curve(y_val, y_probs)

#choose threshold that maximizes F1
f1 = 2 * (precision * recall) / (precision + recall + 1e-9)
best_idx = np.argmax(f1)
best_threshold = thresholds[best_idx]

print("Best threshold:", best_threshold)

print("After tuning threshold.. ")
y_pred_tuned = (y_probs >= best_threshold).astype(int)
print(classification_report(y_val, y_pred_tuned))
print("PR AUC ", average_precision_score(y_val, y_probs))

Calibrated Classfier CV

In [None]:
from sklearn.calibration import CalibratedClassifierCV

calibrated_rf = CalibratedClassifierCV(
    random_forest,
    method='sigmoid',
    cv='prefit'
)

calibrated_rf.fit(x_val, y_val)

y_probs = calibrated_rf.predict_proba(x_val)[:, 1]

precision, recall, thresholds = precision_recall_curve(y_val, y_probs)
#choose threshold that maximizes F1
f1 = 2 * (precision * recall) / (precision + recall + 1e-9)
best_idx = np.argmax(f1)
best_threshold = thresholds[best_idx]

print("Best threshold:", best_threshold)

print("After tuning threshold.. ")
y_pred_tuned = (y_probs >= best_threshold).astype(int)
print(classification_report(y_val, y_pred_tuned))
ap = average_precision_score(y_val, y_probs)
print("PR-AUC:", ap)

this classifier has pretty good precision, but we'd like to see if we could improve the recall without destroying precision.


Lets try adding class weights:

In [None]:
random_forest_weighted = RandomForestClassifier(class_weight="balanced")
random_forest_weighted.fit(x_train, y_train)

print(classification_report(y_val, random_forest_weighted.predict(x_val), target_names=['Legit', 'Fraud']))

Let's try SMOTE + Random Forest


In [None]:
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# define pipeline
pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('rf', RandomForestClassifier(
        n_estimators=200,
        random_state=42,
        n_jobs=-1
    ))
])

# train (SMOTE applied ONLY to training data)
pipeline.fit(x_train, y_train)

# predict
y_pred = pipeline.predict(x_val)

print(classification_report(y_val, y_pred))
print("PR-AUC:", ap)

In [None]:
y_probs = pipeline.predict_proba(x_val)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_val, y_probs)

#choose threshold that maximizes F1
f1 = 2 * (precision * recall) / (precision + recall + 1e-9)
best_idx = np.argmax(f1)
best_threshold = thresholds[best_idx]

print("Best threshold:", best_threshold)

print("After tuning threshold.. ")
y_pred_tuned = (y_probs >= best_threshold).astype(int)
print(classification_report(y_val, y_pred_tuned))

ap = average_precision_score(y_val, y_probs)
print("PR-AUC:", ap)

Another experiment with RF + SMOTE

In [None]:
rf_model = Pipeline([
    ("smote", SMOTE(
        sampling_strategy=0.3,   # not full balance → prevents overfitting
        k_neighbors=5,
        random_state=42
    )),
    ("rf", RandomForestClassifier(
        n_estimators=400,
        max_depth=12,
        min_samples_leaf=2,
        n_jobs=-1,
        random_state=42
    ))
])

#fit model
rf_model.fit(x_train, y_train)

y_probs = rf_model.predict_proba(x_val)[:, 1]
#threshold tuning
precision, recall, thresholds = precision_recall_curve(y_val, y_probs)

f1 = 2 * precision * recall / (precision + recall + 1e-9)
best_idx = np.argmax(f1)
best_threshold = thresholds[best_idx]

print("Best threshold:", best_threshold)

y_pred = (y_probs >= best_threshold).astype(int)

print(classification_report(y_val, y_pred))

pr_auc = average_precision_score(y_val, y_probs)
print("PR-AUC:", pr_auc)

Cross-validated training

In [None]:
#gridsearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)

param_grid = {
    "n_estimators": [200, 400],
    "max_depth": [8, 12, None],
    "min_samples_leaf": [1, 2, 5]
}

grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring="average_precision",   # ⭐ THIS LINE
    cv=3,
    n_jobs=-1
)

grid.fit(x_train, y_train)

print("Best PR-AUC:", grid.best_score_)
print("Best params:", grid.best_params_)

best_model = grid.best_estimator_

In [None]:

from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(
    rf_model,
    x_train,
    y_train,
    scoring="average_precision",
    cv=cv,
    n_jobs=-1
)

print("CV PR-AUC:", scores.mean())

#NEURAL NETWORK - CHECK AGAIN!!!

In [None]:
import tensorflow as tf
import keras
from keras import layers, callbacks

# --- Class weight to handle imbalance ---

shallow_nn = keras.models.Sequential([
    layers.InputLayer(shape=(x_train.shape[1],)),

    layers.Dense(64, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),

    layers.Dense(32, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),

    layers.Dense(1, activation='sigmoid')
])


neg, pos = np.bincount(y_train.astype(int))
class_weight = {0: 1.0, 1: (neg / pos) * 0.3}

early_stop = callbacks.EarlyStopping(
    monitor='val_auc', patience=10,
    mode='max', restore_best_weights=True
)


checkpoint = callbacks.ModelCheckpoint(
    'shallow_nn.keras',                              # fix: needs .keras extension
    monitor='val_auc',                               # fix: monitor AUC not loss
    save_best_only=True,
    mode='max'                                       # fix: higher AUC = better
)

shallow_nn.compile(loss='binary_crossentropy', optimizer='adam', metrics=[
        keras.metrics.AUC(name='auc', curve='PR')])
shallow_nn.summary()

In [None]:
history = shallow_nn.fit(
    x_train, y_train,
    validation_data=(x_val, y_val),
    epochs=100,          # let early stopping decide when to quit
    batch_size=2048,     # larger batches = faster epochs, more stable gradients
    class_weight=class_weight,
    callbacks=[checkpoint, early_stop]
)

# todo: try using algorithmic threshold split

In [None]:
y_proba = shallow_nn.predict(x_val)
y_pred = (y_proba > 0.8).astype(int)
print(classification_report(y_val, y_pred, target_names=['Legit', 'Fraud']))

In [None]:
from sklearn.metrics import precision_recall_curve, classification_report
import numpy as np

# Get validation probabilities (risk scores)
y_proba = shallow_nn.predict(x_val).ravel()

# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_val, y_proba)

# Compute F1 scores
f1 = 2 * (precision * recall) / (precision + recall + 1e-9)

# Find best threshold
best_idx = np.argmax(f1)
best_threshold = thresholds[best_idx]

print("Best threshold (validation):", best_threshold)

# Apply optimal threshold
y_pred_optimal = (y_proba > best_threshold).astype(int)

print("\nValidation performance with optimal threshold:")
print(classification_report(y_val, y_pred_optimal, target_names=['Legit', 'Fraud']))


After increasing network capacity and adjusting class weights, the neural network achieved high recall (0.92) while maintaining strong precision (0.77).
This demonstrates that proper imbalance handling and architecture tuning significantly improve performance.



# XGBoost

In [None]:
pip install xgboost


In [None]:
neg = (y_train == 0).sum()
pos = (y_train == 1).sum()

scale_pos_weight = neg / pos


We computed the number of legitimate and fraud transactions in the training set.
Since the dataset is highly imbalanced, we calculated scale_pos_weight to penalize mistakes on the minority (fraud) class more heavily.
This helps XGBoost focus more on detecting fraud cases.

In [None]:
from xgboost import XGBClassifier

# Calculate pos_weight using the already defined neg and pos counts
pos_weight = neg / pos

xgb = XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=4,
    scale_pos_weight=pos_weight,  # important for imbalance
    random_state=42
)

# Correcting variable names from X_train, X_test to x_train, x_test
xgb.fit(x_train, y_train)

y_pred = xgb.predict(x_val)
y_proba = xgb.predict_proba(x_val)[:, 1]


In [None]:
from sklearn.metrics import classification_report, confusion_matrix, average_precision_score

print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

pr_auc = average_precision_score(y_val, y_pred)
print("PR-AUC:", pr_auc)



In [None]:
#Threshold tuning the XGBoost Model
precision, recall, thresholds = precision_recall_curve(y_val, y_proba)

#choose threshold that maximizes F1
f1 = 2 * (precision * recall) / (precision + recall + 1e-9)
best_idx = np.argmax(f1)
best_threshold = thresholds[best_idx]

print("Best threshold:", best_threshold)

print("After tuning threshold.. ")
y_pred_tuned = (y_proba >= best_threshold).astype(int)
print(classification_report(y_val, y_pred_tuned))