### Key Insights

- Handling class imbalance through SMOTE significantly improved the model's ability to detect fraudulent transactions.
- High recall ensures minimal false negatives, reducing the likelihood of undetected fraud.
- Random Forest's feature importance analysis provides valuable insights into the attributes most indicative of fraudulent behavior.

### Key Results from Your Model

By distinguishing fraudulent from legitimate transactions, the business can focus manual investigation efforts only on flagged (high-risk) transactions, reducing the workload on fraud investigation teams.

- ROC-AUC Score: Indicates how well the model distinguishes fraud from legitimate transactions. 
- AUPRC: Measures precision-recall performance, critical for imbalanced datasets. Higher values ensure fewer false negatives.

### Limitations

1. Overfitting

- Perfect Metrics: ROC-AUC, AUPRC, and classification report metrics all showing a score of 1.000 might indicate overfitting, especially if the test data is not representative of real-world scenarios.
- Cause:
    - SMOTE creates synthetic samples, which might introduce data artifacts that are too easy for the model to classify.
    - The training and test sets might not include enough real-world variability.
    - Impact: The model may perform poorly on unseen data where fraudulent patterns differ.

2. Dependency on SMOTE

- Issue: SMOTE balances the dataset artificially by generating synthetic samples. While this improves training, it does not reflect the real-world imbalance of fraudulent vs. non-fraudulent transactions.
- Impact: The model might face challenges when deployed in production, where the imbalance is extreme (e.g., fraud cases < 0.1%).
- Solution: Explore ensemble techniques or anomaly detection models that handle imbalanced datasets without relying heavily on resampling.

3. Lack of Feature Importance Analysis

- Issue: The project does not explicitly analyze which features are most important for detecting fraud.
- Impact: Without understanding feature importance, it is difficult to improve data collection or identify potential biases in the data.
- Solution: Use the Random Forest feature importance scores or SHAP values to gain insights into the factors driving predictions.

4. Model Interpretability

- Issue: Random Forests are not inherently interpretable, making it hard to explain predictions to stakeholders or regulators.
- Impact: Businesses may face difficulties in justifying flagged transactions to customers or complying with regulatory requirements.
- Solution: Complement the model with interpretable methods, such as logistic regression or rule-based systems, for explainability in critical cases.

In [1]:
import zipfile
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, auc
from imblearn.over_sampling import SMOTE  # SMOTE for oversampling the minority class
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import average_precision_score
import joblib

In [2]:
# Path to the zip file
zip_file = 'Credit Card Fraud1.zip'
extract_folder = 'Credit Card Fraud1'  

# Extract the contents of the zip file
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    zip_ref.extractall(extract_folder)

# Check if the file exists after extraction
extracted_files = os.listdir(extract_folder)
print(f"Files extracted: {extracted_files}")

# Now load the CSV file
data = pd.read_csv(os.path.join(extract_folder, 'creditcard.csv'))

# Inspect the first few rows of the dataset
print(data.head())

# Check for missing values and general information
print(data.info())

# Check for class distribution
print(data['Class'].value_counts())

Files extracted: ['creditcard.csv']
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010 

In [3]:
# Feature columns (X) and target column (y)
X = data.drop('Class', axis=1)
y = data['Class']

# Standardize all features (including 'Time' and 'Amount')
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Scale all features

# Convert the scaled data back to DataFrame
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Check the distribution of the classes before splitting
print("Class distribution before splitting:")
print(y.value_counts())

Class distribution before splitting:
Class
0    284315
1       492
Name: count, dtype: int64


In [4]:
# Handle class imbalance by oversampling the minority class (frauds)
# Using SMOTE for oversampling the fraud cases (Class == 1)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_scaled, y)

# Check the new class distribution
print("\nClass distribution after SMOTE oversampling:")
print(pd.Series(y_resampled).value_counts())


Class distribution after SMOTE oversampling:
Class
0    284315
1    284315
Name: count, dtype: int64


In [5]:
# Split the resampled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'class_weight': ['balanced', None]  # Adding class weight to handle imbalance better
}

# Perform GridSearchCV with StratifiedKFold cross-validation
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=StratifiedKFold(n_splits=3), n_jobs=-1, verbose=2, scoring='roc_auc')
grid_search.fit(X_train, y_train)

# Best parameters from GridSearchCV
print(f"Best Parameters: {grid_search.best_params_}")

Fitting 3 folds for each of 36 candidates, totalling 108 fits
Best Parameters: {'class_weight': None, 'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}


In [6]:
# Evaluate the model on the test set using the best parameters
best_rf_model = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_rf_model.predict(X_test)
y_pred_proba = best_rf_model.predict_proba(X_test)[:, 1]  # Get probability scores for ROC-AUC

# Evaluate the model's performance using the AUPRC metric
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Calculate the ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

# Print the AUPRC score
print(f"Area Under the Precision-Recall Curve (AUPRC): {auprc:.4f}")

# Save the Model (Optional)
joblib.dump(best_rf_model, 'credit_card_fraud_model.pkl')
print("Model saved as 'credit_card_fraud_model.pkl'")


Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56863
           1       1.00      1.00      1.00     56863

    accuracy                           1.00    113726
   macro avg       1.00      1.00      1.00    113726
weighted avg       1.00      1.00      1.00    113726

ROC-AUC Score: 1.0000
Area Under the Precision-Recall Curve (AUPRC): 1.0000
Model saved as 'credit_card_fraud_model.pkl'
