In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [2]:
# Load dataset
data_path = 'credit_card_fraud.csv'
df = pd.read_csv(data_path)

In [3]:
# Calculate the proportion of fraud cases to the total
fraud_proportion = df['is_fraud'].value_counts(normalize=True)

print(fraud_proportion)

is_fraud
0    0.994753
1    0.005247
Name: proportion, dtype: float64


<h1> Preprocessing </h1>

In [4]:
# Extracting datetime features and cardholder's age
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
df['dob'] = pd.to_datetime(df['dob'])
df['transaction_hour'] = df['trans_date_trans_time'].dt.hour
df['age'] = np.round((df['trans_date_trans_time'] - df['dob']).dt.days / 365.25, 0)

In [5]:
# Identifying categorical and numeric columns
categorical_cols = ['merchant', 'category', 'city', 'state', 'job']
numeric_cols = ['amt', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'transaction_hour', 'age']

In [8]:
# Defining transformers for the preprocessing pipeline
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
numeric_transformer = StandardScaler()

# Combining transformers into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Defining the feature set X and the target variable y
X = df.drop('is_fraud', axis=1)  # Features
y = df['is_fraud']  # Target

In [9]:
# Splitting the dataset into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Applying the ColumnTransformer to the training data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test) 

<h1> Training the model </h1>

In [10]:
# Initialize the models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
}

# Dictionary to hold model predictions
predictions = {}

# Loop through models, train, and predict
for name, model in models.items():
    # Train the model on preprocessed training data
    model.fit(X_train_preprocessed, y_train)
    
    # Transform the test set and make predictions
    y_pred = model.predict(X_test_preprocessed)  # Make sure this uses preprocessed not resampled data
    
    # Store predictions
    predictions[name] = y_pred
    
    # Print model performance
    print(f"Model: {name}")
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))
    print("-----------------------------------------------------")
    

Model: Logistic Regression
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     67566
           1       0.80      0.13      0.23       356

    accuracy                           1.00     67922
   macro avg       0.90      0.57      0.61     67922
weighted avg       0.99      1.00      0.99     67922

[[67554    12]
 [  308    48]]
-----------------------------------------------------
Model: Random Forest
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     67566
           1       0.99      0.54      0.70       356

    accuracy                           1.00     67922
   macro avg       1.00      0.77      0.85     67922
weighted avg       1.00      1.00      1.00     67922

[[67565     1]
 [  163   193]]
-----------------------------------------------------
Model: Gradient Boosting
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     

<h2> Summary and Review: </h2>

1. Logistic Regression:

-  Precision for the fraudulent class is relatively high at 0.80, indicating fewer false positives.
- Recall is low at 0.13, suggesting the model misses many fraudulent transactions.
- The F1-score for the fraudulent class is low at 0.23, reflecting the imbalance between precision and recall.
- Overall accuracy is 1.00, but this is not informative due to class imbalance.

2. Random Forest:

- Precision remains high for the fraudulent class at 0.99, indicating very few legitimate transactions are mislabeled as fraudulent.
- Recall improves to 0.54, showing that more than half of the fraudulent transactions are detected.
- The F1-score for the fraudulent class increases to 0.70, suggesting a better balance between precision and recall compared to Logistic Regression.
- Overall accuracy is perfect at 1.00, though it should be viewed with skepticism due to the imbalanced nature of the dataset.

3. Gradient Boosting:

- Precision for the fraudulent class is very high at 0.81, similar to Random Forest.
- Recall is also high at 0.72, indicating the model is quite effective at detecting fraudulent transactions.
- The F1-score for the fraudulent class is strong at 0.76, showing a good balance between precision and recall.
- verall accuracy is 1.00, as seen with the other models.

4. XGBoost:

- Precision for the fraudulent class is slightly lower than Gradient Boosting at 0.96, but still indicates low false positives.
- Recall is also slightly lower at 0.76, which means it detects a majority of fraudulent transactions.
- The F1-score for the fraudulent class is high at 0.85, which is among the best of the models shown.
- Overall accuracy is again perfect at 1.00, consistent with the other models.

<h2> Overall Summary:</h2>

-  XGBoost and Gradient Boosting show the strongest performance in terms of F1-score for detecting the fraudulent class, with XGBoost slightly leading. These models demonstrate a strong ability to detect fraud with a balanced approach between minimizing false positives and maximizing true positives.

- Random Forest, while having an excellent F1-score, doesn't perform as well as XGBoost and Gradient Boosting in terms of recall, which is critical in fraud detection.

- Logistic Regression, despite having a high precision, falls behind in recall and F1-score, making it less effective for this task compared to the ensemble methods.

- Given these results, XGBoost stands out as the best model for further tuning and operational use. The high F1-score indicates that it effectively balances precision and recall, making it a robust choice for fraud detection. However, considering the business impact of false positives and false negatives is also crucial for the final model selection and threshold tuning.