# Logistic Regression with SMOTE for Credit Card Fraud Detection

### Data Loading
- Loads a pre-processed dataset from emissions: `creditcard_isoforest_cleaned_001.csv`.
- Splits the data into:
  - `X`: Features
  - `y`: Target variable (`Class`: 0 = Legit, 1 = Fraud)

### Train-Test Split
- Uses `train_test_split` with stratification to preserve class distribution.
- Test set size: **20%**

### Class Balancing with SMOTE
- Uses **SMOTE** only on the training data.
- SMOTE creates new fake fraud cases to balance the number of fraud and normal transactions.

### Model Training
- Trains a `LogisticRegression` model with the following settings:
	- penalty="l1": Uses L1 regularization, which helps the model automatically ignore less important features by shrinking their coefficients to zero.
This is especially useful for this dataset, which contains 30 numerical features (mostly PCA components). L1 helps to reduce noise and makes the model more interpretable.
	- solver="liblinear": This solver is specifically designed for L1 regularization and works well on small to medium-sized datasets like this one (≈285K rows, 30 features).
It ensures stability and good convergence when working with binary classification.
	- class_weight="balanced": Since fraud cases make up less than 0.2% of the dataset, this option automatically balances the class weights, forcing the model to pay more attention to rare fraud cases during training.
	- max_iter=1000: Increases the number of training iterations to ensure convergence, which is especially important when using regularization and class reweighting.

### Prediction with a custom threshold

Instead of using the default threshold of **0.5**, a **customized threshold of 0.7** was applied when converting the predicted probabilities into binary class labels:

- The model outputs the probability that the transaction is fraudulent.
- By default, if this probability is ≥ 0.5, the transaction is classified as fraudulent.
- However, in very **unbalanced datasets** such as this one, where **fraud events are extremely rare**, a threshold of 0.5 can result in **a high number of false positives**.

Benefits of using 0.7

- Reduces the number of false positives: Flagging too many legitimate transactions as fraudulent can lead to user dissatisfaction and unnecessary manual checks.
- Improves accuracy: A higher threshold makes the model more conservative, flagging a transaction as fraudulent only when there is a high degree of certainty that it is fraudulent.
- Better for business**: This solution balances **model efficiency** with **practical impact** by minimizing disruption to honest users.

### Evaluation
- Displays:
  - Confusion matrix
  - Classification report with precision, recall, and F1-score
- Labels: `"Legit"` and `"Fraud"`

In [85]:
import os
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

# Load the cleaned and scaled dataset
df = pd.read_csv("data/creditcard_isoforest_cleaned_001.csv")

# Split features and target
X = df.drop("Class", axis=1)
y = df["Class"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Apply SMOTE to training set only
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

# Train Logistic Regression with L1 regularization
model = LogisticRegression(
    max_iter=1000,
    random_state=42,
    class_weight="balanced",
    penalty="l1",
    solver="liblinear"
)
model.fit(X_train_sm, y_train_sm)

# Predict with lowered threshold
y_prob = model.predict_proba(X_test)[:, 1]
threshold = 0.7
y_pred = (y_prob >= threshold).astype(int)

# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, digits=4, target_names=["Legit", "Fraud"]))

# Save model with auto-incrementing name
model_dir = "models"
os.makedirs(model_dir, exist_ok=True)
base_filename = "logistic_regression_clean_eval"
ext = ".pkl"
i = 0
while True:
    filename = f"{base_filename}{'' if i == 0 else f'_{i:02d}'}{ext}"
    filepath = os.path.join(model_dir, filename)
    if not os.path.exists(filepath):
        break
    i += 1

joblib.dump(model, filepath)
print(f"Model saved to {filepath}")

Confusion Matrix:
 [[55842   762]
 [    9    76]]
Classification Report:
               precision    recall  f1-score   support

       Legit     0.9998    0.9865    0.9931     56604
       Fraud     0.0907    0.8941    0.1647        85

    accuracy                         0.9864     56689
   macro avg     0.5453    0.9403    0.5789     56689
weighted avg     0.9985    0.9864    0.9919     56689

Model saved to models/logistic_regression_clean_eval_08.pkl


## Evaluation and Model Selection

After experimenting with several parameter configurations for Logistic Regression, I retained only the **best-performing runs** to ensure meaningful comparison.


**Classification Report**

| Class           | Precision | Recall | F1-score | Support |
|------------------|-----------|--------|----------|---------|
| Legit            | 0.9998    | 0.9868 | 0.9932   | 56,651  |
| Fraud            | 0.0988    | 0.8632 | 0.1773   | 95      |
| **Accuracy**     |           |        | **0.9866** | 56,746  |
| **Macro avg**    | 0.5493    | 0.9250 | **0.5853** | 56,746  |
| **Weighted avg** | 0.9983    | 0.9866 | **0.9919** | 56,746  |

> **Model saved to:** `models/logistic_regression_clean_eval_06.pkl`

In [None]:
import os
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import BorderlineSMOTE

# Load the cleaned and scaled dataset
df = pd.read_csv("data/creditcard_isoforest_cleaned_001.csv")

# Split features and target
X = df.drop("Class", axis=1)
y = df["Class"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Apply BorderlineSMOTE to training set only
bsmote = BorderlineSMOTE(random_state=42, kind='borderline-1')
X_train_bsm, y_train_bsm = bsmote.fit_resample(X_train, y_train)

# Train Logistic Regression with L1 regularization
model = LogisticRegression(
    max_iter=1000,
    random_state=42,
    class_weight="balanced",
    penalty="l1",
    solver="liblinear"
)
model.fit(X_train_bsm, y_train_bsm)

# Predict with lowered threshold
y_prob = model.predict_proba(X_test)[:, 1]
threshold = 0.7
y_pred = (y_prob >= threshold).astype(int)

# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, digits=4, target_names=["Legit", "Fraud"]))

# Save model with auto-incrementing name
model_dir = "models"
os.makedirs(model_dir, exist_ok=True)
base_filename = "logistic_regression_bsmote_eval"
ext = ".pkl"
i = 0
while True:
    filename = f"{base_filename}{'' if i == 0 else f'_{i:02d}'}{ext}"
    filepath = os.path.join(model_dir, filename)
    if not os.path.exists(filepath):
        break
    i += 1

joblib.dump(model, filepath)
print(f"Model saved to {filepath}")

Confusion Matrix:
 [[56122   482]
 [    9    76]]
Classification Report:
               precision    recall  f1-score   support

       Legit     0.9998    0.9915    0.9956     56604
       Fraud     0.1362    0.8941    0.2364        85

    accuracy                         0.9913     56689
   macro avg     0.5680    0.9428    0.6160     56689
weighted avg     0.9985    0.9913    0.9945     56689

Model saved to models/logistic_regression_bsmote_eval_02.pkl


### Class Balancing with BorderlineSMOTE

- Standard SMOTE was replaced by **BorderlineSMOTE (kind='borderline-1')**, applied only to the training set.  
- **Reason**: BorderlineSMOTE synthesizes examples near the class boundary, which helps the classifier learn the subtle patterns of fraud cases that lie close to legitimate transactions.

### Model Training

A `LogisticRegression` model was trained on the BorderlineSMOTE-resampled data with the following settings:

- `penalty="l1"`  
  **L1 regularization** was used to drive many feature coefficients to zero, effectively selecting only the most informative variables out of the 30 PCA-derived features.

- `solver="liblinear"`  
  The **liblinear** solver was chosen because it reliably supports L1 penalties and performs efficiently on small-to-medium datasets (≈230 K training samples, 30 features).

- `class_weight="balanced"`  
  Class weights were automatically scaled inversely to class frequencies. Since fraud cases represent less than 0.2 % of the data, this forces the model to assign higher importance to minority (fraud) examples.

- `max_iter=1000`  
  The maximum number of optimization iterations was increased to **ensure full convergence**, especially important when using both regularization and imbalanced class weights.

### Prediction with Custom Threshold

Predicted fraud probabilities were converted into binary labels using **threshold = 0.7** instead of the default 0.5.  
- **Justification**:  
  - A **higher threshold** reduces false positives, minimizing the number of legitimate transactions flagged as fraud and improving user experience.  
  - This conservative approach ensures that only highly confident predictions are labeled as fraudulent.

### Evaluation

- A **confusion matrix** and **classification report** (precision, recall, F1-score) were generated, using class labels `Legit` and `Fraud`.  
- The final model was saved with an auto-incremented filename in the `models/` directory, ready for deployment.

## Conclusions Logistic Regression

1. **General Performance**
   - All logistic regression models achieved **very high accuracy (>97%)**, but this metric is misleading due to class imbalance.
   - The key focus was on detecting **fraud cases** (minority class), which had only 95 instances in the test set.

2. **Recall on Fraud Class**
   - All models achieved **high recall** (≈0.87) for fraud, meaning they correctly identified most frauds.
   - This came at the cost of **very low precision** (≈0.05–0.11), indicating many false positives.

3. **Balanced SMOTE Models**
   - Models like `logistic_regression_bsmote_eval.pkl` and its variations:
     - Achieved **slightly higher precision** (up to 0.11) on the fraud class.
     - F1-score was still low (max ≈0.20).
     - ROC AUC varied between **0.927–0.944**, showing decent separation ability.

4. **Cleaned-Only Models**
   - Models without SMOTE (e.g. `logistic_regression_clean_eval_*.pkl`):
     - Had **extremely consistent recall ≈0.8737** across all variants.
     - Precision remained very low (~0.0529), suggesting limited improvement.
     - Slight increase in ROC AUC (**up to 0.9690** in the best case).
     - Many of these models were effectively **identical**, suggesting the optimization had converged.

5. **Conclusion**
   - Logistic regression is not suitable for fraud detection in this case.
   - Despite its good recall, the **model is too often wrong**, labeling legitimate transactions as fraudulent.
   - A F1-score of < 0.20** indicates low overall quality.
   - The models are not ready for actual use in the banking environment.