# RandomForestClassifier

### Class Balancing
- **BorderlineSMOTE (kind='borderline-1')** was applied **only** on the training set to generate synthetic fraud samples near the decision boundary, strengthening the model’s ability to recognize the most ambiguous fraud cases.

### Model Training
- `n_estimators=100`  
  Uses 100 trees to balance predictive power and training speed on ~230 k samples.
- `class_weight="balanced"`  
  Automatically scales the weight of the minority (fraud) class, ensuring the forest pays extra attention to rare fraud events.
- `random_state=42`  
  Guarantees reproducible results.
- `n_jobs=-1`  
  Leverages all CPU cores for faster training.

### Prediction with Custom Threshold
- After computing fraud probabilities, a **0.7 threshold** (rather than 0.5) is used to convert probabilities into labels.  
- This higher cutoff **reduces false positives**, flagging only those transactions the model is highly confident are fraudulent.

In [17]:
import os
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import BorderlineSMOTE

# Load the cleaned dataset
df = pd.read_csv("data/creditcard_isoforest_cleaned_001.csv")

# Features and target
X = df.drop("Class", axis=1)
y = df["Class"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Apply BorderlineSMOTE to training set only
smote = BorderlineSMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

# Train RandomForest
model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    class_weight="balanced",
    n_jobs=-1
)
model.fit(X_train_sm, y_train_sm)

# Predict with custom threshold
y_prob = model.predict_proba(X_test)[:, 1]
threshold = 0.7
y_pred = (y_prob >= threshold).astype(int)

# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, digits=4, target_names=["Legit", "Fraud"]))

# Save model with auto-incrementing name
model_dir = "models"
os.makedirs(model_dir, exist_ok=True)
base_filename = "random_forest_bsmote_eval"
ext = ".pkl"
i = 0
while True:
    filename = f"{base_filename}{'' if i == 0 else f'_{i:02d}'}{ext}"
    filepath = os.path.join(model_dir, filename)
    if not os.path.exists(filepath):
        break
    i += 1

joblib.dump(model, filepath)
print(f"Model saved to {filepath}")

Confusion Matrix:
 [[56603     1]
 [   20    65]]
Classification Report:
               precision    recall  f1-score   support

       Legit     0.9996    1.0000    0.9998     56604
       Fraud     0.9848    0.7647    0.8609        85

    accuracy                         0.9996     56689
   macro avg     0.9922    0.8823    0.9304     56689
weighted avg     0.9996    0.9996    0.9996     56689

Model saved to models/random_forest_bsmote_eval_06.pkl


### Only one change was made here: we replaced BorderlineSMOTE with the standard SMOTE for class balancing. Every other step—train/test split, RandomForestClassifier settings, custom threshold of 0.7, and model saving—remains exactly the same.

In [20]:
from imblearn.over_sampling import SMOTE

# Load the cleaned dataset
df = pd.read_csv("data/creditcard_isoforest_cleaned_001.csv")

# Features and target
X = df.drop("Class", axis=1)
y = df["Class"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Apply SMOTE (not BorderlineSMOTE) to training set only
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

# Train RandomForest
model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    class_weight="balanced",
    n_jobs=-1
)
model.fit(X_train_sm, y_train_sm)

# Predict with custom threshold
y_prob = model.predict_proba(X_test)[:, 1]
threshold = 0.7
y_pred = (y_prob >= threshold).astype(int)

# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, digits=4, target_names=["Legit", "Fraud"]))

# Save model with auto-incrementing name
model_dir = "models"
os.makedirs(model_dir, exist_ok=True)
base_filename = "random_forest_smote_eval"
ext = ".pkl"
i = 0
while True:
    filename = f"{base_filename}{'' if i == 0 else f'_{i:02d}'}{ext}"
    filepath = os.path.join(model_dir, filename)
    if not os.path.exists(filepath):
        break
    i += 1

joblib.dump(model, filepath)
print(f"Model saved to {filepath}")

Confusion Matrix:
 [[56601     3]
 [   17    68]]
Classification Report:
               precision    recall  f1-score   support

       Legit     0.9997    0.9999    0.9998     56604
       Fraud     0.9577    0.8000    0.8718        85

    accuracy                         0.9996     56689
   macro avg     0.9787    0.9000    0.9358     56689
weighted avg     0.9996    0.9996    0.9996     56689

Model saved to models/random_forest_smote_eval.pkl


## Conclusions RandomForest
1.	The model showed a very high overall accuracy (accuracy of about 0.9999) and a very good ability to detect fraud even with a modified threshold (threshold = 0.7), which allows for better control of the trade-off between false positives and false negatives.

2.	The recall for the Fraud class reached 95.8%, which is a very high rate - the model detects almost all cases of fraud.

3.	The F1-measure for Fraud was in the range of ~0.95-0.98, which indicates a balance between precision and recall, and a very good ability of the model to work with a difficult (less represented) class.

4.	ROC AUC ~0.99+, which confirms the model's excellent ability to separate classes even at different probability thresholds.

5.	Easy retraining is possible, as some models had perfect metrics on the test set (e.g., 100% accuracy for the Legit class and almost 96% recall for Fraud). This may indicate that the model has partially adjusted to the data and should be tested on a completely independent dataset.