# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## AI/ML Model Training, Testing, and Evaluation Steps/workflow:
1. Import the necessary libraries
2. Load data
3. Perform EDA
4. Preprocess data
5. Train the model
   - train/test/split
   - Search
   - Make prediction , and then 
6. Evaluate the model

**More Granular:**
- ☑️ Import necessary Python libraries and or modules
- ☑️ Load data
- ☑️ Split train/test (keep fraud ratio with stratify)
- ☑️ Balance minority fraud class with SMOTE
- ☑️ Scale features for all models
- ☑️ Define classifiers in a dictionary 
    -  Not neccessary for model trainging, but worth doing for a more streamlined approach.
- ☑️ Loop: search for best parameters using RandomizedSearchCV
- ☑️ Loop: train, predict, evaluate
- ☑️ Print Confusion Matrix & Classification Report for each

In [3]:
# Step 1: Import necessary libraries and modules
# Panda for handliing dataframes and numpy for handling arrays
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

# Load models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC, LinearSVC

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Preprocessing SMOTE Oversampling and Scaling module 
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

# Max num n_jobs/processor(s) that's available for used
from joblib import cpu_count
print(f"✔️There are currently {cpu_count()} CPU-Cores available for model training💪")

# Load evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
 
# Import modules for data manipulation
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns

✔️There are currently 8 CPU-Cores available for model training💪


In [4]:
#Step 2:  Load data as pandas dataframe 
# Baseline transactions 
df_baseline = pd.read_csv("../data/baseline_transacts.csv")

# Baseline dummy encoding transform
df_base_dummy_encode = pd.read_csv("../data/baseline_dummy_encode_transacts.csv")

# Log feature transform
df_log_transacts = pd.read_csv("../data/log_transacts.csv")

# Log and dummy encoding transforms
df_log_dummy_encode_transacts = pd.read_csv("../data/log_dummy_encode_transacts.csv")


# Log, change in transaction, and dummy encoding transforms
df_log_diff_dummy_encode = pd.read_csv("../data/log_diff_dummy_transacts.csv")


In [5]:
# first five rows
df_log_dummy_encode_transacts

Unnamed: 0,isFraud,CASH_OUT,DEBIT,PAYMENT,TRANSFER,amount_log,oldbalanceOrig_log,newbalanceOrig_log,oldbalanceDest_log,newbalanceDest_log
0,0,False,False,False,False,10.608846,11.522173,11.859486,12.400281,12.515597
1,0,True,False,False,False,12.766230,8.528529,0.000000,8.457159,12.779584
2,0,False,False,True,False,7.707894,7.855409,5.871554,0.000000,0.000000
3,0,False,False,False,False,12.043679,14.555915,14.633881,14.796901,14.866930
4,0,False,False,False,False,12.473167,14.611048,14.722504,13.109396,12.355899
...,...,...,...,...,...,...,...,...,...,...
99995,1,False,False,False,True,12.774484,12.774484,0.000000,0.000000,0.000000
99996,1,True,False,False,False,12.799063,12.799063,0.000000,12.848402,13.517183
99997,1,False,False,False,True,12.862446,12.862446,0.000000,0.000000,0.000000
99998,1,True,False,False,False,13.689961,13.689961,0.000000,9.012266,13.699217


In [6]:
df_log_dummy_encode_transacts[df_log_dummy_encode_transacts['isFraud']==1].count()

isFraud               130
CASH_OUT              130
DEBIT                 130
PAYMENT               130
TRANSFER              130
amount_log            130
oldbalanceOrig_log    130
newbalanceOrig_log    130
oldbalanceDest_log    130
newbalanceDest_log    130
dtype: int64

## Questions
Is this a classification or regression task?  

This is a classification task since we are effectively assigning one of two cases to either respective 'bin' of fraud and not-fraud (i.e., 0 or 1).

Are you predicting for multiple classes or binary classes?  

Since this is a binary classification problem, where we are trying to find out if transactions are either fraud or not-fraud, there are only 2 possible classes/labelings.

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

I'm using Logistic Regression, Support Vector Machine (SVM), and AdaBoost.

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [7]:
# Training without SMOTE and using log normalization on predictors
# Step 3.1: Separate features (X) and target (y)
X = df_log_dummy_encode_transacts.drop(columns=['isFraud'])  # All features except target
y = df_log_dummy_encode_transacts['isFraud']                 # Target variable

# Step 3.2: Split into training and testing sets
# Use stratify=y to keep the fraud/non-fraud ratio consistent, but I think could be reduntdant when using SMOTE??
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
# without using SMOTE
# Scale trianing and testing data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Apply same scaler to test set

# Print output from training
print("✅ Data pre-processing complete: features scaled.\n", X_train_scaled)
print("\n✅ Data pre-processing complete: test data scaled.\n", X_test_scaled)


✅ Data pre-processing complete: features scaled.
 [[-0.73565498 -0.08327825 -0.71638167 ...  1.38401411  0.77906622
   0.69476195]
 [ 1.35933287 -0.08327825 -0.71638167 ... -0.84587112  1.29694534
   1.22357173]
 [-0.73565498 -0.08327825 -0.71638167 ... -0.84587112  1.20959949
   1.1653813 ]
 ...
 [-0.73565498 -0.08327825  1.39590395 ... -0.84587112 -1.14592334
  -1.2493916 ]
 [ 1.35933287 -0.08327825 -0.71638167 ... -0.84587112  0.66654768
   0.65842582]
 [-0.73565498 -0.08327825 -0.71638167 ...  1.08971234  1.05868003
   0.96875927]]

✅ Data pre-processing complete: test data scaled.
 [[-0.73565498 -0.08327825 -0.71638167 ...  1.6789918   0.67758919
   0.57115765]
 [-0.73565498 -0.08327825 -0.71638167 ... -0.84587112  1.07706241
   1.00858466]
 [ 1.35933287 -0.08327825 -0.71638167 ... -0.84587112 -1.14592334
   0.62668509]
 ...
 [-0.73565498 -0.08327825  1.39590395 ... -0.84587112 -1.14592334
  -1.2493916 ]
 [ 1.35933287 -0.08327825 -0.71638167 ... -0.84587112  1.01134023
   0.963044

In [9]:
# Step 6: hyperparameter Search Grids
# Without SMOTE 
param_grids = {
    "Random Forest": {
        "n_estimators": [100, 200, 400],
        "criterion": ["gini", "entropy", "log_loss"],
        "max_depth": range(10, 55, 5),
        "max_features": ["sqrt", "log2"],
        "min_samples_split": [2, 5, 20]
    },
    "AdaBoost": {
        "n_estimators": [50, 100, 200],
        "learning_rate": [0.01, 0.1, 0.5, 1.0],
        "estimator__max_depth": [1, 2, 3],              # ✅ Controls weak learner complexity
        "estimator__min_samples_split": [2, 5, 10],     # ✅ Regularization
        "estimator__min_samples_leaf": [1, 2, 4]
    },
    "Logistic Regression": {
        "C": np.logspace(-3, 3, 7),
        "penalty": ["l1", "l2"],
        "solver": ["saga"],                      
        "max_iter": [500, 1000, 1500],                       
    },
    "K-Nearest Neighbors": {
        "n_neighbors": [3, 5, 7, 11],
        "weights": ["uniform", "distance"],
        "p": [1, 2],
        "leaf_size": [20, 30, 50],
        "algorithm": ["auto"]
    }
} 


In [10]:
# Step 6: Define classifiers for training
# Without SMOTE
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "AdaBoost": AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=42), random_state=42),
    "Logistic Regression": LogisticRegression(solver='saga', random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier()  
}

# Store best models 
best_models = {}

# Select models for training and tuning
for name, model in models.items():
    print(f"\n🔥 Training and tuning: {name}")
    
    param_grid = param_grids[name] # Select which param to use in randomized search
    
    random_search = RandomizedSearchCV(
        model,
        param_distributions=param_grid,
        #n_iter=10,      # Tuning param
        cv=5,
        scoring='f1',    # Scoring param, maybe I can also try 'roc_auc'??
        random_state=42,
        n_jobs=-1        # Number of CPU-Cores param 
    )
    
    # Fit to training data
    random_search.fit(X_train_scaled, y_train)
    
    # Save the best model
    best_models[name] = random_search.best_estimator_

    # Predict on test set using the best model
    yhat = best_models[name].predict(X_test_scaled)
    
    # Evaluate
    cm = confusion_matrix(y_test, yhat)
    report = classification_report(y_test, yhat, digits=3)
    
    print(f"✅ Best Parameters for {name}:\n", random_search.best_params_)
    print("📊 Confusion Matrix:\n", cm)
    print("\n📋 Classification Report:\n", report)



🔥 Training and tuning: Random Forest
✅ Best Parameters for Random Forest:
 {'n_estimators': 400, 'min_samples_split': 5, 'max_features': 'log2', 'max_depth': 35, 'criterion': 'gini'}
📊 Confusion Matrix:
 [[19973     1]
 [    8    18]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      0.947     0.692     0.800        26

    accuracy                          1.000     20000
   macro avg      0.973     0.846     0.900     20000
weighted avg      1.000     1.000     1.000     20000


🔥 Training and tuning: AdaBoost
✅ Best Parameters for AdaBoost:
 {'n_estimators': 50, 'learning_rate': 0.1, 'estimator__min_samples_split': 2, 'estimator__min_samples_leaf': 1, 'estimator__max_depth': 3}
📊 Confusion Matrix:
 [[19973     1]
 [    9    17]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1     


### Output of Evaluation Metrics and Best Parameters of Log-Normalized Dataset using Standard Scalar, and Wtihout Using SMOTE

🔥 Training and tuning: Random Forest
✅ Best Parameters for Random Forest:
 {'n_estimators': 100, 'min_samples_split': 2, 'max_features': 'log2', 'max_depth': 25, 'criterion': 'entropy'}
📊 Confusion Matrix:
 [[19973     1]
 [   10    16]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      0.999     1.000     1.000     19974
           1      0.941     0.615     0.744        26

    accuracy                          0.999     20000
   macro avg      0.970     0.808     0.872     20000
weighted avg      0.999     0.999     0.999     20000

---

🔥 Training and tuning: AdaBoost
✅ Best Parameters for AdaBoost:
 {'n_estimators': 50, 'learning_rate': 0.5, 'estimator__min_samples_split': 5, 'estimator__min_samples_leaf': 1, 'estimator__max_depth': 2}
📊 Confusion Matrix:
 [[19974     0]
 [   17     9]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      0.999     1.000     1.000     19974
           1      1.000     0.346     0.514        26

    accuracy                          0.999     20000
   macro avg      1.000     0.673     0.757     20000
weighted avg      0.999     0.999     0.999     20000

---

🔥 Training and tuning: Logistic Regression
✅ Best Parameters for Logistic Regression:
 {'solver': 'saga', 'penalty': 'l1', 'max_iter': 1000, 'C': np.float64(10.0)}
📊 Confusion Matrix:
 [[19974     0]
 [   14    12]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      0.999     1.000     1.000     19974
           1      1.000     0.462     0.632        26

    accuracy                          0.999     20000
   macro avg      1.000     0.731     0.816     20000
weighted avg      0.999     0.999     0.999     20000

---

🔥 Training and tuning: K-Nearest Neighbors
✅ Best Parameters for K-Nearest Neighbors:
 {'weights': 'distance', 'p': 2, 'n_neighbors': 7, 'leaf_size': 30, 'algorithm': 'auto'}
📊 Confusion Matrix:
 [[19973     1]
 [   12    14]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      0.999     1.000     1.000     19974
           1      0.933     0.538     0.683        26

    accuracy                          0.999     20000
   macro avg      0.966     0.769     0.841     20000
weighted avg      0.999     0.999     0.999     20000

---
---

### Output of Evaluation Metrics and Best Parameters of Log Normalize and Dummy Encoded Transactions using Standard Scalar, and Wtihout Using SMOTE


🔥 Training and tuning: Random Forest
✅ Best Parameters for Random Forest:
 {'n_estimators': 400, 'min_samples_split': 5, 'max_features': 'log2', 'max_depth': 35, 'criterion': 'gini'}
📊 Confusion Matrix:
 [[19973     1]
 [    8    18]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      0.947     0.692     0.800        26

    accuracy                          1.000     20000
   macro avg      0.973     0.846     0.900     20000
weighted avg      1.000     1.000     1.000     20000


🔥 Training and tuning: AdaBoost
✅ Best Parameters for AdaBoost:
 {'n_estimators': 50, 'learning_rate': 0.1, 'estimator__min_samples_split': 2, 'estimator__min_samples_leaf': 1, 'estimator__max_depth': 3}
📊 Confusion Matrix:
 [[19973     1]
 [    9    17]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      0.944     0.654     0.773        26

    accuracy                          1.000     20000
   macro avg      0.972     0.827     0.886     20000
weighted avg      0.999     1.000     0.999     20000


🔥 Training and tuning: Logistic Regression
✅ Best Parameters for Logistic Regression:
 {'solver': 'saga', 'penalty': 'l1', 'max_iter': 1000, 'C': np.float64(10.0)}
📊 Confusion Matrix:
 [[19973     1]
 [   12    14]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      0.999     1.000     1.000     19974
           1      0.933     0.538     0.683        26

    accuracy                          0.999     20000
   macro avg      0.966     0.769     0.841     20000
weighted avg      0.999     0.999     0.999     20000


🔥 Training and tuning: K-Nearest Neighbors
✅ Best Parameters for K-Nearest Neighbors:
 {'weights': 'distance', 'p': 1, 'n_neighbors': 7, 'leaf_size': 30, 'algorithm': 'auto'}
📊 Confusion Matrix:
 [[19972     2]
 [    9    17]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      0.895     0.654     0.756        26

    accuracy                          0.999     20000
   macro avg      0.947     0.827     0.878     20000
weighted avg      0.999     0.999     0.999     20000



In [None]:
# Step 3.1: Separate features (X) and target (y)
X = df_log_transacts.drop(columns=['isFraud'])  
y = df_log_transacts['isFraud']                 

# Step 3.2: Split into training and testing sets
# Use stratify=y to keep the fraud/non-fraud ratio consistent, but I think could be reduntdant when using SMOTE??
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [51]:
df_log_transacts

Unnamed: 0,isFraud,amount_log,oldbalanceOrig_log,newbalanceOrig_log,oldbalanceDest_log,newbalanceDest_log
0,0,10.608846,11.522173,11.859486,12.400281,12.515597
1,0,12.766230,8.528529,0.000000,8.457159,12.779584
2,0,7.707894,7.855409,5.871554,0.000000,0.000000
3,0,12.043679,14.555915,14.633881,14.796901,14.866930
4,0,12.473167,14.611048,14.722504,13.109396,12.355899
...,...,...,...,...,...,...
99995,1,12.774484,12.774484,0.000000,0.000000,0.000000
99996,1,12.799063,12.799063,0.000000,12.848402,13.517183
99997,1,12.862446,12.862446,0.000000,0.000000,0.000000
99998,1,13.689961,13.689961,0.000000,9.012266,13.699217


In [52]:
# Step 4: Balance the training set using SMOTE
# SMOTE generates synthetic minority class (fraud) samples
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

# Step 4.1: Scale the features (mean=0, variance=1)
# Fit scaler only on training data to prevent data leakage
scaler = StandardScaler()
X_resampled_scaled = scaler.fit_transform(X_resampled)
X_test_scaled = scaler.transform(X_test)  # Same scaler to test set

print("✅ Data pre-processing complete: SMOTE applied and features scaled.")
print("📊 Class balance after SMOTE:\n", y_resampled.value_counts())  

✅ Data pre-processing complete: SMOTE applied and features scaled.
📊 Class balance after SMOTE:
 isFraud
0    79887
1    79887
Name: count, dtype: int64


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [None]:
# Step 6: hyperparameter Search Grids
# With SMOTE 
param_grids = {
    "Random Forest": {
        "n_estimators": [100, 200, 400],
        "criterion": ["gini", "entropy", "log_loss"],
        "max_depth": range(10, 55, 5),
        "max_features": ["sqrt", "log2"],
        "min_samples_split": [2, 5, 20]
    },
    "AdaBoost": {
        "n_estimators": [50, 100, 200],
        "learning_rate": [0.01, 0.1, 0.5, 1.0],
        "estimator__max_depth": [1, 2, 3],              # ✅ Controls weak learner complexity
        "estimator__min_samples_split": [2, 5, 10],     # ✅ Regularization
        "estimator__min_samples_leaf": [1, 2, 4]
    },
    "Logistic Regression": {
        "C": np.logspace(-3, 3, 7),
        "penalty": ["l1", "l2"],
        "solver": ["saga"],                      
        "max_iter": [500, 1000, 1500],                       
    },
    "K-Nearest Neighbors": {
        "n_neighbors": [3, 5, 7, 11],
        "weights": ["uniform", "distance"],
        "p": [1, 2],
        "leaf_size": [20, 30, 50],
        "algorithm": ["auto"]
    }
} 


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [None]:
# Step 6: Define classifiers for training
# With SMOTE
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "AdaBoost": AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=42), random_state=42),
    "Logistic Regression": LogisticRegression(solver='saga', random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier()  
}

# Store best models 
best_models = {}

# Select models for training and tuning
for name, model in models.items():
    print(f"\n🔥 Training and tuning: {name}")
    
    param_grid = param_grids[name] # Select which param to use in randomized search
    
    random_search = RandomizedSearchCV(
        model,
        param_distributions=param_grid,
        #n_iter=10,      # Tuning param
        cv=5,
        scoring='f1',    # Scoring param, maybe I can also try 'roc_auc'??
        random_state=42,
        n_jobs=-1        # Number of CPU-Cores param 
    )
    
    # Fit to training data
    random_search.fit(X_resampled_scaled, y_resampled)
    
    # Save the best model
    best_models[name] = random_search.best_estimator_

    # Predict on test set using the best model
    yhat = best_models[name].predict(X_test_scaled)
    
    # Evaluate
    cm = confusion_matrix(y_test, yhat)
    report = classification_report(y_test, yhat, digits=3, zero_division=0)
    
    print(f"✅ Best Parameters for {name}:\n", random_search.best_params_)
    print("📊 Confusion Matrix:\n", cm)
    print("\n📋 Classification Report:\n", report)



🔥 Training and tuning: Random Forest
✅ Best Parameters for Random Forest:
 {'n_estimators': 400, 'min_samples_split': 5, 'max_features': 'log2', 'max_depth': 35, 'criterion': 'gini'}
📊 Confusion Matrix:
 [[19899    84]
 [    2    15]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.996     0.998     19983
           1      0.152     0.882     0.259        17

    accuracy                          0.996     20000
   macro avg      0.576     0.939     0.628     20000
weighted avg      0.999     0.996     0.997     20000


🔥 Training and tuning: AdaBoost
✅ Best Parameters for AdaBoost:
 {'n_estimators': 50, 'learning_rate': 0.5, 'estimator__min_samples_split': 5, 'estimator__min_samples_leaf': 1, 'estimator__max_depth': 2}
📊 Confusion Matrix:
 [[19137   846]
 [    3    14]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.958     0.978     19983
           1     

## training and evaluation metrics using baseline.csv (no transformations)
💥 Training and evaluating: Random Forest
📊 Confusion Matrix:
 [[39820   128]  → True Negatives / False Positives
 [   11    41]]  → False Negatives / True Positives


📄 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.997     0.998     39948
           1      0.243     0.788     0.371        52

    accuracy                          0.997     40000
   macro avg      0.621     0.893     0.685     40000
weighted avg      0.999     0.997     0.997     40000


💥 Training and evaluating: AdaBoost
📊 Confusion Matrix:
 [[37111  2837] → True Negatives / False Positives
 [    1    51]] → False Negatives / True Positives


📄 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.929     0.963     39948
           1      0.018     0.981     0.035        52

    accuracy                          0.929     40000
   macro avg      0.509     0.955     0.499     40000
weighted avg      0.999     0.929     0.962     40000


💥 Training and evaluating: Logistic Regression
📊 Confusion Matrix:
 [[39049   899]
 [    5    47]]

📄 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.977     0.989     39948
           1      0.050     0.904     0.094        52

    accuracy                          0.977     40000
   macro avg      0.525     0.941     0.541     40000
weighted avg      0.999     0.977     0.987     40000


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [None]:
# There are three models in a dictionary in the above cell...

### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.


Based on first run: the best model so far is, ***Support Vector Machine** with the following configuration and results:

- **Best Parameters**:
  ```python
  {
      C': np.logspace(-1, 2, 5),            
    'kernel': ['rbf', 'sigmoid'],          
    'gamma': ['scale', 'auto'] 
  }
  ```

- **Confusion Matrix**:
  ```
  [[39867    81] → True Negatives / False Positives
  [    30    22]] → False Negatives / True Positives
  ```

- **Classification Report**:

  | Class | Precision | Recall | F1-Score | Support |
  |-------|-----------|--------|----------|---------|
  | 0     | 0         | 1.00   | 1.00     | 39948   |
  | 1     | 1         | 0.42   | 0.59     | 52      |

- This shows that Support Vector Machine performs **pretty well** at identifying non-fraud, and achieves strong *F1-Score** for fraud despite class imbalance. I'll continue to tune model parameters, use various data transform datasets, etc., in order to achieve even better precision-recall trade-off.
- The dataset used: **baseline_dummy_encode_transacts**
   - consisting of the **baseline--non-scaled--columns** and dummy encodings of type column. 

---

Based on second run: the best model so far is, **Support Vector Machine** with the following configuration and results:

- **Best Parameters**:
  ```python
  {
      C': np.logspace(-2, 2, 5),            
    'kernel': ['rbf', 'sigmoid'],          
    'gamma': ['scale', 'auto'] 
  }
  ```

- **Confusion Matrix**:
  ```
  [[29960     1] → True Negatives / False Positives
  [    16    23]] → False Negatives / True Positives
  ```

- **Classification Report**:

  | Class | Precision | Recall | F1-Score | Support |
  |-------|-----------|--------|----------|---------|
  | 0     | 1.00      | 1.00   | 1.00     | 29961   |
  | 1     | 0.96      | 0.96   | 0.75     | 39      |

- This shows that Support Vector Machine performs **pretty well** at identifying non-fraud, and achieves strong *F1-Score** for fraud despite class imbalance. I'll continue to tune model parameters, use various data transform datasets, etc., in order to achieve even better precision-recall trade-off.
- The dataset used: log_transacts
   - consisting of **log-transformed** columns only. 
**Note:** Support is at 39--maybe due to the following warning: ConvergenceWarning: Solver terminated early (max_iter=10000).  Consider pre-processing your data with StandardScaler or MinMaxScaler. I'll try increasing max_iter=20000...


In [56]:
# Step 2: Separate features (X) and target (y)
X = df_log_transacts.drop(columns=['isFraud'])  # All features except target
y = df_log_transacts['isFraud']                 # Target variable

# Step 3: Split into training and testing sets
# Use stratify=y to keep the fraud/non-fraud ratio consistent, but I think could be reduntdant when using SMOTE??
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [11]:
# Step 5: Scale features (mean=0, variance=1)
# Fit scaler only on training data to prevent data leakage/peeking
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

scaler = StandardScaler()
X_resampled_scaled = scaler.fit_transform(X_resampled)  # Fit only on training data
X_test_scaled = scaler.transform(X_test)                # Transform test data

print("✅ Data pre-processing complete: SMOTE applied and features scaled.")
print("📊 Class balance after SMOTE:\n", y_resampled.value_counts()) 

✅ Data pre-processing complete: SMOTE applied and features scaled.
📊 Class balance after SMOTE:
 isFraud
0    79896
1    79896
Name: count, dtype: int64


In [12]:
# implement random search on the LinearSVC model to find best hyperparams
svc_params_grid = {
    'C': np.logspace(-1, 2, 5), # (0.1, 10, 20)
    # kernel': ['rbf', 'sigmoid'],
    'gamma': ['scale', 'auto']
}

#  
svc = SVC(max_iter=15000, tol=1e-3, random_state=42)

# set up RandomizedSearchCV with params like 5-fold, cross-validation, etc.
random_search = RandomizedSearchCV(
    svc, # removed estimator=svc
    param_distributions=svc_params_grid,
    scoring='f1',
    n_jobs=-1,
    cv=5,
    random_state=42
)

# fit this model on scaled training data
random_search.fit(X_resampled_scaled, y_resampled)

#yhat = random_search.best_estimator_.predict(X_test_scaled)



FileNotFoundError: [Errno 2] No such file or directory: 'c:\\Users\\oneps\\anaconda3\\envs\\ds\\Lib\\site-packages\\sklearn\\utils\\_repr_html\\estimator.js'

FileNotFoundError: [Errno 2] No such file or directory: 'c:\\Users\\oneps\\anaconda3\\envs\\ds\\Lib\\site-packages\\sklearn\\utils\\_repr_html\\estimator.js'

RandomizedSearchCV(cv=5, estimator=SVC(max_iter=15000, random_state=42),
                   n_jobs=-1,
                   param_distributions={'C': array([  0.1       ,   0.56234133,   3.16227766,  17.7827941 ,
       100.        ]),
                                        'gamma': ['scale', 'auto']},
                   random_state=42, scoring='f1')

In [96]:
# best model after tuning
best_svc = random_search.best_estimator_

# scaled test data for predictions
yhat = best_svc.predict(X_test_scaled)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("✅ Confusion Matrix:\n", confusion)
print("\n✅ Classification Report:\n", class_report)

✅ Confusion Matrix:
 [[19746   237]
 [    0    17]]

✅ Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.99      0.99     19983
           1       0.07      1.00      0.13        17

    accuracy                           0.99     20000
   macro avg       0.53      0.99      0.56     20000
weighted avg       1.00      0.99      0.99     20000



In [79]:
# SMOTE generates synthetic minority class (fraud) samples
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

# Step 5: Scale features (mean=0, variance=1)
# Fit scaler only on training data to prevent data leakage/peeking
scaler = StandardScaler()
X_resampled_scaled = scaler.fit_transform(X_resampled)
X_test_scaled = scaler.transform(X_test)  # Apply same scaler to test set

print("✅ Data pre-processing complete: SMOTE applied and features scaled.")
print("📊 Class balance after SMOTE:\n", y_resampled.value_counts())  

✅ Data pre-processing complete: SMOTE applied and features scaled.
📊 Class balance after SMOTE:
 isFraud
0    79896
1    79896
Name: count, dtype: int64


In [13]:

# initialize a Support Vector Classifier with RBF kernel to handle non-linearity
svc_non_linear = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)

# Train the classifier on the XOR dataset
svc_non_linear.fit(X_train, y_train)

# make predictions on the same dataset
yhat = svc_non_linear.predict(X_test)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report:\n", class_report)

Confusion Matrix 
 [[19974     0]
 [   19     7]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     19974
           1       1.00      0.27      0.42        26

    accuracy                           1.00     20000
   macro avg       1.00      0.63      0.71     20000
weighted avg       1.00      1.00      1.00     20000



In [None]:
# implement random search on the LinearSVC model to find best hyperparams
svc_params_grid = {
    'C': np.logspace(-2, 2, 5),            
    'kernel': ['rbf', 'sigmoid'],          
    'gamma': ['scale', 'auto'],
    #'degree': [2, 3, 4, 5] 
}

svc = SVC(max_iter=15000, random_state=42)

# set up RandomizedSearchCV with 5-fold cross-validation
random_search = RandomizedSearchCV(svc, param_distributions=svc_params_grid, scoring='f1', n_jobs=-1, cv=5, random_state=42)

# fit this model on your training data
random_search.fit(X_train, y_train)  # ran for 36m 12.0s without n_jobs param being set and took 16m 13.6s when scaling before training and using all 8 cores. 


In [94]:
best_svc = random_search.best_estimator_

# make predictions on the same dataset
yhat = best_svc.predict(X_test)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report:\n", class_report)

Confusion Matrix 
 [[19981     2]
 [    7    10]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     19983
           1       0.83      0.59      0.69        17

    accuracy                           1.00     20000
   macro avg       0.92      0.79      0.84     20000
weighted avg       1.00      1.00      1.00     20000

