# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## AI/ML Model Training, Testing, and Evaluation Steps/workflow:
1. Import the necessary libraries
2. Load data
3. Perform EDA
4. Preprocess data
5. Train the model
   - train/test/split
   - Search
   - Make prediction , and then 
6. Evaluate the model

**More Granular:**
- ☑️ Import necessary Python libraries and or modules
- ☑️ Load data
- ☑️ Split train/test (keep fraud ratio with stratify)
- ☑️ Balance minority fraud class with SMOTE
- ☑️ Scale features for all models
- ☑️ Define classifiers in a dictionary 
    -  Not neccessary for model trainging, but worth doing for a more streamlined approach.
- ☑️ Loop: search for best parameters using RandomizedSearchCV
- ☑️ Loop: train, predict, evaluate
- ☑️ Print Confusion Matrix & Classification Report for each

In [1]:
# Step 1: Import necessary libraries and modules
# Panda for handliing dataframes and numpy for handling arrays
import pandas as pd
import numpy as np

# Load models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC, LinearSVC

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Preprocessing and Scaling modules
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Preprocessing SMOTE Oversampling module 
from imblearn.over_sampling import SMOTE, KMeansSMOTE

# Max num n_jobs/processor(s) that's available for used
from joblib import cpu_count
print(f"✔️There are currently {cpu_count()} CPU-Cores available for model training💪")

# Load evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, recall_score, accuracy_score, precision_score

 
# Import modules for data manipulation
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns

✔️There are currently 8 CPU-Cores available for model training💪


In [2]:
#Step 2:  Load data as pandas dataframe 
# Baseline transactions 
df_baseline = pd.read_csv("../data/baseline_transacts.csv")

# Baseline dummy encoding transform
df_base_dummy_encode = pd.read_csv("../data/baseline_dummy_encode_transacts.csv")

# Log feature transform
df_log_transacts = pd.read_csv("../data/log_transacts.csv")

# Log and dummy encoding transforms
df_log_dummy_encode_transacts = pd.read_csv("../data/log_dummy_encode_transacts.csv")


# Log, change in transaction, and dummy encoding transforms
df_log_diff_dummy_encode = pd.read_csv("../data/log_diff_dummy_transacts.csv")


In [3]:
# first five rows
df_log_diff_dummy_encode 

Unnamed: 0,isFraud,CASH_OUT,DEBIT,PAYMENT,TRANSFER,amount_log,oldbalanceOrig_log,newbalanceOrig_log,oldbalanceDest_log,newbalanceDest_log,balance_change_orig,balance_change_dest,amount_ratio_orig_balance,zero_balance_flag
0,0,False,False,False,False,10.608846,11.522173,11.859486,12.400281,12.515597,-80980.92,-1.080515e+04,0.401178,0
1,0,True,False,False,False,12.766230,8.528529,0.000000,8.457159,12.779584,-345132.68,5.820766e-11,69.248305,1
2,0,False,False,True,False,7.707894,7.855409,5.871554,0.000000,0.000000,0.00,-2.224850e+03,0.862462,0
3,0,False,False,False,False,12.043679,14.555915,14.633881,14.796901,14.866930,-340040.68,2.352559e+04,0.081086,0
4,0,False,False,False,False,12.473167,14.611048,14.722504,13.109396,12.355899,-522463.61,-5.224636e+05,0.117904,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,1,False,False,False,True,12.774484,12.774484,0.000000,0.000000,0.000000,0.00,-3.530910e+05,0.999997,1
99996,1,True,False,False,False,12.799063,12.799063,0.000000,12.848402,13.517183,0.00,-5.820766e-11,0.999997,1
99997,1,False,False,False,True,12.862446,12.862446,0.000000,0.000000,0.000000,0.00,-3.855568e+05,0.999997,1
99998,1,True,False,False,False,13.689961,13.689961,0.000000,9.012266,13.699217,0.00,0.000000e+00,0.999999,1


In [4]:
df_log_dummy_encode_transacts[df_log_dummy_encode_transacts['isFraud']==1].count()

isFraud               130
CASH_OUT              130
DEBIT                 130
PAYMENT               130
TRANSFER              130
amount_log            130
oldbalanceOrig_log    130
newbalanceOrig_log    130
oldbalanceDest_log    130
newbalanceDest_log    130
dtype: int64

## Questions
Is this a classification or regression task?  

This is a classification task since we are effectively assigning one of two cases to either respective 'bin' of fraud and not-fraud (i.e., 0 or 1).

Are you predicting for multiple classes or binary classes?  

Since this is a binary classification problem, where we are trying to find out if transactions are either fraud or not-fraud, there are only 2 possible classes/labelings.

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

I'm using Logistic Regression, Support Vector Machine (SVM), and AdaBoost.

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [5]:
# Training without SMOTE and using log normalization on predictors
# Step 3.1: Separate features (X) and target (y)
X = df_log_diff_dummy_encode.drop(columns=['isFraud'])  # All features except target
y = df_log_diff_dummy_encode['isFraud']                 # Target variable

# Step 3.2: Split into training and testing sets
# Use stratify=y to keep the fraud/non-fraud ratio consistent, but I think could be reduntdant when using SMOTE??
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [6]:
# without using SMOTE
# Scale trianing and testing data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Apply same scaler to test set

# Print output from training
print("✅ Data pre-processing complete: features scaled.\n", X_train_scaled)
print("\n✅ Data pre-processing complete: test data scaled.\n", X_test_scaled)


✅ Data pre-processing complete: features scaled.
 [[-0.73565498 -0.08327825 -0.71638167 ...  0.09354458 -0.14233443
  -1.14741344]
 [ 1.35933287 -0.08327825 -0.71638167 ...  0.11634656 -0.14228997
   0.87152544]
 [-0.73565498 -0.08327825 -0.71638167 ...  0.11634656  3.92743105
   0.87152544]
 ...
 [-0.73565498 -0.08327825  1.39590395 ...  0.099119   -0.14231597
   0.87152544]
 [ 1.35933287 -0.08327825 -0.71638167 ...  0.11634656  0.12999653
   0.87152544]
 [-0.73565498 -0.08327825 -0.71638167 ... -0.71870966 -0.1423015
  -1.14741344]]

✅ Data pre-processing complete: test data scaled.
 [[-0.73565498 -0.08327825 -0.71638167 ... -0.01690802 -0.14233443
  -1.14741344]
 [-0.73565498 -0.08327825 -0.71638167 ...  0.11634656 -0.14232477
   0.87152544]
 [ 1.35933287 -0.08327825 -0.71638167 ...  0.11634656 -0.14225037
   0.87152544]
 ...
 [-0.73565498 -0.08327825  1.39590395 ...  0.10026392 -0.14227453
   0.87152544]
 [ 1.35933287 -0.08327825 -0.71638167 ...  0.11634656  0.85778997
   0.8715254

In [7]:
# Step 6: hyperparameter Search Grids
# Without SMOTE 
param_grids = {
    "Random Forest": {
        "n_estimators": [100, 200, 400],                # Number of trees in the forest
        "criterion": ["gini", "entropy", "log_loss"],   # Function to measure the quality of a split
        "max_depth": range(10, 55, 5),                  # Maximum depth of each tree
        "max_features": ["sqrt", "log2"],               # Number of features to consider at each split
        "min_samples_split": [2, 5, 20]                 # Minimum number of samples to split an internal node
    },
    "AdaBoost": {
        "n_estimators": [50, 100, 200],                 # Number of boosting rounds
        "learning_rate": [0.01, 0.1, 0.5, 1.0],         # Shrinks the contribution of each classifier
        "estimator__max_depth": [1, 2, 3],              # Controls weak learner complexity
        "estimator__min_samples_split": [2, 5, 10],     # Regularization
        "estimator__min_samples_leaf": [1, 2, 4]        # Minimum samples required at a leaf node
    },
    "Logistic Regression": {
        "C": np.logspace(-3, 3, 7),                     # Inverse of regularization strength
        "penalty": ["l1", "l2"],                        # Type of regularization to apply
        "solver": ["saga"],                             # Optimizer that supports L1/L2 with large data
        "max_iter": [500, 1000, 5000],                  # Max number of iterations for convergence
    },
    "K-Nearest Neighbors": {
        "n_neighbors": [3, 5, 7, 11],                   # Number of neighbors to use for prediction
        "weights": ["uniform", "distance"],             # Uniform: equal weights, Distance: closer = more weight
        "p": [1, 2],                                     # Distance metric: 1 = Manhattan, 2 = Euclidean
        "leaf_size": [20, 30, 50],                      # Leaf size for tree-based algorithms
        "algorithm": ["auto"]                           # Algorithm used to compute nearest neighbors
    }
}


In [10]:
# Step 6: Define classifiers for training
# Without SMOTE
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "AdaBoost": AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=42), random_state=42),
    "Logistic Regression": LogisticRegression(solver='saga', random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier()  
}

# Store best models 
best_models = {}

# Select models for training and tuning
for name, model in models.items():
    print(f"\n🔥 Training and tuning: {name}")
    
    param_grid = param_grids[name] # Select which param to use in randomized search
    
    random_search = RandomizedSearchCV(
        model,
        param_distributions=param_grid,
        #n_iter=10,      # Tuning param
        cv=5,
        scoring='f1',    # Scoring param, maybe I can also try 'roc_auc'??
        random_state=42,
        n_jobs=-1        # Number of CPU-Cores param 
    )
    
    # Fit to training data
    random_search.fit(X_train_scaled, y_train)
    
    # Save the best model
    best_models[name] = random_search.best_estimator_

    # Predict on test set using the best model
    yhat = best_models[name].predict(X_test_scaled)
    
    # Evaluate
    cm = confusion_matrix(y_test, yhat)
    report = classification_report(y_test, yhat, digits=3)
    
    print(f"✅ Best Parameters for {name}:\n", random_search.best_params_)
    print("📊 Confusion Matrix:\n", cm)
    print("\n📋 Classification Report:\n", report)



🔥 Training and tuning: Random Forest
✅ Best Parameters for Random Forest:
 {'n_estimators': 400, 'min_samples_split': 5, 'max_features': 'log2', 'max_depth': 35, 'criterion': 'gini'}
📊 Confusion Matrix:
 [[19973     1]
 [    8    18]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      0.947     0.692     0.800        26

    accuracy                          1.000     20000
   macro avg      0.973     0.846     0.900     20000
weighted avg      1.000     1.000     1.000     20000


🔥 Training and tuning: AdaBoost
✅ Best Parameters for AdaBoost:
 {'n_estimators': 50, 'learning_rate': 0.1, 'estimator__min_samples_split': 2, 'estimator__min_samples_leaf': 1, 'estimator__max_depth': 3}
📊 Confusion Matrix:
 [[19973     1]
 [    9    17]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1     

### Output of Evaluation Metrics and Best Parameters of Log-Transform, Transaction Deltas, and Dummy Encoding with Standard Scalar, without Using SMOTE


🔥 Training and tuning: Random Forest
```
✅ Best Parameters for Random Forest:
 {'n_estimators': 100, 'min_samples_split': 20, 'max_features': 'sqrt', 'max_depth': 35, 'criterion': 'log_loss'}
```

```
📊 Confusion Matrix:
 [[19974     0]
 [    0    26]]
```

```
📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      1.000     1.000     1.000        26

    accuracy                          1.000     20000
   macro avg      1.000     1.000     1.000     20000
weighted avg      1.000     1.000     1.000     20000
```

🔥 Training and tuning: AdaBoost
```
✅ Best Parameters for AdaBoost:
 {'n_estimators': 200, 'learning_rate': 0.1, 'estimator__min_samples_split': 10, 'estimator__min_samples_leaf': 1, 'estimator__max_depth': 2}
```

```
📊 Confusion Matrix:
 [[19973     1]
 [    0    26]]
```

```
📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      0.963     1.000     0.981        26

    accuracy                          1.000     20000
   macro avg      0.981     1.000     0.991     20000
weighted avg      1.000     1.000     1.000     20000
```

🔥 Training and tuning: Logistic Regression
/usr/local/lib/python3.11/dist-packages/sklearn/linear_model/_sag.py:348: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(

```
✅ Best Parameters for Logistic Regression:
 {'solver': 'saga', 'penalty': 'l2', 'max_iter': 5000, 'C': np.float64(10.0)}
```

```
📊 Confusion Matrix:
 [[19974     0]
 [   14    12]]
```

```
📋 Classification Report:
               precision    recall  f1-score   support

           0      0.999     1.000     1.000     19974
           1      1.000     0.462     0.632        26

    accuracy                          0.999     20000
   macro avg      1.000     0.731     0.816     20000
weighted avg      0.999     0.999     0.999     20000
```

🔥 Training and tuning: K-Nearest Neighbors

```
✅ Best Parameters for K-Nearest Neighbors:
 {'weights': 'uniform', 'p': 1, 'n_neighbors': 5, 'leaf_size': 20, 'algorithm': 'auto'}
```

```
📊 Confusion Matrix:
 [[19973     1]
 [    9    17]]
```

```
📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      0.944     0.654     0.773        26

    accuracy                          1.000     20000
   macro avg      0.972     0.827     0.886     20000
weighted avg      0.999     1.000     0.999     20000
```

### Output of Evaluation Metrics and Best Parameters of Log-Normalized and Dummy Encoding Dataset With Standard Scalar, and Wtihout Using SMOTE

🔥 Training and tuning: Random Forest

✅ Best Parameters for Random Forest:
```
 {'n_estimators': 100, 
 'min_samples_split': 2, 
 'max_features': 'log2',
 'max_depth': 25, 
 'criterion': 
 'entropy'}
 ```
 ```
📊 Confusion Matrix:
 [[19973     1]
 [   10    16]]
 ```

📋 Classification Report:
```
               precision    recall  f1-score   support

           0      0.999     1.000     1.000     19974
           1      0.941     0.615     0.744        26
```
---

🔥 Training and tuning: AdaBoost
✅ Best Parameters for AdaBoost:
 {'n_estimators': 50, 'learning_rate': 0.5, 'estimator__min_samples_split': 5, 'estimator__min_samples_leaf': 1, 'estimator__max_depth': 2}
📊 Confusion Matrix:
 [[19974     0]
 [   17     9]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      0.999     1.000     1.000     19974
           1      1.000     0.346     0.514        26

    accuracy                          0.999     20000
   macro avg      1.000     0.673     0.757     20000
weighted avg      0.999     0.999     0.999     20000

---

🔥 Training and tuning: Logistic Regression

✅ Best Parameters for Logistic Regression:
```
 {'solver': 'saga', 
 'penalty': 'l1',
 'max_iter': 1000, 
 'C': np.float64(10.0)}
```
```
📊 Confusion Matrix:
 [[19974     0]
 [   14    12]]
```

📋 Classification Report:
```
               precision    recall  f1-score   support

           0      0.999     1.000     1.000     19974
           1      1.000     0.462     0.632        26
```
---

🔥 Training and tuning: K-Nearest Neighbors

✅ Best Parameters for K-Nearest Neighbors:
```
{'weights': 'distance', 
'p': 2, 'n_neighbors': 7, 
'leaf_size': 30, 
'algorithm': 'auto'}
```
```
📊 Confusion Matrix:
 - [[19973    1]
 - [ 12      14]]
```

📋 Classification Report:
```
               precision    recall  f1-score   support

           0      0.999     1.000     1.000     19974
           1      0.933     0.538     0.683        26
```
---
---

### Output of Evaluation Metrics and Best Parameters of Log Normalize and Dummy Encoded Transactions using Standard Scalar, and Wtihout Using SMOTE


🔥 Training and tuning: Random Forest
```
✅ Best Parameters for Random Forest:
 {'n_estimators': 400, 
 'min_samples_split': 5, 
 'max_features': 'log2', 
 'max_depth': 35, 
 'criterion': 'gini'}
```
```
📊 Confusion Matrix:
 [[19973     1]
 [    8    18]]
```

📋 Classification Report:
```
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      0.947     0.692     0.800        26

    accuracy                          1.000     20000
```

🔥 Training and tuning: AdaBoost
```
✅ Best Parameters for AdaBoost:
 {'n_estimators': 50, 
 'learning_rate': 0.1, 
 'estimator__min_samples_split': 2, 
 'estimator__min_samples_leaf': 1, '
 estimator__max_depth': 3} 
 ```
 ```
📊 Confusion Matrix:
 [[19973     1]
 [    9    17]]
```

📋 Classification Report:
```
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      0.944     0.654     0.773        26

    accuracy                          1.000     20000
```

🔥 Training and tuning: Logistic Regression
```
✅ Best Parameters for Logistic Regression:
 {'solver': 'saga',
 'penalty': 'l1',
 'max_iter': 1000, 
 'C': np.float64(10.0)}
```
```
📊 Confusion Matrix:
 [[19973     1]
 [   12    14]]
```

📋 Classification Report:
```
               precision    recall  f1-score   support

           0      0.999     1.000     1.000     19974
           1      0.933     0.538     0.683        26

    accuracy                          0.999     20000
```

🔥 Training and tuning: K-Nearest Neighbors
```
✅ Best Parameters for K-Nearest Neighbors:
 {'weights': 'distance',
 'p': 1, 
 'n_neighbors': 7, 
 'leaf_size': 30, 
 'algorithm': 'auto'}
```
📊 Confusion Matrix:
 [[19972     2]
 [    9    17]]
```

📋 Classification Report:
```
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      0.895     0.654     0.756        26

    accuracy                          0.999     20000
```



In [59]:
# Step 1: Separate features (X) and target (y)
X = df_log_dummy_encode_transacts.drop(columns=['isFraud'])  
y = df_log_dummy_encode_transacts['isFraud']                 

# Step 1.2: Split into training and testing sets
# Use stratify=y to keep the fraud/non-fraud ratio consistent, but I think could be reduntdant when using SMOTE??
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
df_log_dummy_encode_transacts

Unnamed: 0,isFraud,CASH_OUT,DEBIT,PAYMENT,TRANSFER,amount_log,oldbalanceOrig_log,newbalanceOrig_log,oldbalanceDest_log,newbalanceDest_log
0,0,False,False,False,False,10.608846,11.522173,11.859486,12.400281,12.515597
1,0,True,False,False,False,12.766230,8.528529,0.000000,8.457159,12.779584
2,0,False,False,True,False,7.707894,7.855409,5.871554,0.000000,0.000000
3,0,False,False,False,False,12.043679,14.555915,14.633881,14.796901,14.866930
4,0,False,False,False,False,12.473167,14.611048,14.722504,13.109396,12.355899
...,...,...,...,...,...,...,...,...,...,...
99995,1,False,False,False,True,12.774484,12.774484,0.000000,0.000000,0.000000
99996,1,True,False,False,False,12.799063,12.799063,0.000000,12.848402,13.517183
99997,1,False,False,False,True,12.862446,12.862446,0.000000,0.000000,0.000000
99998,1,True,False,False,False,13.689961,13.689961,0.000000,9.012266,13.699217


In [62]:
# Step 2: Apply SMOTE to training data only
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

# Step 3: Scale features after SMOTE
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_resampled)
X_test_scaled = scaler.transform(X_test)

# Step 4: Show class balance and test skew
print("✅ SMOTE applied.")
print("📊 Class balance after SMOTE:\n", y_resampled.value_counts())
print("\n📉 y_test distribution:\n", y_test.value_counts())

✅ SMOTE applied.
📊 Class balance after SMOTE:
 isFraud
0    79896
1    79896
Name: count, dtype: int64

📉 y_test distribution:
 isFraud
0    19974
1       26
Name: count, dtype: int64


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [40]:
# Step 6: hyperparameter Search Grids
# With SMOTE 
param_grids = {
    "Random Forest": {
        "n_estimators": [100, 200, 400],                # Number of trees in the forest
        "criterion": ["gini", "entropy", "log_loss"],   # Function to measure the quality of a split
        "max_depth": range(10, 55, 5),                  # Maximum depth of each tree
        "max_features": ["sqrt", "log2"],               # Number of features to consider at each split
        "min_samples_split": [2, 5, 20]                 # Minimum number of samples to split an internal node
    },
    "AdaBoost": {
        "n_estimators": [50, 100, 200],                 # Number of boosting rounds
        "learning_rate": [0.01, 0.1, 0.5, 1.0],         # Shrinks the contribution of each classifier
        "estimator__max_depth": [1, 2, 3],              # Controls weak learner complexity
        "estimator__min_samples_split": [2, 5, 10],     # Regularization
        "estimator__min_samples_leaf": [1, 2, 4]        # Minimum samples required at a leaf node
    },
    "Logistic Regression": {
        "C": np.logspace(-3, 3, 7),                     # Inverse of regularization strength
        "penalty": ["l1", "l2"],                        # Type of regularization to apply
        "solver": ["saga"],                             # Optimizer that supports L1/L2 with large data
        "max_iter": [500, 1000, 1500],                  # Max number of iterations for convergence
    },
    "K-Nearest Neighbors": {
        "n_neighbors": [3, 5, 7, 11],                   # Number of neighbors to use for prediction
        "weights": ["uniform", "distance"],             # Uniform: equal weights, Distance: closer = more weight
        "p": [1, 2],                                     # Distance metric: 1 = Manhattan, 2 = Euclidean
        "leaf_size": [20, 30, 50],                      # Leaf size for tree-based algorithms
        "algorithm": ["auto"]                           # Algorithm used to compute nearest neighbors
    }
}


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [41]:
# Step 6: Define classifiers for training
# With SMOTE
models = {
    "Random Forest": RandomForestClassifier(class_weight='balanced', random_state=42),
    "AdaBoost": AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=42), random_state=42),
    "Logistic Regression": LogisticRegression(solver='saga', class_weight='balanced', random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier()
}

# Store best models 
best_models = {}

# Select models for training and tuning
for name, model in models.items():
    print(f"\n🔥 Training and tuning: {name}")
    
    param_grid = param_grids[name] # Select which param to use in randomized search
    
    random_search = RandomizedSearchCV(
        model,
        param_distributions=param_grid,
        #n_iter=10,      # Tuning param
        cv=5,
        scoring='f1',    # Scoring param, maybe I can also try 'roc_auc'??
        random_state=42,
        n_jobs=-1        # Number of CPU-Cores param 
    )
    
    # Fit to training data
    random_search.fit(X_train_scaled, y_resampled)
    
    # Save the best model
    best_models[name] = random_search.best_estimator_

    # Predict on test set using the best model
    yhat = best_models[name].predict(X_test_scaled)

    # Evaluate
    cm = confusion_matrix(y_test, yhat)
    report = classification_report(y_test, yhat, digits=3, zero_division=0)
    auc = roc_auc_score(y_test, yhat)
    
    print(f"✅ Best Parameters for {name}:\n", random_search.best_params_)
    print("📊 Confusion Matrix:\n", cm)
    print("📋 Classification Report:\n", report)
    print(f"🎯 ROC-AUC Score: {auc:.4f}")


🔥 Training and tuning: Random Forest
✅ Best Parameters for Random Forest:
 {'n_estimators': 200, 'min_samples_split': 2, 'max_features': 'log2', 'max_depth': 35, 'criterion': 'entropy'}
📊 Confusion Matrix:
 [[19875    99]
 [    4    22]]
📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.995     0.997     19974
           1      0.182     0.846     0.299        26

    accuracy                          0.995     20000
   macro avg      0.591     0.921     0.648     20000
weighted avg      0.999     0.995     0.997     20000

🎯 ROC-AUC Score: 0.9206

🔥 Training and tuning: AdaBoost
✅ Best Parameters for AdaBoost:
 {'n_estimators': 50, 'learning_rate': 0.5, 'estimator__min_samples_split': 5, 'estimator__min_samples_leaf': 1, 'estimator__max_depth': 2}
📊 Confusion Matrix:
 [[18811  1163]
 [    1    25]]
📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.942     0.970   

## Training and Evaluation Metrics Using Three Models from baseline.csv (no transformations--scaling, etc.)
💥 Training and evaluating: Random Forest

```
📊 Confusion Matrix:
 [[39820   128]  → True Negatives / False Positives
 [   11    41]]  → False Negatives / True Positives
 ```
📄 Classification Report:
```
               precision    recall  f1-score   support

           0      1.000     0.997     0.998     39948
           1      0.243     0.788     0.371        52

    accuracy                          0.997     40000
```

💥 Training and evaluating: AdaBoost
```
📊 Confusion Matrix:
 [[37111  2837] → True Negatives / False Positives
 [    1    51]] → False Negatives / True Positives
```
📄 Classification Report:
```
               precision    recall  f1-score   support

           0      1.000     0.929     0.963     39948
           1      0.018     0.981     0.035        52

    accuracy                          0.929     40000
```
   
💥 Training and evaluating: Logistic Regression
```
📊 Confusion Matrix:
 [[39049   899]
 [    5    47]]
 ```
 ```
📄 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.977     0.989     39948
           1      0.050     0.904     0.094        52

    accuracy                          0.977     40000
```

## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [None]:
# There are four models in the dictionary above...

### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.

### Output of Evaluation Metrics and Best Parameters of Log-Transform, Transaction Deltas, and Dummy Encoding with Standard Scalar, without Using SMOTE, and using Google Colab

🔥 The best model so far -- trained using google colab.
```
✅ Best Parameters for Random Forest:
 {'n_estimators': 100, 'min_samples_split': 20, 'max_features': 'sqrt', 'max_depth': 35, 'criterion': 'log_loss'}
```

```
📊 Confusion Matrix:
 [[19974     0]
 [    0    26]]
```

```
📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      1.000     1.000     1.000        26

    accuracy                          1.000     20000
   macro avg      1.000     1.000     1.000     20000
weighted avg      1.000     1.000     1.000     20000
```

- This shows that Random Forest does **fantastic** at identifying non-fraud, and achieves exeptional *F1-Score** for fraud despite class imbalance. 
- The dataset used: df_log_diff_dummy_encode
   - consisting of **log-Transform, delta transactions, and dummy encoding transactions** columns. 
   - Done without the use of SMOTE
   - Done with Standard Scalar (mean=0, variance=1) and stratify=y...

```
✅ Best Parameters for AdaBoost:
 {'n_estimators': 200, 'learning_rate': 0.1, 'estimator__min_samples_split': 10, 'estimator__min_samples_leaf': 1, 'estimator__max_depth': 2}
```

```
📊 Confusion Matrix:
 [[19973     1]
 [    0    26]]
```

```
📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000     19974
           1      0.963     1.000     0.981        26

    accuracy                          1.000     20000
   macro avg      0.981     1.000     0.991     20000
weighted avg      1.000     1.000     1.000     20000

```

---

The third best performance so far using ***Random Forest:**

### Random Forest 

✅ Best Parameters for Random Forest:
```
 {'n_estimators': 400, 
 'min_samples_split': 5,
  'max_features': 'log2', 
  'max_depth': 35, 'criterion': 
  'gini'}
```
```
📊 Confusion Matrix:
 [[19973     1] → True Negatives / False Positives
 [    8    18]] → False Negatives / True Positives
 ```

📋 Classification Report:
| Class | Precision | Recall | F1-Score | Support |
  |-------|-----------|--------|----------|---------|
  | 0     | 1.00      | 1.00   | 1.00     | 19974   |
  | 1     | 0.95      | 0.69   | 0.80     | 26      |


- This shows that Random Forest does **well** at identifying non-fraud, and achieves very strong *F1-Score** for fraud despite class imbalance. I'll continue to tune model parameters, use various data transform datasets, etc., in order to achieve even better precision-recall trade-off.
- The dataset used: df_log_dummy_encode_transacts
   - consisting of **log and dummy encoding transaction** columns only. 
   - Done without the use of SMOTE
   - Done with Standard Scalar (mean=0, variance=1)

---

**Support Vector Machine using MinMaxScalar()** 

- **Best Parameters for SVC**
  ```python
  {
    'C': np.logspace(-2, 2, 5),            
    'kernel': ['rbf', 'sigmoid'],          
    'gamma': ['scale', 'auto'] 
  }
  ```

- **Confusion Matrix**:
  ```
  [[19974     1] → True Negatives / False Positives
   [   9    17]] → False Negatives / True Positives
  ```

- **Classification Report**:

  | Class | Precision | Recall | F1-Score | Support |
  |-------|-----------|--------|----------|---------|
  | 0     | 1.00      | 1.00   | 1.00     | 19974   |
  | 1     | 0.94      | 0.65   | 0.77     | 26      |

  - This shows that Support Vector Machine performs **well** at identifying non-fraud, and achieves strong *F1-Score** for fraud despite class imbalance. I'll continue to tune model parameters, use various data transform datasets, etc., in order to achieve even better precision-recall trade-off.
- The dataset used: log_transacts
   - Consisting of **log-transformed** columns only. 
   - Done without using SMOTE 
   - Done using MinMaxScalar

In [82]:
# Step 2: Separate features and target
X = df_log_dummy_encode_transacts.drop(columns=['isFraud'])  # All features except target
y = df_log_dummy_encode_transacts['isFraud']                 # Target variable

# Step 3: Split into training and testing sets
# Use stratify=y to keep the fraud/non-fraud ratio consistent, but could be reduntdant when using SMOTE
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
# Without SMOTE
# Initialize MinMaxScaler to scale data to [0, 1]
scaler = MinMaxScaler()

# Fit on training data and transform both sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define hyperparameter grid for SVC
svc_param_grid = {
    'C': np.logspace(-2, 2, 5),
    'kernel': ['rbf', 'sigmoid'],
    'gamma': ['scale', 'auto']
}

# Initialize the SVC model
svc = SVC(max_iter=10000, random_state=42)

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=svc,
    param_distributions=svc_param_grid,
    scoring='f1',
    cv=5,
    n_iter=10,
    n_jobs=-1,
    random_state=42
)

# Fit the model on the scaled training data
random_search.fit(X_train_scaled, y_train)


In [69]:
# best model after tuning
best_svc = random_search.best_estimator_

# scaled test data for predictions
yhat = best_svc.predict(X_test_scaled)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("✅ Confusion Matrix:\n", confusion)
print("\n✅ Classification Report:\n", class_report)

✅ Confusion Matrix:
 [[19973     1]
 [    9    17]]

✅ Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     19974
           1       0.94      0.65      0.77        26

    accuracy                           1.00     20000
   macro avg       0.97      0.83      0.89     20000
weighted avg       1.00      1.00      1.00     20000



In [83]:
from sklearn.cluster import KMeans

# With SMOTE
# Apply SMOTE to training data only
# ✅ KMeansSMOTE is faster + cluster-aware
kmeans = KMeans(n_clusters=15, random_state=42)
sm = KMeansSMOTE(random_state=42, n_jobs=-1, cluster_balance_threshold='0.001', kmeans_estimator=kmeans)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

# Initialize MinMaxScaler to scale data to [0, 1]
scaler = MinMaxScaler()

# Fit on training data and transform both sets
X_train_scaled = scaler.fit_transform(X_resampled)
X_test_scaled = scaler.transform(X_test)# Fit scaler on training data to prevent data leakage/peeking

# Print balanced classes
# Show class balance and test skew
print("✅ SMOTE applied.")
print("📊 Class balance after SMOTE:\n", y_resampled.value_counts())
print("\n📉 y_test distribution:\n", y_test.value_counts())

InvalidParameterError: The 'cluster_balance_threshold' parameter of KMeansSMOTE must be a str among {'auto'} or an instance of 'float'. Got '0.001' instead.

In [80]:
# Define hyperparameter grid for SVC
# With SMOTE
svc_param_grid = {
    'C': np.logspace(-2, 2, 5),
    'kernel': ['rbf', 'sigmoid'],
    'gamma': ['scale', 'auto']
}

# Initialize the SVC model
svc = SVC(max_iter=10000, random_state=42)

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=svc,
    param_distributions=svc_param_grid,
    scoring='f1',
    cv=5,
    n_iter=10,
    n_jobs=-1,
    random_state=42
)

# Fit the model on the scaled training data
random_search.fit(X_train_scaled, y_resampled)




In [81]:
# best model after tuning
best_svc = random_search.best_estimator_

# scaled test data for predictions
yhat = best_svc.predict(X_test_scaled)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("✅ Confusion Matrix:\n", confusion)
print("\n✅ Classification Report:\n", class_report)

✅ Confusion Matrix:
 [[19096   878]
 [    0    26]]

✅ Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.96      0.98     19974
           1       0.03      1.00      0.06        26

    accuracy                           0.96     20000
   macro avg       0.51      0.98      0.52     20000
weighted avg       1.00      0.96      0.98     20000

