# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## AI/ML Model Training, Testing, and Evaluation Steps/workflow:
1. Import the necessary libraries
2. Load data
3. Perform EDA
4. Preprocess data
5. Train the model
   - train/test/split
   - Search
   - Make prediction , and then 
6. Evaluate the model

**More Granular:**
- ☑️ Import necessary Python libraries and or modules
- ☑️ Load data
- ☑️ Split train/test (keep fraud ratio with stratify)
- ☑️ Balance minority fraud class with SMOTE
- ☑️ Scale features for all models
- ☑️ Define classifiers in a dictionary 
    -  Not neccessary for model trainging, but worth doing for a more streamlined approach.
- ☑️ Loop: search for best parameters using RandomizedSearchCV
- ☑️ Loop: train, predict, evaluate
- ☑️ Print Confusion Matrix & Classification Report for each

In [6]:
# Import necessary libraries and modules
# Panda for handliing dataframes and numpy for handling arrays
import pandas as pd
import numpy as np

# Load models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC, LinearSVC

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Preprocessing SMOTE Oversampling and Scaling module 
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

# Max num n_jobs/processor(s) that's available for used
from joblib import cpu_count
print(f"✔️There are currently {cpu_count()} CPU-Cores available for model training💪")

# Load evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
 
# Import modules for data manipulation
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns

✔️There are currently 8 CPU-Cores available for model training💪


In [33]:
# read in csv files to pandas dataframe
# Baseline transactions 
df_baseline = pd.read_csv("../data/baseline_transacts.csv")

# Baseline dummy encoding transform
df_base_dummy_encode = pd.read_csv("../data/baseline_dummy_encode_transacts.csv")

# Log feature transform
df_log_transacts = pd.read_csv("../data/log_transacts.csv")

# Log and dummy encoding transforms
df_log_dummy_encode_transacts = pd.read_csv("../data/log_dummy_encode_transacts.csv")


# Log, change in transaction, and dummy encoding transforms
df_log_diff_dummy_encode = pd.read_csv("../data/log_diff_dummy_transacts.csv")


In [35]:
# first five rows
df_log_transacts

Unnamed: 0,isFraud,amount_log,oldbalanceOrig_log,newbalanceOrig_log,oldbalanceDest_log,newbalanceDest_log
0,0,10.608846,11.522173,11.859486,12.400281,12.515597
1,0,12.766230,8.528529,0.000000,8.457159,12.779584
2,0,7.707894,7.855409,5.871554,0.000000,0.000000
3,0,12.043679,14.555915,14.633881,14.796901,14.866930
4,0,12.473167,14.611048,14.722504,13.109396,12.355899
...,...,...,...,...,...,...
149995,1,14.723745,14.723745,0.000000,0.000000,0.000000
149996,1,13.473565,13.473565,0.000000,13.406430,14.133707
149997,1,10.117585,10.117585,0.000000,0.000000,0.000000
149998,1,13.573852,13.573852,0.000000,13.597156,14.358291


In [24]:
df_log_transacts[df_log_transacts['isFraud'] == 1] 

Unnamed: 0,amount_log,oldbalanceOrig_log,newbalanceOrig_log,oldbalanceDest_log,newbalanceDest_log,isFraud
199741,14.928063,14.928063,0.0,0.000000,0.000000,1
199742,16.118096,16.118096,0.0,0.000000,16.118096,1
199743,8.352894,8.352894,0.0,0.000000,0.000000,1
199744,9.080735,9.080735,0.0,0.000000,9.080735,1
199745,12.582867,0.000000,0.0,0.000000,12.582867,1
...,...,...,...,...,...,...
199995,11.546066,11.546066,0.0,0.000000,11.546066,1
199996,14.366892,14.366892,0.0,0.000000,0.000000,1
199997,16.118096,16.118096,0.0,12.259846,16.138981,1
199998,13.974011,13.974011,0.0,0.000000,0.000000,1


## Questions
Is this a classification or regression task?  

This is a classification task since we are effectively assigning one of two cases to either respective 'bin' of fraud and not-fraud (i.e., 0 or 1).

Are you predicting for multiple classes or binary classes?  

Since this is a binary classification problem, where we are trying to find out if transactions are either fraud or not-fraud, there are only 2 possible classes/labelings.

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

I'm using Logistic Regression, Support Vector Machine (SVM), and AdaBoost.

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [36]:
# Step 2: Separate features (X) and target (y)
X = df_log_transacts.drop(columns=['isFraud'])  # All features except target
y = df_log_transacts['isFraud']                 # Target variable

# Step 3: Split into training and testing sets
# Use stratify=y to keep the fraud/non-fraud ratio consistent, but I think could be reduntdant when using SMOTE??
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
# Step 4: Balance the training set using SMOTE
# SMOTE generates synthetic minority class (fraud) samples
sm = SMOTE(random_state=42)
X_sampled, y_sampled = sm.fit_resample(X_train, y_train)

# Step 5: Scale the features (mean=0, variance=1)
# Fit scaler only on training data to prevent data leakage
scaler = StandardScaler()
X_sampled_scaled = scaler.fit_transform(X_sampled)
X_test_scaled = scaler.transform(X_test)  # Apply same scaler to test set

print("✅ Data pre-processing complete: SMOTE applied and features scaled.")
print("📊 Class balance after SMOTE:\n", y_sampled.value_counts())  

✅ Data pre-processing complete: SMOTE applied and features scaled.
📊 Class balance after SMOTE:
 isFraud
0    119844
1    119844
Name: count, dtype: int64


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [38]:
# Step 6: hyperparameter Search Grids
param_grids = {
    "Random Forest": {
        "n_estimators": [100, 200, 400, 600],
        "criterion": ["gini", "entropy", "log_loss"],
        "max_depth": range(10, 55, 5),
        "max_features": ["sqrt", "log2"],
        "min_samples_split": [2, 5, 20]
    },
    "AdaBoost": {
        "n_estimators": [50, 100, 200, 400],
        "learning_rate": [0.01, 0.1, 1, 2, 10, 100]
    },
    "Logistic Regression": {
        "C": np.logspace(-3, 3, 7),
        "penalty": ["l1", "l2"],
        "solver": ["liblinear"]
    }
}

### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [None]:
# Step 6: Define classifiers
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "AdaBoost": AdaBoostClassifier(random_state=42),
    "Logistic Regression": LogisticRegression(solver='liblinear', random_state=42)
}

# Store best models for future use and less compute
best_models = {}

# Select models for training and tuning
for name, model in models.items():
    print(f"\n🔥 Training and tuning: {name}")
    
    param_grid = param_grids[name] # Select which param to use in randomized search
    
    random_search = RandomizedSearchCV(
        estimator=model,
        param_distributions=param_grid,
        n_iter=10,  # Tuning param
        cv=5,
        scoring='f1',  # Scoring param, maybe I can also try 'roc_auc'??
        random_state=42,
        n_jobs=-1 # Number of CPU-Cores param 
    )
    
    # Fit to training data
    random_search.fit(X_sampled_scaled, y_sampled)
    
    # Save the best model
    best_models[name] = random_search.best_estimator_

    # Predict on test set using the best model
    yhat = best_models[name].predict(X_test_scaled)
    
    # Evaluate
    cm = confusion_matrix(y_test, yhat)
    report = classification_report(y_test, yhat, digits=3)
    
    print(f"✅ Best Parameters for {name}:\n", random_search.best_params_)
    print("📊 Confusion Matrix:\n", cm)
    print("\n📋 Classification Report:\n", report)



🔥 Training and tuning: Random Forest
✅ Best Parameters for Random Forest:
 {'n_estimators': 100, 'min_samples_split': 5, 'max_features': 'sqrt', 'max_depth': 30, 'criterion': 'gini'}
📊 Confusion Matrix:
 [[39867    81]
 [    5    47]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.998     0.999     39948
           1      0.367     0.904     0.522        52

    accuracy                          0.998     40000
   macro avg      0.684     0.951     0.761     40000
weighted avg      0.999     0.998     0.998     40000


🔥 Training and tuning: AdaBoost
✅ Best Parameters for AdaBoost:
 {'n_estimators': 200, 'learning_rate': 1}
📊 Confusion Matrix:
 [[38875  1073]
 [    1    51]]

📋 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.973     0.986     39948
           1      0.045     0.981     0.087        52

    accuracy                          0.973     40000
   ma

## training and evaluation metrics using baseline.csv (no transformations)
💥 Training and evaluating: Random Forest
📊 Confusion Matrix:
 [[39820   128]  → True Negatives / False Positives
 [   11    41]]  → False Negatives / True Positives


📄 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.997     0.998     39948
           1      0.243     0.788     0.371        52

    accuracy                          0.997     40000
   macro avg      0.621     0.893     0.685     40000
weighted avg      0.999     0.997     0.997     40000


💥 Training and evaluating: AdaBoost
📊 Confusion Matrix:
 [[37111  2837] → True Negatives / False Positives
 [    1    51]] → False Negatives / True Positives


📄 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.929     0.963     39948
           1      0.018     0.981     0.035        52

    accuracy                          0.929     40000
   macro avg      0.509     0.955     0.499     40000
weighted avg      0.999     0.929     0.962     40000


💥 Training and evaluating: Logistic Regression
📊 Confusion Matrix:
 [[39049   899]
 [    5    47]]

📄 Classification Report:
               precision    recall  f1-score   support

           0      1.000     0.977     0.989     39948
           1      0.050     0.904     0.094        52

    accuracy                          0.977     40000
   macro avg      0.525     0.941     0.541     40000
weighted avg      0.999     0.977     0.987     40000


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [None]:
# There are three models in a dictionary in the above cell...

### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.


The best model so far is ***Support Vector Machine** with the following configuration and results:

- **Best Parameters**:
  ```python
  {
      C': np.logspace(-1, 2, 5),            
    'kernel': ['rbf', 'sigmoid'],          
    'gamma': ['scale', 'auto'] 
  }
  ```

- **Confusion Matrix**:
  ```
  [[39867    81] → True Negatives / False Positives
  [    30    22]] → False Negatives / True Positives
  ```

- **Classification Report**:

  | Class | Precision | Recall | F1-Score | Support |
  |-------|-----------|--------|----------|---------|
  | 0     | 0         | 1.00   | 1.00     | 39948   |
  | 1     | 1         | 0.42   | 0.59     | 52      |

- This shows that Support Vector Machine performs **pretty well** at identifying non-fraud, and achieves strong *F1-Score** for fraud despite class imbalance. I'll continue to tune model parameters, use various data transform datasets, etc., in order to achieve even better precision-recall trade-off.
- The dataset used--baseline_dummy_encode_transacts
   - consisting of the baseline (non-scaled/log1p) columns and dummy encodings of type column. 

In [39]:
# SMOTE generates synthetic minority class (fraud) samples
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

# Step 5: Scale features (mean=0, variance=1)
# Fit scaler only on training data to prevent data leakage/peeking
scaler = StandardScaler()
X_resampled_scaled = scaler.fit_transform(X_resampled)
X_test_scaled = scaler.transform(X_test)  # Apply same scaler to test set

print("✅ Data pre-processing complete: SMOTE applied and features scaled.")
print("📊 Class balance after SMOTE:\n", y_resampled.value_counts())  

✅ Data pre-processing complete: SMOTE applied and features scaled.
📊 Class balance after SMOTE:
 isFraud
0    119844
1    119844
Name: count, dtype: int64


In [40]:

 #initialize a Support Vector Classifier with RBF kernel to handle non-linearity
svc_non_linear = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)

# Train the classifier on the XOR dataset
svc_non_linear.fit(X_train, y_train)

# make predictions on the same dataset
yhat = svc_non_linear.predict(X_test)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report:\n", class_report)

Confusion Matrix 
 [[29961     0]
 [   25    14]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     29961
           1       1.00      0.36      0.53        39

    accuracy                           1.00     30000
   macro avg       1.00      0.68      0.76     30000
weighted avg       1.00      1.00      1.00     30000



In [41]:
# implement random search on the LinearSVC model to find best hyperparams
svc_params_grid = {
    'C': np.logspace(-1, 2, 5),            
    'kernel': ['rbf', 'sigmoid'],          
    'gamma': ['scale', 'auto'],
    #'degree': [2, 3, 4, 5] 
}

svc = SVC(max_iter=10000, random_state=42)

# set up RandomizedSearchCV with 5-fold cross-validation
random_search = RandomizedSearchCV(svc, param_distributions=svc_params_grid, scoring='f1', n_jobs=-1, cv=5, random_state=42)

# fit this model on your training data
random_search.fit(X_train, y_train)  # ran for 36m 12.0s without n_jobs param being set and took 16m 13.6s when scaling before training and using all 8 cores. 




In [43]:
best_svc = random_search.best_estimator_

# make predictions on the same dataset
yhat = best_svc.predict(X_test)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report:\n", class_report)

Confusion Matrix 
 [[29960     1]
 [   16    23]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     29961
           1       0.96      0.59      0.73        39

    accuracy                           1.00     30000
   macro avg       0.98      0.79      0.86     30000
weighted avg       1.00      1.00      1.00     30000

