# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

Classification

Are you predicting for multiple classes or binary classes?  

Binary classes

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

Logistic regression, Random Forest Classifier

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [30]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform
from sklearn.svm import SVC
from scipy.stats import uniform
from scipy.stats import randint
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.ensemble import VotingClassifier


In [6]:
transactions = pd.read_csv("../data/transactions_transformed_sample_encoded.csv")

X = transactions.drop(columns=['isFraud'])  
y = transactions['isFraud']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)


print("Train set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)

Train set shape: (40000, 11) (40000,)
Test set shape: (10000, 11) (10000,)


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [14]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(solver='liblinear', random_state=42))
])

# 3. Define hyperparameter search space
param_distributions = {
    'logreg__C': loguniform(1e-4, 1e2),  # Regularization strength
    'logreg__penalty': ['l1', 'l2']
}

# 4. Randomized search
random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_distributions,
    n_iter=20,
    scoring='f1',
    cv=5,
    verbose=1,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)


Fitting 5 folds for each of 20 candidates, totalling 100 fits




### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [16]:
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

# 6. Evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred)  # sensitivity
f1 = f1_score(y_test, y_pred)

print("Best Hyperparameters:", random_search.best_params_)
print("\n Evaluation Metrics on Test Set:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f} (Sensitivity)")
print(f"F1 Score:  {f1:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Best Hyperparameters: {'logreg__C': np.float64(69.58780103230364), 'logreg__penalty': 'l1'}

 Evaluation Metrics on Test Set:
Accuracy:  0.9995
Precision: 0.8750
Recall:    0.6364 (Sensitivity)
F1 Score:  0.7368

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      9989
           1       0.88      0.64      0.74        11

    accuracy                           1.00     10000
   macro avg       0.94      0.82      0.87     10000
weighted avg       1.00      1.00      1.00     10000

Confusion Matrix:
 [[9988    1]
 [   4    7]]


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [22]:
# Create train split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)


print("Train set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)

Train set shape: (40000, 11) (40000,)
Test set shape: (10000, 11) (10000,)


In [21]:
# search hyperparameter

pipeline_rf = Pipeline([
    ('rf', RandomForestClassifier(random_state=42, n_jobs=-1))
])

# 2. Define hyperparameter search space
param_distributions_rf = {
    'rf__n_estimators': randint(100, 500),
    'rf__max_depth': randint(5, 30),
    'rf__min_samples_split': randint(2, 10),
    'rf__min_samples_leaf': randint(1, 10),
    'rf__max_features': ['sqrt', 'log2', None]
}


random_search_rf = RandomizedSearchCV(
    estimator=pipeline_rf,
    param_distributions=param_distributions_rf,
    n_iter=25,
    scoring='f1',
    cv=5,
    verbose=1,
    n_jobs=-1,
    random_state=42
)

# 4. Fit search on training data
random_search_rf.fit(X_train, y_train)


Fitting 5 folds for each of 25 candidates, totalling 125 fits


In [23]:
# train model

best_rf = random_search_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test)

# 6. Evaluation
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf, zero_division=0)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)

print("Best Random Forest Hyperparameters:", random_search_rf.best_params_)
print("\n Evaluation Metrics (Random Forest):")
print(f"Accuracy:  {accuracy_rf:.4f}")
print(f"Precision: {precision_rf:.4f}")
print(f"Recall:    {recall_rf:.4f}")
print(f"F1 Score:  {f1_rf:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

Best Random Forest Hyperparameters: {'rf__max_depth': 8, 'rf__max_features': 'sqrt', 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 3, 'rf__n_estimators': 364}

 Evaluation Metrics (Random Forest):
Accuracy:  0.9998
Precision: 1.0000
Recall:    0.8182
F1 Score:  0.9000

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      9989
           1       1.00      0.82      0.90        11

    accuracy                           1.00     10000
   macro avg       1.00      0.91      0.95     10000
weighted avg       1.00      1.00      1.00     10000

Confusion Matrix:
 [[9989    0]
 [   2    9]]


### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.