# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  



In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

import matplotlib.pyplot as plt
import seaborn as sns



Are you predicting for multiple classes or binary classes?  

We are predicting a binary classes because the variables isFraud and it has two values Not Fraud and Fraud.

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

List your models here
Logistic Regression**  
    Simple, interpretable baseline model.
    Performs well on linearly separable data.
    Works with class imbalance using `class_weight='balanced'`.
Random Forest Classifier**  
    Handles non-linear relationships well.
    Robust to outliers and noisy features.
    Naturally balances bias and variance.
    Also supports `class_weight='balanced'`.




## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [6]:
import pandas as pd

# Load the transformed data
df = pd.read_csv("transformed_data.csv")

X = df.drop("isFraud", axis=1)
y = df["isFraud"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)




### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Define parameter grid
param_grid = {
    'n_estimators': [100],
    'max_depth': [None, 10],
    'class_weight': ['balanced']
}




rf = RandomForestClassifier(random_state=42)


grid_search = GridSearchCV(estimator=rf,
                           param_grid=param_grid,
                           cv=3,
                           scoring='f1',
                          n_jobs=-1,
                           verbose=1)


grid_search.fit(X_train, y_train)


print("Best Parameters Found:")
print(grid_search.best_params_)


best_rf = grid_search.best_estimator_
y_pred_rf_best = best_rf.predict(X_test)


print("\nClassification Report for Best Random Forest:")
print(classification_report(y_test, y_pred_rf_best))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf_best))
print("ROC AUC Score:", roc_auc_score(y_test, y_pred_rf_best))



Fitting 3 folds for each of 2 candidates, totalling 6 fits
Best Parameters Found:
{'class_weight': 'balanced', 'max_depth': None, 'n_estimators': 100}

Classification Report for Best Random Forest:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    199741
           1       0.98      0.72      0.83       259

    accuracy                           1.00    200000
   macro avg       0.99      0.86      0.92    200000
weighted avg       1.00      1.00      1.00    200000

Confusion Matrix:
[[199738      3]
 [    73    186]]
ROC AUC Score: 0.859065849348265


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [11]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score


lr = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)


print("\nLogistic Regression Results:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))
print("ROC AUC Score:", roc_auc_score(y_test, y_pred_lr))



Logistic Regression Results:
Confusion Matrix:
[[189870   9871]
 [     2    257]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.95      0.97    199741
           1       0.03      0.99      0.05       259

    accuracy                           0.95    200000
   macro avg       0.51      0.97      0.51    200000
weighted avg       1.00      0.95      0.97    200000

ROC AUC Score: 0.9714294973380488


### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.