# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

Answer here

- This is a Classification Task because our goal here is to predict whether a transaction is fraudulent or not.

Are you predicting for multiple classes or binary classes?  

- Binary, because our goal is to assign each transaction to one of two categories: Fraudulent or Not Fraudulent.

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

- Logistic Regression: simple, fast, and interpretable baseline for binary classification.
- Random Forest Classifier: a powerful ensemble model that handles nonlinear relationships well.
- XGBoost (Extreme Gradient Boosting): highly accurate and efficient gradient boosting algorithm.
- Use all of the above with Stratified Cross-Validation and SMOTE. For imbalanced data like fraud detection, these help improve model performance on the minority class.

## First Model

Using the first model that you've chosen, implement the following steps.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from sklearn.linear_model import Lasso, Ridge

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [2]:
transactions = pd.read_csv("../data/bank_transactions.csv")
cleaned_transactions = transactions.drop(columns=['nameOrig', 'nameDest'])
cleaned_transactions

Unnamed: 0,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,36730.24,35747.15,0.00,0.00,0,0
1,PAYMENT,55215.25,99414.00,44198.75,0.00,0.00,0,0
2,CASH_IN,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,0.00,0.00,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,0.00,0.00,625317.04,693307.19,0,0
...,...,...,...,...,...,...,...,...
999995,PAYMENT,13606.07,114122.11,100516.04,0.00,0.00,0,0
999996,PAYMENT,9139.61,0.00,0.00,0.00,0.00,0,0
999997,CASH_OUT,153650.41,50677.00,0.00,0.00,380368.36,0,0
999998,CASH_OUT,163810.52,0.00,0.00,357850.15,521660.67,0,0


In [5]:
X = cleaned_transactions.drop(columns='isFraud')
y = cleaned_transactions['isFraud']

# Step 3: One-hot encode the 'type' column
X = pd.get_dummies(X, columns=['type'], drop_first=True)

# Step 4: Perform train-test split with stratification and random state
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y  # preserves class distribution
)

In [6]:
# One-hot encode the 'type' column
transactions_encoded = pd.get_dummies(cleaned_transactions, columns=['type'], drop_first=True)

# Define features and target
X = transactions_encoded.drop('isFraud', axis=1)
y = transactions_encoded['isFraud']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [13]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

In [16]:
# Sample 10% of the training data for quicker tuning
X_sample = X_train.sample(frac=0.1, random_state=42)
y_sample = y_train.loc[X_sample.index]


In [18]:
# Define the model
rf = RandomForestClassifier(random_state=42)

# Define the hyperparameter distributions
param_dist = {
    'n_estimators': randint(100, 300),
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2']
}

# Randomized Search
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=5,             # Try 5 random combinations
    cv=2,                  # Use 2 folds to make it faster
    scoring='f1',
    n_jobs=-1,
    verbose=1,
    random_state=42
)

random_search.fit(X_sample, y_sample)

print("Best Hyperparameters:", random_search.best_params_)
best_rf = random_search.best_estimator_

Fitting 2 folds for each of 5 candidates, totalling 10 fits
Best Hyperparameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 174}


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [19]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Initialize model with best hyperparameters
best_rf = RandomForestClassifier(
    max_depth=None,
    max_features='sqrt',
    min_samples_leaf=1,
    min_samples_split=5,
    n_estimators=174,
    random_state=42
)

# Train the model on the full training data
best_rf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_rf.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")

# Optional: Detailed classification report and confusion matrix
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy:  0.9996
Precision: 0.9720
Recall:    0.7147
F1 Score:  0.8237

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    299611
           1       0.97      0.71      0.82       389

    accuracy                           1.00    300000
   macro avg       0.99      0.86      0.91    300000
weighted avg       1.00      1.00      1.00    300000

Confusion Matrix:
 [[299603      8]
 [   111    278]]


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [26]:
from imblearn.over_sampling import SMOTE

# Sample a subset of data for speed
sample_data = cleaned_transactions.sample(n=50000, random_state=42)

# Encode 'type' column using one-hot encoding
sample_data_encoded = pd.get_dummies(sample_data, columns=['type'], drop_first=True)

# Split features and target
X = sample_data_encoded.drop('isFraud', axis=1)
y = sample_data_encoded['isFraud']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

# Resample with SMOTE (after encoding!)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [27]:
# Define logistic regression model
log_reg = LogisticRegression(solver='liblinear', random_state=42)

# Define hyperparameter grid
param_dist = {
    'C': np.logspace(-3, 2, 10),  # e.g. 0.001 to 100
    'penalty': ['l1', 'l2']
}

# Setup random search
random_search = RandomizedSearchCV(
    estimator=log_reg,
    param_distributions=param_dist,
    n_iter=10,
    scoring='f1',
    cv=5,
    random_state=42,
    n_jobs=-1,
    verbose=1
)

# Fit on resampled training data
random_search.fit(X_train_resampled, y_train_resampled)

# Output best model
best_logreg = random_search.best_estimator_
print("Best Hyperparameters:", random_search.best_params_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best Hyperparameters: {'penalty': 'l1', 'C': np.float64(0.001)}




In [28]:
# 1. Train the model with the best hyperparameters
best_logreg = LogisticRegression(
    solver='liblinear',   # 'liblinear' supports 'l1' penalty
    penalty='l1',
    C=0.001,
    random_state=42
)

best_logreg.fit(X_train_resampled, y_train_resampled)

# 2. Predict on the test set
y_pred = best_logreg.predict(X_test)

# 3. Evaluate the model
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:   ", recall_score(y_test, y_pred))
print("F1 Score: ", f1_score(y_test, y_pred))

print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy:  0.9802
Precision: 0.05732484076433121
Recall:    0.9473684210526315
F1 Score:  0.10810810810810811

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99     14981
           1       0.06      0.95      0.11        19

    accuracy                           0.98     15000
   macro avg       0.53      0.96      0.55     15000
weighted avg       1.00      0.98      0.99     15000

Confusion Matrix:
 [[14685   296]
 [    1    18]]




### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.