# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

Answer here

Classification.

Fraud = 1
Not Fraud = 0

Are you predicting for multiple classes or binary classes?  

Answer here

Binary class

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

List your models here

Random Forest

Decision Tree

In [2]:
import pandas as pd
transactions=pd.read_csv("../data/bank_transactions_final.csv")


## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [None]:
from sklearn.model_selection import train_test_split

# Select relevant numeric features (including encoded 'type_mapped')
features = ['type_mapped', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']

X = transactions[features]
y = transactions['isFraud']

# Split data: 70% train, 30% test (you can adjust test_size)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y  
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")


Training set size: 700000 samples
Testing set size: 300000 samples


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Initialize the model
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

# Define hyperparameter distributions
param_distributions = {
    'n_estimators': randint(30, 80),
    'max_depth': [5, 10],
    'min_samples_split': randint(2, 5),
    'min_samples_leaf': randint(1, 3),
    'max_features': ['sqrt']
}

# Sample 30% of training data (features and labels aligned)
sample_frac = 0.3
sample_indices = X_train.sample(frac=sample_frac, random_state=42).index
X_sample = X_train.loc[sample_indices]
y_sample = y_train.loc[sample_indices]

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_distributions,
    n_iter=10,          # 10 random combinations
    cv=2,               # 2-fold cross-validation
    scoring='roc_auc',  # Use ROC AUC for imbalanced classification
    verbose=2,
    n_jobs=-1,
    random_state=42
)

# Fit the model on the sampled training data
random_search.fit(X_sample, y_sample)

# Output best hyperparameters
print("Best hyperparameters:", random_search.best_params_)

# Save best model for later use
best_rf = random_search.best_estimator_



Fitting 2 folds for each of 10 candidates, totalling 20 fits
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=44; total time=   1.8s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=44; total time=   1.9s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   2.1s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=52; total time=   2.1s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=52; total time=   2.3s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   2.3s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=65; total time=   2.7s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=65; total time=   

### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Train best model on the full training data
best_rf.fit(X_train, y_train)

# Generate predictions on the test set
y_pred = best_rf.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)  # Recall is sensitivity here
f1 = f1_score(y_test, y_pred)
conf_mat = confusion_matrix(y_test, y_pred)

# Print results
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall (Sensitivity):    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")

print("\nConfusion Matrix:")
print(conf_mat)


Accuracy:  0.9996
Precision: 0.9963
Recall (Sensitivity):    0.6838
F1 Score:  0.8110

Confusion Matrix:
[[299610      1]
 [   123    266]]


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Initialize Decision Tree
dt = DecisionTreeClassifier(random_state=42)

# Train on full training data
dt.fit(X_train, y_train)

# Predict on test data
y_pred_dt = dt.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred_dt)
precision = precision_score(y_test, y_pred_dt)
recall = recall_score(y_test, y_pred_dt)
f1 = f1_score(y_test, y_pred_dt)
conf_mat = confusion_matrix(y_test, y_pred_dt)

print("Decision Tree Classifier performance:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")
print("\nConfusion Matrix:")
print(conf_mat)


Decision Tree Classifier performance:
Accuracy:  0.9995
Precision: 0.8466
Recall:    0.7943
F1 Score:  0.8196

Confusion Matrix:
[[299555     56]
 [    80    309]]


### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.