# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

Answer here

Are you predicting for multiple classes or binary classes?  

Answer here

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

List your models here

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [1]:
from sklearn.ensemble import RandomForestRegressor

# regressor models
from sklearn.linear_model import Lasso, Ridge
from sklearn.linear_model import LinearRegression

# accuracy metrics
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV

import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

In [3]:
new_transactions = pd.read_csv("bank_transaction_data.csv")
new_transactions.head()

Unnamed: 0.1,Unnamed: 0,type,amount,isFraud,isFlaggedFraud
0,0,PAYMENT,983.09,0,0
1,1,PAYMENT,55215.25,0,0
2,2,CASH_IN,220986.01,0,0
3,3,TRANSFER,2357394.75,0,0
4,4,CASH_OUT,67990.14,0,0


In [5]:
new_transactions2= pd.read_csv("bank_transactions.csv")
new_transactions2.head()

Unnamed: 0.1,Unnamed: 0,amount,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,0,983.09,0,0,False,False,False,True,False
1,1,55215.25,0,0,False,False,False,True,False
2,2,220986.01,0,0,True,False,False,False,False
3,3,2357394.75,0,0,False,False,False,False,True
4,4,67990.14,0,0,False,True,False,False,False


In [18]:
# Step 1: Creating Train/Test Splits for fraud detection
X = new_transactions2.drop(["isFraud", "Unnamed: 0"], axis=1)
y = new_transactions2["isFraud"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
print("Training features shape:", X_train.shape)
print("Test features shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Test labels shape:", y_test.shape)

Training features shape: (800000, 7)
Test features shape: (200000, 7)
Training labels shape: (800000,)
Test labels shape: (200000,)


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [42]:
## Step 2 Hyperparameter Tuning

param_dist = {
    "criterion": ["squared_error", "absolute_error", "friedman_mse", "poisson"],
    "max_depth": range(5, 50, 1),
    "max_features": ["sqrt", "log2"]
}
rf = RandomForestRegressor()


In [45]:

X_sample = X_train.sample(n=1000, random_state=42)
y_sample = y_train.loc[X_sample.index]

random_search = RandomizedSearchCV(rf, param_distributions=param_dist_small, n_iter=3, cv=2, random_state=42, n_jobs=-1)
random_search.fit(X_sample, y_sample)


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [46]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

best_rf = random_search.best_estimator_
best_rf.fit(X_train, y_train)
y_pred = np.round(best_rf.predict(X_test)).astype(int)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
cm = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall (Sensitivity):", recall)
print("Confusion Matrix:\n", cm)


Accuracy: 0.99891
Precision: 0.9148936170212766
Recall (Sensitivity): 0.16731517509727625
Confusion Matrix:
 [[199739      4]
 [   214     43]]


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [48]:
from sklearn.ensemble import RandomForestClassifier

# Hyperparameter tuning for RandomForestClassifier
clf = RandomForestClassifier(n_jobs=-1, random_state=42)
quick_search = RandomizedSearchCV(
    clf,
    param_distributions=params,
    n_iter=5,
    cv=3,
    scoring='f1',
    random_state=42,
    n_jobs=-1
)

X_sample_clf = X_train.sample(n=1000, random_state=42)
y_sample_clf = y_train.loc[X_sample_clf.index]

quick_search.fit(X_sample_clf, y_sample_clf)




In [None]:
# Smaller sample for fitting and prediction to speed up
sample_idx = X_train.sample(n=10000, random_state=42).index
X_train_small = X_train.loc[sample_idx]
y_train_small = y_train.loc[sample_idx]

best_clf = quick_search.best_estimator_
best_clf.fit(X_train_small, y_train_small)
y_pred_clf = best_clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_clf))
print("Precision:", precision_score(y_test, y_pred_clf, zero_division=0))
print("Recall (Sensitivity):", recall_score(y_test, y_pred_clf, zero_division=0))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_clf))


Accuracy: 0.998225
Precision: 0.20481927710843373
Recall (Sensitivity): 0.13229571984435798
Confusion Matrix:
 [[199611    132]
 [   223     34]]


### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.