# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

This is a classification task.

Are you predicting for multiple classes or binary classes?  

I am predicting for multiple classes.

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

AdaBoost would be better suited due to its high perecent.

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

from sklearn.metrics import classification_report 


transactions = pd.read_csv("../data/bank_transactions.csv")

sample_df = transactions.sample(n=5000)


X = sample_df[["oldbalanceDest", "amount"]]
y = sample_df["isFraud"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [25]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [31]:
param_dist = {
    "criterion": ["squared_error", "absolute_error", "friedman_mse", "poisson", "entropy", "gini", "log_loss"],
    "max_depth": range(5,50,1),
    "max_features": ["sqrt", "log2"]
}

rf= RandomForestClassifier()
random_search = RandomizedSearchCV(rf, param_distributions=param_dist, cv=5, random_state=42)


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [35]:
rf = RandomForestClassifier()
random_search = RandomizedSearchCV(rf, param_distributions=param_dist, cv=5, random_state=42)
# TODO: fit this model on your training data
random_search.fit(X_train, y_train)

20 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\oamae135\miniconda3\envs\ds\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\oamae135\miniconda3\envs\ds\Lib\site-packages\sklearn\base.py", line 1382, in wrapper
    estimator._validate_params()
  File "c:\Users\oamae135\miniconda3\envs\ds\Lib\site-packages\sklearn\base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\oamae135\miniconda3\envs\ds\Lib\site-packages\sklearn\utils\_param_validation.py", line 98, in validate_

## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [28]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier


trans = DecisionTreeClassifier(max_depth=1)

# initialize the AdaBoostClassifier with 50 weak learners and a fixed random state
ada = AdaBoostClassifier(estimator=trans, random_state=42)

ada.fit(X_train, y_train)

In [29]:
from sklearn.metrics import confusion_matrix, classification_report


yhat = ada.predict(X_test) 

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[499   0]
 [  1   0]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       499
           1       0.00      0.00      0.00         1

    accuracy                           1.00       500
   macro avg       0.50      0.50      0.50       500
weighted avg       1.00      1.00      1.00       500



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.