This is the code you created last lesson, add a step to Optuna where it pre-processed data with a data-scaler.

1. Think where the pre-processing needs to happen
2. make sure the pre-processing is done every iteration properly, and it does not happen only once
3. use correct optuna function to suggest which scaler to pick and then based on the selection, pre-process the data
4. use 3 numerical data-scalers you choose from sci-kit (minmax scaler etc..)


Add `AdaBoostClassifier` classifier to the selection of the classifiers optuna chooses from.
1. make it such that when Ada Boost is picked, optuna suggest `learning_rate` in range from `1e-10` to `1e10`

Change metric to `f1_score` instead of `accuracy`. This metric takes into account not only how many 0 we predicted as 0 and 1 as 1 (accuracy), but it also uses the false negatives (model predicts 0 instead of 1) and false positives (model predicts 1 instead of 0).

When f1 = 1 that means model is performing great, when f1 = 0, model is perofrming badly. Make sure optuna is set to optimize this score in a correct direction!

---
.
- Once everything above is completed and it works. (you can run the code and Optuna is able to calculate 100 trials and give you the best score achieved)
.

---

Load this dataset `/work/data/homework 21/data_banknote_authentication.csv` - dataset with binary target class 0/1 - is a banknote fake or not. First 4 columns stand for a photo representation of that banknote. 
Set up the above-working code, and run it on this dataset. (you will only need to change the data, everything else should remain the same.)
Good luck hunting all the errors!

In [None]:
!pip install optuna

Collecting optuna
  Downloading optuna-2.9.1-py3-none-any.whl (302 kB)
[K     |████████████████████████████████| 302 kB 32.6 MB/s 
Collecting alembic
  Downloading alembic-1.7.1-py3-none-any.whl (208 kB)
[K     |████████████████████████████████| 208 kB 74.3 MB/s 
Collecting cliff
  Downloading cliff-3.9.0-py3-none-any.whl (80 kB)
[K     |████████████████████████████████| 80 kB 19.7 MB/s 
[?25hCollecting colorlog
  Downloading colorlog-6.4.1-py2.py3-none-any.whl (11 kB)
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting Mako
  Downloading Mako-1.1.5-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 10.8 MB/s 
[?25hCollecting importlib-resources
  Downloading importlib_resources-5.2.2-py3-none-any.whl (27 kB)
Collecting PrettyTable>=0.7.2
  Downloading prettytable-2.2.0-py3-none-any.whl (23 kB)
Collecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.6.0-py2.py3-none-any.whl (111 kB)
[K     |████████████████████████████████| 1

In [None]:
import optuna
import pandas as pd

import numpy as np
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import sklearn.datasets
import sklearn.ensemble
import sklearn.model_selection
import sklearn.svm
from sklearn import tree
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, StandardScaler, PowerTransformer,Normalizer
from sklearn.model_selection import cross_validate

In [None]:
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.model_selection import cross_val_score


In [None]:


def load_data(data_source):
    if data_source == "diabetes":
        df= pd.read_csv("/work/data/homework 21/diabetes.csv")
        df.columns= df.columns.str.lower()
        X = df.drop(["outcome"],axis=1)
        y = df["outcome"]
        return X, y
    else:
        df= pd.read_csv("/work/data/homework 21/data_banknote_authentication.csv",names=["first", "second", "third", "fourth","final"])
        X = df.drop(["final"],axis=1)
        y = df["final"]
        return X, y

def objective(trial):

    classifier_name = trial.suggest_categorical("classifier", ["SVC", "RandomForest","DecisionTreeClassifier","AdaBoostClassifier"])
    if classifier_name == "SVC":
        svc_c = trial.suggest_float("svc_c", 1e-5, 1e5, log=True)
        model = sklearn.svm.SVC(C=svc_c, gamma="auto")
    elif classifier_name == "RandomForest":
        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 12, log=True)
        model = sklearn.ensemble.RandomForestClassifier(max_depth=rf_max_depth, 
                                                        n_estimators=10)
    elif classifier_name == "DecisionTreeClassifier":
        dt_criteria = trial.suggest_categorical("dt_criteria",["gini","entropy"])
        dt_max_depth = trial.suggest_int("dt_max_depth", 2, 12)
        model = tree.DecisionTreeClassifier(criterion= dt_criteria,
                                            max_depth=dt_max_depth)
    elif classifier_name == 'AdaBoostClassifier':
        learning_rate = trial.suggest_uniform('learning_rate', 1e-3, 1) # 1e-10, 1e10
        sug_ada_estims = trial.suggest_int("estimators", 2, 32)
        model = AdaBoostClassifier(n_estimators=sug_ada_estims,learning_rate=learning_rate)  
  
                                          
    scaler_string = trial.suggest_categorical("------------------------------------_scaler",["no_scaler", "StandardScaler","RobustScaler","MinMaxScaler", "MaxAbsScaler", "StandardScaler", "PowerTransformer","Normalizer"])
    
    if scaler_string == "no_scaler":
        scaled_X = X
    else:
        scaler = eval(scaler_string)()
        scaler.fit(X)
        scaled_X = scaler.transform(X)


    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    score = cross_val_score(model, scaled_X, y, cv=cv,scoring="f1_weighted")
    trial_score = score.mean()

    return trial_score

X, y = load_data("diabetes")
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)
print(study.best_trial)


[32m[I 2021-09-16 16:43:16,447][0m A new study created in memory with name: no-name-36b8720d-bd42-4bcc-9271-83a3eea258ce[0m
[32m[I 2021-09-16 16:43:16,669][0m Trial 0 finished with value: 0.7508862767458336 and parameters: {'classifier': 'RandomForest', 'rf_max_depth': 4, '------------------------------------_scaler': 'StandardScaler'}. Best is trial 0 with value: 0.7508862767458336.[0m
[32m[I 2021-09-16 16:43:16,721][0m Trial 1 finished with value: 0.7308796664378903 and parameters: {'classifier': 'DecisionTreeClassifier', 'dt_criteria': 'gini', 'dt_max_depth': 10, '------------------------------------_scaler': 'MaxAbsScaler'}. Best is trial 0 with value: 0.7508862767458336.[0m
[32m[I 2021-09-16 16:43:16,781][0m Trial 2 finished with value: 0.7181414849761961 and parameters: {'classifier': 'AdaBoostClassifier', 'learning_rate': 0.41165829923508857, 'estimators': 2, '------------------------------------_scaler': 'MinMaxScaler'}. Best is trial 0 with value: 0.7508862767458336

In [None]:
optuna.visualization.plot_optimization_history(study)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=7d3ce7c8-a514-49e4-9ba4-a5899ac52ea5' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>