### Projekt 01
**Wybrany model : Las losowy w wariancie Extra Trees.**

Problem badawczy : Wykorzystaną bazą do klasyfikacji będzie zestaw - 'Drum Kit Sound Samples':
*https://www.kaggle.com/datasets/anubhavchhabra/drum-kit-sound-samples*
Klasyfikować, więc będziemy dźwięki perkusyjne w celu rozróżnienia bębnów, perkusji czy werbli, a dokładnie zgodnie z angielską nomenklaturą bazy danych : kick, snare, toms i overheads. W bazie znajduje się dokłądnie 160 sampli audio, gdzie każdy z rodzajów to 40 plików .wav. Nagrania zostały pozyskane drogą nagrań 'live' lub są to dźwięki 'symulowane' techniką komputerową. Bazę można wykorzystać również do zadań klasteryzacyjnych.

Celem i obiktem klasyfikacji jest więc model, który nauczy się rozpoznawać określone warianty dźwięków perkusyjnych, który może być wykorzystany np. w reprodukcji muzyki.

Tak jak i w opisie wyżej, baza zawiera, więc 4 klasy po 40 obiektów, co daje w sumie 'database' składający się ze 160 elementów. Format pliku to .wav, a częstotliwość próbkowania to : 

Wybrane modele do klasyfikacji to :
- Las Losowy (w opcji Extra Trees) [wykonanie - Jakub Gucik]
- Wektory Nośne
- Algorytm K-sąsiadów
- Gaussian Naive Bayes

Każdy z członków zespołu zajmie się jednym z modeli w początkowej fazie, a jeżeli wystarczy czasu każdy z członków przeanalizuje jeszcze jeden dodatkowy model z listy.

Ten Jupyter Notebook jest poświęcony modelowi Lasu Losowego.

### 1. Preprocessing

Preprocessing zostanie przeprowadzony standardowo, zgodnie z tym co pojawiało się na zajęciach, a więc odpowiednie wczytanie i label'owanie nagrań, a także ustandaryzowanie danych za pomocą `Standard Scaler`.

Poniżej import bibliotek :

In [44]:
# Libraries import :

import scipy.stats
import os
import librosa
import optuna
import pandas

import numpy as np

from collections import Counter
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import roc_auc_score, roc_curve, auc, precision_recall_curve, accuracy_score, recall_score, f1_score, make_scorer, confusion_matrix, log_loss
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate, StratifiedKFold
from pathlib import Path
from sklearn.model_selection import train_test_split

Przygotowanie dostępu do folderów z odpowiednimi klasami dla lepszej organizacji i listy do zautomatyzowania wczytywania nagrań :

In [6]:
# Creation of directories and lists :

repoDir = Path.cwd()
archiveDir = os.path.join(repoDir, "IRMAS-TrainingData")
celDir = os.path.join(archiveDir, "cel")
claDir = os.path.join(archiveDir, "cla")
fluDir = os.path.join(archiveDir, "flu")
gacDir = os.path.join(archiveDir, "gac")
gelDir = os.path.join(archiveDir, "gel")
orgDir = os.path.join(archiveDir, "org")
piaDir = os.path.join(archiveDir, "pia")
saxDir = os.path.join(archiveDir, "sax")
truDir = os.path.join(archiveDir, "tru")
vioDir = os.path.join(archiveDir, "vio")
voiDir = os.path.join(archiveDir, "voi")


listDir = [celDir, claDir, fluDir, gacDir, gelDir, orgDir, piaDir, saxDir, truDir, vioDir, voiDir]

Poniżej przygotowanie danych do analizy. W formie `Dataframe'u` - dla wygody i ułatwienia posługiwania się danymi. Niżej check, czy wszystkie nagrania są w tej samej częstotliwości próbkowania :

In [61]:
# Creation of database :

records = []
wavNames = []
labels = []
genre = []
for iter, list in enumerate(listDir):
    os.chdir(repoDir)    
    currentList = os.listdir(list)
    os.chdir(list)
    for recording in currentList:
        records.append(librosa.load(recording, sr=44100))
        wavNames.append(str(recording[1:4]))
        labels.append(iter)
        if "[cla]" in str(recording):
            genre.append("classical")
        elif "[jaz_blu]" in str(recording):
            genre.append("jazz_blues")
        elif "[pop_roc]" in str(recording):
            genre.append("pop_rock")
        elif "[cou_fol]" in str(recording):
            genre.append("country_folk")
        elif "[lat_sou]" in str(recording):
            genre.append("latin_soul")

database = pandas.DataFrame(data=records)
database.columns = ["record_data", "sampling_frequency"]
database["instrument_type"] = wavNames
database["labels"] = labels
database["genre"] = genre
display(database)
os.chdir(repoDir)

# Sampling frequency check :

for frequency in database["sampling_frequency"]:
    if frequency == 44100:
        pass
    else:
        print("Freqency doesn't match 22050 Hz.")

Unnamed: 0,record_data,sampling_frequency,instrument_type,labels,genre
0,"[0.022232056, 0.02468872, 0.02645874, 0.027236...",44100,cel,0,classical
1,"[-0.0033721924, -0.003967285, -0.003479004, -0...",44100,cel,0,classical
2,"[0.00061035156, 0.0001373291, -0.0005950928, -...",44100,cel,0,classical
3,"[-0.00024414062, -0.0013885498, -0.0026397705,...",44100,cel,0,classical
4,"[-0.00018310547, 0.00076293945, 0.0011901855, ...",44100,cel,0,classical
...,...,...,...,...,...
1645,"[0.0059509277, 0.015350342, 0.023391724, 0.032...",44100,voi,10,pop_rock
1646,"[-0.11528015, -0.15473938, -0.15000916, -0.113...",44100,voi,10,pop_rock
1647,"[0.1661377, 0.15690613, 0.12338257, 0.07142639...",44100,voi,10,pop_rock
1648,"[0.14439392, 0.12376404, 0.10206604, 0.0853881...",44100,voi,10,pop_rock


Kolejno trzeba wyekstrachować cechy. Użyte zostanie 13 cech MFCC, a dodatkowo wykorzystane zostanie `MFCC_delta` oraz `MFCC_delta_delta`. Wszystkie dane zostaną przedstawione w `DataFrame'ie` :

In [68]:
# Extracting MFCC coefficents

mfccs = []
mfccs_delta = []
mfccs_deltasq = []
for data in database["record_data"]:
    mfcc = librosa.feature.mfcc(y=data, sr=22050, n_mfcc=13)
    mfccs.append(mfcc)
    mfccs_delta.append(librosa.feature.delta(mfcc))
    mfccs_deltasq.append(librosa.feature.delta(mfcc, order=2))

database["mfcc"] = mfccs
database["mfcc_delta"] = mfccs_delta
database["mfcc_delta_delta"] = mfccs_deltasq

display(database)

Unnamed: 0,record_data,sampling_frequency,instrument_type,labels,genre,mfcc,mfcc_delta,mfcc_delta_delta
0,"[0.022232056, 0.02468872, 0.02645874, 0.027236...",44100,cel,0,classical,"[[-388.13446, -423.61435, -473.78613, -472.044...","[[-8.185071, -8.185071, -8.185071, -8.185071, ...","[[5.1802053, 5.1802053, 5.1802053, 5.1802053, ..."
1,"[-0.0033721924, -0.003967285, -0.003479004, -0...",44100,cel,0,classical,"[[-380.3471, -347.56705, -341.9458, -340.2455,...","[[3.440472, 3.440472, 3.440472, 3.440472, 3.44...","[[-2.7836533, -2.7836533, -2.7836533, -2.78365..."
2,"[0.00061035156, 0.0001373291, -0.0005950928, -...",44100,cel,0,classical,"[[-444.47504, -436.25427, -442.07135, -440.382...","[[1.7043172, 1.7043172, 1.7043172, 1.7043172, ...","[[0.64747626, 0.64747626, 0.64747626, 0.647476..."
3,"[-0.00024414062, -0.0013885498, -0.0026397705,...",44100,cel,0,classical,"[[-499.21375, -478.78656, -483.94357, -485.582...","[[1.6173273, 1.6173273, 1.6173273, 1.6173273, ...","[[-0.55970395, -0.55970395, -0.55970395, -0.55..."
4,"[-0.00018310547, 0.00076293945, 0.0011901855, ...",44100,cel,0,classical,"[[-491.483, -463.85297, -464.4726, -464.46667,...","[[1.5724477, 1.5724477, 1.5724477, 1.5724477, ...","[[-2.0295393, -2.0295393, -2.0295393, -2.02953..."
...,...,...,...,...,...,...,...,...
1645,"[0.0059509277, 0.015350342, 0.023391724, 0.032...",44100,voi,10,pop_rock,"[[-209.98701, -174.61427, -171.96695, -177.942...","[[-1.7447826, -1.7447826, -1.7447826, -1.74478...","[[-3.0017266, -3.0017266, -3.0017266, -3.00172..."
1646,"[-0.11528015, -0.15473938, -0.15000916, -0.113...",44100,voi,10,pop_rock,"[[-142.983, -119.2241, -121.361145, -122.55564...","[[-0.27515614, -0.27515614, -0.27515614, -0.27...","[[-1.6264538, -1.6264538, -1.6264538, -1.62645..."
1647,"[0.1661377, 0.15690613, 0.12338257, 0.07142639...",44100,voi,10,pop_rock,"[[-127.09217, -121.634254, -144.1307, -150.183...","[[10.00528, 10.00528, 10.00528, 10.00528, 10.0...","[[7.137136, 7.137136, 7.137136, 7.137136, 7.13..."
1648,"[0.14439392, 0.12376404, 0.10206604, 0.0853881...",44100,voi,10,pop_rock,"[[-125.003006, -52.63265, -53.847103, -72.9391...","[[-2.2320504, -2.2320504, -2.2320504, -2.23205...","[[-3.3040605, -3.3040605, -3.3040605, -3.30406..."


Oczywiście, ciężko będzie użyć `MFCC`, które jest niezgodne co do długości z innymi. Jednak nasze nagrania mają z góry przygotowane nagrania tak, że długości wszystkich `MFCC` wynoszą : 1131 elementów (macierze 13x87). Oczywiście, jeżeli zachowamy wariant z 13 cechami `MFCC`. Przy zmianie powinno być rówinież w porządku, jednak liczba będzie większa lub mniejsza i należy to zweryfikować. Jednak trzeba to sprawdzić kodem poniżej :

In [70]:
mfcc_len = 3367

for item in database["mfcc"]:
    if np.size(item) != mfcc_len:
        print("Lenght of arrays doesn't match !")

for item in database["mfcc_delta"]:
    if np.size(item) != mfcc_len:
        print("Lenght of arrays doesn't match !")

for item in database["mfcc_delta_delta"]:
    if np.size(item) != mfcc_len:
        print("Lenght of arrays doesn't match !")

Możemy wykorzystać również podejście związane z 'parametrami' `MFCC`. Damy radę przeanalizować :

- wartość średnią, 
- odchylenie standardowe, 
- medianę, 
- I i III kwartyl, 
- rozrzut pomiędzy 10 i 90 percentylem, 
- kurtozę, 
- skośność, 
- wartość minimalną,
- wartość maksymalną.

Otrzymamy tutaj 10 parametrów na każdą ramkę `MFCC`, czyli 130 parametrów na każdy sygnał i tyle będzie opisywać dany element. Łącznie wszystkich otrzymamy 19200 i wpiszemy kolejno do `DataFrame'u`. Ze względu na taką budowę naszej bazy danych, musimy przeiterować, wyliczenie kolejnych wartości po kolejnych ramkach `MFCC` :

In [71]:
mfcc_parameters = []
for iteration, value in enumerate(database["mfcc"]):
    mfcc_stack = []
    for i in range(0,12):
        data_stack = np.hstack((np.mean(database["mfcc"][iteration][i]), 
                    np.std(database["mfcc"][iteration][i]), 
                    np.median(database["mfcc"][iteration][i]), 
                    np.percentile(database["mfcc"][iteration][i], 25), 
                    np.percentile(database["mfcc"][iteration][i], 75), 
                    scipy.stats.iqr(database["mfcc"][iteration][i], rng=(10, 90)),
                    scipy.stats.kurtosis(database["mfcc"][iteration][i]),
                    scipy.stats.skew(database["mfcc"][iteration][i]),
                    np.min(database["mfcc"][iteration][i]),
                    np.max(database["mfcc"][iteration][i])
                    ))
        mfcc_stack = np.hstack((mfcc_stack, data_stack))
    mfcc_parameters.append(mfcc_stack)

database["mfcc_parameters"] = mfcc_parameters
display(database)

Unnamed: 0,record_data,sampling_frequency,instrument_type,labels,genre,mfcc,mfcc_delta,mfcc_delta_delta,mfcc_parameters
0,"[0.022232056, 0.02468872, 0.02645874, 0.027236...",44100,cel,0,classical,"[[-388.13446, -423.61435, -473.78613, -472.044...","[[-8.185071, -8.185071, -8.185071, -8.185071, ...","[[5.1802053, 5.1802053, 5.1802053, 5.1802053, ...","[-453.2663269042969, 22.335796356201172, -451...."
1,"[-0.0033721924, -0.003967285, -0.003479004, -0...",44100,cel,0,classical,"[[-380.3471, -347.56705, -341.9458, -340.2455,...","[[3.440472, 3.440472, 3.440472, 3.440472, 3.44...","[[-2.7836533, -2.7836533, -2.7836533, -2.78365...","[-383.6215515136719, 25.697067260742188, -380...."
2,"[0.00061035156, 0.0001373291, -0.0005950928, -...",44100,cel,0,classical,"[[-444.47504, -436.25427, -442.07135, -440.382...","[[1.7043172, 1.7043172, 1.7043172, 1.7043172, ...","[[0.64747626, 0.64747626, 0.64747626, 0.647476...","[-472.3026428222656, 20.734647750854492, -470...."
3,"[-0.00024414062, -0.0013885498, -0.0026397705,...",44100,cel,0,classical,"[[-499.21375, -478.78656, -483.94357, -485.582...","[[1.6173273, 1.6173273, 1.6173273, 1.6173273, ...","[[-0.55970395, -0.55970395, -0.55970395, -0.55...","[-470.0331726074219, 17.2320613861084, -471.61..."
4,"[-0.00018310547, 0.00076293945, 0.0011901855, ...",44100,cel,0,classical,"[[-491.483, -463.85297, -464.4726, -464.46667,...","[[1.5724477, 1.5724477, 1.5724477, 1.5724477, ...","[[-2.0295393, -2.0295393, -2.0295393, -2.02953...","[-452.78167724609375, 26.359832763671875, -461..."
...,...,...,...,...,...,...,...,...,...
1645,"[0.0059509277, 0.015350342, 0.023391724, 0.032...",44100,voi,10,pop_rock,"[[-209.98701, -174.61427, -171.96695, -177.942...","[[-1.7447826, -1.7447826, -1.7447826, -1.74478...","[[-3.0017266, -3.0017266, -3.0017266, -3.00172...","[-202.75323486328125, 25.102352142333984, -202..."
1646,"[-0.11528015, -0.15473938, -0.15000916, -0.113...",44100,voi,10,pop_rock,"[[-142.983, -119.2241, -121.361145, -122.55564...","[[-0.27515614, -0.27515614, -0.27515614, -0.27...","[[-1.6264538, -1.6264538, -1.6264538, -1.62645...","[-153.88998413085938, 37.77930450439453, -157...."
1647,"[0.1661377, 0.15690613, 0.12338257, 0.07142639...",44100,voi,10,pop_rock,"[[-127.09217, -121.634254, -144.1307, -150.183...","[[10.00528, 10.00528, 10.00528, 10.00528, 10.0...","[[7.137136, 7.137136, 7.137136, 7.137136, 7.13...","[-110.45230865478516, 27.135652542114258, -114..."
1648,"[0.14439392, 0.12376404, 0.10206604, 0.0853881...",44100,voi,10,pop_rock,"[[-125.003006, -52.63265, -53.847103, -72.9391...","[[-2.2320504, -2.2320504, -2.2320504, -2.23205...","[[-3.3040605, -3.3040605, -3.3040605, -3.30406...","[-105.6734390258789, 29.895414352416992, -106...."


Możemy, więc sprawidzić też, które z podejść (`MFCC`, `MFCC-delta`, `MFCC-delta-delta` czy parametry) dadzą najlepsze wyniki. Jednak w przypadku pierwszych trzech można pomyśleć o redukcji wymiarowości, gdyż jedno `MFCC` to, aż 1131 elementów. Dodatkowo w przypadku `MFCC` i delt musimy pamiętać o spłaszczeniu danych, by móc odpowiednio przeprocesować je przez `train_test_split` oraz `StandardScaler`. Zostanie to wykonane poniżej i podmienione zostaną kolumny z wartościami, na te ze spłaszczonymi `MFCC`. Jest to wykonywane pod koniec przez wzgląd na poprawne obliczenie wcześniejszych 'parametrów'.

In [73]:
mfcc_flatten = []
mfcc_flatten_delta = []
mfcc_flatten_delta_delta = []
for i in range (0,1650):
    mfcc_flatten.append(database['mfcc'][i].flatten())
    mfcc_flatten_delta.append(database['mfcc_delta'][i].flatten())
    mfcc_flatten_delta_delta.append(database['mfcc_delta_delta'][i].flatten())

database.drop('mfcc', axis=1, inplace=True)
database.drop('mfcc_delta', axis=1, inplace=True)
database.drop('mfcc_delta_delta', axis=1, inplace=True)
database['mfcc'] = mfcc_flatten
database['mfcc_delta'] = mfcc_flatten_delta
database['mfcc_deltasq'] = mfcc_flatten_delta_delta
display(database)

Unnamed: 0,record_data,sampling_frequency,instrument_type,labels,genre,mfcc_parameters,mfcc,mfcc_delta,mfcc_deltasq
0,"[0.022232056, 0.02468872, 0.02645874, 0.027236...",44100,cel,0,classical,"[-453.2663269042969, 22.335796356201172, -451....","[-388.13446, -423.61435, -473.78613, -472.0447...","[-8.185071, -8.185071, -8.185071, -8.185071, -...","[5.1802053, 5.1802053, 5.1802053, 5.1802053, 5..."
1,"[-0.0033721924, -0.003967285, -0.003479004, -0...",44100,cel,0,classical,"[-383.6215515136719, 25.697067260742188, -380....","[-380.3471, -347.56705, -341.9458, -340.2455, ...","[3.440472, 3.440472, 3.440472, 3.440472, 3.440...","[-2.7836533, -2.7836533, -2.7836533, -2.783653..."
2,"[0.00061035156, 0.0001373291, -0.0005950928, -...",44100,cel,0,classical,"[-472.3026428222656, 20.734647750854492, -470....","[-444.47504, -436.25427, -442.07135, -440.3829...","[1.7043172, 1.7043172, 1.7043172, 1.7043172, 1...","[0.64747626, 0.64747626, 0.64747626, 0.6474762..."
3,"[-0.00024414062, -0.0013885498, -0.0026397705,...",44100,cel,0,classical,"[-470.0331726074219, 17.2320613861084, -471.61...","[-499.21375, -478.78656, -483.94357, -485.5826...","[1.6173273, 1.6173273, 1.6173273, 1.6173273, 1...","[-0.55970395, -0.55970395, -0.55970395, -0.559..."
4,"[-0.00018310547, 0.00076293945, 0.0011901855, ...",44100,cel,0,classical,"[-452.78167724609375, 26.359832763671875, -461...","[-491.483, -463.85297, -464.4726, -464.46667, ...","[1.5724477, 1.5724477, 1.5724477, 1.5724477, 1...","[-2.0295393, -2.0295393, -2.0295393, -2.029539..."
...,...,...,...,...,...,...,...,...,...
1645,"[0.0059509277, 0.015350342, 0.023391724, 0.032...",44100,voi,10,pop_rock,"[-202.75323486328125, 25.102352142333984, -202...","[-209.98701, -174.61427, -171.96695, -177.9427...","[-1.7447826, -1.7447826, -1.7447826, -1.744782...","[-3.0017266, -3.0017266, -3.0017266, -3.001726..."
1646,"[-0.11528015, -0.15473938, -0.15000916, -0.113...",44100,voi,10,pop_rock,"[-153.88998413085938, 37.77930450439453, -157....","[-142.983, -119.2241, -121.361145, -122.55564,...","[-0.27515614, -0.27515614, -0.27515614, -0.275...","[-1.6264538, -1.6264538, -1.6264538, -1.626453..."
1647,"[0.1661377, 0.15690613, 0.12338257, 0.07142639...",44100,voi,10,pop_rock,"[-110.45230865478516, 27.135652542114258, -114...","[-127.09217, -121.634254, -144.1307, -150.1832...","[10.00528, 10.00528, 10.00528, 10.00528, 10.00...","[7.137136, 7.137136, 7.137136, 7.137136, 7.137..."
1648,"[0.14439392, 0.12376404, 0.10206604, 0.0853881...",44100,voi,10,pop_rock,"[-105.6734390258789, 29.895414352416992, -106....","[-125.003006, -52.63265, -53.847103, -72.93917...","[-2.2320504, -2.2320504, -2.2320504, -2.232050...","[-3.3040605, -3.3040605, -3.3040605, -3.304060..."


### 2.Przygotowanie zbiorów uczących i modelu.

Teraz można zająć się przygotowaniem danych i modelu. Użyty zostanie `train_test_split` (nie będzie stosowana crosswalidacja, ze względu na małą liczność klas). By współpracować z `sklearn` i `pandasem`, dane wyciągnięte z `DataFrame'u` konwertowane będą na listy array'ów, by można podzielone zbiory odpowiednio jeszcze ustandaryzować `StandardScaler'em`. W przypadku MFCC i delt musimy również pamiętać o spłaszczeniu danych - jednak w poniższym podejściu zastosowane będą 'parametry' MFCC, a zostały przygotowane już w odpowiedniej formie :

In [74]:
X_train, X_test, y_train, y_test = train_test_split(database["mfcc_parameters"].to_list(), database["labels"].to_list(), test_size=0.2, random_state=42, stratify=database["labels"].to_list())

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

1320.0

Analizowanym modelem jest model `ExtraTreesClassifier`:

In [76]:
model = ExtraTreesClassifier(random_state=42)

Na początek można sprawdzić jak poradzi sobie podstawowy model, bez żadnych modyfikacji i przygotowania :

In [77]:
model.fit(X_train_scaled, y_train)
preds_test = model.predict(X_test_scaled)

In [78]:
print("Confusion matrix :")
print(confusion_matrix(y_test, preds_test))
print("Accuracy :")
print(accuracy_score(y_test, preds_test))
print("F1 score :")
print(f1_score(y_test, preds_test, average='macro'))

Confusion matrix :
[[21  1  4  3  1  0  0  0  0  0  0]
 [ 1 16  4  4  2  0  0  0  0  0  3]
 [ 1  1 20  1  0  4  1  0  0  1  1]
 [ 2  2  0 21  0  0  5  0  0  0  0]
 [ 0  0  0  3 16  5  1  0  2  1  2]
 [ 1  0  0  1  0 25  1  0  0  0  2]
 [ 1  0  0  1  1  1 20  1  1  0  4]
 [ 4  2  2  2  1  0  2 14  3  0  0]
 [ 0  0  2  0  1  1  0  1 23  0  2]
 [ 3  2  2  2  2  2  0  0  1 14  2]
 [ 0  0  0  0  0  2  0  0  1  0 27]]
Accuracy :
0.6575757575757576
F1 score :
0.6523863924874473


In [90]:
model = ExtraTreesClassifier
scoring = {'f1_macro': make_scorer(f1_score, average='macro')}

            # "criterion": trial.suggest_categorical("criterion", ["gini", "entropy", "log_loss"]), # default "gini"

def get_space(trial): 
    space = {
            "n_estimators": trial.suggest_int("n_estimators", 2, 2000), #default value 100
            "max_depth": trial.suggest_int("max_depth", -1, 2000), # default None
            "min_samples_split": trial.suggest_int("min_samples_split", 2, 2000), # default 2
            "n_jobs": trial.suggest_int("n_jobs", -1, -1)
        }
    return space
trials = 200

def objective(trial, model, X, y):
    model_space = get_space(trial)

    mdl = model(**model_space)
    scores = cross_validate(mdl, X, y, scoring=scoring, cv=StratifiedKFold(n_splits=11), return_train_score=True)

    return np.mean(scores['test_f1_macro'])

In [91]:
%%time
study = optuna.create_study(direction='maximize')
study.optimize(lambda x: objective(x, model, X_train_scaled, y_train), n_trials=trials)

[32m[I 2023-01-07 18:18:48,633][0m A new study created in memory with name: no-name-a5cb1793-2341-4080-bf1b-e3f5488cbd25[0m
[32m[I 2023-01-07 18:19:10,276][0m Trial 0 finished with value: 0.3339997876451606 and parameters: {'n_estimators': 1356, 'max_depth': 1495, 'min_samples_split': 265, 'n_jobs': -1}. Best is trial 0 with value: 0.3339997876451606.[0m
[33m[W 2023-01-07 18:19:30,504][0m Trial 1 failed because of the following error: KeyboardInterrupt()[0m
Traceback (most recent call last):
  File "d:\Anaconda\envs\ML_P01\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "<timed exec>", line 2, in <lambda>
  File "C:\Users\kubag\AppData\Local\Temp\ipykernel_8876\763803486.py", line 20, in objective
    scores = cross_validate(mdl, X, y, scoring=scoring, cv=StratifiedKFold(n_splits=11), return_train_score=True)
  File "d:\Anaconda\envs\ML_P01\lib\site-packages\sklearn\model_selection\_validation.py", line 267, in cro

KeyboardInterrupt: 

In [81]:
print('params: ', study.best_params)

lr = model(**study.best_params)
lr.fit(X_train, y_train)
preds = lr.predict(X_test)
print('test accuracy = ', accuracy_score(y_test, preds))
print('test F1 = ', f1_score(y_test, preds, average='weighted'))
print(confusion_matrix(y_test, preds))

params:  {'n_estimators': 527, 'max_depth': 755, 'min_samples_split': 2, 'n_jobs': -1}
test accuracy =  0.6636363636363637
test F1 =  0.657439040822469
[[22  0  4  3  1  0  0  0  0  0  0]
 [ 1 15  5  3  1  0  0  0  0  0  5]
 [ 2  0 19  2  1  4  1  0  0  1  0]
 [ 2  0  0 23  0  0  4  0  0  1  0]
 [ 0  0  0  4 19  2  0  0  0  1  4]
 [ 0  1  0  0  0 25  1  0  0  1  2]
 [ 1  0  0  1  2  0 19  1  1  1  4]
 [ 5  0  1  2  1  1  3 13  2  1  1]
 [ 0  0  3  0  0  0  0  2 22  0  3]
 [ 1  2  2  2  3  2  0  0  1 13  4]
 [ 0  0  0  0  0  1  0  0  0  0 29]]
