## Projekt UM - badanie modeli i optymalizacji metod klasyfikacji
### Część modelu - Las losowy w wariancie Extra Trees

 Wybrane modele do klasyfikacji to :
 - Las Losowy (w opcji Extra Trees) [wykonanie - Jakub Gucik]
 - Wektory Nośne [wykonanie - Tymoteusz Macewicz]

Każdy z członków zespołu zajmie się jednym z modeli.

Problem badawczy : Ze względów projektowych wstępną bazą Wykorzystaną do klasyfikacji był zestaw - 'Drum Kit Sound Samples':

*https://www.kaggle.com/datasets/anubhavchhabra/drum-kit-sound-samples*

W tym wypadku klasyfikowane były dźwięki perkusyjne w celu rozróżnienia np. bębnów czy werbli, a dokładnie zgodnie z angielską nomenklaturą bazy danych :
- "kick"
- "snare"
- "toms"
- "overheads"

 W bazie znajduje się dokładnie 160 sampli audio, gdzie każdy z rodzajów to 40 plików .wav. Nagrania zostały pozyskane drogą nagrań 'live' lub są to dźwięki 'symulowane' techniką komputerową. Bazę można wykorzystać również do zadań klasteryzacyjnych.

Jednak przez wzgląd, iż wybrane modele na tej bazie danych domyślnie otrzymywały już około 100% skuteczności, skłonił do podjęcia wyboru zmiany danych. Najpewniejszym powodem takiego zachowania modeli jest po prostu zbyt mała i nie różnorodna baza danych. Nie chcąc też łączyć baz danych, by uniknąć problemów z processingiem czy niejednorodnością nagrań, zmieniono bazę na większą.

Finalnie wybrano bazę – ‘IRMAS’, którą można znaleźć na stronie: 

https://www.upf.edu/web/mtg/irmas. 

Jest to zbiór danych w postaci plików .wav, przeznaczony do rozpoznawania instrumentów w sygnałach muzycznych audio – zawiera wiele sampli z oznaczonymi dominującymi instrumentami, czyli label’ami niezbędnymi do uczenia nadzorowanego. Jest więc to baza przeznaczona do zastosowań w automatycznym rozpoznawaniu i klasyfikacji. Zestaw danych został przygotowany przez Uniwersytet Pompeu Fabry w Barcelonie.
Same pliki zawierają adnotacje dotyczące głównego dominującego instrumentu grającego, zgodnie z oznaczeniami:

-   cel – wiolonczela [ang. Cello]
-   cla – klarnet [ang. Clarinet]
-   flu – flet [ang. Flute]
-   gac – gitara akustyczna [ang. Acoustic Guitar]
-   gel – gitara elektryczna [ang. Electric Guitar]
-   org – organy [ang. Organ]
-   sax – saksofon [ang. Saxophone]
-   tru – trąbka [ang. Trumpet]
-   vio – skrzypce [ang. Violin]
-   voi – wokale [ang. Voice]
 
Dodatkowo, część z plików oferuje dodatkowe opisy dotyczące gatunku i występowanie bębnów:

-   dru – występowanie bębnów
-   nod – brak bębnów
-   cou_fol – country-folk
-   cla – klasyczna
-   pop-roc – pop-rock
-   lat-sou – latin-soul

Nagrania to części muzyczne z aktualnych utworów z ubiegłego wieku. Co za tym idzie oferuje dużą różnorodność, a także jakość audio – utrudnia to zadanie klasyfikacji.
W kwestii plików – baza składa się z 6705 nagrań audio składających się na 11 klas w różnej liczebności. Są to pliki 16 bit stereo .wav o częstotliwości próbkowania 44100 Hz.
W początkowej fazie, ze względu na rozmiar bazy zawężono i wyrównano ilość plików .wav do 150 elementów (pierwszorzędnie usuwane były nagrania z oznaczeniami ‘dru’ i ‘nod’) na klasę, a co za tym idzie sumarycznie analizowana baza posiada 1650 elementów.

Dokładna baza z wyselekcjonowanymi nagraniami znajduje się pod adresem:

https://drive.google.com/drive/folders/1MegaYzPFYEbVGd5ica_SU_S_xoqSbcGF?usp=sharing

Ten Jupyter Notebook jest poświęcony modelowi Lasu Losowego.

### 1. Preprocessing

Preprocessing zostanie przeprowadzony standardowo, zgodnie z tym co pojawiało się na zajęciach, a więc odpowiednie wczytanie i label'owanie nagrań, a także ustandaryzowanie danych za pomocą `Standard Scaler`.

Poniżej import bibliotek :

In [101]:
# Libraries import :

import scipy.stats
import os
import librosa
import optuna
import pandas

import numpy as np

from collections import Counter
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, recall_score, f1_score, make_scorer, confusion_matrix, log_loss, precision_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate, StratifiedKFold
from pathlib import Path
from sklearn.model_selection import train_test_split

Przygotowanie dostępu do folderów z odpowiednimi klasami dla lepszej organizacji i listy do zautomatyzowania wczytywania nagrań :

In [102]:
# Creation of directories and lists :

repoDir = Path.cwd()
archiveDir = os.path.join(repoDir, "IRMAS-TrainingData")
celDir = os.path.join(archiveDir, "cel")
claDir = os.path.join(archiveDir, "cla")
fluDir = os.path.join(archiveDir, "flu")
gacDir = os.path.join(archiveDir, "gac")
gelDir = os.path.join(archiveDir, "gel")
orgDir = os.path.join(archiveDir, "org")
piaDir = os.path.join(archiveDir, "pia")
saxDir = os.path.join(archiveDir, "sax")
truDir = os.path.join(archiveDir, "tru")
vioDir = os.path.join(archiveDir, "vio")
voiDir = os.path.join(archiveDir, "voi")


listDir = [celDir, claDir, fluDir, gacDir, gelDir, orgDir, piaDir, saxDir, truDir, vioDir, voiDir]

# Labels - zgodnie opisem w raporcie i wstępie
# 0 - cel
# 1 - cla
# 2 - flu
# 3 - gac
# 4 - gel
# 5 - org
# 6 - pia
# 7 - sax
# 8 - tru
# 9 - vio
# 10 - voi

Poniżej przygotowanie danych do analizy. W formie `Dataframe'u` - dla wygody i ułatwienia posługiwania się danymi. Niżej check, czy wszystkie nagrania są w tej samej częstotliwości próbkowania :

In [103]:
# Creation of database :

records = []
wavNames = []
labels = []
genre = []
for iter, list in enumerate(listDir):
    os.chdir(repoDir)    
    currentList = os.listdir(list)
    os.chdir(list)
    for recording in currentList:
        records.append(librosa.load(recording, sr=44100))
        wavNames.append(str(recording[1:4]))
        labels.append(iter)
        if "[cla]" in str(recording):
            genre.append("classical")
        elif "[jaz_blu]" in str(recording):
            genre.append("jazz_blues")
        elif "[pop_roc]" in str(recording):
            genre.append("pop_rock")
        elif "[cou_fol]" in str(recording):
            genre.append("country_folk")
        elif "[lat_sou]" in str(recording):
            genre.append("latin_soul")

database = pandas.DataFrame(data=records)
database.columns = ["record_data", "sampling_frequency"]
database["instrument_type"] = wavNames
database["labels"] = labels
database["genre"] = genre
display(database)
os.chdir(repoDir)

# Sampling frequency check :

for frequency in database["sampling_frequency"]:
    if frequency == 44100:
        pass
    else:
        print("Freqency doesn't match 44100 Hz.")

Unnamed: 0,record_data,sampling_frequency,instrument_type,labels,genre
0,"[0.022232056, 0.02468872, 0.02645874, 0.027236...",44100,cel,0,classical
1,"[-0.0033721924, -0.003967285, -0.003479004, -0...",44100,cel,0,classical
2,"[0.00061035156, 0.0001373291, -0.0005950928, -...",44100,cel,0,classical
3,"[-0.00024414062, -0.0013885498, -0.0026397705,...",44100,cel,0,classical
4,"[-0.00018310547, 0.00076293945, 0.0011901855, ...",44100,cel,0,classical
...,...,...,...,...,...
1645,"[0.0059509277, 0.015350342, 0.023391724, 0.032...",44100,voi,10,pop_rock
1646,"[-0.11528015, -0.15473938, -0.15000916, -0.113...",44100,voi,10,pop_rock
1647,"[0.1661377, 0.15690613, 0.12338257, 0.07142639...",44100,voi,10,pop_rock
1648,"[0.14439392, 0.12376404, 0.10206604, 0.0853881...",44100,voi,10,pop_rock


Kolejno trzeba wyekstrachować cechy. Użyte zostanie 13 cech MFCC, a dodatkowo wykorzystane zostanie `MFCC_delta` oraz `MFCC_delta_delta`. Wszystkie dane zostaną przedstawione w `DataFrame'ie` :

In [104]:
# Extracting MFCC coefficents

mfccs = []
mfccs_delta = []
mfccs_deltasq = []
for data in database["record_data"]:
    mfcc = librosa.feature.mfcc(y=data, sr=22050, n_mfcc=13)
    mfccs.append(mfcc)
    mfccs_delta.append(librosa.feature.delta(mfcc))
    mfccs_deltasq.append(librosa.feature.delta(mfcc, order=2))

database["mfcc"] = mfccs
database["mfcc_delta"] = mfccs_delta
database["mfcc_delta_delta"] = mfccs_deltasq

display(database)

Unnamed: 0,record_data,sampling_frequency,instrument_type,labels,genre,mfcc,mfcc_delta,mfcc_delta_delta
0,"[0.022232056, 0.02468872, 0.02645874, 0.027236...",44100,cel,0,classical,"[[-388.13446, -423.61435, -473.78613, -472.044...","[[-8.185071, -8.185071, -8.185071, -8.185071, ...","[[5.1802053, 5.1802053, 5.1802053, 5.1802053, ..."
1,"[-0.0033721924, -0.003967285, -0.003479004, -0...",44100,cel,0,classical,"[[-380.3471, -347.56705, -341.9458, -340.2455,...","[[3.440472, 3.440472, 3.440472, 3.440472, 3.44...","[[-2.7836533, -2.7836533, -2.7836533, -2.78365..."
2,"[0.00061035156, 0.0001373291, -0.0005950928, -...",44100,cel,0,classical,"[[-444.47504, -436.25427, -442.07135, -440.382...","[[1.7043172, 1.7043172, 1.7043172, 1.7043172, ...","[[0.64747626, 0.64747626, 0.64747626, 0.647476..."
3,"[-0.00024414062, -0.0013885498, -0.0026397705,...",44100,cel,0,classical,"[[-499.21375, -478.78656, -483.94357, -485.582...","[[1.6173273, 1.6173273, 1.6173273, 1.6173273, ...","[[-0.55970395, -0.55970395, -0.55970395, -0.55..."
4,"[-0.00018310547, 0.00076293945, 0.0011901855, ...",44100,cel,0,classical,"[[-491.483, -463.85297, -464.4726, -464.46667,...","[[1.5724477, 1.5724477, 1.5724477, 1.5724477, ...","[[-2.0295393, -2.0295393, -2.0295393, -2.02953..."
...,...,...,...,...,...,...,...,...
1645,"[0.0059509277, 0.015350342, 0.023391724, 0.032...",44100,voi,10,pop_rock,"[[-209.98701, -174.61427, -171.96695, -177.942...","[[-1.7447826, -1.7447826, -1.7447826, -1.74478...","[[-3.0017266, -3.0017266, -3.0017266, -3.00172..."
1646,"[-0.11528015, -0.15473938, -0.15000916, -0.113...",44100,voi,10,pop_rock,"[[-142.983, -119.2241, -121.361145, -122.55564...","[[-0.27515614, -0.27515614, -0.27515614, -0.27...","[[-1.6264538, -1.6264538, -1.6264538, -1.62645..."
1647,"[0.1661377, 0.15690613, 0.12338257, 0.07142639...",44100,voi,10,pop_rock,"[[-127.09217, -121.634254, -144.1307, -150.183...","[[10.00528, 10.00528, 10.00528, 10.00528, 10.0...","[[7.137136, 7.137136, 7.137136, 7.137136, 7.13..."
1648,"[0.14439392, 0.12376404, 0.10206604, 0.0853881...",44100,voi,10,pop_rock,"[[-125.003006, -52.63265, -53.847103, -72.9391...","[[-2.2320504, -2.2320504, -2.2320504, -2.23205...","[[-3.3040605, -3.3040605, -3.3040605, -3.30406..."


Oczywiście, ciężko będzie użyć `MFCC`, które jest niezgodne co do długości z innymi. Jednak nasze nagrania mają z góry przygotowane nagrania tak, że długości wszystkich `MFCC` wynoszą : 3367 elementów. Oczywiście, jeżeli zachowamy wariant z 13 cechami `MFCC`. Przy zmianie powinno być rówinież w porządku, jednak liczba będzie większa lub mniejsza i należy to zweryfikować. Jednak trzeba to sprawdzić kodem poniżej :

In [105]:
mfcc_len = 3367

for item in database["mfcc"]:
    if np.size(item) != mfcc_len:
        print("Lenght of arrays doesn't match !")

for item in database["mfcc_delta"]:
    if np.size(item) != mfcc_len:
        print("Lenght of arrays doesn't match !")

for item in database["mfcc_delta_delta"]:
    if np.size(item) != mfcc_len:
        print("Lenght of arrays doesn't match !")

Możemy wykorzystać również podejście związane z 'parametrami' `MFCC`. Damy radę przeanalizować :

- wartość średnią, 
- odchylenie standardowe, 
- medianę, 
- I i III kwartyl, 
- rozrzut pomiędzy 10 i 90 percentylem, 
- kurtozę, 
- skośność, 
- wartość minimalną,
- wartość maksymalną.

Otrzymamy tutaj 10 parametrów na każdą ramkę `MFCC`, czyli 130 parametrów na każdy sygnał i tyle będzie opisywać dany element. Łącznie wszystkich otrzymamy 19200 i wpiszemy kolejno do `DataFrame'u`. Ze względu na taką budowę naszej bazy danych, musimy przeiterować, wyliczenie kolejnych wartości po kolejnych ramkach `MFCC` :

In [106]:
mfcc_parameters = []
for iteration, value in enumerate(database["mfcc"]):
    mfcc_stack = []
    for i in range(0,12):
        data_stack = np.hstack((np.mean(database["mfcc"][iteration][i]), 
                    np.std(database["mfcc"][iteration][i]), 
                    np.median(database["mfcc"][iteration][i]), 
                    np.percentile(database["mfcc"][iteration][i], 25), 
                    np.percentile(database["mfcc"][iteration][i], 75), 
                    scipy.stats.iqr(database["mfcc"][iteration][i], rng=(10, 90)),
                    scipy.stats.kurtosis(database["mfcc"][iteration][i]),
                    scipy.stats.skew(database["mfcc"][iteration][i]),
                    np.min(database["mfcc"][iteration][i]),
                    np.max(database["mfcc"][iteration][i])
                    ))
        mfcc_stack = np.hstack((mfcc_stack, data_stack))
    mfcc_parameters.append(mfcc_stack)

database["mfcc_parameters"] = mfcc_parameters
display(database)

Unnamed: 0,record_data,sampling_frequency,instrument_type,labels,genre,mfcc,mfcc_delta,mfcc_delta_delta,mfcc_parameters
0,"[0.022232056, 0.02468872, 0.02645874, 0.027236...",44100,cel,0,classical,"[[-388.13446, -423.61435, -473.78613, -472.044...","[[-8.185071, -8.185071, -8.185071, -8.185071, ...","[[5.1802053, 5.1802053, 5.1802053, 5.1802053, ...","[-453.2663269042969, 22.335796356201172, -451...."
1,"[-0.0033721924, -0.003967285, -0.003479004, -0...",44100,cel,0,classical,"[[-380.3471, -347.56705, -341.9458, -340.2455,...","[[3.440472, 3.440472, 3.440472, 3.440472, 3.44...","[[-2.7836533, -2.7836533, -2.7836533, -2.78365...","[-383.6215515136719, 25.697067260742188, -380...."
2,"[0.00061035156, 0.0001373291, -0.0005950928, -...",44100,cel,0,classical,"[[-444.47504, -436.25427, -442.07135, -440.382...","[[1.7043172, 1.7043172, 1.7043172, 1.7043172, ...","[[0.64747626, 0.64747626, 0.64747626, 0.647476...","[-472.3026428222656, 20.734647750854492, -470...."
3,"[-0.00024414062, -0.0013885498, -0.0026397705,...",44100,cel,0,classical,"[[-499.21375, -478.78656, -483.94357, -485.582...","[[1.6173273, 1.6173273, 1.6173273, 1.6173273, ...","[[-0.55970395, -0.55970395, -0.55970395, -0.55...","[-470.0331726074219, 17.2320613861084, -471.61..."
4,"[-0.00018310547, 0.00076293945, 0.0011901855, ...",44100,cel,0,classical,"[[-491.483, -463.85297, -464.4726, -464.46667,...","[[1.5724477, 1.5724477, 1.5724477, 1.5724477, ...","[[-2.0295393, -2.0295393, -2.0295393, -2.02953...","[-452.78167724609375, 26.359832763671875, -461..."
...,...,...,...,...,...,...,...,...,...
1645,"[0.0059509277, 0.015350342, 0.023391724, 0.032...",44100,voi,10,pop_rock,"[[-209.98701, -174.61427, -171.96695, -177.942...","[[-1.7447826, -1.7447826, -1.7447826, -1.74478...","[[-3.0017266, -3.0017266, -3.0017266, -3.00172...","[-202.75323486328125, 25.102352142333984, -202..."
1646,"[-0.11528015, -0.15473938, -0.15000916, -0.113...",44100,voi,10,pop_rock,"[[-142.983, -119.2241, -121.361145, -122.55564...","[[-0.27515614, -0.27515614, -0.27515614, -0.27...","[[-1.6264538, -1.6264538, -1.6264538, -1.62645...","[-153.88998413085938, 37.77930450439453, -157...."
1647,"[0.1661377, 0.15690613, 0.12338257, 0.07142639...",44100,voi,10,pop_rock,"[[-127.09217, -121.634254, -144.1307, -150.183...","[[10.00528, 10.00528, 10.00528, 10.00528, 10.0...","[[7.137136, 7.137136, 7.137136, 7.137136, 7.13...","[-110.45230865478516, 27.135652542114258, -114..."
1648,"[0.14439392, 0.12376404, 0.10206604, 0.0853881...",44100,voi,10,pop_rock,"[[-125.003006, -52.63265, -53.847103, -72.9391...","[[-2.2320504, -2.2320504, -2.2320504, -2.23205...","[[-3.3040605, -3.3040605, -3.3040605, -3.30406...","[-105.6734390258789, 29.895414352416992, -106...."


Możemy, więc sprawidzić też, które z podejść (`MFCC`, `MFCC-delta`, `MFCC-delta-delta` czy parametry) dadzą najlepsze wyniki. Jednak w przypadku pierwszych trzech można pomyśleć o redukcji wymiarowości, gdyż jedno `MFCC` to, aż 3367 elementów. Dodatkowo w przypadku `MFCC` i delt musimy pamiętać o spłaszczeniu danych, by móc odpowiednio przeprocesować je przez `train_test_split` oraz `StandardScaler`. Zostanie to wykonane poniżej i podmienione zostaną kolumny z wartościami, na te ze spłaszczonymi `MFCC`. Jest to wykonywane pod koniec przez wzgląd na poprawne obliczenie wcześniejszych 'parametrów'.

In [107]:
mfcc_flatten = []
mfcc_flatten_delta = []
mfcc_flatten_delta_delta = []
for i in range (0,1650):
    mfcc_flatten.append(database['mfcc'][i].flatten())
    mfcc_flatten_delta.append(database['mfcc_delta'][i].flatten())
    mfcc_flatten_delta_delta.append(database['mfcc_delta_delta'][i].flatten())

database.drop('mfcc', axis=1, inplace=True)
database.drop('mfcc_delta', axis=1, inplace=True)
database.drop('mfcc_delta_delta', axis=1, inplace=True)
database['mfcc'] = mfcc_flatten
database['mfcc_delta'] = mfcc_flatten_delta
database['mfcc_deltasq'] = mfcc_flatten_delta_delta
display(database)

Unnamed: 0,record_data,sampling_frequency,instrument_type,labels,genre,mfcc_parameters,mfcc,mfcc_delta,mfcc_deltasq
0,"[0.022232056, 0.02468872, 0.02645874, 0.027236...",44100,cel,0,classical,"[-453.2663269042969, 22.335796356201172, -451....","[-388.13446, -423.61435, -473.78613, -472.0447...","[-8.185071, -8.185071, -8.185071, -8.185071, -...","[5.1802053, 5.1802053, 5.1802053, 5.1802053, 5..."
1,"[-0.0033721924, -0.003967285, -0.003479004, -0...",44100,cel,0,classical,"[-383.6215515136719, 25.697067260742188, -380....","[-380.3471, -347.56705, -341.9458, -340.2455, ...","[3.440472, 3.440472, 3.440472, 3.440472, 3.440...","[-2.7836533, -2.7836533, -2.7836533, -2.783653..."
2,"[0.00061035156, 0.0001373291, -0.0005950928, -...",44100,cel,0,classical,"[-472.3026428222656, 20.734647750854492, -470....","[-444.47504, -436.25427, -442.07135, -440.3829...","[1.7043172, 1.7043172, 1.7043172, 1.7043172, 1...","[0.64747626, 0.64747626, 0.64747626, 0.6474762..."
3,"[-0.00024414062, -0.0013885498, -0.0026397705,...",44100,cel,0,classical,"[-470.0331726074219, 17.2320613861084, -471.61...","[-499.21375, -478.78656, -483.94357, -485.5826...","[1.6173273, 1.6173273, 1.6173273, 1.6173273, 1...","[-0.55970395, -0.55970395, -0.55970395, -0.559..."
4,"[-0.00018310547, 0.00076293945, 0.0011901855, ...",44100,cel,0,classical,"[-452.78167724609375, 26.359832763671875, -461...","[-491.483, -463.85297, -464.4726, -464.46667, ...","[1.5724477, 1.5724477, 1.5724477, 1.5724477, 1...","[-2.0295393, -2.0295393, -2.0295393, -2.029539..."
...,...,...,...,...,...,...,...,...,...
1645,"[0.0059509277, 0.015350342, 0.023391724, 0.032...",44100,voi,10,pop_rock,"[-202.75323486328125, 25.102352142333984, -202...","[-209.98701, -174.61427, -171.96695, -177.9427...","[-1.7447826, -1.7447826, -1.7447826, -1.744782...","[-3.0017266, -3.0017266, -3.0017266, -3.001726..."
1646,"[-0.11528015, -0.15473938, -0.15000916, -0.113...",44100,voi,10,pop_rock,"[-153.88998413085938, 37.77930450439453, -157....","[-142.983, -119.2241, -121.361145, -122.55564,...","[-0.27515614, -0.27515614, -0.27515614, -0.275...","[-1.6264538, -1.6264538, -1.6264538, -1.626453..."
1647,"[0.1661377, 0.15690613, 0.12338257, 0.07142639...",44100,voi,10,pop_rock,"[-110.45230865478516, 27.135652542114258, -114...","[-127.09217, -121.634254, -144.1307, -150.1832...","[10.00528, 10.00528, 10.00528, 10.00528, 10.00...","[7.137136, 7.137136, 7.137136, 7.137136, 7.137..."
1648,"[0.14439392, 0.12376404, 0.10206604, 0.0853881...",44100,voi,10,pop_rock,"[-105.6734390258789, 29.895414352416992, -106....","[-125.003006, -52.63265, -53.847103, -72.93917...","[-2.2320504, -2.2320504, -2.2320504, -2.232050...","[-3.3040605, -3.3040605, -3.3040605, -3.304060..."


### 2.Przygotowanie zbiorów uczących i modelu.

Teraz można zająć się przygotowaniem danych i modelu. Użyty zostanie `train_test_split` (nie będzie stosowana crosswalidacja, ze względu na małą liczność klas). By współpracować z `sklearn` i `pandasem`, dane wyciągnięte z `DataFrame'u` konwertowane będą na listy array'ów, by można podzielone zbiory odpowiednio jeszcze ustandaryzować `StandardScaler'em`. W przypadku MFCC i delt musimy również pamiętać o spłaszczeniu danych - jednak w poniższym podejściu zastosowane będą 'parametry' MFCC, a zostały przygotowane już w odpowiedniej formie :

In [108]:
X_train, X_test, y_train, y_test = train_test_split(database["mfcc_parameters"].to_list(), database["labels"].to_list(), test_size=0.2, random_state=42, stratify=database["labels"].to_list())

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Poniższy kod i czynności posiadają jedynie niezbędny opis, głównie nawiązujący do przeprowadzonych działań. Część analizy i wniosków jest przedstawiona w raporcie w pliku .pdf.

Analizowanym modelem jest model `ExtraTreesClassifier`:

In [109]:
model = ExtraTreesClassifier()

Na początek można sprawdzić jak poradzi sobie podstawowy model, bez żadnych modyfikacji i przygotowania :

In [110]:
model.fit(X_train_scaled, y_train)
preds_test = model.predict(X_test_scaled)

Obliczone metryki znajdują się poniżej:

In [111]:
print("Confusion matrix :")
print(confusion_matrix(y_test, preds_test))
print("Accuracy :")
print(accuracy_score(y_test, preds_test))
print("F1 score :")
print(f1_score(y_test, preds_test, average='macro'))
print("Precison :")
print(precision_score(y_test, preds_test, average='macro'))
print("Recall :")
print(recall_score(y_test, preds_test, average='macro'))

Confusion matrix :
[[21  1  3  3  1  1  0  0  0  0  0]
 [ 3 16  2  4  0  0  0  0  1  1  3]
 [ 1  0 19  1  0  3  2  0  0  4  0]
 [ 1  3  1 21  0  0  4  0  0  0  0]
 [ 0  1  0  3 17  2  0  0  2  1  4]
 [ 0  0  0  1  1 24  1  0  0  0  3]
 [ 0  1  0  1  1  0 19  1  2  2  3]
 [ 5  1  0  1  0  0  2 15  4  1  1]
 [ 0  2  2  0  0  0  1  1 21  0  3]
 [ 2  0  3  2  4  2  0  0  2 13  2]
 [ 0  0  0  0  0  1  0  0  0  0 29]]
Accuracy :
0.6515151515151515
F1 score :
0.6457811829430375
Precison :
0.6637928827650845
Recall :
0.6515151515151515


Takie wyniki, bez żadnej optymalizacji mogą być zadowalające, jednak na pewno nie wystarczające. Poniżej próba pierwszej optymalizacji:

In [112]:
model = ExtraTreesClassifier
scoring = {'f1_macro': make_scorer(f1_score, average='macro')}

def get_space(trial): 
    space = {
            "n_estimators": trial.suggest_int("n_estimators", 2, 2000), #default value 100
            "max_depth": trial.suggest_int("max_depth", -1, 2000), # default None
            "min_samples_split": trial.suggest_int("min_samples_split", 2, 2000), # default 2
            "n_jobs": trial.suggest_int("n_jobs", -1, -1)
        }
    return space
trials = 25

def objective(trial, model, X, y):
    model_space = get_space(trial)

    mdl = model(**model_space)
    scores = cross_validate(mdl, X, y, scoring=scoring, cv=StratifiedKFold(n_splits=11), return_train_score=True)

    return np.mean(scores['test_f1_macro'])

In [113]:
%%time
study01 = optuna.create_study(direction='maximize')
study01.optimize(lambda x: objective(x, model, X_train_scaled, y_train), n_trials=trials)

[32m[I 2023-01-09 20:46:24,854][0m A new study created in memory with name: no-name-794951d9-4d8c-4085-8552-909313822f31[0m
[32m[I 2023-01-09 20:46:47,627][0m Trial 0 finished with value: 0.26600944611096816 and parameters: {'n_estimators': 925, 'max_depth': 1458, 'min_samples_split': 702, 'n_jobs': -1}. Best is trial 0 with value: 0.26600944611096816.[0m
[32m[I 2023-01-09 20:47:08,850][0m Trial 1 finished with value: 0.25500465729822347 and parameters: {'n_estimators': 983, 'max_depth': 866, 'min_samples_split': 830, 'n_jobs': -1}. Best is trial 0 with value: 0.26600944611096816.[0m
[32m[I 2023-01-09 20:47:42,656][0m Trial 2 finished with value: 0.013986013986013986 and parameters: {'n_estimators': 1868, 'max_depth': 618, 'min_samples_split': 1265, 'n_jobs': -1}. Best is trial 0 with value: 0.26600944611096816.[0m
[32m[I 2023-01-09 20:47:59,622][0m Trial 3 finished with value: 0.37073517927593486 and parameters: {'n_estimators': 710, 'max_depth': 1960, 'min_samples_split

CPU times: total: 1min 42s
Wall time: 6min 58s


Poniżej odpowiedni fit oraz wyświetlenie najlepszych parametrów:

In [114]:
print('params: ', study01.best_params)

lr = model(**study01.best_params)
lr.fit(X_train_scaled, y_train)
preds01 = lr.predict(X_test_scaled)

params:  {'n_estimators': 167, 'max_depth': 1457, 'min_samples_split': 10, 'n_jobs': -1}


Poniżej wyliczone metryki:

In [115]:
print("Confusion matrix :")
print(confusion_matrix(y_test, preds01))
print("Accuracy :")
print(accuracy_score(y_test, preds01))
print("F1 score :")
print(f1_score(y_test, preds01, average='macro'))
print("Precison :")
print(precision_score(y_test, preds01, average='macro'))
print("Recall :")
print(recall_score(y_test, preds01, average='macro'))

Confusion matrix :
[[20  0  4  3  1  0  0  1  0  0  1]
 [ 1 11  5  5  3  1  0  0  0  0  4]
 [ 1  0 20  3  1  5  0  0  0  0  0]
 [ 1  0  1 20  0  0  4  0  0  1  3]
 [ 0  0  0  5 13  3  0  2  0  0  7]
 [ 0  0  0  1  1 23  1  0  0  0  4]
 [ 1  3  0  0  2  1 17  1  1  1  3]
 [ 3  2  2  1  1  1  3 11  4  1  1]
 [ 0  0  3  0  0  1  0  4 18  0  4]
 [ 1  2  2  2  3  4  0  0  0 13  3]
 [ 0  0  0  0  1  1  0  0  0  0 28]]
Accuracy :
0.5878787878787879
F1 score :
0.5805316631529825
Precison :
0.6161592773363862
Recall :
0.5878787878787879


Zgodnie z obserwacjami, rozszerzam zakres "n_estimators" oraz "max_depth" i zawężam "min_samples_split":

In [116]:
model = ExtraTreesClassifier
scoring = {'f1_macro': make_scorer(f1_score, average='macro')}

def get_space(trial): 
    space = {
            "n_estimators": trial.suggest_int("n_estimators", 2, 4000), #default value 100
            "max_depth": trial.suggest_int("max_depth", -1, 4000), # default None
            "min_samples_split": trial.suggest_int("min_samples_split", 2, 100), # default 2
            "n_jobs": trial.suggest_int("n_jobs", -1, -1)
        }
    return space
trials = 25

def objective(trial, model, X, y):
    model_space = get_space(trial)

    mdl = model(**model_space)
    scores = cross_validate(mdl, X, y, scoring=scoring, cv=StratifiedKFold(n_splits=11), return_train_score=True)

    return np.mean(scores['test_f1_macro'])

Rozpoczęcie drugiej optymalizacji:

In [117]:
%%time
study02 = optuna.create_study(direction='maximize')
study02.optimize(lambda x: objective(x, model, X_train_scaled, y_train), n_trials=trials)

[32m[I 2023-01-09 20:53:24,261][0m A new study created in memory with name: no-name-67291e11-759d-4332-a682-43f3136b9191[0m
[32m[I 2023-01-09 20:54:25,499][0m Trial 0 finished with value: 0.41135991323860177 and parameters: {'n_estimators': 3793, 'max_depth': 1675, 'min_samples_split': 86, 'n_jobs': -1}. Best is trial 0 with value: 0.41135991323860177.[0m
[32m[I 2023-01-09 20:54:36,722][0m Trial 1 finished with value: 0.4535099897386195 and parameters: {'n_estimators': 549, 'max_depth': 1960, 'min_samples_split': 52, 'n_jobs': -1}. Best is trial 1 with value: 0.4535099897386195.[0m
[32m[I 2023-01-09 20:55:15,426][0m Trial 2 finished with value: 0.5479311491302412 and parameters: {'n_estimators': 2627, 'max_depth': 2954, 'min_samples_split': 17, 'n_jobs': -1}. Best is trial 2 with value: 0.5479311491302412.[0m
[32m[I 2023-01-09 20:55:59,266][0m Trial 3 finished with value: 0.4955199172498764 and parameters: {'n_estimators': 2608, 'max_depth': 2952, 'min_samples_split': 31,

CPU times: total: 5min 7s
Wall time: 16min 27s


Wynik optymalizacji - parametry najlepszego modelu i predykcja:

In [118]:
print('params: ', study02.best_params)

lr = model(**study02.best_params)
lr.fit(X_train_scaled, y_train)
preds02 = lr.predict(X_test_scaled)

params:  {'n_estimators': 880, 'max_depth': 2317, 'min_samples_split': 2, 'n_jobs': -1}


Metryki oraz macierz pomyłek:

In [119]:
print("Confusion matrix :")
print(confusion_matrix(y_test, preds02))
print("Accuracy :")
print(accuracy_score(y_test, preds02))
print("F1 score :")
print(f1_score(y_test, preds02, average='macro'))
print("Precison :")
print(precision_score(y_test, preds02, average='macro'))
print("Recall :")
print(recall_score(y_test, preds02, average='macro'))

Confusion matrix :
[[22  0  4  3  1  0  0  0  0  0  0]
 [ 2 14  4  4  1  0  0  0  0  0  5]
 [ 1  0 20  2  0  4  1  0  0  2  0]
 [ 1  0  1 23  0  0  4  0  0  1  0]
 [ 0  0  0  3 17  3  0  0  1  1  5]
 [ 0  0  0  0  0 27  1  0  0  0  2]
 [ 0  1  0  1  1  0 22  0  2  0  3]
 [ 3  0  1  2  1  1  3 13  5  0  1]
 [ 0  1  3  0  0  0  0  2 21  0  3]
 [ 3  2  2  2  2  2  0  0  1 14  2]
 [ 0  0  0  0  0  1  0  0  0  0 29]]
Accuracy :
0.6727272727272727
F1 score :
0.6644170811597889
Precison :
0.6995895421434286
Recall :
0.6727272727272727


Optymalizacja trzecia - dodanie parametrów "criterion", "max_features", "random_state", "warm_start":

In [120]:
model = ExtraTreesClassifier
scoring = {'f1_macro': make_scorer(f1_score, average='macro')}         

def get_space(trial): 
    space = {
            "n_estimators": trial.suggest_int("n_estimators", 2, 4500), #default value 100
            "min_samples_split": trial.suggest_int("min_samples_split", 2, 5), # default 2
            "n_jobs": trial.suggest_int("n_jobs", -1, -1),
            "criterion": trial.suggest_categorical("criterion", ["gini", "entropy", "log_loss"]), # default "gini"
            "max_features": trial.suggest_categorical("max_features", ["auto", "sqrt", "log2"]),
            "random_state": trial.suggest_int("random_state", 1, 50),
            "warm_start": trial.suggest_categorical("warm_start", ["True"])
        }
    return space
trials = 25

def objective(trial, model, X, y):
    model_space = get_space(trial)

    mdl = model(**model_space)
    scores = cross_validate(mdl, X, y, scoring=scoring, cv=StratifiedKFold(n_splits=11), return_train_score=True)

    return np.mean(scores['test_f1_macro'])

Trzecia optymalizacja:

In [121]:
%%time
study03 = optuna.create_study(direction='maximize')
study03.optimize(lambda x: objective(x, model, X_train_scaled, y_train), n_trials=trials)

[32m[I 2023-01-09 21:09:53,381][0m A new study created in memory with name: no-name-fe0a9d5d-c86b-4088-921e-9dff16d74e43[0m
[32m[I 2023-01-09 21:12:08,395][0m Trial 0 finished with value: 0.6907374124310103 and parameters: {'n_estimators': 2278, 'min_samples_split': 2, 'n_jobs': -1, 'criterion': 'entropy', 'max_features': 'sqrt', 'random_state': 17, 'warm_start': 'True'}. Best is trial 0 with value: 0.6907374124310103.[0m
[32m[I 2023-01-09 21:13:57,282][0m Trial 1 finished with value: 0.6594156473287422 and parameters: {'n_estimators': 2143, 'min_samples_split': 5, 'n_jobs': -1, 'criterion': 'gini', 'max_features': 'log2', 'random_state': 28, 'warm_start': 'True'}. Best is trial 0 with value: 0.6907374124310103.[0m
11 fits failed out of a total of 11.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
---------

CPU times: total: 8min 54s
Wall time: 26min 5s


Predykcje na zbiorze testowym, wyniki i najlepszy model:

In [122]:
print('params: ', study03.best_params)

lr = model(**study03.best_params)
lr.fit(X_train_scaled, y_train)
preds03 = lr.predict(X_test_scaled)

params:  {'n_estimators': 2278, 'min_samples_split': 2, 'n_jobs': -1, 'criterion': 'entropy', 'max_features': 'sqrt', 'random_state': 17, 'warm_start': 'True'}


In [123]:
print("Confusion matrix :")
print(confusion_matrix(y_test, preds03))
print("Accuracy :")
print(accuracy_score(y_test, preds03))
print("F1 score :")
print(f1_score(y_test, preds03, average='macro'))
print("Precison :")
print(precision_score(y_test, preds03, average='macro'))
print("Recall :")
print(recall_score(y_test, preds03, average='macro'))

Confusion matrix :
[[22  0  4  3  1  0  0  0  0  0  0]
 [ 1 15  5  3  1  0  0  0  0  0  5]
 [ 1  0 19  2  1  4  2  0  0  1  0]
 [ 1  1  1 23  0  0  4  0  0  0  0]
 [ 0  0  0  3 18  2  0  0  1  1  5]
 [ 0  0  0  0  0 27  1  0  0  0  2]
 [ 0  1  0  1  0  0 22  0  1  0  5]
 [ 5  0  1  2  1  0  3 15  3  0  0]
 [ 0  0  3  0  0  0  0  2 22  0  3]
 [ 3  1  2  2  2  1  0  0  1 15  3]
 [ 0  0  0  0  0  1  0  0  0  0 29]]
Accuracy :
0.6878787878787879
F1 score :
0.6840310521811128
Precison :
0.7226947072535308
Recall :
0.687878787878788


Czwarta optymalizacja - weryfikacja zmiany scorer'a na accuracy:

In [124]:
model = ExtraTreesClassifier
scoring = {'accuracy': make_scorer(accuracy_score)}         

def get_space(trial): 
    space = {
            "n_estimators": trial.suggest_int("n_estimators", 2, 4000), #default value 100
            "min_samples_split": trial.suggest_int("min_samples_split", 2, 2000), # default 2
            "n_jobs": trial.suggest_int("n_jobs", -1, -1)
        }
    return space
trials = 25

def objective(trial, model, X, y):
    model_space = get_space(trial)

    mdl = model(**model_space)
    scores = cross_validate(mdl, X, y, scoring=scoring, cv=StratifiedKFold(n_splits=11), return_train_score=True)

    return np.mean(scores['test_accuracy'])

Czwarty study:

In [125]:
%%time
study04 = optuna.create_study(direction='maximize')
study04.optimize(lambda x: objective(x, model, X_train_scaled, y_train), n_trials=trials)

[32m[I 2023-01-09 21:36:05,565][0m A new study created in memory with name: no-name-62b890e4-f505-4cb1-8908-f8ab26893d9a[0m
[32m[I 2023-01-09 21:37:05,138][0m Trial 0 finished with value: 0.33636363636363636 and parameters: {'n_estimators': 2001, 'min_samples_split': 698, 'n_jobs': -1}. Best is trial 0 with value: 0.33636363636363636.[0m
[32m[I 2023-01-09 21:37:25,960][0m Trial 1 finished with value: 0.08333333333333334 and parameters: {'n_estimators': 581, 'min_samples_split': 1752, 'n_jobs': -1}. Best is trial 0 with value: 0.33636363636363636.[0m
[32m[I 2023-01-09 21:38:05,122][0m Trial 2 finished with value: 0.5242424242424243 and parameters: {'n_estimators': 1065, 'min_samples_split': 29, 'n_jobs': -1}. Best is trial 2 with value: 0.5242424242424243.[0m
[32m[I 2023-01-09 21:38:27,137][0m Trial 3 finished with value: 0.3530303030303031 and parameters: {'n_estimators': 564, 'min_samples_split': 509, 'n_jobs': -1}. Best is trial 2 with value: 0.5242424242424243.[0m
[3

CPU times: total: 4min 46s
Wall time: 20min 42s


Wyniki i metryki - znaczny spadek skuteczności w przypadku zmiany scorer'a:

In [126]:
print('params: ', study04.best_params)

lr = model(**study04.best_params)
lr.fit(X_train_scaled, y_train)
preds04 = lr.predict(X_test_scaled)

params:  {'n_estimators': 2280, 'min_samples_split': 7, 'n_jobs': -1}


In [127]:
print("Confusion matrix :")
print(confusion_matrix(y_test, preds04))
print("Accuracy :")
print(accuracy_score(y_test, preds04))
print("F1 score :")
print(f1_score(y_test, preds04, average='macro'))
print("Precison :")
print(precision_score(y_test, preds04, average='macro'))
print("Recall :")
print(recall_score(y_test, preds04, average='macro'))

Confusion matrix :
[[21  0  4  3  1  0  0  0  0  0  1]
 [ 2 14  4  4  1  0  0  0  0  0  5]
 [ 1  0 19  2  1  4  2  0  0  1  0]
 [ 1  0  1 22  0  0  4  0  0  1  1]
 [ 0  0  0  4 16  2  0  1  1  1  5]
 [ 0  0  0  0  0 25  1  0  0  2  2]
 [ 1  0  0  1  2  0 18  1  2  1  4]
 [ 3  0  1  3  1  1  4 11  4  1  1]
 [ 0  0  3  0  0  0  1  3 18  0  5]
 [ 2  1  2  3  2  2  0  0  2 14  2]
 [ 0  0  0  0  1  1  0  0  0  0 28]]
Accuracy :
0.6242424242424243
F1 score :
0.6174711787263328
Precison :
0.6533657552300817
Recall :
0.6242424242424242


Optymalizacja piąta - zbadanie wpłwu "foldów" w crosswalidacji na wyniki:

In [128]:
model = ExtraTreesClassifier
scoring = {'f1_macro': make_scorer(f1_score, average='macro')}         

def get_space(trial): 
    space = {
            "n_estimators": trial.suggest_int("n_estimators", 2, 4500), #default value 100
            "min_samples_split": trial.suggest_int("min_samples_split", 2, 5), # default 2
            "n_jobs": trial.suggest_int("n_jobs", -1, -1),
            "criterion": trial.suggest_categorical("criterion", ["gini", "entropy", "log_loss"]), # default "gini"
            "max_features": trial.suggest_categorical("max_features", ["auto", "sqrt", "log2"]),
            "random_state": trial.suggest_int("random_state", 1, 50),
            "warm_start": trial.suggest_categorical("warm_start", ["True"])
        }
    return space
trials = 25

def objective(trial, model, X, y):
    model_space = get_space(trial)

    mdl = model(**model_space)
    scores = cross_validate(mdl, X, y, scoring=scoring, cv=StratifiedKFold(n_splits=5), return_train_score=True)

    return np.mean(scores['test_f1_macro'])

Optymalizacja piąta:

In [129]:
%%time
study05 = optuna.create_study(direction='maximize')
study05.optimize(lambda x: objective(x, model, X_train_scaled, y_train), n_trials=trials)

[32m[I 2023-01-09 21:56:54,247][0m A new study created in memory with name: no-name-3828dec8-694a-4887-8582-b66b13bba51f[0m
[32m[I 2023-01-09 21:57:34,488][0m Trial 0 finished with value: 0.6371119877109087 and parameters: {'n_estimators': 1937, 'min_samples_split': 5, 'n_jobs': -1, 'criterion': 'gini', 'max_features': 'auto', 'random_state': 26, 'warm_start': 'True'}. Best is trial 0 with value: 0.6371119877109087.[0m
[32m[I 2023-01-09 21:58:07,517][0m Trial 1 finished with value: 0.6375257107939358 and parameters: {'n_estimators': 1624, 'min_samples_split': 5, 'n_jobs': -1, 'criterion': 'gini', 'max_features': 'auto', 'random_state': 16, 'warm_start': 'True'}. Best is trial 1 with value: 0.6375257107939358.[0m
5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------

CPU times: total: 4min 57s
Wall time: 17min 38s


Wyniki piątej optymalizacji - w tym wypadku 5 foldów sprawdziło się lepiej [całościowo najlepszy ze wszystkich na zbiorze testowym]:

In [130]:
print('params: ', study05.best_params)

lr = model(**study05.best_params)
lr.fit(X_train_scaled, y_train)
preds05 = lr.predict(X_test_scaled)

params:  {'n_estimators': 2837, 'min_samples_split': 2, 'n_jobs': -1, 'criterion': 'entropy', 'max_features': 'auto', 'random_state': 24, 'warm_start': 'True'}


In [131]:
print("Confusion matrix :")
print(confusion_matrix(y_test, preds05))
print("Accuracy :")
print(accuracy_score(y_test, preds05))
print("F1 score :")
print(f1_score(y_test, preds05, average='macro'))
print("Precison :")
print(precision_score(y_test, preds05, average='macro'))
print("Recall :")
print(recall_score(y_test, preds05, average='macro'))

Confusion matrix :
[[22  0  4  3  1  0  0  0  0  0  0]
 [ 1 15  5  3  1  0  0  0  0  0  5]
 [ 1  0 20  2  0  4  1  0  0  2  0]
 [ 1  0  1 23  0  0  4  0  0  1  0]
 [ 0  0  0  4 19  1  0  0  0  1  5]
 [ 0  0  0  0  0 28  1  0  0  0  1]
 [ 0  1  0  1  0  0 22  1  2  0  3]
 [ 4  0  1  2  1  1  3 15  3  0  0]
 [ 0  0  3  0  0  0  1  2 20  0  4]
 [ 2  1  2  3  1  2  0  0  1 15  3]
 [ 0  0  0  0  0  1  0  0  0  0 29]]
Accuracy :
0.6909090909090909
F1 score :
0.6865970306320115
Precison :
0.7228130023541898
Recall :
0.6909090909090909


Finałowa, ostatnia optymalizacja:

In [134]:
model = ExtraTreesClassifier
scoring = {'f1_macro': make_scorer(f1_score, average='macro')}         

def get_space(trial): 
    space = {
            "n_estimators": trial.suggest_int("n_estimators", 2, 2500), #default value 100
            "n_jobs": trial.suggest_int("n_jobs", -1, -1),
            "criterion": trial.suggest_categorical("criterion", ["entropy"]), # default "gini"
            "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2"]),
        }
    return space
trials = 50

def objective(trial, model, X, y):
    model_space = get_space(trial)

    mdl = model(**model_space)
    scores = cross_validate(mdl, X, y, scoring=scoring, cv=StratifiedKFold(n_splits=11), return_train_score=True)

    return np.mean(scores['test_f1_macro'])

In [135]:
%%time
study06 = optuna.create_study(direction='maximize')
study06.optimize(lambda x: objective(x, model, X_train_scaled, y_train), n_trials=trials)

[32m[I 2023-01-09 22:17:04,742][0m A new study created in memory with name: no-name-79e9b520-786a-466f-bbd9-3d794aa0fa0a[0m
[32m[I 2023-01-09 22:21:00,408][0m Trial 0 finished with value: 0.6858923102054312 and parameters: {'n_estimators': 1897, 'n_jobs': -1, 'criterion': 'entropy', 'max_features': 'log2'}. Best is trial 0 with value: 0.6858923102054312.[0m
[32m[I 2023-01-09 22:21:34,142][0m Trial 1 finished with value: 0.6810537883936397 and parameters: {'n_estimators': 337, 'n_jobs': -1, 'criterion': 'entropy', 'max_features': 'log2'}. Best is trial 0 with value: 0.6858923102054312.[0m
[32m[I 2023-01-09 22:26:08,427][0m Trial 2 finished with value: 0.6830479250268031 and parameters: {'n_estimators': 1810, 'n_jobs': -1, 'criterion': 'entropy', 'max_features': 'log2'}. Best is trial 0 with value: 0.6858923102054312.[0m
[32m[I 2023-01-09 22:32:10,944][0m Trial 3 finished with value: 0.692547271118574 and parameters: {'n_estimators': 2315, 'n_jobs': -1, 'criterion': 'entrop

KeyboardInterrupt: 

Wyniki predykcji oraz parametry/metryki [najlepszy wynik na zbiorze treningowym]:

In [136]:
print('params: ', study06.best_params)

lr = model(**study06.best_params)
lr.fit(X_train_scaled, y_train)
preds06 = lr.predict(X_test_scaled)

params:  {'n_estimators': 733, 'n_jobs': -1, 'criterion': 'entropy', 'max_features': 'sqrt'}


In [137]:
print("Confusion matrix :")
print(confusion_matrix(y_test, preds06))
print("Accuracy :")
print(accuracy_score(y_test, preds06))
print("F1 score :")
print(f1_score(y_test, preds06, average='macro'))
print("Precison :")
print(precision_score(y_test, preds06, average='macro'))
print("Recall :")
print(recall_score(y_test, preds06, average='macro'))

Confusion matrix :
[[22  0  4  3  1  0  0  0  0  0  0]
 [ 1 15  5  3  1  0  0  0  0  0  5]
 [ 1  0 20  2  0  4  1  0  0  2  0]
 [ 1  1  1 22  0  0  4  0  0  1  0]
 [ 0  0  0  4 17  2  0  1  1  1  4]
 [ 0  0  0  0  0 26  1  0  0  1  2]
 [ 0  0  0  1  1  0 23  0  2  1  2]
 [ 6  2  0  1  1  0  2 13  3  1  1]
 [ 0  0  3  0  0  0  1  1 22  0  3]
 [ 2  1  2  3  2  2  0  0  1 15  2]
 [ 0  0  0  0  0  1  0  0  0  0 29]]
Accuracy :
0.6787878787878788
F1 score :
0.6719292362877295
Precison :
0.700334660804979
Recall :
0.6787878787878788


### Omówienie w raporcie.

Z podziękowaniami za możliwość darmowego użycia bazy IRMAS:

Bosch, J. J., Janer, J., Fuhrmann, F., & Herrera, P. “A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals”, in Proc. ISMIR (pp. 559-564), 2012

The creation of this dataset was partially supported by “La Caixa” Fellowship Program, and the following projects: Classical Planet: TSI-070100-2009-407 (MITYC), DRIMS: TIN2009-14247-C02-01 (MICINN) and MIRES: EC-FP7 ICT-2011.1.5 Networked Media and Search Systems, grant agreement No. 287711. Additionally supported by TECNIO network promoted by ACC1Ó agency by the Catalan Government.