# Pregled Zadatka 3

Zadatak 3 obuhvaća treniranje i evaluaciju više modela strojnog učenja na skupu podataka dobivenih obradom satelitskih snimki. Analiza uključuje eksperimentiranje s različitim transformacijama podataka, metodama podjele podataka i algoritmima strojnog učenja.

Ključni aspekti analize:

1. Obrada i priprema podataka iz priložene CSV datoteke.
2. Implementacija različitih metoda podjele podataka na skupove za treniranje, testiranje i validaciju.
3. Treniranje i optimizacija više modela strojnog učenja.
4. Detaljna evaluacija performansi modela koristeći raznovrsne metrike.

Važno je napomenuti da sam neprecizno izrazio neobrađene podatke kao "Normal". Ovaj naziv se odnosi na izvorne, nepromijenjene podatke. Točniji naziv bio bi "Original" ili "Unmodified".

Osim rada s izvornim podacima, eksperimentirao sam i s logaritamski transformiranim podacima te "mješovitim" pristupom gdje su samo neki feature-i logaritamski transformirani. Izvorni podaci su naizgled bili asimetrični (eng. skewed), pa se činilo da su za odabrane raspodjele prikladnije logaritamske transformacije. Ovo je učinjeno najviše iz znatiželje, kako bismo vidjeli utjecaj transformacija na performanse modela.

Raspodjele različitih skupova podataka mogu se vidjeti na sljedećim slikama:
- [Raspodjela izvornih podataka](normal_feature_distributions.png)
- [Raspodjela logaritamski transformiranih podataka](log_transformed_feature_distributions.png)
- [Raspodjela mješovitih podataka](mixed_feature_distributions.png)

In [None]:
#pokretanje glavne skripte
%run zadatak_3.py

Za evaluaciju performansi modela, koristili smo unakrsnu validaciju (cross-validation) s različitim podjelama podataka. Uspoređujemo rezultate za dvije različite podjele:

70-15-15 podjela (70% trening, 15% validacija, 15% test)
60-20-20 podjela (60% trening, 20% validacija, 20% test)




In [1]:
import pandas as pd

cv_results_70_15_15 = pd.read_csv('results/cross_validation_multi_metric_results_70_15_15_split.csv')
cv_results_60_20_20 = pd.read_csv('results/cross_validation_multi_metric_results_60_20_20_split.csv')



In [2]:
cv_results_70_15_15


Unnamed: 0,Model,Data Type,CV Accuracy,CV Accuracy Std,CV F1 (Weighted),CV F1 (Weighted) Std,CV F1 (Macro),CV F1 (Macro) Std,CV ROC AUC,CV ROC AUC Std,CV Precision,CV Precision Std,CV Recall,CV Recall Std
0,KNN,Log,0.999228,0.000197,0.999229,0.000197,0.998788,0.000362,0.999999,2.545483e-07,0.99923,0.000196,0.999228,0.000197
1,KNN,Mixed,0.999228,0.000197,0.999229,0.000197,0.998788,0.000362,0.999999,2.545483e-07,0.99923,0.000196,0.999228,0.000197
2,KNN,Normal,0.999288,0.000173,0.999288,0.000173,0.998887,0.000303,0.999915,0.0001029853,0.999289,0.000172,0.999288,0.000173
3,LightGBM,Log,0.999288,0.000173,0.999288,0.000173,0.998912,0.000307,0.999997,2.016573e-06,0.999289,0.000172,0.999288,0.000173
4,LightGBM,Mixed,0.999288,0.000173,0.999288,0.000173,0.998912,0.000307,0.999997,2.016573e-06,0.999289,0.000172,0.999288,0.000173
5,LightGBM,Normal,0.999377,0.000145,0.999377,0.000145,0.999035,0.000317,0.999996,4.776803e-06,0.999378,0.000145,0.999377,0.000145
6,Logistic Regression,Log,0.997121,0.000178,0.997123,0.000178,0.995376,0.000372,0.999942,1.564464e-05,0.997128,0.00018,0.997121,0.000178
7,Logistic Regression,Mixed,0.997121,0.000178,0.997123,0.000178,0.995376,0.000372,0.999942,1.564464e-05,0.997128,0.00018,0.997121,0.000178
8,Logistic Regression,Normal,0.997211,5.9e-05,0.997212,6.1e-05,0.995532,0.000159,0.999936,1.952522e-05,0.997219,6.5e-05,0.997211,5.9e-05
9,Random Forest,Log,0.999021,0.000151,0.999021,0.000151,0.998516,0.000323,0.999997,1.35471e-06,0.999022,0.00015,0.999021,0.000151


In [3]:
cv_results_60_20_20

Unnamed: 0,Model,Data Type,CV Accuracy,CV Accuracy Std,CV F1 (Weighted),CV F1 (Weighted) Std,CV F1 (Macro),CV F1 (Macro) Std,CV ROC AUC,CV ROC AUC Std,CV Precision,CV Precision Std,CV Recall,CV Recall Std
0,KNN,Log,0.999238,0.000177,0.999238,0.000176,0.998817,0.000248,0.999901,9.2e-05,0.99924,0.000175,0.999238,0.000177
1,KNN,Mixed,0.999238,0.000177,0.999238,0.000176,0.998817,0.000248,0.999901,9.2e-05,0.99924,0.000175,0.999238,0.000177
2,KNN,Normal,0.999273,0.00017,0.999273,0.000169,0.998875,0.000279,0.999853,0.000142,0.999275,0.000168,0.999273,0.00017
3,LightGBM,Log,0.999238,0.000177,0.999238,0.000176,0.998846,0.000288,0.999998,1e-06,0.999239,0.000176,0.999238,0.000177
4,LightGBM,Mixed,0.999238,0.000177,0.999238,0.000176,0.998846,0.000288,0.999998,1e-06,0.999239,0.000176,0.999238,0.000177
5,LightGBM,Normal,0.999238,0.000139,0.999238,0.000138,0.998846,0.000224,0.999998,1e-06,0.999239,0.000139,0.999238,0.000139
6,Logistic Regression,Log,0.996988,0.001036,0.996989,0.001037,0.995212,0.001698,0.999936,4e-05,0.996994,0.001039,0.996988,0.001036
7,Logistic Regression,Mixed,0.996988,0.001036,0.996989,0.001037,0.995212,0.001698,0.999936,4e-05,0.996994,0.001039,0.996988,0.001036
8,Logistic Regression,Normal,0.997126,0.000907,0.997127,0.000908,0.995449,0.00145,0.999936,4.2e-05,0.997133,0.00091,0.997126,0.000907
9,Random Forest,Log,0.998996,0.000642,0.998996,0.000642,0.998558,0.000988,0.999997,2e-06,0.998998,0.000641,0.998996,0.000642
