# NSU, vaje 2: meta učenje

### A: Spletna stran OpenML in knjižnica **openml**, priprava podatkov

A.1 Dobi podatke z OpenML-ja (https://www.openml.org/).
Najdi vsa podatkovja, ki imajo med 100 in 200 primerov in med 4 do 100 znacilk. Pomagaj si s funkcijo **openml.datasets.list_datasets**. Podaj ji argumente
- number_instances (npr. number_instances="\<x>..\<y>", kjer sta x in y najmanjse in najvecje dovoljeno stevilo primerov, npr. "10..20")
- number_features (analogno number_instances)
- output_format="dataframe"
in si oglej dobljeni pandasov Dataframe. Eden od stolpcev v njem je did (data ID).

Za opis OpenML funkcij lahko uporabis dokumentacijo na https://docs.openml.org/Python-API/

In [4]:
import numpy as np
import pandas as pd
import openml as oml

In [5]:
df = oml.datasets.list_datasets(number_instances = "100..200", number_features="4..100", output_format="dataframe")
df

Unnamed: 0,did,name,version,uploader,status,format,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
10,10,lymph,1,1,active,ARFF,81.0,8.0,2.0,4.0,19.0,148.0,0.0,0.0,3.0,16.0
48,48,tae,1,1,active,ARFF,52.0,3.0,49.0,3.0,6.0,151.0,0.0,0.0,3.0,3.0
55,55,hepatitis,1,1,active,ARFF,123.0,2.0,32.0,2.0,20.0,155.0,75.0,167.0,6.0,14.0
61,61,iris,1,1,active,ARFF,50.0,3.0,50.0,3.0,5.0,150.0,0.0,0.0,4.0,1.0
62,62,zoo,1,1,active,ARFF,41.0,7.0,4.0,7.0,17.0,101.0,0.0,0.0,1.0,16.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43839,43839,IRIS-flower-dataset,1,30125,active,arff,,,,,5.0,150.0,0.0,0.0,4.0,0.0
43859,43859,iriiiiiis,2,28351,active,ARFF,50.0,,50.0,3.0,5.0,150.0,0.0,0.0,4.0,1.0
44151,44151,Iris,51,30495,active,arff,50.0,,49.0,3.0,5.0,149.0,0.0,0.0,4.0,0.0
44154,44154,iris_reproduced,1,30495,active,arff,50.0,,50.0,3.0,5.0,150.0,0.0,0.0,4.0,1.0


A.2 Nalozi ta podatkovja s funkcijo **openml.datasets.get_datasets**, tako da uporabis ID-je podatkovij iz prejsnje tocke. To bo prvic morda trajalo nekaj minut. Ker se bodo nalozena podatkovja shranila tudi na disk racunalnika, bo vsak naslednji klic precej hitrejsi.

In [6]:
did = df['did']
did

10          10
48          48
55          55
61          61
62          62
         ...  
43839    43839
43859    43859
44151    44151
44154    44154
44344    44344
Name: did, Length: 252, dtype: int64

In [7]:
# za 4 do 10 značilk
#podatkovja = oml.datasets.get_datasets(dataset_ids=did)

In [8]:
# za 4 do 100 značilk
podatkovja2 = oml.datasets.get_datasets(dataset_ids=did)

A.3 Iz zbirke podatkovij odstrani tista, ki niso primerna za klasifikacijo. Pomagas si lahko s klicem **podatkovje.get_data()**, ki vrne

- x: pandasov Dataframe znacilk, ki vkljucuje tudi ciljno spremenljivko, ce argument target in podan
- y: stolpec, ki podaja vrednosti ciljne spremenljivke, ce je argument target podan, in None sicer
- nominalni: seznam vrednosti True/False, ki pove, ali je i-ti atribut nominalen
- atributi: seznam imen atributov

Ciljno spremenljivko posameznega podatkovja najdemo z **podatkovje.default_target_attribute**.

Iz vseh podatkovij iz prejsnje naloge odstrani tista, ki imajo
- neznano (None) ali numericno ciljno spremenljivko,
- več ciljnih spremenljivk.

Pazi, da obdržiš le eno razlicico podatkov: "iris" se npr. pojavi vec kot 40-krat. Ime podatkovja je shranjeno v
polju "name" (do njega torej dostopamo s "podatkovje.name")
Ker scikit ne podpira nominalnih znacilk, odstrani tudi vsa podatkovja, ki vsebujejo nominalne znacilke.


In [9]:
import pickle

with open("podatki.dat", "wb") as f:
    pickle.dump(podatkovja2, f)

In [10]:
import copy

podatkovja_copy = copy.deepcopy(podatkovja2)
print(len(podatkovja2))
print(len(podatkovja_copy))

252
252


In [11]:
odstrani = []
imena = []
for i in range(0,len(podatkovja2)):
    ime = podatkovja2[i].name
    target = podatkovja2[i].default_target_attribute
    data = podatkovja2[i].get_data()
    znacilke = data[2]

    if ime in imena or sum(znacilke) > 1 or target == None or "," in target:
        odstrani.append(i)
        continue
    idx = data[3].index(target)
    if data[2][idx] == False:
            odstrani.append(i)
    else:
         imena.append(ime)
            
for i in sorted(odstrani, reverse=True):
    del podatkovja_copy[i]


In [12]:
n_vsa = len(podatkovja2)
n_ok = len(podatkovja_copy)
print(f"Obdrzal sem {n_ok} podatkovij ({100 * n_ok / n_vsa:.1f}%)")

Obdrzal sem 52 podatkovij (20.6%)


### B - Priprava ciljnih spremenljivk za meta učenje 
Naša ciljna spremenljivka bo uspešnost posameznih metod strojnega učenja na različnih podatkih. 

Na vseh OK podatkovjih pozeni modele kNN (**sklearn.neighbors.KNeighborsClassifier**), odlocitveno drevo (**sklearn.tree.DecisionTreeClassifier**) in naivnega Bayesa (**sklearn.naive_bayes.GaussianNB**).
Za smiselno oceno kvalitete je treba razbiti podakovje na ucno in testno mnozico. V testno mnozico daj 1/4 primerov. 
Kvaliteto napovedi izmeri s tocnostjo.
Ce se bo pri ucenju/napovedovanju koda sesula, to morda pomeni, da s podatki ni vse v redu in bi jih bilo potrebno prej
odfiltrirati.
Dobljene rezultate shrani v tabelo s stolpci

**ime podatkovja,knn,drevo,bayes,najboljsi**

kjer stolpci 2-4 podajajo tocnosti modelov na tesni podmnozici podatkovja, stolpec najboljsi pa ime najboljsega
modela. Pandasov dataframe lahko preprosto shraniš v CSV datoteko s funkcijo **df.to_csv**.

In [13]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(5)
drevo = DecisionTreeClassifier()
bayes = GaussianNB()
modeli = [knn, drevo, bayes]
rezultati = pd.DataFrame(columns = ["ime podatkovja", "knn", "drevo", "bayes", "najboljsi"])

for i in range(0,len(podatkovja_copy)):
    ime = podatkovja_copy[i].name
    podatki = podatkovja_copy[i].get_data()[0]
    target = podatkovja_copy[i].default_target_attribute
    X = podatki.drop(target, axis = 1)
    y = podatki[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    best = 0
    tocnosti = []
    for j in range(0,len(modeli)):
        modeli[j].fit(X_train, y_train)
        predictions = modeli[j].predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        tocnosti.append(accuracy)
        if accuracy > best:
            best = accuracy
    rezultati = rezultati.append({"ime podatkovja" : ime, "knn" : tocnosti[0], "drevo" : tocnosti[1],
                                  "bayes" : tocnosti[2], "najboljsi" : modeli[tocnosti.index(best)]},
                                  ignore_index = True)
rezultati

  rezultati = rezultati.append({"ime podatkovja" : ime, "knn" : tocnosti[0], "drevo" : tocnosti[1],
  rezultati = rezultati.append({"ime podatkovja" : ime, "knn" : tocnosti[0], "drevo" : tocnosti[1],
  rezultati = rezultati.append({"ime podatkovja" : ime, "knn" : tocnosti[0], "drevo" : tocnosti[1],
  rezultati = rezultati.append({"ime podatkovja" : ime, "knn" : tocnosti[0], "drevo" : tocnosti[1],
  rezultati = rezultati.append({"ime podatkovja" : ime, "knn" : tocnosti[0], "drevo" : tocnosti[1],
  rezultati = rezultati.append({"ime podatkovja" : ime, "knn" : tocnosti[0], "drevo" : tocnosti[1],
  rezultati = rezultati.append({"ime podatkovja" : ime, "knn" : tocnosti[0], "drevo" : tocnosti[1],
  rezultati = rezultati.append({"ime podatkovja" : ime, "knn" : tocnosti[0], "drevo" : tocnosti[1],
  rezultati = rezultati.append({"ime podatkovja" : ime, "knn" : tocnosti[0], "drevo" : tocnosti[1],
  rezultati = rezultati.append({"ime podatkovja" : ime, "knn" : tocnosti[0], "drevo" : tocnosti[1],


Unnamed: 0,ime podatkovja,knn,drevo,bayes,najboljsi
0,iris,1.0,1.0,1.0,KNeighborsClassifier()
1,zoo,0.923077,0.923077,0.961538,GaussianNB()
2,wine,0.711111,0.955556,1.0,GaussianNB()
3,hayes-roth,0.55,0.85,0.75,DecisionTreeClassifier()
4,fri_c3_100_50,0.48,0.68,0.52,DecisionTreeClassifier()
5,pwLinear,0.76,0.84,0.76,DecisionTreeClassifier()
6,fri_c2_100_5,0.8,0.64,0.72,KNeighborsClassifier()
7,visualizing_environmental,0.642857,0.75,0.678571,DecisionTreeClassifier()
8,wisconsin,0.530612,0.530612,0.612245,GaussianNB()
9,fri_c0_100_5,0.84,0.84,0.88,GaussianNB()


In [14]:
rezultati.to_csv("meta_target.csv", index=False)

### C Priprava napovednih spremenljivk za meta učenje

Pri pripravi meta znacilk nam bo pomagal paket pymfe https://pypi.org/project/pymfe/.
Najbolj koristno bo orodje **MFE**, ki mu v argumentu **groups** povemo, katere tipe meta značilk želimo.
Po nastavitvi moramo poklicati še njegovi funkciji **fit** in **extract**.

Najprej preizkusimo pymfe na enem izmed podatkovij v naši zbirki.

C.1 Naredi meta značilke iz podatkov (tip značilk "general" in "info-theory"). Oglej si ustvarjene značilke. 
Preizkusiš lahko tudi druge možnosti, npr. "statistical".

Opozorilo, da nekaterih značilk ni mogoče izračunati, ni nič nenavadnega.

In [64]:
from pymfe.mfe import MFE

testno_podatkovje = podatkovja_copy[0]
target_tp = testno_podatkovje.default_target_attribute
data_tp = testno_podatkovje.get_data()[0]

X = np.array(data_tp.drop(target_tp, axis=1))
y = np.array(data_tp[target_tp])

mfe = MFE(groups=["general", "info-theory"])
mfe.fit(X, y)
names, values = mfe.extract()
print(list(zip(names, values)))

[('attr_conc.mean', 0.20922243392447848), ('attr_conc.sd', 0.11995019623784713), ('attr_ent.mean', 2.279010448380773), ('attr_ent.sd', 0.05742641939520074), ('attr_to_inst', 0.02666666666666667), ('cat_to_num', 0.0), ('class_conc.mean', 0.27232594204286165), ('class_conc.sd', 0.14258948387896486), ('class_ent', 1.584962500721156), ('eq_num_attr', 1.8824081456478539), ('freq_class.mean', 0.3333333333333333), ('freq_class.sd', 0.0), ('inst_to_attr', 37.5), ('joint_ent.mean', 3.0219863138519676), ('joint_ent.sd', 0.38738046925935415), ('mut_inf.mean', 0.8419866352499615), ('mut_inf.sd', 0.42517984298139233), ('nr_attr', 4), ('nr_bin', 0), ('nr_cat', 0), ('nr_class', 3), ('nr_inst', 150), ('nr_num', 4), ('ns_ratio', 1.7067062028890763), ('num_to_cat', nan)]


C.2 Naredi meta značilke na podlagi enostavnega modela - odločitvenega drevesa, naučenega na celotni množici. Model ustvarimo in naučimo samo, značilke pa iz naučenega drevesa pridobimo z **MFE** in nastavitvijo "model-based" ter funkcjo **extract_from_model**.

In [65]:
drevo = DecisionTreeClassifier()
drevo.fit(X,y)
mfe2 = MFE(groups = "model-based")
names2, values2 = mfe2.extract_from_model(drevo)
print(list(zip(names2, values2)))

[('leaves', 9), ('leaves_branch.mean', 3.7777777777777777), ('leaves_branch.sd', 1.2018504251546631), ('leaves_corrob.mean', 0.1111111111111111), ('leaves_corrob.sd', 0.15051762539834182), ('leaves_homo.mean', 37.46666666666667), ('leaves_homo.sd', 13.142298124757328), ('leaves_per_class.mean', 0.3333333333333333), ('leaves_per_class.sd', 0.22222222222222224), ('nodes', 8), ('nodes_per_attr', 2.0), ('nodes_per_inst', 0.05333333333333334), ('nodes_per_level.mean', 1.6), ('nodes_per_level.sd', 0.8944271909999159), ('nodes_repeated.mean', 2.0), ('nodes_repeated.sd', 1.4142135623730951), ('tree_depth.mean', 3.0588235294117645), ('tree_depth.sd', 1.4348601079588785), ('tree_imbalance.mean', 0.19491705385114735), ('tree_imbalance.sd', 0.13300709991513865), ('tree_shape.mean', 0.2708333333333333), ('tree_shape.sd', 0.10711960313126631), ('var_importance.mean', 0.25), ('var_importance.sd', 0.4487534065700905)]


C.3 Izračunaj vse tri vrste značilk za vsa podatkovja v naši zbirki in jih shrani v tabelo. Naredi tabelo s stolpci

**name,znacilka1,znacilka2,...,znacilkaN**, 

kjer je N stevilo znacilk (imena stolpcev 2-N niso znacilkaI, ampak dejanska imena znacilk, ki se ustvarijo.)
Tabela lahko vsebuje manjkajoce vrednosti. Tabelo shrani tudi v datoteko.

In [66]:
drevo = DecisionTreeClassifier()
names += names2
stolpci =  ["ime_podatkovja"] + names
#print(stolpci)
rezultati = pd.DataFrame(columns = stolpci)
#print(rezultati)
for i in range(0, len(podatkovja_copy)):
    target = podatkovja_copy[i].default_target_attribute
    data = podatkovja_copy[i].get_data()[0]
    ime = podatkovja_copy[i].name

    X = np.array(data.drop(target, axis=1))
    y = np.array(data[target])

    mfe = MFE(groups=["general", "info-theory"])
    mfe2 = MFE(groups=["model-based"])
    mfe.fit(X, y)
    names, values = mfe.extract()

    drevo.fit(X,y)
    names2, values2 = mfe2.extract_from_model(drevo)

    values += values2
    values.insert(0, ime)
    
    rezultati = rezultati.append(pd.DataFrame([values], columns = stolpci), ignore_index=True)

rezultati

['ime_podatkovja', 'attr_conc.mean', 'attr_conc.sd', 'attr_ent.mean', 'attr_ent.sd', 'attr_to_inst', 'cat_to_num', 'class_conc.mean', 'class_conc.sd', 'class_ent', 'eq_num_attr', 'freq_class.mean', 'freq_class.sd', 'inst_to_attr', 'joint_ent.mean', 'joint_ent.sd', 'mut_inf.mean', 'mut_inf.sd', 'nr_attr', 'nr_bin', 'nr_cat', 'nr_class', 'nr_inst', 'nr_num', 'ns_ratio', 'num_to_cat', 'leaves', 'leaves_branch.mean', 'leaves_branch.sd', 'leaves_corrob.mean', 'leaves_corrob.sd', 'leaves_homo.mean', 'leaves_homo.sd', 'leaves_per_class.mean', 'leaves_per_class.sd', 'nodes', 'nodes_per_attr', 'nodes_per_inst', 'nodes_per_level.mean', 'nodes_per_level.sd', 'nodes_repeated.mean', 'nodes_repeated.sd', 'tree_depth.mean', 'tree_depth.sd', 'tree_imbalance.mean', 'tree_imbalance.sd', 'tree_shape.mean', 'tree_shape.sd', 'var_importance.mean', 'var_importance.sd']
Empty DataFrame
Columns: [ime_podatkovja, attr_conc.mean, attr_conc.sd, attr_ent.mean, attr_ent.sd, attr_to_inst, cat_to_num, class_conc.mea

  rezultati = rezultati.append(pd.DataFrame([values], columns = stolpci), ignore_index=True)
  rezultati = rezultati.append(pd.DataFrame([values], columns = stolpci), ignore_index=True)
  rezultati = rezultati.append(pd.DataFrame([values], columns = stolpci), ignore_index=True)
  rezultati = rezultati.append(pd.DataFrame([values], columns = stolpci), ignore_index=True)
  rezultati = rezultati.append(pd.DataFrame([values], columns = stolpci), ignore_index=True)
  rezultati = rezultati.append(pd.DataFrame([values], columns = stolpci), ignore_index=True)
  rezultati = rezultati.append(pd.DataFrame([values], columns = stolpci), ignore_index=True)
  rezultati = rezultati.append(pd.DataFrame([values], columns = stolpci), ignore_index=True)
  rezultati = rezultati.append(pd.DataFrame([values], columns = stolpci), ignore_index=True)
  rezultati = rezultati.append(pd.DataFrame([values], columns = stolpci), ignore_index=True)
  rezultati = rezultati.append(pd.DataFrame([values], columns = stolpc

Unnamed: 0,ime_podatkovja,attr_conc.mean,attr_conc.sd,attr_ent.mean,attr_ent.sd,attr_to_inst,cat_to_num,class_conc.mean,class_conc.sd,class_ent,...,nodes_repeated.mean,nodes_repeated.sd,tree_depth.mean,tree_depth.sd,tree_imbalance.mean,tree_imbalance.sd,tree_shape.mean,tree_shape.sd,var_importance.mean,var_importance.sd
0,iris,0.209222,0.11995,2.27901,0.05742642,0.026667,0.0,0.272326,0.142589,1.584963,...,2.666667,1.154701,3.058824,1.43486,0.194917,0.133007,0.270833,0.10712,0.25,0.448885
1,zoo,0.139005,0.205325,0.761123,0.3740142,0.158416,0.0,0.593161,0.304557,2.39056,...,1.125,0.353553,4.105263,2.131633,0.133181,0.127898,0.207813,0.179487,0.0625,0.106235
2,wine,0.071499,0.047768,2.317306,0.008828109,0.073034,0.0,0.152791,0.071452,1.566822,...,1.833333,1.169045,2.956522,1.260529,0.238917,0.173057,0.286458,0.085148,0.076923,0.129153
3,hayes-roth,0.036065,0.030615,1.737425,0.1297673,0.025,0.0,0.077131,0.048188,1.515466,...,6.0,1.632993,6.734694,2.643983,0.088969,0.119795,0.088047,0.145569,0.25,0.17689
4,fri_c3_100_50,0.031661,0.014676,2.0,0.0,0.5,0.0,0.013175,0.015654,0.958042,...,1.25,0.46291,2.857143,1.195229,0.258001,0.082751,0.295455,0.084275,0.02,0.076174
5,pwLinear,0.009696,0.008027,1.517821,0.183333,0.05,0.0,0.060243,0.133851,0.999351,...,3.5,1.95789,4.929577,1.823058,0.117964,0.117632,0.125326,0.091195,0.1,0.122995
6,fri_c2_100_5,0.089493,0.106381,2.0,0.0,0.05,0.0,0.042889,0.040935,0.970951,...,3.0,1.224745,4.064516,2.048341,0.118626,0.114936,0.208496,0.142942,0.2,0.159474
7,visualizing_environmental,0.129066,0.088687,1.998113,0.002195144,0.027027,0.0,0.043505,0.028962,0.998536,...,10.666667,1.527525,6.123077,2.85322,0.077573,0.096983,0.111313,0.129573,0.333333,0.166284
8,wisconsin,0.10313,0.14251,2.306292,0.08090571,0.164948,0.0,0.008079,0.005075,0.99624,...,1.842105,1.014515,5.380282,2.225606,0.104014,0.111756,0.118544,0.107807,0.03125,0.040348
9,fri_c0_100_5,0.02496,0.009571,2.0,0.0,0.05,0.0,0.041975,0.022654,0.995378,...,3.4,1.516575,4.685714,2.609614,0.092267,0.112422,0.185655,0.149728,0.2,0.125132


In [67]:
rezultati.to_csv("meta_features.csv", index=False)