# Naivni Bajesov algoritam

Problem klasifikacije možemo da predstavimo kao određivanje uslovne verovatnoće ciljne promenljive $y$ pri uslovu atributa $X$ $p(y|X)$ (kao i obično, skraćeno pišemo $X$ umesto $X_1, X_2, ..., X_n$).

Po definicije uslovne verovatnoće
$$p(y|X) = \frac{p(X, y)}{p(X)}$$

Kako biste ocenili ovu verovatnoću za dati skup podataka?
Mozemo da pokušamo brojanjem - (koliko puta se zajedno javljaju odgovarajuće vrednosti $y$ i svih atributa $X_i$) podeljeno sa (koliko puta se zajedno javljaju odgovarajuće vrednosti svih atributa $X_i$). Ali što više atributa $X_i$ imamo, to će brojanje imati manje smisla, tj. nećemo imati dovoljno instanci koje pokrivaju sve moguće kombinacije da ocenimo verovatnoću ispravno. Moramo da pokušamo da olakšamo problem.

Simetrična formula za uslovnu verovatnoću $X$ pri uslovu $y$:
$$p(X|y) = \frac{p(X, y)}{p(y)}$$
Ako to zamenimo u $p(y|X)$
$$p(y|X) = \frac{p(X|y)p(y)}{p(X)}$$
Dobijamo Bajesovu formulu. Sada imamo tri verovatnoće koje treba da ocenimo da bismo dobili ono što nas zanima. Od te tri, jedna je laka $p(y)$ - pošto je jedna promenljiva u pitanju, možemo da ocenimo verovatnoću brojanjem. Ali, preostale dve verovatnoće su teške jer imamo puno atributa $X_i$. Stoga olakšavamo problem jednom _naivnom_ pretpostavkom - pretpostavljamo da su svi atributi $X_i$ uslovno nezavisni pri uslovu $y$, tj.
$$P(X|y) = \prod_{i=1..n}{P(X_i|y)}$$
Pojedinačnu uslovnu verovatnoću $P(X_i|y)$ opet lako možemo da ocenimo brojanjem jer je u pitanju samo jedan atribut $X_i$ i jedna ciljna promenljiva $y$.
Ako ovo zamenimo u Bajesovu formulu, dobijamo
$$p(y|X) = \frac{\prod_{i=1..n}{P(X_i|y)}p(y)}{p(X)}$$
Sada lako možemo da ocenimo čitav brojilac. Zašto ne možemo da razdvojimo $p(X) = \prod_{i=1..n}{P(X_i)}$? Zato što je nezavisnost jaci uslov od uslovne nezavisnosti. Dakle, imenilac ostaje težak. Međutim, kako je naš cilj da odredimo najverovatniju klasu, imenilac nam nije ni potreban jer ne zavisi od klase $y$.
$$p(y|X) \propto \prod_{i=1..n}{P(X_i|y)}p(y)$$
$$\hat{y} = argmax_{y}{p(y|X)} = argmax_{y}{\frac{\prod_{i=1..n}{P(X_i|y)}p(y)}{p(X)}} = argmax_{y}\prod_{i=1..n}{P(X_i|y)}p(y)$$


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../balloons.csv')

In [3]:
df.head()

Unnamed: 0,color,size,act,age,inflated
0,YELLOW,SMALL,STRETCH,ADULT,T
1,YELLOW,SMALL,STRETCH,ADULT,T
2,YELLOW,SMALL,STRETCH,CHILD,F
3,YELLOW,SMALL,DIP,ADULT,F
4,YELLOW,SMALL,DIP,CHILD,F


In [4]:
X = df.drop('inflated', axis=1)
y = df['inflated']

In [5]:
X.shape

(76, 4)

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=123)

In [8]:
X_train.shape

(53, 4)

In [13]:
X_test.shape

(23, 4)

Pre korišćenja _CategoricalNB_ modela moramo da pretvorimo kategoricke atribute sa $n_i$ kategorija u brojeve $0,1,...,n_i-1$.

In [14]:
from sklearn.preprocessing import OrdinalEncoder

In [16]:
oe = OrdinalEncoder()

Kao i obično, fit samo na trening skupu

In [17]:
oe.fit(X_train)

In [20]:
X_train = oe.transform(X_train)

In [19]:
X_test = oe.transform(X_test)

Svaka od različitih klasa iz _naive_bayes_ modula na različit način modeluje $p(X_i|y)$. Sve ostalo im je zajedničko.

_CategoricalNB_ koristi sledeću formulu:
$$p(X_i=t|y=c) = \frac{N_{tic} + \alpha}{N_c + \alpha n_i}$$
gde je $N_{tic}$ broj instanci klase $c$ kod kojih je vrednost atributa $X_i$ jednaka $t$, a
$N_c$ je broj instanci klase $c$. $\alpha$. iz formule je samo dodatak koji koristimo da ne bismo imali problem sa verovatnoćama $0$ i $1$.

Ovakva formula $$p(X_i=t|y=c) = \frac{N_{tic}}{N_c}$$
ima smisla, ali može lako da se desi da u trening skupu nemamo nijednu instancu neke klase $c$ kod koje je atribut $X_i$ jednak nekom $t$. Tada je $N_{tic} = 0$ i čitava verovatnoća $p(y|X)$ postaje $0$ zbog množenja nulom. To nam se ne sviđa jer zbog samo jednog atributa čitava verovatnoća postaje nula. Da bismo rešili ovaj problem koristimo _smoothing_, odnosno ostavljamo malu verovatnoću da se može desiti nešto što nemamo u trening skupu.
Npr. treba na osnovu trening skupa da izračunamo koja je verovatnoća da će sutra Sunce izaći. Svih $N$ instanci u trening skupu imaju istu vrednost - Sunce je izašlo. Dakle, verovatnoća je $1$. Ali to je previše isključivo, pa ako se pravimo da je verovatnoća izlaska Sunca $0.5$ svakog dana, možemo da uradimo sledeće:
$$p(izlazi sutra) = \frac{N+1}{N+2}$$
Dodajemo u brojilac $1$, to je opcija da Sunce izađe sutra, a u imenilac dodajemo $2$ pošto ukupno postoje dve opcije - da izađe i da ne izađe. U opštem slučaju:
$$\frac{N+1}{N+k}$$ gde je $k$ broj kategorija. Početna formula je upravo ovakva uz dodatak parametra $\alpha$.

In [35]:
from sklearn.naive_bayes import CategoricalNB

In [29]:
model = CategoricalNB()

In [30]:
model.fit(X_train, y_train)

Kako su sva četiri ulazna atributa i ciljna promenljiva binarni, u _category_count__ imamo četiri $2\times2$ matrice broja pojavljivanja $X_i$ i $y$, što nam suštinski omogućava da izračunamo verovatnoću $p(X_i|y)$.

In [31]:
model.category_count_

[array([[18., 11.],
        [ 8., 16.]]),
 array([[17., 12.],
        [ 6., 18.]]),
 array([[20.,  9.],
        [ 7., 17.]]),
 array([[11., 18.],
        [15.,  9.]])]

Članska promenljiva _class_count__ predstavlja broj instanci po klasi, tj. neskalirane verovatnoće $p(y)$.

In [34]:
model.class_count_

array([29., 24.])

Na osnovu ove dve stvari računamo verovatnoću $p(y|X)$ i predviđamo klasu.

In [24]:
y_train_pred = model.predict(X_train)

In [25]:
y_test_pred = model.predict(X_test)

In [26]:
from sklearn.metrics import confusion_matrix

In [27]:
confusion_matrix(y_train, y_train_pred)

array([[25,  4],
       [ 9, 15]])

In [36]:
confusion_matrix(y_test, y_test_pred)

array([[9, 3],
       [3, 8]])

## Klasifikacija teksta

In [37]:
import os

In [38]:
def read_data(root_dir):
    corpus = []
    classes = []
    for class_name in os.listdir(root_dir):
        class_dir = os.path.join(root_dir, class_name)
        for file_name in os.listdir(class_dir):
            file_path = os.path.join(class_dir, file_name)
            word_counts = {}
            with open(file_path, 'r') as f:
                for line in f:
                    word, count = line.split()
                    word_counts[word] = int(count)
            corpus.append(word_counts)
            classes.append(class_name)
    return corpus, classes

In [39]:
X_train, y_train = read_data('../ebart/ebart/VektoriEbart-5/Skup/')

In [62]:
len(X_train)

3492

In [64]:
len(y_train)

3492

Hoćemo da napravimo _TF_ matricu kod koje su kolone reči, redovi dokumenti, a vrednost polja $(i,j)$ je broj pojavljivanja reči $j$ u dokumentu $i$. Pokušajte da transformišete ovu matricu u _TF-IDF_ matricu i proverite kako to utiče na rezultate klasifikacije.

In [42]:
from sklearn.feature_extraction import DictVectorizer

In [45]:
dv = DictVectorizer()
dv.fit(X_train)
print(f'Broj atributa (razlicitih reci u dokumentima): {len(dv.feature_names_)}')

Broj atributa (razlicitih reci u dokumentima): 36830


S obzirom na to da ima mnogo različitih reči i ne javljaju se sve u svim dokumentima, dobićemo retku matricu.

In [46]:
sparse_matrix = dv.transform(X_train)

In [47]:
X_train = pd.DataFrame(sparse_matrix.toarray(), columns=dv.feature_names_)

In [48]:
X_train.head()

Unnamed: 0,ab,abasu,abati,abc,abdul,abdulah,abe,aberdin,abhaziji,abida,...,zxurno,zxustel,zxustrine,zxustro,zxuticx,zxutih,zxutilovine,zxuto,zxutra,zxuzxa
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


_MultinomialNB_ koristi sledeću formulu:
$$p(X_i|y) = \frac{N_{yi}+\alpha}{N_y + \alpha n}$$
gde je $N_{yi} = \sum_{x \in X_{train}}{x_i}$, dakle, zbir vrednosti atributa $X_i$ u svim instancama klase $y$. U našem slučaju to je broj pojavljivanja neke reči u svim dokumentima koji su klase. Npr. koliko puta se javlja reč gol u svim dokumentima o sportu. $N_y = \sum_{i=1}^{n}{N_{yi}}$ je ukupan zbir vrednosti svih atributa za klasu $y$. $\alpha$ je ponovo samo _smoothing_.

In [79]:
from sklearn.naive_bayes import MultinomialNB

In [80]:
model = MultinomialNB()

In [81]:
model.fit(X_train, y_train)

Broj pojavljivanja svake od klasa

In [82]:
model.class_count_

array([333., 620., 627., 935., 977.])

Broj pojavljivanja svake od reči u svakoj od klasa

In [85]:
model.feature_count_

array([[ 1.,  0.,  0., ...,  2.,  0.,  0.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  1.,  0., ...,  5.,  1.,  0.],
       [ 0.,  0.,  0., ...,  2.,  0.,  0.],
       [ 2.,  0.,  1., ..., 27.,  0.,  2.]])

In [52]:
y_train_pred = model.predict(X_train)

In [53]:
confusion_matrix(y_train, y_train_pred)

array([[312,   4,   3,  14,   0],
       [ 13, 504,   5,  97,   1],
       [  1,   4, 614,   6,   2],
       [  9,  28,   6, 885,   7],
       [  1,   4,   1,   3, 968]])

Da li su greške koje naš model pravi logične?

In [54]:
from sklearn.metrics import accuracy_score

In [55]:
accuracy_score(y_train, y_train_pred)

0.9401489117983963

In [56]:
X_test, y_test = read_data('../ebart/ebart/VektoriEbart-5/Testing/')

In [57]:
len(X_test)

1743

Moramo na isti način da pretprocesirmo i test podatke.

In [58]:
X_test = pd.DataFrame(dv.transform(X_test).toarray(), columns=dv.feature_names_)

In [59]:
y_test_pred = model.predict(X_test)

In [60]:
accuracy_score(y_test, y_test_pred)

0.8995983935742972

In [61]:
confusion_matrix(y_test, y_test_pred)

array([[152,   0,   1,  13,   0],
       [ 10, 226,   6,  66,   1],
       [  2,   0, 301,   6,   4],
       [  8,  36,   9, 411,   3],
       [  2,   1,   1,   6, 478]])

S obzirom na to da su sada X_train i X_test "obične" matrice, možemo da klasifikujemo ove podatke i nekim drugim modelima i da uporedimo rezultate.

In [65]:
from sklearn.neighbors import KNeighborsClassifier

In [66]:
model = KNeighborsClassifier()

In [67]:
model.fit(X_train, y_train)

In [68]:
y_pred = model.predict(X_test)

Zašto _predict_ traje duže nego fit?

In [70]:
confusion_matrix(y_test, y_pred)

array([[ 45,   4,  21,  95,   1],
       [  3,  34,  36, 227,   9],
       [  1,   1, 182, 121,   8],
       [  4,  13,  25, 421,   4],
       [  1,   0,  15, 215, 257]])

In [71]:
accuracy_score(y_test, y_pred)

0.5387263339070568

In [72]:
from sklearn.tree import DecisionTreeClassifier

In [73]:
model = DecisionTreeClassifier()

In [74]:
model.fit(X_train, y_train)

In [75]:
y_pred = model.predict(X_test)

In [77]:
confusion_matrix(y_test, y_pred)

array([[ 95,  17,   8,  34,  12],
       [ 15, 180,  13,  80,  21],
       [  4,   7, 260,  17,  25],
       [ 29,  72,  11, 325,  30],
       [  6,  10,   6,  16, 450]])

In [78]:
accuracy_score(y_test, y_pred)

0.7515777395295468

Pokušajte da poboljšate ove rezultate optimizacijom hiperparametara.