## Decision Tree

### Information Gain

Decision Tree oluşturulurken dallanma sırası neye göre belirlenecek(ilk hangi özellikler kontrol edilecek)?

#### Quinlan's ID3 Algorithm

$$\normalsize \textrm{Info}(D) = - \sum\limits_{i=1}^{m}p_i log_2 (p_i)$$

$$\normalsize \textrm{Info}_A (D)=\sum\limits_{j=1}^{v} \frac{|D_j|}{D}\times I(D_j)$$

$$\normalsize \textrm{Gain}(A)=\textrm{Info}(D)-\textrm{Info}_A(D)$$

#### Örnek

<div style="text-align: center;">
  <img src="Naive-Bayes-Example.png" alt="Naive-Bayes Example" width="250">
</div>

Class P = buys_computer = "yes" $\normalsize (\frac{9}{14})$

Class N = buys_computer = "no" $\normalsize (\frac{5}{14})$

$$\normalsize \textrm{Info} (D) = I(9,5) = - \frac{9}{14}log_2(\frac{9}{14})-\frac{5}{14}log_2(\frac{5}{14})=0.940$$

|age|p<sub>i</sub>|n<sub>i</sub>|I(p<sub>i</sub>),(n<sub>i</sub>)|
|:---:|:---:|:---:|:---:|
|<=30|2|3|0.971|
|31...40|4|0|0|
|>40|3|2|0.971|

$$\normalsize \textrm{Info}_{age}(D)= \frac{5}{14}I(2,3)+\frac{4}{14}I(4,0)+\frac{5}{14}I(3,2)=0.694$$

$\frac{5}{14}I(2,3)$  means "age<=30" has 5 out of 14 samples, with 2 yes'es and 3 no's. Hence:

___Gain(age) = Info(D) - Info<sub>age</sub>(D)= 0.246___ 

Similarly,

___Gain(income) = 0.029___

___Gain(student) = 0.151___

___Gain(credit_rating) = 0.0.048___

In [2]:
from math import log2

In [9]:
def I(a,b):
    if a == 0 or b == 0:
        return 0
    one = a/(a+b)
    two = log2(one)
    first = one * two

    one = b/(a+b)
    two = log2(one)
    second = one * two
    return -first - second

In [10]:
(5/14)*I(2,3)+(4/14)*I(4,0)+(5/14)*I(3,2)

ValueError: math domain error

In [3]:
import numpy as np
import pandas as pd

veriler=pd.read_csv("veriler.csv")

x=veriler.iloc[:,1:4].values
y=veriler.iloc[:,4:].values

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.33, random_state=0)
x_train

array([[177,  60,  22],
       [190,  80,  25],
       [193,  90,  23],
       [129,  38,  12],
       [135,  34,  10],
       [180,  90,  30],
       [187,  80,  27],
       [185, 105,  33],
       [175,  90,  35],
       [183,  88,  28],
       [133,  30,   9],
       [130,  30,  10],
       [174,  70,  47],
       [160,  58,  39]], dtype=int64)

In [6]:
from sklearn.tree import DecisionTreeClassifier
# default olarak gini kullanır, entropy'den farkı log2 tabanında almadan 
# çarpması yani karesini alması
dtc = DecisionTreeClassifier(criterion="entropy")

In [4]:
dtc.fit(x_train,y_train)
y_pred = dtc.predict(x_test)

In [5]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true=y_test, y_pred=y_pred)
cm

array([[1, 0],
       [1, 6]], dtype=int64)

## Random Forest

In [8]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, criterion="entropy")

In [10]:
import warnings
warnings.filterwarnings("ignore")

In [11]:
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)

In [12]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true=y_test, y_pred=y_pred)
cm

array([[1, 0],
       [0, 7]], dtype=int64)