# Diskritisasi (Tugas 2)

Tugas melakukan diskritisasi

Carilah data yang bertipe numerik (data Klasisikasi)

1. Lakukan proses diskritisasi dengan equal width dan equal frequency
2. lakukan proses diskritisasi dengan basis entropy yang di upload

$$Entropy(D_1) = -\sum_{i=1}^m p_i \ log{_2} \ p_i $$

$$Gain(E_{new}) = (E_{initial}) \ - (E_{new})$$

$$ Info_A(D) = \frac{|D_1|}{|D|} Entropy (D_1)\frac{|D_2|}{|D|}+ Entropy (D_2)$$

##Import Data

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer



In [2]:
url = 'https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv'
data = pd.read_csv(url)


In [3]:
data.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [4]:
# CONSTAN SERIES

# Class Variety Name
IRIS_VARIETY = pd.DataFrame(data["variety"])

# create iris data without category
IRIS_DATA = data.drop(columns = "variety")

## Perhitungan 

### Definisi *discretization*

1. Proses mengubah data bertipe data *numeric* menjadi data bertipe *categorical*
2. Diskritisasi memiliki 2 pendekatan yaitu *equal-width intervals* dan *equal-frequency intervals*
3. *equal-width intervals* adalah diskritisasi dengan lebar data yang sama
4. *equal-frequency intervals* adalah diskritisasi dengan jumlah data yang sama

#### Library Scikit Learn dengan modul KBinsDiscritizer

Digunakan untuk melakukkan *binning* data dengan pendekatan *equal width* dan *equal frequency*


```
# Syntax
KBinsDiscretizer(n_bins = amount_of_binning, encode, strategy)
```


*   **Parameter**
*   *n_bins* adalah jumlah banyaknya kategori yang digunakan
*   *encode* adalah jenis tipe data dari hasil *output*, gunakan *value = "encode"* sehingga data keluaran bertipe integer
*   *strategy* digunakan untuk menentukan jangka interval antara data
*   *strategy = "uniform"* digunakan untuk pendekatan *equal width*
*   *strategy = "quantile"* digunakan untuk pendekatan *equal frequency*


* Dikelompokkan menjadi 3 kategori yaitu 
* Kategori sedikit lebar
* Kategori lebar
* Kategori sangat lebar


```
# pseudocode
if value == 0.0 then sedikit lebar
if value == 1.0 then lebar
else sangat lebar
```






In [5]:
# set category and n bin
labels = ["Sedikit_Lebar", "Lebar", "Sangat_Lebar"]
amount_of_binning = len(labels)

#### Equal Width

* ***strategy = "uniform"* digunakan untuk pendekatan *equal width***
* Equal Width Intervals Data bunga Iris


In [6]:
# equal-width intervals binning
est_ew_binning = KBinsDiscretizer(n_bins=amount_of_binning, encode='ordinal', strategy='uniform')

In [7]:
# fit estimator
est_ew_binning.fit(IRIS_DATA)

# fit to data
iris_ew_binning = est_ew_binning.transform(IRIS_DATA)
value_of_ew_binned_df = pd.DataFrame(iris_ew_binning, columns=["sepal.length","sepal.width", "petal.length","petal.width"])
iris_ew_binning

array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 2., 0., 0.],
       [0., 2., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 2., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [1., 2., 0., 0.],
       [1., 2., 0., 0.],
       [0., 2., 0., 0.],
       [0., 1., 0., 0.],
       [1., 2., 0., 0.],
       [0., 2., 0., 0.],
       [0., 1., 0., 0.],
       [0., 2., 0., 0.],
       [0., 2., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 2., 0., 0.],
       [1., 2., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [1., 1., 0., 0.],
       [0., 2., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],


In [8]:
# categorize every iris data
ew_binned_df = pd.DataFrame(iris_ew_binning, columns=["sepal.length", "sepal.width", "petal.length", "petal.width"])
ew_binned_df["sepal.length"] = np.where(ew_binned_df["sepal.length"] == 0.0, "Sedikit_Lebar", np.where(ew_binned_df["sepal.length"] == 1.0, "Lebar", "Sangat_Lebar"))
ew_binned_df["sepal.width"] = np.where(ew_binned_df["sepal.width"] == 0.0, "Sedikit_Lebar", np.where(ew_binned_df["sepal.width"] == 1.0, "Lebar", "Sangat_Lebar"))
ew_binned_df["petal.length"] = np.where(ew_binned_df["petal.length"] == 0.0, "Sedikit_Lebar", np.where(ew_binned_df["petal.length"] == 1.0, "Lebar", "Sangat_Lebar"))
ew_binned_df["petal.width"] = np.where(ew_binned_df["petal.width"] == 0.0, "Sedikit_Lebar", np.where(ew_binned_df["petal.width"] == 1.0, "Lebar", "Sangat_Lebar"))

In [9]:
# create data frame from binned iris data and iris variety
class_of_ew_binned_df = pd.concat((ew_binned_df, IRIS_VARIETY), axis = 1)
class_of_ew_binned_df

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,Sedikit_Lebar,Lebar,Sedikit_Lebar,Sedikit_Lebar,Setosa
1,Sedikit_Lebar,Lebar,Sedikit_Lebar,Sedikit_Lebar,Setosa
2,Sedikit_Lebar,Lebar,Sedikit_Lebar,Sedikit_Lebar,Setosa
3,Sedikit_Lebar,Lebar,Sedikit_Lebar,Sedikit_Lebar,Setosa
4,Sedikit_Lebar,Sangat_Lebar,Sedikit_Lebar,Sedikit_Lebar,Setosa
...,...,...,...,...,...
145,Sangat_Lebar,Lebar,Sangat_Lebar,Sangat_Lebar,Virginica
146,Lebar,Sedikit_Lebar,Sangat_Lebar,Sangat_Lebar,Virginica
147,Lebar,Lebar,Sangat_Lebar,Sangat_Lebar,Virginica
148,Lebar,Lebar,Sangat_Lebar,Sangat_Lebar,Virginica


Menghitung dari Setiap kategori

In [10]:
class_of_ew_binned_df["sepal.length"].value_counts()

Lebar            70
Sedikit_Lebar    52
Sangat_Lebar     28
Name: sepal.length, dtype: int64

In [11]:
class_of_ew_binned_df["sepal.width"].value_counts()

Lebar            98
Sedikit_Lebar    33
Sangat_Lebar     19
Name: sepal.width, dtype: int64

In [12]:
class_of_ew_binned_df["petal.length"].value_counts()

Lebar            54
Sedikit_Lebar    50
Sangat_Lebar     46
Name: petal.length, dtype: int64

In [13]:
class_of_ew_binned_df["petal.width"].value_counts()

Lebar            52
Sedikit_Lebar    50
Sangat_Lebar     48
Name: petal.width, dtype: int64

In [14]:
class_of_ew_binned_df["variety"].value_counts()

Setosa        50
Versicolor    50
Virginica     50
Name: variety, dtype: int64

#### Equal Frequency

* ***strategy = "quantile"* digunakan untuk pendekatan *equal width***
* Equal Frequency Intervals Data bunga Iris

In [15]:
# equal-frequency intervals binning
est_ef_binning = KBinsDiscretizer(n_bins=amount_of_binning, encode='ordinal', strategy='quantile')

In [16]:
# create dataframe of sepal-width

# fit estimator
est_ef_binning.fit(IRIS_DATA)

# fit to data
est_ef_binning = est_ef_binning.transform(IRIS_DATA)
value_of_ef_binned_df = pd.DataFrame(est_ef_binning, columns=["sepal.length","sepal.width", "petal.length","petal.width"])
value_of_ef_binned_df

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
0,0.0,2.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,2.0,0.0,0.0
3,0.0,1.0,0.0,0.0
4,0.0,2.0,0.0,0.0
...,...,...,...,...
145,2.0,1.0,2.0,2.0
146,2.0,0.0,2.0,2.0
147,2.0,1.0,2.0,2.0
148,1.0,2.0,2.0,2.0


In [17]:
# categorize every iris data
ef_binned_df = pd.DataFrame(est_ef_binning, columns=["sepal.length", "sepal.width", "petal.length", "petal.width"])
ef_binned_df["sepal.length"] = np.where(ef_binned_df["sepal.length"] == 0.0, "Sedikit_Lebar", np.where(ef_binned_df["sepal.length"] == 1.0, "Lebar", "Sangat_Lebar"))
ef_binned_df["sepal.width"] = np.where(ef_binned_df["sepal.width"] == 0.0, "Sedikit_Lebar", np.where(ef_binned_df["sepal.width"] == 1.0, "Lebar", "Sangat_Lebar"))
ef_binned_df["petal.length"] = np.where(ef_binned_df["petal.length"] == 0.0, "Sedikit_Lebar", np.where(ef_binned_df["petal.length"] == 1.0, "Lebar", "Sangat_Lebar"))
ef_binned_df["petal.width"] = np.where(ef_binned_df["petal.width"] == 0.0, "Sedikit_Lebar", np.where(ef_binned_df["petal.width"] == 1.0, "Lebar", "Sangat_Lebar"))

In [18]:
# create data frame from binned iris data and iris variety
class_of_ef_binned_df = pd.concat((ef_binned_df, IRIS_VARIETY), axis = 1)
class_of_ef_binned_df

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,Sedikit_Lebar,Sangat_Lebar,Sedikit_Lebar,Sedikit_Lebar,Setosa
1,Sedikit_Lebar,Lebar,Sedikit_Lebar,Sedikit_Lebar,Setosa
2,Sedikit_Lebar,Sangat_Lebar,Sedikit_Lebar,Sedikit_Lebar,Setosa
3,Sedikit_Lebar,Lebar,Sedikit_Lebar,Sedikit_Lebar,Setosa
4,Sedikit_Lebar,Sangat_Lebar,Sedikit_Lebar,Sedikit_Lebar,Setosa
...,...,...,...,...,...
145,Sangat_Lebar,Lebar,Sangat_Lebar,Sangat_Lebar,Virginica
146,Sangat_Lebar,Sedikit_Lebar,Sangat_Lebar,Sangat_Lebar,Virginica
147,Sangat_Lebar,Lebar,Sangat_Lebar,Sangat_Lebar,Virginica
148,Lebar,Sangat_Lebar,Sangat_Lebar,Sangat_Lebar,Virginica


Menghitung dari setiap kategori

In [19]:
class_of_ef_binned_df["sepal.length"].value_counts()

Lebar            53
Sangat_Lebar     51
Sedikit_Lebar    46
Name: sepal.length, dtype: int64

In [20]:
class_of_ef_binned_df["sepal.width"].value_counts()

Sangat_Lebar     56
Lebar            47
Sedikit_Lebar    47
Name: sepal.width, dtype: int64

In [21]:
class_of_ef_binned_df["petal.length"].value_counts()

Sangat_Lebar     51
Sedikit_Lebar    50
Lebar            49
Name: petal.length, dtype: int64

In [22]:
class_of_ef_binned_df["petal.width"].value_counts()

Sangat_Lebar     52
Sedikit_Lebar    50
Lebar            48
Name: petal.width, dtype: int64

In [23]:
class_of_ef_binned_df["variety"].value_counts()

Setosa        50
Versicolor    50
Virginica     50
Name: variety, dtype: int64

###Entropy

Entropy merupakan informasi yang menggambarkan seberapa konsisten pemisah akan cocok dengan pengklasifikasi

Rumus mencari entrhopy \begin{align*}
\displaystyle Entropy(S) &= \sum_{i=0}^{k} -pi \ log_{2} \ pi \\
\end{align*}

Rumus mencari informasi \begin{align} info(D_{new}) = \frac {|D_1|}{|D|} Entrophy(D_2) + \frac {|D_2|}{|D|} Entrophy(D_2) \end{align}

Rumus mencari gain \begin{align} gain = D_{new} - D_{initial} \end{align}



#### Persiapan


*   Impor modul log2 dari *package math*
*   Ambil data yang akan digunakan


```
# value of binned dataframe (Equeal Frequency Interval)
df["sepal.length"

# class of binned dataframe (Equeal Frequency Interval)
class_of_ef_binned_df 
```

*  Tentukan label


```
labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
```



In [None]:
# module to calculate log
from math import log2

In [None]:
column_target_value = pd.DataFrame(data["petal.width"])
column_target_value

In [None]:
column_target_class = pd.DataFrame(class_of_ef_binned_df["petal.width"])
column_target_class

In [None]:
# create data target for entropy gain
target_data = pd.DataFrame(pd.concat((df["petal.width"], class_of_ef_binned_df["petal.width"]), axis = 1))

# change columns name of target data
target_data.columns = ["value", "category"]
target_data

#### Fungsi Menghitung banyaknya data setiap kategori

In [None]:
# count every value based the given category
def countEveryCategory(data_column, labels, column, category):
    group = data_column.groupby(category).count()
    amount_of_every_category = []
    for label in labels:
        if label not in group.index:
            amount_of_every_category.append(0)
        else:
            amount_of_every_category.append(group.loc[label, column])
    return amount_of_every_category

In [None]:
countEveryCategory(target_data, labels, target_data.columns[0], target_data.columns[1])

#### Fungsi menghitung banyaknya data setiap kategori berdasarkan nilai pembagi atau *split value* yang diberikan

In [None]:
# split the given data based on category and split value 
def split(split_value, data_column, labels, col, category):
    less_group = data_column[data_column[col] < split_value]
    greater_group = data_column[data_column[col] >= split_value]
    
    length_less_group = countEveryCategory(less_group, labels, col, category)
    length_greater_group = countEveryCategory(greater_group, labels, col, category)
    
    return (length_less_group, length_greater_group)

#### fungsi menghitung perbedaan gain antara entropi inisial dengan entropi nilai pembag yang diberikan

In [None]:
# count information gain from inisial entropy and new entropy
def count_gain(inisial_entropy, new_entropy):
  return inisial_entropy - new_entropy

#### fungsi menghitung entropy

In [None]:
# count entropy
def count_entropy(data_target):
    all_prob = []
    for prob in data_target:
        if (prob/sum(data_target) != 0):
            all_prob.append(prob/sum(data_target) * log2(prob/sum(data_target)))
        else:
            all_prob.append(0)
    return -(sum(all_prob))

#### Fungsi menghitung Info entropy berdasarkan nilai pembagi atau *split value* yang diberikan

In [None]:
# count entropy for given a split value
def info(d, data_target):
    temp = []
    for value in d:
        temp.append((sum(value) / data_target.shape[0]) * count_entropy(value))
    return sum(temp)

### Implementasi


*   Hitung target *entropy*
*   Hitung target *entropy* dengan *split value*
*   Hitung *entropy gain* 



Entropy inisial atau awal

In [None]:
# entropy data target
initial_data_target_entropy = count_entropy(countEveryCategory(target_data, labels, target_data.columns[0], target_data.columns[1]))



*   **Entropy uji coba 1**
*   Nilai *split* yaitu 0.7
*   Data kurang dari $ < 0.7 $
*   Data kurang dari $ >= 0.7 $



In [None]:
# count Entropy for the target given a split value, split value = 0.7
entropy_data_target_1 = info(split(0.7, target_data, labels, target_data.columns[0], target_data.columns[1]), target_data)



*   **Entropy uji coba 2**
*   Nilai *split* yaitu $ 1.4 $
*   Data kurang dari $ < 1.4 $
*   Data kurang dari $ >= 1.4 $



In [None]:
# count Entropy for the target given a split value, split value = 1.4
entropy_data_target_2 = info(split(1.4, target_data, labels, target_data.columns[0], target_data.columns[1]), target_data)



*   **Entropy uji coba 3**
*   Nilai *split* yaitu $ 2.1 $
*   Data kurang dari $ < 2.1 $
*   Data kurang dari $ >= 2.1 $



In [None]:
# count Entropy for the target given a split value, split value = 2.1
entropy_data_target_3 = info(split(2.1, target_data, labels, target_data.columns[0], target_data.columns[1]), target_data)

In [None]:
# information gain (entropy data target and entropy_data_target_1)
count_gain(initial_data_target_entropy, entropy_data_target_1)

In [None]:
# information gain (entropy data target and entropy_data_target_2)
count_gain(initial_data_target_entropy, entropy_data_target_2)

In [None]:
# information gain (entropy data target and entropy_data_target_3)
count_gain(initial_data_target_entropy, entropy_data_target_3)