# Tugas Penambangan Data

# Tugas 2

- Carilah data yang bertipe numerik ( data klassifikasi)
- Lakukan proses diskritisasi dengan equal width dan equal frequency
- Lakukan proses diskritisasi dengan basis entropy
- Kumpulkan tugas dengan link github ( web statis dari jupyter book)

In [1]:
import pandas as pd

In [2]:
# source data
dataset_url = "https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"
data = pd.read_csv(dataset_url)

In [3]:
data.head(10)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
5,5.4,3.9,1.7,0.4,Setosa
6,4.6,3.4,1.4,0.3,Setosa
7,5.0,3.4,1.5,0.2,Setosa
8,4.4,2.9,1.4,0.2,Setosa
9,4.9,3.1,1.5,0.1,Setosa


In [4]:
# CONSTAN SERIES

SEPAL_LENGTH_SERIES = data["sepal.length"]
SEPAL_WIDTH_SERIES = data["sepal.width"]
PETAL_LENGTH_SERIES = data["petal.length"]
PETAL_WIDTH_SERIES = data["petal.width"]

## Hitung Data
### Definisi discretization
1. Proses mengubah data bertipe data numeric menjadi data bertipe categorical
2. Diskritisasi memiliki 2 pendekatan yaitu equal-width intervals dan equal-frequency intervals
3. equal-width intervals adalah diskritisasi dengan lebar data yang sama
4. equal-frequency intervals adalah diskritisasi dengan jumlah data yang sama



## Cut

* Cut adalah sebuah *method* pada library *pandas* untuk melakukan perhitungan *equal width frequency intervals*


```
# Syntax
pd.cut(series, interval, right=True, label)
```


## Sepal Width
- Equal Width Intervals lebar kelopak bunga Iris
- Dikelompokkan menjadi 3 kategori yaitu
- Kategori sedikit lebar
- Kategori lebar
- Kategori sangat lebar

In [5]:
# equal-width intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

sepal_width_ew_binning = pd.cut(SEPAL_WIDTH_SERIES, amount_of_binning, True, labels)
labelled_sepal_width_ew_binning = sepal_width_ew_binning.value_counts()
interval_sepal_width_ew_binning = pd.cut(SEPAL_WIDTH_SERIES, amount_of_binning, True).value_counts()

In [6]:
# dataframe of sepal-width and sepal category
data_sepal_width_ew = pd.DataFrame(pd.concat((SEPAL_WIDTH_SERIES, sepal_width_ew_binning), axis=1))

In [7]:
# change columns name
data_sepal_width_ew.columns = ["sepal.width", "category"]

In [8]:
data_sepal_width_ew

Unnamed: 0,sepal.width,category
0,3.5,lebar
1,3.0,lebar
2,3.2,lebar
3,3.1,lebar
4,3.6,lebar
...,...,...
145,3.0,lebar
146,2.5,sedikit_lebar
147,3.0,lebar
148,3.4,lebar


In [9]:
# equal-width intervals binning with label
labelled_sepal_width_ew_binning

lebar            88
sedikit_lebar    47
sangat_lebar     15
Name: sepal.width, dtype: int64

In [10]:
# equal-width intervals without label
interval_sepal_width_ew_binning

(2.8, 3.6]      88
(1.998, 2.8]    47
(3.6, 4.4]      15
Name: sepal.width, dtype: int64

## Petal Width
- Equal Width Intervals lebar mahkota bunga Iris
- Dikelompokkan menjadi 3 kategori yaitu
- Kategori sedikit lebar
- Kategori lebar
- Kategori sangat lebar

In [11]:
# equal-width intervals
labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_width_ew_binning = pd.cut(PETAL_WIDTH_SERIES, amount_of_binning, True, labels)
labelled_petal_width_ew_binning = petal_width_ew_binning.value_counts()
interval_petal_width_ew_binning = pd.cut(PETAL_WIDTH_SERIES, amount_of_binning, True).value_counts()

In [12]:
# dataframe of petal-width and petal category
data_petal_width = pd.DataFrame(pd.concat((PETAL_WIDTH_SERIES, petal_width_ew_binning), axis=1))

In [13]:
# change columns name
data_petal_width.columns = ["petal.width", "category"]

In [14]:
data_petal_width

Unnamed: 0,petal.width,category
0,0.2,sedikit_lebar
1,0.2,sedikit_lebar
2,0.2,sedikit_lebar
3,0.2,sedikit_lebar
4,0.2,sedikit_lebar
...,...,...
145,2.3,sangat_lebar
146,1.9,sangat_lebar
147,2.0,sangat_lebar
148,2.3,sangat_lebar


In [15]:
# equal-width intervals with label
labelled_petal_width_ew_binning

lebar            54
sedikit_lebar    50
sangat_lebar     46
Name: petal.width, dtype: int64

In [16]:
# equal-width intervals without label
interval_petal_width_ew_binning

(0.9, 1.7]       54
(0.0976, 0.9]    50
(1.7, 2.5]       46
Name: petal.width, dtype: int64

## Sepal Length
- Equal Width Intervals panjang kelopak bunga Iris
- Dikelompokkan menjadi 3 kategori yaitu
- Kategori sedikit lebar
- Kategori lebar
- Kategori sangat lebar

In [17]:
# equal-width intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

sepal_length_ew_binning = pd.cut(SEPAL_LENGTH_SERIES, amount_of_binning, True, labels)
labelled_sepal_length_ew_binning = sepal_length_ew_binning.value_counts()
interval_sepal_length_ew_binning = pd.cut(SEPAL_LENGTH_SERIES, amount_of_binning, True).value_counts()

In [18]:
# dataframe of sepal-width and sepal category
data_sepal_length_ew = pd.DataFrame(pd.concat((SEPAL_LENGTH_SERIES, sepal_length_ew_binning), axis=1))

In [19]:
# change columns name
data_sepal_length_ew.columns = ["sepal_length", "category"]

In [20]:
data_sepal_length_ew

Unnamed: 0,sepal_length,category
0,5.1,sedikit_lebar
1,4.9,sedikit_lebar
2,4.7,sedikit_lebar
3,4.6,sedikit_lebar
4,5.0,sedikit_lebar
...,...,...
145,6.7,lebar
146,6.3,lebar
147,6.5,lebar
148,6.2,lebar


In [21]:
# equal-width intervals with label
labelled_sepal_length_ew_binning

lebar            71
sedikit_lebar    59
sangat_lebar     20
Name: sepal.length, dtype: int64

In [22]:
# equal-width intervals without label
interval_sepal_length_ew_binning

(5.5, 6.7]      71
(4.296, 5.5]    59
(6.7, 7.9]      20
Name: sepal.length, dtype: int64

## Petal Length
- Equal Width Intervals panjang mahkota bunga Iris
- Dikelompokkan menjadi 3 kategori yaitu
- Kategori sedikit lebar
- Kategori lebar
- Kategori sangat lebar

In [23]:
# equal-width intervals
labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_length_ew_binning = pd.cut(PETAL_LENGTH_SERIES, amount_of_binning, True, labels)
labelled_petal_length_ew_binning = petal_length_ew_binning.value_counts()
interval_petal_length_ew_binning = pd.cut(PETAL_LENGTH_SERIES, amount_of_binning, True).value_counts()

In [24]:
# dataframe of petal-width and petal category
data_petal_length_ew =  pd.DataFrame(pd.concat((PETAL_LENGTH_SERIES,petal_length_ew_binning), axis=1))

In [25]:
# change columns name
data_petal_length_ew.columns = ["petal_length", "category"]

In [26]:
data_petal_length_ew

Unnamed: 0,petal_length,category
0,1.4,sedikit_lebar
1,1.4,sedikit_lebar
2,1.3,sedikit_lebar
3,1.5,sedikit_lebar
4,1.4,sedikit_lebar
...,...,...
145,5.2,sangat_lebar
146,5.0,sangat_lebar
147,5.2,sangat_lebar
148,5.4,sangat_lebar


In [27]:
# equal-width intervals binning with label
labelled_petal_length_ew_binning

lebar            54
sedikit_lebar    50
sangat_lebar     46
Name: petal.length, dtype: int64

In [28]:
# equal-width intervals out label
interval_petal_length_ew_binning

(2.967, 4.933]    54
(0.994, 2.967]    50
(4.933, 6.9]      46
Name: petal.length, dtype: int64

## Qcut
Qcut adalah sebuah method pada library pandas untuk melakukan perhitungan equal frequency intervals

```
# Syntax
pd.qcut(series, interval, label)
```

## Sepal Width
- Equal Frequency Intervals lebar kelopak bunga Iris
- Dikelompokkan menjadi 3 kategori yaitu
- Kategori sedikit lebar
- Kategori lebar
- Kategori sangat lebar

In [29]:
# equal-frequency intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

sepal_width_ef_binning = pd.qcut(SEPAL_WIDTH_SERIES, amount_of_binning, labels)
labelled_sepal_width_ef_binning = sepal_width_ef_binning.value_counts()
interval_sepal_width_ef_binning = pd.qcut(SEPAL_WIDTH_SERIES, amount_of_binning).value_counts()

In [30]:
# dataframe of sepal-width and sepal category
data_sepal_width_ef = pd.DataFrame(pd.concat((SEPAL_WIDTH_SERIES, sepal_width_ef_binning), axis = 1))

In [31]:
# change columns name
data_sepal_width_ef.columns = ["sepal_width", "category"]

In [32]:
data_sepal_width_ef

Unnamed: 0,sepal_width,category
0,3.5,sangat_lebar
1,3.0,lebar
2,3.2,lebar
3,3.1,lebar
4,3.6,sangat_lebar
...,...,...
145,3.0,lebar
146,2.5,sedikit_lebar
147,3.0,lebar
148,3.4,sangat_lebar


In [33]:
# equal-frequency intervals binning with label
labelled_sepal_width_ef_binning

sedikit_lebar    57
lebar            50
sangat_lebar     43
Name: sepal.width, dtype: int64

In [34]:
# equal-frequency intervals out label
interval_sepal_width_ef_binning

(1.999, 2.9]    57
(2.9, 3.2]      50
(3.2, 4.4]      43
Name: sepal.width, dtype: int64

## Petal Width
- Equal Frequency Intervals lebar mahkota bunga Iris
- Dikelompokkan menjadi 3 kategori yaitu
- Kategori sedikit lebar
- Kategori lebar
- Kategori sangat lebar

In [35]:
# equal-frequency intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_width_ef_binning = pd.qcut(PETAL_WIDTH_SERIES, amount_of_binning, labels)
labelled_petal_width_ef_binning = petal_width_ef_binning.value_counts()
interval_petal_width_ef_binning = pd.qcut(PETAL_WIDTH_SERIES, amount_of_binning).value_counts()

In [36]:
# dataframe of petal-width and petal category
data_petal_width_ef = pd.DataFrame(pd.concat((PETAL_WIDTH_SERIES, petal_width_ef_binning), axis = 1))

In [37]:
# change columns name
data_petal_width_ef.columns = ["petal_width", "category"]

In [38]:
data_petal_width_ef

Unnamed: 0,petal_width,category
0,0.2,sedikit_lebar
1,0.2,sedikit_lebar
2,0.2,sedikit_lebar
3,0.2,sedikit_lebar
4,0.2,sedikit_lebar
...,...,...
145,2.3,sangat_lebar
146,1.9,sangat_lebar
147,2.0,sangat_lebar
148,2.3,sangat_lebar


In [39]:
# equal-frequency intervals binning with label
labelled_petal_width_ef_binning

lebar            52
sedikit_lebar    50
sangat_lebar     48
Name: petal.width, dtype: int64

In [40]:
# equal-frequency intervals without label
interval_petal_width_ef_binning

(0.867, 1.6]      52
(0.099, 0.867]    50
(1.6, 2.5]        48
Name: petal.width, dtype: int64

## Sepal Length
- Equal Frequency Intervals panjang kelopak bunga Iris
- Dikelompokkan menjadi 3 kategori yaitu
- Kategori sedikit lebar
- Kategori lebar
- Kategori sangat lebar

In [41]:
# equal-frequency intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

sepal_length_ef_binning = pd.qcut(SEPAL_LENGTH_SERIES, amount_of_binning, labels)
labelled_sepal_length_ef_binning = sepal_length_ef_binning.value_counts()
interval_sepal_length_ef_binning = pd.qcut(SEPAL_LENGTH_SERIES, amount_of_binning).value_counts()

In [42]:
# dataframe of sepal-length and sepal category
data_sepal_length_ef = pd.DataFrame(pd.concat((SEPAL_LENGTH_SERIES, sepal_length_ef_binning), axis=1))

In [43]:
# change columns name
data_sepal_length_ef.columns = ["sepal_length", "category"]

In [44]:
data_sepal_length_ef

Unnamed: 0,sepal_length,category
0,5.1,sedikit_lebar
1,4.9,sedikit_lebar
2,4.7,sedikit_lebar
3,4.6,sedikit_lebar
4,5.0,sedikit_lebar
...,...,...
145,6.7,sangat_lebar
146,6.3,lebar
147,6.5,sangat_lebar
148,6.2,lebar


In [45]:
# equal-frequency intervals binning with label
labelled_sepal_length_ef_binning

lebar            56
sedikit_lebar    52
sangat_lebar     42
Name: sepal.length, dtype: int64

In [46]:
# equal-frequency intervals out label
interval_sepal_length_ef_binning

(5.4, 6.3]                   56
(4.2989999999999995, 5.4]    52
(6.3, 7.9]                   42
Name: sepal.length, dtype: int64

## Petal Length
- Equal Frequency Intervals panjang mahkota bunga Iris
- Dikelompokkan menjadi 3 kategori yaitu
- Kategori sedikit lebar
- Kategori lebar
- Kategori sangat lebar

In [47]:
# equal-frequency intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_length_ef_binning = pd.qcut(PETAL_LENGTH_SERIES, amount_of_binning, labels)
labelled_petal_length_ef_binning = petal_length_ef_binning.value_counts()
interval_petal_length_ef_binning = pd.qcut(PETAL_LENGTH_SERIES, amount_of_binning).value_counts()

In [48]:
# dataframe of petal-length and petal category
data_petal_length_ef = pd.DataFrame(pd.concat((PETAL_LENGTH_SERIES, petal_length_ef_binning), axis=1))

In [49]:
# change columns name
data_petal_length_ef.columns = ["petal_length", "category"]

In [50]:
data_petal_length_ef

Unnamed: 0,petal_length,category
0,1.4,sedikit_lebar
1,1.4,sedikit_lebar
2,1.3,sedikit_lebar
3,1.5,sedikit_lebar
4,1.4,sedikit_lebar
...,...,...
145,5.2,sangat_lebar
146,5.0,sangat_lebar
147,5.2,sangat_lebar
148,5.4,sangat_lebar


In [51]:
# equal-frequency intervals binning with label
labelled_petal_length_ef_binning

lebar            54
sedikit_lebar    50
sangat_lebar     46
Name: petal.length, dtype: int64

In [52]:
# equal-frequency intervals out label
interval_petal_length_ef_binning

(2.633, 4.9]      54
(0.999, 2.633]    50
(4.9, 6.9]        46
Name: petal.length, dtype: int64

### Definisi Entropy-based Binning

1.   Metode untuk mengelompokkan data *numeric* menjadi *categorical*
2.   Pengelompokkan dengan mencari jumlah pembagi yang terabaik
3.   Hasil pengelompokkan terbaik adalah dengan *entropy gain* yang paling besar


*  Rumus Entropy

$$
\begin{align*}
\displaystyle Entropy(S) &= \sum_{i=0}^{k} -pi \ log_{2} \ pi \\
\end{align*}
$$



#### Persiapan


*   Impor modul log2 dari *package math*
*   Ambil data yang akan digunakan


```
# dataframe about petal width (Equeal Frequency Interval)
df_petal_width_ef
```

*  Tentukan label


```
labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
```



In [53]:
# module to calculate log
from math import log2

In [54]:
# dataframe
target_data = data_petal_width_ef
target_data

Unnamed: 0,petal_width,category
0,0.2,sedikit_lebar
1,0.2,sedikit_lebar
2,0.2,sedikit_lebar
3,0.2,sedikit_lebar
4,0.2,sedikit_lebar
...,...,...
145,2.3,sangat_lebar
146,1.9,sangat_lebar
147,2.0,sangat_lebar
148,2.3,sangat_lebar


In [55]:
# category label
labels = ["sedikit_lebar", "lebar", "sangat_lebar"]

## Functions
Fungsi menghitung banyaknya data setiap kategori

In [56]:
# count every value based the given category
def countEveryCategory(data_column, labels, column, category):
    group = data_column.groupby(category).count()
    amount_of_every_category = []
    for label in labels:
        amount_of_every_category.append(group.loc[label, column])
    return amount_of_every_category

Fungsi menghitung banyaknya data setiap kategori berdasarkan nilai pembagi atau split value yang diberikan

In [57]:
# split the given data based on category and split value 
def split(split_value, data_column, labels, col, category):
    less_group = data_column[data_column[col] < split_value]
    greater_group = data_column[data_column[col] >= split_value]
    
    length_less_group = countEveryCategory(less_group, labels, col, category)
    length_greater_group = countEveryCategory(greater_group, labels, col, category)
    
    return (length_less_group, length_greater_group)

Fungsi menghitung perbedaan *gain* antara entropi inisial dengan entropi nilai pembagi yang diberikan




---
$$Gain(E_{new}) = (E_{initial}) \ - (E_{new})$$
---



In [58]:
# count information gain from inisial entropy and new entropy
def count_gain(inisial_entropy, new_entropy):
  return inisial_entropy - new_entropy

Fungsi menghitung entropy

In [59]:
# count entropy
def count_entropy(data_target):
    all_prob = []
    for prob in data_target:
        if (prob/sum(data_target) != 0):
            all_prob.append(prob/sum(data_target) * log2(prob/sum(data_target)))
        else:
            all_prob.append(0)
    return -(sum(all_prob))

Fungsi menghitung Info entropy berdasarkan nilai pembagi atau *split value* yang diberikan


---



$$ Info_A(D) = \frac{|D_1|}{|D|} Entropy (D_1)\frac{|D_2|}{|D|}+ Entropy (D_2)$$

In [60]:
# count entropy for given a split value
def info(d, data_target):
    temp = []
    for value in d:
        temp.append((sum(value) / data_target.shape[0]) * count_entropy(value))
    return sum(temp)

## Implementasi
- Hitung target entropy
- Hitung target entropy dengan split value
- Hitung entropy gain

In [61]:
# entropy data target
initial_data_target_entropy = count_entropy(countEveryCategory(target_data, labels, target_data.columns[0], target_data.columns[1]))

In [62]:
# count Entropy for the target given a split value, split value = 0.7
entropy_data_target_1 = info(split(0.7, target_data, labels, target_data.columns[0], target_data.columns[1]), target_data)

In [63]:
# information gain (entropy data target and entropy_data_target_1)
count_gain(initial_data_target_entropy, entropy_data_target_1)

0.9182958340544896