# [教學目標]
- 以下程式碼將示範在 python 如何利用 pandas.cut 與 .qcut 計算出數據的離散化標籤

# [範例重點]
- pandas.cut 的等寬劃分效果 (In[3], Out[4])
- pandas.qcut 的等頻劃分效果 (In[5], Out[6])

In [1]:
# 載入套件
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [29]:
# 初始設定 Ages 的資料
ages = pd.DataFrame({"age": [10,18,22,25,27,7,21,23,37,30,61,45,41,9,18,80,100]})

#### 等寬劃分

In [30]:
# 新增欄位 "equal_width_age", 對年齡做等寬劃分
ages["equal_width_age"] = pd.cut(ages["age"], 4)
ages["equal_width_age"]

0     (6.907, 30.25]
1     (6.907, 30.25]
2     (6.907, 30.25]
3     (6.907, 30.25]
4     (6.907, 30.25]
5     (6.907, 30.25]
6     (6.907, 30.25]
7     (6.907, 30.25]
8      (30.25, 53.5]
9     (6.907, 30.25]
10     (53.5, 76.75]
11     (30.25, 53.5]
12     (30.25, 53.5]
13    (6.907, 30.25]
14    (6.907, 30.25]
15    (76.75, 100.0]
16    (76.75, 100.0]
Name: equal_width_age, dtype: category
Categories (4, interval[float64]): [(6.907, 30.25] < (30.25, 53.5] < (53.5, 76.75] < (76.75, 100.0]]

In [31]:
# 觀察等寬劃分下, 每個種組距各出現幾次
ages["equal_width_age"].value_counts() # 每個 bin 的值的範圍大小都是一樣的

(6.907, 30.25]    11
(30.25, 53.5]      3
(76.75, 100.0]     2
(53.5, 76.75]      1
Name: equal_width_age, dtype: int64

#### 等頻劃分

In [32]:
# 新增欄位 "equal_freq_age", 對年齡做等頻劃分
ages["equal_freq_age"] = pd.qcut(ages["age"], 4)
ages["equal_freq_age"]

0     (6.999, 18.0]
1     (6.999, 18.0]
2      (18.0, 25.0]
3      (18.0, 25.0]
4      (25.0, 41.0]
5     (6.999, 18.0]
6      (18.0, 25.0]
7      (18.0, 25.0]
8      (25.0, 41.0]
9      (25.0, 41.0]
10    (41.0, 100.0]
11    (41.0, 100.0]
12     (25.0, 41.0]
13    (6.999, 18.0]
14    (6.999, 18.0]
15    (41.0, 100.0]
16    (41.0, 100.0]
Name: equal_freq_age, dtype: category
Categories (4, interval[float64]): [(6.999, 18.0] < (18.0, 25.0] < (25.0, 41.0] < (41.0, 100.0]]

In [33]:
# 觀察等頻劃分下, 每個種組距各出現幾次
ages["equal_freq_age"].value_counts() # 每個 bin 的資料筆數是一樣的

(6.999, 18.0]    5
(41.0, 100.0]    4
(25.0, 41.0]     4
(18.0, 25.0]     4
Name: equal_freq_age, dtype: int64

### 作業
新增一個欄位 `customized_age_grp`，把 `age` 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組，'(' 表示不包含, ']' 表示包含

Hints: 執行 ??pd.cut()，了解提供其中 bins 這個參數的使用方式

In [34]:
bins = [0,10, 20, 30, 50, 100]
ages["customized_age_grp"] = pd.cut(ages["age"],bins)
ages["customized_age_grp"].value_counts()
ages

Unnamed: 0,age,equal_width_age,equal_freq_age,customized_age_grp
0,10,"(6.907, 30.25]","(6.999, 18.0]","(0, 10]"
1,18,"(6.907, 30.25]","(6.999, 18.0]","(10, 20]"
2,22,"(6.907, 30.25]","(18.0, 25.0]","(20, 30]"
3,25,"(6.907, 30.25]","(18.0, 25.0]","(20, 30]"
4,27,"(6.907, 30.25]","(25.0, 41.0]","(20, 30]"
5,7,"(6.907, 30.25]","(6.999, 18.0]","(0, 10]"
6,21,"(6.907, 30.25]","(18.0, 25.0]","(20, 30]"
7,23,"(6.907, 30.25]","(18.0, 25.0]","(20, 30]"
8,37,"(30.25, 53.5]","(25.0, 41.0]","(30, 50]"
9,30,"(6.907, 30.25]","(25.0, 41.0]","(20, 30]"
