# 處理類別型資料 (Handling categorical data)

In [150]:
import pandas as pd

## Create a new data frame

In [151]:
df = pd.DataFrame( [ ['green', 'M', 10.1, 'class1'], ['red', 'L', 13.5, 'class2'], ['blue', 'XL', 15.3, 'class1'] ] )

In [152]:
df.columns = ['color', 'size', 'price', 'classlabel']

## 轉換成category type，這樣可以知道該欄位有幾類

In [153]:
categories_col = ['color', 'size', 'classlabel']

for col in categories_col:
    df[col] = df[col].astype('category')

df.dtypes

color         category
size          category
price          float64
classlabel    category
dtype: object

## Mapping ordinal features (手工)

In [154]:
size_mapping = { 'XL':3 , 'L':2 , 'M':1}
df['size'] = df['size'].map(size_mapping)

In [155]:
class_mapping = { 'class1':1 , 'class2':2}
df['classlabel'] = df['classlabel'].map(class_mapping)

## Mapping ordinal features (套library)

In [156]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit_transform(df['color'].values)
print c
print le.inverse_transform(c)

df['color'] = le.inverse_transform(c)

[1 2 0]
['green' 'red' 'blue']


## One-hot encoding (套library)

* 為什麼要做One-hot encoding? 如果我們對"全部"類別欄位做數值化(0,1,2...)，可能會犯了一個錯誤，如顏色這個欄位做轉換後，機器在學習的過程中，它會認為 red > green > blue，但這認知是錯誤的，如果硬train還是會學到不錯的結果，但可能不是最好的


- python maching learning, p.132
- After executing the preceding code, the first column of the NumPy array X now
- holds the new color values, which are encoded as follows:
- • blue -> 0
- • green -> 1
- • red -> 2
- If we stop at this point and feed the array to our classifier, we will make one of the
- most common mistakes in dealing with categorical data. Can you spot the problem?
- Although the color values don't come in any particular order, a learning algorithm
- will now assume that green is larger than blue, and red is larger than green. Although
- this assumption is incorrect, the algorithm could still produce useful results.
- However, those results would not be optimal.


* one-hot encoding就是將nominal feature中的每一類都建立一個新的feature，這feature叫dummy feature

- A common workaround for this problem is to use a technique called one-hot
- encoding. The idea behind this approach is to create a new dummy feature for each
- unique value in the nominal feature column.

In [157]:
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,1
1,red,2,13.5,2
2,blue,3,15.3,1


In [159]:
df = pd.get_dummies(df)
df

Unnamed: 0,size,price,classlabel,color_blue,color_green,color_red
0,1,10.1,1,0.0,1.0,0.0
1,2,13.5,2,0.0,0.0,1.0
2,3,15.3,1,1.0,0.0,0.0
