<a id=0></a>
# 7.Categorical Features
カテゴリカル特徴量（変数）の取り扱い

---
### [1.LabelEncoder()](#1)
### [2.get_dummies()](#2)
### [3.OneHotEncoder()](#3)
### [4.pd.get_dummies()とOneHotEncoder()の違い](#4)
### [5.Seriesのstr属性を使う](#5)

---

データセットとしてsample1_without_index.csvを使用する

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('./sample_without_index.csv')
df.head()

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Score,Difference,Color,Shape
0,1997-07-05,2291,25,2.94665,5.305868,45.8933,52.762659,0.276266,green,triangle
1,1997-07-06,506,16,1.915208,0.679004,50.611735,31.453719,-1.854628,blue,
2,1997-07-07,9629,32,7.869855,6.563335,43.830416,56.239011,0.623901,blue,square
3,1997-07-08,6161,67,6.375209,5.756029,41.358007,61.453113,1.145311,green,square
4,,8570,55,0.390629,3.578136,55.739709,,1.03719,red,square


In [3]:
df = df[['Color', 'Shape']]

In [4]:
df.head()

Unnamed: 0,Color,Shape
0,green,triangle
1,blue,
2,blue,square
3,green,square
4,red,square


In [5]:
df.isnull().sum()

Color    4
Shape    5
dtype: int64

In [6]:
df[df['Color'].isnull()].index

Index([19, 37, 40, 73], dtype='int64')

---
<a id=1></a>
[Topへ](#0)

---
## 1. LabelEncoder()  
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html  
※ ラベルを数値(0, 1, 2, ...)で置換する

In [7]:
from sklearn.preprocessing import LabelEncoder

In [8]:
encoder = LabelEncoder()

In [9]:
encoder.fit(df['Color'])

In [11]:
encoder.classes_

array(['blue', 'green', 'red', nan], dtype=object)

In [12]:
encoder.transform(df['Color'])

array([1, 0, 0, 1, 2, 1, 0, 2, 2, 2, 1, 0, 1, 1, 2, 0, 0, 0, 0, 3, 1, 0,
       1, 0, 1, 1, 1, 2, 1, 1, 0, 0, 0, 2, 0, 1, 1, 3, 0, 0, 3, 2, 1, 2,
       0, 0, 2, 1, 0, 0, 0, 1, 2, 1, 2, 2, 2, 2, 1, 0, 2, 2, 1, 1, 2, 1,
       2, 1, 1, 2, 1, 0, 1, 3, 1, 0, 1, 1, 0, 1, 0, 0, 0, 2, 0, 0, 2, 0,
       1, 2, 2, 2, 0, 1, 2, 0, 2, 0, 0, 2])

In [13]:
df.head()

Unnamed: 0,Color,Shape
0,green,triangle
1,blue,
2,blue,square
3,green,square
4,red,square


In [15]:
df_ce = df.copy()
df_ce['Color_encoded'] = encoder.transform(df['Color'])
df_ce = df_ce[['Color', 'Color_encoded', 'Shape']]
df_ce.loc[36:42]

Unnamed: 0,Color,Color_encoded,Shape
36,green,1,square
37,,3,circle
38,blue,0,square
39,blue,0,square
40,,3,triangle
41,red,2,
42,green,1,square


In [17]:
encoder.inverse_transform(df_ce.loc[36:42, 'Color_encoded'])

array(['green', nan, 'blue', 'blue', nan, 'red', 'green'], dtype=object)

---
<a id=2></a>
[Topへ](#0)

---
## 2. get_dummies()  
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html  
※　カテゴリー変数をダミー変数化（0 or 1）する

* ダミー変数化を実行
* drop_first=Trueとは
* np.nanはどうなるのか
---

ダミー変数化を実行

In [18]:
pd.get_dummies(df['Color']).head()

Unnamed: 0,blue,green,red
0,False,True,False
1,True,False,False
2,True,False,False
3,False,True,False
4,False,False,True


drop_first=Trueとは  

In [20]:
pd.get_dummies(df['Color'], drop_first=True).head()

Unnamed: 0,green,red
0,True,False
1,False,False
2,False,False
3,True,False
4,False,True


In [21]:
df_cd = pd.get_dummies(df, columns=['Color'], drop_first=True)

In [22]:
df_cd.head()

Unnamed: 0,Shape,Color_green,Color_red
0,triangle,True,False
1,,False,False
2,square,False,False
3,square,True,False
4,square,False,True


In [23]:
df_cd = pd.get_dummies(df, columns=['Color', 'Shape'], drop_first=True)

In [24]:
df_cd.head()

Unnamed: 0,Color_green,Color_red,Shape_square,Shape_triangle
0,True,False,False,True
1,False,False,False,False
2,False,False,True,False
3,True,False,True,False
4,False,True,True,False


np.nanはどうなるのか

In [25]:
df_cd.isnull().sum()

Color_green       0
Color_red         0
Shape_square      0
Shape_triangle    0
dtype: int64

In [27]:
df_cd.loc[36:42]

Unnamed: 0,Color_green,Color_red,Shape_square,Shape_triangle
36,True,False,True,False
37,False,False,False,False
38,False,False,True,False
39,False,False,True,False
40,False,False,False,True
41,False,True,False,False
42,True,False,True,False


In [28]:
df_cd = pd.get_dummies(df, columns=['Color'], drop_first=True, dummy_na=True)
df_cd.loc[36:42]

Unnamed: 0,Shape,Color_green,Color_red,Color_nan
36,square,True,False,False
37,circle,False,False,True
38,square,False,False,False
39,square,False,False,False
40,triangle,False,False,True
41,,False,True,False
42,square,True,False,False


---
<a id=3></a>
[Topへ](#0)

---
## 3. OneHotEncoder()  
※　One-hot : ひとつが1で他は0  
※　pd.get_dummies()にはない機能を使ってダミー変数化を行う

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

デフォルトのKeyword Argument : drop=None, handle_unknown='error'

* OneHotEncoder()を使ってみる
* 複数の特徴量を変換
---

OneHotEncoder()を使ってみる

In [40]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()

In [41]:
encoder.fit(df[['Color']])

In [42]:
encoder.categories_

[array(['blue', 'green', 'red', nan], dtype=object)]

In [43]:
encoder.transform(df[['Color']])

<100x4 sparse matrix of type '<class 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>

In [44]:
encoder.transform(df[['Color']]).toarray()[:5]

array([[0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.]])

複数の特徴量を変換

In [45]:
encoder = OneHotEncoder()

In [46]:
encoder.fit(df)

In [47]:
encoder.categories_

[array(['blue', 'green', 'red', nan], dtype=object),
 array(['circle', 'square', 'triangle', nan], dtype=object)]

In [49]:
encoder.transform(df).toarray()[:5]

array([[0., 1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0.]])

In [50]:
encoder.inverse_transform([[0, 1, 0, 0, 0, 1, 0, 0]])

array([['green', 'square']], dtype=object)

---
<a id=4></a>
[Topへ](#0)

---
## 4. pd.get_dummies()とOneHotEncoder()の違い

* get_dummies()ではトレインセットとテストセットに差が生じる
* OneHotEncoder(handle_unknown='error', drop='first')の場合
* OneHotEncoder(handle_unknown='ignore')の場合
---

get_dummies()ではトレインセットとテストセットに差が生じる

In [52]:
np.random.seed(1)
s = pd.Series(np.random.choice([0, 1],len(df)), name='target')
s

0     1
1     1
2     0
3     0
4     1
     ..
95    0
96    1
97    1
98    0
99    1
Name: target, Length: 100, dtype: int64

In [55]:
df_new = pd.concat([df, s],axis=1)
df_new.head()

Unnamed: 0,Color,Shape,target
0,green,triangle,1
1,blue,,1
2,blue,square,0
3,green,square,0
4,red,square,1


In [56]:
from sklearn.model_selection import train_test_split

In [57]:
y = df_new.pop('target')
X = df_new

In [63]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, 
                                                    stratify=y, random_state=17)

In [64]:
X_test

Unnamed: 0,Color,Shape
16,blue,circle
29,green,triangle
80,blue,triangle
44,blue,triangle
48,blue,triangle


In [65]:
pd.get_dummies(X_train, drop_first=True, dummy_na=True).head()

Unnamed: 0,Color_green,Color_red,Color_nan,Shape_square,Shape_triangle,Shape_nan
94,False,True,False,True,False,False
3,True,False,False,True,False,False
25,True,False,False,False,True,False
42,True,False,False,True,False,False
69,False,True,False,False,True,False


In [66]:
pd.get_dummies(X_test, drop_first=True, dummy_na=True).head()

Unnamed: 0,Color_green,Color_nan,Shape_triangle,Shape_nan
16,False,False,False,False
29,True,False,True,False
80,False,False,True,False
44,False,False,True,False
48,False,False,True,False


In [71]:
encoder = OneHotEncoder(drop='first')

In [72]:
encoder.fit_transform(X_train).toarray()[:5]

array([[0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 0.]])

In [73]:
encoder.transform(X_test).toarray()

array([[0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.]])

OneHotEncoder(handle_unknown='error', drop='first')の場合

In [77]:
encoder_error = OneHotEncoder(handle_unknown='error', drop='first')

In [82]:
encoder_error.fit_transform(X_train).toarray()[:5]

array([[0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 0.]])

In [83]:
encoder_error.transform(X_test).toarray()[:5]

array([[0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.]])

In [80]:
X_test_new = X_test.copy()
X_test_new.loc[6, 'Color'] = 'purple'
X_test_new

Unnamed: 0,Color,Shape
16,blue,circle
29,green,triangle
80,blue,triangle
44,blue,triangle
48,blue,triangle
6,purple,


In [84]:
# encoder_error.transform(X_test_new)

OneHotEncoder(handle_unknown='ignore')の場合

In [96]:
encoder_ignore = OneHotEncoder(handle_unknown='ignore') #'ignore'を指定した場合はdrop='first'はしない

In [97]:
encoder_ignore.fit(X_train)

In [98]:
encoder_ignore.transform(X_test).toarray()

array([[1., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.]])

In [99]:
encoder_ignore.transform(X_test_new).toarray()

array([[1., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1.]])

#### 状況に応じて使い分ける（例）
* 分類される値が少ない、レコード量が多い  
    ＝＞　testデータに欠ける値はない　＝＞　get_dummies, OneHotEncoder(drop='first')
* 分類される値が少ない、レコード量が少ない  
    ＝＞　testデータに欠ける値があるかもしれない　＝＞　OneHotEncoder(handle_unknown='error', drop='first')
* 分類される値が多い、レコード量が少ない  
    ＝＞　testデータにtrainデータにない値が確実に入る　＝＞ OneHotEncoder, handle_unknown='ignore'

---
<a id=5></a>
[Topへ](#0)

---
## 5.Seriesのstr属性を使う

* Series.strとは
* メソッドを確認
* 利用頻度の高い置換、抽出、分離
---

Series.strとは

In [100]:
df = pd.DataFrame()
df['ID'] = ['A-123', 'B-456', 'A-789', 'B-123']
df['Color'] = ['py/white black', 'red green blue', 'py/yellow', 'purple white']
df

Unnamed: 0,ID,Color
0,A-123,py/white black
1,B-456,red green blue
2,A-789,py/yellow
3,B-123,purple white


In [102]:
df['ID'].str
# df.str

<pandas.core.strings.accessor.StringMethods at 0x17de38e50>

In [109]:
df['ID'].str[:3]

0    A-1
1    B-4
2    A-7
3    B-1
Name: ID, dtype: object

メソッドを確認

In [110]:
df['ID'].str.lower()

0    a-123
1    b-456
2    a-789
3    b-123
Name: ID, dtype: object

In [114]:
df['ID'].str.startswith('B')
# endswith

0    False
1     True
2    False
3     True
Name: ID, dtype: bool

In [115]:
df['Color'].str.contains('white')

0     True
1    False
2    False
3     True
Name: Color, dtype: bool

In [116]:
df['Color'].str.contains('ye|pu')

0    False
1    False
2     True
3     True
Name: Color, dtype: bool

利用頻度の高い置換、抽出、分離

In [117]:
df['Color'].str.replace('black', 'gold')

0     py/white gold
1    red green blue
2         py/yellow
3      purple white
Name: Color, dtype: object

In [118]:
df['ID'].str.split('-')

0    [A, 123]
1    [B, 456]
2    [A, 789]
3    [B, 123]
Name: ID, dtype: object

In [119]:
df['ID'].str.split('-', expand=True)

Unnamed: 0,0,1
0,A,123
1,B,456
2,A,789
3,B,123


In [121]:
df[['ID_a', 'ID_n']] = df['ID'].str.split('-', expand=True)
df

Unnamed: 0,ID,Color,ID_a,ID_n
0,A-123,py/white black,A,123
1,B-456,red green blue,B,456
2,A-789,py/yellow,A,789
3,B-123,purple white,B,123


In [122]:
df[['Color_1', 'Color_2', 'Color_3']] = df['Color'].str.split(' ', expand=True)
df

Unnamed: 0,ID,Color,ID_a,ID_n,Color_1,Color_2,Color_3
0,A-123,py/white black,A,123,py/white,black,
1,B-456,red green blue,B,456,red,green,blue
2,A-789,py/yellow,A,789,py/yellow,,
3,B-123,purple white,B,123,purple,white,


In [123]:
df['Color_1'].str.extract('(py/)', expand=True)

Unnamed: 0,0
0,py/
1,
2,py/
3,


In [125]:
df['py'] = df['Color_1'].str.extract('(py/)', expand=True)
df

Unnamed: 0,ID,Color,ID_a,ID_n,Color_1,Color_2,Color_3,py
0,A-123,py/white black,A,123,py/white,black,,py/
1,B-456,red green blue,B,456,red,green,blue,
2,A-789,py/yellow,A,789,py/yellow,,,py/
3,B-123,purple white,B,123,purple,white,,


In [126]:
df['Color_1'] = df['Color_1'].str.replace('py/', '')
df

Unnamed: 0,ID,Color,ID_a,ID_n,Color_1,Color_2,Color_3,py
0,A-123,py/white black,A,123,white,black,,py/
1,B-456,red green blue,B,456,red,green,blue,
2,A-789,py/yellow,A,789,yellow,,,py/
3,B-123,purple white,B,123,purple,white,,


---
[Topへ](#0)

---
## 以上
    
---