# 3. データ表現と特徴量エンジニアリング

# 3.1 カテゴリ変数と連続値変数

特徴量には、カテゴリを表すものと、連続値を表すものがあり、機械学習モデルに与える前に適切に前処理を行う必要がある。

- カテゴリ変数 - 離散値。ラベルのようなもの
  例: 性別、婚姻状態など
  注意: 数値データのように見えても、カテゴリとして扱うべきものも多い  
  例: 郵便番号、電話のエリア番号など

- 連続値変数 - 数値データ
  例: 価格

- 順序付き変数 - 離散値だが、明らかに順序のあるもの  
  例: 成績 (優、良、可、不可)

例: 国勢調査の結果から、収入が50Kドル以上かどうかを予測するデータセット

In [38]:
# hide
# Download
DATASET = (
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names",
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
)
import subprocess
from pathlib import Path

def download():
    for u in DATASET: 
        f = u.split("/")[-1]
        subprocess.run(["curl", u, "-o", str(Path("data") / f)])
# download()        
import pandas as pd
file_name = "data/adult.data"

labels = """age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
target""".split("\n")

df = pd.read_csv(file_name, header=None, names=labels)
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [39]:
df['workclass'].value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

In [40]:
df['relationship'].value_counts()

 Husband           13193
 Not-in-family      8305
 Own-child          5068
 Unmarried          3446
 Wife               1568
 Other-relative      981
Name: relationship, dtype: int64

## 3.2 One-Hot Encoding

- カテゴリ変数を機械学習モデルで扱うには何らかの方法で数値に変換しなければならない。
- 個々のカテゴリ値ごとにそれぞれ1-0特徴量を割当てる。  
   例: 特徴量X が A,B,C,D の4つの値をとるとする。それぞれ特徴量 A,B,C,D を導入し、1,0を割り当てる。

In [41]:
shrinked = df[['age', 'workclass', 'target']]
shrinked

Unnamed: 0,age,workclass,target
0,39,State-gov,<=50K
1,50,Self-emp-not-inc,<=50K
2,38,Private,<=50K
3,53,Private,<=50K
4,28,Private,<=50K
...,...,...,...
32556,27,Private,<=50K
32557,40,Private,>50K
32558,58,Private,<=50K
32559,22,Private,<=50K


In [42]:
pd.get_dummies(shrinked)

Unnamed: 0,age,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,target_ <=50K,target_ >50K
0,39,0,0,0,0,0,0,0,1,0,1,0
1,50,0,0,0,0,0,0,1,0,0,1,0
2,38,0,0,0,0,1,0,0,0,0,1,0
3,53,0,0,0,0,1,0,0,0,0,1,0
4,28,0,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,0,0,0,0,1,0,0,0,0,1,0
32557,40,0,0,0,0,1,0,0,0,0,0,1
32558,58,0,0,0,0,1,0,0,0,0,1,0
32559,22,0,0,0,0,1,0,0,0,0,1,0


In [52]:
df_e = pd.get_dummies(df)

In [57]:
target = df_e['target_ <=50K']
df_e = df_e.drop(columns=['target_ <=50K', 'target_ >50K'])

In [59]:
df_e.to_numpy()

array([[    39,  77516,     13, ...,      1,      0,      0],
       [    50,  83311,     13, ...,      1,      0,      0],
       [    38, 215646,      9, ...,      1,      0,      0],
       ...,
       [    58, 151910,      9, ...,      1,      0,      0],
       [    22, 201490,      9, ...,      1,      0,      0],
       [    52, 287927,      9, ...,      1,      0,      0]])

In [60]:
target.to_numpy()

array([1, 1, 1, ..., 1, 1, 0], dtype=uint8)

In [64]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_e.to_numpy(), target.to_numpy(), random_state=1)

model = LogisticRegression()
model.fit(X_train, y_train)
predicted = model.predict(X_test)




In [71]:
import numpy as np
from sklearn.metrics import accuracy_score
print(np.mean((predicted == y_test)))
print(accuracy_score(predicted, y_test))

0.8073946689595872
0.8073946689595872


In [76]:
from sklearn.ensemble import RandomForestClassifier

def eval_classifier(mdl, X_train, X_test, y_train, y_test):
    mdl.fit(X_train, y_train)
    print("{}\ttrain: {}\ttest: {}".format(
        mdl.__class__.__name__,
        accuracy_score(mdl.predict(X_train), y_train), 
        accuracy_score(mdl.predict(X_test), y_test)))
eval_classifier(RandomForestClassifier(), X_train, X_test, y_train, y_test)


RandomForestClassifier	train: 0.9999590499590499	test: 0.8597223928264341


In [77]:
from sklearn.ensemble import GradientBoostingClassifier
 
eval_classifier(GradientBoostingClassifier(), X_train, X_test, y_train, y_test)

GradientBoostingClassifier	train: 0.8683046683046683	test: 0.8699177005281906


In [78]:
eval_classifier(LogisticRegression(), X_train, X_test, y_train, y_test)

LogisticRegression	train: 0.793980343980344	test: 0.8073946689595872
