# import libaries

In [58]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# import dataset

In [59]:
dataset: pd.DataFrame = pd.read_csv('Data.csv')
x = dataset.iloc[:, :-1].values  # [row][column]
y = dataset.iloc[:, -1].values

print(x)
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


# take care of missing data (no data -> mean value (but, it's not the absolute rule))

In [60]:
from sklearn.impute import SimpleImputer  # import SimpleImputer class in scikit-learn


imputer: SimpleImputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])  # fill missing cells with mean values
x[:, 1:3] = imputer.transform(x[:, 1:3])

print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


# Encoding Categorical Data

## Encoding independent Variable

In [61]:
from sklearn.compose import ColumnTransformer # for one-hot-encoding
from sklearn.preprocessing import OneHotEncoder # for one-hhot-encoding


ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))

print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


categorical dataを理解させる為にはデータを置換する。
しかも、数字にしたら並び順などで意味を与えるなどの学習に間違える可能性はある。そのためone-hot-encodingという方法を利用する。(vector化)

- 例
    1. 5 kinds of category => [0, 0, 0, 1, 0]
    2. 2 kinds of category => [1, 0]

ColumnTransformerのtransformers
1. the kind of transformer
2. what kind of encoding (one-hot-encoding)
3. indices of columns


ColumnTransformer.fit_transformはnumpy arrayを返してないので変換が必要

## Encoding dependent Variable

In [62]:
from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
y = le.fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


LabelEncoder.fit_transformはnumpy arrayを返す。