## One-hot Encoding 

선형 회귀의 손실 함수를 구하고, 경사 하강법을 적용하려면<br> 인풋 데이터가 수치형 데이터여야 하지만, 범주형 데이터가 있다면?

각 카테고리를 하나의 새로운 열로 만들어 주는 방법인 One-hot encoding을 통해 <br>
범주형 데이터를 수치형 데이터로 변환한다.

One-hot encoding 은 데이터의 카테고리에 해당하는 열은 1, 나머지는 0으로 채워주는 방식이다.

In [2]:
import pandas as pd

In [3]:
titanic_df = pd.read_csv('titanic.csv')

In [7]:
titanic_df = titanic_df.iloc[:,1:]

In [8]:
titanic_df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### 범주형 데이터 
* Sex : 성별
* Embarked : 승선 장소(앞글자)

In [12]:
titanic_df_se = titanic_df[['Sex','Embarked']]

In [13]:
titanic_df_se.head()

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S


### one-hot encoding

#### get_dummies
열의 이름을 확인하면 데이터를 변환하기 전, 어떤 카테고리 였는지 알 수 있다.

In [15]:
one_hot_encoded_df = pd.get_dummies(titanic_df_se)

In [16]:
one_hot_encoded_df.head()

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,1,0,0,1
1,1,0,1,0,0
2,1,0,0,0,1
3,1,0,0,0,1
4,0,1,0,0,1


#### optional parameter 사용
* data : one-hot encoding을 하려는 DataFrame
* columns one-hot encoding을 하려는 열의 이름들 (리스트 형태)

In [18]:
one_hot_encoded_df = pd.get_dummies(data = titanic_df_se, columns=['Sex','Embarked'])
one_hot_encoded_df

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,1,0,0,1
1,1,0,1,0,0
2,1,0,0,0,1
3,1,0,0,0,1
4,0,1,0,0,1
...,...,...,...,...,...
886,0,1,0,0,1
887,1,0,0,0,1
888,1,0,0,0,1
889,0,1,1,0,0


### 성별 데이터 사용

In [19]:
gender_df = pd.read_csv('gender.csv')

In [20]:
gender_df

Unnamed: 0,Favorite Color,Favorite Music Genre,Favorite Beverage,Favorite Soft Drink,Gender
0,Cool,Rock,Vodka,7UP/Sprite,F
1,Neutral,Hip hop,Vodka,Coca Cola/Pepsi,F
2,Warm,Rock,Wine,Coca Cola/Pepsi,F
3,Warm,Folk/Traditional,Whiskey,Fanta,F
4,Cool,Rock,Vodka,Coca Cola/Pepsi,F
...,...,...,...,...,...
61,Cool,Rock,Vodka,Coca Cola/Pepsi,M
62,Cool,Hip hop,Beer,Coca Cola/Pepsi,M
63,Neutral,Hip hop,Doesn't drink,Fanta,M
64,Cool,Rock,Wine,Coca Cola/Pepsi,M


### 데이터 분리
목표 변수로 사용할 성별 열을 제외한 새로운 dataframe 생성

In [21]:
input_data = gender_df.drop(['Gender'], axis=1)

In [24]:
X = pd.get_dummies(data = input_data)
X

Unnamed: 0,Favorite Color_Cool,Favorite Color_Neutral,Favorite Color_Warm,Favorite Music Genre_Electronic,Favorite Music Genre_Folk/Traditional,Favorite Music Genre_Hip hop,Favorite Music Genre_Jazz/Blues,Favorite Music Genre_Pop,Favorite Music Genre_R&B and soul,Favorite Music Genre_Rock,Favorite Beverage_Beer,Favorite Beverage_Doesn't drink,Favorite Beverage_Other,Favorite Beverage_Vodka,Favorite Beverage_Whiskey,Favorite Beverage_Wine,Favorite Soft Drink_7UP/Sprite,Favorite Soft Drink_Coca Cola/Pepsi,Favorite Soft Drink_Fanta,Favorite Soft Drink_Other
0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0
1,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0
2,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0
3,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
4,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0
62,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0
63,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0
64,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0
