# 範例 : (Kaggle)房價預測
以下用房價預測資料, 觀查標籤編碼與獨編碼熱的影響 <br />
## [教學目標]
以下用房價預測資料, 觀查標籤編碼與獨編碼熱的影響<br />
## [範例重點]
觀察標籤編碼與獨熱編碼, 在特徵數量 / 線性迴歸分數 / 線性迴歸時間上, 分別有什麼影響 <br />
觀察標籤編碼與獨熱編碼, 在特徵數量 / 梯度提升樹分數 / 梯度提升樹時間上, 分別有什麼影響 

In [3]:
#Prepare for our feature engineering
import pandas as pd
import numpy as np
import copy, time
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor

data_dir = './data/'
train_df = pd.read_csv(data_dir + 'train.csv')
test_df = pd.read_csv(data_dir + 'test.csv')

train_label = train_df.SalePrice
test_ids = test_df.Id

train_df = train_df.drop(['Id','SalePrice'], axis = 1)
test_df = test_df.drop(['Id'], axis = 1)

house_df = pd.concat([train_df,test_df])
house_df.head(5)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal


In [6]:
#record the length of train_df
house_train_num = len(train_df)

#Take out the object type columns in house_df
object_features = house_df.columns[house_df.dtypes == 'object']

print(f' {len(object_features)} Object Features : {object_features} \n')

house_object_df = house_df[object_features]
house_object_df = house_object_df.fillna('None')
house_object_df.head(5)

 43 Object Features : Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object') 



Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,...,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
0,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
1,RL,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
2,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
3,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,...,Detchd,Unf,TA,TA,Y,,,,WD,Abnorml
4,RL,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal


In [7]:
temp_house_object_df = pd.DataFrame()

#Label encoding every column of house_object_df
for col in house_object_df.columns:
    temp_house_object_df[col] = LabelEncoder().fit_transform(house_object_df[col])

#Check the score under this label encoding
train_x = temp_house_object_df[:house_train_num]
LR = LinearRegression()
start = time.time()
print(f' shape : { train_x.shape } ')
print(f' score : {cross_val_score(LR,train_x,train_label,cv=5).mean()} ')
print(f' time : {time.time() - start} sec')

 shape : (1460, 43) 
 score : 0.635847097531328 
 time : 0.06618285179138184 sec


In [30]:
#one-hot encoding + linear regression
temp_house_object_df = pd.get_dummies(house_object_df)
train_x = temp_house_object_df[:house_train_num]
LR = LinearRegression()
start = time.time()

print(f' shape : {train_x.shape} ')
print(f' score : { cross_val_score(LR, train_x ,train_label , scoring = "r2", cv=5 ).mean() }')
print(f' time : {time.time() - start} sec')

 shape : (1460, 274) 
 score : -1.1222461341479273e+23
 time : 0.11149096488952637 sec


In [10]:
#Label encoding + GBT

temp_house_object_df = pd.DataFrame()
for col in house_object_df.columns:
    temp_house_object_df[col] = LabelEncoder().fit_transform(house_object_df[col])

train_x = temp_house_object_df[:house_train_num]
GBT = GradientBoostingRegressor()
start = time.time()
print(f' shape : {train_x.shape}')
print(f' score : {cross_val_score(GBT,train_x,train_label,cv=5).mean()}')
print(f' time : {time.time() - start} sec')

 shape : (1460, 43)
 score : 0.7555628664753397
 time : 0.9405481815338135 sec


In [12]:
# one-hot-encoding + GBT
temp_house_object_df = pd.get_dummies(house_object_df)
train_x = temp_house_object_df[:house_train_num]
GBT = GradientBoostingRegressor()
start = time.time()
print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(GBT,train_x,train_label,cv=5).mean()}')
print(f' time : {time.time() - start} sec')

 shape : (1460, 274) 
 score : 0.7803543685378161
 time : 2.807400941848755 sec


# 作業 : (Kaggle)鐵達尼生存預測
https://www.kaggle.com/c/titanic <br />
<br />
## [作業目標]
試著模仿範例寫法, 在鐵達尼生存預測中, 觀察標籤編碼與獨編碼熱的影響<br />
## [作業重點] 
回答在範例中的觀察結果<br />
觀察標籤編碼與獨熱編碼, 在特徵數量 / 邏輯斯迴歸分數 / 邏輯斯迴歸時間上, 分別有什麼影響 <br />
<br />
## 作業1
觀察範例，在房價預測中調整標籤編碼(Label Encoder) / 獨熱編碼 (One Hot Encoder) 方式，<br />
對於線性迴歸以及梯度提升樹兩種模型，何者影響比較大?<br />

## Answer of homework 1:

From the example we can see that the One Hot Encoder has big impact to LinearRegression model.<br />
When we perform the one-hot encoding on data to train LinearRegression model, the score just become very weird. 
The default scoring parameter of cross_val_score is 'r2' if we are using linear regression.
R2 scoring will be negative when the chosen model doesnt follow the trend of the data which means it is worse than the null hypothesis.

In [14]:
from sklearn.linear_model import LogisticRegression

df_train = pd.read_csv(data_dir + 'titanic_train.csv')
df_test = pd.read_csv(data_dir + 'titanic_test.csv')

train_y = df_train.Survived
test_ids = df_test.PassengerId

df_train = df_train.drop(['Survived', 'PassengerId'], axis = 1)
df_test = df_test.drop(['PassengerId'], axis = 1)

df = pd.concat([df_train, df_test])
titanic_train_num = len(df_train)
df.head(5)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [31]:
object_features = df.columns[df.dtypes == 'object']
print(f' {len(object_features)} Object Features : {object_features} \n')

df = df[object_features]
df = df.fillna('None')
df.head(5)

 5 Object Features : Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object') 



Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


## 作業2
鐵達尼號例題中，標籤編碼 / 獨熱編碼又分別對預測結果有何影響?

In [33]:
import warnings 
warnings.filterwarnings('ignore')
# Label Encoding + LogisticRegression
temp_df = pd.DataFrame()
for col in df.columns:
    temp_df[col] = LabelEncoder().fit_transform(df[col])

train_x = temp_df[:titanic_train_num]
LR = LogisticRegression()
start = time.time()

print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(LR,train_x,train_y,cv=5).mean()} ')
print(f' time : {time.time() - start} second' )

 shape : (891, 5) 
 score : 0.780004837244799 
 time : 0.04443788528442383 second


In [34]:
#One hot encoding + LogisticRegression
temp_df = pd.get_dummies(df)
train_x = temp_df[:titanic_train_num]
LR = LogisticRegression()
start = time.time()
print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(LR,train_x,train_y,cv=5).mean()} ')
print(f' time : {time.time() - start} second' )

 shape : (891, 2429) 
 score : 0.8013346043513216 
 time : 0.13790297508239746 second


## Answer of homework2
From the above result, we can see that one-hot-encoding has better score than label-encoding. <br />
I think there might be a possible reason in my mind: <br />
-> If we do the label encoding on Embarked, the values will become 0,1,2. But the Embarked might have this order relationship( 0 -> smallest, 2 -> biggest). This might confuse the model.<br />
<br />
Also the data become very big when we were using one-hot-encoding. Because several columns in data has very distinct value, like Ticket data.