# 範例 : (Kaggle)房價預測
以下用房價預測資料, 觀察均值編碼的效果 <br />
## [教學目標]
以下用房價預測資料, 觀察均值編碼的效果 <br />
## [範例重點]
觀察標籤編碼與均值編碼, 在特徵數量 / 線性迴歸分數 / 線性迴歸時間上, 分別有什麼影響 <br />
觀察標籤編碼與均值編碼, 在特徵數量 / 梯度提升樹分數 / 梯度提升樹時間上, 分別有什麼影響 <br />

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
import copy, time

data_dir = './data/'
train_df = pd.read_csv(data_dir + 'train.csv')
test_df = pd.read_csv(data_dir + 'test.csv')

train_label = train_df.SalePrice
test_ids = test_df.Id
train_df = train_df.drop(['Id','SalePrice'], axis = 1)
test_df = test_df.drop(['Id'], axis = 1)
train_num = len(train_df)

house_df = pd.concat([train_df,test_df])
house_df.head(5)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal


In [2]:
object_features = train_df.columns[train_df.dtypes == 'object']
print(f' {len(object_features)} Object Features : {object_features} ')

object_house_df = house_df[object_features]
object_house_df = object_house_df.fillna('None')
object_house_df.head(5)

 43 Object Features : Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object') 


Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,...,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
0,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
1,RL,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
2,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
3,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,...,Detchd,Unf,TA,TA,Y,,,,WD,Abnorml
4,RL,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal


In [3]:
#Label encoding + Linear regression
temp_df = pd.DataFrame()
for col in object_house_df.columns:
    temp_df[col] = LabelEncoder().fit_transform(object_house_df[col])

train_x = temp_df[:train_num]
LR = LinearRegression()
start = time.time()
print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(LR,train_x,train_label,cv=5).mean()} ')
print(f' time : {time.time() - start} second ')

 shape : (1460, 43) 
 score : 0.635847097531328 
 time : 0.05076885223388672 second 


In [40]:
#mean value encoding + Linear regression
temp_data = pd.concat([ object_house_df[:train_num] , train_label ], axis = 1)

for col in object_house_df.columns:
    mean_df = temp_data.groupby([col])['SalePrice'].mean().reset_index()
    mean_df.columns = [col, f'{col}_mean']
    temp_data = pd.merge(temp_data,mean_df,on = [col], how = 'left')
    temp_data = temp_data.drop(col, axis = 1)
    
temp_data = temp_data.drop(['SalePrice'], axis = 1)
LR = LinearRegression()
start = time.time()
print(f' shape : {temp_data.shape} ')
print(f' score : {cross_val_score(LR,temp_data,train_label,cv=5).mean()} ')
print(f' time : {time.time() - start} second ')

 shape : (1460, 43) 
 score : 0.7253748535065236 
 time : 0.022532939910888672 second 


In [41]:
#Label encoding + GBR
temp_df = pd.DataFrame()
for col in object_house_df.columns:
    temp_df[col] = LabelEncoder().fit_transform(object_house_df[col])

train_x = temp_df[:train_num]
GBR = GradientBoostingRegressor()
start = time.time()

print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(GBR,train_x,train_label,cv=5).mean()} ')
print(f' time : {time.time() - start} second')

 shape : (1460, 43) 
 score : 0.7562449106550473 
 time : 0.7495038509368896 second


In [43]:
#Mean value encoding + GBR
temp_data = pd.concat([object_house_df[:train_num], train_label], axis = 1)
for col in object_house_df.columns:
    mean_df = temp_data.groupby([col])['SalePrice'].mean().reset_index()
    mean_df.columns = [col, f'{col}_mean' ]
    temp_data = pd.merge(temp_data,mean_df,on=[col],how = 'left')
    temp_data = temp_data.drop([col], axis = 1)

temp_data = temp_data.drop(['SalePrice'], axis = 1)
GBR = GradientBoostingRegressor()
print(f' shape : {temp_data.shape} ')
print(f' score : {cross_val_score(GBR,temp_data,train_label,cv=5).mean()} ')
print(f' time : {time.time() - start} second ')

 shape : (1460, 43) 
 score : 0.7918479466656946 
 time : 106.66513085365295 second 


# 作業 : (Kaggle)鐵達尼生存預測
https://www.kaggle.com/c/titanic <br />
<br />
## [作業目標]
試著模仿範例寫法, 在鐵達尼生存預測中, 觀察均值編碼的效果 <br />
## [作業重點]
仿造範例, 完成標籤編碼與均值編碼搭配邏輯斯迴歸的預測 <br />
觀察標籤編碼與均值編碼在特徵數量 / 邏輯斯迴歸分數 / 邏輯斯迴歸時間上, 分別有什麼影響 <br />

## 作業1
請仿照範例，將鐵達尼範例中的類別型特徵改用均值編碼實作一次<br />

In [45]:
df_train = pd.read_csv(data_dir + 'titanic_train.csv')
df_test = pd.read_csv(data_dir + 'titanic_test.csv')

train_y = df_train.Survived
test_idss = df_test.PassengerId

df_train = df_train.drop(['PassengerId','Survived'], axis = 1)
df_test = df_test.drop(['PassengerId'], axis = 1)
train_num = len(train_y)

df = pd.concat([df_train,df_test])
df.head(5)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [47]:
object_features = df.columns[df.dtypes == 'object']
print(f' {len(object_features)} Object Features : {object_features} ')

object_df = df[object_features]
object_df = object_df.fillna('None')
object_df.head(5)

 5 Object Features : Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object') 


Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


In [49]:
import warnings
warnings.filterwarnings('ignore')
#Label Encoding + LogisticRegression
df_temp = pd.DataFrame()
for col in object_df.columns:
    df_temp[col] = LabelEncoder().fit_transform(object_df[col])

train_x = df_temp[:train_num]
LR = LogisticRegression()
start = time.time()

print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(LR,train_x,train_y,cv=5).mean() } ' )
print(f' time : {time.time() - start} second' )

 shape : (891, 5) 
 score : 0.780004837244799 
 time : 0.03939318656921387 second


In [68]:
#MeanVal encodgin + LogisticRegression
df_temp = pd.concat([object_df[:train_num], train_y] , axis = 1)

for col in object_df.columns:
    mean_df = df_temp.groupby([col])['Survived'].mean().reset_index()
    mean_df.columns = [col, f'{col}_mean']
    df_temp = pd.merge(df_temp,mean_df,on = [col], how = 'left')
    df_temp = df_temp.drop([col], axis = 1)

df_temp = df_temp.drop(['Survived'], axis = 1)
LR = LogisticRegression()
start = time.time()

print(f' shape : {df_temp.shape} ')
print(f' score : {cross_val_score(LR,df_temp,train_y,cv=5).mean() } ' )
print(f' time : {time.time() - start} second' )

 shape : (891, 5) 
 score : 1.0 
 time : 0.022444963455200195 second


In [69]:
print(df_temp.head(10))
print(train_y.head(10))

   Name_mean  Sex_mean  Ticket_mean  Cabin_mean  Embarked_mean
0          0  0.188908          0.0    0.299854       0.336957
1          1  0.742038          1.0    1.000000       0.553571
2          1  0.742038          1.0    0.299854       0.336957
3          1  0.742038          0.5    0.500000       0.336957
4          0  0.188908          0.0    0.299854       0.336957
5          0  0.188908          0.0    0.299854       0.389610
6          0  0.188908          0.0    0.000000       0.336957
7          0  0.188908          0.0    0.299854       0.336957
8          1  0.742038          1.0    0.299854       0.336957
9          1  0.742038          0.5    0.299854       0.553571
0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Survived, dtype: int64


## Answer of homework 2

From above results, we can see that we got 1.0 score on MeanVal encoding + LogisticRegression case. This is very obvious an over-fitting case.<br />
After Checking the data, we found out that the Name column become almost the same as label column. Because the values in name column is too unique, so the mean_df of name column will be almost the same as Survived column.

In [70]:
#how about we drop the Name_mean column and try again
df_temp = df_temp.drop(['Name_mean'], axis = 1)
LR = LogisticRegression()
start = time.time()

print(f' shape : {df_temp.shape} ')
print(f' score : {cross_val_score(LR,df_temp,train_y,cv=5).mean() } ' )
print(f' time : {time.time() - start} second' )

 shape : (891, 4) 
 score : 0.9730773636448428 
 time : 0.03298592567443848 second


## Answer of homework 2

After we drop the Name_mean, the score decrease, but the score still pretty high. Because there several columns in titanic dataset that have very unique values, like ticket is another good example. Lets drop the ticket_mean and try again.

In [71]:
#how about we drop the Name_mean column and try again
df_temp = df_temp.drop(['Ticket_mean'], axis = 1)
LR = LogisticRegression()
start = time.time()

print(f' shape : {df_temp.shape} ')
print(f' score : {cross_val_score(LR,df_temp,train_y,cv=5).mean() } ' )
print(f' time : {time.time() - start} second' )

 shape : (891, 3) 
 score : 0.8350366889413987 
 time : 0.05136919021606445 second


## Conclusion of answer of hw 2

I think the meanval encoding might not be a good option for titanic dataset. Since there are too many unique values in the dataset. So the Label Encoding might be better. But we didnt try to smooth the meanVal columns yet. Maybe it will improve the over-fitting very well.