# 範例 : (Kaggle)鐵達尼生存預測
## [教學目標]
以下用鐵達尼生存預測資料, 觀察計數編碼與特徵雜湊的效果<br />
## [範例重點]
了解計數編碼的寫作方式, 以及計數編碼搭配邏輯斯迴歸對於測結果有什麼影響 <br />
觀察 雜湊編碼, 以及 計數編碼+雜湊編碼 分別搭配邏輯斯迴歸對於測結果有什麼影響 <br />

In [1]:
import pandas as pd
import numpy as np
import copy, time
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

data_dir = './data/'
train_df = pd.read_csv(data_dir + 'titanic_train.csv')
test_df = pd.read_csv(data_dir + 'titanic_test.csv')

train_label = train_df.Survived
test_ids = test_df.PassengerId

train_df = train_df.drop(['PassengerId','Survived'], axis = 1)
test_df = test_df.drop(['PassengerId'], axis = 1)

train_num = len(train_df)
df = pd.concat([train_df,test_df])
df.head(5)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
object_features = train_df.columns[train_df.dtypes == 'object']
print(f' {len(object_features)} Object Features : {object_features} ')
object_df = df[object_features]
object_df = object_df.fillna('None')
object_df.head(5)

 5 Object Features : Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object') 


Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


In [3]:
object_df[object_features].apply(pd.Series.nunique)

Name        1307
Sex            2
Ticket       929
Cabin        187
Embarked       4
dtype: int64

In [4]:
#Label Encoding + LogisticRegression
df_temp = pd.DataFrame()
for col in object_df.columns:
    df_temp[col] = LabelEncoder().fit_transform(object_df[col])

train_x = df_temp[:train_num]
LR = LogisticRegression()
start = time.time()

print(f'shape : {train_x.shape} ')
print(f'score : {cross_val_score(LR,train_x,train_label,cv=5).mean()} ')
print(f'time : {time.time() - start} second ')

shape : (891, 5) 
score : 0.780004837244799 
time : 0.056607961654663086 second 


In [18]:
temp_df = copy.deepcopy(object_df)
count_df = temp_df.groupby(['Ticket'])['Name'].agg({'Ticket_Count' : 'size'}).reset_index()
temp_df = pd.merge(temp_df,count_df,on = ['Ticket'], how = 'left')
count_df.sort_values(by = ['Ticket_Count'], ascending = False).head(10)

Unnamed: 0,Ticket,Ticket_Count
778,CA. 2343,11
104,1601,8
775,CA 2144,8
335,3101295,7
454,347077,7
459,347082,7
847,S.O.C. 14879,7
824,PC 17608,7
123,19950,6
49,113781,6


In [19]:
temp_df.head(5)

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked,Ticket_Count
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S,1
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C,2
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S,1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S,2
4,"Allen, Mr. William Henry",male,373450,,S,1


In [20]:
#Counting encoding + LogisticRegression
for col in object_df.columns:
    temp_df[col] = LabelEncoder().fit_transform(object_df[col])
train_x = temp_df[:train_num]
LR = LogisticRegression()
start = time.time()

print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(LR,train_x,train_label,cv = 5).mean()} ')
print(f' time : {time.time() - start} second' )

 shape : (891, 6) 
 score : 0.7811221556805532 
 time : 0.04552865028381348 second


In [23]:
#Hash encoding the Ticket feature + LogisticRegression
temp_df = pd.DataFrame()
for col in object_df.columns:
    temp_df[col] = LabelEncoder().fit_transform(object_df[col])

temp_df['Ticket_Hash'] = temp_df['Ticket'].map(lambda x: hash(x) % 10)
train_x = temp_df[:train_num]
LR = LogisticRegression()
start = time.time()

print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(LR,train_x,train_label,cv = 5).mean()} ')
print(f' time : {time.time() - start} second' )

 shape : (891, 6) 
 score : 0.7721584999150645 
 time : 0.07723402976989746 second


In [25]:
#Hash encoding & Counting encoding the Ticket feature + LogisticRegression
count_df = object_df.groupby(['Ticket'])['Name'].agg({'Ticket_Count':'size'}).reset_index()
temp_df = copy.deepcopy(object_df)
temp_df = pd.merge(temp_df,count_df,on = ['Ticket'], how = 'left')

for col in object_df.columns:
    temp_df[col] = LabelEncoder().fit_transform(temp_df[col])

temp_df['Ticket_hash'] = temp_df['Ticket'].map(lambda x : hash(x) % 10)
train_x = temp_df[:train_num]
LR = LogisticRegression()
start = time.time()

print(temp_df.head(5))

print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(LR,train_x,train_label,cv = 5).mean()} ')
print(f' time : {time.time() - start} second' )

   Name  Sex  Ticket  Cabin  Embarked  Ticket_Count  Ticket_hash
0   155    1     720    185         3             1            0
1   286    0     816    106         0             2            6
2   523    0     914    185         3             1            4
3   422    0      65     70         3             2            5
4    22    1     649    185         3             1            9
 shape : (891, 7) 
 score : 0.7743994138564367 
 time : 0.046534061431884766 second


## 作業1
參考範例，將鐵達尼的艙位代碼( 'Cabin' )欄位使用特徵雜湊 / 標籤編碼 / 目標均值編碼三種轉換後， <br />
與其他數值型欄位一起預估生存機率

In [35]:
#Hash Encoding + Label_Encoding + Mean Encoding + Counting Encoding on Cabin + LogisticRegression
temp_df = pd.concat([object_df[:train_num], train_label], axis = 1)
mean_count_df = temp_df.groupby(['Cabin'])['Survived'].agg({'Cabin_Count':'size','Cabin_Mean':'mean'}).reset_index()
temp_df = pd.merge(temp_df,mean_count_df,on = ['Cabin'], how = 'left')

for col in object_df.columns:
    temp_df[col] = LabelEncoder().fit_transform(temp_df[col])
temp_df['Cabin_Hash'] = temp_df.Cabin.map(lambda x: hash(x)%5)
temp_df = temp_df.drop(['Survived'], axis = 1)

train_x = temp_df[:train_num]
LR = LogisticRegression()
start = time.time()

print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(LR,train_x,train_label,cv = 5).mean()} ')
print(f' time : {time.time() - start} second' )
temp_df.head(10)

 shape : (891, 8) 
 score : 0.8283202241871461 
 time : 0.03581380844116211 second


Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked,Cabin_Count,Cabin_Mean,Cabin_Hash
0,108,1,523,146,3,687,0.299854,1
1,190,0,596,81,0,1,1.0,1
2,353,0,669,146,3,687,0.299854,1
3,272,0,49,55,3,2,0.5,0
4,15,1,472,146,3,687,0.299854,1
5,554,1,275,146,2,687,0.299854,1
6,515,1,85,129,3,1,0.0,4
7,624,1,395,146,3,687,0.299854,1
8,412,0,344,146,3,687,0.299854,1
9,576,0,132,146,0,687,0.299854,1


In [41]:
#Hash Encoding on Cabin + LogisticRegression

temp_df = pd.DataFrame()
for col in object_df.columns:
    temp_df[col] = LabelEncoder().fit_transform(object_df[col])
temp_df['Cabin_Hash'] = temp_df.Cabin.map(lambda x: hash(x)%10)
temp_df = temp_df.drop(['Cabin'], axis = 1)
train_x = temp_df[:train_num]
LR = LogisticRegression()
start = time.time()

print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(LR,train_x,train_label,cv = 5).mean()} ')
print(f' time : {time.time() - start} second' )
temp_df.head(10)

 shape : (891, 5) 
 score : 0.7856039835632975 
 time : 0.02972698211669922 second


Unnamed: 0,Name,Sex,Ticket,Embarked,Cabin_Hash
0,155,1,720,3,5
1,286,0,816,0,6
2,523,0,914,3,5
3,422,0,65,3,0
4,22,1,649,3,5
5,818,1,373,2,5
6,767,1,109,3,3
7,914,1,541,3,5
8,605,0,477,3,5
9,847,0,174,0,5


In [38]:
#Label_Encoding  + LogisticRegression
temp_df = pd.DataFrame()
for col in object_df.columns:
    temp_df[col] = LabelEncoder().fit_transform(object_df[col])

train_x = temp_df[:train_num]
LR = LogisticRegression()
start = time.time()

print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(LR,train_x,train_label,cv = 5).mean()} ')
print(f' time : {time.time() - start} second' )
temp_df.head(10)

 shape : (891, 5) 
 score : 0.780004837244799 
 time : 0.055966854095458984 second


Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,155,1,720,185,3
1,286,0,816,106,0
2,523,0,914,185,3
3,422,0,65,70,3
4,22,1,649,185,3
5,818,1,373,185,2
6,767,1,109,163,3
7,914,1,541,185,3
8,605,0,477,185,3
9,847,0,174,185,0


In [43]:
#Mean Encoding on Cabin + LogisticRegression
temp_df = pd.concat([object_df[:train_num], train_label], axis = 1)
mean_df = temp_df.groupby(['Cabin'])['Survived'].agg({'Cabin_Mean':'mean'}).reset_index()
temp_df = pd.merge(temp_df,mean_df,on = ['Cabin'], how = 'left')

for col in object_df.columns:
    temp_df[col] = LabelEncoder().fit_transform(temp_df[col])

temp_df = temp_df.drop(['Survived','Cabin'], axis = 1)

train_x = temp_df[:train_num]
LR = LogisticRegression()
start = time.time()

print(f' shape : {train_x.shape} ')
print(f' score : {cross_val_score(LR,train_x,train_label,cv = 5).mean()} ')
print(f' time : {time.time() - start} second' )
temp_df.head(10)

 shape : (891, 5) 
 score : 0.8305485839887907 
 time : 0.035204172134399414 second


Unnamed: 0,Name,Sex,Ticket,Embarked,Cabin_Mean
0,108,1,523,3,0.299854
1,190,0,596,0,1.0
2,353,0,669,3,0.299854
3,272,0,49,3,0.5
4,15,1,472,3,0.299854
5,554,1,275,2,0.299854
6,515,1,85,3,0.0
7,624,1,395,3,0.299854
8,412,0,344,3,0.299854
9,576,0,132,0,0.299854


187


## 作業2
承上題，三者比較效果何者最好?


## Answer of HW2
From the result, we can see that the mean encoding has the best score. Cabin data has several unique values, so oberviosly the label encoding might not be a good idea. Because after apply the label encoding, we added some order in the original Cabin feature which might not a reasonable trending with label.<br />
Also the hash encoding, maybe the normal hash function in python cant perform a good hash on our Cabin data. But if we define a better hash function for Cabin, it might will have better result.