# 比賽簡介 / 資料集及目標介紹

此項比賽是KAGGLE中的Spaceship Titanic，此項比賽設定背景為西元20此項比賽設定背景為西元2912年，搭載了13000名乘客的太空船泰坦尼克號，將乘客由太陽系分別送到3個不同的系外行星。在經過Alpha Centauri時，與一個藏匿在塵雲中的時空異常相撞，幾乎一伴的乘客都被傳送到另一個次元。
此項比賽的目標是預測哪些乘客被時空異常傳送。

# 為什麼選擇這個比賽

此比賽的資料集與課程中的titanic相似，可以練習課堂中所教的資料集前處理方法，但是比賽資料集的欄位較複雜，可能一個欄位還包含了不同的資訊，因此也會有具有挑戰性需要自己研究及嘗試的部分。

In [75]:
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 載入資料集

In [76]:
df=pd.read_csv("spaceship_train.csv")
df.head()


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [77]:
df.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

# Homeplanet

觀察homeplanet的分布

In [78]:
df['HomePlanet'].value_counts()

HomePlanet
Earth     4602
Europa    2131
Mars      1759
Name: count, dtype: int64

homeplanet這一欄位的缺失值統一以佔據最多的Earth進行填補 

In [79]:
df['HomePlanet'].fillna(df['HomePlanet'].value_counts().idxmax(),inplace=True)
df.isnull().sum()

PassengerId       0
HomePlanet        0
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

填補後的結果:Earth數量上升

In [80]:
df['HomePlanet'].value_counts()

HomePlanet
Earth     4803
Europa    2131
Mars      1759
Name: count, dtype: int64

# 填補object型態欄位的缺失值

上面填補HomePlanet缺失值時使用上課教的方法，使用value_counts()計算出現次數，再以idxmax()找出其中佔最多數的。
以下方法先將所有object型態的欄位加入list，再以mode()找出各自的眾數，已遞迴的方式進行補值。

In [81]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8693 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [82]:
list_missing = list((df.select_dtypes(['object']).isna().sum() > 0).index)
list_missing


['PassengerId',
 'HomePlanet',
 'CryoSleep',
 'Cabin',
 'Destination',
 'VIP',
 'Name']

In [83]:
for col in list_missing:
    df[col] = df[col].fillna(df[col].mode()[0])

In [84]:
df.isnull().sum()

PassengerId       0
HomePlanet        0
CryoSleep         0
Cabin             0
Destination       0
Age             179
VIP               0
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name              0
Transported       0
dtype: int64

Cabin欄位包含CabinDeck, CabinNo., CabinSide，將其各自取出，存在不同欄位

In [85]:
df[["CabinDeck", "CabinNo.", "CabinSide"]] = df["Cabin"].str.split('/', expand = True)

PassengerID只保留前半部分的group

In [86]:
df[["Group", "NuminGroup"]] = df["PassengerId"].str.split('_', expand = True)

In [87]:
df.drop("NuminGroup",axis=1,inplace=True)

Cabin及PassengerId兩個欄位已經經過處理可刪除

In [89]:
df.drop('Cabin',axis=1,inplace=True)
df.drop('PassengerId',axis=1,inplace=True)

將Name欄位刪除

In [88]:
df.drop('Name',axis=1,inplace=True)

In [90]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    8693 non-null   object 
 1   CryoSleep     8693 non-null   bool   
 2   Destination   8693 non-null   object 
 3   Age           8514 non-null   float64
 4   VIP           8693 non-null   bool   
 5   RoomService   8512 non-null   float64
 6   FoodCourt     8510 non-null   float64
 7   ShoppingMall  8485 non-null   float64
 8   Spa           8510 non-null   float64
 9   VRDeck        8505 non-null   float64
 10  Transported   8693 non-null   bool   
 11  CabinDeck     8693 non-null   object 
 12  CabinNo.      8693 non-null   object 
 13  CabinSide     8693 non-null   object 
 14  Group         8693 non-null   object 
dtypes: bool(3), float64(6), object(6)
memory usage: 840.6+ KB


In [91]:
df.isnull().sum()

HomePlanet        0
CryoSleep         0
Destination       0
Age             179
VIP               0
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Transported       0
CabinDeck         0
CabinNo.          0
CabinSide         0
Group             0
dtype: int64

### 將資料型態為object的欄位轉換為模型可讀懂的數值

In [93]:
df['HomePlanet'].value_counts()

HomePlanet
Earth     4803
Europa    2131
Mars      1759
Name: count, dtype: int64

In [94]:
HomePlanet_mapping={"Earth":0,"Europa":1,"Mars":2}
df['HomePlanet'] = df['HomePlanet'].map(HomePlanet_mapping)

In [95]:
df['Destination'].value_counts()

Destination
TRAPPIST-1e      6097
55 Cancri e      1800
PSO J318.5-22     796
Name: count, dtype: int64

In [96]:
Destination_mapping={"TRAPPIST-1e":0,"55 Cancri e":1,"PSO J318.5-22":2}
df['Destination'] = df['Destination'].map(Destination_mapping)

In [97]:
df['CabinDeck'].value_counts()

CabinDeck
F    2794
G    2758
E     876
B     779
C     747
D     478
A     256
T       5
Name: count, dtype: int64

In [98]:
CabinDeck_mapping={"A":0,"B":1,"C":2,"D":3,"E":4,"F":5,"G":6,"T":7}
df['CabinDeck'] = df['CabinDeck'].map(CabinDeck_mapping)

In [99]:
df['CabinNo.'].value_counts()

CabinNo.
734     208
82       28
86       22
19       22
56       21
       ... 
1644      1
1515      1
1639      1
1277      1
1894      1
Name: count, Length: 1817, dtype: int64

In [100]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['CabinNo.']=le.fit_transform(df['CabinNo.'])

In [101]:
df['CabinSide'].value_counts()

CabinSide
S    4487
P    4206
Name: count, dtype: int64

In [102]:
CabinSide_mapping={"S":0,"P":1}
df['CabinSide'] = df['CabinSide'].map(CabinSide_mapping)

In [103]:
df['Group'].value_counts()

Group
4498    8
8168    8
8728    8
8796    8
8956    8
       ..
3483    1
3480    1
3478    1
3473    1
4620    1
Name: count, Length: 6217, dtype: int64

In [104]:
df['Group']=le.fit_transform(df['Group'])

In [105]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    8693 non-null   int64  
 1   CryoSleep     8693 non-null   bool   
 2   Destination   8693 non-null   int64  
 3   Age           8514 non-null   float64
 4   VIP           8693 non-null   bool   
 5   RoomService   8512 non-null   float64
 6   FoodCourt     8510 non-null   float64
 7   ShoppingMall  8485 non-null   float64
 8   Spa           8510 non-null   float64
 9   VRDeck        8505 non-null   float64
 10  Transported   8693 non-null   bool   
 11  CabinDeck     8693 non-null   int64  
 12  CabinNo.      8693 non-null   int32  
 13  CabinSide     8693 non-null   int64  
 14  Group         8693 non-null   int32  
dtypes: bool(3), float64(6), int32(2), int64(4)
memory usage: 772.6 KB


In [106]:
df.isnull().sum()

HomePlanet        0
CryoSleep         0
Destination       0
Age             179
VIP               0
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Transported       0
CabinDeck         0
CabinNo.          0
CabinSide         0
Group             0
dtype: int64

# 針對資料型態為數值的欄位進行補值

Age以中位數補值

In [107]:
df['Age'].fillna(df['Age'].median(),inplace=True)

In [108]:
df.isnull().sum()

HomePlanet        0
CryoSleep         0
Destination       0
Age               0
VIP               0
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Transported       0
CabinDeck         0
CabinNo.          0
CabinSide         0
Group             0
dtype: int64

觀察dataset，由於RoomService/FoodCourt/ShoppingMall/Spa/VRDeck四個欄位，大部分為0，因此以0補值

In [110]:
df['RoomService'].fillna(0,inplace=True)
df['FoodCourt'].fillna(0,inplace=True)
df['ShoppingMall'].fillna(0,inplace=True)
df['Spa'].fillna(0,inplace=True)
df['VRDeck'].fillna(0,inplace=True)

確認補值完畢

In [111]:
df.isnull().sum()

HomePlanet      0
CryoSleep       0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
CabinDeck       0
CabinNo.        0
CabinSide       0
Group           0
dtype: int64

觀察欄位間的相關性，若2者相關性高，同時放進模型訓練可能會影響結果

In [112]:
#若2者相關性高，若同時放進模型訓練可能會影響結果
df.corr()


Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,CabinDeck,CabinNo.,CabinSide,Group
HomePlanet,1.0,0.083586,-0.155671,0.133577,0.123512,0.211751,0.071454,0.101383,0.055047,0.039824,0.115461,-0.404593,0.010726,-0.000636,-0.006093
CryoSleep,0.083586,1.0,0.118972,-0.071323,-0.078281,-0.244089,-0.205928,-0.207798,-0.198307,-0.192721,0.460132,0.020613,0.011922,-0.023858,-0.00692
Destination,-0.155671,0.118972,1.0,-0.014496,0.027678,-0.070954,0.026133,-0.036354,-0.000489,0.016114,0.067972,0.016025,0.002205,-0.003965,0.011048
Age,0.133577,-0.071323,-0.014496,1.0,0.091863,0.068629,0.12739,0.033148,0.120946,0.09959,-0.074233,-0.239202,-0.000948,-0.011621,-0.009122
VIP,0.123512,-0.078281,0.027678,0.091863,1.0,0.056566,0.125499,0.018412,0.060991,0.123061,-0.037261,-0.176063,0.008129,0.008798,0.013602
RoomService,0.211751,-0.244089,-0.070954,0.068629,0.056566,1.0,-0.015126,0.052337,0.009244,-0.018624,-0.241124,-0.021888,0.00708,0.006991,0.000366
FoodCourt,0.071454,-0.205928,0.026133,0.12739,0.125499,-0.015126,1.0,-0.013717,0.221468,0.224572,0.045583,-0.315318,0.006709,-0.019682,-0.009257
ShoppingMall,0.101383,-0.207798,-0.036354,0.033148,0.018412,0.052337,-0.013717,1.0,0.014542,-0.007849,0.009391,-0.033577,-0.003939,0.02094,0.017861
Spa,0.055047,-0.198307,-0.000489,0.120946,0.060991,0.009244,0.221468,0.014542,1.0,0.147658,-0.218545,-0.214971,0.021997,-0.0057,-0.005138
VRDeck,0.039824,-0.192721,0.016114,0.09959,0.123061,-0.018624,0.224572,-0.007849,0.147658,1.0,-0.204874,-0.251548,0.004951,0.009089,0.015977


# 模型訓練

選用課程中所使用的LogisticRegression

In [113]:
X=df.drop(['Transported'],axis=1)
y=df['Transported']

#split to training data & testing data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=67)

#using Logistic regression model
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression(max_iter=200)
lr.fit(X_train,y_train)
predictions=lr.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# 模型效果

In [115]:
from sklearn.metrics import confusion_matrix,accuracy_score,recall_score,precision_score
accuracy_score(y_test,predictions)

0.7887269938650306

In [116]:
recall_score(y_test,predictions)


0.8414634146341463

In [117]:
precision_score(y_test,predictions)

0.7629578438147893

將訓練好的模型存檔

In [120]:
import joblib
filename = 'Spaceship-LR.pk1'
joblib.dump(lr, filename)

['Spaceship-LR.pk1']