# 目次
### 1. [Introduction](#1)
### 2. [Import Library](#2)
### 3. [Set Configure](#3)
### 4. [Data Loading](#4)
### 5. [EDA](#5)
### 6. [Build Model](#6)
### 7. [Prediction](#7)
### 8. [Submit](#8)


Reference:<br>
[Feature Engineering, EDA and LightGBM](https://www.kaggle.com/code/taranmarley/feature-engineering-eda-and-lightgbm)<br>
[🚀Spaceship Titanic -📊EDA + 27 different models📈](https://www.kaggle.com/code/odins0n/spaceship-titanic-eda-27-different-models)

<a id="1"></a>
# Introduction

#### [Description](https://www.kaggle.com/c/spaceship-titanic/overview/description) より
> 宇宙の謎を解くためにあなたのデータサイエンス能力が必要とされている2912年へようこそ。4光年の彼方から通信があり、事態は思わしくない。
<br><br>
宇宙船タイタニックは、1ヶ月前に打ち上げられた恒星間旅客船である。約13,000人の乗客を乗せたこの船は、太陽系から近隣の星を周回する居住可能な3つの太陽系外惑星への移民を運ぶ処女航海に出発した。
<br><br>
最初の目的地である高温の55カンクリE星へ向かう途中、油断していた宇宙船タイタニック号は、塵の雲に隠された時空の異常に衝突してしまった。残念なことに、タイタニック号は1000年前の同名の船と同じ運命をたどった。船は無事だったものの、乗客のほぼ半数が異次元に飛ばされてしまったのだ。
<br><br>
救助隊を助け、失われた乗客を取り戻すために、あなたは宇宙船の損傷したコンピュータシステムから回収した記録をもとに、どの乗客が異常によって異次元に飛ばされてしまったかを予測することに挑戦しています。
<br><br>
彼らを救い、歴史を変えよう

![](https://storage.googleapis.com/kaggle-media/competitions/Spaceship%20Titanic/joel-filipe-QwoNAhbmLLo-unsplash.jpg)

<a id="2"></a>
# Import Library


In [None]:
!pip install pycaret
!pip uninstall scikit-learn -y
!pip install scikit-learn==0.23.2

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from pycaret.classification import *
import sklearn.preprocessing as preprocessing

<a id="3"></a>
# Set Configuration

In [None]:
train_path = "../input/spaceship-titanic/train.csv"
test_path = "../input/spaceship-titanic/test.csv"

LGBM_PATH = "./model/lgbm"

SEED = 42

use_features = ["PassengerId_A", "PassengerId_B", "HomePlanet",\
    'CryoSleep', 'Cabin_A', 'Cabin_B', 'Cabin_C', 'Destination', 'Age',\
    'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', "FamilyName", "Transported"]

use_test_features = ["PassengerId_A", "PassengerId_B", "HomePlanet",\
    'CryoSleep', 'Cabin_A', 'Cabin_B', 'Cabin_C', 'Destination', 'Age',\
    'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', "FamilyName"]

<a id="4"></a>
# Data Loading

### Column Descriptions : 
- `PassengerId` - 各旅客に固有のID。各Idはggggg_ppの形式をとり、gggggはその乗客が一緒に旅行しているグループ、ppはそのグループ内の番号。グループ内の人は家族であることが多いが、必ずしもそうではない。
- `HomePlanet` - 旅客が出発した惑星であり、通常は居住している惑星。
- `CryoSleep` - 航海中に仮死状態になることを選択したかを表す変数。冷凍睡眠中の旅客は、キャビンに閉じ込められる。
- `Cabin` - 乗客が滞在している客室番号。deck/num/sideの形式で、sideはP（Port：左舷）またはS（Starboard：右舷）のどちらか。
- `Destination` - 旅客の目的地。
- `Age` - 旅客の年齢。
- `VIP` - 航海中に特別なVIPサービスを支払ったかどうか。
- `RoomService, FoodCourt, ShoppingMall, Spa, VRDeck` - 宇宙船タイタニックの設備の一つ一つに、旅客が請求された金額。
- `Name` - 旅客のフルネーム。
- `Transported` - 乗客が異次元に転送されたかどうか。今回の目的変数。

In [None]:
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

In [None]:
print(f"train shape:\t{train_df.shape}")
print(f"test shape:\t{test_df.shape}")

In [None]:
train_df.head()

#### train データセットの欠損値

In [None]:
train_df.isna().sum()

#### test データセットの欠損値

In [None]:
test_df.isna().sum()

<a id="5"></a>
# EDA

In [None]:
num_columns = ["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]
cate_columns = ["HomePlanet", "Destination", "Cabin_A", "Cabin_B", "Cabin_C", "FamilyName", "PassengerId_B"]


# 数値データの処理

train_df["Age"] = train_df["Age"].fillna(0)
train_df["RoomService"] = train_df["RoomService"].fillna(0)
train_df["FoodCourt"] = train_df["FoodCourt"].fillna(0)
train_df["ShoppingMall"] = train_df["ShoppingMall"].fillna(0)
train_df["Spa"] = train_df["Spa"].fillna(0)
train_df["VRDeck"] = train_df["VRDeck"].fillna(0)

# bool型の処理
train_df["CryoSleep"] = train_df["CryoSleep"].astype("bool")
train_df["VIP"] = train_df["VIP"].astype("bool")

# カテゴリ型の処理
train_df["HomePlanet"] = train_df.HomePlanet.fillna("Earth")
train_df["Destination"] = train_df.Destination.fillna("TRAPPIST-1e")
train_df["Cabin_A"] = train_df["Cabin"].apply(lambda x: x.split("/")[0] if type(x) is not float else "F")
train_df["Cabin_B"] = train_df["Cabin"].apply(lambda x: x.split("/")[1] if type(x) is not float else "82")
train_df["Cabin_C"] = train_df["Cabin"].apply(lambda x: x.split("/")[2] if type(x) is not float else "S")

train_df["HomePlanet"] = train_df.HomePlanet.astype("category")
train_df["Destination"] = train_df.Destination.astype("category")
train_df["Cabin_A"] = train_df.Cabin_A.astype("category")
train_df["Cabin_B"] = train_df.Cabin_B.astype("category")
train_df["Cabin_C"] = train_df.Cabin_C.astype("category")

# Nameの処理
train_df["FirstName"] = train_df.Name.apply(lambda x: x.split(" ")[0] if type(x) is not float else "XXX")
train_df["FamilyName"] = train_df.Name.apply(lambda x: x.split(" ")[1] if type(x) is not float else "XXX")

train_df["FirstName"] = train_df.FirstName.astype("str")
train_df["FamilyName"] = train_df.FamilyName.astype("category")

# PassengerIdの処理
train_df["PassengerId_A"] = train_df["PassengerId"].apply(lambda x: x.split("_")[0])
train_df["PassengerId_B"] = train_df["PassengerId"].apply(lambda x: x.split("_")[1])

train_df["PassengerId_A"] = train_df.PassengerId_A.astype("int64")
train_df["PassengerId_B"] = train_df.PassengerId_B.astype("category")

In [None]:
# 数値データの処理

test_df["Age"] = test_df["Age"].fillna(0)
test_df["RoomService"] = test_df["RoomService"].fillna(0)
test_df["FoodCourt"] = test_df["FoodCourt"].fillna(0)
test_df["ShoppingMall"] = test_df["ShoppingMall"].fillna(0)
test_df["Spa"] = test_df["Spa"].fillna(0)
test_df["VRDeck"] = test_df["VRDeck"].fillna(0)

test_df["RoomService"] = test_df["RoomService"].apply(lambda x: np.log10(x+1))
test_df["FoodCourt"] = test_df["FoodCourt"].apply(lambda x: np.log10(x+1))
test_df["ShoppingMall"] = test_df["ShoppingMall"].apply(lambda x: np.log10(x+1))
test_df["Spa"] = test_df["Spa"].apply(lambda x: np.log10(x+1))
test_df["VRDeck"] = test_df["VRDeck"].apply(lambda x: np.log10(x+1))

# bool型の処理
test_df["CryoSleep"] = test_df["CryoSleep"].astype("bool")
test_df["VIP"] = test_df["VIP"].astype("bool")

# カテゴリ型の処理
test_df["HomePlanet"] = test_df.HomePlanet.fillna("Earth")
test_df["Destination"] = test_df.Destination.fillna("TRAPPIST-1e")
test_df["Cabin_A"] = test_df["Cabin"].apply(lambda x: x.split("/")[0] if type(x) is not float else "F")
test_df["Cabin_B"] = test_df["Cabin"].apply(lambda x: x.split("/")[1] if type(x) is not float else "82")
test_df["Cabin_C"] = test_df["Cabin"].apply(lambda x: x.split("/")[2] if type(x) is not float else "S")

test_df["HomePlanet"] = test_df.HomePlanet.astype("category")
test_df["Destination"] = test_df.Destination.astype("category")
test_df["Cabin_A"] = test_df.Cabin_A.astype("category")
test_df["Cabin_B"] = test_df.Cabin_B.astype("category")
test_df["Cabin_C"] = test_df.Cabin_C.astype("category")

# Nameの処理
test_df["FirstName"] = test_df.Name.apply(lambda x: x.split(" ")[0] if type(x) is not float else "XXX")
test_df["FamilyName"] = test_df.Name.apply(lambda x: x.split(" ")[1] if type(x) is not float else "XXX")

test_df["FirstName"] = test_df.FirstName.astype("str")
test_df["FamilyName"] = test_df.FamilyName.astype("category")

# PassengerIdの処理
test_df["PassengerId_A"] = test_df["PassengerId"].apply(lambda x: x.split("_")[0])
test_df["PassengerId_B"] = test_df["PassengerId"].apply(lambda x: x.split("_")[1])

test_df["PassengerId_A"] = test_df.PassengerId_A.astype("int64")
test_df["PassengerId_B"] = test_df.PassengerId_B.astype("category")

In [None]:
train_df.describe()

In [None]:
# 対数変換
train_df["RoomService"] = train_df["RoomService"].apply(lambda x: np.log10(x+1))
train_df["FoodCourt"] = train_df["FoodCourt"].apply(lambda x: np.log10(x+1))
train_df["ShoppingMall"] = train_df["ShoppingMall"].apply(lambda x: np.log10(x+1))
train_df["Spa"] = train_df["Spa"].apply(lambda x: np.log10(x+1))
train_df["VRDeck"] = train_df["VRDeck"].apply(lambda x: np.log10(x+1))

In [None]:
sns.pairplot(train_df, hue="Transported", vars=num_columns)

In [None]:
num_corr = ["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck", "Transported"]

corr = train_df[num_corr].corr()
sns.heatmap(corr, square=True, annot=True)

In [None]:
def category_plot(df, columns):
    fig, ax = plt.subplots(len(columns), 1, figsize=(10,5*len(columns)))
    for i, column in enumerate(columns):
        sns.countplot(x=column, hue="Transported", data=df, ax=ax[i])

cate_columns = ["HomePlanet", "Destination", "Cabin_A", "Cabin_C", "VIP", "CryoSleep"]
category_plot(train_df, cate_columns)

<a id="6"></a>
# Build Model

In [None]:
setup(data=train_df[use_features], target="Transported", silent=True, normalize=True,\
      session_id=SEED, categorical_features=cate_columns, verbose=0)

In [None]:
lightgbm = create_model("lightgbm")

In [None]:
ensemble = ensemble_model(lightgbm, n_estimators=2)
ensemble = finalize_model(ensemble)

In [None]:
plot_model(ensemble, "confusion_matrix")

<a id="7"></a>
# Prediction

In [None]:
predictions = predict_model(lightgbm, data=test_df[use_test_features])

In [None]:
predictions.head(5)

<a id="8"></a>
# Submit

In [None]:
submission = pd.read_csv("../input/spaceship-titanic/sample_submission.csv")
submission["Transported"] = predictions["Label"]
submission.to_csv("submission.csv", index=False)