# 1. Introduction 
The goal of this competition aims to predicting whether or not passengers are transported.

In the data, there are lots of features. 
Below is the description of each feature
- `PassengerId` - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
- `HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.
- `CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- `Cabin` - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- `Destination` - The planet the passenger will be debarking to.
- `Age` - The age of the passenger.
- `VIP` - Whether the passenger has paid for special VIP service during the voyage.
- `RoomService, FoodCourt, ShoppingMall, Spa, VRDeck` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- `Name` - The first and last names of the passenger.
- `Transported` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

We will look at feature engineering to see what form it is appropriate to refine data with these features before refining them.

(This notebook gets help on below notebooks.)

* [Spaceship Titanic EDA & classification](https://www.kaggle.com/code/marianobasili/spaceship-titanic-eda-classification)
* [🚀Spaceship Titanic -📊EDA + 27 different models📈](https://www.kaggle.com/code/odins0n/spaceship-titanic-eda-27-different-models)


# 2. EDA
## 2-1. Importing data and preparation

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns 
import plotly.express as px
import missingno as msno
import matplotlib.ticker as mtick
import time
import re


pd.set_option('float_format', '{:f}'.format)

import warnings
warnings.filterwarnings(action='ignore') 

%config Completer.use_jedi = False

In [None]:
org_train = pd.read_csv('../input/spaceship-titanic/train.csv')
org_test = pd.read_csv('../input/spaceship-titanic/test.csv')
org_sample_sub = pd.read_csv('../input/spaceship-titanic/sample_submission.csv')

## 2-2. Overview of features and data

In [None]:
train = org_train.copy()
test = org_test.copy()
sample_sub = org_sample_sub

train.info()
test.info()

In [None]:
train

**The overview of features in data**\
In the categorical feature, there are two things we have to split the data before we build the model.



1. **PassengerId**\
PassengerId formed by `gggg_pp`.
The `pp` is information about the individual passenger, but `gggg` is information about the group, so it is necessary to separate the information about the group because individual transfer can depend on the group.

2. **Cabin**\
Cabin combines `deck/num/side` information into one.
If two passengers are the same deck, they may be confused with different information when training the model if they are different num or sides, so it is necessary to separate them into three.


In [None]:
train[['Group','Id']] = train['PassengerId'].str.split('_', expand=True)
test[['Group','Id']] = test['PassengerId'].str.split('_', expand=True)

train['Group'] = pd.to_numeric(train['Group'])
test['Group'] = pd.to_numeric(test['Group'])

train.drop(['Id','PassengerId'], inplace=True, axis=1)
test.drop(['Id','PassengerId'], inplace=True, axis=1)

In [None]:
train[['Cabin_deck','Cabin_num','Cabin_side']] = train['Cabin'].str.split('/', expand=True)
test[['Cabin_deck','Cabin_num','Cabin_side']] = test['Cabin'].str.split('/', expand=True)

train['Cabin_num'] = pd.to_numeric(train['Cabin_num'])
test['Cabin_num'] = pd.to_numeric(test['Cabin_num'])

train.drop('Cabin', inplace=True, axis=1)
test.drop('Cabin', inplace=True, axis=1)

In [None]:
train.drop('Name', inplace=True, axis=1)
test.drop('Name', inplace=True, axis=1)

In [None]:
Target = 'Transported'
Features = [col for col in train.columns]
cat_feat = [col for col in train.columns if (train[col].dtypes == 'object') & ( col not in [Target])]
num_feat = [col for col in train.columns if (train[col].dtypes != 'object') & ( col not in [Target])]

In [None]:
print("Train set's dimension : ",train.shape)
print("Test set's dimension : ",test.shape)
print("numbers of categorical feature : " ,len(cat_feat))
print("numbers of numerical feature : " ,len(num_feat))

In [None]:
print("Is there missing values?", train.isnull().sum().sum())
print("Is there missing values?", test.isnull().sum().sum())

**The overview of train/test data**

* In the train data, there are total 15 columns and 8693 rows 
* 15 columns contain 1 target(6 for categorical, 8 for numerical, 1 for target)
* There are missing values in both train / test data

$\Rightarrow$ Missing values are important issues before data analysis.
We can eliminate or fill missing values.
But later we'll fill it by multiple imputations.

In [None]:
print("train data missing columns\n\n",train.isnull().sum())
print("\n\ntest data missing columns\n\n",test.isnull().sum())

In [None]:
# Approximate visualization of missing value distribution
msno.matrix(train)
plt.title('Missing Value Distribution in train set');

In [None]:
msno.matrix(test)
plt.title('Missing Value Distribution in test set');

In [None]:
fig,ax = plt.subplots(1,1, figsize=(12,8))
(train.isnull().mean()*100).plot(kind='bar', ax=ax, align='center', width=.4, color='violet')
(test.isnull().mean()*100).plot(kind='bar', ax=ax, align='edge',width=.4, color='dodgerblue')
plt.legend(labels=['Train Set','Test Set'])
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=80)
ax.set_ylabel('Missing Values (%)')
ax.set_title('Percentage of missing values in train and test set');

**The overview of missing value**\
It can be seen that the distribution ratio of missing values in both train set and test set is similar.

Also, since the missing value distribution is different, removing all missing values can cause a lot of data loss\
$\rightarrow$So we will replace missing values.

We will replace missing value by *Multiple Imputation by Chained Equation*(MICE).\
Mice simulates and creates multiple missing substitution sets, performs certain statistical modeling 'with' functions, and averages the substitution sets generated with 'pool' functions to derive results
It can be the most optimal missing data replacement value.

MICE : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

## 2-3. Overview of Target

We will identify the distribution of Transported.

If it is balanced, we will apply Kfold, otherwise we will apply stratifiedKfold.

In [None]:
plt.title('Distribution of Trasported')
sns.countplot(x= 'Transported',data=train);
plt.show()
print("Total number of Transported : ", len(train[train[Target]==True]))
print("Total number of Not transported : ", len(train[train[Target]==False]))

*The distribution is balanced so we are gonna use Kfold*

## 2-4. Overview of the distribution of Categorical Features


In [None]:
cat_feat

In [None]:
plt.figure(figsize=(20,10))

plt.subplot(3,2,1)
plt.title('HomePlanet distribution based on Transported')
sns.countplot(x= 'HomePlanet', hue='Transported', data=train)

plt.subplot(3,2,2)
plt.title('CryoSleep distribution based on Transported')
sns.countplot(x= 'CryoSleep', hue='Transported', data=train)

plt.subplot(3,2,3)
plt.title('Destination distribution based on Transported')
sns.countplot(x= 'Destination', hue='Transported', data=train)

plt.subplot(3,2,4)
plt.title('VIP distribution based on Transported')
sns.countplot(x= 'VIP', hue='Transported', data=train)

plt.subplot(3,2,5)
plt.title('Cabin_deck distribution based on Transported')
sns.countplot(x= 'Cabin_deck', hue='Transported', data=train)

plt.subplot(3,2,6)
plt.title('Cabin_side distribution based on Transported')
sns.countplot(x= 'Cabin_side', hue='Transported', data=train)

plt.subplots_adjust(hspace=0.6)

plt.show()

**The overview of Categorical features**\
As you can see, the Target distribution of categorical features isn't balanced.\
People who are from Earth have a lower percentage of transfers, and the Europa is the opposite.

Those who didn't do Cryosleep had a higher percentage of transfers than those who did.

Various other distributions can be identified.

## 2-5. Overview of the distribution of Numerical Features

In [None]:
num_feat

In [None]:
plt.figure(figsize=(20,15))

plt.subplot(4,2,1)
plt.title('Age distribution based on Transported')
sns.histplot(x="Age", data=train, kde=True, alpha = 0.1, color='#E6104C')
sns.histplot(x="Age", data=train, hue="Transported", multiple="stack")

plt.subplot(4,2,2)
plt.title('RoomService distribution based on Transported')
sns.histplot(x="RoomService", data=train, kde=True, alpha = 0.1, color='#E6104C')
sns.histplot(x="RoomService", data=train, hue="Transported", multiple="stack")

plt.subplot(4,2,3)
plt.title('FoodCourt distribution based on Transported')
sns.histplot(x="FoodCourt", data=train, kde=True, alpha = 0.1, color='#E6104C')
sns.histplot(x="FoodCourt", data=train, hue="Transported", multiple="stack")

plt.subplot(4,2,4)
plt.title('ShoppingMall distribution based on Transported')
sns.histplot(x="ShoppingMall", data=train, kde=True, alpha = 0.1, color='#E6104C')
sns.histplot(x="ShoppingMall", data=train, hue="Transported", multiple="stack")

plt.subplot(4,2,5)
plt.title('Spa distribution based on Transported')
sns.histplot(x="Spa", data=train, kde=True, alpha = 0.1, color='#E6104C')
sns.histplot(x="Spa", data=train, hue="Transported", multiple="stack")

plt.subplot(4,2,6)
plt.title('VRDeck distribution based on Transported')
sns.histplot(x="VRDeck", data=train, kde=True, alpha = 0.1, color='#E6104C')
sns.histplot(x="VRDeck", data=train, hue="Transported", multiple="stack")

plt.subplot(4,2,7)
plt.title('Group distribution based on Transported')
sns.histplot(x="Group", data=train, kde=True, alpha = 0.1, color='#E6104C')
sns.histplot(x="Group", data=train, hue="Transported", multiple="stack")

plt.subplot(4,2,8)
plt.title('Cabin_num distribution based on Transported')
sns.histplot(x="Cabin_num", data=train, kde=True, alpha = 0.1, color='#E6104C')
sns.histplot(x="Cabin_num", data=train, hue="Transported", multiple="stack")

plt.subplots_adjust(hspace=0.6)

plt.show()

**The overview of Numerical features**\
Red line is the kernel density estimator for an entire continuous variable
and Bar is the histogram divided by Transported.

Age and Cabin_num features seem to be biased

Group feature seems to be balanced

and rest of features is biased toward zero

## 2-6. Handling missing value and out 

Numerical features have different distributions for each feature,\
Categorical features have distribution differences depending on whether they are transported.

In consideration of this, we do not replace the missing values with simple 0 or average or median values, but rather use MICE to be more statistically significant.


In [None]:
train = pd.get_dummies(train)
test = pd.get_dummies(test)
Features = [col for col in train.columns if col not in ['PassengerId',Target]]

#Replace missing value by Multiple Imputation
!pip install impyute
from impyute.imputation.cs import mice
train_imputed=mice(train.drop([Target],axis=1).values) # mice 학습시작
test_imputed=mice(test.values)

feat_colum_list = list(train.columns)
feat_colum_list.remove('Transported')

train[Features] = pd.DataFrame(train_imputed, columns =feat_colum_list)
test[Features] = pd.DataFrame(test_imputed, columns =feat_colum_list)

train[Features].head()

In [None]:
print("Is there missing values?", train.isnull().sum().sum())
print("Is there missing values?", test.isnull().sum().sum())

# 3. Modelings

Now that the data preprocessing is over, let's make a prediction through various models.
We are gonna use 5 models

1. Decision Tree
2. Random Forest
3. Extratree
4. CatBoost
5. LGBM

Finally, we will submit the final results through hard-voting for five models.

## 3-1. Decision Tree

In [None]:
# 1.Decision Tree classfy(with KFold)
from sklearn import tree
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

DTC = tree.DecisionTreeClassifier(random_state=0)

DTC_pred_result = []
DTC_scores = []
DTC_feature_imp =[]

splitter = KFold(n_splits=10, shuffle=True, random_state=0)

print("Start Decision Tree Classify")
for fold, (train_index, valid_index) in enumerate(splitter.split(train[Features],train[Target])) :
    print(10*"=",f"Fold : {fold+1}",10*"=")
    start_time = time.time()
    
    X_train, X_valid = train.iloc[train_index][Features], train.iloc[valid_index][Features]
    y_train , y_valid = train[Target].iloc[train_index] , train[Target].iloc[valid_index]

    model=DTC
    model.fit(X_train, y_train)
    
    pred_valid_result = model.predict(X_valid)
    acc = accuracy_score(y_valid,pred_valid_result)
    DTC_scores.append(acc)
    run_time = time.time() - start_time
    
    print(f"Fold : {fold+1}, Accuracy : {acc:.3f}, Run Time : {run_time:.3f}s" )
    
    feature_imp = pd.DataFrame(index=Features,data=model.feature_importances_,columns=[f'{fold}_importance'])
    DTC_feature_imp.append(feature_imp)
    
    test_pred_result = model.predict(test[Features])
    DTC_pred_result.append(test_pred_result)
    
print(f"Decision Tree Classify Mean Accuracy : {(np.mean(DTC_scores)):.3f}")

### Feature Importance for Decision Tree (Top 10 Features)

In [None]:
DTC_feature_imp
DTC_feature_imp_df = pd.concat(DTC_feature_imp, axis=1).head(10)
DTC_feature_imp_df.sort_values('1_importance').plot(kind='barh', figsize=(15, 10), title='Feature Importance Across Folds')
plt.show()

Accuracy of Decision tree is 0.751, and Cabin_num is most important Features.

## 3-2. Random Forest

In [None]:
# 2.Random Forest classfy(with KFold)
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(random_state=0)

RFC_pred_result = []
RFC_scores = []
RFC_feature_imp =[]

splitter = KFold(n_splits=10, shuffle=True, random_state=0)

print("Start Random Forest Classify")
for fold, (train_index, valid_index) in enumerate(splitter.split(train[Features],train[Target])) :
    print(10*"=",f"Fold : {fold+1}",10*"=")
    start_time = time.time()
    
    X_train, X_valid = train.iloc[train_index][Features], train.iloc[valid_index][Features]
    y_train , y_valid = train[Target].iloc[train_index] , train[Target].iloc[valid_index]
    
    model=RFC
    model.fit(X_train, y_train)
    
    pred_valid_result = model.predict(X_valid)
    acc = accuracy_score(y_valid,pred_valid_result)
    RFC_scores.append(acc)
    run_time = time.time() - start_time
    
    print(f"Fold : {fold+1}, Accuracy : {acc:.3f}, Run Time : {run_time:.3f}s" )
    
    feature_imp = pd.DataFrame(index=Features,data=model.feature_importances_,columns=[f'{fold}_importance'])
    RFC_feature_imp.append(feature_imp)
    
    test_pred_result = model.predict(test[Features])
    RFC_pred_result.append(test_pred_result)
    
print(f"Random Forest Classify Mean Accuracy : {(np.mean(RFC_scores)):.3f}")

### Feature Importance for Random Forest (Top 10 Features)

In [None]:
RFC_feature_imp
RFC_feature_imp_df = pd.concat(RFC_feature_imp, axis=1).head(10)
RFC_feature_imp_df.sort_values('1_importance').plot(kind='barh', figsize=(15, 10), title='Feature Importance Across Folds')
plt.show()

Accuracy of Random Forest is 0.8, and Cabin_num is most important Features.

In [None]:
# 3.Extra Tree classfy(with KFold)
from sklearn.ensemble import ExtraTreesClassifier

ETC = ExtraTreesClassifier(random_state=0)

ETC_pred_result = []
ETC_scores = []
ETC_feature_imp =[]

splitter = KFold(n_splits=10, shuffle=True, random_state=0)

print("Start Extra Tree Classify")
for fold, (train_index, valid_index) in enumerate(splitter.split(train[Features],train[Target])) :
    print(10*"=",f"Fold : {fold+1}",10*"=")
    start_time = time.time()
    
    X_train, X_valid = train.iloc[train_index][Features], train.iloc[valid_index][Features]
    y_train , y_valid = train[Target].iloc[train_index] , train[Target].iloc[valid_index]
    
    model=ETC
    model.fit(X_train, y_train)
    
    pred_valid_result = model.predict(X_valid)
    acc = accuracy_score(y_valid,pred_valid_result)
    ETC_scores.append(acc)
    run_time = time.time() - start_time
    
    print(f"Fold : {fold+1}, Accuracy : {acc:.3f}, Run Time : {run_time:.3f}s" )
    
    feature_imp = pd.DataFrame(index=Features,data=model.feature_importances_,columns=[f'{fold}_importance'])
    ETC_feature_imp.append(feature_imp)
    
    test_pred_result = model.predict(test[Features])
    ETC_pred_result.append(test_pred_result)
    
print(f"Extra Tree Classify Mean Accuracy : {(np.mean(ETC_scores)):.3f}")

In [None]:
ETC_feature_imp
ETC_feature_imp_df = pd.concat(ETC_feature_imp, axis=1).head(10)
ETC_feature_imp_df.sort_values('1_importance').plot(kind='barh', figsize=(15, 10), title='Feature Importance Across Folds')
plt.show()

Accuracy of Extra tree is 0.79, and Age is most important Features.

In [None]:
# 4.Catboost classfy(with KFold)
from catboost import CatBoostClassifier

CAT = CatBoostClassifier(silent=True,random_state=0)

CAT_pred_result = []
CAT_scores = []
CAT_feature_imp =[]

splitter = KFold(n_splits=10, shuffle=True, random_state=0)

print("Start Extra Tree Classify")
for fold, (train_index, valid_index) in enumerate(splitter.split(train[Features],train[Target])) :
    print(10*"=",f"Fold : {fold+1}",10*"=")
    start_time = time.time()
    
    X_train, X_valid = train.iloc[train_index][Features], train.iloc[valid_index][Features]
    y_train , y_valid = train[Target].iloc[train_index] , train[Target].iloc[valid_index]
    
    model=CAT
    model.fit(X_train, y_train)
    
    pred_valid_result = model.predict(X_valid)
    pred_valid_result = [ele == "True" for ele in pred_valid_result]
    acc = accuracy_score(y_valid,pred_valid_result)
    CAT_scores.append(acc)
    run_time = time.time() - start_time
    
    print(f"Fold : {fold+1}, Accuracy : {acc:.3f}, Run Time : {run_time:.3f}s" )
    
    feature_imp = pd.DataFrame(index=Features,data=model.feature_importances_,columns=[f'{fold}_importance'])
    CAT_feature_imp.append(feature_imp)
    
    test_pred_result = model.predict(test[Features])
    CAT_pred_result.append(test_pred_result)
    
print(f"CATBoost Classify Mean Accuracy : {(np.mean(CAT_scores)):.3f}")

In [None]:
CAT_feature_imp
CAT_feature_imp_df = pd.concat(CAT_feature_imp, axis=1).head(10)
CAT_feature_imp_df.sort_values('1_importance').plot(kind='barh', figsize=(15, 10), title='Feature Importance Across Folds')
plt.show()

Accuracy of Catboost is 0.813, and Spa is most important Features.

In [None]:
# 5.LGBM classfy(with KFold)
from lightgbm import LGBMClassifier
LGB = LGBMClassifier(random_state=0)

LGB_pred_result = []
LGB_scores = []
LGB_feature_imp =[]

splitter = KFold(n_splits=10, shuffle=True, random_state=0)

print("Start LGBM Classify")
for fold, (train_index, valid_index) in enumerate(splitter.split(train[Features],train[Target])) :
    print(10*"=",f"Fold : {fold+1}",10*"=")
    start_time = time.time()
    
    X_train, X_valid = train.iloc[train_index][Features], train.iloc[valid_index][Features]
    y_train , y_valid = train[Target].iloc[train_index] , train[Target].iloc[valid_index]
    
    model=LGB
    model.fit(X_train, y_train)
    
    pred_valid_result = model.predict(X_valid)
    acc = accuracy_score(y_valid,pred_valid_result)
    LGB_scores.append(acc)
    run_time = time.time() - start_time
    
    print(f"Fold : {fold+1}, Accuracy : {acc:.3f}, Run Time : {run_time:.3f}s" )
    
    feature_imp = pd.DataFrame(index=Features,data=model.feature_importances_,columns=[f'{fold}_importance'])
    LGB_feature_imp.append(feature_imp)
    
    test_pred_result = model.predict(test[Features])
    LGB_pred_result.append(test_pred_result)
    
print(f"LGBM Classify Mean Accuracy : {(np.mean(LGB_scores)):.3f}")

In [None]:
LGB_feature_imp
LGB_feature_imp_df = pd.concat(LGB_feature_imp, axis=1).head(10)
LGB_feature_imp_df.sort_values('1_importance').plot(kind='barh', figsize=(15, 10), title='Feature Importance Across Folds')
plt.show()

Accuracy of LGBM is 0.81, and Cabin_num is most important Features.

# 4. Submission

We constructed five different models for predicting results.\
We will combine these models into ensemble to complete the submission of the test data.

The ensemble will proceed with hard-voting.\
Hard voting is a method of voting by majority based on the predicted results of each weak learner.

The more information about hard-voting : https://en.wikipedia.org/wiki/Ensemble_learning

In [None]:
#Ensemble(hard-voting)

from sklearn.ensemble import VotingClassifier
voting = VotingClassifier(estimators=[
         ('DecisionTree', DTC), ('RandomForest', RFC), ('ExtraTree', ETC), ('Catboost', CAT), ('LGBMboost', LGB)],
           voting='hard', n_jobs=5)
voting = voting.fit(train[Features],train[Target])

prediction = voting.predict(test[Features])

In [None]:
submission = pd.DataFrame({'PassengerId': org_test['PassengerId'], 'Transported': prediction})
submission.to_csv('./submission.csv',index=False)