# Introduction: Spaceship titanic

## info

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

![image.png](attachment:33d2e2f6-d79b-4762-ab22-10e475960c0f.png)

## Data

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly.

* **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
* **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
Destination - The planet the passenger will be debarking to.
* **Age** - The age of the passenger.
* **VIP** - Whether the passenger has paid for special VIP service during the voyage.
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* **Name** - The first and last names of the passenger.
* **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [1]:
pip install kaggle




In [2]:
!kaggle competitions download -c spaceship-titanic

spaceship-titanic.zip: Skipping, found more recently modified local copy (use --force to force download)


In [3]:
import zipfile
import pandas as pd

with zipfile.ZipFile('spaceship-titanic.zip', 'r') as zip_ref:
    zip_ref.extractall('./dataset')

train = pd.read_csv('./dataset/train.csv')
test = pd.read_csv('./dataset/test.csv')

train.head(15)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
5,0005_01,Earth,False,F/0/P,PSO J318.5-22,44.0,False,0.0,483.0,0.0,291.0,0.0,Sandie Hinetthews,True
6,0006_01,Earth,False,F/2/S,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,0.0,0.0,Billex Jacostaffey,True
7,0006_02,Earth,True,G/0/S,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,0.0,,Candra Jacostaffey,True
8,0007_01,Earth,False,F/3/S,TRAPPIST-1e,35.0,False,0.0,785.0,17.0,216.0,0.0,Andona Beston,True
9,0008_01,Europa,True,B/1/P,55 Cancri e,14.0,False,0.0,0.0,0.0,0.0,0.0,Erraiam Flatic,True


## imports

In [4]:
import sys
print(sys.executable)

C:\Users\Сергей\AppData\Local\Programs\Python\Python312\python.exe


In [5]:
!where pip

D:\Study\MachineLearning\Kaggle\SpaceshipTitanic\.venv\Scripts\pip.exe
C:\Users\‘ҐаЈҐ©\AppData\Local\Programs\Python\Python312\Scripts\pip.exe


In [6]:
!python -m site

sys.path = [
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic',
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip',
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312\\DLLs',
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312\\Lib',
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312',
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic\\.venv',
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic\\.venv\\Lib\\site-packages',
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic\\.venv\\Lib\\site-packages\\win32',
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic\\.venv\\Lib\\site-packages\\win32\\lib',
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic\\.venv\\Lib\\site-packages\\Pythonwin',
]
USER_BASE: 'C:\\Users\\Сергей\\AppData\\Roaming\\Python' (doesn't exist)
USER_SITE: 'C:\\Users\\Сергей\\AppData\\Roaming\\Python\\Python312\\site-packages' (doesn't exist)
EN

In [7]:
sys.path = [
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip',
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312\\DLLs',
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312\\Lib',
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312',
]

In [8]:
!python -m site

sys.path = [
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic',
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip',
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312\\DLLs',
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312\\Lib',
    'C:\\Users\\Сергей\\AppData\\Local\\Programs\\Python\\Python312',
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic\\.venv',
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic\\.venv\\Lib\\site-packages',
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic\\.venv\\Lib\\site-packages\\win32',
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic\\.venv\\Lib\\site-packages\\win32\\lib',
    'D:\\Study\\MachineLearning\\Kaggle\\SpaceshipTitanic\\.venv\\Lib\\site-packages\\Pythonwin',
]
USER_BASE: 'C:\\Users\\Сергей\\AppData\\Roaming\\Python' (doesn't exist)
USER_SITE: 'C:\\Users\\Сергей\\AppData\\Roaming\\Python\\Python312\\site-packages' (doesn't exist)
EN

In [9]:
!pip install scikit-learn



In [11]:
pip list | findstr scikit-learn

scikit-learn              1.5.0
Note: you may need to restart the kernel to use updated packages.


In [13]:
!python -c "import sklearn; print(sklearn.__version__)"

1.5.2


In [14]:
import sklearn
import numpy as np
import os
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import missingno as msno
from prettytable import PrettyTable
%matplotlib inline
import seaborn as sns
sns.set(style='darkgrid', font_scale=1.4)
from tqdm import tqdm
from tqdm.notebook import tqdm as tqdm_notebook
tqdm_notebook.get_lock().locks = []
# !pip install sweetviz
# import sweetviz as sv
import concurrent.futures
from copy import deepcopy       
from functools import partial
from itertools import combinations
import random
from random import randint, uniform
import gc
from sklearn.feature_selection import f_classif
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler,PowerTransformer, FunctionTransformer
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from itertools import combinations
from sklearn.impute import SimpleImputer
import xgboost as xg
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import mean_squared_error,mean_squared_log_error, roc_auc_score, accuracy_score, f1_score, precision_recall_curve, log_loss
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from gap_statistic.optimalK import OptimalK
from scipy import stats
import statsmodels.api as sm
from scipy.stats import ttest_ind
from scipy.stats import boxcox
import math
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.base import BaseEstimator, TransformerMixin
import optuna
import xgboost as xgb
!pip install lightgbm --install-option=--gpu --install-option="--boost-root=C:/local/boost_1_69_0" --install-option="--boost-librarydir=C:/local/boost_1_69_0/lib64-msvc-14.1"
import lightgbm as lgb
from category_encoders import OneHotEncoder, OrdinalEncoder, CountEncoder, CatBoostEncoder
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier, GradientBoostingClassifier,ExtraTreesClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoost, CatBoostRegressor, CatBoostClassifier
from sklearn.svm import NuSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from catboost import Pool
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD

ModuleNotFoundError: No module named 'sklearn'

In [None]:
train['Transported'].value_counts()

Surprisingly the target is balanced so lets move to the EDA

## Feature engeniring

Take another look to the train dataset. we looking for some **feature engeniring**

In [None]:
train.info()

In [None]:
msno.matrix(train)
plt.show()

There is a lot of missing data here btw

In [None]:
train.head(15)

Since we read the data explanation we can clearly see that the most common thing we can do is to get the group Passanger stands in out of the PassengerId. Another common thing is to cut the ages into groups and check the class balance here. Next i would like to somehow get the sex of each passanger out of their names. 

## EDA (Exploratory Data Analysis)