# Spaceship Titanic Preprocessing

**Description:**

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!
___

# 1.Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_theme(style='darkgrid', font_scale=2)
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, RobustScaler, PowerTransformer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate, train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans, DBSCAN


In [2]:
COMPARING_MODELS = True
DEBUG = True
VISUALIZING = False

# 2. Loading Data

In [3]:
df_train = pd.read_csv('../data/train.csv')
df_test = pd.read_csv('../data/test.csv')

df_train.head() if VISUALIZING else None

**Columns Description**
* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
* CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.
* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

# 3. Exploring/Viewing Data

In [4]:
if VISUALIZING:
    r1,c1 = df_train.shape
    print('The training data has {} rows and {} columns'.format(r1,c1))
    r2,c2 = df_test.shape
    print('The validation data has {} rows and {} columns'.format(r2,c2))

In [5]:
df_train.info() if VISUALIZING else None

In [6]:
df_train.describe() if VISUALIZING else None

In [7]:
df_test.describe() if VISUALIZING else None

## 3.B Fixing Missing Values 

In [8]:
if VISUALIZING:
    print('MISSING VALUES IN TRAINING DATASET:')
    print(df_train.isna().sum().nlargest(c1))
    print('')
    print('MISSING VALUES IN VALIDATION DATASET:')
    print(df_test.isna().sum().nlargest(c2))

In [9]:
df_train.set_index('PassengerId',inplace=True)
df_test.set_index('PassengerId',inplace=True)

## 3.C Null Replacement 

In [10]:
df_train[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']] = df_train[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].fillna(0)
df_test[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']] = df_test[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].fillna(0)

df_train['Age'] =df_train['Age'].fillna(df_train['Age'].median())
df_test['Age'] =df_test['Age'].fillna(df_test['Age'].median())

df_train['VIP'] =df_train['VIP'].fillna(False)
df_test['VIP'] =df_test['VIP'].fillna(False)

df_train['HomePlanet'] =df_train['HomePlanet'].fillna('Mars')
df_test['HomePlanet'] =df_test['HomePlanet'].fillna('Mars')

df_train['Destination']=df_train['Destination'].fillna("PSO J318.5-22")
df_test['Destination']=df_test['Destination'].fillna("PSO J318.5-22")

df_train['CryoSleep'] =df_train['CryoSleep'].fillna(False)
df_test['CryoSleep'] =df_test['CryoSleep'].fillna(False)

df_train['Cabin'] =df_train['Cabin'].fillna('T/0/P')
df_test['Cabin'] =df_test['Cabin'].fillna('T/0/P')

# 4. Exploration and Visualization 

In [11]:
plt.figure(figsize=(15,18)) if VISUALIZING else None
sns.heatmap(df_train.select_dtypes("number").corr(), annot=True) if VISUALIZING else None

In [12]:
plt.pie(df_train.Transported.value_counts(), shadow=True, explode=[.1,.1], autopct='%.1f%%') if VISUALIZING else None
plt.title('Transported ', size=18) if VISUALIZING else None
plt.legend(['False', 'True'], loc='best', fontsize=12) if VISUALIZING else None
plt.show() if VISUALIZING else None

In [13]:
sns.countplot(df_train, x="Transported") if VISUALIZING else None

In [14]:
sns.countplot(df_train, x="HomePlanet", hue="Transported") if VISUALIZING else None

In [15]:
sns.countplot(df_train, x="VIP", hue="Transported") if VISUALIZING else None

In [16]:
sns.countplot(df_train, x="CryoSleep", hue="Transported") if VISUALIZING else None

In [17]:
sns.countplot(df_train, x="Destination", hue="Transported") if VISUALIZING else None
plt.xticks(rotation=30) if VISUALIZING else None

In [18]:
sns.boxplot(df_train, y="Age", x="Transported") if VISUALIZING else None # Age will be divided into groups later

## 4.B Splitting Cabin Column

In [19]:
# Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
df_train[['Deck', 'Cabin_Num', 'Side']] = df_train.Cabin.str.split('/',expand=True)
df_test[['Deck', 'Cabin_Num', 'Side']] = df_test.Cabin.str.split('/',expand=True)

df_train['Cabin_Num'] = pd.to_numeric(df_train['Cabin_Num'], downcast='integer')
df_test['Cabin_Num'] = pd.to_numeric(df_test['Cabin_Num'], downcast='integer')

df_train['Side'] = df_train['Side'].map({'P':0,'S':1})
df_test['Side'] = df_test['Side'].map({'P':0,'S':1})

In [20]:
sns.countplot(df_train, y="Deck", hue="Transported", order=["A", "B", "C", "D", "E", "F", "T"]) if VISUALIZING else None

In [21]:
plt.figure(figsize=(10,5)) if VISUALIZING else None
sns.histplot(df_train, x='Cabin_Num', hue='Transported', bins=14, multiple="dodge", shrink=0.6) if VISUALIZING else None

In [22]:
sns.countplot(df_train, x="Side", hue="Transported") if VISUALIZING else None

In [23]:
sns.countplot(df_test, x="Side") if VISUALIZING else None

# 5. Feature Engineering

In [24]:
df_train['total_spent'] = df_train['RoomService'] + df_train['FoodCourt'] + df_train['ShoppingMall'] + df_train['Spa'] + df_train['VRDeck']
df_test['total_spent'] = df_test['RoomService'] + df_test['FoodCourt'] + df_test['ShoppingMall'] + df_test['Spa'] + df_test['VRDeck']

In [25]:
df_train['AgeGroup'] = 0
for i in range(6):
    df_train.loc[(df_train.Age >= 10*i) & (df_train.Age < 10*(i + 1)), 'AgeGroup'] = i
# Same for test data
df_test['AgeGroup'] = 0
for i in range(6):
    df_test.loc[(df_test.Age >= 10*i) & (df_test.Age < 10*(i + 1)), 'AgeGroup'] = i

In [26]:
sns.countplot(y=df_train['AgeGroup'], hue=df_train['Transported']) if VISUALIZING else None

## 6.B drop target

In [27]:
df_train['Transported'] = df_train['Transported'].replace({True:1,False:0})

# 6. Pre processing

In [28]:
if DEBUG:
    df_train, df_test = train_test_split(df_train, test_size=0.25, random_state=42)
    y_test = df_test["Transported"]

X_test = df_test.drop(columns=["Transported"]) if DEBUG else df_test

X_train, y_train = (
    df_train.drop(columns=["Transported"]),
    df_train["Transported"],
)

## 6.A Encoding

In [29]:
numeric_feats = [
    "total_spent",
    "RoomService",
    "FoodCourt",
    "ShoppingMall",
    "Spa",
    "VRDeck",
]
log_numeric_feats = [
    "Cabin_Num",
]
ordinal_feats = [
    "AgeGroup",
]
categorical_feats = [
    "HomePlanet",
    "Destination",
    "Deck",
]
binary_feats = [
    "VIP",
    "CryoSleep",
    "Side",    
]
drop_feats = [
    "Cabin",
    "Name",
    "Age",
]

In [30]:
ct = make_column_transformer(
    (RobustScaler(), numeric_feats),
    (
        make_pipeline(
            PowerTransformer(),
            RobustScaler()
        ),
        log_numeric_feats
    ),
    (OneHotEncoder(), categorical_feats),
    (OrdinalEncoder(), binary_feats),
    (OrdinalEncoder(), ordinal_feats),
    ("drop", drop_feats),
)

In [31]:
ct if VISUALIZING else None

In [32]:
transformed = ct.fit_transform(X_train)

In [33]:
column_names = ct.get_feature_names_out()
clean_names = [name.split("__", 1)[-1] for name in column_names]

transformed = pd.DataFrame(transformed, columns=clean_names)

In [34]:
pd.DataFrame(transformed).head() if VISUALIZING else None

## 6.B Heatmap

In [35]:
plt.figure(figsize=(55,30)) if VISUALIZING else None
sns.heatmap(transformed.corr(), annot=True) if VISUALIZING else None

# 7. Models

In [36]:
results = {}

def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

## 7.A SVM

### Linear Kernel

In [37]:
grid_search = GridSearchCV(
    estimator=make_pipeline(ct, SVC(kernel='linear', max_iter=420)),
    param_grid={'svc__C': [0.01, 0.1, 1, 10, 100, 1000]},
    cv=2,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train) if COMPARING_MODELS else None
pd.DataFrame(list(grid_search.best_params_.items()), columns=['Hyperparameter', 'Best Value']) if COMPARING_MODELS else None

Unnamed: 0,Hyperparameter,Best Value
0,svc__C,10


In [38]:
svm_linear_model = make_pipeline(ct, SVC(C=10)).fit(X_train, y_train)
svm_linear_model if VISUALIZING else None

In [39]:
y_pred = svm_linear_model.predict(X_test)

In [40]:
accuracy_score(y_test, y_pred) if DEBUG else None

0.781508739650414

In [41]:
sub = pd.DataFrame({'Transported':y_pred.astype(bool)}, index=df_test.index)
sub.head() if VISUALIZING else None

In [42]:
sub.to_csv("../predictions/svm_linear_optimized.csv")

### RBF Kernel

In [43]:
grid_search = GridSearchCV(
    estimator=make_pipeline(ct, SVC(kernel='rbf')),
    param_grid={'svc__C': [0.01, 0.1, 1, 10, 100, 1000], 'svc__gamma': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]},
    cv=2,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train) if COMPARING_MODELS else None
pd.DataFrame(list(grid_search.best_params_.items()), columns=['Hyperparameter', 'Best Value']) if COMPARING_MODELS else None

Unnamed: 0,Hyperparameter,Best Value
0,svc__C,100.0
1,svc__gamma,0.0001


In [44]:
svm_rbf_model = make_pipeline(ct, SVC(C=100, gamma=0.0001)).fit(X_train, y_train)
svm_rbf_model if VISUALIZING else None

In [45]:
y_pred = svm_rbf_model.predict(X_test)

In [46]:
accuracy_score(y_test, y_pred) if DEBUG else None

0.7861085556577737

In [47]:
sub = pd.DataFrame({'Transported':y_pred.astype(bool)}, index=df_test.index)
sub.head() if VISUALIZING else None

In [48]:
sub.to_csv("../predictions/svm_rbf_optimized.csv")

## 7.B KNN Classifier

In [49]:
grid_search = GridSearchCV(
    estimator=make_pipeline(ct, KNeighborsClassifier()),
    param_grid={'kneighborsclassifier__n_neighbors': [1, 3, 5, 8, 10, 15, 20, 25]},
    cv=2,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train) if COMPARING_MODELS else None
pd.DataFrame(list(grid_search.best_params_.items()), columns=['Hyperparameter', 'Best Value']) if COMPARING_MODELS else None

Unnamed: 0,Hyperparameter,Best Value
0,kneighborsclassifier__n_neighbors,20


In [50]:
knn_model = make_pipeline(ct, KNeighborsClassifier(n_neighbors=20)).fit(X_train, y_train)
knn_model if VISUALIZING else None

In [51]:
y_pred = knn_model.predict(X_test)

In [52]:
accuracy_score(y_test, y_pred) if DEBUG else None

0.7861085556577737

In [53]:
sub = pd.DataFrame({'Transported':y_pred.astype(bool)}, index=df_test.index)
sub.head() if VISUALIZING else None

In [54]:
sub.to_csv("../predictions/knn.csv")

## 7.C Logistic Regression 

In [55]:
grid_search = GridSearchCV(
    estimator=make_pipeline(ct, LogisticRegression()),
    param_grid={'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
    cv=2,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train) if COMPARING_MODELS else None
pd.DataFrame(list(grid_search.best_params_.items()), columns=['Hyperparameter', 'Best Value']) if COMPARING_MODELS else None

Unnamed: 0,Hyperparameter,Best Value
0,logisticregression__C,0.01


In [56]:
logreg_model = make_pipeline(ct, LogisticRegression(C=0.01)).fit(X_train, y_train)
logreg_model if VISUALIZING else None

In [57]:
y_pred = logreg_model.predict(X_test)

In [58]:
accuracy_score(y_test, y_pred) if DEBUG else None

0.7856485740570377

In [59]:
sub = pd.DataFrame({'Transported':y_pred.astype(bool)}, index=df_test.index)
sub.head() if VISUALIZING else None

In [60]:
sub.to_csv("../predictions/logistic_reg.csv")

## 7.D kMeans

In [61]:
k = 4 # TODO: need to tune

kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(transformed)

X_train_kmeans = pd.DataFrame(ct.transform(X_train), columns=clean_names)
X_test_kmeans = pd.DataFrame(ct.transform(X_test), columns=clean_names)

X_train_kmeans['Cluster'] = kmeans.predict(X_train_kmeans)
X_test_kmeans['Cluster'] = kmeans.predict(X_test_kmeans)

svc_model = SVC(C=10)
svc_model.fit(X_train_kmeans, y_train)

y_pred = svc_model.predict(X_test_kmeans)
accuracy_score(y_test, y_pred)

0.781508739650414

## 7.E DBSCAN

In [62]:
X_train_DBS = pd.DataFrame(ct.transform(X_train), columns=clean_names)
X_test_DBS = pd.DataFrame(ct.transform(X_test), columns=clean_names)

eps = 2.5  # TODO: need to tune
min_samples = 10 # TODO: need to tune

# fit to transformed dataframe
dbscan = DBSCAN(eps=2.5, min_samples=10)
dbscan.fit(X_train_DBS)

dbscan_train_labels = dbscan.labels_

# use knn to predict which cluster to be in
cluster_predictor = KNeighborsClassifier(n_neighbors=5)
cluster_predictor.fit(X_train_DBS, dbscan_train_labels)
dbscan_test_labels = cluster_predictor.predict(X_test_DBS)

# make augmented dataframe - original data + cluster predicted
X_train_augmented = X_train_DBS.copy()
X_train_augmented['DBSCAN_Cluster'] = dbscan_train_labels

X_test_augmented = X_test_DBS.copy()
X_test_augmented['DBSCAN_Cluster'] = dbscan_test_labels


# train main SVC model on augmented data
svc_DBS = SVC(C=10).fit(X_train_augmented, y_train)


y_pred = svc_DBS.predict(X_test_augmented)
accuracy_score(y_test, y_pred)

0.7824287028518859