# Spaceship Titanic

Predict which passengers are transported to an alternate dimension

Current Rank: 690/1572 (Top 44%)
Current Score: 0.79237

### File and Data Field Descriptions

<details>
<summary>Click to expand</summary>

train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
+ PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the 
group. People in a group are often family members, but not always.
+ HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
+ CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
+ Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
+ Destination - The planet the passenger will be debarking to.
+ Age - The age of the passenger.
+ VIP - Whether the passenger has paid for special VIP service during the voyage.
+ RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
+ Name - The first and last names of the passenger.
+ Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

+ test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.
sample_submission.csv - A submission file in the correct format.
+ PassengerId - Id for each passenger in the test set.
+ Transported - The target. For each passenger, predict either True or False.

</details>

In [None]:

from datetime import datetime as time

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from Kaggle.Challenges.utils import make_mi_scores, plot_mi_scores

plt.figure(dpi=100, figsize=(8, 5))

In [None]:
# Importing the dataset
train = pd.read_csv('data/space_train.csv')
test = pd.read_csv('data/space_test.csv')

In [None]:
def feature_engineering(df: pd.DataFrame):
    
    df['Age'] = df['Age'].fillna(df['Age'].mean())
    df['CryoSleep'] = df['CryoSleep'].fillna(False)
    df['Cabin'] = df['Cabin'].fillna('U/0/U')
    df['RoomNumber'] = df['PassengerId'].str.split('_').str[1].astype(int)
    df['RoomService'] = df['RoomService'].fillna(0)
    df['FoodCourt'] = df['FoodCourt'].fillna(0)
    df['ShoppingMall'] = df['ShoppingMall'].fillna(0)
    df['Spa'] = df['Spa'].fillna(0)
    df['VRDeck'] = df['VRDeck'].fillna(0)
    df['TotalServicesFee'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']
    df[['Deck', 'Num','Side']] = df['Cabin'].str.split('/', expand=True)
    df['Num'] = df['Num'].astype(int)
    
    df = df.drop('Cabin', axis=1)
    
    cols = [
        'PassengerId',
        'Name',
        'VIP',
        'Destination',
        'HomePlanet',
        'Side'
    ]
    
    df = df.drop(cols, axis=1)
    
    # Transform boolean columns to integer
    bool_col = df.select_dtypes(include=['bool']).columns
    df[bool_col] = df[bool_col].astype(int)
    
    ordinal_encoder = OrdinalEncoder()
    cat_columns = df.select_dtypes(include=['object']).columns
    df[cat_columns] = ordinal_encoder.fit_transform(df[cat_columns])
    
    # Use Scaler for data
    # scaler = MinMaxScaler(feature_range=(0, 1))
    # df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
    return df

In [None]:
transformed_train = feature_engineering(train)

In [None]:
X = transformed_train.drop('Transported', axis=1)
y = transformed_train['Transported']

In [None]:
# Investigate MI of data
discrete_features = X.dtypes == int # All discrete features should now have integer dtypes (double-check this before using MI!)
mi_scores = make_mi_scores(X, y, discrete_features)
plot_mi_scores(mi_scores)

In [None]:
# prepare data for model
random_state = int(time.now().timestamp()) % 4294967295
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)

In [None]:
# TODO: Try launch on Desktop with more power
# from sklearn.ensemble import AdaBoostClassifier
# from sklearn.svm import SVC
# 
# svc_estimator = SVC(probability=True, random_state=random_state, kernel='linear')
# model = AdaBoostClassifier(estimator=svc_estimator)
# model.fit(X_train, y_train)

In [None]:
# Tune model
params = {'bootstrap': False, 'criterion': 'entropy', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 100}
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
important_features = pd.Series(data=model.feature_importances_, index=X.columns).sort_values(ascending=False)
print(important_features)
important_features.plot(kind='bar')

In [None]:
train_prediction = model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test,train_prediction)) # 0.799 % Accuracy
cv_scores = cross_val_score(model, X_train, y_train, cv=5) # Use cross-validation to estimate performance
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation score: {cv_scores.mean()}") # Mean Cross-validation score 0.79 % Accuracy

In [None]:
passenger_id = test['PassengerId']
transformed_test = feature_engineering(test)
test_prediction = model.predict(transformed_test) 

In [None]:
# Save results
output = pd.DataFrame({'PassengerId': passenger_id, 'Transported': test_prediction})
output['Transported'] = output['Transported'].astype(bool)
output.to_csv('data/space_submission.csv', index=False)
print("Your submission was successfully saved!")