## My Submission for Spaceship Titanic

We start by importing our training, testing, and submission files to our program.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

all_features = ["PassengerID", "HomePlanet", "CryoSleep", "Cabin", "Destination", "Age", 
            "VIP", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck", "Name"]

#Some observations

#train_data[["HomePlanet", "Transported"]].groupby(["HomePlanet", "Transported"]).agg(len).apply(lambda x:x)
#Curiously, around 2/3 of people from europa are transported

#train_data[["CryoSleep", "Transported"]].groupby(["CryoSleep", "Transported"]).agg(len).apply(lambda x:x)
#People not in CryoSleep are less likely to get transported than people who are in CryoSleep

#view_data(["Destination", "Transported"], train_data).apply(lambda x:x)
#61 percent of people going to 55 Cancri were transported

#def brackets(step, price):
#    return np.floor(price/step)

#bracket = 800
#train_data["Expenses"] = train_data["Expenses"].map(lambda x: brackets(bracket, x))

#view_data(["Expenses", "Transported"], train_data).apply(lambda x:x)

/kaggle/input/spaceship-titanic/sample_submission.csv
/kaggle/input/spaceship-titanic/train.csv
/kaggle/input/spaceship-titanic/test.csv


We then want to determine what factors would lead to someone being transported. The first thing I noticed is that if a passenger was a VIP, then they were less likely to be transported, so this seemed to be a pretty good feature to focus on.

In [2]:
train_data = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")

transported_data = train_data[train_data["Transported"] == True]
num_transported = len(transported_data)

VIP_data = train_data[train_data["VIP"] == True]
VIP_transport_percentage = len(VIP_data[VIP_data["Transported"] == True])/num_transported

VIP_transport_percentage

0.01735952489721334

It turns out that only 1.7359% of VIPs were transported! VIPs are usually pretty rich, so it is natural to ask how one's personal wealth influences whether or not they were transported. But, there is no "Wealth" column in our table, however, we can guess how much wealth one had based on how much they spent. In otherwords, we can determine's one wealth based on their expenses, which we calculate in the following way: $$Expenses = RoomService + FoodCourt + ShoppingMall + Spa + VRDeck$$ Here is the corresponding the code that goes with this calculation and its visualization in a table format.

In [3]:
train_data["Expenses"] = train_data["RoomService"] + train_data["FoodCourt"] + train_data["ShoppingMall"] + train_data["Spa"] + train_data["VRDeck"]

#Generates a table showing how many people have these characteristics
def view_data(characteristics, dataset):
    return dataset[characteristics].groupby(characteristics).agg(len)

#A function which takes a price and puts it into a price range [step * i, step * (i + 1)) determined by step (a money value), 
#and the bracket that person
def brackets(step, price):
    return np.floor(price/step)

bracket = 800
train_data["Bracket_Expenses"] = train_data["Expenses"].map(lambda x: brackets(bracket, x))

view_data(["Bracket_Expenses", "Transported"], train_data).apply(lambda x:x)

Bracket_Expenses  Transported
0.0               False          1463
                  True           2854
1.0               False          1133
                  True            478
2.0               False           424
                                 ... 
35.0              False             1
37.0              False             1
38.0              False             1
                  True              2
44.0              False             1
Length: 64, dtype: int64

Unsurprisingly, people with less money (in 0th Expenses bracket where each bracket is divided by 800 space bucks) were more likely to be transported than people who had more money. However, this statistic does not account for people in CryoSleep or where people were during the temporal anomaly. I have not yet implemented CryoSleep into my model, but I have analyzed the relationship between cabin locations and transportations. In order to analyze this relation, I decided to parse and process the cabin location, which is given by Deck/Num/Port, so that Deck and Num are numeric and that people without a cabin are sent to $/1000/N, which doesn't exist. Here is the code I used to clean the cabin locations as well as some observations I made.

In [4]:
def clean_cabin(cabin):
    if type(cabin) != str:
        return "$/1000/N" #If a person doesn't have a cabin
    else:
        return cabin

train_data["Cabin"] = train_data["Cabin"].map(clean_cabin)
attributes = train_data["Cabin"].map(lambda x: x.split("/"))

train_data["Deck"] = list(map(lambda x: ord(x[0][0]), attributes))
train_data["Num"] = list(map(lambda x: int(x[1]), attributes))
train_data["Side"] = list(map(lambda x: x[2], attributes))

view_data(["Side", "Transported"], train_data).apply(lambda x: x)
#On side N, ~50% chance of getting transported
#On side P, ~46% chance of getting transported (4206 people total)
#On side S, ~55% chance of getting transpoted (4288 people total)

view_data(["Deck", "Transported"], train_data).apply(lambda x: x)
#Decks B, C seem relatively unsafe, E seems pretty safe, on Deck F only ~46% were transported
#A lot of people are on decks F and G

view_data(["Deck", "Transported"], train_data).apply(lambda x: x)
#It seems like people who shared a room were either mostly fine or not fine

Deck  Transported
36    False            99
      True            100
65    False           129
      True            127
66    False           207
      True            572
67    False           239
      True            508
68    False           271
      True            207
69    False           563
      True            313
70    False          1565
      True           1229
71    False          1238
      True           1321
84    False             4
      True              1
dtype: int64

I also noticed that the majority of passenger are below 40 years old. However, it seems like those below twenty are more likely to be transported than those in higher 20 year age brackets.

In [5]:
age_bracket = 20
train_data["Age_Bracket"] = train_data["Age"].map(lambda x: brackets(age_bracket, x))

view_data(["Age_Bracket", "Transported"], train_data).map(lambda x: x)
#When age_bracket = 20, we see that most people are below 40 years old, but those below twenty are more likely to be transported.

Age_Bracket  Transported
0.0          False           887
             True           1271
1.0          False          2405
             True           2092
2.0          False           799
             True            806
3.0          False           135
             True            119
dtype: int64

Based on these observations, I decided to create a small model based on Expenses, Deck, room number, Age, and VIP status to predict whether someone was transported. But first, I had to create a function that cleaned the data appropriately.

In [6]:
def clean_cabin(cabin):
    if type(cabin) != str:
        return "$/1000/N" #If a person doesn't have a cabin
    else:
        return cabin

def transform_data(data):
    ord_encoder = OrdinalEncoder()
    
    data["VIP"] = data["VIP"].fillna(False).astype(bool)
    data["VIP"] = data["VIP"].map(lambda x: int(x == True))
    
    data["CryoSleep"] = data["CryoSleep"].fillna(False).astype(bool)
    data["CryoSleep"] = data["CryoSleep"].map(lambda x: int(x == True))
    
    data["Cabin"] = data["Cabin"].map(clean_cabin)
    attributes = data["Cabin"].map(lambda x: x.split("/"))

    data["Deck"] = list(map(lambda x: x[0][0], attributes))
    data["Num"] = list(map(lambda x: x[1], attributes))
    data["Side"] = list(map(lambda x: x[2], attributes))

    data[["Deck", "Num", "Side"]] = ord_encoder.fit_transform(data[["Deck", "Num", "Side"]])

    data["Age"] = data["Age"].fillna(1000).astype(float) #I chose this value because no one is above 1000 years old
    
    purchases = ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]
    data["Expenses"] = sum([data[item] for item in purchases])
    data["Expenses"] = data["Expenses"].fillna(0).astype(float)

    return data

Here is my model. (Make sure to run the cell above to clean the data)

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error

def score(estimators, data_train, data_test, val_train, val_test):
    rfc = RandomForestClassifier(n_estimators = estimators, random_state=0).fit(data_train, val_train)
    return mean_absolute_error(rfc.predict(data_test), val_test)

train_data = transform_data(train_data)
test_data = transform_data(pd.read_csv('/kaggle/input/spaceship-titanic/test.csv'))
train_features = ["Expenses", "Deck", "Num", "Side", "VIP", "CryoSleep", "Age"]

X = train_data[train_features]
y = train_data["Transported"].map(lambda x: int(x))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

#rfc = RandomForestClassifier(n_estimators=50, random_state=0).fit(X_train, y_train)
#predictions = rfc.predict(X_test)

#estims = range(1, 140, 1)
#best_index = np.argmin(list(map(lambda x: score(x, X_train, X_test, y_train, y_test), estims)))
#best_value = estims[best_index]
#The lines above were used to determine how many estimators I should use in my model

best_value = 133
my_model = RandomForestClassifier(n_estimators = best_value, random_state=0).fit(X_train, y_train)

print(score(best_value, X_train, X_test, y_train, y_test))
print(best_value)

  data["VIP"] = data["VIP"].fillna(False).astype(bool)
  data["CryoSleep"] = data["CryoSleep"].fillna(False).astype(bool)
  data["VIP"] = data["VIP"].fillna(False).astype(bool)
  data["CryoSleep"] = data["CryoSleep"].fillna(False).astype(bool)


0.2622196664749856
133


## Submission
Here is my submission

In [8]:
test_data["Transported"] = my_model.predict(test_data[train_features])
test_data["Transported"] = test_data["Transported"].map(lambda x: bool(x == 1.0))

output = pd.DataFrame({'PassengerId': test_data["PassengerId"], 'Transported': test_data["Transported"]})
output.to_csv('sample_submission.csv', index = False)