# Spaceship.

## Task description

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

## Files and Data Fields Descriptions


### **train.csv**  - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.

CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

Destination - The planet the passenger will be debarking to.

Age - The age of the passenger.

VIP - Whether the passenger has paid for special VIP service during the voyage.

RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

Name - The first and last names of the passenger.

Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

### **test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data.

Your task is to predict the value of Transported for the passengers in this set.

### **sample_submission.csv** - A submission file in the correct format.

PassengerId - Id for each passenger in the test set.

Transported - The target. For each passenger, predict either True or False.



### Here are the first 5 rows of the data:

In [61]:
import pandas as pd

# Random seed for reproducibility
SEED = 123
# A file to save global variables
global_variables = pd.DataFrame({'SEED': [SEED]})
global_variables.to_csv('global_variables.csv')

train_unprocessed = pd.read_csv('datasets/train.csv')
test_unprocessed = pd.read_csv('datasets/test.csv')

train_size = len(train_unprocessed)
train_size_file = pd.DataFrame([train_size])
train_size_file.to_csv('train_size.csv')

# Collect Passenger Ids in the test dataset into a separate variable
test_Ids = test_unprocessed['PassengerId']
test_Ids.to_csv('test_Ids.csv')

data_unprocessed = pd.concat([train_unprocessed, test_unprocessed]).reset_index(drop=True)

data_unprocessed.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [62]:
data_unprocessed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   12970 non-null  object 
 1   HomePlanet    12682 non-null  object 
 2   CryoSleep     12660 non-null  object 
 3   Cabin         12671 non-null  object 
 4   Destination   12696 non-null  object 
 5   Age           12700 non-null  float64
 6   VIP           12674 non-null  object 
 7   RoomService   12707 non-null  float64
 8   FoodCourt     12681 non-null  float64
 9   ShoppingMall  12664 non-null  float64
 10  Spa           12686 non-null  float64
 11  VRDeck        12702 non-null  float64
 12  Name          12676 non-null  object 
 13  Transported   8693 non-null   object 
dtypes: float64(6), object(8)
memory usage: 1.4+ MB


## 00. Baseline

First, we'll make a baseline prediction, that all passengers were Transported. We'll calculate the Score of this prediction on the train set (for future cases, we'll calculate separately Train Score and Cross-validation Score, but in this case, these scores will be equal to Train Score, since our cross-validation will be stratified).

Our Score = (Average Cross-validation ROC AUC) - (1 Standard deviation of Cross-validation ROC AUCs).

In this simple case, Standard deviation will be 0 (again, cross-validation is stratified), so Score for this case will be just Train ROC AUC.

We'll save our intermediate results in DataFrame scores_df:

In [63]:
from sklearn.metrics import roc_auc_score

train_predictions_00 = pd.DataFrame(data=train_unprocessed['Transported'], columns=['Transported'])
train_predictions_00['Transported'] = True

scores_df = pd.DataFrame({'Comment': [], 'Train Score': [], 'Cross-val Score': [], 'Test Accuracy': []})

score_00 = roc_auc_score(train_unprocessed['Transported'], train_predictions_00['Transported'])

scores_df.loc[0, 'Comment'] = 'All True'
scores_df.loc[0, 'Train Score'] = score_00
scores_df.loc[0, 'Cross-val Score'] = score_00
scores_df

Unnamed: 0,Comment,Train Score,Cross-val Score,Test Accuracy
0,All True,0.5,0.5,


ROC AUC is 0.5, which means our predictions are no better than random guessing

Now, we'll make a submission to Kaggle to see our Test Accuracy. We won't use Test Accuracy in making decisions, but we'll use it to catch bugs in our Score calculations:

In [64]:
test_predictions_00 = pd.DataFrame([True] * len(test_unprocessed), columns=['Transported'])
submission_00 = pd.concat([test_unprocessed['PassengerId'], test_predictions_00], axis=1)

submission_00.to_csv('submissions/submission_00.csv', index=False)

scores_df.loc[0, 'Test Accuracy'] = 0.50689
scores_df.to_csv('scores_df.csv')
scores_df

Unnamed: 0,Comment,Train Score,Cross-val Score,Test Accuracy
0,All True,0.5,0.5,0.50689


## 01. Numerical features with 0's for missing values

Now, we'll make predictions on the numerical features only, filling missing values with zeros:

In [65]:
# Drop non-numerical columns
train = train_unprocessed.select_dtypes(include=['int', 'float'])
test = test_unprocessed.select_dtypes(include=['int', 'float'])

# Put the target variable back to the train dataset
train = pd.concat([train, train_unprocessed['Transported']], axis=1)

# Fill missing values with zeros
train = train.fillna(0)
test = test.fillna(0)

train.to_csv('new_datasets/train_01.csv')
test.to_csv('new_datasets/test_01.csv')

We'll use XGBoost with default parameters as our first estimator. 

### Choosing number of cross-validation splits

For calculating Score, I wrote get_score function, that is located in ['functions/get_score.py'](functions/get_score.py). This function takes a number of StratifiedKFold slits as one of its arguments. 

We want such number of splits that give us the best balance between bias and variance. For the sake of run time, the optimal number of splits calculation is done in a separate file: ['functions/n_splits.py'](functions/n_splits.py).

The tradeoff sweetspot is at 3 splits.

In [66]:
N_SPLITS = 3
global_variables['N_SPLITS'] = N_SPLITS
global_variables.to_csv('global_variables.csv')

Let's find Scores and Test Accuracy for this number of splits:

In [67]:
train_score, cross_score, cross_scores_std, submission = get_score(train, test, model, scores_df,
                                                                  comment="All numerical features with 0's for missing values")

In [68]:
submission.to_csv('submissions/submission_01.csv', index=False)

scores_df.loc[1, 'Test Accuracy'] = 0.7877
scores_df.to_csv('scores_df.csv')
scores_df

Unnamed: 0,Comment,Train Score,Cross-val Score,Test Accuracy
0,All True,0.5,0.5,0.50689
1,All numerical features with 0's for missing va...,0.919507,0.828452,0.7877


## 02. Choosing numerical features

We'll find the set of numerical features that gives us the highest cross-validation Score: