# Spaceship.

## Task description

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

## Files and Data Fields Descriptions


### **train.csv**  - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.

CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

Destination - The planet the passenger will be debarking to.

Age - The age of the passenger.

VIP - Whether the passenger has paid for special VIP service during the voyage.

RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

Name - The first and last names of the passenger.

Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

### **test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data.

Your task is to predict the value of Transported for the passengers in this set.

### **sample_submission.csv** - A submission file in the correct format.

PassengerId - Id for each passenger in the test set.

Transported - The target. For each passenger, predict either True or False.



### Here are the first 5 rows of the data:

In [38]:
import pandas as pd

# Random seed for reproducibility
SEED = 123
# A file to save global variables
global_variables = pd.DataFrame({'SEED': [SEED]})
global_variables.to_csv('global_variables.csv')
global_variables.to_csv('functions/global_variables.csv')

train_unprocessed = pd.read_csv('datasets/train.csv')
test_unprocessed = pd.read_csv('datasets/test.csv')

train_size = len(train_unprocessed)

data_unprocessed = pd.concat([train_unprocessed, test_unprocessed)

train_size = len(train_unprocessed)
train_size_file = pd.DataFrame([train_size])
train_size_file.to_csv('train_size.csv')

# Collect Passenger Ids in the test dataset into a separate variable
test_Ids = test_unprocessed['PassengerId']
test_Ids.to_csv('test_Ids.csv')

data_unprocessed = pd.concat([train_unprocessed, test_unprocessed]).reset_index(drop=True)

data_unprocessed.head()

SyntaxError: closing parenthesis ')' does not match opening parenthesis '[' (2634166550.py, line 15)

In [None]:
data_unprocessed.info()

## 00. Baseline

First, we'll make a baseline prediction, that all passengers were Transported. We'll calculate the Score of this prediction on the train set (for future cases, we'll calculate separately Train Score and Cross-validation Score, but in this case, these scores will be equal to Train Score, since our cross-validation will be stratified).

Our Score = (Average Cross-validation ROC AUC) - (1 Standard deviation of Cross-validation ROC AUCs).

In this simple case, Standard deviation will be 0 (again, cross-validation is stratified), so Score for this case will be just Train ROC AUC.

We'll save our intermediate results in DataFrame scores_df:

In [None]:
from sklearn.metrics import roc_auc_score

train_predictions_00 = pd.DataFrame(data=train_unprocessed['Transported'], columns=['Transported'])
train_predictions_00['Transported'] = True

scores_df = pd.DataFrame({'Comment': [], 'Train Score': [], 'Cross-val Score': [], 'Test Accuracy': []})

score_00 = roc_auc_score(train_unprocessed['Transported'], train_predictions_00['Transported'])

scores_df.loc[0, 'Comment'] = 'All True'
scores_df.loc[0, 'Train Score'] = score_00
scores_df.loc[0, 'Cross-val Score'] = score_00
scores_df

ROC AUC is 0.5, which means our predictions are no better than random guessing

Now, we'll make a submission to Kaggle to see our Test Accuracy. We won't use Test Accuracy in making decisions, but we'll use it to catch bugs in our Score calculations:

In [None]:
test_predictions_00 = pd.DataFrame([True] * len(test_unprocessed), columns=['Transported'])
submission_00 = pd.concat([test_unprocessed['PassengerId'], test_predictions_00], axis=1)

submission_00.to_csv('submissions/submission_00.csv', index=False)

scores_df.loc[0, 'Test Accuracy'] = 0.50689
scores_df.to_csv('scores_df.csv')
scores_df

## 01. Numerical features with 0's for missing values

Now, we'll make predictions on the numerical features only, filling missing values with zeros:

In [None]:
# Drop non-numerical columns
train = train_unprocessed.select_dtypes(include=['int', 'float'])
test = test_unprocessed.select_dtypes(include=['int', 'float'])

# Put the target variable back to the train dataset
train = pd.concat([train, train_unprocessed['Transported']], axis=1)

# Fill missing values with zeros
train = train.fillna(0)
test = test.fillna(0)

train.to_csv('new_datasets/train_01.csv')
test.to_csv('new_datasets/test_01.csv')

We'll use XGBoost with default parameters as our first estimator. 

### Choosing number of cross-validation splits

For calculating Score, I wrote get_score function, that is located in ['functions/get_score.py'](functions/get_score.py). This function takes a number of StratifiedKFold slits as one of its arguments. 

We want such number of splits that give us the best balance between bias and variance. For the sake of run time, the optimal number of splits calculation is done in a separate file: ['functions/n_splits.py'](functions/n_splits.py).

The tradeoff sweetspot is at 3 splits.

In [None]:
N_SPLITS = 3
global_variables['N_SPLITS'] = N_SPLITS
global_variables.to_csv('global_variables.csv')
global_variables.to_csv('functions/global_variables.csv')

Let's find Scores and Test Accuracy for this number of splits:

In [None]:
# UNCOMMENT TO INSTALL XGBOOST
#!pip install xgboost
import xgboost as xgb

# Instantiate the regressor
model = xgb.XGBClassifier(random_state=SEED, n_jobs=-1)

from functions.get_score import get_score

train_score, cross_score, cross_scores_std, submission = get_score(global_variables, train, test, model, scores_df,
                                                                  comment="All numerical features with 0's for missing values")

In [None]:
submission.to_csv('submissions/submission_01.csv', index=False)

scores_df.loc[1, 'Test Accuracy'] = 0.7877
scores_df.to_csv('scores_df.csv')
scores_df

## Hyperparameters tuning workflow description

For the sake of runtime and for convinience, we'll do all our hyperparameters tunings in separate files. Here is the workflow:

-) Notice the Report chapter number. For example, our next chapter, in which we'll do our first tining, is 02.

-) If you need to restart any study from scratch, go to ['functions/initialize_studies.ipynb'](functions/initialize_studies.ipynb), run the first cell to import packages, then run a cell with your current Report chapter number. The progress of that study will be deleted.

!!!ATTENTION!!! Do not run the whole initialize_studies notebook, or all the studies will be restarted (unless that is what you want). !!!ATTENTION!!!

-) The current study is in studies/Report_chapter_number.py. For example, next study will be in ['studies/02.py'](studies/02.py).

-) At the beginning of the study file, choose maximum run time and number of trials for the current run.

-) Hyperparameter tuning will be continued from the end of the previous run of this file.

-) At the end of the run, look at the best parameters. If some of the parameters are on the extreme ends of the search ranges, extend the ranges. The study progress will be kept.

-) At the end of the run, look at the total number of trials in study and the number of the best trial.

-) If results (Average cross-val scores) keep going up, re-run the file. Repeat until satisfied.

-) If results (Average cross-val scores) do not improve for a big number of trials, then go back to the current Report chapter and load results (see below).

For a simple exaple look at ['studies/test.ipynb'](studies/test.ipynb). This study maximizes sum of two numbers chosen from a set of integers.

## 02. Choosing numerical features

We'll find the set of numerical features that gives us the highest cross-validation Score:

In [None]:
import joblib
import optuna

study = joblib.load("studies/02.pkl")

print("Best average cross-validation Score:", study.best_trial.value)
print("Best hyperparameters:", study.best_params)

Here, "'Age': True" means that the best model was using 'Age' feature. 

All the features were selected by our hyperparameters search.

Note, that our best Score is equal to the Score from the 01, since we used the same set of features and same random seed. Let's put this score into scores_df for consistency:

In [None]:
train_score, cross_score, cross_scores_std, submission = get_score(global_variables, train, test, model, scores_df,
                                                                  comment="All numerical features are selected")
scores_df.loc[2, 'Test Accuracy'] = 0.7877
scores_df.to_csv('scores_df.csv')
scores_df

## Categorical features

Let's look at our data column by column:

## 03. Group Size

**PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

The number of a passenger within their group is arbitrary, so we don't need it. However, group numbers may be important, so we'll create a new feature "Group":

In [None]:
train['Group'] = train_unprocessed['PassengerId'].str[:4]
test['Group'] = test_unprocessed['PassengerId'].str[:4]
print(train['Group'].info())
print(train['Group'].describe())
print('Unique Values:')
print(train['Group'].unique())

We have 6217 separate Groups in the training set among 8693 entries.

We need to transform Group to numerical features. Since the number of categories is high, it may be unworthy to create dummy variables. We'll try Mean Target Encoding (the fuctions for Mean Target Encoding are based on work by Yauhen Babakhin):

In [None]:
from functions.target_encoding import mean_target_encoding

# We'll need to express Transported as 1 and 0 for Mean Target Encoding:
train['Transported'] = [1 if i else 0 for i in train['Transported']]

# Encode Group
train['Group_enc'], test['Group_enc'] = mean_target_encoding(train, test, 'Transported', 'Group', alpha=7.5)

test['Group_enc'].describe()

Oh, it seems that we have only one unique value for the Group_enc in the test set:

In [None]:
test['Group_enc'].unique()

The reason is that there is no Groups that are common between the train and test sets:

In [None]:
list(set(train['Group']) & set(test['Group']))

Therefore, distinguishing Groups is useless. However, we can use the Group column in another way: let's calculate the number of group members and assign it to "GroupSize" variable:

In [None]:
train['GroupSize'] = train.groupby('Group')['Group'].transform('count')
test['GroupSize'] = test.groupby('Group')['Group'].transform('count')

for dataset in [train, test]:
    print(dataset['GroupSize'].info())
    print(dataset['GroupSize'].describe())
    print('Unique Values:')
    print(dataset['GroupSize'].unique())
    

Now, let's select features again, now with GroupSize feature as an option:

In [None]:
train.to_csv('new_datasets/train_03.csv')
test.to_csv('new_datasets/test_03.csv')

study = joblib.load("studies/03.pkl")

print("Best average cross-validation Score:", study.best_trial.value)
print("Best hyperparameters:", study.best_params)

We improved our Score by adding Group Size! Let's put results in the table:

In [None]:
# Create sets with selected columns

selected_columns = []
for key, value in study.best_params.items():
    if value:
        selected_columns.append(key)

train_selected = train[selected_columns]
train_selected = pd.concat([train_selected, train_unprocessed['Transported']], axis=1)
test_selected =  test[selected_columns]


train_score, cross_score, cross_scores_std, submission = get_score(global_variables, 
                                    train_selected, test_selected,
                                    model, scores_df,
                                    comment="+ GroupSize")
submission.to_csv('submissions/submission_03.csv', index=False)
scores_df.loc[3, 'Test Accuracy'] = 0.7884
scores_df.to_csv('scores_df.csv')
scores_df

## Exploring missing values