# Spaceship. Part 4. New start.

## Task description

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

## Starting over

We'll start over, tweaking some steps of the process and testing every step using the best classifier we've found in Part 3.

We'll repeat all the commentary, so it will be conveinient for readers to start from Part 4, without reading previous parts.

## Test function

But first, let's create a function that allows us easily test every new step by providing cross-validation average ROC AUC and accuracy scores, as well as preparing data for new submissions:

In [194]:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

# Random seed for reproducibility
SEED = 123

# Prepare our best model for training
from sklearn.ensemble import RandomForestClassifier
model_for_tests = RandomForestClassifier(random_state=SEED, \
                               n_estimators= 516, \
                               criterion= 'log_loss', \
                               max_depth= 17, \
                               max_features=0.7, \
                               max_leaf_nodes=123,\
                               min_impurity_decrease= 0.00020380822483963789, \
                               min_samples_leaf= 2, \
                               max_samples= 0.9999360987512214, \
                               n_jobs=-1
                               )



def get_cv_scores(train, test, model, scores_df, verbose=False, prepare_submission=False):
    
    '''
    This function takes train and test sets, as well as a model for cross validation and a DataFrame with previous scores.
    
    Setting verbose to True makes function printing out updated scores.

    
    It returns:
        
        -) Updated DataFrame with new:
            1) Average training ROC AUC score.
            2) Average cross-validation ROC AUC score.
            3) Average training accuracy score. 
            4) Average cross-validation accuracy score.
        
        -) A dataset for a new submission, if prepare_submission is True
    '''
    
    # Create a StratifiedKFold object (6 splits with equal proportion of positive target values)
    skf = StratifiedKFold(n_splits=6, shuffle=True, random_state=SEED)
    
    # Empty lists for collecting scores
    train_roc_auc_scores = []
    cv_roc_auc_scores = []
    train_accuracy_scores = []
    cv_accuracy_scores = []
    
    # Iterate through folds
    for train_index, cv_index in skf.split(train.drop('Transported', axis=1), train['Transported']):
        # Obtain training and testing folds
        cv_train, cv_test = train.iloc[train_index], train.iloc[cv_index]
        
        # Fit the model
        model.fit(cv_train.drop('Transported', axis=1), cv_train['Transported']) 
        
        # Calculate scores and append to the scores lists
        train_pred_proba = model.predict_proba(cv_train.drop('Transported', axis=1))[:, 1]
        train_roc_auc_scores.append(roc_auc_score(cv_train['Transported'], train_pred_proba))
        cv_pred_proba = model.predict_proba(cv_test.drop('Transported', axis=1))[:, 1]
        cv_roc_auc_scores.append(roc_auc_score(cv_test['Transported'], cv_pred_proba))
        train_accuracy_scores.append(model.score(cv_train.drop('Transported', axis=1), cv_train['Transported']))
        cv_accuracy_scores.append(model.score(cv_test.drop('Transported', axis=1), cv_test['Transported']))
        

    # Update the scores DataFrame with average scores:
    
    scores_df.loc[len(scores_df)] = [np.mean(train_roc_auc_scores), np.mean(cv_roc_auc_scores), np.mean(train_accuracy_scores), \
                        np.mean(cv_accuracy_scores), np.nan]
    #scores_df.index = scores_df.index + 1
    #scores_df.sort_index()
    
    # Print the updated scores DataFrame
    if verbose:
        print(scores_df)
        
    submission = "prepare_submission=False"
        
    if prepare_submission:
    
        # Prepare the submission DataFrame
        test_pred = model.predict(test)
        test_pred = ["True" if i == 1 else "False" for i in test_pred]
        test_pred = pd.DataFrame(test_pred, columns=['Transported'])
        submission = pd.concat([test_Ids, test_pred], axis=1)

    
    return submission
                         

## Files and Data Fields Descriptions


### **train.csv**  - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.

CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

Destination - The planet the passenger will be debarking to.

Age - The age of the passenger.

VIP - Whether the passenger has paid for special VIP service during the voyage.

RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

Name - The first and last names of the passenger.

Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

### **test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data.

Your task is to predict the value of Transported for the passengers in this set.

### **sample_submission.csv** - A submission file in the correct format.

PassengerId - Id for each passenger in the test set.

Transported - The target. For each passenger, predict either True or False.



### Here are the first 5 rows of the data:

In [195]:
train_unprocessed = pd.read_csv('datasets/train.csv')
test_unprocessed = pd.read_csv('datasets/test.csv')

train_size = len(train_unprocessed)

data_unprocessed = pd.concat([train_unprocessed, test_unprocessed]).reset_index(drop=True)

data = pd.concat([train_unprocessed, test_unprocessed]).reset_index(drop=True)

data.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [196]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   12970 non-null  object 
 1   HomePlanet    12682 non-null  object 
 2   CryoSleep     12660 non-null  object 
 3   Cabin         12671 non-null  object 
 4   Destination   12696 non-null  object 
 5   Age           12700 non-null  float64
 6   VIP           12674 non-null  object 
 7   RoomService   12707 non-null  float64
 8   FoodCourt     12681 non-null  float64
 9   ShoppingMall  12664 non-null  float64
 10  Spa           12686 non-null  float64
 11  VRDeck        12702 non-null  float64
 12  Name          12676 non-null  object 
 13  Transported   8693 non-null   object 
dtypes: float64(6), object(8)
memory usage: 1.4+ MB


## First submission

Let's try our model on unprocessed data. RandomForestClassifier works only with numerical features, so'll we drop non-numerical features for now. RandomForestClassifier doesn't accept missing values, so we'll fill all the missing values with zeros for now:

In [197]:
%%time

# Create the scores DataFrame
scores_df = pd.DataFrame({'Train ROC AUC': [], 'Cross-val ROC AUC': [], 'Train Accuracy': [], \
                          'Cross-val Accuracy': [], 'Test accuracy': []})

# Collect Passenger Ids in the test dataset into a separate variable
test_Ids = test_unprocessed['PassengerId']

# Drop non-numerical columns
train = train_unprocessed.select_dtypes(include=['int', 'float'])
test = test_unprocessed.select_dtypes(include=['int', 'float'])

# Put the target variable back to the train dataset
train = pd.concat([train, train_unprocessed['Transported']], axis=1)

# Fill missing values with zeros
train = train.fillna(0)
test = test.fillna(0)

# Calculate scores
submission_00 = get_cv_scores(train, test, model_for_tests, scores_df, prepare_submission=True)

scores_df

CPU times: total: 9.73 s
Wall time: 10.2 s


Unnamed: 0,Train ROC AUC,Cross-val ROC AUC,Train Accuracy,Cross-val Accuracy,Test accuracy
0,0.891046,0.847367,0.830047,0.790751,


0.79 Cross-val accuracy is not bad. The classifier we've found in Part 3 works well even on unprocessed and truncated data. Now, let's creata a submission file and submit it to the competition to see the test accuracy:

In [198]:
submission_00.to_csv('04_submission_00.csv', index=False)

scores_df.loc[0, 'Test accuracy'] = 0.80056

scores_df

Unnamed: 0,Train ROC AUC,Cross-val ROC AUC,Train Accuracy,Cross-val Accuracy,Test accuracy
0,0.891046,0.847367,0.830047,0.790751,0.80056


Our test accuracy (0.80056) is even a bit higher than our best result on processed data in Part 3! That confirms my idea that some of the preprocessing stems decreased our potential performance. Now we'll be checking the cross-validation performance on every step and reject steps that decrease scores.

## Data validation and feature engineering

Let's look at our data column by column:

**PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

The number of a passenger within their group is arbitrary, so we don't need it. However, group numbers may be important, so we'll create a new feature "Group":

In [199]:
data['Group'] = data['PassengerId'].str[:4]
data =  data.drop('PassengerId', axis=1)
print(data['Group'].info())
print(data['Group'].describe())
print('Unique Values:')
print(data['Group'].unique())

<class 'pandas.core.series.Series'>
RangeIndex: 12970 entries, 0 to 12969
Series name: Group
Non-Null Count  Dtype 
--------------  ----- 
12970 non-null  object
dtypes: object(1)
memory usage: 101.5+ KB
None
count     12970
unique     9280
top        6499
freq          8
Name: Group, dtype: object
Unique Values:
['0001' '0002' '0003' ... '9271' '9273' '9277']


We have 9280 separate Groups among 12970 entries.

We need to transform Group to numerical features. Since the number of categories is high, it may be unworthy to create dummy variables. We'll try Mean Target Encoding (the fuctions for Mean Target Encoding are based on work by Yauhen Babakhin):

In [200]:
def test_mean_target_encoding(train, test, target, categorical, alpha=5):
    # Calculate global mean on the train data
    global_mean = train[target].mean()
    
    # Group by the categorical feature and calculate its properties
    train_groups = train.groupby(categorical)
    category_sum = train_groups[target].sum()
    category_size = train_groups.size()
    
    # Calculate smoothed mean target statistics
    train_statistics = (category_sum + global_mean * alpha) / (category_size + alpha)
    
    # Apply statistics to the test data and fill new categories
    test_feature = test[categorical].map(train_statistics).fillna(global_mean)
    return test_feature.values

def train_mean_target_encoding(train, target, categorical, alpha=5):
    # Create 5-fold cross-validation
    skf = StratifiedKFold(n_splits=5, random_state=123, shuffle=True)
    train_feature = pd.Series(index=train.index, dtype='float64')
    
    # For each folds split
    for train_index, test_index in skf.split(train.drop('Transported', axis=1), train['Transported']):
        cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
      
        # Calculate out-of-fold statistics and apply to cv_test
        cv_test_feature = test_mean_target_encoding(cv_train, cv_test, target, categorical, alpha)
        
        # Save new feature for this particular fold
        train_feature.iloc[test_index] = cv_test_feature       
    return train_feature.values

def mean_target_encoding(train, test, target, categorical, alpha=5):
  
    # Get the train feature
    train_feature = train_mean_target_encoding(train, target, categorical, alpha)
  
    # Get the test feature
    test_feature = test_mean_target_encoding(train, test, target, categorical, alpha)
    
    # Return new features to add to the model
    return train_feature, test_feature



# Add Group to train and test

train['Group'] = data.loc[:train_size-1, 'Group'].values
test['Group'] = data.loc[train_size:, 'Group'].values

# We'll need to express Transported as 1 and 0 for Mean Target Encoding:
train['Transported'] = [1 if i else 0 for i in train['Transported']]

# Encode Group
train['Group_enc'], test['Group_enc'] = mean_target_encoding(train, test, 'Transported', 'Group', alpha=7.5)

test['Group_enc'].describe()

count    4.277000e+03
mean     5.036236e-01
std      3.475404e-14
min      5.036236e-01
25%      5.036236e-01
50%      5.036236e-01
75%      5.036236e-01
max      5.036236e-01
Name: Group_enc, dtype: float64

Oh, it seems that we have only one unique value for the Group_enc in the test set:

In [201]:
test['Group_enc'].unique()

array([0.50362361])

The reason is that there is no Groups that are common between the train and test sets:

In [202]:
list(set(train['Group']) & set(test['Group']))

[]

Therefore, distinguishing Groups is useless. However, we can use the Group column in another way: let's calculate the number of group members and assign it to "GroupSize" variable:

In [203]:
# Revert train and test
train = train_unprocessed.select_dtypes(include=['int', 'float'])
test = test_unprocessed.select_dtypes(include=['int', 'float'])
train = pd.concat([train, train_unprocessed['Transported']], axis=1)
train = train.fillna(0)
test = test.fillna(0)
train.head()

# Calculate GroupSize
data['GroupSize'] = data.groupby('Group')['Group'].transform('count')
print(data['GroupSize'].info())
print(data['GroupSize'].describe())
print('Unique Values:')
print(data['GroupSize'].unique())

<class 'pandas.core.series.Series'>
RangeIndex: 12970 entries, 0 to 12969
Series name: GroupSize
Non-Null Count  Dtype
--------------  -----
12970 non-null  int64
dtypes: int64(1)
memory usage: 101.5 KB
None
count    12970.000000
mean         2.022976
std          1.577102
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          8.000000
Name: GroupSize, dtype: float64
Unique Values:
[1 2 3 6 4 7 5 8]


In [204]:
# Add GroupSize to train and test

train['GroupSize'] = data.loc[:train_size-1, 'GroupSize'].values
test['GroupSize'] = data.loc[train_size:, 'GroupSize'].values
print('Unique Values in train:')
print(train['GroupSize'].unique())
print('Unique Values in test:')
print(test['GroupSize'].unique())

Unique Values in train:
[1 2 3 6 4 7 5 8]
Unique Values in test:
[1 2 3 5 4 8 6 7]


In [205]:
%%time

# Let's test

get_cv_scores(train, test, model_for_tests, scores_df)

scores_df

CPU times: total: 14.2 s
Wall time: 11.1 s


Unnamed: 0,Train ROC AUC,Cross-val ROC AUC,Train Accuracy,Cross-val Accuracy,Test accuracy
0,0.891046,0.847367,0.830047,0.790751,0.80056
1,0.89933,0.854434,0.828759,0.792477,


Cross-validation scores are improved, so we'll keep GroupSize.

PassengerId, and, therefore, GroupSize don't have missing values. Other columns, though, have missing values. Let's explore if there are some patterns in missing data: