# Introduction
This is my first competition and I decided to join Kaggle's Spaceship Titanic Competition. We are given training data which contains information about passengers and whether they were transported to an alternate dimension or not. Using this, we must predict whether the passengers given in the testing data were transported or not.

Start by importing libraries and setting default options

In [None]:
import warnings

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import optuna

from sklearn.impute import SimpleImputer
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier

#Allow more columns to be displayed
pd.options.display.max_columns = 50

# Mute warnings
warnings.filterwarnings('ignore')

Below is an initial look at the training data

In [None]:
df_train = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv", index_col="PassengerId")
df_test = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv", index_col="PassengerId")
df_train.head()

## Initial Data Analysis ##
All columns but the target contain missing values. About 2% of data is missing

In [None]:
df_train.describe()

In [None]:
#Number of missing values in each column
df_train.isna().sum()

In [None]:
df_train.dtypes

### Categorical Data ####

Analyzing the categorical columns below, CryoSleep, VIP, and Transported seem to be boolean values. Cabin and Name are features that are more unique among passengers.

There don't seem to be any typos or errors that need to be cleaned

In [None]:
for col in df_train.columns:
    if df_train[col].dtype in ["object", "bool"]:
        print(f"{df_train[col].value_counts()}")
        print(f"Missing: {df_train[col].isna().sum()}\n")

### Numerical Data ###
Below are kernel density estimate plots that show the distribution for every numerical feature. All features but `Age` show extreme right skew. These skewed features represent how much each passenger spent on certain amenities. The skew seems to result form the fact that a large majority of passengers didn't spend much money or any at all.

In [None]:
fig, ax = plt.subplots(6, 1, figsize=(10, 20))

#Numerical columns
numerical_cols = [col for col in df_train if df_train[col].dtype == "float64"]

for i in range(6):
    col = numerical_cols[i]
    sns.kdeplot(df_train[col], ax=ax[i])

## Preprocessing Data ##
This function loads fresh dataframes whenever needed. It splits the data encoding and imputation into their own seperate functions which will be defined below after taking a closer look at the data.

In [None]:
def load_data():
    df_train = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv", index_col="PassengerId")
    df_test = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv", index_col="PassengerId")
    
    #Encode and impute training and testing data together
    df = pd.concat([df_train, df_test])
    df = encode(df)
    df = impute(df)
    
    # Reform splits
    df_train = df.loc[df_train.index, :]
    df_test = df.loc[df_test.index, :]
    return df_train, df_test

### Encoding Data Type ###
The categorical data will be encoded to be of `categorical` type so data processing will be easier. The boolean data will be encoded to be of `int` type.

In [None]:
#The only categories that seem to be ordinal are Transported, VIP and CryoSleep. The rest seem to be nominal
nom_features = ["HomePlanet", "Cabin", "Destination", "Name"]
bool_features = ["VIP", "CryoSleep"]

def encode(df):
    for col in nom_features:
        df[col] = df[col].astype("category")   
        
    #Add "None" to categories for later
    df["Cabin"] = df["Cabin"].cat.add_categories("None")
    df["Name"] = df["Name"].cat.add_categories("None")
    
    #Target will be encoded as categorical to use XGBClassifier later
    df["Transported"] = df["Transported"].astype("category")
    
    return df

### Imputing Data ###
Besides the `Age` feature, all the numerical features show very heavy skew. Therefore those columns will be imputed with the median since this method is more robust.

For the categorical features, they will be imputed with the most common value. This makes less sense for the `Cabin` and `Name` features. For these features, `NaN` will be imputed with `"None"`.

Finally, the function will create indicator columns that indicate whether data was originally missing.

In [None]:
def impute(df):
    #Create indicator columns for missing values
    df_missing = pd.DataFrame()
    for name in df:
        if df[name].isna().any() and name != "Transported":     
            df_missing[name + "_missing"] = df[name].isna().astype(int)
            
    #Impute missing values with False since False is the most common value for all boolean features
    df_bool = df[bool_features]
    df_bool = df_bool.fillna(False)
    df_bool = df_bool.astype("int")
    
    num_imputer = SimpleImputer(strategy="median")
    cat_imputer = SimpleImputer(strategy="most_frequent")
    
    #Seperate df to impute columns differently. Convert the imputed matrices back into dataframes
    df_num = df[numerical_cols]
    df_num = pd.DataFrame(num_imputer.fit_transform(df_num), columns=numerical_cols, index=df_num.index)
    
    cat_cols = ["HomePlanet", "Destination"]
    df_cat = df[cat_cols]
    df_cat = pd.DataFrame(cat_imputer.fit_transform(df_cat), columns=cat_cols, index=df_cat.index).astype("category")
    
    df_special = df[["Cabin", "Name"]]
    df_special = df_special.fillna("None")
    
    return pd.concat([df_num, df_bool, df_cat, df_special, df_missing, df["Transported"]], axis=1)

## Load Data ##
Now, the function can be called to create the preprocessed dataframes.

In [None]:
df_train, df_test = load_data()

Here's a look at the newly processed training data

In [None]:
df_train

## Establish Baseline ##
Below is a function that will create a score with features. This will let us establish our baseline score and also judge whether newly created features are effective. It uses 5-fold cross validation and scores using accuracy.

In [None]:
def score(X, y, model=XGBClassifier()):
    
    #Label encode categorical features with many unique values
    label_cols = [col for col in X.columns if X[col].dtype == "category" and X[col].nunique() > 50]
    for col in label_cols:
        X[col], _ = X[col].factorize()
        
    #One hot encode other categorical features
    one_hot_cols = [col for col in X.columns if X[col].dtype == "category" and col not in label_cols]
    X = pd.get_dummies(X, columns=one_hot_cols, dtype="int")
   
    score = cross_val_score(model, X, y, cv=5, scoring="accuracy")
    score = score.mean()
    return score

In [None]:
X = df_train.copy()
y = X.pop("Transported")

score(X, y)

# Make Mutual Information Scores #
Now, a function to create and plot mutual information scores will be made. This will help identify the features with the most potential for feature engineering. It will also help determine if newly created features are useful.

In [None]:
#Returns series with mi scores of each feature
def get_MI_scores(X, y):
    
    #Label encode categorical features
    for col in X.select_dtypes(["category"]):
        X[col], _ = X[col].factorize()

    mi_scores = mutual_info_regression(X, y, random_state=0)
    mi_scores = pd.Series(mi_scores, index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

In [None]:
X = df_train.copy()
y = X.pop("Transported")

mi_scores = get_MI_scores(X, y)
mi_scores

`CryoSleep` and the spending features seem to be the most informative features. On the other hand, many features, especially the indicator features seem to be uninformative. This isn't a concrete ranking however, since their may be some interactions between features that might prove useful. For now, the uninformative features will be dropped to prevent overfitting.

In [None]:
#Drop all columns with mi_score = 0
uninformative = mi_scores > 0
#Keep Target column
uninformative["Transported"] = True

df_train = df_train.loc[:, uninformative]
df_test = df_test.loc[:, uninformative]

Our scoring function rates our model a little worse now but that's fine. Overfitting may only show its effects in the testing set.

In [None]:
X = df_train.copy()
y = X.pop("Transported")

score(X, y)

# Feature Engineering #
Here's a look at our training data so far again:

In [None]:
df_train.head(15)

Based off of our data, some features look like they could have more infromation extracted from them. Here are the ideas that I was able to spot:

* The spending features can be totaled to create a `TotalSpending` feature
* `Cabin` can be split into three more features. Taken from the competition description: "Takes the form `deck/num/side`, where `side` can be either `P` for Port or `S` for Starboard."
* `PassengerID` has two parts seperated by an underscore. Taken from the competition description: "Each Id takes the form `gggg_pp` where `gggg` indicates a group the passenger is travelling with and `pp` is their number within the group." From this, we can create a feature that represents how many people a passenger is traveling with.


## Spending Features ##
A `TotalSpending` feature will be created by adding all spending features together.

In [None]:
spending_cols = ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]

df_train["TotalSpending"] = sum([df_train[col] for col in spending_cols])
df_test["TotalSpending"] = sum([df_test[col] for col in spending_cols])

df_train

Based off of the `TotalSpending` feature, passengers in cryosleep didn't spend money at all. This makes sense since they were asleep the whole passage.

In [None]:
sns.boxplot(df_train, x="CryoSleep", y="TotalSpending")

## Cabin Feature ##

`Cabin` is split into three new features: `Deck`, `Num`, `Side`. `Cabin` is also dropped since the 3 columns will probably represent a passenger's cabin better. `Cabin`'s cardinality is too high and is would behave similarly to `PassengerID`

In [None]:
#Replace "None" values so they can be split
df_train["Cabin"].replace("None", "None/0/None", inplace=True)
df_test["Cabin"].replace("None", "None/0/None", inplace=True)

#expand=True expands into different columns
df_train[["Deck", "Num", "Side"]] = df_train["Cabin"].str.split("/", expand=True).astype("category")
df_test[["Deck", "Num", "Side"]] = df_test["Cabin"].str.split("/", expand=True).astype("category")

#Drop Cabin feature
df_train.drop("Cabin", axis=1, inplace=True)
df_test.drop("Cabin", axis=1, inplace=True)

#Set Num column as int
df_train["Num"] = df_train["Num"].astype("int")
df_test["Num"] = df_test["Num"].astype("int")

## Group Size ##
`PassengerId` will be used to create a `GroupSize` feature which indicates the size of a passenger's gorup. To recap, a passenger's ID is represented as `gggg_pp` where the number of passengers with the same `gggg` represents the size of that party.

In [None]:
#Create a function so that steps can be repeated on testing and training set
def create_groupSize(df):
    
    groupSize_df = df.reset_index()
    groupSize_df = groupSize_df[["PassengerId"]]
    #Split PassengerID to get first 4 digits
    groupSize_df[["Group", "Number"]] = groupSize_df["PassengerId"].str.split("_", expand=True)
    #Groups entries by Group number and returns a count of each group
    groupSize_df['GroupSize'] = groupSize_df.groupby('Group')['Group'].transform('count')
    
    #Set index again and append GroupSize back to df
    groupSize_df.set_index(groupSize_df["PassengerId"], inplace=True)
    df["GroupSize"] = groupSize_df["GroupSize"]
    return df

create_groupSize(df_train)
create_groupSize(df_test)

df_train.head()

## Reevaluation ##

This section will reevaluate the data's performance and mutual information scores.

In [None]:
X = df_train.copy()
y = X.pop("Transported")

score(X, y)

The dataset's evaluation score is much higher now with the new features!

In [None]:
mi_scores = get_MI_scores(X, y)
mi_scores

Based off of our new mutual information scores, `TotalSpending` seems to be the most important feature that we've created. The features derived from `Cabin` and `PassengerId` don't seeem to have been as informative.

Also, even though the mutual information score of `Name` is nonzero, the feature shouldn't offer any actual useful information. It will be dropped to reduce overfitting.

In [None]:
df_train.drop("Name", axis=1, inplace=True)
df_test.drop("Name", axis=1, inplace=True)

# Train Model and Make Submission #

Using Optuna and the `score()` function from earlier, 5 trials are run to find the best hyperparameters for an `XGBClassifier`.

In [None]:
def objective(trial):
    xgb_params = dict(
        max_depth=trial.suggest_int("max_depth", 2, 10),
        max_leaves=trial.suggest_int("max_leaves", 10, 50),
        learning_rate=trial.suggest_float("learning_rate", 1e-4, 1e-1, log=True),
        n_estimators=trial.suggest_int("n_estimators", 1000, 8000),
        min_child_weight=trial.suggest_int("min_child_weight", 1, 10),
        colsample_bytree=trial.suggest_float("colsample_bytree", 0.2, 1.0),
        subsample=trial.suggest_float("subsample", 0.2, 1.0),
        reg_alpha=trial.suggest_float("reg_alpha", 1e-4, 1e2, log=True),
        reg_lambda=trial.suggest_float("reg_lambda", 1e-4, 1e2, log=True),
    )
    model = XGBClassifier(**xgb_params)
    return score(X_train, y_train, model)

X_train = df_train.copy()
y_train = X_train.pop("Transported")

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

xgb_params = study.best_params

model = XGBClassifier(**xgb_params)
score(X_train, y_train, model)

Here is the best trial run so far: Trial 66 finished with value: 0.7958154859312953 and parameters: {'max_depth': 3, 'max_leaves': 31, 'learning_rate': 0.0037495100253379537, 'n_estimators': 1901, 'min_child_weight': 8, 'colsample_bytree': 0.35654469830643004, 'subsample': 0.42020401090898746, 'reg_alpha': 0.8693462304401648, 'reg_lambda': 0.00888150124652303}.

Train the tuned model on the data and make predictions.

In [None]:
#Remove filler target column in testing set
df_test.drop("Transported", axis=1, inplace=True)

#Label encode categorical features with many unique values
label_cols = [col for col in df_test.columns if df_test[col].dtype == "category" and df_test[col].nunique() > 50]
for col in label_cols:
    df_train[col], _ = df_train[col].factorize()
    df_test[col], _ = df_test[col].factorize()

#One hot encode other categorical features
one_hot_cols = [col for col in df_test.columns if df_test[col].dtype == "category" and col not in label_cols]
df_train = pd.get_dummies(df_train, columns=one_hot_cols, dtype="int")
df_test = pd.get_dummies(df_test, columns=one_hot_cols, dtype="int")


X_train = df_train
y_train = X_train.pop("Transported")
X_test = df_test

model.fit(X_train, y_train)
predictions = model.predict(X_test).astype("bool")

output = pd.DataFrame({"PassengerId": X_test.index, "Transported": predictions})
output.to_csv('my_submission.csv', index=False)
print("Submission Saved")


In [None]:
output