# Spaceship Titanic Competition

Authors: Jimmy Bierenbroodspot, Maarten de Jue

In [None]:
# Data manipulation
import pandas as pd
import numpy  as np

# Visualization
import seaborn as sns
from matplotlib import pyplot as plt

# Table of Contents

- [Business Understanding](#Business-Understanding)
- [Data Understanding](#Data-Understanding)
    - [Load Dataframes](#Load-Dataframes)
    - [Dataframe Shape](#Dataframe-Shape)
    - [Description Table](#Description-Table)
    - [Column Datatypes](#Column-Datatypes)
        - [Datatype Table](#Datatype-Table)
- [Data Preparation](#Data-Preparation)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Modeling](#Modeling)

# Business Understanding

This competition is about a spaceship which has collided with a time anomaly where about half of the passengers were transported to a different dimension.

Our job is to predict which passengers will get transported and which not.

[To table of contents](#Table-of-Contents)

# Data Understanding

## Load Dataframes

Let's start off the data understanding by actually importing the datasets. We have two datasets:

- `train_df` which contains a column for whether a passenger is transported or not.
- `test_df` this dataset does not contain the target column and should. We are supposed to add the transported column to this dataset and submit it for the competition.

In [None]:
test_df = pd.read_csv('data/test.csv')
train_df = pd.read_csv('data/train.csv')

## Dataframe Shape

First let's take a look at how much data we are working with. 

In [None]:
print(f'The train dataframe contains {train_df.shape[0]} rows and {train_df.shape[1]} columns.')
print(f'The test dataframe contains {test_df.shape[0]} rows and {test_df.shape[1]} columns.')

We have roughly two-thirds of the dataframe to train with and one-thirds of the dataset to predict. The test dataframe contains one fewer column which makes sense as it does not contain the `Transported` column since we are supposed to predict it.

We can take a look at the `14` columns and what kind of data they contain. We are going to focus on `train_df` for now since it contains the same data as `test_df` but includes the target column as well.

In [None]:
train_df.head(5)

We can immediately see an anomaly with `PassengerId`. Normally for ID's you would expect solely a number, although this is not strictly necessary, it is the most common format. According to [this article from the competition](https://www.kaggle.com/competitions/spaceship-titanic/data) everything left from the underscore is the group the passenger belongs to and everything right of the underscore is the number of the passenger within the group. So `0001_01` would be passenger number `01` of group `0001`.

[To table of contents](#Table-of-Contents)

## Description Table

Although a previously mentioned article contains a description of every column already, it is perhaps a good idea to copy it over anyway for ease of use.

| Column name | Description |
| - | - |
| PassengerId | A unique Id for each passenger. Each Id takes the form `gggg_pp` where `gggg` indicates a group the passenger is traveling with and `pp` is their number within the group. People in a group are often family members, but not always. |
| HomePlanet | The planet the passenger departed from, typically their planet of permanent residence. |
| CryptoSleep | Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins. |
| Cabin | The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either `P` for *Port* or `S` for *Starboard* |
| Destination | The planet the passenger will be debarking to. |
| Age | The age of the passenger. |
| VIP | Whether the passenger has paid for special VIP service during the voyage. |
| RoomService | Amount the passenger has billed at each of the *Spaceship Titanic*'s many luxury amenities. |
| FoodCourt | Amount the passenger has billed at each of the *Spaceship Titanic*'s many luxury amenities. |
| ShoppingMall | Amount the passenger has billed at each of the *Spaceship Titanic*'s many luxury amenities. |
| Spa | Amount the passenger has billed at each of the *Spaceship Titanic*'s many luxury amenities. |
| VRDeck | Amount the passenger has billed at each of the *Spaceship Titanic*'s many luxury amenities. |
| Name | The first and last names of the passenger. |
| Transported | Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict. |

## Column Datatypes

We should check whether the the datatype of each column is as expected.

In [None]:
train_df.dtypes

The datatype at first glance seem to make, except for `Age`. Unless the age is measured with decimals e.g. thirty-and-a-half is `30.5` you would expect the age to be an integer. We can check whether this is the case by using the python `is_integer()` on every value in the column. This function can take float values and return `True` or `False` based on whether the float contains decimal numbers or not. With that we can generate a `Series` containing just `True`s and `False`s. If we look at the unique vales we can see if there's only integers when it only says `True`.

In [None]:
age_is_int = train_df['Age'].apply(float.is_integer)

tuple(age_is_int.unique())

So the Age column also contains decimal numbers. We can use a boolean mask to find out what these values are. We can use the previously declared `age_is_int Series` and apply it as a boolean mask but we only want the `False` values so we use a bitwise not operator (`~`).

In [None]:
tuple(train_df['Age'][~age_is_int].unique())

Now it becomes clear that there are actually no decimal numbers and that the anomaly causing `False` to show up are `NaN` values.

The `object` datatypes doesn't tell us much. We can apply the `type()` function to every row in the `PassengerId, HomePlanet, CryoSleep, Cabin, Destination, VIP, Name` to find out whether the datatype actually makes sense or not.

In [None]:
object_dtype_columns = ['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP', 'Name']

for column in object_dtype_columns:
    print(f'{column:12} contains the datatypes:', train_df[column].apply(type).unique())

As we can see most of the columns contains the `float` datatype. This doesn't make sense for the columns which have an `str` datatype as well for example. What we do know is that numpy `NaN` values are seen as `float`s by python.

In [None]:
print(f'The datatype of np.NAN is: {type(np.NAN)}')

If all the `float` values are actually `NaN`s then we could figure that out by looking at the unique values with the `float` datatype for every column.

In [None]:
for column in object_dtype_columns:
    curr_column_series = train_df[column]
    check_if_float = lambda x: type(x) is float  # A function that checks whether the datatype of a value is a float

    float_values = curr_column_series[curr_column_series.map(check_if_float)]  # A series containing only the values with datatype float.

    print(f'The unique float values of {column} are: {float_values.unique()}')

### Datatype Table 

We can now put the correct datatypes neatly in a table.

| Column name | Datatype |
| - | - |
| PassengerId | `str` |
| HomePlanet | `str` |
| CryptoSleep | `bool` |
| Cabin | `str` |
| Destination | `str` |
| Age | `int` |
| VIP | `bool` |
| RoomService | `float` |
| FoodCourt | `float` |
| ShoppingMall | `float` |
| Spa | `float` |
| VRDeck | `float` |
| Name | `str` |
| Transported | `bool` |

[To table of contents](#Table-of-Contents)

# Data Preparation

## PassengerId

As the PassengerId columns contains 2 pieces of information: the group ID and the ID within the group. We'll separate these into their own columns. Let's have a look at a sample.

In [None]:
train_df[["PassengerId"]].sample(2, random_state=0)

The Group ID and ID within group are separated by an underscore. We can easily split them.

In [None]:
train_df[["GroupId", "IdWithinGroup"]] = train_df["PassengerId"].str.split("_", n=2, expand=True)
train_df[["GroupId", "IdWithinGroup"]].sample(2, random_state=0)

It appears as though we split the strings successfully. Let's verify the datatypes of the result.

In [None]:
train_df[["GroupId", "IdWithinGroup"]].dtypes

The IDs are represented by whole numbers. We probably won't be using math on these IDs, but we might as well store them as integers.

This way we should also find out if any values are not formatted as whole numbers.

In [None]:
# .astype() raises errors if not formatted properly
train_df[["GroupId", "IdWithinGroup"]] = train_df[["GroupId", "IdWithinGroup"]].astype(int)
train_df[["GroupId", "IdWithinGroup"]].dtypes

Wonderful. We can drop the original column, as we probably won't need it anymore.

In [None]:
train_df = train_df.drop(columns="PassengerId")

## Groups

As it might be interesting to involve information about the groups that passengers are in the data analysis, we can aggregate some information about them into a new DataFrame.

We are interested in aggregating the following data:

- Amount of people in group.
- Lupus et Agnus ad Rivum veniunt

In [None]:
group_df = train_df.groupby("GroupId").agg({
    "IdWithinGroup": len  # Amount of people in group. Random column, doesn't matter for result.
    # Add more functions to aggregate on here if so desired
})

group_df = group_df.rename(
    columns={
        "IdWithinGroup": "PeopleAmount"
    }
)

group_df.sample(5, random_state=0)

Looks good.

[To table of contents](#Table-of-Contents)

# Exploratory Data Analysis

[To table of contents](#Table-of-Contents)

# Modeling

[To table of contents](#Table-of-Contents)