# Spaceship Titanic Competition

Authors: Jimmy Bierenbroodspot, Maarten de Jue

In [None]:
# Data manipulation
import pandas as pd
import numpy  as np

# Visualization
import seaborn as sns
from matplotlib import pyplot as plt

# Table of Contents

- [Business Understanding](#Business-Understanding)
- [Data Understanding](#Data-Understanding)
    - [Load Dataframes](#Load-Dataframes)
    - [Dataframe Shape](#Dataframe-Shape)
    - [Description Table](#Description-Table)
- [Data Preparation](#Data-Preparation)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Modeling](#Modeling)

# Business Understanding

This competition is about a spaceship which has collided with a time anomaly where about half of the passengers were transported to a different dimension.

Our job is to predict which passengers will get transported and which not.

[To table of contents](#Table-of-Contents)

# Data Understanding

## Load Dataframes

Let's start off the data understanding by actually importing the datasets. We have two datasets:

- `train_df` which contains a column for whether a passenger is transported or not.
- `test_df` this dataset does not contain the target column and should. We are supposed to add the transported column to this dataset and submit it for the competition.

In [None]:
test_df = pd.read_csv('data/test.csv')
train_df = pd.read_csv('data/train.csv')

## Dataframe Shape

First let's take a look at how much data we are working with. 

In [None]:
print(f'The train dataframe contains {train_df.shape[0]} rows and {train_df.shape[1]} columns.')
print(f'The test dataframe contains {test_df.shape[0]} rows and {test_df.shape[1]} columns.')

We have roughly two-thirds of the dataframe to train with and one-thirds of the dataset to predict. The test dataframe contains one fewer column which makes sense as it does not contain the `Transported` column since we are supposed to predict it.

We can take a look at the `14` columns and what kind of data they contain. We are going to focus on `train_df` for now since it contains the same data as `test_df` but includes the target column as well.

In [None]:
train_df.head(5)

We can immediately see an anomaly with `PassengerId`. Normally for ID's you would expect solely a number, although this is not strictly necessary, it is the most common format. According to [this article from the competition](https://www.kaggle.com/competitions/spaceship-titanic/data) everything left from the underscore is the group the passenger belongs to and everything right of the underscore is the number of the passenger within the group. So `0001_01` would be passenger number `01` of group `0001`.

[To table of contents](#Table-of-Contents)

## Description Table

Although a previously mentioned article contains a description of every column already, it is perhaps a good idea to copy it over anyway for ease of use.

| Column name | Description |
| - | - |
| PassengerId | A unique Id for each passenger. Each Id takes the form `gggg_pp` where `gggg` indicates a group the passenger is traveling with and `pp` is their number within the group. People in a group are often family members, but not always. |
| HomePlanet | The planet the passenger departed from, typically their planet of permanent residence. |
| CryptoSleep | Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins. |
| Cabin | The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either `P` for *Port* or `S` for *Starboard* |
| Destination | The planet the passenger will be debarking to. |
| Age | The age of the passenger. |
| VIP | Whether the passenger has paid for special VIP service during the voyage. |
| RoomService | Amount the passenger has billed at each of the *Spaceship Titanic*'s many luxury amenities. |
| FoodCourt | Amount the passenger has billed at each of the *Spaceship Titanic*'s many luxury amenities. |
| ShoppingMall | Amount the passenger has billed at each of the *Spaceship Titanic*'s many luxury amenities. |
| Spa | Amount the passenger has billed at each of the *Spaceship Titanic*'s many luxury amenities. |
| VRDeck | Amount the passenger has billed at each of the *Spaceship Titanic*'s many luxury amenities. |
| Name | The first and last names of the passenger. |
| Transported | Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict. |

[To table of contents](#Table-of-Contents)

# Data Preparation

[To table of contents](#Table-of-Contents)

# Exploratory Data Analysis

[To table of contents](#Table-of-Contents)

# Modeling

[To table of contents](#Table-of-Contents)