# ⚡️ Lightning Flash - Spaceship Titanic 🚀

## Goal of this Notebook
I have used Lightning for several projects and I have really enjoyed using it. I haven't had the chance to use Flash and I am interested to use Flash's out-of-the-box tools for training a deep learning model. Also, there has been a lot of debate on Twitter regarding using deep learning for tabular data. Personally, I have only built deep learning models for image or text-based tasks and not for tabular data (all hail XGBoost!). However, I am intrigued by Flash's tabular data backbones! 

## Why  Flash?
To get started with Deep Learning.

### Easy to learn
If you are just getting started with deep learning, Flash offers common deep learning tasks you can use out-of-the-box in a few lines of code, no math, fancy nn.Modules or research experience required!

### Easy to scale
Flash is built on top of PyTorch Lightning, a powerful deep learning research framework for training models at scale. With the power of Lightning, you can train your flash tasks on any hardware: CPUs, GPUs or TPUs without any code changes.

### Easy to upskill
If you want to create more complex and customized models, you can refactor any part of flash with PyTorch or PyTorch Lightning components to get all the flexibility you need. Lightning is just organized PyTorch with the unnecessary engineering details abstracted away.

- Flash (high-level)
- Lightning (mid-level)
- PyTorch (low-level)

When you need more flexibility you can build your own tasks or simply use Lightning directly.

## More Information
- Check out Flash on GitHub: https://github.com/PyTorchLightning/lightning-flash
- See the Flash docs for more information: https://lightning-flash.readthedocs.io/en/latest/
- Join the community on Slack: https://www.pytorchlightning.ai/community
- A great Flash tutorial that I followed to structure this notebook: https://lightning-flash.readthedocs.io/en/latest/notebooks/flash_tutorials/electricity_forecasting.html

# References
- I would like to thank Jirka Borovec for their notebook! Check it out: https://www.kaggle.com/code/jirkaborovec/starter-flash-spaceship-titanic
- A great starter notebook: https://www.kaggle.com/code/odins0n/spaceship-titanic-eda-27-different-models

# Competition
The competition is organised by Kaggle and is in the GettingStarted Prediction Competition series.

In this competition, you are supposed to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Submissions are evaluated on Classification Accuracy.

# Install Packages

I found that the latest version`lightning-flash==0.7.5` had several behaviours that were unexpected and had been updated in the pre-release version on GitHub (`0.8.0dev`).

In [None]:
! pip uninstall -y torchtext
! pip install 'git+https://github.com/PyTorchLightning/lightning-flash.git#egg=lightning-flash[tabular]'
! pip install -q "omegaconf==2.1.*" "matplotlib==3.1.1" "pandas==1.3.5" --force-reinstall
! pip list | grep -e lightning -e torch -e tab

# Import Libraries

Use the function `seed_everything` from `pytorch_lightning` for reproducible results.

In [None]:
import itertools
from pprint import pprint

import flash
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pytorch_lightning as pl
import seaborn as sns
import torch
from flash import tabular

pl.seed_everything(42)

# Functions

In [None]:
def process_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Creating additional features.
    
    Args:
        df: spaceship-titanic dataframe.
        
    Returns:
        pd.DataFrame
    """
    df = _split_cabin_number(df)
    df = _bin_age(df)
    df = _money_spent(df)
    return df

def _split_cabin_number(df: pd.DataFrame) -> pd.DataFrame:
    df[["CabinNumber", "CabinDeck", "CabinSide"]] = [
        c.split("/") if isinstance(c, str) else [c] * 3 
        for c in df["Cabin"]
    ]
    return df


def _bin_age(df: pd.DataFrame) -> pd.DataFrame:
    df["AgeCategorized"] = np.where(
        df["Age"] < 20,
        "Below 20",
        np.where(
            df["Age"] > 38,
            "Above 38",
            "Between 20 and 38",
        ),
    )
    return df

def _money_spent(df: pd.DataFrame) -> pd.DataFrame:
    df["MoneySpent"] = df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)
    return df

def plot_metrics() -> None:
    """Plot training logged training metrics"""
    sns.set()
    metrics = pd.read_csv(f'{trainer.logger.log_dir}/metrics.csv')
    metrics.set_index("step", inplace=True)
    del metrics["epoch"]
    sns.relplot(data=metrics, kind="line")
    plt.gca().set_ylim([0, 1.25])
    plt.gcf().set_size_inches(10, 5)

# Loading Data

- `PassengerId` - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
- `HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.
- `CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- `Cabin` - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- `Destination` - The planet the passenger will be debarking to.
- `Age` - The age of the passenger.
- `VIP` - Whether the passenger has paid for special VIP service during the voyage.
- `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- `Name` - The first and last names of the passenger.
- `Transported` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [None]:
df = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")

df.head()

# Process Training Data

Before loading data into Flash, we need to pre-process the data as I want to create some extra features. Also, for modelling, the target variable needs to be label encoded. 

In [None]:
df = process_data(df)
df["Transported"] = df["Transported"].apply(int)

df.head()

# Create Flash DataModule

Create a `TabularClassificationData` to split the data into training and validation samples for modelling. We need to specify our numerical, categorical, and target variables. 

For this version of Flash, **numerical missing** values are imputed using the **median** and **categorical missing** values are labelled as **0**.

Set the `pin_memory` argument in DataLoader to `True` when working with GPUs. This allocates the data into page-locked memory, which speeds up data transfer to the GPU.

Once initialised, we can see the parameters of `TabularClassificationData` module:

In [None]:
train_datamodule = tabular.TabularClassificationData.from_data_frame(
    categorical_fields=["HomePlanet", "CryoSleep", "Destination", "VIP", "CabinNumber", "CabinDeck", "CabinSide", "AgeCategorized"],
    numerical_fields=["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck", "MoneySpent"],
    target_fields="Transported",
    train_data_frame=df,
    val_split=0.05,
    batch_size=128,
    pin_memory=True,
)

pprint(train_datamodule.parameters)

# Creating the Flash Task

We can use the `TabularClassifier`to create a tabular classification task. Let's try out the `tabtransformer` backbone for tabular tasks.

In [None]:
model = tabular.TabularClassifier.from_data(
    train_datamodule, 
    backbone="tabtransformer",
    optimizer="adamax",
    out_ff_activation="SiLU",
    num_attn_blocks=14,
    attn_dropout = 0.2,
    ff_dropout = 0.2,
)

# The Flash Trainer

Let's initial the Flash Trainer. We apply gradient clipping (a common technique for tabular tasks) with `gradient_clip_val=0.01` to help prevent our model from over-fitting. Let's also use the `ÈarlyStopping` callback to monitor our validation loss.

Using `precision=16` Lightning will use half-precision whenever possible while retaining single-precision elsewhere. With minimal code modifications, we can achieve a 1.5x — 2x speed boost to our model training times.

We can also log all of our training metrics to a `.csv` file so we can visualise them later.

In [None]:
early_stopping = pl.callbacks.EarlyStopping(monitor="valid_loss", patience=10, mode="min")

trainer = flash.Trainer(
    max_epochs=100, 
    gpus=torch.cuda.device_count(), 
    precision=16,
    gradient_clip_val=0.01,
    callbacks=[early_stopping],
    logger=pl.loggers.CSVLogger(save_dir='logs/')
)

# Automatical find the Learning Rate

Tabular models can be particularly sensitive to the choice of learning rate. Helpfully, Lightning provides a built-in learning rate finder that suggests a suitable learning rate automatically. Here’s how to find the learning rate:

In [None]:
res = trainer.tuner.lr_find(model, datamodule=train_datamodule, min_lr=1e-5)
print(f"Suggested learning rate: {res.suggestion()}")
res.plot(show=True, suggest=True).show()

Once the suggest learning rate has been found, we can update our model with it:

In [None]:
model.learning_rate = res.suggestion()

# Training the Model

Finally, let's train the model!

In [None]:
trainer.fit(model, datamodule=train_datamodule)

Awesome! Let's visualise the training metrics:

In [None]:
plot_metrics()

# Prediction

We use the parameters from the `train_datamodule` so that `df_test` can be transformed correctly for inference.

In [None]:
df_test = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
df_test = process_data(df_test)

predict_datamodule = tabular.TabularClassificationData.from_data_frame(
    predict_data_frame=df_test,
    parameters=train_datamodule.parameters,
    batch_size=8
)
    
predictions = trainer.predict(model, datamodule=predict_datamodule, output="classes")
predictions = list(itertools.chain(*predictions))
df_test["Transported"] = [str(bool(p)) for p in predictions]

# Submit

In [None]:
df_test.set_index("PassengerId")[["Transported"]].to_csv("submission.csv")

! head submission.csv

# Thanks for reading!

If you made it this far, thank you! As you've seen there is a lot you can do with Flash. The performance of neural networks depends greatly on the parameters that are used. Next, I will investigate how packages like Optuna integrate with the Lightning eco-system for hyperparameter tuning. 