## Original Description
> Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.
>
> The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.
>
> While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!
>
> To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.
>
> Help save them and change history!

### Features
- `PassengerId` - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
- `HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.
- `CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- `Cabin` - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- `Destination` - The planet the passenger will be debarking to.
- `Age` - The age of the passenger.
- `VIP` - Whether the passenger has paid for special VIP service during the voyage.
- `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- `Name` - The first and last names of the passenger.

### Labels
- `Transported` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [37]:
# !pip3 install -r requirements.txt

In [38]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from lazypredict.Supervised import LazyClassifier

warnings.filterwarnings('ignore')

In [39]:
%matplotlib inline

In [40]:
redo_graphs = 0

In [41]:
try:
    df = pd.read_csv('data_FAA/train.csv')
    # df_test = pd.read_csv('../input/spaceship-titanic/test.csv')
except FileNotFoundError:
    df = pd.read_csv('data_FAA/train.csv')
    # df_test = pd.read_csv('/spaceship-titanic/test.csv')
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [46]:
def preprocess(df, preTrain = True, rem_columns=True):
    # drop the entire cols that have low to now interest for now
    if preTrain:
        df = df.dropna()

    df['PassengerId_split'] = df['PassengerId'].apply(lambda x : str(x).split(sep = '_', maxsplit=1))
    df['Group']= df['PassengerId_split'].apply(lambda x : np.nan if x[0] == 'nan' else x[0])
    df['Group_id']= df['PassengerId_split'].apply(lambda x : np.nan if x[0] == 'nan' else x[1])
    df = df.drop(['PassengerId','PassengerId_split'],axis=1)

    df['Name_split'] = df['Name'].apply(lambda x : str(x).split(sep = ' ', maxsplit=1))
    df['FirstName']= df['Name_split'].apply(lambda x : np.nan if x[0] == 'nan' else x[0])
    df['Surname']= df['Name_split'].apply(lambda x : np.nan if x[0] == 'nan' else x[1])
    df = df.drop(['Name','Name_split'],axis=1)

    # Split the cabins
    df['Cabin_splt'] = df['Cabin'].apply(lambda x : str(x).split(sep = '/'))

    df['Deck']= df['Cabin_splt'].apply(lambda x : np.nan if x[0] == 'nan' else x[0])
    df['CabinNumber']= df['Cabin_splt'].apply(lambda x : np.nan if x[0] == 'nan' else x[1]).astype('float')
    df['CabinSide']= df['Cabin_splt'].apply(lambda x : np.nan if x[0] == 'nan' else x[2])

    df = df.drop(['Cabin','Cabin_splt'],axis=1)

    numeric_cols = list(df.select_dtypes(include = np.number).columns)
    category_cols = list(df.select_dtypes(include = ['object']).columns)
    # processing categorical null value
    for col in category_cols:
        df[col] = df[col].fillna(df[col].mode()[0])

    # processing numerical null value
    for col in numeric_cols:
        df[col] = df[col].fillna(df[col].mode()[0])

    df_n=df[numeric_cols]
    df_c=df[category_cols]

    df_n['Total'] = sum(df[col] for col in numeric_cols if col not in ["Age", "CabinNumber"])

    # Categorize the Total Spending 0.43
    df_n['cat_Total'] = ''
    list_spend = [0, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000]
    for k in range(len(list_spend)-1):
        df_n.loc[df_n['Total'].between(list_spend[k], list_spend[k+1], 'right'), 'cat_Total'] = f'Under{list_spend[k+1]}'
    df_n.loc[df_n['Total'].between(list_spend[-1], 9999999, 'both'), 'cat_Total'] = 'rest'

    df_n['cat_Total'] = df_n['cat_Total'].astype('category').cat.codes.astype("int") 

    # Dividing in intervals of 10 gave a correlation of -0.09 to Transported, diving into 4 categories gave a -0.12
    df_n['cat_Age'] = ''
    for k in range(20):
        df_n.loc[df_n['Age'].between(5*k, 5*(k+1), 'both'), 'cat_Age'] = f'Under{k*5}'
    df_n['cat_Age'] = df_n['cat_Age'].astype('category').cat.codes.astype("int")   

    if rem_columns:
        df_n = df_n.drop(['Age', "Total"],axis=1)
    df_n["Transported"] = df["Transported"]

    numeric_cols = list(df.select_dtypes(include = np.number).columns)
    category_cols = list(df.select_dtypes(include = ['object']).columns)
    bool_cols  = list(df.select_dtypes(include = bool).columns)

    for col in bool_cols:
        if col != "Transported":
            df_n[f"boo_{col}"] = df[col]

    # label encode
    enc = LabelEncoder()
    for col in category_cols:
        enc.fit(df_c[col])
        df_c[col] = enc.transform(df_c[col])
        df_n[f"cat_{col}"] = df_c[col]
    category_cols.extend(("cat_Age", "cat_Total"))
    print(f"Boolean columns ({len(bool_cols)}) :", ", ".join(bool_cols))
    print(f"Numeric columns ({len(numeric_cols)}) :", ", ".join(numeric_cols))
    print(f"Categorical columns ({len(category_cols)}) :", ", ".join(category_cols))

    return df_n, category_cols, numeric_cols, bool_cols

def split_x_y(df, preTrain = True):
    target = df['Transported']
    target = target.astype(int)
    df = df.drop(['Age', "Total"],axis=1)
    df = df.drop(['Transported'],axis=1)
    return df, target

In [47]:
# initiate the traning and test datasets--for trainning
df_n, category_cols, numeric_cols, bool_cols = preprocess(df, rem_columns=False)
df_test, df_train = train_test_split(df_n, test_size = 0.2, random_state = 100)

Boolean columns (3) : CryoSleep, VIP, Transported
Numeric columns (7) : Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, CabinNumber
Categorical columns (10) : HomePlanet, Destination, Group, Group_id, FirstName, Surname, Deck, CabinSide, cat_Age, cat_Total


In [48]:
X_train, y_train = split_x_y(df_train)
X_test, y_test = split_x_y(df_test)

In [49]:
clf = LazyClassifier(verbose=0,
                     ignore_warnings=True,
                     custom_metric=None,
                     predictions=False,
                     random_state=12,
                     classifiers='all')

models, predictions = clf.fit(X_train , X_test , y_train , y_test)

100%|██████████| 29/29 [00:03<00:00,  8.68it/s]


In [50]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,0.79,0.79,0.79,0.79,0.13
RandomForestClassifier,0.79,0.79,0.79,0.79,0.27
ExtraTreesClassifier,0.79,0.79,0.79,0.79,0.27
XGBClassifier,0.78,0.78,0.78,0.78,0.14
CalibratedClassifierCV,0.78,0.78,0.78,0.78,0.34
NuSVC,0.78,0.78,0.78,0.78,0.45
LinearSVC,0.78,0.78,0.78,0.78,0.11
SVC,0.78,0.78,0.78,0.78,0.5
LogisticRegression,0.78,0.78,0.78,0.78,0.02
AdaBoostClassifier,0.78,0.78,0.78,0.78,0.19


In [None]:
# Tentar visualizar a dependência de Transported com as variáveis binárias (ou com poucos elementos)
# Fazer o mesmo com as novas features (p.e. Age depois de agrupar)
fig, axes = plt.subplots(3, 2, figsize=(12,12),sharey=True)

k = 0
for col in cat_columns:
    if col not in ["cat_Group", "cat_Surname", "cat_FirstName", "cat_HomePlanet"]:
        sns.countplot(ax=axes[k%3, k//3],hue='Transported',x=col,data=df_train)
        k += 1
plt.savefig("Imagens_FAA/4graphs.png")

In [None]:
# #  Tentar visualizar a dependência de Transported com as restantes variáveis através de boxplots
# sns.boxplot(x='Transported',y='Age',data=df_train)
# plt.savefig("Imagens_FAA/Age.png")

In [None]:
# # Não dá para visualizar muito bem porque a maior parte dos valores é dada como outlier
if redo_graphs:
    sns.boxplot(x='Transported',y='VRDeck',data=df_train)
    plt.savefig("Imagens_FAA/Boxplot_VRDeck.png")

In [None]:
if redo_graphs:
    num_columns.append("Transported")
    sns.pairplot(df_train[num_columns], kind="scatter", hue="Transported")
    plt.savefig("Imagens_FAA/Full_pairplot_cont.png")
    plt.show()

In [None]:
if redo_graphs:
    sns.pairplot(df_train, kind="scatter", hue="Transported")
    plt.show()
    plt.savefig("Imagens_FAA/Full_pairplot.png")

In [None]:
if redo_graphs:
    corr = df_train.corr()
    # Generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
                square=True, linewidths=.4, cbar_kws={"shrink": .5})
    # Not sure of the reason why but the below command saves it some with border
    plt.savefig("Imagens_FAA/Correlation.png")

In [None]:
#define dimensions of subplots (rows, columns)
fig, axes = plt.subplots(1, 2, figsize=(10, 5))

#create chart in each subplot
sns.countplot(df_train, x="cat_CryoSleep", hue="Transported", ax=axes[1])
sns.countplot(df_train, x="SpendCategory", hue="Transported", ax=axes[0])

In [None]:
#define dimensions of subplots (rows, columns)
# fig, axes = plt.subplots(1, 2, figsize=(10, 5))

#create chart in each subplot
sns.scatterplot(df_train, x="FoodCourt", y="Spa", hue="Transported")

In [None]:
import joblib
X_train, y_train = split_x_y(df_train)
X_test, y_test = split_x_y(df_test)
with open('data_FAA/data.pkl', 'wb') as f:
    joblib.dump([X_train, y_train, X_test, y_test], f)

In [None]:
X_train["Transported"]