# Problem definition

The dataset is used for this competition is synthetic but based on a real dataset (in this case, the actual Titanic data!) and generated using a CTGAN.

Data description: 

| Variable        | Definition           | Key  |
|---------------|:-------------|------:|
|survival |	Survival | 0 = No, 1 = Yes |
|pclass |	Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
|sex |	Sex	 ||
|Age |	Age in years	 ||
|sibsp |	# of siblings / spouses aboard the Titanic	 ||
|parch |	# of parents / children aboard the Titanic	 ||
|ticket |	Ticket number	 ||
|fare |	Passenger fare	 ||
|cabin |	Cabin number	| |
|embarked |	Port of Embarkation	| C = Cherbourg, Q = Queenstown, S = Southampton |

<br>

Where `survival` will be our target variable! 🎯

<br>

Check out these notebooks:

- Tuning of a Lightgbm with Bayesian Optimization using the `tidymodels` framework in R: 
    - [https://www.kaggle.com/gomes555/tps-apr2021-r-eda-lightgbm-bayesopt/](https://www.kaggle.com/gomes555/tps-apr2021-r-eda-lightgbm-bayesopt/)
- Tuning of a Lightgbm with Bayesian Optimization using the `Optuna` framework in Python: 
    - [https://www.kaggle.com/gomes555/tps-apr2021-lightgbm-optuna-pipelineopt](https://www.kaggle.com/gomes555/tps-apr2021-lightgbm-optuna-pipelineopt)

<br>

<p align="right"><span style="color:firebrick">Dont forget the upvote if you liked the notebook! <i class="fas fa-hand-peace"></i></span> </p>

In [None]:
# Install mljar
!pip install -q -U git+https://github.com/mljar/mljar-supervised.git@master

# Dependencies

In [None]:
import pandas as pd
import numpy as np
from supervised.automl import AutoML

pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 5)

In [None]:
train = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv', index_col=0)
test = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv', index_col=0)
submission = pd.read_csv('../input/tabular-playground-series-apr-2021/sample_submission.csv')

In [None]:
def initial_prep(x):

    x['CabinNumber'] = x.Cabin.str.extract(r'(\d+)').astype('float64', copy=False).replace(r'^\s*$', np.nan, regex=True)
    x['CabinClass'] = x.Cabin.str.replace(r'(\d+)', '', regex=True).str.replace(' ', '', regex=True).replace(r'^\s*$', np.nan, regex=True)
    x['CabinClass'] = x['CabinClass'].astype('category')
    x['TicketNumber'] = x.Ticket.str.extract(r'(\d+)').astype('float64', copy=False).replace(r'^\s*$', np.nan, regex=True)
    x['TicketPrefix'] = x.Ticket.str.replace('\.','', regex=True).str.replace('(\d+)', '', regex=True).str.replace(' ', '', regex=True).replace(r'^\s*$', np.nan, regex=True)
    x['TicketPrefix'] = x['TicketPrefix'].astype('category')
    x['Sex'] = np.where(x['Sex'] == 'male', 1, 0)

    x['Embarked'] = x['Embarked'].astype('category')
    # conditions = [
    #     (x["Embarked"].eq("C")),
    #     (x["Embarked"].eq("Q")),
    #     (x["Embarked"].eq("S"))
    # ]
    # choices = [2, 3, 1]
    # x["Embarked"] = np.select(conditions, choices)

    x['NameLen'] = x.loc[:,'Name'].str.len() - 2
    
    x['Name2'] = [x[1] for x in x.loc[:,'Name'].str.split(',', 1)]
    
    x['Name2'] = x.loc[:,'Name2'].astype('category')
    
    x['FamilySize'] = x['SibSp'] + x['Parch'] + 1

    x['IsAlone'] = np.where(x['FamilySize'] == 1, 1, 0)

    x['AnyMissing'] = np.where(x.isnull().any(axis=1) == True, 1, 0)
    
    x['Age_Pclass'] = x['Age'] * x['Pclass']
    
    x = x.drop(['Ticket', 'Name', 'Cabin', 'SibSp', 'Parch'], axis = 1)
    
    return x

In [None]:
train = train.pipe(initial_prep)
test = test.pipe(initial_prep)

# MLJAR Automated Machine Learning for Humans

see: [https://github.com/mljar/mljar-supervised](https://github.com/mljar/mljar-supervised)

In [None]:
X = train.drop('Survived', axis=1)
y = train['Survived']

In [None]:
# Reference: https://supervised.mljar.com/features/modes/
automl  = AutoML(
    mode="Compete", 
    eval_metric="accuracy",
    total_time_limit=60*60*7,
    algorithms=["Xgboost","LightGBM","CatBoost"], # Boosting 
    features_selection=True,
    validation_strategy={
        "validation_type": "kfold",
        "k_folds": 8,
        "shuffle": True,
        "stratify": True,
    }
)
automl.fit(X, y)

In [None]:
automl.report()

In [None]:
submission.loc[:, 'Survived'] = automl.predict(test)
submission.to_csv('submission.csv', index = False)