[Auto-PyTorch](https://automl.github.io/Auto-PyTorch/development/index.html) is a library for automatic deep learning developed by [AutoML group](https://www.automl.org/) from Freibug-Hannover, Germany.
# Generate tabular classification prediction using auto-pytorch
Main components of the library:
```python
# Import libraries
import autoPyTorch

# Configure model parameters
classifier = autoPyTorch.api.tabular_classification.TabularClassificationTask()

# Fit model
classifier.search(X_train, y_train)

# Predict on test data
y_pred = classifier.predict(X_test)
```

For more info on auto-pytorch applications, visit the [manual](https://automl.github.io/Auto-PyTorch/development/manual.html#manual).

This notebook is available on [GitHub](https://github.com/TomPham97/Kaggle-machine-learning/blob/main/Titanic-competition/2022-08-29-auto-pytorch-titanic.ipynb) or to be downloaded [here](/assets/posts/askl2/2022-08-29-auto-pytorch-titanic.ipynb).

In [1]:
%%capture
%pip install -Uq wheel setuptools
%pip install -Uq -r requirements-apt.txt

## Download the dataset from Kaggle
The dataset being used is from the [Kaggle Titanic competition](https://www.kaggle.com/competitions/titanic).

In [5]:
import fastkaggle
print("fastkaggle version: ", fastkaggle.__version__)

fastkaggle version:  0.0.7


In [8]:
comp = 'titanic' # competition name
path = fastkaggle.setup_comp(comp,
                  install = 'fastai "timm >= 0.6.2.dev0"')

In [9]:
# Import basic dependencies such as np, pd
import fastai
from fastai.imports import *
print("fastai version: ", fastai.__version__)

fastai version:  2.7.9


In [10]:
!ls {path}

gender_submission.csv  test.csv  train.csv


## Process and clean the data
Additional transformation and normalization are handled by
auto-sklearn 2.

In [11]:
df = pd.read_csv(path/'train.csv', index_col = 'PassengerId')
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Define data features(X_train) and label(y_train).

In [12]:
y = df['Survived']
X = df.drop(['Survived', 'Name'], axis = 1)

Since auto-sklearn 2 does not accept string columns, it is necessary to convert them into categorical columns.

In [13]:
# Create a function that finds categorical columns and label them as such
## Import dependency
from fastai.tabular.all import *

def to_cat(df = df):
    '''
    Convert string-type columns of a dataframe into categorical columns
    '''
    # Identify string/categorical columns in the dataframe
    _, cat = cont_cat_split(df, 1)
    
    # Convert to categorical type using for loops
    for col in cat:
        df[col] = pd.Categorical(df[col])

In [14]:
to_cat(X)

In [3]:
X.head()

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,3,male,22.0,1,0,A/5 21171,7.25,,S
2,1,female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,female,35.0,1,0,113803,53.1,C123,S
5,3,male,35.0,0,0,373450,8.05,,S


## Data exploration
The package pandas_profiling provide quick and valuable insights into the data.

In [16]:
import pandas_profiling
print("pandas_profiling version: ", pandas_profiling.__version__)

pandas_profiling version:  3.2.0


In [17]:
X.profile_report(progress_bar = False).to_notebook_iframe()
# Use .to_notebook_iframe() for HTML format or .to_widgets() for built in widget view

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state = 42)

In [18]:
import pickle
df_list = [X, y, X_train, X_test, y_train, y_test]
open_file = open("training.pkl", "wb")
pickle.dump(df_list, open_file)
open_file.close()

## Configure and train the model

In [5]:
import autoPyTorch
print('autoPyTorch version:', autoPyTorch.__version__)

autoPyTorch version: 0.2.1


In [6]:
%%capture
!sudo apt update
!sudo apt install -y libgl1-mesa-glx

In [7]:
# Import libraries
from autoPyTorch.api.tabular_classification import TabularClassificationTask

# Configure model parameters
cls = TabularClassificationTask(seed = 42, n_jobs = -1)

In [3]:
import pickle
open_file = open("training.pkl", "rb")
X, y, X_train, X_test, y_train, y_test = pickle.load(open_file)
open_file.close()

In [None]:
# Train the model
cls.search(X_train = X_train,
           y_train = y_train,
           X_test = X_test,
           y_test = y_test,
           optimize_metric = 'accuracy',
           dataset_name = 'Titanic',
           total_walltime_limit = 100,
           # func_eval_time_limit_secs = 5,
           memory_limit = None)



To save the progress thus far, we can use fastai's function save_pickle to store the trained model.
If the pickled model needs to be accessed later, type the following:
```python
cls = load_pickle('cls.pkl')
```

In [None]:
save_pickle('cls.pkl', cls)

## Model insights

The contents of the model ensemble can be viewed below.

In [None]:
print(cls.sprint_statistics())

In [None]:
y_pred = cls.predict(X_test)
score = cls.score(y_pred, y_test)
print(score)

In [None]:
from autoPyTorch.api.tabular_classifcation import TabularClassificationTask
automl = TabularClassificationTask()
automl.fit(X_train, y_train)
automl.show_models()

In [None]:
print(cls.leaderboard())

## Use the trained model to make predictions

In [None]:
# Refit the model on the full training set
cls.refit(X_train = X_train,
         y_train = y_train,
         X_test = X_test,
         dataset_name = "Titanic",
         memory_limite = None,
         )

Process the test data similarly to the trained features(X_train).

In [28]:
df_test = pd.read_csv(path/'test.csv', index_col = 'PassengerId')
df_test = df_test.drop('Name', axis = 1)
to_cat(df_test)
df_test.head()

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
892,3,male,34.5,0,0,330911,7.8292,,Q
893,3,female,47.0,1,0,363272,7.0,,S
894,2,male,62.0,0,0,240276,9.6875,,Q
895,3,male,27.0,0,0,315154,8.6625,,S
896,3,female,22.0,1,1,3101298,12.2875,,S


In [29]:
# Make prediction
prediction = cls.predict(df_test)

In [30]:
# Convert the prediction to dataframe from ndarry
subm = pd.DataFrame(prediction,
                    index = df_test.index,
                    columns = ['Survived'])

Save the prediction as a .csv file.

In [31]:
subm.to_csv('subm.csv')
# View the first few rows
!head subm.csv

PassengerId,Survived
892,0
893,1
894,0
895,0
896,1
897,0
898,1
899,0
900,1


Submit the prediction directly to the Kaggle competition. View the scores in [this webpage](https://www.kaggle.com/competitions/titanic/submissions).

In [39]:
# Submit to competition
from kaggle import api
api.competition_submit_cli('subm.csv', # file name
                           'auto-pytorch', # version description
                           comp) # competition name

100%|██████████| 2.77k/2.77k [00:00<00:00, 5.35kB/s]


Successfully submitted to Titanic - Machine Learning from Disaster

This submission has an accuracy score of 79.186%, which is top 6% of all submissions. *Note: there are numerous [top predictions with 100% accuracy](https://www.kaggle.com/competitions/titanic/leaderboard) from cheating*.