<p align="center"><img width="50%" src="https://aimodelsharecontent.s3.amazonaws.com/aimodshare_banner.jpg" /></p>


---




<p align="center"><h1 align="center">Titanic Dataset Classification Tutorial</h1> <h3 align="center">(Prepare to deploy model and preprocessor to REST API/Web Dashboard in four easy steps...)</h3></p>
<p align="center"><img width="80%" src="https://aimodelsharecontent.s3.amazonaws.com/ModelandPreprocessorObjectPreparation.jpeg" /></p>


---



## **(1) Preprocessor Function & Setup**

> ### A more advanced example demonstrating the flexibility of a new *Column Transformer* approach.

In [2]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']

# Replacing missing values with Modal value and then one-hot encoding.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Final preprocessor object set up with ColumnTransformer...

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


X = data.drop('survived', axis=1)
X = data.drop('name', axis=1)
y = data['survived']
y = y.map({0: 'died', 1: 'survived'})

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

preprocess = preprocess.fit(X_train)

In [3]:
def preprocessor(data):
    preprocessed_data=preprocess.transform(data)
    return preprocessed_data

In [4]:
preprocessor(X_train).shape

(1047, 10)

## **(2) Build Your Model Using `sklearn`**

In [5]:
print(X_train.shape, X_test.shape, 
      y_train.shape, y_test.shape)

(1047, 13) (262, 13) (1047,) (262,)


In [7]:
# Penalized Logit...

hyperparameters = {'C':np.logspace(1, 10, 100), 'penalty':['l2']}

logit = LogisticRegression()
logit_cv = GridSearchCV(logit, hyperparameters, cv = 10)
logit_cv.fit(preprocessor(X_train), y_train)

print("Best Parameters {:.3f}: ", logit_cv.best_params_)

Best Parameters {:.3f}:  {'C': 10.0, 'penalty': 'l2'}


In [8]:
model = LogisticRegression(C=10, penalty='l2')

model.fit(preprocessor(X_train), y_train) # Fitting to the training set.

model.score(preprocessor(X_train), y_train) # Fit score, 0-1 scale.

0.7793696275071633

In [9]:
y_pred = model.predict(preprocessor(X_test))

y_pred

array(['died', 'survived', 'died', 'died', 'died', 'survived', 'died',
       'died', 'died', 'died', 'died', 'died', 'died', 'survived',
       'survived', 'died', 'survived', 'died', 'survived', 'died', 'died',
       'died', 'died', 'survived', 'died', 'survived', 'died', 'died',
       'died', 'survived', 'survived', 'survived', 'survived', 'died',
       'survived', 'died', 'died', 'died', 'died', 'died', 'died', 'died',
       'died', 'died', 'survived', 'died', 'died', 'survived', 'died',
       'died', 'survived', 'died', 'died', 'survived', 'died', 'died',
       'survived', 'died', 'survived', 'survived', 'died', 'died',
       'survived', 'died', 'survived', 'survived', 'died', 'died', 'died',
       'survived', 'survived', 'died', 'died', 'died', 'survived',
       'survived', 'died', 'survived', 'survived', 'died', 'died',
       'survived', 'died', 'died', 'survived', 'survived', 'died', 'died',
       'died', 'died', 'died', 'died', 'died', 'died', 'died', 'died',
      

In [10]:
from sklearn.metrics import accuracy_score

print("Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred) * 100))

Accuracy: 79.01%


## **(3) Save Preprocessor**

In [None]:
# ! pip3 install aimodelshare

In [11]:
def export_preprocessor(preprocessor_function, filepath):
    import dill
    with open(filepath, "wb") as f:
        dill.dump(preprocessor_function, f)

# import aimodelshare as ai # Once we can deploy this, we use it in lieu of the below.
# ai.export_preprocessor(preprocessor, "preprocessor.pkl")

export_preprocessor(preprocessor, "preprocessor.pkl")

## **(4) Save `sklearn` Model to Onnx File Format**

In [None]:
! pip3 install skl2onnx

In [15]:
# Convert into ONNX format...

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

initial_type = [('float_input', FloatTensorType([None, 10]))]
onx = convert_sklearn(model, initial_types=initial_type)

# Save model to local .onnx file...
with open("my_model.onnx", "wb") as f:
    f.write(onx.SerializeToString())