<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Usecase - Predict Survival on the Titanic Disaster</b>
</header>

## Description:
 Researchers are still drawn to the Titanic disaster, even though it happened just over a century ago, as they try to figure out _how some people survived while others couldn't_.
 
## Objective:
Our goal is to create a predictive algorithm that can identify whether or not Titanic passengers survived the ship's sinking. With the use of titanic passenger data.

## Following steps are followed:
* Import the required teradataml modules.
* Context establishment with Vantage system.
* Data Loading.
* Data Analysis e.g. use of various dataframe functions to get details about the data like shape, survived passenger count, etc.
* DataFrame preprocessing.
    * Column selection.
    * Ordinal Encoding.
    * Split the data into train and validation sets.
    * Hyper-Parametrization of Non-Model trainer function (SimpleImpute).
* Build a predictive model using Hyperparameter-Tuning.
* Validate the best model.
* Cleanup.

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Import the required modules.</b>

In [1]:
# Import required packages.
import random
from getpass import getpass
from teradataml import *
from teradataml.hyperparameter_tuner import *

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Connect to Vantage.</b>

In [None]:
# Create connection with Vantage.
create_context(host=getpass(), user=getpass(), password=getpass())

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>DataFrame creation.</b>

In [3]:
# Load the example dataset.
load_example_data("teradataml", "titanic")
titanic = DataFrame("titanic")




In [4]:
titanic.head()

passenger,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>DataFrame Analysis.</b>

In [5]:
# Shape of the dataframe.
titanic.shape

(891, 12)

In [6]:
# Info about dataframe and null values.
titanic.info(null_counts=True)

<class 'teradataml.dataframe.dataframe.DataFrame'>
Data columns (total 12 columns):


passenger    891 non-null int  
survived     891 non-null int  
pclass       891 non-null int  
name         891 non-null str  
sex          891 non-null str  
age          714 non-null int  
sibsp        891 non-null int  
parch        891 non-null int  
ticket       891 non-null str  
fare         891 non-null float
cabin        204 non-null str  
embarked     889 non-null str  
dtypes: str(5), int(6), float(1)


In [7]:
# Generates statistics for numeric columns in titanic data. 
titanic.describe()

func,passenger,survived,pclass,age,sibsp,parch,fare
min,1.0,0.0,1.0,0.0,0.0,0.0,0.0
std,257.354,0.487,0.836,14.536,1.103,0.806,49.693
25%,223.5,0.0,2.0,20.0,0.0,0.0,7.91
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.454
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.329
mean,446.0,0.384,2.309,29.679,0.523,0.382,32.204
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0


In [8]:
# Count of survived passengers.
survived_count = titanic[titanic.survived == 1]
survived_count.shape[0]

342

In [9]:
# Count of lost passengers.
non_survived_count = titanic[titanic.survived == 0]
non_survived_count.shape[0]

549

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>DataFrame preprocessing.</b>

<b style = 'font-size:20px;font-family:Arial;color:#00000'>Column selection.</b>

In [10]:
# Dropping unwanted columns.
titanic = titanic.drop(["name", "ticket", "cabin"], axis=1)
titanic

passenger,survived,pclass,sex,age,sibsp,parch,fare,embarked
326,1,1,female,36.0,0,0,135.6333,C
183,0,3,male,9.0,4,2,31.3875,S
652,1,2,female,18.0,0,1,23.0,S
40,1,3,female,14.0,1,0,11.2417,C
774,0,3,male,,0,0,7.225,C
366,0,3,male,30.0,0,0,7.25,S
509,0,3,male,28.0,0,0,22.525,S
795,0,3,male,25.0,0,0,7.8958,S
61,0,3,male,22.0,0,0,7.2292,C
469,0,3,male,,0,0,7.725,Q


<b style = 'font-size:20px;font-family:Arial;color:#00000'>Ordinal Encoding.</b>

In [11]:
# Perform OrdinalEncoding for 'sex' column.
ordinal_obj = OrdinalEncodingFit(target_column=['sex', 'embarked'],
                                 data=titanic)

# Transforming the encoded data.
df = ordinal_obj.transform(data=titanic,
                           accumulate=['passenger', 'survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']).result

df


passenger,survived,pclass,age,sibsp,parch,fare,sex,embarked
326,1,1,36.0,0,0,135.6333,0,0
183,0,3,9.0,4,2,31.3875,1,2
652,1,2,18.0,0,1,23.0,0,2
265,0,3,,0,0,7.75,0,1
530,0,2,23.0,2,1,11.5,1,2
122,0,3,,0,0,8.05,1,2
591,0,3,35.0,0,0,7.125,1,2
387,0,3,1.0,5,2,46.9,1,2
734,0,2,23.0,0,0,13.0,1,2
795,0,3,25.0,0,0,7.8958,1,2


<b style = 'font-size:20px;font-family:Arial;color:#00000'>Train-Validation split.</b>

In [12]:
# Sample 5% of data for model validation.
df_sample = df.sample(frac=[0.95, 0.05], randomize=True)
df_sample

passenger,survived,pclass,age,sibsp,parch,fare,sex,embarked,sampleid
530,0,2,23.0,2,1,11.5,1,2,1
591,0,3,35.0,0,0,7.125,1,2,1
387,0,3,1.0,5,2,46.9,1,2,1
856,1,3,18.0,0,1,9.35,0,2,1
244,0,3,22.0,0,0,7.125,1,2,1
713,1,1,48.0,1,0,52.0,1,2,1
448,1,1,34.0,0,0,26.55,1,2,1
122,0,3,,0,0,8.05,1,2,1
734,0,2,23.0,0,0,13.0,1,2,1
265,0,3,,0,0,7.75,0,1,1


In [13]:
# Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is required for training model.
data_train = df_sample[df_sample.sampleid == "1"].drop("sampleid", axis = 1)
data_train

passenger,survived,pclass,age,sibsp,parch,fare,sex,embarked
530,0,2,23.0,2,1,11.5,1,2
591,0,3,35.0,0,0,7.125,1,2
387,0,3,1.0,5,2,46.9,1,2
469,0,3,,0,0,7.725,1,1
326,1,1,36.0,0,0,135.6333,0,0
795,0,3,25.0,0,0,7.8958,1,2
183,0,3,9.0,4,2,31.3875,1,2
652,1,2,18.0,0,1,23.0,0,2
61,0,3,22.0,0,0,7.2292,1,0
122,0,3,,0,0,8.05,1,2


In [14]:
# Create validation dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is required for validating model.
data_val = df_sample[df_sample.sampleid == "2"].drop("sampleid", axis = 1)
data_val

passenger,survived,pclass,age,sibsp,parch,fare,sex,embarked
196,1,1,58.0,0,0,146.5208,0,0
11,1,3,4.0,1,1,16.7,0,2
663,0,1,47.0,0,0,25.5875,1,2
810,1,1,33.0,1,0,53.1,0,2
295,0,3,24.0,0,0,7.8958,1,2
110,1,3,,1,0,24.15,0,1
844,0,3,34.0,0,0,6.4375,1,0
270,1,1,35.0,0,0,135.6333,0,2
726,0,3,20.0,0,0,8.6625,1,2
297,0,3,23.0,0,0,7.2292,1,0


In [15]:
data_val.info(null_counts=True)

<class 'teradataml.dataframe.dataframe.DataFrame'>
Data columns (total 9 columns):
passenger    45 non-null int  
survived     45 non-null int  
pclass       45 non-null int  
age          34 non-null int  
sibsp        45 non-null int  
parch        45 non-null int  
fare         45 non-null float
sex          45 non-null int  
embarked     44 non-null int  
dtypes: int(8), float(1)


<b style = 'font-size:20px;font-family:Arial;color:#00000'>Hyper-Parametrization of SimpleImpute.</b>


In [16]:
# GridSearch offers hyper-parameterization for Non-Model Trainer functions.
# "age" and "embarked" columns contains 'NaN' values. Hence, Impute 'NaN' value 
# with special metrics, for example, mean, mode or median. And use those imputed 
# data to build the best model.

# Define Hyperparameters for SimpleImputeFit.
# GridSearch perform imputation on "data_train" for specified combination of parameters
# and returns imputed data.
si_params = {"data":data_train,
            "stats_columns":["age", "embarked"],
            "stats":("median", "mean", "mode")}

In [17]:
# Perform GridSearch on SimpleImputeFit function.
si_gs_obj = GridSearch(func=SimpleImputeFit, params=si_params)

In [18]:
# Start the imputation task.
si_gs_obj.fit()

In [19]:
# Imputation task metadata shows three variants of imputation results.
si_gs_obj.models

Unnamed: 0,MODEL_ID,PARAMETERS,STATUS
0,SIMPLEIMPUTEFIT_0,"{'data': '""ALICE"".""ml__select__169947832716537...",PASS
1,SIMPLEIMPUTEFIT_2,"{'data': '""ALICE"".""ml__select__169947832716537...",PASS
2,SIMPLEIMPUTEFIT_1,"{'data': '""ALICE"".""ml__select__169947832716537...",PASS


In [20]:
models = si_gs_obj.models

In [21]:
# Perform SimpleImpute transform and structure the data in dictionary format with labels.
imputed_data = dict((model, si_gs_obj.get_model(model).transform(data = df,
                    accumulate=['passenger', 'survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']).result) \
                    for model in models["MODEL_ID"])
imputed_data

{'SIMPLEIMPUTEFIT_0':    passenger  survived  pclass  age  sibsp  parch      fare  sex  embarked
 0        326         1       1   36      0      0  135.6333    0         0
 1        183         0       3    9      4      2   31.3875    1         2
 2        652         1       2   18      0      1   23.0000    0         2
 3        265         0       3   28      0      0    7.7500    0         1
 4        530         0       2   23      2      1   11.5000    1         2
 5        122         0       3   28      0      0    8.0500    1         2
 6        591         0       3   35      0      0    7.1250    1         2
 7        387         0       3    1      5      2   46.9000    1         2
 8        734         0       2   23      0      0   13.0000    1         2
 9        795         0       3   25      0      0    7.8958    1         2,
 'SIMPLEIMPUTEFIT_2':    passenger  survived  pclass  age  sibsp  parch      fare  sex  embarked
 0        326         1       1   36      0  

In [22]:
# SimpleImpute performed on validation data.
si_obj_val = SimpleImputeFit(data=data_val, stats_columns=["age", "embarked"], stats="mean")
val_df = si_obj_val.transform(data=data_val).result


<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Hyperparameter-Tuning to create optimal predictive model.</b>

In [23]:
# Define XGBoost hyperparameter space with 4000 parameter combinations for XGBoost model.
# Any combination specified with in hyperparameter space is used for hyperparameter tuning task.
XGB_params = {"input_columns":['pclass', 'age', 'sibsp', 'parch', 'fare', 'sex', 'embarked'],
              "response_column" : 'survived',
              "max_depth":tuple(random.randrange(3, 50) for i in range(10)),
              "lambda1" : tuple(round(random.uniform(0.001, 1.0), 3) for i in range(10)),
              "model_type" : "classification",
              "num_boosted_trees": 50,
              "shrinkage_factor":tuple(round(random.uniform(0.001, 1.0), 3) for i in range(10)),
              "iter_num":(50, 200, 500, 1000)}

In [24]:
# Define Evaluation parameters which is used for model evaluation.
eval_params = {"id_column": "passenger",
               "model_type": "classification",
               "accumulate": "survived",
               "object_order_column": ['task_index', 'tree_num', 'iter', 'class_num', 'tree_order']}

In [25]:
# Initialize the RandomSearch for XGBoost model.
# In addition, Though hyperparameter space contains 4000 parameters based on "n_iter" value 
# hyperparameter combinations are selected randomly. selected set of hyperparameters are
# used for model optimization. 
# Note: Chosen hyperparameter combinations are used on hyper-parameterized data for model optimization.
rs_obj = RandomSearch(func=XGBoost, params=XGB_params, n_iter=4)

In [26]:
# Start the RandomSearch optimization.
rs_obj.fit(data=imputed_data,
           verbose=1, frac=0.85,
           **eval_params
            )

Completed: ｜⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿｜ 100% - 12/12

In [27]:
# Trained model metadata explains 4 models build for each hyper-parameterized data. 
# Hence, Total of 12 models generated in RandomSearch optimization.
rs_obj.models

Unnamed: 0,MODEL_ID,DATA_ID,PARAMETERS,STATUS,ACCURACY
0,XGBOOST_3,SIMPLEIMPUTEFIT_0,"{'input_columns': ['pclass', 'age', 'sibsp', '...",PASS,0.783582
1,XGBOOST_0,SIMPLEIMPUTEFIT_0,"{'input_columns': ['pclass', 'age', 'sibsp', '...",PASS,0.783582
2,XGBOOST_2,SIMPLEIMPUTEFIT_1,"{'input_columns': ['pclass', 'age', 'sibsp', '...",PASS,0.783582
3,XGBOOST_1,SIMPLEIMPUTEFIT_2,"{'input_columns': ['pclass', 'age', 'sibsp', '...",PASS,0.798507
4,XGBOOST_4,SIMPLEIMPUTEFIT_2,"{'input_columns': ['pclass', 'age', 'sibsp', '...",PASS,0.791045
5,XGBOOST_5,SIMPLEIMPUTEFIT_1,"{'input_columns': ['pclass', 'age', 'sibsp', '...",PASS,0.783582
6,XGBOOST_7,SIMPLEIMPUTEFIT_2,"{'input_columns': ['pclass', 'age', 'sibsp', '...",PASS,0.798507
7,XGBOOST_6,SIMPLEIMPUTEFIT_0,"{'input_columns': ['pclass', 'age', 'sibsp', '...",PASS,0.791045
8,XGBOOST_9,SIMPLEIMPUTEFIT_0,"{'input_columns': ['pclass', 'age', 'sibsp', '...",PASS,0.776119
9,XGBOOST_8,SIMPLEIMPUTEFIT_1,"{'input_columns': ['pclass', 'age', 'sibsp', '...",PASS,0.776119


In [28]:
# RandomSearch model stats for XGBoost.
rs_obj.model_stats

Unnamed: 0,MODEL_ID,ACCURACY,MICRO-PRECISION,MICRO-RECALL,MICRO-F1,MACRO-PRECISION,MACRO-RECALL,MACRO-F1,WEIGHTED-PRECISION,WEIGHTED-RECALL,WEIGHTED-F1
0,XGBOOST_3,0.783582,0.783582,0.783582,0.783582,0.766927,0.73666,0.746923,0.778585,0.783582,0.777113
1,XGBOOST_0,0.783582,0.783582,0.783582,0.783582,0.766927,0.73666,0.746923,0.778585,0.783582,0.777113
2,XGBOOST_2,0.783582,0.783582,0.783582,0.783582,0.774639,0.763368,0.767737,0.781346,0.783582,0.781318
3,XGBOOST_1,0.798507,0.798507,0.798507,0.798507,0.777083,0.798876,0.783755,0.81334,0.798507,0.802301
4,XGBOOST_4,0.791045,0.791045,0.791045,0.791045,0.770311,0.793258,0.776667,0.808323,0.791045,0.795274
5,XGBOOST_5,0.783582,0.783582,0.783582,0.783582,0.774639,0.763368,0.767737,0.781346,0.783582,0.781318
6,XGBOOST_7,0.798507,0.798507,0.798507,0.798507,0.777083,0.798876,0.783755,0.81334,0.798507,0.802301
7,XGBOOST_6,0.791045,0.791045,0.791045,0.791045,0.774671,0.74753,0.757246,0.786528,0.791045,0.785637
8,XGBOOST_9,0.776119,0.776119,0.776119,0.776119,0.75907,0.725791,0.736428,0.77062,0.776119,0.768486
9,XGBOOST_8,0.776119,0.776119,0.776119,0.776119,0.767292,0.753752,0.758703,0.773597,0.776119,0.773217


In [29]:
# Best identified XGBoost model id.
rs_obj.best_model_id

'XGBOOST_10'

In [30]:
# Best identified data id.
rs_obj.best_data_id

'SIMPLEIMPUTEFIT_2'

In [31]:
# Best identified model score.
rs_obj.best_score_

0.8059701492537313

In [32]:
# Best identified model hyperparameters.
rs_obj.best_params_

{'input_columns': ['pclass',
  'age',
  'sibsp',
  'parch',
  'fare',
  'sex',
  'embarked'],
 'response_column': 'survived',
 'max_depth': 38,
 'lambda1': 0.46,
 'model_type': 'classification',
 'num_boosted_trees': 50,
 'shrinkage_factor': 0.33,
 'iter_num': 1000,
 'data': '"ALICE"."ml___frmqry_v_1699472167308052"'}

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Perform validation on the best model.</b>

In [33]:
# Validating the best model.
rs_obj.evaluate(newdata=val_df,
                **eval_params)



############ output_data Output ############

   SeqNum                                              Metric  MetricValue
0       3  Micro-Recall                                           0.777778
1       5  Macro-Precision                                        0.776786
2       6  Macro-Recall                                           0.776786
3       7  Macro-F1                                               0.776786
4       9  Weighted-Recall                                        0.777778
5      10  Weighted-F1                                            0.777778
6       8  Weighted-Precision                                     0.777778
7       4  Micro-F1                                               0.777778
8       2  Micro-Precision                                        0.777778
9       1  Accuracy                                               0.777778


############ result Output ############

       Prediction  Mapping  CLASS_1  CLASS_2  Precision    Recall        F1  Support


<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Perform classification using best model.</b>


In [34]:
# Predict passenger survival using the best model.
result = rs_obj.predict(newdata=val_df,
                        **eval_params)
result



############ result Output ############

   survived  passenger  Prediction  Confidence_Lower  Confidence_upper
0         1        196           1              0.82              0.82
1         1         11           0              0.52              0.52
2         0        663           0              0.52              0.52
3         1        810           1              0.90              0.90
4         0        295           0              0.96              0.96
5         1        110           1              0.70              0.70
6         0        844           0              0.64              0.64
7         1        270           1              0.90              0.90
8         0        726           0              0.94              0.94
9         0        297           0              0.74              0.74


<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Cleanup</b>


In [38]:
remove_context()