<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Predict Survival on the Titanic Disaster using HyperParameter Tuning
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction:</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Researchers are still drawn to the Titanic disaster, even though it happened just over a century ago, as they try to figure out how some people survived while others couldn't. Fortunately, Teradata Vantage and ClearScape Analytics provide the ideal platform to create these predictions. ClearScape Analytics combines these analytic disciplines into a single, massively scalable platform which enables unique business outcomes and more accurate analytic and predictive models. With Vantage’s advanced in-database analytics, time series functions, and AI/ML capabilities, researchers can increase their confidence in these predictions. </p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Business Value</b></p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Improve speed, performance, and time-to value by minimizing data movement by fully integrating data for faster results and trusted outcomes</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Easily deploys new, more accurate models to production</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Deploy preferred AI/ML tools and models directly to the VantageCloud platform</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Identify patterns and reasons leading to survival of passengers.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Advanced research and development stemming from the results of the data and models produced.</li></p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Teradata? </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Traditional ML and AI development and deployment pipelines require users to manually combine various tools and techniques across the lifecycle.  This leads to lengthy, fragile, manual, error-prone processes that are, in many cases, impossible to migrate out of the lab and into production in order to realize business value. ClearScape Analytics helps to solve this “development to deployment gap” by providing highly scalable, performant, and easy-to-use analytic capabilities that address all aspects of the development lifecycle.  The same tools and techniques that data scientists use in development can be seamlessly deployed into production using the same code, platform, and operational pipeline.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>A critical strategy for Vantage and ClearScape Analytics is to embrace the value and innovation in the open-source and partner ML and AI community. This provides enterprises with the most scalable option for deploying custom machine learning pipelines. Users can leverage the innovation and familiarity of a broad range of tools and techniques, with the ability to prepare and score new data in near-real-time and at any scale; allowing the products of machine learning to become pervasive across all applications, reporting tools, and consumers in an organization. </p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
The goal of this demo is to create a predictive algorithm that can identify whether or not Titanic passengers survived the ship's sinking with the use of titanic passenger data. Here we are analyzing data for 891 passengers.</p>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Connect to Vantage.</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
# Import required packages.
import random
from getpass import getpass
from teradataml import *
from teradataml.hyperparameter_tuner import *
from matplotlib import pyplot as plt
display.max_rows = 5
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Predict_TitanicSurvival_Hyperparameter_Tuning_Python.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage.  You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage.  There are two statements in the following cell, and one is commented out.  You may switch which mode you choose by changing the comment string.</p>   


In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_TitanicSurvival_cloud');"
 # Takes about 30 seconds
%run -i ../run_procedure.py "call get_data('DEMO_TitanicSurvival_local');"
 # Takes about 50 seconds

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Analyze the raw data set</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create a DataFrame to get the data from the table created.  We'll create a virtual dataframe to keep the date in Vantage and not copy it down to the client.</p>



In [None]:
titanic = DataFrame(in_schema("DEMO_TitanicSurvival", "Passenger_Data"))
titanic

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Check the shape of the dataset.</p>

In [None]:
# Shape of the dataframe.
titanic.shape

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The dataset contains 891 rows and 12 columns.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Check the number of nulls in each of the columns.</p>

In [None]:
# Info about dataframe and null values.
titanic.info(null_counts=True)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We observe that there are no NULLS in any of the columns.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Describe is used to find the statistics of the numeric columns in the dataframe.</p>

In [None]:
# Generates statistics for numeric columns in titanic data. 
titanic.describe()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Check the number of passengers who survived and the number of passengers who did not survive.</p>

In [None]:
# Count of survived passengers.
survived_count = titanic[titanic.survived == 1]
survived_count.shape[0]

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As per the count 342 passengers survived.</p>

In [None]:
# Count of lost passengers.
non_survived_count = titanic[titanic.survived == 0]
non_survived_count.shape[0]

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As per the count 549 passengers did not survive.</p>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Data Preparation</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Column selection.</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We drop the columns which are not needed for further analysis. </p>

In [None]:
# Dropping unwanted columns.
titanic = titanic.drop(["passengername", "ticket", "cabin"], axis=1)
titanic

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.1 Ordinal Encoding.</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>OrdinalEncodingFit function identifies distinct categorical values from an input table or a user-defined list and returns the distinct categorical values along with the ordinal value for each category.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>OrdinalEncodingFit is useful in use cases where categorical data needs to be converted into numerical data for analysis or machine learning algorithms. For example, in a dataset with categorical variables such as "color" or "size", TD_OrdinalEncodingFit can assign numerical values to each category, making it possible to perform mathematical operations on the data. This helps in tasks such as clustering, classification, and regression analysis.</p>

In [None]:
# Perform OrdinalEncoding for 'sex' column.
ordinal_obj = OrdinalEncodingFit(target_column=['sex', 'embarked'],
                                 data=titanic)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The OrdinalEncodingTransform function maps the categorical value to a specified ordinal value using the OrdinalEncodingFit output.
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
OrdinalEncodingTransform follows this process:- </p>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'>Select the table and columns to be encoded by the OrdinalEncodingFit function.</li>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'>Use OrdinalEncodingTransform to map each category value to a specified ordinal value, using OrdinalEncodingTransform output.</li>


In [None]:
# Transforming the encoded data.
df = ordinal_obj.transform(data=titanic,
                           accumulate=['passenger', 'survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']).result

df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The OrdinalEncodingTransform function maps the categorical value to a specified ordinal value using the OrdinalEncodingFit output.
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
As we can observe the categorical values for the column 'sex' are converted into numeric values. female is replaced by 0 and male is replaced by 1. </p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.2 Train-Test split.</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TrainTestSplit() function simulates how a model would perform on new data. The function divides the dataset into train and test subsets to evaluate machine learning algorithms and validate processes. The first subset is used to train the model. The second subset is used to make predictions and compare the predictions to actual values.</p>

In [None]:
# Sample 5% of data for model validation.
df_sample = df.sample(frac=[0.95, 0.05], randomize=True)
df_sample

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Train dataset is created using sampleid = 1.</p>

In [None]:
# Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is required for training model.
data_train = df_sample[df_sample.sampleid == "1"].drop("sampleid", axis = 1)
data_train

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Test dataset is created using sampleid = 2.</p>

In [None]:
# Create validation dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is required for validating model.
data_val = df_sample[df_sample.sampleid == "2"].drop("sampleid", axis = 1)
data_val

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Hyper-Parametrization of SimpleImpute.</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>GridSearch is an exhaustive search algorithm that covers all possible parameter values to identify optimal hyperparameters. It works for teradataml analytic functions from Analytics Database, BYOM, VAL, and UAF features.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
teradataml GridSearch allows you to perform hyperparameter tuning for all model trainer and non-model trainer functions.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
When used for model trainer functions:
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Based on evaluation metrics, search determines best model.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>All methods and properties can be used.</li></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>When used for non-model trainer functions:
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>You can choose the best output as you see fit to use this.</li>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'>Only fit method is supported.</li></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>teradataml GridSearch also allows you to use input data as the hyperparameter. This option can be suitable when the you want to identify the best models for a set of input data. When you pass set of data as hyperparameter for model trainer function, the search determines the best data along with the best model based on the evaluation metrics.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>GridSearch offers hyper-parameterization for Non-Model Trainer functions. "age" and "embarked" columns contains 'NaN' values. Hence, Impute 'NaN' value with special metrics, for example, mean, mode or median. And use those imputed data to build the best model.</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Define Hyperparameters for SimpleImputeFit </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>GridSearch perform imputation on "data_train" for specified combination of parameters and returns imputed data.</p>

In [None]:
si_params = {"data":data_train,
            "stats_columns":["age", "embarked"],
            "stats":("median", "mean", "mode")}

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Perform GridSearch on SimpleImputeFit function.</p>

In [None]:
si_gs_obj = GridSearch(func=SimpleImputeFit, params=si_params)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The fit() method is used to run the teradataml analytic function for all sets of hyperparameters. Sets of hyperparameters chosen for execution from the parameter grid is populated based on search algorithm.</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>In model trainer function, the best parameters are selected based on training results.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>In non- model trainer function, first execution parameter set is selected as the best parameters.</li>

In [None]:
# Start the imputation task.
si_gs_obj.fit()

In [None]:
# Imputation task metadata shows three variants of imputation results.
si_gs_obj.models

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As seen above the models property returns the generated models metadata.</p>

In [None]:
models = si_gs_obj.models

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Using each of the models the tranform function is applied on the data .</p>

In [None]:
# Perform SimpleImpute transform and structure the data in dictionary format with labels.
imputed_data = dict((model, si_gs_obj.get_model(model).transform(data = df,
                    accumulate=['passenger', 'survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']).result) \
                    for model in models["MODEL_ID"])
imputed_data

In [None]:
# SimpleImpute performed on validation data.
si_obj_val = SimpleImputeFit(data=data_val, stats_columns=["age", "embarked"], stats="mean")
val_df = si_obj_val.transform(data=data_val).result


<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Hyperparameter-Tuning to create optimal predictive model.</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>RandomSearch algorithm performs random sampling on hyperparameter space to identify optimal hyperparameters. It works for teradataml analytic functions from Analytics Database, BYOM, VAL, and UAF features.>/p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>teradataml RandomSearch allows user to perform hyperparameter tuning for all model trainer and non-model trainer functions.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>When used for model trainer functions:
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Based on evaluation metrics, search determines best model.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>All methods and properties can be used.</li>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>When used for non-model trainer functions:
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>You can choose the best output as you see fit to use this.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Only fit method is supported.</li>
    </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Define XGBoost hyperparameter space with 4000 parameter combinations for XGBoost model. Any combination specified with in hyperparameter space is used for hyperparameter tuning task.</p>

In [None]:
XGB_params = {"input_columns":['pclass', 'age', 'sibsp', 'parch', 'fare', 'sex', 'embarked'],
              "response_column" : 'survived',
              "max_depth":tuple(random.randrange(3, 50) for i in range(10)),
              "lambda1" : tuple(round(random.uniform(0.001, 1.0), 3) for i in range(10)),
              "model_type" : "classification",
              "num_boosted_trees": 50,
              "shrinkage_factor":tuple(round(random.uniform(0.001, 1.0), 3) for i in range(10)),
              "iter_num":( 35,40,45,50)}


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Define Evaluation parameters which is used for model evaluation.</p>

In [None]:
eval_params = {"id_column": "passenger",
               "model_type": "classification",
               "accumulate": "survived",
               "object_order_column": ['task_index', 'tree_num', 'iter', 'class_num', 'tree_order']}

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Initialize the RandomSearch for XGBoost model. In addition, Though hyperparameter space contains 4000 parameters based on "n_iter" value hyperparameter combinations are selected randomly. selected set of hyperparameters are used for model optimization.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Note: Chosen hyperparameter combinations are used on hyper-parameterized data for model optimization.</p>

In [None]:
rs_obj = RandomSearch(func=XGBoost, params=XGB_params, n_iter=4)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The fit() method is used to run the teradataml analytic function for all sets of hyperparameters. Sets of hyperparameters chosen for execution from the parameter grid is populated based on search algorithm.</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>In model trainer function, the best parameters are selected based on training results.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>In non- model trainer function, first execution parameter set is selected as the best parameters.</li></p>
<p style = 'font-size:14px;font-family:Arial;color:#00233C'><b><i>**Note: Since this step searches for all model variations it might take time, around 3-4 minutes</i></b></p>

In [None]:
# Start the RandomSearch optimization.
rs_obj.fit(data=imputed_data,
           verbose=1, frac=0.85,run_parallel=False,
           **eval_params
            )

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Trained model metadata explains 4 models build for each hyper-parameterized data. Hence, Total of 12 models generated in RandomSearch optimization.</p>

In [None]:
rs_obj.models

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>There are various properties of Randomsearch that can be used to analyze the models created by the fit() method. We will be using some of these to get the details for the models created </p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Check <b>model_stats:</b> The model_stats property returns a pandas DataFrame representing the model statistics of the model with best score.</p>

In [None]:
# RandomSearch model stats for XGBoost.
rs_obj.model_stats

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Check <b>best_model_id:</b> The best_model_id property returns a string representing the model id of the model with the best score.</p>

In [None]:
# Best identified XGBoost model id.
rs_obj.best_model_id

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Check <b>best_data_id:</b> The best_data_id property returns a string representing the "data_id" of a sampled data used for training the best model.</p>

In [None]:
# Best identified data id.
rs_obj.best_data_id

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Check <b>best_score_:</b> The best_score_ property returns a string representing the best score of the model out of all generated models.</p>

In [None]:
# Best identified model score.
rs_obj.best_score_

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Check <b>best_params_:</b> The best_params_ property returns a dictionary of the the parameters used for the model with best score.</p>

In [None]:
# Best identified model hyperparameters.
rs_obj.best_params_

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7. Perform validation on the best model.</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The evaluate() method is used for evaluation using trained models from Analytics Database, VAL, and UAF features. Evaluation are done using the default trained model.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Validating the best model.</p>

In [None]:
rs_obj.evaluate(newdata=val_df,
                **eval_params)


<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>8. Perform classification using best model.</b></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The evaluate() method is used for evaluation using trained models from Analytics Database, VAL, and UAF features. Evaluation are done using the default trained model.</p>

In [None]:
# Predict passenger survival using the best model.
result = rs_obj.predict(newdata=val_df,
                        **eval_params)
df_pred=result.result
df_pred

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>9. Evaluate the model</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ClassificationEvaluator() function evaluates and emits various metrics of classification model based on its predictions on the data. Apart from accuracy, the secondary output data returns micro, macro, and weighted-averaged metrics of precision, recall, and F1-score values.
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The function works for multi-class scenarios as well. In any case, the primary output data contains class-level metrics, whereas the secondary output data contains metrics that are applicable across classes.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The function works only when columns specified in 'observation_column' and 'prediction_column' has same teradata types.</li></p>

In [None]:
df_pred_1=df_pred
df_pred_1 = df_pred_1.assign(Prediction = df_pred_1.Prediction.cast(type_ = VARCHAR(2)))
df_pred_1 = df_pred_1.assign(survived = df_pred_1.survived.cast(type_ = VARCHAR(2)))
df_pred_1

In [None]:
ClassificationEvaluator_obj = ClassificationEvaluator(
                                                        data = df_pred_1,
                                                        observation_column = 'survived',
                                                        prediction_column = 'Prediction',
                                                        labels = ['0','1'])
df_result=ClassificationEvaluator_obj.result
df_result

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Show AUC-ROC Curve</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/search/all?query=TD_ROC&content-lang=en-US'>ROC</a> curve shows the performance of a binary classification model as its discrimination threshold varies. For a range of thresholds, the curve plots the true positive rate against false-positive rate.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This function accepts a set of prediction-actual pairs as input and calculates the following values for a range of discrimination thresholds.</p>
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
        <li>True-positive rate (TPR)</li>
        <li>False-positive rate (FPR)</li>
        <li>The area under the ROC curve (AUC)</li>
        <li>Gini coefficient</li>
        <li>Other details are mentioned in the documentation</li>
    </ul>



In [None]:
from sklearn import metrics
df_cm=df_pred.to_pandas()
fpr, tpr, thresholds = metrics.roc_curve(df_cm['survived'], df_cm['Prediction'])
auc = metrics.auc(fpr, tpr)
print("AUC="+str(auc))

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Plot the predictions.</b></p>

In [None]:
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.legend(loc=4)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("AUC-ROC Curve")
plt.show()

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Show Confusion Matrix</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Confusion Matrix shows the actual and the Predicted values. Based on model the matrix shows the predicted and actual value comparison for people who survived and those who did not survive the titanic disaster.</p>


In [None]:

from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, recall_score, ConfusionMatrixDisplay
# Compute confusion matrix
cm = confusion_matrix(df_cm['survived'], df_cm['Prediction'])

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['Survived', 'Did not Survive'])
fig, ax = plt.subplots(figsize = (8, 8))
disp.plot(ax = ax, cmap = 'Blues', colorbar = True)

# Add labels and annotations
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.xticks(ticks = [0, 1], labels = ['Survived', 'Did not Survive'])
plt.yticks(ticks = [0, 1], labels = ['Survived', 'Did not Survive'])

# Add text to the plot to show the actual values of the confusion matrix
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, f'{cm[i, j]}', ha = 'center', va = 'center', color = 'white' if cm[i, j] > cm.max()/1.4 else 'black')

# Remove grid lines
ax.grid(False)

# Show the plot
plt.show()

print(f'''
This means that out of all the actual survived cases ({cm[0][0] + cm[0][1]}),
{round(cm[0][0]/(cm[0][0] + cm[0][1])*100, 2)}% were correctly classified as survived, while
{round(cm[0][1]/(cm[0][0] + cm[0][1])*100, 2)}% were incorrectly classified as survived.
Similarly, out of all the actual death cases ({cm[1][0] + cm[1][1]}),
{round(cm[1][1]/(cm[1][0] + cm[1][1])*100, 2)}% were correctly classified as did not survive, while
{round(cm[1][0]/(cm[1][0] + cm[1][1])*100, 2)}% were incorrectly classified as did not survive.
''')

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Thus using the Hyperparameter tuning we select the best data preparation model and training model for the required data and try to predict the correct value of target variables for the data. Vantage's easy-to-use analytic and AI/ML capabilities help researchers and datascientist use the best model and provide more accurate predictions. </p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>10. Cleanup</b></p>


<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_TitanicSurvival');" 
#Takes 40 seconds

In [None]:
remove_context()

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>Required Materials</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let’s look at the elements we have available for reference for this notebook:</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Dataset</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'>Passenger Data </p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Survived: Survival	0 = No, 1 = Yes</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Passenger: Unique ID of each passenger (integer) 
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Pclass: Ticket class	(1 = 1st, 2 = 2nd, 3 = 3rd) </li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Sex: Sex	('male' 'female')</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Age: Age in years	</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>SibSp: Number of siblings / spouses aboard the Titanic	</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Parch: Number of parents / children aboard the Titanic	</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Ticket: Ticket number	</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Fare: Passenger fare	</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Cabin: Cabin number	</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Embarked: Port of Embarkation	(C = Cherbourg, Q = Queenstown, S = Southampton)</li>
<p></p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Filters:</b></p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Industry:</b> Travel and Transportation</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Functionality:</b> Hyperparameter Tuning</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Use Case:</b> Titanic Survival Prediction</li>
</p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Related Resources:</b></p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://www.teradata.com/blogs/nps-is-a-metric-not-the-goal'>In the fight to improve customer experience, NPS is a metric, not the goal</a></li>



<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023,2024. All Rights Reserved
        </div>
    </div>
</footer>