# <b><span style='color:#F1A424'>AutoML - Binary Classification - Titanic Survival Prediction</span> </b> 

### Disclaimer
Please note, the Vantage Functions via SQLAlchemy feature is a preview/beta code release with limited functionality (the “Code”). As such, you acknowledge that the Code is experimental in nature and that the Code is provided “AS IS” and may not be functional on any machine or in any environment. TERADATA DISCLAIMS ALL WARRANTIES RELATING TO THE CODE, EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES AGAINST INFRINGEMENT OF THIRD-PARTY RIGHTS, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

TERADATA SHALL NOT BE RESPONSIBLE OR LIABLE WITH RESPECT TO ANY SUBJECT MATTER OF THE CODE UNDER ANY CONTRACT, NEGLIGENCE, STRICT LIABILITY OR OTHER THEORY 
    (A) FOR LOSS OR INACCURACY OF DATA OR COST OF PROCUREMENT OF SUBSTITUTE GOODS, SERVICES OR TECHNOLOGY, OR 
    (B) FOR ANY INDIRECT, INCIDENTAL OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO LOSS OF REVENUES AND LOSS OF PROFITS. TERADATA SHALL NOT BE RESPONSIBLE FOR ANY MATTER BEYOND ITS REASONABLE CONTROL.

Notwithstanding anything to the contrary: 
    (a) Teradata will have no obligation of any kind with respect to any Code-related comments, suggestions, design changes or improvements that you elect to provide to Teradata in either verbal or written form (collectively, “Feedback”), and 
    (b) Teradata and its affiliates are hereby free to use any ideas, concepts, know-how or techniques, in whole or in part, contained in Feedback: 
        (i) for any purpose whatsoever, including developing, manufacturing, and/or marketing products and/or services incorporating Feedback in whole or in part, and 
        (ii) without any restrictions or limitations, including requiring the payment of any license fees, royalties, or other consideration. 

## <b> Problem overview - Binary Classification </b>
    


The Titanic dataset is a well-known dataset in the field of machine learning and data science. It contains information about passengers aboard the RMS Titanic, including whether they survived or not. The dataset is often used for predictive modeling and classification tasks. Here are some key details about the Titanic dataset:

**Features**:

- `PassengerId`: Unique identifier for each passenger.
- `Pclass`: Ticket class (1st, 2nd, or 3rd).
- `Name`: Passenger's name.
- `Sex`: Passenger's gender (male or female).
- `Age`: Passenger's age.
- `SibSp`: Number of siblings or spouses aboard.
- `Parch`: Number of parents or children aboard.
- `Ticket`: Ticket number.
- `Fare`: Fare paid for the ticket.
- `Cabin`: Cabin number.
- `Embarked`: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

**Target Variable**:

- `Survived`: Binary variable indicating whether the passenger survived (1) or not (0).
        
**Objective**:

The main objective is typically to build a predictive model that can accurately predict whether a passenger survived based on the available features.

**Challenges**:

- Missing data in the columns such as `Age`, `Cabin`, and `Embarked`.
- Exploring feature engineering techniques to improve model performance.(`Feature exploration and engineering`)
- Understanding passenger demographics and characteristics that influenced survival.(`Model training`)

**Usecase**:

- Here, we will use AutoML(Automated Machine Learning) functionality to automate the entire process of developing a predictive model. 
- It will perform `feature exploration`, `feature engineering`, `data preparation`, `model training` and `model evaluation` on dataset in auto run and at end we will get `leaderboard` containined different models along with their performance. 
- Model will also have `rank` associated with them which indicates which is `best performing model` for given data followed by other models.

In [1]:
# Importing AutoML from teradataml
from teradataml import AutoML, AutoClassifier, AutoRegressor

In [2]:
# Importing other important libraries
import getpass
from teradataml import create_context, remove_context
from teradataml import DataFrame
from teradataml import load_example_data
from teradataml import TrainTestSplit

In [3]:
# Create the connection.
host = getpass.getpass("Host: ")
username = getpass.getpass("Username: ")
password = getpass.getpass("Password: ")

con = create_context(host=host, username=username, password=password)

Host:  ········
Username:  ········
Password:  ········


## <b><span style='color:#F1A424'>| 1.</span> Loading Deployed Models - 'top_10_models' </b>

### <b><span style='color:#F1A424'>| 1.1.</span> Loading Model </b>

In [4]:
# Creating AutoML object

aml=AutoML()

In [5]:
# Loading models

models_1 = aml.load('top_10_models')

In [6]:
# Display loaded models



models_1

Unnamed: 0,RANK,MODEL_ID,FEATURE_SELECTION,ACCURACY,MICRO-PRECISION,MICRO-RECALL,MICRO-F1,MACRO-PRECISION,MACRO-RECALL,MACRO-F1,WEIGHTED-PRECISION,WEIGHTED-RECALL,WEIGHTED-F1,DATA_TABLE
0,1,DECISIONFOREST_3,lasso,0.816,0.816,0.816,0.816,0.80699,0.816058,0.810118,0.8213,0.816,0.817337,ml__survived_lasso_1723106082106494
1,2,DECISIONFOREST_0,lasso,0.808,0.808,0.808,0.808,0.803885,0.787728,0.793616,0.806734,0.808,0.805385,ml__survived_lasso_1723106082106494
2,3,DECISIONFOREST_1,rfe,0.808,0.808,0.808,0.808,0.803885,0.787728,0.793616,0.806734,0.808,0.805385,ml__survived_rfe_1723106095613226
3,4,XGBOOST_2,pca,0.792,0.792,0.792,0.792,0.781821,0.781821,0.781821,0.792,0.792,0.792,ml__survived_pca_1723105437373421
4,5,XGBOOST_0,lasso,0.784,0.784,0.784,0.784,0.775751,0.786117,0.778325,0.792884,0.784,0.785986,ml__survived_lasso_1723106082106494
5,6,XGBOOST_3,lasso,0.784,0.784,0.784,0.784,0.775751,0.786117,0.778325,0.792884,0.784,0.785986,ml__survived_lasso_1723106082106494
6,7,XGBOOST_1,rfe,0.784,0.784,0.784,0.784,0.775751,0.786117,0.778325,0.792884,0.784,0.785986,ml__survived_rfe_1723106095613226
7,8,DECISIONFOREST_2,pca,0.728,0.728,0.728,0.728,0.714403,0.711063,0.712527,0.726245,0.728,0.726933,ml__survived_pca_1723105437373421
8,9,GLM_2,pca,0.712,0.712,0.712,0.712,0.711222,0.665279,0.669118,0.711555,0.712,0.694847,ml__survived_pca_1723105437373421
9,10,GLM_0,lasso,0.704,0.704,0.704,0.704,0.702632,0.655075,0.657636,0.7032,0.704,0.68485,ml__survived_lasso_1723106082106494


In [7]:
### Loading Dataset for Prediction

load_example_data('teradataml','titanic')
df = DataFrame('titanic')



In [8]:
# Display data

df

passenger,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
732,0,3,"Hassan, Mr. Houssein G N",male,11.0,0,0,2699,18.7875,,C
650,1,3,"Stanley, Miss. Amy Zillah Elsie",female,23.0,0,0,CA. 2314,7.55,,S
446,1,1,"Dodge, Master. Washington",male,4.0,0,2,33638,81.8583,A34,S
711,1,1,"Mayne, Mlle. Berthe Antonine (""Mrs de Villiers"")",female,24.0,0,0,PC 17482,49.5042,C90,C
444,1,2,"Reynaldo, Ms. Encarnacion",female,28.0,0,0,230434,13.0,,S
709,1,1,"Cleaver, Miss. Alice",female,22.0,0,0,113781,151.55,,S
99,1,2,"Doling, Mrs. John T (Ada Julia Bone)",female,34.0,0,1,231919,23.0,,S
385,0,3,"Plotcharsky, Mr. Vasil",male,,0,0,349227,7.8958,,S
305,0,3,"Williams, Mr. Howard Hugh ""Harry""",male,,0,0,A/5 2466,8.05,,S
326,1,1,"Young, Miss. Marie Grice",female,36.0,0,0,PC 17760,135.6333,C32,C


### <b><span style='color:#F1A424'>| 1.2.</span> Generating Prediction & Performance Metrics</b>

In [9]:
# Generate prediction using some data rows and model rank

prediction = aml.predict(df.iloc[:80], rank=1)

Generating prediction using:
Model Name: DECISIONFOREST
Feature Selection: lasso
Completed: ｜⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿｜ 100% - 10/10           

In [10]:
prediction

id,prediction,prob_1,prob_0,survived
392,0,0.35,0.65,0
520,0,0.4,0.6,0
280,1,0.55,0.45,0
256,1,0.85,0.15,1
424,1,0.8,0.2,1
352,1,0.8,0.2,1
8,0,0.25,0.75,0
408,0,0.5,0.5,0
504,0,0.4,0.6,0
112,0,0.3,0.7,0


In [11]:
# Generate performance metrics

performance_metric = aml.evaluate(df.iloc[:80], rank=1)

Generating performance metrics using:
Model Name: DECISIONFOREST
Feature Selection: lasso
Completed: ｜⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿｜ 100% - 10/10           

In [12]:
performance_metric


############ output_data Output ############

   SeqNum              Metric  MetricValue
0       3        Micro-Recall     0.772152
1       5     Macro-Precision     0.766801
2       6        Macro-Recall     0.761528
3       7            Macro-F1     0.763630
4       9     Weighted-Recall     0.772152
5      10         Weighted-F1     0.771016
6       8  Weighted-Precision     0.770893
7       4            Micro-F1     0.772152
8       2     Micro-Precision     0.772152
9       1            Accuracy     0.772152


############ result Output ############

       Prediction  Mapping  CLASS_1  CLASS_2  Precision    Recall        F1  Support
SeqNum                                                                              
1               1  CLASS_2        8       23   0.741935  0.696970  0.718750       33
0               0  CLASS_1       38       10   0.791667  0.826087  0.808511       46


In [13]:
# Generate prediction using data and model rank

prediction = aml.predict(df.iloc[:80], rank=4)

Generating prediction using:
Model Name: XGBOOST
Feature Selection: pca
Completed: ｜⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿｜ 100% - 10/10           

In [14]:
prediction

id,Prediction,Prob_1,Prob_0,survived
344,1,0.8487479996791107,0.1512520003208892,0
424,1,0.977709499729398,0.022290500270602,1
584,0,0.04215793428692,0.95784206571308,0
352,1,0.980637563330441,0.0193624366695589,1
216,1,0.8487479996791107,0.1512520003208892,0
408,1,0.8848851971704411,0.1151148028295588,0
32,1,0.9765966832127928,0.0234033167872073,1
456,1,0.94612917208794,0.0538708279120598,1
272,1,0.7508947504578759,0.2491052495421241,0
392,0,0.4317393879390012,0.5682606120609988,0


In [15]:
# Generate performance metrics

performance_metric = aml.evaluate(df.iloc[:80], rank=4)

Generating performance metrics using:
Model Name: XGBOOST
Feature Selection: pca
Completed: ｜⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿｜ 100% - 10/10           

In [16]:
performance_metric


############ output_data Output ############

   SeqNum              Metric  MetricValue
0       3        Micro-Recall     0.746835
1       5     Macro-Precision     0.750321
2       6        Macro-Recall     0.756917
3       7            Macro-F1     0.745817
4       9     Weighted-Recall     0.746835
5      10         Weighted-F1     0.748465
6       8  Weighted-Precision     0.765425
7       4            Micro-F1     0.746835
8       2     Micro-Precision     0.746835
9       1            Accuracy     0.746835


############ result Output ############

       Prediction  Mapping  CLASS_1  CLASS_2  Precision    Recall        F1  Support
SeqNum                                                                              
0               0  CLASS_1       32        6   0.842105  0.695652  0.761905       46
1               1  CLASS_2       14       27   0.658537  0.818182  0.729730       33


## <b><span style='color:#F1A424'>| 2.</span> Loading Deployed Models - 'mixed_models' </b>

### <b><span style='color:#F1A424'>| 2.1.</span> Loading Model </b>

In [17]:
# Loading models

models_2 = aml.load('mixed_models')

In [18]:
models_2

Unnamed: 0,RANK,MODEL_ID,FEATURE_SELECTION,ACCURACY,MICRO-PRECISION,MICRO-RECALL,MICRO-F1,MACRO-PRECISION,MACRO-RECALL,MACRO-F1,WEIGHTED-PRECISION,WEIGHTED-RECALL,WEIGHTED-F1,DATA_TABLE
0,1,DECISIONFOREST_3,lasso,0.816,0.816,0.816,0.816,0.80699,0.816058,0.810118,0.8213,0.816,0.817337,ml__survived_lasso_1723105668448665
1,2,XGBOOST_2,pca,0.792,0.792,0.792,0.792,0.781821,0.781821,0.781821,0.792,0.792,0.792,ml__survived_pca_1723106068035810
2,3,XGBOOST_1,rfe,0.784,0.784,0.784,0.784,0.775751,0.786117,0.778325,0.792884,0.784,0.785986,ml__survived_rfe_1723104929766904
3,4,GLM_2,pca,0.712,0.712,0.712,0.712,0.711222,0.665279,0.669118,0.711555,0.712,0.694847,ml__survived_pca_1723106068035810


### <b><span style='color:#F1A424'>| 2.2.</span> Generating Prediction & Performance Metrics</b>

In [19]:
# Generate prediction using data and model rank

prediction = aml.predict(df.iloc[:80], rank=2)

Generating prediction using:
Model Name: XGBOOST
Feature Selection: pca
Completed: ｜⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿｜ 100% - 10/10           

In [20]:
prediction

id,Prediction,Prob_1,Prob_0,survived
464,1,0.8487479996791107,0.1512520003208892,0
488,0,0.1751798792013437,0.8248201207986562,0
368,0,0.0880308372771982,0.9119691627228018,0
16,1,0.977709499729398,0.022290500270602,1
520,0,0.4543162633975348,0.5456837366024652,0
40,0,0.0880308372771982,0.9119691627228018,0
624,0,0.0880308372771982,0.9119691627228018,0
104,0,0.0863711270181098,0.9136288729818902,0
504,0,0.4108949343073637,0.5891050656926362,0
392,0,0.4317393879390012,0.5682606120609988,0


In [21]:
# Generate performance metrics

performance_metric = aml.evaluate(df.iloc[:80], rank=2)

Generating performance metrics using:
Model Name: XGBOOST
Feature Selection: pca
Completed: ｜⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿｜ 100% - 10/10           

In [22]:
performance_metric


############ output_data Output ############

   SeqNum              Metric  MetricValue
0       3        Micro-Recall     0.746835
1       5     Macro-Precision     0.750321
2       6        Macro-Recall     0.756917
3       7            Macro-F1     0.745817
4       9     Weighted-Recall     0.746835
5      10         Weighted-F1     0.748465
6       8  Weighted-Precision     0.765425
7       4            Micro-F1     0.746835
8       2     Micro-Precision     0.746835
9       1            Accuracy     0.746835


############ result Output ############

       Prediction  Mapping  CLASS_1  CLASS_2  Precision    Recall        F1  Support
SeqNum                                                                              
0               0  CLASS_1       32        6   0.842105  0.695652  0.761905       46
1               1  CLASS_2       14       27   0.658537  0.818182  0.729730       33


## <b><span style='color:#F1A424'>| 3.</span> Loading Deployed Models - 'range_models' </b>

### <b><span style='color:#F1A424'>| 3.1.</span> Loading Model</b>

In [24]:
# Creating another AutoML object

obj=AutoML()

In [25]:
# Loading models

models_3 = obj.load('range_models')

In [26]:
models_3

Unnamed: 0,RANK,MODEL_ID,FEATURE_SELECTION,ACCURACY,MICRO-PRECISION,MICRO-RECALL,MICRO-F1,MACRO-PRECISION,MACRO-RECALL,MACRO-F1,WEIGHTED-PRECISION,WEIGHTED-RECALL,WEIGHTED-F1,DATA_TABLE
0,1,XGBOOST_2,pca,0.792,0.792,0.792,0.792,0.781821,0.781821,0.781821,0.792,0.792,0.792,ml__survived_pca_1723105688011217
1,2,XGBOOST_0,lasso,0.784,0.784,0.784,0.784,0.775751,0.786117,0.778325,0.792884,0.784,0.785986,ml__survived_lasso_1723105440370502
2,3,XGBOOST_3,lasso,0.784,0.784,0.784,0.784,0.775751,0.786117,0.778325,0.792884,0.784,0.785986,ml__survived_lasso_1723105440370502
3,4,XGBOOST_1,rfe,0.784,0.784,0.784,0.784,0.775751,0.786117,0.778325,0.792884,0.784,0.785986,ml__survived_rfe_1723106221014756


### <b><span style='color:#F1A424'>| 3.2.</span> Generating Prediction & Performance Metrics</b>

In [27]:
# Generate prediction using data and model rank

prediction = obj.predict(df.iloc[:80], rank=1)

Generating prediction using:
Model Name: XGBOOST
Feature Selection: pca
Completed: ｜⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿｜ 100% - 10/10           

In [29]:
prediction

id,Prediction,Prob_1,Prob_0,survived
16,1,0.977709499729398,0.022290500270602,1
248,0,0.0959926382722336,0.9040073617277664,0
128,1,0.9302579838943104,0.0697420161056896,1
520,0,0.4543162633975348,0.5456837366024652,0
256,1,0.9612438995480754,0.0387561004519245,1
400,0,0.3709172066386672,0.6290827933613328,0
552,0,0.2228467930162311,0.7771532069837689,1
608,0,0.0880308372771982,0.9119691627228018,0
600,0,0.04215793428692,0.95784206571308,1
464,1,0.8487479996791107,0.1512520003208892,0


In [30]:
# Generate performance metrics

performance_metric = obj.evaluate(df.iloc[:80], rank=1)

Generating performance metrics using:
Model Name: XGBOOST
Feature Selection: pca
Completed: ｜⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿｜ 100% - 10/10           

In [31]:
performance_metric


############ output_data Output ############

   SeqNum              Metric  MetricValue
0       3        Micro-Recall     0.746835
1       5     Macro-Precision     0.750321
2       6        Macro-Recall     0.756917
3       7            Macro-F1     0.745817
4       9     Weighted-Recall     0.746835
5      10         Weighted-F1     0.748465
6       8  Weighted-Precision     0.765425
7       4            Micro-F1     0.746835
8       2     Micro-Precision     0.746835
9       1            Accuracy     0.746835


############ result Output ############

       Prediction  Mapping  CLASS_1  CLASS_2  Precision    Recall        F1  Support
SeqNum                                                                              
0               0  CLASS_1       32        6   0.842105  0.695652  0.761905       46
1               1  CLASS_2       14       27   0.658537  0.818182  0.729730       33


In [34]:
remove_context()

True