<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Usecase - Demonstrate training and evaluation of various scikit-learn models after data preprocessing of customer segmentation data in OpensourceML</b>
</header>

### Disclaimer
The sample code (“Sample Code”) provided is not covered by any Teradata agreements. Please be aware that Teradata has no control over the model responses to such sample code and such response may vary. The use of the model by Teradata is strictly for demonstration purposes and does not constitute any form of certification or endorsement. The sample code is provided “AS IS” and any express or implied warranties, including the implied warranties of merchantability and fitness for a particular purpose, are disclaimed. In no event shall Teradata be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) sustained by you or a third party, however caused and on any theory of liability, whether in contract, strict liability, or tort arising in any way out of the use of this sample code, even if advised of the possibility of such damage.

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Description</b>

<b style = 'font-size:16px;font-family:Arial;'>Context</b>
<p>Customer segmentation is the practice of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing, such as age, gender, interests and spending habits.</p>

<p>Companies employing customer segmentation operate under the fact that every customer is different and that their marketing efforts would be better served if they target specific, smaller groups with messages that those consumers would find relevant and lead them to buy something. Companies also hope to gain a deeper understanding of their customers' preferences and needs with the idea of discovering what each segment finds most valuable to more accurately tailor marketing materials toward that segment.</p>

<b style = 'font-size:16px;font-family:Arial;'>Content</b>
<p>An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market.</p>

</p>In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers.</p>

<b>DISCLAIMER: The data and description for this usecase is taken from <a href="https://www.kaggle.com/datasets/abisheksudarshan/customer-segmentation">here</a></b>.

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Objective</b>

Our goal is to create a <b>multiclass</b> predictive algorithm that can help the manager to predict the right group of the new customers.

<b style = 'font-size:16px;font-family:Arial;'>Following steps are followed to achieve the objective:</b>
* Import the required teradataml modules.
* Context establishment with Vantage system.
* Authenticate VantageCloud Lake and get conda environment from OpenAF to use in teradataml OpensourceML module.
* Loading both train and test data.
* DataFrame preprocessing.
    * Verify and remove NULL/NaN values using SimpleImputer and ColumnTransformer.
    * Ordinal Encoding.
    * Detect and exclude Outliers from dataset.
    * Split the data into train and validation sets.
* Build different predictive models using train data, predict and validate using validation data.
* Predict "Segmentation" on test data.
* Cleanup.

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Import the required libraries</b>

In [1]:
import numpy as np
from teradatasqlalchemy import VARCHAR

# Importing required libraries.
import getpass
from teradataml import create_context, remove_context, DataFrame, load_example_data
from teradataml import list_user_envs, get_env, set_auth_token, configure, display
from teradataml import td_sklearn as osml

In [2]:
# Ignoring unnecessary warnings.
import warnings
warnings.simplefilter(action='always', category=DeprecationWarning)
display.suppress_vantage_runtime_warnings = True

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Connect to Vantage</b>

In [3]:
# Read the connection parameters.
host = getpass.getpass("Host: ")
username = getpass.getpass("Username: ")
password = getpass.getpass("Password: ")

Host:  ········
Username:  ········
Password:  ········


In [4]:
# Create the connection.
con = create_context(host=host, username=username, password=password)

In [5]:
# Read configuration parameters for VantageCloud Lake authentication.
ues_url = getpass.getpass("UES URL: ")
auth_token = getpass.getpass("Auth Token: ")

UES URL:  ········
Auth Token:  ········


In [6]:
# Set configuration parameters for VantageCloud Lake authentication.
set_auth_token(auth_token=auth_token, ues_url=ues_url)

True

In [7]:
set_auth_token(auth_token=auth_token, ues_url=ues_url)

True

In [8]:
list_user_envs()

Unnamed: 0,env_name,env_description,base_env_name,language,conda
0,conda_env_3_10_demo,Conda environment for notebook demo,python_3.10,python,True
1,demo_env,Demo env 1.,python_3.10,Python,False
2,non_conda_env_3_8_demo,Non Conda environment for notebook demo,python_3.8,Python,False
3,openml_env,DONT DELETE: OpenML environment,python_3.10,Python,False
4,openml_env_dhan,DONT DELETE: OpenML environment,python_3.10,Python,False
5,testenv,This env 'testenv' is created with base env 'p...,python_3.10,Python,False


In [9]:
env = get_env("conda_env_3_10_demo")
env


Environment Name: conda_env_3_10_demo
Base Environment: python_3.10
Description: Conda environment for notebook demo

############ Libraries installed in User Environment ############

                name   version
0      _libgcc_mutex       0.1
1      _openmp_mutex       5.1
2               blas       1.0
3              bzip2     1.0.8
4    ca-certificates  2024.7.2
5       intel-openmp  2023.1.0
6             joblib     1.4.2
7   ld_impl_linux-64      2.38
8             libffi     3.4.4
9          libgcc-ng    11.2.0
10    libgfortran-ng    11.2.0
11      libgfortran5    11.2.0
12           libgomp    11.2.0
13      libstdcxx-ng    11.2.0
14           libuuid    1.41.5
15               mkl  2023.1.0
16       mkl-service     2.4.0
17           mkl_fft    1.3.10
18        mkl_random     1.2.7
19           ncurses       6.4
20             numpy    1.26.4
21        numpy-base    1.26.4
22           openssl    3.0.14
23               pip      24.2
24      pybind11-abi         4
25      

teradataml OpensourceML requires python versions and required python package versions be same in both client and OpenAF User environment.

In [10]:
# Verifying whether required packages are of same version in both client and OpenAF User environment (above cell).
!pip list | grep scikit-learn
!pip list | grep scipy
!pip list | grep numpy

scikit-learn              1.5.1
scipy                     1.13.1
numpy                     1.26.4


In [11]:
# Use "conda_env_3_10_demo" enviroment for teradataml OpensourceML.
configure.openml_user_env = env

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. teradataml DataFrame Creation</b>

In [12]:
load_example_data("openml", ["customer_segmentation_train", "customer_segmentation_test"])



<b style = 'font-size:22px;font-family:Arial;color:#E37C4D'>4.1. Train Data</b>

In [13]:
df_train = DataFrame("customer_segmentation_train")

In [14]:
df_train

ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
459165,Male,No,27,No,Doctor,7,Low,3.0,Cat_6,C
464589,Male,Yes,30,No,Entertainment,11,Low,2.0,Cat_5,D
464385,Male,Yes,70,No,Engineer,0,Average,2.0,Cat_6,B
465323,Female,Yes,43,Yes,Artist,0,Low,1.0,Cat_6,A
463977,Male,Yes,28,Yes,Healthcare,9,High,2.0,Cat_6,B
466995,Female,No,35,Yes,Artist,8,Low,,Cat_6,A
464181,Female,No,31,Yes,Healthcare,6,Low,4.0,Cat_2,D
465731,Female,Yes,31,Yes,Artist,0,Average,2.0,Cat_4,A
463325,Male,Yes,22,No,Doctor,1,Low,4.0,Cat_6,A
467954,Male,No,31,No,Healthcare,8,Low,4.0,Cat_6,D


In [15]:
train_x_columns = df_train.columns[1:-1]
train_x_columns

['Gender',
 'Ever_Married',
 'Age',
 'Graduated',
 'Profession',
 'Work_Experience',
 'Spending_Score',
 'Family_Size',
 'Var_1']

In [16]:
train_y_columns = ["Segmentation"]
train_y_columns

['Segmentation']

<b style = 'font-size:22px;font-family:Arial;color:#E37C4D'>4.2. Test Data</b>

In [17]:
df_test = DataFrame("customer_segmentation_test")

In [18]:
df_test

ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
461121,Male,No,33,No,Doctor,0.0,Low,4.0,Cat_3
463588,Male,No,22,No,Healthcare,0.0,Low,1.0,Cat_6
466463,Female,No,33,Yes,Healthcare,3.0,Low,4.0,Cat_4
461426,Male,Yes,47,Yes,Artist,7.0,Average,2.0,Cat_6
460059,Female,No,46,No,Entertainment,4.0,Low,,Cat_4
463138,Female,No,27,No,Homemaker,,Low,,Cat_3
466320,Female,No,39,Yes,Engineer,8.0,Low,2.0,Cat_4
465811,Male,No,28,No,Healthcare,1.0,Low,4.0,Cat_4
467729,Male,Yes,62,Yes,Entertainment,2.0,Average,6.0,Cat_6
467954,Male,No,29,No,Healthcare,9.0,Low,4.0,Cat_6


In [19]:
test_x_columns = df_test.columns[1:]
test_x_columns

['Gender',
 'Ever_Married',
 'Age',
 'Graduated',
 'Profession',
 'Work_Experience',
 'Spending_Score',
 'Family_Size',
 'Var_1']

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Data Preprocessing</b>

<b style = 'font-size:22px;font-family:Arial;color:#E37C4D'>5.1. NULL column information in both train and test data</b>

In [20]:
df_train.info(null_counts=True)

<class 'teradataml.dataframe.dataframe.DataFrame'>
Data columns (total 11 columns):
ID                 8068 non-null int
Gender             8068 non-null str
Ever_Married       7928 non-null str
Age                8068 non-null int
Graduated          7990 non-null str
Profession         7944 non-null str
Work_Experience    7239 non-null int
Spending_Score     8068 non-null str
Family_Size        7733 non-null int
Var_1              7992 non-null str
Segmentation       8068 non-null str
dtypes: int(4), str(7)


In [21]:
df_test.info(null_counts=True)

<class 'teradataml.dataframe.dataframe.DataFrame'>
Data columns (total 10 columns):
ID                 2627 non-null int
Gender             2627 non-null str
Ever_Married       2577 non-null str
Age                2627 non-null int
Graduated          2603 non-null str
Profession         2589 non-null str
Work_Experience    2358 non-null int
Spending_Score     2627 non-null str
Family_Size        2514 non-null int
Var_1              2595 non-null str
dtypes: int(4), str(6)


<b>We can see that there are NULL/NaN values in some columns, as mentioned in below cells.</b>

In [22]:
# Same columns for both train and test data.
null_categorical_col_idxs = [2, 4, 5, 9]
null_categorical_cols = ['Ever_Married', 'Graduated', 'Profession', 'Var_1']

null_numerical_col_idxs = [6, 8]
null_numerical_cols = ['Work_Experience', 'Family_Size']

In [23]:
train_non_null_col_idxs = [0, 1, 3, 7, 10]
train_non_null_cols = ['ID', 'Gender', 'Age', 'Spending_Score', 'Segmentation']

test_non_null_col_idxs = [0, 1, 3, 7]
test_non_null_cols = ['ID', 'Gender', 'Age', 'Spending_Score']

<b style = 'font-size:22px;font-family:Arial;color:#E37C4D'>5.2. Run SimpleImputer to remove NULLs(Nones)/NaNs in categorical/numerical columns</b>

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>5.2.1. Object Creation</b>

Since `SimpleImputer` is different for different columns, using `ColumnTransformer` to combine transformed output.

<b style = 'font-size:18px;font-family:Arial'>SimpleImputer for categorical columns</b>

In [24]:
cat_imp = osml.SimpleImputer(strategy='most_frequent', missing_values=None)
cat_imp

<b style = 'font-size:18px;font-family:Arial'>SimpleImputer for numerical columns</b>

In [25]:
num_imp = osml.SimpleImputer(missing_values=np.NaN,strategy='median')
num_imp

<b style = 'font-size:18px;font-family:Arial'>ColumnTransformer to combine transformed output of SimpleImputers</b>

In [26]:
ct = osml.ColumnTransformer([("cat_cols", cat_imp, null_categorical_col_idxs),
                                   ("num_cols", num_imp, null_numerical_col_idxs)],
                                  remainder="passthrough")
# 'passthrough' is for other non-null arguments to be passes as it is without any transformation.
ct

In [27]:
ct.get_params()

{'force_int_remainder_cols': True,
 'n_jobs': None,
 'remainder': 'passthrough',
 'sparse_threshold': 0.3,
 'transformer_weights': None,
 'transformers': [('cat_cols',
   SimpleImputer(missing_values=None, strategy='most_frequent'),
   [2, 4, 5, 9]),
  ('num_cols', SimpleImputer(strategy='median'), [6, 8])],
 'verbose': False,
 'verbose_feature_names_out': True,
 'cat_cols': SimpleImputer(missing_values=None, strategy='most_frequent'),
 'num_cols': SimpleImputer(strategy='median'),
 'cat_cols__add_indicator': False,
 'cat_cols__copy': True,
 'cat_cols__fill_value': None,
 'cat_cols__keep_empty_features': False,
 'cat_cols__missing_values': None,
 'cat_cols__strategy': 'most_frequent',
 'num_cols__add_indicator': False,
 'num_cols__copy': True,
 'num_cols__fill_value': None,
 'num_cols__keep_empty_features': False,
 'num_cols__missing_values': nan,
 'num_cols__strategy': 'median'}

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>5.2.2. Run fit and transform</b>

<b style = 'font-size:14px;font-family:Arial;color:#E37C4D'>5.2.2.1. On Train Data</b>

In [28]:
ct.fit(X=df_train)

In [29]:
opt = ct.transform(df_train)
opt

ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation,columntransformer_transform_1,columntransformer_transform_2,columntransformer_transform_3,columntransformer_transform_4,columntransformer_transform_5,columntransformer_transform_6,columntransformer_transform_7,columntransformer_transform_8,columntransformer_transform_9,columntransformer_transform_10,columntransformer_transform_11
460490,Female,No,18,No,Healthcare,0.0,Low,4,Cat_3,D,No,No,Healthcare,Cat_3,0.0,4.0,460490,Female,18,Low,D
464711,Female,Yes,38,Yes,Artist,7.0,Average,3,Cat_1,B,Yes,Yes,Artist,Cat_1,7.0,3.0,464711,Female,38,Average,B
462162,Male,Yes,38,No,Artist,1.0,Low,2,Cat_6,B,Yes,No,Artist,Cat_6,1.0,2.0,462162,Male,38,Low,B
466118,Female,Yes,36,No,Engineer,1.0,Low,3,Cat_4,A,Yes,No,Engineer,Cat_4,1.0,3.0,466118,Female,36,Low,A
463834,Female,Yes,41,Yes,Engineer,1.0,Average,4,Cat_6,C,Yes,Yes,Engineer,Cat_6,1.0,4.0,463834,Female,41,Average,C
460612,Male,No,33,Yes,Entertainment,9.0,Low,4,Cat_3,C,No,Yes,Entertainment,Cat_3,9.0,4.0,460612,Male,33,Low,C
460551,Male,Yes,49,Yes,Engineer,0.0,Average,2,Cat_3,C,Yes,Yes,Engineer,Cat_3,0.0,2.0,460551,Male,49,Average,C
463508,Female,Yes,31,No,Artist,0.0,Low,3,Cat_6,A,Yes,No,Artist,Cat_6,0.0,3.0,463508,Female,31,Low,A
465262,Male,Yes,60,Yes,Entertainment,7.0,Average,4,Cat_2,B,Yes,Yes,Entertainment,Cat_2,7.0,4.0,465262,Male,60,Average,B
462448,Male,No,19,No,Healthcare,,Low,3,Cat_6,D,No,No,Healthcare,Cat_6,1.0,3.0,462448,Male,19,Low,D


In [30]:
# categorical columns are transformed by categorical SimpleImputer and then numerical columns are transformed
# by other numerical simpleImputer and in the end, non-null values are passed through in the same order.
# So, here is the column mapping before and after transformation:
# 1. NULL Categorical columns
#    a. Ever_Married - columntransformer_transform_1
#    b. Graduated - columntransformer_transform_2
#    c. Profession - columntransformer_transform_3
#    d. Var_1 - columntransformer_transform_4
# 2. NULL Numerical columns
#    a. Work_Experience - columntransformer_transform_5
#    b. Family_Size - columntransformer_transform_6
# 3. Non-NULL columns
#    a. ID - columntransformer_transform_7
#    a. Gender - columntransformer_transform_8
#    a. Age - columntransformer_transform_9
#    a. Spending_Score - columntransformer_transform_10
#    a. Segmentation - columntransformer_transform_11

# Ignoring non-NULL passed through columns.

In [31]:
df_train = opt.assign(ID=opt.ID,
                      Gender=opt.Gender,
                      Ever_Married=opt.columntransformer_transform_1,
                      Age=opt.Age,
                      Graduated=opt.columntransformer_transform_2,
                      Profession=opt.columntransformer_transform_3,
                      Work_Experience=opt.columntransformer_transform_5,
                      Spending_Score=opt.Spending_Score,
                      Var_1=opt.columntransformer_transform_4,
                      Family_Size=opt.columntransformer_transform_6,
                      Segmentation=opt.Segmentation,
                      drop_columns=True)
df_train

ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
460490,Female,No,18,No,Healthcare,0.0,Low,4.0,Cat_3,D
464711,Female,Yes,38,Yes,Artist,7.0,Average,3.0,Cat_1,B
462162,Male,Yes,38,No,Artist,1.0,Low,2.0,Cat_6,B
466118,Female,Yes,36,No,Engineer,1.0,Low,3.0,Cat_4,A
463834,Female,Yes,41,Yes,Engineer,1.0,Average,4.0,Cat_6,C
460612,Male,No,33,Yes,Entertainment,9.0,Low,4.0,Cat_3,C
460551,Male,Yes,49,Yes,Engineer,0.0,Average,2.0,Cat_3,C
463508,Female,Yes,31,No,Artist,0.0,Low,3.0,Cat_6,A
465262,Male,Yes,60,Yes,Entertainment,7.0,Average,4.0,Cat_2,B
462448,Male,No,19,No,Healthcare,1.0,Low,3.0,Cat_6,D


<b style = 'font-size:14px;font-family:Arial;color:#E37C4D'>5.2.2.2. On Test Data</b>

In [32]:
ct.fit(X=df_test)

In [33]:
opt = ct.transform(df_test)
opt

ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,columntransformer_transform_1,columntransformer_transform_2,columntransformer_transform_3,columntransformer_transform_4,columntransformer_transform_5,columntransformer_transform_6,columntransformer_transform_7,columntransformer_transform_8,columntransformer_transform_9,columntransformer_transform_10
465445,Female,No,29,Yes,Homemaker,5.0,Low,7,Cat_6,No,Yes,Homemaker,Cat_6,5.0,7.0,465445,Female,29,Low
464812,Female,Yes,49,Yes,Engineer,1.0,High,5,Cat_4,Yes,Yes,Engineer,Cat_4,1.0,5.0,464812,Female,49,High
463670,Female,No,26,Yes,Healthcare,,Low,4,Cat_6,No,Yes,Healthcare,Cat_6,1.0,4.0,463670,Female,26,Low
459163,Male,No,35,Yes,Entertainment,0.0,Low,1,Cat_6,No,Yes,Entertainment,Cat_6,0.0,1.0,459163,Male,35,Low
459428,Female,No,39,No,Marketing,0.0,Low,2,Cat_4,No,No,Marketing,Cat_4,0.0,2.0,459428,Female,39,Low
459897,Female,Yes,56,Yes,Artist,0.0,Low,2,Cat_6,Yes,Yes,Artist,Cat_6,0.0,2.0,459897,Female,56,Low
466606,Female,Yes,35,Yes,Engineer,1.0,Low,3,Cat_3,Yes,Yes,Engineer,Cat_3,1.0,3.0,466606,Female,35,Low
462141,Female,No,38,Yes,Artist,8.0,Low,1,Cat_6,No,Yes,Artist,Cat_6,8.0,1.0,462141,Female,38,Low
459144,Male,Yes,80,Yes,Lawyer,,High,2,Cat_6,Yes,Yes,Lawyer,Cat_6,1.0,2.0,459144,Male,80,High
465119,Female,Yes,45,Yes,Artist,0.0,Average,5,Cat_6,Yes,Yes,Artist,Cat_6,0.0,5.0,465119,Female,45,Average


In [34]:
# categorical columns are transformed by categorical SimpleImputer and then numerical columns are transformed
# by other numerical simpleImputer and in the end, non-null values are passed through in the same order.
# So, here is the column mapping before and after transformation:
# 1. NULL Categorical columns
#    a. Ever_Married - columntransformer_transform_1
#    b. Graduated - columntransformer_transform_2
#    c. Profession - columntransformer_transform_3
#    d. Var_1 - columntransformer_transform_4
# 2. NULL Numerical columns
#    a. Work_Experience - columntransformer_transform_5
#    b. Family_Size - columntransformer_transform_6
# 3. Non-NULL columns
#    a. ID - columntransformer_transform_7
#    a. Gender - columntransformer_transform_8
#    a. Age - columntransformer_transform_9
#    a. Spending_Score - columntransformer_transform_10

# Ignoring non-NULL passed through columns.

In [35]:
df_test = opt.assign(ID=opt.ID,
                     Gender=opt.Gender,
                     Ever_Married=opt.columntransformer_transform_1,
                     Age=opt.Age,
                     Graduated=opt.columntransformer_transform_2,
                     Profession=opt.columntransformer_transform_3,
                     Work_Experience=opt.columntransformer_transform_5,
                     Spending_Score=opt.Spending_Score,
                     Var_1=opt.columntransformer_transform_4,
                     Family_Size=opt.columntransformer_transform_6,
                     drop_columns=True)
df_test

ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
465445,Female,No,29,Yes,Homemaker,5.0,Low,7.0,Cat_6
464812,Female,Yes,49,Yes,Engineer,1.0,High,5.0,Cat_4
463670,Female,No,26,Yes,Healthcare,1.0,Low,4.0,Cat_6
459163,Male,No,35,Yes,Entertainment,0.0,Low,1.0,Cat_6
459428,Female,No,39,No,Marketing,0.0,Low,2.0,Cat_4
459897,Female,Yes,56,Yes,Artist,0.0,Low,2.0,Cat_6
466606,Female,Yes,35,Yes,Engineer,1.0,Low,3.0,Cat_3
462141,Female,No,38,Yes,Artist,8.0,Low,1.0,Cat_6
459144,Male,Yes,80,Yes,Lawyer,1.0,High,2.0,Cat_6
465119,Female,Yes,45,Yes,Artist,0.0,Average,5.0,Cat_6


<b style = 'font-size:22px;font-family:Arial;color:#E37C4D'>5.3. Verify NULL values after imputation</b>

In [36]:
df_train.info(null_counts=True)

<class 'teradataml.dataframe.dataframe.DataFrame'>
Data columns (total 11 columns):
ID                   int
Gender               str
Ever_Married         str
Age                  int
Graduated            str
Profession           str
Work_Experience    float
Spending_Score       str
Family_Size        float
Var_1                str
Segmentation         str
dtypes: int(2), str(7), float(2)


In [37]:
df_test.info(null_counts=True)

<class 'teradataml.dataframe.dataframe.DataFrame'>
Data columns (total 10 columns):
ID                   int
Gender               str
Ever_Married         str
Age                  int
Graduated            str
Profession           str
Work_Experience    float
Spending_Score       str
Family_Size        float
Var_1                str
dtypes: int(2), str(6), float(2)


<b>We can see that there are NO NULL/NaN values in any column.</b>

<b style = 'font-size:22px;font-family:Arial;color:#E37C4D'>5.4. Encode categorical columns using OrdinalEncoder</b>

<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>5.4.1. Function to ordinal encode data</b>

In [38]:
def ordinal_encode_data(df):
    """
    This function does the following tasks:
        1. Prepare categorical columns (ID columns is included) and numerical columns.
        2. Ordinal encode categorical columns.
        3. Join the transformed DataFrame with input DataFrame to get all columns (transformed categorical columns,
           non-transformed categorical columns and numerical columns) in same DataFrame.
        4. Since transformed column names are different, get column mapping of original columns to transformed columns.
        5. Extract the required transformed columns and numerical columns from joined DataFrame.
    """

    # ID column is needed for join of transformed teradataml DataFrame and input teradatml DataFrame. This column is not used in training/prediction.
    # categorical columns
    cat_columns = ['ID', 'Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score', 'Var_1']
    if 'Segmentation' in df.columns:
        # Train data has Segmentation column.
        cat_columns += ['Segmentation']

    # numerical columns
    num_columns = ["Age", "Work_Experience", "Family_Size"]

    print("Categorical columns:\n", cat_columns)
    print("\nNumerical columns:\n", num_columns)
    
    # Initiate and train Ordinal Encoder to encode categorical columns and convert them to integer types.
    oe = osml.OrdinalEncoder(dtype=int)
    print("\nInitial OridinalEncoder model:", oe)

    opt = oe.fit_transform(df.select(cat_columns))

    print("\nTansformed data:\n", opt)

    # Join transformed categorical columns with non-transformed numerical columns.
    joined_trained_opt = opt.join(df, on="ID", lsuffix="l", how="inner")
    print("\nJoined data (input DataFrame with transformed DataFrame):\n", joined_trained_opt)

    ## Extract only required columns (encoded categorical columns and numerical columns) from joined teradataml DataFrame.

    # Mapping of categorical columns to their encoded columns as SQLColumnExpressions.
    # There are 8 categorical columns (including "ID") and joined DataFrame has 16 repeated columns starting from ID_l and ID to 
    # Segmentation_l and Segmentation.
    # Transformed columns of these categorical columns start from 16th column (which is 2 times number of categorical columns).
    ln_cat = len(cat_columns)
    cat_cols_map = dict((col, getattr(joined_trained_opt, joined_trained_opt.columns[2*ln_cat+i])) for i, col in enumerate(cat_columns))
    
    # After join, ID is no longer needed.
    del cat_cols_map["ID"]

    # Get SQLColumnExpressions of numerical columns.
    num_cols_map = dict((col, getattr(joined_trained_opt, col)) for col in num_columns)

    # Return final encoded DataFrame.
    return joined_trained_opt.assign(drop_columns=True, **{**num_cols_map, **cat_cols_map})

<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>5.4.2. Encode Train Data</b>

In [39]:
opt1_train = ordinal_encode_data(df_train)
opt1_train

Categorical columns:
 ['ID', 'Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score', 'Var_1', 'Segmentation']

Numerical columns:
 ['Age', 'Work_Experience', 'Family_Size']

Initial OridinalEncoder model: OrdinalEncoder(dtype=<class 'int'>)

Tansformed data:
        ID  Gender Ever_Married Graduated     Profession Spending_Score  Var_1 Segmentation  ordinalencoder_transform_1  ordinalencoder_transform_2  ordinalencoder_transform_3  ordinalencoder_transform_4  ordinalencoder_transform_5  ordinalencoder_transform_6  ordinalencoder_transform_7  ordinalencoder_transform_8
0  462162    Male          Yes        No         Artist            Low  Cat_6            B                        2849                           1                           1                           0                           0                           2                           5                           1
1  463834  Female          Yes       Yes       Engineer        Average  Cat_6            C     

Gender,Ever_Married,Graduated,Profession,Spending_Score,Var_1,Segmentation,Age,Work_Experience,Family_Size
1,1,0,0,2,5,1,38,1.0,2.0
0,1,1,2,0,5,2,41,1.0,4.0
1,0,1,3,2,2,2,33,9.0,4.0
1,1,1,4,1,5,2,36,6.0,2.0
0,0,1,0,2,3,0,41,1.0,3.0
1,0,0,5,2,2,0,32,7.0,4.0
0,0,1,5,2,5,0,36,2.0,3.0
0,1,0,2,2,3,0,36,1.0,3.0
0,1,1,0,0,0,1,38,7.0,3.0
0,0,0,5,2,2,3,18,0.0,4.0


<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>5.4.3. Encode Test Data</b>

In [40]:
opt1_test = ordinal_encode_data(df_test)
opt1_test

Categorical columns:
 ['ID', 'Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score', 'Var_1']

Numerical columns:
 ['Age', 'Work_Experience', 'Family_Size']

Initial OridinalEncoder model: OrdinalEncoder(dtype=<class 'int'>)

Tansformed data:
        ID  Gender Ever_Married Graduated     Profession Spending_Score  Var_1  ordinalencoder_transform_1  ordinalencoder_transform_2  ordinalencoder_transform_3  ordinalencoder_transform_4  ordinalencoder_transform_5  ordinalencoder_transform_6  ordinalencoder_transform_7
0  463670  Female           No       Yes     Healthcare            Low  Cat_6                        1395                           0                           0                           1                           5                           2                           5
1  459428  Female           No        No      Marketing            Low  Cat_4                         148                           0                           0                           0    

Gender,Ever_Married,Graduated,Profession,Spending_Score,Var_1,Age,Work_Experience,Family_Size
0,0,1,5,2,5,26,1.0,4.0
0,0,0,8,2,3,39,0.0,2.0
0,1,1,0,2,5,56,0.0,2.0
1,0,1,3,2,5,38,1.0,2.0
0,0,1,0,2,5,39,7.0,1.0
0,1,1,1,2,1,39,0.0,1.0
1,0,1,3,2,1,26,9.0,1.0
1,0,1,3,2,5,35,0.0,1.0
0,1,1,2,1,3,49,1.0,5.0
0,0,1,6,2,5,29,5.0,7.0


<b style = 'font-size:22px;font-family:Arial;color:#E37C4D'>5.5. Outlier Detection</b>

In [41]:
df_x = opt1_train.select(train_x_columns + train_y_columns)
df_x

Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
1,1,38,0,0,1.0,2,2.0,5,1
0,1,41,1,2,1.0,0,4.0,5,2
1,0,33,1,3,9.0,2,4.0,2,2
1,1,36,1,4,6.0,1,2.0,5,2
0,0,41,1,0,1.0,2,3.0,3,0
1,0,32,0,5,7.0,2,4.0,2,0
0,0,36,1,5,2.0,2,3.0,5,0
0,1,36,0,2,1.0,2,3.0,3,0
0,1,38,1,0,7.0,0,3.0,0,1
0,0,18,0,5,0.0,2,4.0,2,3


<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>5.5.1. Using LocalOutlierFactor</b>

In [42]:
clf = osml.LocalOutlierFactor(n_neighbors=2)
clf

In [43]:
opt = clf.fit_predict(df_x)
opt

Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation,localoutlierfactor_fit_predict_1
1,0,33,1,3,9.0,2,4.0,2,2,1.0
0,0,41,1,0,1.0,2,3.0,3,0,1.0
1,0,32,0,5,7.0,2,4.0,2,0,1.0
1,1,52,1,4,1.0,1,4.0,5,1,-1.0
0,1,43,1,0,0.0,1,3.0,5,1,1.0
1,1,36,0,1,1.0,2,1.0,3,0,1.0
1,1,57,0,3,0.0,2,1.0,5,0,1.0
1,1,36,1,4,6.0,1,2.0,5,2,1.0
0,1,41,1,2,1.0,0,4.0,5,2,1.0
1,1,38,0,0,1.0,2,2.0,5,1,1.0


In [44]:
opt.groupby(["localoutlierfactor_fit_predict_1"]).count()

localoutlierfactor_fit_predict_1,count_Gender,count_Ever_Married,count_Age,count_Graduated,count_Profession,count_Work_Experience,count_Spending_Score,count_Family_Size,count_Var_1,count_Segmentation
-1.0,508,508,508,508,508,508,508,508,508,508
1.0,7560,7560,7560,7560,7560,7560,7560,7560,7560,7560


Outliers are grouped under -1.

In [45]:
# Excluding outliers from LocalOutlierFactor for model training.
df_lof = opt[opt.localoutlierfactor_fit_predict_1 == 1].drop(columns="localoutlierfactor_fit_predict_1")
df_lof.shape

(7560, 10)

In [46]:
# Split data into train and validation data.
df_sample_lof = df_lof.sample(frac=[0.9, 0.1])
df_sample_lof

Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation,sampleid
1,1,63,1,3,1.0,0,2.0,5,2,1
1,0,23,0,0,0.0,2,8.0,3,3,2
1,1,55,1,7,1.0,2,4.0,2,2,1
1,0,30,1,8,8.0,2,1.0,5,0,1
1,0,45,1,0,0.0,2,2.0,5,0,1
1,1,32,1,1,0.0,0,2.0,2,3,1
0,1,25,1,3,0.0,0,2.0,0,0,2
0,0,21,1,5,0.0,2,4.0,0,3,2
0,1,38,1,0,1.0,2,2.0,2,2,1
0,1,48,1,6,3.0,0,2.0,2,1,1


In [47]:
# Training data for LocalOutlierFactor.
df_train_lof = df_sample_lof[df_sample_lof.sampleid == 1].drop(columns="sampleid")
df_train_lof.shape

(6804, 10)

In [48]:
# Validation data for LocalOutlierFactor.
df_validate_lof = df_sample_lof[df_sample_lof.sampleid == 2].drop(columns="sampleid")
df_validate_lof.shape

(756, 10)

<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>5.5.2. Using OneClassSVM</b>

In [49]:
oc_svm = osml.OneClassSVM(nu=0.25)
oc_svm

In [50]:
oc_svm.fit(df_x)

In [51]:
opt1 = oc_svm.predict(df_x)
opt1

Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation,oneclasssvm_predict_1
1,0,33,1,3,9.0,2,4.0,2,2,1
0,0,41,1,0,1.0,2,3.0,3,0,1
1,0,32,0,5,7.0,2,4.0,2,0,1
1,1,52,1,4,1.0,1,4.0,5,1,1
0,1,43,1,0,0.0,1,3.0,5,1,1
1,1,36,0,1,1.0,2,1.0,3,0,1
1,1,57,0,3,0.0,2,1.0,5,0,1
1,1,36,1,4,6.0,1,2.0,5,2,1
0,1,41,1,2,1.0,0,4.0,5,2,1
1,1,38,0,0,1.0,2,2.0,5,1,1


In [52]:
opt1.groupby(["oneclasssvm_predict_1"]).count()

oneclasssvm_predict_1,count_Gender,count_Ever_Married,count_Age,count_Graduated,count_Profession,count_Work_Experience,count_Spending_Score,count_Family_Size,count_Var_1,count_Segmentation
1,6052,6052,6052,6052,6052,6052,6052,6052,6052,6052
-1,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016


Outliers are grouped under -1.

In [53]:
# Excluding outliers from OneClassSVM for model training.
df_ocsvm = opt1[opt1.oneclasssvm_predict_1 == 1].drop(columns="oneclasssvm_predict_1")
df_ocsvm.shape

(6052, 10)

In [54]:
# Split data into train and validation data.
df_sample_ocsvm = df_ocsvm.sample(frac=[0.9, 0.1])
df_sample_ocsvm

Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation,sampleid
1,1,58,1,0,7.0,2,1.0,2,1,1
1,0,27,1,0,1.0,2,3.0,5,0,1
1,0,41,1,0,1.0,2,3.0,5,0,1
1,1,55,1,4,9.0,1,2.0,5,2,1
1,1,45,1,0,1.0,0,2.0,5,1,1
1,1,52,1,3,1.0,0,3.0,5,1,1
0,1,30,1,2,8.0,0,5.0,3,3,1
0,1,45,1,0,0.0,2,1.0,2,1,1
0,1,47,1,0,1.0,1,6.0,5,1,2
0,1,42,1,0,0.0,0,2.0,5,2,1


In [55]:
# Training data for OneClassSVM.
df_train_ocsvm = df_sample_ocsvm[df_sample_ocsvm.sampleid == 1].drop(columns="sampleid")
df_train_ocsvm.shape

(5447, 10)

In [56]:
# Validation data for OneClassSVM.
df_validate_ocsvm = df_sample_ocsvm[df_sample_ocsvm.sampleid == 2].drop(columns="sampleid")
df_validate_ocsvm.shape

(605, 10)

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>6. Model training, prediction and validation</b>

<b>Train, predict and evaluate multiple models</b>

In [57]:
def run_diff_models_on_data(df_train, df_test, models):
    """
    Helper function to train, predict, score and evaluate different models.
    """
    # Training data.
    train_x_df = df_train.select(train_x_columns)
    train_y_df = df_train.select(train_y_columns)

    # Validation data.
    valid_x_df = df_test.select(train_x_columns)
    valid_y_df = df_test.select(train_y_columns)

    for model in models:
        ## Train the model.
        model.fit(train_x_df, train_y_df)
        print('\nmodel: ', model)

        ## Score the data using trained model.
        print('\nScore: \n', model.score(train_x_df, train_y_df))

        ## Predict the values of validation data using trained model.
        # Note that we are passing y to predict() for the output DataFrame to contain both y and y_pred values.
        # Both y and y_pred values are needed for evaluation in next steps.
        mpred = model.predict(X=valid_x_df, y=valid_y_df)
        print('\nPredictions on validation data:', mpred)

        ## Run few evaluation metrics.
        y_true = mpred.select("Segmentation") # Column with true y values.
        y_pred = mpred.select(mpred.columns[-1]) # Last column is predicted column.
        print('\nEvaluation on validation data using different metrics:\n')
        print('\nAccuracy_score: \n', osml.accuracy_score(y_true=y_true, y_pred=y_pred))
        print('\nConfusion_matrix: \n', osml.confusion_matrix(y_true=y_true, y_pred=y_pred))
        print('\nClassification Report: \n', osml.classification_report(y_true=y_true, y_pred=y_pred))

        print('\n...................................\n')

<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>6.1. Run different models on data extracted from OneClassSVM</b>

In [58]:
mnb_ocsvm = osml.MultinomialNB()
knn_ocsvm = osml.KNeighborsClassifier()
dtc_ocsvm = osml.DecisionTreeClassifier(criterion='gini')
abc_ocsvm = osml.AdaBoostClassifier(learning_rate=0.5)

In [59]:
models_ocsvm = [mnb_ocsvm, knn_ocsvm, dtc_ocsvm, abc_ocsvm]

In [60]:
run_diff_models_on_data(df_train_ocsvm, df_validate_ocsvm, models_ocsvm)


model:  MultinomialNB()

Score: 
       score
0  0.425188

Predictions on validation data:    Gender  Ever_Married  Age  Graduated  Profession  Work_Experience  Spending_Score  Family_Size  Var_1  Segmentation  multinomialnb_predict_1
0       0             0   48          1           0              4.0               2          1.0      5             2                        0
1       0             0   43          1           8              1.0               2          1.0      5             3                        3
2       0             0   39          0           2              1.0               2          3.0      2             0                        0
3       0             0   38          1           5              0.0               2          4.0      5             0                        3
4       0             0   51          1           0              1.0               2          1.0      3             1                        1
5       0             0   55          1     

<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>6.2. Run different models on data extracted from LocalOutlierFactor</b>

In [61]:
mnb_lof = osml.MultinomialNB()
knn_lof = osml.KNeighborsClassifier()
dtc_lof = osml.DecisionTreeClassifier(criterion='gini')
abc_lof = osml.AdaBoostClassifier(learning_rate=0.5)

In [62]:
models_lof = [mnb_lof, knn_lof, dtc_lof, abc_lof]

In [63]:
run_diff_models_on_data(df_train_lof, df_validate_lof, models_lof)


model:  MultinomialNB()

Score: 
      score
0  0.44562

Predictions on validation data:    Gender  Ever_Married  Age  Graduated  Profession  Work_Experience  Spending_Score  Family_Size  Var_1  Segmentation  multinomialnb_predict_1
0       1             1   69          0           1              8.0               2          2.0      5             0                        0
1       1             1   55          1           3              0.0               0          4.0      5             2                        2
2       1             1   59          1           0              1.0               0          4.0      5             1                        2
3       1             1   39          1           0              1.0               0          2.0      3             2                        2
4       1             1   63          1           0              1.0               2          1.0      0             2                        1
5       1             1   36          0       

From above validation run, DecisionTreeClassifier has score of around 0.95 (which is on train data) but validation data had accuracy score of less than 0.50.
However, AdaBoostClassifier has better accuracy score of around 0.51 (on both train and validation data).

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>7. Run Predictions</b>

In [64]:
# Test data.
df_test_x = opt1_test.select(train_x_columns)
df_test_x

Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,0,26,1,5,1.0,2,4.0,5
0,0,39,0,8,0.0,2,2.0,3
0,1,56,1,0,0.0,2,2.0,5
1,0,38,1,3,1.0,2,2.0,5
0,0,39,1,0,7.0,2,1.0,5
0,1,39,1,1,0.0,2,1.0,1
1,0,26,1,3,9.0,2,1.0,1
1,0,35,1,3,0.0,2,1.0,5
0,1,49,1,2,1.0,1,5.0,3
0,0,29,1,6,5.0,2,7.0,5


In [65]:
# Running predict on AdaboostClassifier which has better validation accuracy.
abc_lof.predict(df_test_x)

Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,adaboostclassifier_predict_1
0,1,56,1,0,0.0,2,2.0,5,2
0,0,39,1,0,7.0,2,1.0,5,0
0,1,39,1,1,0.0,2,1.0,1,0
1,1,82,0,7,1.0,2,1.0,5,0
0,1,66,1,3,0.0,1,2.0,5,1
0,0,30,0,5,0.0,2,3.0,5,3
1,1,36,1,0,1.0,0,2.0,5,2
1,0,38,1,3,1.0,2,2.0,5,0
0,0,39,0,8,0.0,2,2.0,3,3
0,0,26,1,5,1.0,2,4.0,5,3


<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>8. Remove Context</b>

In [66]:
remove_context()

True