***

<img src='teradata_logo.png' alt='Teradata' width='200'/>


# [Runbook for Regression Modelling using Vantage Analytics Library (VAL)](#title)

***


## [Table of Contents](#toc)

1. [Introduction](#Sec_1)
2. [Connection to Vantage](#Sec_2)
3. [Modelling with VAL](#Sec_3)
    1. [Invoking VAL funtions using SQL](#Sec_3.1)
		1. [Model Training](#Sec_3.1.1)
		2. [Model Evaluation Report](#Sec_3.1.2)
		3. [Scoring and Evalution](#Sec_3.1.3)
	2. [Invoking VAL Funtions using TD Python Wrappers](#Sec_3.2)
		1. [Model Training](#Sec_3.2.1)
		2. [Model Report](#Sec_3.2.2)
		3. [Model Evaluation](#Sec_3.2.3)
		4. [Scoring](#Sec_3.2.4)
	3. [Feature Encoding with VAL](#Sec_3.3)
		1. [Model Evaluation](#Sec_3.3.1)
		2. [Scoring](#Sec_3.3.2)
4. [Code for data upload to Vantage](#Sec_4)
        
***


## [1. Introduction](#Sec_1)

___This notebook is complementary to the demand forecasting notebook available [here](../../b1e18b12-5ccd-4c94-b96b-c2cebf230150/notebooks/demand_forecasting.ipynb)___. The latter provides the problem description, detailed EDA of the given datasets and focuses on modelling using the famous scikit-learn Python library"s machine learning functions. It also provides information on how to prepare and upload training and test datasets for this problem to be used for demo in AOPS (already available in Vantage).<br> 
This notebook, however, focuses on modelling this problem using analytics functions available in the Vantage Analytics Library (VAL). In addition to scoring in-database, modelling with VAL allows to train the models in-database using Teradata native analytics functions.<br><br> 

In [1]:
import pandas as pd
#teradata ML Libraries
from teradataml import DataFrame, create_context, copy_to_sql
import getpass

In [2]:
#VAL libraries and VAL installation path
from teradataml import valib
from teradataml import configure
configure.val_install_location = "VAL"

## [2. Connection to Vantage](#Sec_2)

In [3]:
# Establish connection to AOPS Teradata Vantage instance 
host = "3.238.151.85"
username = "AOA_DEMO" #update username as needed
password = getpass.getpass() #FppT4qdna7
logmech = None#"LDAP"
database_name = "AOA_DEMO"

 ··········


In [4]:
# create the connection using credentials
eng=create_context(host=host, username=username, password=password)#, logmech=logmech)
conn=eng.connect()


## [3. Modelling with VAL](#Sec_3) 

The Vantage Analytics Library (VAL) provides a suite of algorithms for solving machine learning problems. VAL"s growth is ongoing with new functions continuously being added.<br>
For now, VAL provides a linear regression algorithm for solving regression problems, which is used for the demand forecasting problem referred here.<br>
VAL functions can be invoked in two flavours: 
as a sql call to underlying UDFs; 
or using the Teradata ML Python wrappers.<br>
This notebook demonstrates the use in both ways. However, the preferred method of invoking VAL functions in AOPS is through the latter approach. 



### [3.1 Invoking VAL funtions using SQL](#Sec_3.1)


In [5]:
%load_ext sql

In [6]:
%sql teradatasql://$username:$password@$host/


#### [3.1.1 Model Training](#Sec_3.1.1)


In [7]:
%%sql
call VAL.td_analyze(
    'linear','
    database = AOA_DEMO;
    tablename = DEMAND_FORECAST_TRAIN_VAL;
    columns = center_id, meal_id, checkout_price, base_price, emailer_for_promotion, homepage_featured;
    dependent = num_orders;
    outputdatabase = AOA_DEMO;
    outputtablename = demand_forecast_val_linreg_model
    ');

 * teradatasql://AOA_DEMO:***@3.238.151.85/
0 rows affected.



#### [3.1.2 Model Evaluation Report](#Sec_3.1.2)


In [8]:
#read back analytics on train data in a TD DF
model_rpt = DataFrame("demand_forecast_val_linreg_model_rpt")
#model_txt = DataFrame("demand_forecast_val_linreg_model_txt")

In [9]:
# evaluation metrics report
#train_rpt.head()
#print(train_rpt[])
model_rpt_pdf = model_rpt.to_pandas()
model_rpt_pdf = model_rpt_pdf.T
model_rpt_pdf.columns = ["Value"]
#model_rpt_pdf.index.rename("Metric", inplace=True)
model_rpt_pdf

Unnamed: 0,Value
Total Observations,365238.0
Total Sum of Squares,59310817250.780998
Multiple Correlation Coefficient (R):,0.445459
Squared Multiple Correlation Coefficient (1-Tolerance),0.198434
Adjusted R-Squared,0.19842
Standard Error of Estimate,360.788655
Regression Sum of Squares,11769262840.7237
Regression Degrees of Freedom,6.0
Regression Mean-Square,1961543806.78729
Regression F Ratio,15069.271819



#### [3.1.3 Scoring and Evalution](#Sec_3.1.3)

the scoringmethod parameter allows invoking the function in scoring or evaluation modes.


In [10]:
%%sql
call VAL.td_analyze(
    'linearscore','
    database = AOA_DEMO;
    tablename = DEMAND_FORECAST_TEST_VAL;
    modeldatabase  = AOA_DEMO;
    modeltablename  = demand_forecast_val_linreg_model;
    outputdatabase = AOA_DEMO;
    outputtablename = demand_forecast_val_linreg_results;
    scoringmethod = evaluate;
    ');

 * teradatasql://AOA_DEMO:***@3.238.151.85/
0 rows affected.


In [11]:
#read back results in a TD DF
result = DataFrame("demand_forecast_val_linreg_results_txt")
result

                        Maxmum Absolute Error  Average Absolute Error  Standard Error of Estimate
Minimum Absolute Error                                                                           
0.003523                         12272.711455              200.390488                  331.746754


### [3.2 Invoking VAL funtions using TD Python Wrappers](#Sec_3.2)



#### [3.2.1 Model Training](#Sec_3.2.1)


In [12]:
train_df = DataFrame("DEMAND_FORECAST_TRAIN_VAL") 
features = ["center_id", "meal_id", "checkout_price", "base_price",
       "emailer_for_promotion", "homepage_featured"]
lin_reg_obj = valib.LinReg(data=train_df, 
                     columns=features, 
                     response_column="num_orders")


#### [3.2.2 Model Report](#Sec_3.2.2)


In [13]:
# evaluation metrics report
df = lin_reg_obj.statistical_measures
pdf = df.to_pandas()
pdf = pdf.T
pdf.columns = ["Value"]
#train_rpt_pdf.index.rename("Metric", inplace=True)
pdf

Unnamed: 0,Value
Total Observations,365238.0
Total Sum of Squares,59310817250.780998
Multiple Correlation Coefficient (R):,0.445459
Squared Multiple Correlation Coefficient (1-Tolerance),0.198434
Adjusted R-Squared,0.19842
Standard Error of Estimate,360.788655
Regression Sum of Squares,11769262840.7237
Regression Degrees of Freedom,6.0
Regression Mean-Square,1961543806.78728
Regression F Ratio,15069.271819



#### [3.2.3 Model Evaluation](#Sec_3.2.3)


In [14]:
test_df = DataFrame("DEMAND_FORECAST_TEST_VAL") 
obj = valib.LinRegEvaluator(data=test_df, model=lin_reg_obj.model)                          

In [15]:
print(obj.result)

   Minimum Absolute Error  Maxmum Absolute Error  Average Absolute Error  Standard Error of Estimate
0                0.003523           12272.711455              200.390488                  331.746754



#### [3.2.4 Scoring](#Sec_3.2.4)


In [15]:
obj = valib.LinRegPredict(data=test_df,
                          model=lin_reg_obj.model,
                          response_column="num_orders")
print(obj.result)

    index  num_orders
0  399947  -26.676583
1  402965  265.249105
2  377679  171.798391
3  372112  199.750275
4  420135  213.504123
5  375130  239.434483
6  445421  679.796642
7  439385  248.025750
8  380228  270.993833
9  396929  613.909827



### [3.3 Feature Encoding with VAL](#Sec_3.3)

Extending the above example, similar to scikit-learn modeling, we can enhance the above model by adding some categorical features and encoding them using VAL's onehot encoder (or TD ML"s Transformations library that provides a wrapper to VAL's transformers) before applying the LinReg algorithm.

In [17]:
# data transformation
from teradataml.analytics.Transformations import OneHotEncoder
from teradataml.analytics.Transformations import Retain

data = DataFrame("DEMAND_FORECAST_TRAIN_VAL")
#we can use ML"s OneHotEncoder to transform the x variable so it can be treated as numeric
centers = ["TYPE_A", "TYPE_B", "TYPE_C"]
cuisines = ["Continental", "Indian", "Italian", "Thai"]
meals = ["Beverages", "Biryani", "Desert", "Extras", "Fish", "Other Snacks", "Pasta", 
         "Pizza", "Rice Bowl", "Salad", "Sandwich", "Seafood", "Soup", "Starters"]
ohe_center = OneHotEncoder(values=centers, columns= "center_type")
ohe_cuisine = OneHotEncoder(values=cuisines, columns= "cuisine")
ohe_meal = OneHotEncoder(values=meals, columns= "category")
one_hot_encode = [ohe_center, ohe_cuisine, ohe_meal]

retained_cols = ["center_id", "meal_id", "checkout_price", "base_price",
       "emailer_for_promotion", "homepage_featured", "op_area", "num_orders"]
retain = Retain(columns=retained_cols)

tf = valib.Transform(data=data, one_hot_encode=one_hot_encode, retain=retain)
df_train = tf.result
df_train.head()

   index  center_id  meal_id  checkout_price  base_price  emailer_for_promotion  homepage_featured  op_area  num_orders  TYPE_A_center_type  TYPE_B_center_type  TYPE_C_center_type  Continental_cuisine  Indian_cuisine  Italian_cuisine  Thai_cuisine  Beverages_category  Biryani_category  Desert_category  Extras_category  Fish_category  Other Snacks_category  Pasta_category  Pizza_category  Rice Bowl_category  Salad_category  Sandwich_category  Seafood_category  Soup_category  Starters_category
0      2         55     2539          134.86      135.86                      0                  0      2.0         189                   0                   0                   1                    0               0                0             1                   1                 0                0                0              0                      0               0               0                   0               0                  0                 0              0                  0
1     

In [18]:
# to avoid multi-collinearity issue we need to pass 
# k-1 categories for each categorical feature to LinReg function
features = [col_name for col_name in df_train.columns if not (col_name=="num_orders" 
            or col_name=="TYPE_C_center_type"
            or col_name=="Thai_cuisine"
            or col_name=="Starters_category")]

In [19]:
lin_reg_obj = valib.LinReg(data=df_train, 
                     columns=features, 
                     response_column="num_orders")

In [20]:
# evaluation metrics report
df = lin_reg_obj.statistical_measures
pdf = df.to_pandas()
pdf = pdf.T
pdf.columns = ["Value"]
pdf

Unnamed: 0,Value
Total Observations,365238.0
Total Sum of Squares,59310817250.780998
Multiple Correlation Coefficient (R):,0.638477
Squared Multiple Correlation Coefficient (1-Tolerance),0.407652
Adjusted R-Squared,0.40761
Standard Error of Estimate,310.158293
Regression Sum of Squares,24178188621.635899
Regression Degrees of Freedom,26.0
Regression Mean-Square,929930331.60138
Regression F Ratio,9666.819694



#### [3.3.1 Model Evaluation](#Sec_3.3.1)


In [None]:
test_df = DataFrame("DEMAND_FORECAST_TEST_VAL")
# transform data using the transformer object fitted to the training data
test_tf = valib.Transform(data=test_df, one_hot_encode=tf.one_hot_encode, retain=tf.retain)
test_df_tf = test_tf.result
test_df_tf.columns

In [22]:
obj = valib.LinRegEvaluator(data=test_df_tf, model=lin_reg_obj.model)

In [23]:
print(obj.result)

   Minimum Absolute Error  Maxmum Absolute Error  Average Absolute Error  Standard Error of Estimate
0                0.003503           12038.770401              158.375807                  280.287877



#### [3.3.2 Scoring](#Sec_3.2.4)


In [24]:
obj = valib.LinRegPredict(data=test_df_tf,
                          model=lin_reg_obj.model,
                          response_column="num_orders")
print(obj.result)

    index  num_orders
0  399947  -18.173020
1  402965  549.031949
2  377679   90.837168
3  372112  132.819333
4  420135  356.974342
5  375130  395.665648
6  445421  639.558520
7  439385  597.785478
8  380228  156.671643
9  396929  663.654131


## [4. Code for data upload to Vantage](#Sec_4)

In [None]:
# combining information from the meal_info and center_info tables with the base table
df_combined = DataFrame.from_query('''
SELECT a.*, b.category, b.cuisine, c.center_type, c.op_area
FROM demand_forecast_demo_base as a
	LEFT JOIN 
	demand_forecast_demo_meal as b 
	ON 
	a.meal_id = b.meal_id
	LEFT JOIN 
	demand_forecast_demo_center as c 
	ON
	a.center_id = c.center_id;
    ''')
#split and upload data to Vantage tables for use in AOPS 
n = round(df_combined.shape[0]*0.8) #80% data for training
copy_to_sql(df = df_combined.iloc[0:n], table_name="DEMAND_FORECAST_TRAIN_VAL", schema_name="AOA_DEMO", if_exists="replace", 
            index=True, index_label="index", primary_index="index")
copy_to_sql(df = df_combined.iloc[n:], table_name="DEMAND_FORECAST_TEST_VAL", schema_name="AOA_DEMO", if_exists="replace", 
            index=True, index_label="index", primary_index="index")

In [42]:
from teradataml import remove_context
remove_context()

True