# Runbook for Regression Modelling using Vantage Analytics Library (VAL)


## Introduction

___This notebook is complementary to the demand forecasting notebook available [here](../../b1e18b12-5ccd-4c94-b96b-c2cebf230150/notebooks/demand_forecasting.ipynb)___ . The latter provides the problem description, detailed EDA of the given datasets and focuses on modelling using the famous scikit-learn Python library"s machine learning functions. It also provides information on how to prepare and upload training and test datasets for this problem to be used for demo in AOPS (already available in Vantage).<br> 
This notebook, however, focuses on modelling this problem using analytics functions available in the Vantage Analytics Library (VAL). In addition to scoring in-database, modelling with VAL allows to train the models in-database using Teradata native analytics functions.<br><br> 

In [1]:
import pandas as pd
#teradata ML Libraries
from teradataml import DataFrame , create_context, copy_to_sql
import getpass

In [2]:
#VAL libraries and VAL installation path
from teradataml import valib
from teradataml import configure
configure.val_install_location = "VAL"

## 1. Establish connection to AOPS Vantage instance

In [3]:
#connection context; 
host = "3.238.151.85"
username = "AOA_DEMO" #update username as needed
password = getpass.getpass() #FppT4qdna7
logmech = None#"LDAP"
database_name = "AOA_DEMO"

 ··········


In [4]:
# create the connection using credentials
eng=create_context(host=host, username=username, password=password)#, logmech=logmech)
conn=eng.connect()


## 2. Modelling with VAL 

The Vantage Analytics Library (VAL) provides a suite of algorithms for solving machine learning problems. VAL"s growth is ongoing with new functions continuously being added.<br>
For now, VAL provides a linear regression algorithm for solving regression problems, which is used for the demand forecasting problem referred here.<br>
VAL functions can be invoked in two flavours: 
as a sql call to underlying UDFs; 
or using the Teradata ML Python wrappers.<br>
This notebook demonstrates the use in both ways. However, the preferred method of invoking VAL functions in AOPS is through the latter approach. 



### 2.1. Invoking VAL funtions using SQL


In [5]:
%load_ext sql

In [6]:
%sql teradatasql://$username:$password@$host/


#### 2.1.1. Model Training


In [7]:
%%sql
call VAL.td_analyze(
    "linear","
    database = AOA_DEMO;
    tablename = DEMAND_FORECAST_TRAIN_VAL;
    columns = center_id, meal_id, checkout_price, base_price, emailer_for_promotion, homepage_featured;
    dependent = num_orders;
    outputdatabase = AOA_DEMO;
    outputtablename = demand_forecast_val_linreg_model
    ");

 * teradatasql://AOA_DEMO:***@3.238.151.85/
0 rows affected.



#### 2.1.2. Model Evaluation Report


In [8]:
#read back analytics on train data in a TD DF
model_rpt = DataFrame("demand_forecast_val_linreg_model_rpt")
#model_txt = DataFrame("demand_forecast_val_linreg_model_txt")

In [9]:
# evaluation metrics report
#train_rpt.head()
#print(train_rpt[])
model_rpt_pdf = model_rpt.to_pandas()
model_rpt_pdf = model_rpt_pdf.T
model_rpt_pdf.columns = ["Value"]
#model_rpt_pdf.index.rename("Metric", inplace=True)
model_rpt_pdf

Unnamed: 0,Value
Total Observations,365238.0
Total Sum of Squares,59310817250.780998
Multiple Correlation Coefficient (R):,0.445459
Squared Multiple Correlation Coefficient (1-Tolerance),0.198434
Adjusted R-Squared,0.19842
Standard Error of Estimate,360.788655
Regression Sum of Squares,11769262840.7237
Regression Degrees of Freedom,6.0
Regression Mean-Square,1961543806.78729
Regression F Ratio,15069.271819



#### 2.1.3. Scoring and Evalution

the scoringmethod parameter allows invoking the function in scoring or evaluation modes.


In [10]:
%%sql
call VAL.td_analyze(
    "linearscore","
    database = AOA_DEMO;
    tablename = DEMAND_FORECAST_TEST_VAL;
    modeldatabase  = AOA_DEMO;
    modeltablename  = demand_forecast_val_linreg_model;
    outputdatabase = AOA_DEMO;
    outputtablename = demand_forecast_val_linreg_results;
    scoringmethod = evaluate;
    ");

 * teradatasql://AOA_DEMO:***@3.238.151.85/
0 rows affected.


In [11]:
#read back results in a TD DF
result = DataFrame("demand_forecast_val_linreg_results_txt")
result

                        Maxmum Absolute Error  Average Absolute Error  Standard Error of Estimate
Minimum Absolute Error                                                                           
0.003523                         12272.711455              200.390488                  331.746754


### 2.2. Invoking VAL funtions using TD Python Wrappers



#### 2.2.1. Model Training


In [7]:
train_df = DataFrame("DEMAND_FORECAST_TRAIN_VAL") 
features = ["center_id", "meal_id", "checkout_price", "base_price",
       "emailer_for_promotion", "homepage_featured"]
lin_reg_obj = valib.LinReg(data=train_df, 
                     columns=features, 
                     response_column="num_orders")


#### 2.2.2. Model Report


In [13]:
# evaluation metrics report
df = lin_reg_obj.statistical_measures
pdf = df.to_pandas()
pdf = pdf.T
pdf.columns = ["Value"]
#train_rpt_pdf.index.rename("Metric", inplace=True)
pdf

Unnamed: 0,Value
Total Observations,365238.0
Total Sum of Squares,59310817250.780998
Multiple Correlation Coefficient (R):,0.445459
Squared Multiple Correlation Coefficient (1-Tolerance),0.198434
Adjusted R-Squared,0.19842
Standard Error of Estimate,360.788655
Regression Sum of Squares,11769262840.7237
Regression Degrees of Freedom,6.0
Regression Mean-Square,1961543806.78729
Regression F Ratio,15069.271819



#### 2.2.3. Model Evaluation


In [14]:
test_df = DataFrame("DEMAND_FORECAST_TEST_VAL") 
obj = valib.LinRegEvaluator(data=test_df, model=lin_reg_obj.model)                          

In [15]:
print(obj.result)

   Minimum Absolute Error  Maxmum Absolute Error  Average Absolute Error  Standard Error of Estimate
0                0.003523           12272.711455              200.390488                  331.746754



#### 2.2.4. Scoring


In [16]:
obj = valib.LinRegPredict(data=test_df,
                          model=lin_reg_obj.model,
                          response_column="num_orders")
print(obj.result)

    index  num_orders
0  436367  196.387543
1  411081    0.093443
2  425233  139.338417
3  408532  372.484039
4  383246  278.395831
5  397398  140.469640
6  453537  627.687378
7  366076  213.590031
8  441934  222.901165
9  382777  337.582222



### 2.3. Linear regression with onehot encoded categorical features

Extending the above example, similar to scikit-learn modeling, we can enhance the above model by adding some categorical features and encoding them using VAL"s onehot encoder (or TD ML"s Transformations library that provides a wrapper to VAL"s transformers) before applying the LinReg algorithm.

In [8]:
# data transformation
from teradataml.analytics.Transformations import OneHotEncoder
data = DataFrame("DEMAND_FORECAST_TRAIN_VAL")
#we can use ML"s OneHotEncoder to transform the x variable so it can be treated as numeric
index_columns = ["center_id", "meal_id", "checkout_price", "base_price",
       "emailer_for_promotion", "homepage_featured", "op_area", "num_orders"]
centers = ["TYPE_A", "TYPE_B", "TYPE_C"]
cuisines = ["Continental", "Indian", "Italian", "Thai"]
meals = ["Beverages", "Biryani", "Desert", "Extras", "Fish", "Other Snacks", "Pasta", 
         "Pizza", "Rice Bowl", "Salad", "Sandwich", "Seafood", "Soup", "Starters"]
ohe_center = OneHotEncoder(values=centers, columns= "center_type")
ohe_cuisine = OneHotEncoder(values=cuisines, columns= "cuisine")
ohe_meal = OneHotEncoder(values=meals, columns= "category")
one_hot_encode = [ohe_center, ohe_cuisine, ohe_meal]

tf = valib.Transform(data=data, one_hot_encode=one_hot_encode, index_columns=index_columns)
df_train = tf.result
df_train.head()

   center_id  meal_id  checkout_price  base_price  emailer_for_promotion  homepage_featured  op_area  num_orders  TYPE_A_center_type  TYPE_B_center_type  TYPE_C_center_type  Continental_cuisine  Indian_cuisine  Italian_cuisine  Thai_cuisine  Beverages_category  Biryani_category  Desert_category  Extras_category  Fish_category  Other Snacks_category  Pasta_category  Pizza_category  Rice Bowl_category  Salad_category  Sandwich_category  Seafood_category  Soup_category  Starters_category
0         10     1847          256.08      258.08                      0                  0      6.3          28                   0                   1                   0                    0               0                0             1                   0                 0                0                0              0                      0               0               0                   0               0                  0                 0              1                  0
1         10     119

In [9]:
# to avoid multi-collinearity issue we need to pass 
# k-1 categories for each categorical feature to LinReg function
features = [col_name for col_name in df_train.columns if not (col_name=="num_orders" 
            or col_name=="TYPE_C_center_type"
            or col_name=="Thai_cuisine"
            or col_name=="Starters_category")]

In [10]:
lin_reg_obj = valib.LinReg(data=df_train, 
                     columns=features, 
                     response_column="num_orders")

In [11]:
# evaluation metrics report
df = lin_reg_obj.statistical_measures
pdf = df.to_pandas()
pdf = pdf.T
pdf.columns = ["Value"]
pdf

Unnamed: 0,Value
Total Observations,350157.0
Total Sum of Squares,58602487769.626503
Multiple Correlation Coefficient (R):,0.635601
Squared Multiple Correlation Coefficient (1-Tolerance),0.403989
Adjusted R-Squared,0.403946
Standard Error of Estimate,315.8421
Regression Sum of Squares,23674738413.221699
Regression Degrees of Freedom,25.0
Regression Mean-Square,946989536.528869
Regression F Ratio,9493.036326



#### 2.2.3. Model Evaluation


In [17]:
test_df = DataFrame("DEMAND_FORECAST_TEST_VAL")
#index_columns = ["center_id", "meal_id", "checkout_price", "base_price",
#       "emailer_for_promotion", "homepage_featured", "op_area"]
# transform the data using the transformer object used to transform the training data
test_tf = valib.Transform(data=test_df, one_hot_encode=tf.one_hot_encode, index_columns=tf.index_columns)
#test_tf = valib.Transform(data=test_df, one_hot_encode=tf.one_hot_encode, index_columns=index_columns)
#test_tf = valib.Transform(data=test_df, one_hot_encode=tf.one_hot_encode)
test_df_tf = test_tf.result
test_df_tf.columns

['center_id',
 'meal_id',
 'checkout_price',
 'base_price',
 'emailer_for_promotion',
 'homepage_featured',
 'op_area',
 'num_orders',
 'TYPE_A_center_type',
 'TYPE_B_center_type',
 'TYPE_C_center_type',
 'Continental_cuisine',
 'Indian_cuisine',
 'Italian_cuisine',
 'Thai_cuisine',
 'Beverages_category',
 'Biryani_category',
 'Desert_category',
 'Extras_category',
 'Fish_category',
 'Other Snacks_category',
 'Pasta_category',
 'Pizza_category',
 'Rice Bowl_category',
 'Salad_category',
 'Sandwich_category',
 'Seafood_category',
 'Soup_category',
 'Starters_category']

In [19]:
df_test = test_df_tf.get(features)

In [20]:
df_test.columns

['center_id',
 'meal_id',
 'checkout_price',
 'base_price',
 'emailer_for_promotion',
 'homepage_featured',
 'op_area',
 'TYPE_A_center_type',
 'TYPE_B_center_type',
 'Continental_cuisine',
 'Indian_cuisine',
 'Italian_cuisine',
 'Beverages_category',
 'Biryani_category',
 'Desert_category',
 'Extras_category',
 'Fish_category',
 'Other Snacks_category',
 'Pasta_category',
 'Pizza_category',
 'Rice Bowl_category',
 'Salad_category',
 'Sandwich_category',
 'Seafood_category',
 'Soup_category']

In [21]:
#obj = valib.LinRegEvaluator(data=test_df_tf, model=lin_reg_obj.model)
obj = valib.LinRegEvaluator(data=df_test, model=lin_reg_obj.model)

OperationalError: [Version 17.10.0.2] [Session 12385] [Teradata Database] [Error 7825] in UDF/XSP/UDM VAL.td_analyze: SQLSTATE 38U01: [Teradata Database] [TeraJDBC 16.20.00.02] [Error 3720] [SQLState HY000] This view does not contain any complete index columns of the underlying table.
 at gosqldriver/teradatasql.(*teradataConnection).formatDatabaseError TeradataConnection.go:1142
 at gosqldriver/teradatasql.(*teradataConnection).makeChainedDatabaseError TeradataConnection.go:1158
 at gosqldriver/teradatasql.(*teradataConnection).processErrorParcel TeradataConnection.go:1232
 at gosqldriver/teradatasql.(*TeradataRows).processResponseBundle TeradataRows.go:2112
 at gosqldriver/teradatasql.(*TeradataRows).executeSQLRequest TeradataRows.go:794
 at gosqldriver/teradatasql.newTeradataRows TeradataRows.go:653
 at gosqldriver/teradatasql.(*teradataStatement).QueryContext TeradataStatement.go:122
 at gosqldriver/teradatasql.(*teradataConnection).QueryContext TeradataConnection.go:2113
 at database/sql.ctxDriverQuery ctxutil.go:48
 at database/sql.(*DB).queryDC.func1 sql.go:1579
 at database/sql.withLock sql.go:3204
 at database/sql.(*DB).queryDC sql.go:1574
 at database/sql.(*Conn).QueryContext sql.go:1823
 at main.goCreateRows goside.go:652
 at main._cgoexpwrap_7f5c3249bf12_goCreateRows _cgo_gotypes.go:363
 at runtime.cgocallbackg1 cgocall.go:332
 at runtime.cgocallbackg cgocall.go:207
 at runtime.cgocallback_gofunc asm_amd64.s:793
 at runtime.goexit asm_amd64.s:1373

In [29]:
tf.index_columns

['center_id',
 'meal_id',
 'checkout_price',
 'base_price',
 'emailer_for_promotion',
 'homepage_featured',
 'op_area',
 'num_orders']

In [30]:
print(tf.one_hot_encode)

[<teradataml.analytics.Transformations.OneHotEncoder object at 0x0000015AFC41EB80>, <teradataml.analytics.Transformations.OneHotEncoder object at 0x0000015AFC41EC10>, <teradataml.analytics.Transformations.OneHotEncoder object at 0x0000015AFC41EC40>]


In [16]:
print(obj.result)

   Minimum Absolute Error  Maxmum Absolute Error  Average Absolute Error  Standard Error of Estimate
0                0.003523           12272.711455              200.390488                  331.746754



#### 2.2.4. Scoring


In [29]:
obj = valib.LinRegPredict(data=test_df_tf,
                          model=lin_reg_obj.model,
                          response_column="num_orders")
print(obj.result)

OperationalError: [Version 17.0.0.8] [Session 10238] [Teradata Database] [Error 7825] in UDF/XSP/UDM VAL.td_analyze: SQLSTATE 38U01: Duplicate output names detected.  Try changing predicted or residual names.
 at gosqldriver/teradatasql.(*teradataConnection).formatDatabaseError TeradataConnection.go:1139
 at gosqldriver/teradatasql.(*teradataConnection).makeChainedDatabaseError TeradataConnection.go:1155
 at gosqldriver/teradatasql.(*teradataConnection).processErrorParcel TeradataConnection.go:1218
 at gosqldriver/teradatasql.(*TeradataRows).processResponseBundle TeradataRows.go:1807
 at gosqldriver/teradatasql.(*TeradataRows).executeSQLRequest TeradataRows.go:640
 at gosqldriver/teradatasql.newTeradataRows TeradataRows.go:499
 at gosqldriver/teradatasql.(*teradataStatement).QueryContext TeradataStatement.go:122
 at gosqldriver/teradatasql.(*teradataConnection).QueryContext TeradataConnection.go:2091
 at database/sql.ctxDriverQuery ctxutil.go:48
 at database/sql.(*DB).queryDC.func1 sql.go:1579
 at database/sql.withLock sql.go:3204
 at database/sql.(*DB).queryDC sql.go:1574
 at database/sql.(*Conn).QueryContext sql.go:1823
 at main.goCreateRows goside.go:652
 at main._cgoexpwrap_7f5c3249bf12_goCreateRows _cgo_gotypes.go:363
 at runtime.cgocallbackg1 cgocall.go:332
 at runtime.cgocallbackg cgocall.go:207
 at runtime.cgocallback_gofunc asm_amd64.s:793
 at runtime.goexit asm_amd64.s:1373

In [10]:
feature_names = ["center_id", "meal_id", "checkout_price", 
                 "base_price", "emailer_for_promotion", "homepage_featured",
                 "category", "cuisine", "center_type", "op_area"]
feature_names_cat = ["center_type", "category", "cuisine"]
target_name = "num_orders"

In [None]:
#extract categories as a list
train_df = DataFrame("DEMAND_FORECAST_TRAIN_VAL")
feature_cat = {}
for feature in ["center_type", "category", "cuisine"]:
    q = "select distinct(" + feature + ") from AOA_DEMO.DEMAND_FORECAST_TRAIN_VAL"
    r = DataFrame.from_query(q).to_pandas()
    s = r[feature]
    l = s.dropna().to_list()
    feature_cat[feature] = l 

In [None]:
#r = DataFrame.from_query("select distinct(center_type) from AOA_DEMO.DEMAND_FORECAST_TRAIN_VAL")
type(r)
p = r.to_pandas()
#r = DataFrame.from_query(q).get_values()
#t = r.select([feature]).squeeze()
#t

In [39]:
data.shape

(365238, 12)

In [13]:
model = valib.LinReg(data=df, 
                     columns=features, 
                     response_column="num_orders")

In [15]:
type(model.statistical_measures)

teradataml.dataframe.dataframe.DataFrame

In [43]:
copy_to_sql(df, table_name="demand_forecast_train_transformed", schema_name="AOA_DEMO", if_exists="replace")

In [45]:
#df_tf = DataFrame("demand_forecast_train_transformed")
df_tf.columns

['center_id',
 'meal_id',
 'checkout_price',
 'base_price',
 'emailer_for_promotion',
 'homepage_featured',
 'num_orders',
 'op_area',
 'TYPE_A_center_type',
 'TYPE_B_center_type',
 'TYPE_C_center_type',
 'Continental_cuisine',
 'Indian_cuisine',
 'Italian_cuisine',
 'Thai_cuisine',
 'Beverages_category',
 'Biryani_category',
 'Desert_category',
 'Extras_category',
 'Fish_category',
 'Other Snacks_category',
 'Pasta_category',
 'Pizza_category',
 'Rice Bowl_category',
 'Salad_category',
 'Sandwich_category',
 'Seafood_category',
 'Soup_category',
 'Starters_category']

In [None]:
# this doesn"t work -- output cleared
%%sql
call td_analyze (
    "vartran","
    database = AOA_DEMO;
    tablename = DEMAND_FORECAST_TRAIN_VAL;
    designcode =
     {designstyle (dummycode), designvalues (TYPE_A, TYPE_B, TYPE_C), columns (center_type)};
    outputdatabase = AOA_DEMO;
    outputtablename = demand_forecast_val_encoder;
    ");

In [49]:
%%sql
call VAL.td_analyze(
    "linear","
    database = AOA_DEMO;
    tablename = demand_forecast_train_transformed;
    columns = center_id,meal_id,checkout_price,base_price,emailer_for_promotion,homepage_featured,op_area,TYPE_A_center_type,TYPE_B_center_type,TYPE_C_center_type,Continental_cuisine,Indian_cuisine,Italian_cuisine,Thai_cuisine,Beverages_category,Biryani_category,Desert_category,Extras_category,Fish_category,Other Snacks_category,Pasta_category,Pizza_category,Rice Bowl_category,Salad_category,Sandwich_category,Seafood_category,Soup_category,Starters_category;
    dependent = num_orders;
    outputdatabase = AOA_DEMO;
    outputtablename = demand_forecast_val_linreg_model_ext
    ");

 * teradatasql://AOA_DEMO:***@3.238.151.85/
(teradatasql.OperationalError) [Version 17.0.0.8] [Session 9652] [Teradata Database] [Error 7825] in UDF/XSP/UDM VAL.td_analyze: SQLSTATE 38U01: Unable to calculate results.  Matrix is not symmetric positive definite.
 at gosqldriver/teradatasql.(*teradataConnection).formatDatabaseError TeradataConnection.go:1139
 at gosqldriver/teradatasql.(*teradataConnection).makeChainedDatabaseError TeradataConnection.go:1155
 at gosqldriver/teradatasql.(*teradataConnection).processErrorParcel TeradataConnection.go:1218
 at gosqldriver/teradatasql.(*TeradataRows).processResponseBundle TeradataRows.go:1807
 at gosqldriver/teradatasql.(*TeradataRows).executeSQLRequest TeradataRows.go:640
 at gosqldriver/teradatasql.newTeradataRows TeradataRows.go:499
 at gosqldriver/teradatasql.(*teradataStatement).QueryContext TeradataStatement.go:122
 at gosqldriver/teradatasql.(*teradataConnection).QueryContext TeradataConnection.go:2091
 at database/sql.ctxDriverQuery c

## 3. Code for data upload to Vantage

In [None]:
# combining information from the meal_info and center_info tables with the base table
df_combined = DataFrame.from_query("""
SELECT a.*, b.category, b.cuisine, c.center_type, c.op_area
FROM demand_forecast_demo_base as a
	LEFT JOIN 
	demand_forecast_demo_meal as b 
	ON 
	a.meal_id = b.meal_id
	LEFT JOIN 
	demand_forecast_demo_center as c 
	ON
	a.center_id = c.center_id;
    """)
#split and upload data to Vantage tables for use in AOPS 
n = round(df_combined.shape[0]*0.8) #80% data for training
copy_to_sql(df = df_combined.iloc[0:n], table_name="DEMAND_FORECAST_TRAIN_VAL", schema_name="AOA_DEMO", if_exists="replace", 
            index=True, index_label="index", primary_index="index")
copy_to_sql(df = df_combined.iloc[n:], table_name="DEMAND_FORECAST_TEST_VAL", schema_name="AOA_DEMO", if_exists="replace", 
            index=True, index_label="index", primary_index="index")

In [42]:
from teradataml import remove_context
remove_context()

True