<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Usecase - Predicting sales using teradataml OpenSourceML on VantageCloud Lake</b>
</header>

### Disclaimer
The sample code (“Sample Code”) provided is not covered by any Teradata agreements. Please be aware that Teradata has no control over the model responses to such sample code and such response may vary. The use of the model by Teradata is strictly for demonstration purposes and does not constitute any form of certification or endorsement. The sample code is provided “AS IS” and any express or implied warranties, including the implied warranties of merchantability and fitness for a particular purpose, are disclaimed. In no event shall Teradata be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) sustained by you or a third party, however caused and on any theory of liability, whether in contract, strict liability, or tort arising in any way out of the use of this sample code, even if advised of the possibility of such damage.

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Description</b>

## <b> Problem overview</b>
    

**Dataset Used : Advertising Sales Dataset**

**Features**:
- `TV`: Advertising done on TV.
- `Radio`: Advertising done on Radio.
- `Newspapaer`: Advertising done on Newspaper.

**Target Variable**:
- `Sales`: The Sales received after advertisement.
    
**Objective**:
The primary objective is typically to build a model that can accurately predict sales received based on advertisement.

**Usecase**:
Here, we use teradataml opensourceML to build and evaluate the model and deploy the best model to be used in later sessions for scoring etc.

**Workflow Steps**:
- Import required library functions and create connection to Vantage.
- Authenticate VantageCloud Lake and get user environment from OpenAF to use in teradataml OpensourceML module.
- Load the dataset. Split dataset in train and test datasets.
- Train `RandomizedSearchCV` model on train data and access trained attributes.
- Deploy best model to Vantage.
- Deploy the model trained outside Vantage (<b>additional feature</b>).
- Load both saved models in new session.
- Run predict and access attributes.
- Remove connection to Vantage.

<!-- ### <b><span style='color:#F1A424'>| 4.</span> Deploy the best model </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Import libraries and create connection</b>

In [1]:
# Importing required libraries.
import getpass
from teradataml import create_context, remove_context, DataFrame, load_example_data
from teradataml import list_user_envs, get_env, set_auth_token, configure, display
from teradataml import td_sklearn as osml

In [2]:
# Ignoring unnecessary warnings.
import warnings
warnings.simplefilter(action='ignore', category=DeprecationWarning)
display.suppress_vantage_runtime_warnings = True

In [3]:
# Read the connection parameters.
host = getpass.getpass("Host: ")
username = getpass.getpass("Username: ")
password = getpass.getpass("Password: ")

Host:  ········
Username:  ········
Password:  ········


In [4]:
# Create the connection.
con = create_context(host=host, username=username, password=password)

In [5]:
# Read configuration parameters for VantageCloud Lake authentication.
ues_url = getpass.getpass("UES URL: ")
auth_token = getpass.getpass("Auth Token: ")

UES URL:  ········
Auth Token:  ········


In [6]:
# Set configuration parameters for VantageCloud Lake authentication.
set_auth_token(auth_token=auth_token, ues_url=ues_url)

True

In [7]:
# List existing user environments in OpenAF.
list_user_envs()

Unnamed: 0,env_name,env_description,base_env_name,language,conda
0,conda_env_3_10_demo,Conda environment for notebook demo,python_3.10,python,True
1,demo_env,Demo env 1.,python_3.10,Python,False
2,non_conda_env_3_8_demo,Non Conda environment for notebook demo,python_3.8,Python,False
3,openml_env,DONT DELETE: OpenML environment,python_3.10,Python,False
4,openml_env_dhan,DONT DELETE: OpenML environment,python_3.10,Python,False
5,testenv,This env 'testenv' is created with base env 'p...,python_3.10,Python,False


In [8]:
env = get_env("non_conda_env_3_8_demo")
env


Environment Name: non_conda_env_3_8_demo
Base Environment: python_3.8
Description: Non Conda environment for notebook demo

############ Files installed in User Environment ############

                               File   Size             Timestamp
0      file_1725416796321435___1001    769  2024-09-04T01:58:27Z
1      file_1725449732481207___1001   1022  2024-09-04T09:10:35Z
2      file_1725442681529059___1001    769  2024-09-04T09:10:27Z
3              sklearn_transform.py  11654  2024-09-04T09:09:50Z
4                  sklearn_score.py   4626  2024-09-04T09:09:50Z
5                    sklearn_fit.py   6472  2024-09-04T09:09:50Z
6            sklearn_fit_predict.py   4819  2024-09-04T09:09:51Z
7  sklearn_model_selection_split.py   5831  2024-09-04T09:09:51Z
8              sklearn_neighbors.py   5795  2024-09-04T09:09:51Z

############ Libraries installed in User Environment ############

            name version
0         joblib   1.4.2
1          numpy  1.22.4
2            pip  2

teradataml OpensourceML requires python versions and required python package versions be same in both client and OpenAF User environment.

In [9]:
# Verifying whether required packages are of same version in both client and OpenAF User environment (above cell).
!pip list | grep scikit-learn
!pip list | grep scipy
!pip list | grep numpy

scikit-learn              1.1.3
scipy                     1.10.0
numpy                     1.22.4


In [10]:
# Use "non_conda_env_3_8_demo" enviroment for teradataml OpensourceML.
configure.openml_user_env = env

<!-- ### <b><span style='color:#F1A424'>| 4.</span> Deploy the best model </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Loading dataset</b>

In [11]:
# Loading dataset from example data collection.
load_example_data('teradataml','advertising')



In [12]:
# Fetching in teradata dataframe.
advertising_df = DataFrame("advertising")

In [13]:
# Look at 5 rows.
advertising_df.head(5)

TV,radio,newspaper,sales
5.4,29.9,9.4,5.3
7.8,38.9,50.6,6.6
7.3,28.1,41.4,5.5
4.1,11.6,5.7,3.2
0.7,39.6,8.7,1.6


In [14]:
advertising_df.shape

(200, 4)

In [15]:
# Performing sampling to get 80% for training and 20% for testing.
advertising_df_sample = advertising_df.sample(frac = [0.8, 0.2])

# Fetching train and test data.
df_train = advertising_df_sample[advertising_df_sample['sampleid'] == 1].drop('sampleid', axis=1)
df_test = advertising_df_sample[advertising_df_sample['sampleid'] == 2].drop('sampleid', axis=1)

In [16]:
# Train data shape.
df_train.shape

(160, 4)

In [17]:
# Test data shape.
df_test.shape

(40, 4)

In [18]:
# Persisting test data to be used in another session.
df_test.to_sql(table_name='advertising_test', if_exists='replace')

<!-- ### <b><span style='color:#F1A424'>| 4.</span> Deploy the best model </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. Train RandomizedSearchCV on SGDRegressor estimator</b>

Use `SGDRegressor` as estimator for `RandomizedSearchCV` to get best model from the set of training parameters.

In [19]:
obj = osml.SGDRegressor(max_iter=1000, tol=1e-4)
obj

In [20]:
# Create RandomizedSearchCV object.
from scipy.stats import uniform
distributions = dict(alpha=uniform(loc=0, scale=4),
                     penalty=['l2', 'l1'],
                     loss=['squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'])
clf = osml.RandomizedSearchCV(obj, distributions, random_state=0)
clf

In [21]:
# Train the RandomizedSearchCV model to get the best model out of given distributed parameters.
clf.fit(df_train.select(df_train.columns[:-1]), df_train.select(["sales"]))

<!-- ### <b><span style='color:#F1A424'>| 4.</span> Deploy the best model </b> -->

<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>4.1. Access attributes of the trained model</b>

In [22]:
clf.best_params_

{'alpha': 0.28414423279154777, 'loss': 'huber', 'penalty': 'l2'}

In [23]:
best_est = clf.best_estimator_
best_est

In [24]:
clf.best_score_

0.6255448637880374

In [25]:
clf.cv_results_

{'mean_fit_time': array([0.00096378, 0.00096583, 0.00069742, 0.00112352, 0.00073304,
        0.00097914, 0.00068316, 0.00074458, 0.0011024 , 0.00088735]),
 'std_fit_time': array([4.81758226e-04, 9.73403233e-05, 4.39837758e-05, 1.01823057e-04,
        2.29989788e-05, 1.02446707e-04, 9.61489366e-05, 5.68054968e-05,
        8.65620869e-05, 3.08857995e-05]),
 'mean_score_time': array([0.00038362, 0.00036674, 0.00036473, 0.00036979, 0.00036798,
        0.00037246, 0.00036321, 0.00036378, 0.00036469, 0.00037627]),
 'std_score_time': array([3.49229102e-05, 7.69674850e-06, 1.48604983e-05, 1.23601122e-05,
        1.96434918e-05, 1.78272062e-05, 6.70732103e-06, 1.03961437e-05,
        5.20080498e-06, 1.63697578e-05]),
 'param_alpha': masked_array(data=[2.195254015709299, 2.4110535042865755,
                    1.6946191973556188, 1.75034884505077,
                    3.854651042004117, 3.1669001523306584,
                    2.2721782443757292, 0.28414423279154777,
                    0.08087358

<!-- ### <b><span style='color:#F1A424'>| 4.</span> Deploy the best model </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Deploy the best model</b>

In [26]:
# Note the value passed to the argument `model`, which is the best estimator of the trained model.
deployed_model = best_est.deploy(model_name="sgd_regressor_model", replace_if_exists=True)
deployed_model

Model is deleted.
Model is saved.


In [27]:
type(deployed_model)

teradataml.opensource.sklearn._sklearn_wrapper._SkLearnObjectWrapper

In [28]:
# Get the training score.
deployed_model.score(X=df_train.select(df_train.columns[:-1]), y=df_train.select(["sales"]))

score
0.7600104281305883


<!-- ### <b><span style='color:#F1A424'>| 4.</span> Deploy the best model </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>6. Deploy the model trained outside Vantage.</b>

OpensourceML also offers another feature to deploy/save models trained outside Vantage.<br>
Later these deployed models can be used through OpensourceML for prediction or scoring in another session.

In [29]:
# Train a scikit-learn model in client.
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
X, y = make_regression(n_features=3, random_state=0) # 3 features - TV, radio, newspaper.
y = abs(y) # Sales are non-negative.
clf = make_pipeline(StandardScaler(),
                    LinearRegression())
clf.fit(X, y)

In [30]:
outside_model = osml.deploy(model_name="LR_model_trained_outside", model=clf, replace_if_exists=True)
outside_model

Model is deleted.
Model is saved.


In [31]:
type(outside_model)

teradataml.opensource.sklearn._sklearn_wrapper._SkLearnObjectWrapper

In [32]:
# Removing the context to load the saved models in another session.
remove_context()

True

<!-- ### <b><span style='color:#F1A424'>| 5.</span> Load the saved model in another session </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>7. Load the saved models in another session</b>

In [33]:
# Create connection again to the same host.
con = create_context(host=host, username=username, password=password)

In [34]:
# Authenticate VantageCloud Lake.
set_auth_token(auth_token=auth_token, ues_url=ues_url)

True

In [35]:
# Set the required opensourceML user environment. Need to run functions in the same environment.
configure.openml_user_env = get_env("non_conda_env_3_8_demo")

<!-- ### <b><span style='color:#F1A424'>| 5.</span> Load the saved model in another session </b> -->
<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>7.1. Load the saved best model</b>

In [36]:
# Load the saved model.
loaded_model = osml.load(model_name="sgd_regressor_model")
loaded_model

In [37]:
type(loaded_model)

teradataml.opensource.sklearn._sklearn_wrapper._SkLearnObjectWrapper

In [38]:
loaded_model.get_params()

{'alpha': 0.28414423279154777,
 'average': False,
 'early_stopping': False,
 'epsilon': 0.1,
 'eta0': 0.01,
 'fit_intercept': True,
 'l1_ratio': 0.15,
 'learning_rate': 'invscaling',
 'loss': 'huber',
 'max_iter': 1000,
 'n_iter_no_change': 5,
 'penalty': 'l2',
 'power_t': 0.25,
 'random_state': None,
 'shuffle': True,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [39]:
# Predict sales on test data.
df_test = DataFrame("advertising_test")
opt = loaded_model.predict(df_test.select(df_test.columns[:-1]))
opt

TV,radio,newspaper,sgdregressor_predict_1
62.3,12.6,18.3,7.289186420563085
38.2,3.7,13.8,3.99301712690689
168.4,7.1,12.8,13.74526285312672
85.7,35.8,49.3,13.88296135297689
204.1,32.9,46.0,21.713120888831853
69.2,20.5,18.3,9.01708387634114
76.3,27.5,16.0,10.522097502282024
239.8,4.1,36.9,19.362240021652973
290.7,4.1,8.5,21.79536932114689
156.6,2.6,8.3,12.01304466060292


<!-- ### <b><span style='color:#F1A424'>| 5.</span> Load the saved model in another session </b> -->
<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>7.1.1. Access attributes of saved model</b>

In [40]:
loaded_model.coef_

array([0.07113404, 0.15659147, 0.04181666])

In [41]:
loaded_model.intercept_

array([0.11923866])

In [42]:
loaded_model.n_iter_

28

In [43]:
loaded_model.t_

4481.0

<!-- ### <b><span style='color:#F1A424'>| 5.</span> Load the saved model in another session </b> -->
<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>7.2. Load the model trained outside Vantage</b>

In [44]:
# Load the saved model.
loaded_model = osml.load(model_name="LR_model_trained_outside")
loaded_model

In [45]:
type(loaded_model)

teradataml.opensource.sklearn._sklearn_wrapper._SkLearnObjectWrapper

In [46]:
opt = loaded_model.predict(df_test.select(df_test.columns[:-1]))
opt

TV,radio,newspaper,pipeline_predict_1
69.2,20.5,18.3,1403.6941966744905
187.8,21.1,9.5,3361.956848178012
202.5,22.3,31.6,3619.8221796831062
7.8,38.9,50.6,580.5686798158789
216.8,43.9,27.2,4069.303607275704
175.1,22.5,31.5,3170.408226003832
220.5,33.2,37.9,4025.749812480743
62.3,12.6,18.3,1211.5158612263383
204.1,32.9,46.0,3753.9799579475034
85.7,35.8,49.3,1832.855142961112


<!-- ### <b><span style='color:#F1A424'>| 5.</span> Load the saved model in another session </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>7. Cleanup and remove connection</b>

In [47]:
# Drop persisted table.
from teradataml import db_drop_table
db_drop_table(table_name='advertising_test')

True

In [48]:
remove_context()

True