<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Usecase - Predicting sales using teradataml OpenSourceML on VantageCloud Lake</b>
</header>

### Disclaimer
The sample code (“Sample Code”) provided is not covered by any Teradata agreements. Please be aware that Teradata has no control over the model responses to such sample code and such response may vary. The use of the model by Teradata is strictly for demonstration purposes and does not constitute any form of certification or endorsement. The sample code is provided “AS IS” and any express or implied warranties, including the implied warranties of merchantability and fitness for a particular purpose, are disclaimed. In no event shall Teradata be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) sustained by you or a third party, however caused and on any theory of liability, whether in contract, strict liability, or tort arising in any way out of the use of this sample code, even if advised of the possibility of such damage.

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Description</b>

## <b> Problem overview</b>
    

**Dataset Used : Advertising Sales Dataset**

**Features**:
- `TV`: Advertising done on TV.
- `Radio`: Advertising done on Radio.
- `Newspapaer`: Advertising done on Newspaper.

**Target Variable**:
- `Sales`: The Sales received after advertisement.
    
**Objective**:
The primary objective is typically to build a model that can accurately predict sales received based on advertisement.

**Usecase**:
Here, we use teradataml opensourceML to build and evaluate the model and deploy the best model to be used in later sessions for scoring etc.

**Workflow Steps**:
- Import required library functions and create connection to Vantage..
- Authenticate VantageCloud Lake and get user environment from OpenAF to use in teradataml OpensourceML module.
- Load the dataset. Split dataset in train and test datasets.
- Train `RandomizedSearchCV` model on train data and access trained attributes.
- Deploy best model to Vantage.
- Load the saved model in new session.
- Run predict and access attributes.
- Remove connection to Vantage.

<!-- ### <b><span style='color:#F1A424'>| 4.</span> Deploy the best model </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Import libraries and create connection</b>

In [1]:
# Importing required libraries.
import getpass
from teradataml import create_context, remove_context, DataFrame, load_example_data
from teradataml import list_user_envs, get_env, set_auth_token, configure, display
from teradataml import td_sklearn as osml

In [2]:
# Ignoring unnecessary warnings.
import warnings
warnings.simplefilter(action='ignore', category=DeprecationWarning)
display.suppress_vantage_runtime_warnings = True

In [3]:
# Read the connection parameters.
host = getpass.getpass("Host: ")
username = getpass.getpass("Username: ")
password = getpass.getpass("Password: ")

In [4]:
# Create the connection.
con = create_context(host=host, username=username, password=password)

In [5]:
# Read configuration parameters for VantageCloud Lake authentication.
ues_url = getpass.getpass("UES URL: ")
auth_token = getpass.getpass("Auth Token: ")

In [6]:
# Set configuration parameters for VantageCloud Lake authentication.
set_auth_token(auth_token=auth_token, ues_url=ues_url)

True

In [7]:
# List existing user environments in OpenAF.
list_user_envs()

Unnamed: 0,env_name,env_description,base_env_name,language
0,non_conda_env_3_8_demo,Non Conda environment for notebook demo,python_3.8.13,Python
1,openml_env,DONT DELETE: OpenML environment,python_3.10.5,Python
2,testenv,This env 'testenv' is created with base env 'p...,python_3.10.5,Python


In [8]:
env = get_env("non_conda_env_3_8_demo")
env


Environment Name: non_conda_env_3_8_demo
Base Environment: python_3.8.13
Description: Non Conda environment for notebook demo

############ Files installed in User Environment ############

                               File  Size             Timestamp
0              sklearn_transform.py  9885  2024-08-08T12:11:29Z
1                  sklearn_score.py  4626  2024-08-08T12:11:28Z
2                    sklearn_fit.py  6457  2024-08-08T12:11:26Z
3            sklearn_fit_predict.py  4819  2024-08-08T12:11:31Z
4      file_1723117644752999___1001   769  2024-08-08T11:10:16Z
5  sklearn_model_selection_split.py  5831  2024-08-08T12:11:34Z
6              sklearn_neighbors.py  5795  2024-08-08T12:11:32Z

############ Libraries installed in User Environment ############

            name version
0         joblib   1.4.2
1          numpy  1.23.5
2            pip  22.0.4
3   scikit-learn   1.1.3
4          scipy  1.10.1
5     setuptools  56.0.0
6  threadpoolctl   3.5.0


teradataml OpensourceML requires python versions and required python package versions be same in both client and OpenAF User environment.

In [9]:
# Verifying whether required packages are of same version in both client and OpenAF User environment (above cell).
!pip list | grep scikit-learn
!pip list | grep scipy
!pip list | grep numpy

scikit-learn              1.1.3
scipy                     1.10.1
numpy                     1.23.5


In [10]:
# Use "non_conda_env_3_8_demo" enviroment for teradataml OpensourceML.
configure.openml_user_env = env

<!-- ### <b><span style='color:#F1A424'>| 4.</span> Deploy the best model </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Loading dataset</b>

In [11]:
# Loading dataset from example data collection.
load_example_data('teradataml','advertising')



In [12]:
# Fetching in teradata dataframe.
advertising_df = DataFrame("advertising")

In [13]:
# Look at 5 rows.
advertising_df.head(5)

TV,radio,newspaper,sales
5.4,29.9,9.4,5.3
7.8,38.9,50.6,6.6
7.3,28.1,41.4,5.5
4.1,11.6,5.7,3.2
0.7,39.6,8.7,1.6


In [14]:
advertising_df.shape

(200, 4)

In [15]:
# Performing sampling to get 80% for training and 20% for testing.
advertising_df_sample = advertising_df.sample(frac = [0.8, 0.2])

# Fetching train and test data.
df_train = advertising_df_sample[advertising_df_sample['sampleid'] == 1].drop('sampleid', axis=1)
df_test = advertising_df_sample[advertising_df_sample['sampleid'] == 2].drop('sampleid', axis=1)

In [16]:
# Train data shape.
df_train.shape

(160, 4)

In [17]:
# Test data shape.
df_test.shape

(40, 4)

In [18]:
# Persisting test data to be used in another session.
df_test.to_sql(table_name='advertising_test', if_exists='replace')

<!-- ### <b><span style='color:#F1A424'>| 4.</span> Deploy the best model </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. Train RandomizedSearchCV on SGDRegressor estimator</b>

Use `SGDRegressor` as estimator for `RandomizedSearchCV` to get best model from the set of training parameters.

In [19]:
obj = osml.SGDRegressor(max_iter=1000, tol=1e-4)
obj

In [20]:
# Create RandomizedSearchCV object.
from scipy.stats import uniform
distributions = dict(alpha=uniform(loc=0, scale=4),
                     penalty=['l2', 'l1'],
                     loss=['squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'])
clf = osml.RandomizedSearchCV(obj, distributions, random_state=0)
clf

In [21]:
# Train the RandomizedSearchCV model to get the best model out of given distributed parameters.
clf.fit(df_train.select(df_train.columns[:-1]), df_train.select(["sales"]))

<!-- ### <b><span style='color:#F1A424'>| 4.</span> Deploy the best model </b> -->

<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>4.1. Access attributes of the trained model</b>

In [22]:
clf.best_params_

{'alpha': 1.6946191973556188, 'loss': 'huber', 'penalty': 'l2'}

In [23]:
best_est = clf.best_estimator_
best_est

In [24]:
clf.best_score_

0.4844624247468654

In [25]:
clf.cv_results_

{'mean_fit_time': array([0.00102983, 0.0010107 , 0.00068583, 0.00101643, 0.00073547,
        0.00085764, 0.00068741, 0.00065713, 0.0008821 , 0.00086827]),
 'std_fit_time': array([5.47945664e-04, 1.02751426e-04, 8.87322846e-05, 1.47918960e-04,
        7.30688806e-05, 6.50999562e-05, 3.90175192e-05, 4.96427102e-05,
        1.16816287e-04, 1.39979477e-04]),
 'mean_score_time': array([0.00038218, 0.00035143, 0.00035048, 0.00035782, 0.00035524,
        0.00035381, 0.00035853, 0.00038123, 0.00035424, 0.00035744]),
 'std_score_time': array([4.08630248e-05, 4.19237225e-06, 8.75746200e-06, 1.82986221e-05,
        9.26095839e-06, 1.10323673e-05, 1.53645190e-05, 3.37888843e-05,
        9.42088321e-06, 9.68742540e-06]),
 'param_alpha': masked_array(data=[2.195254015709299, 2.4110535042865755,
                    1.6946191973556188, 1.75034884505077,
                    3.854651042004117, 3.1669001523306584,
                    2.2721782443757292, 0.28414423279154777,
                    0.08087358

<!-- ### <b><span style='color:#F1A424'>| 4.</span> Deploy the best model </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Deploy the best model</b>

In [26]:
# Note the value passed to the argument `model`, which is the best estimator of the trained model.
deployed_model = best_est.deploy(model_name="sgd_regressor_model", replace_if_exists=True)
deployed_model

Model is deleted.
Model is saved.


In [27]:
type(deployed_model)

teradataml.opensource.sklearn._sklearn_wrapper._SkLearnObjectWrapper

In [28]:
# Get the training score.
deployed_model.score(X=df_train.select(df_train.columns[:-1]), y=df_train.select(["sales"]))

score
0.5602259529500108


In [29]:
# Removing the context to load the saved model in another session.
remove_context()

True

<!-- ### <b><span style='color:#F1A424'>| 5.</span> Load the saved model in another session </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>6. Load the saved model in another session</b>

In [30]:
# Create connection again to the same host.
con = create_context(host=host, username=username, password=password)

In [31]:
# Authenticate VantageCloud Lake.
set_auth_token(auth_token=auth_token, ues_url=ues_url)

True

In [32]:
# Load the saved model.
loaded_model = osml.load(model_name="sgd_regressor_model")
loaded_model

In [33]:
type(loaded_model)

teradataml.opensource.sklearn._sklearn_wrapper._SkLearnObjectWrapper

In [34]:
loaded_model.get_params()

{'alpha': 1.6946191973556188,
 'average': False,
 'early_stopping': False,
 'epsilon': 0.1,
 'eta0': 0.01,
 'fit_intercept': True,
 'l1_ratio': 0.15,
 'learning_rate': 'invscaling',
 'loss': 'huber',
 'max_iter': 1000,
 'n_iter_no_change': 5,
 'penalty': 'l2',
 'power_t': 0.25,
 'random_state': None,
 'shuffle': True,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [35]:
# Set the required opensourceML user environment. Need to run functions in the same environment.
configure.openml_user_env = get_env("non_conda_env_3_8_demo")

In [36]:
# Predict sales on test data.
df_test = DataFrame("advertising_test")
opt = loaded_model.predict(df_test.select(df_test.columns[:-1]))
opt

TV,radio,newspaper,sgdregressor_predict_1
94.2,4.9,8.1,8.66649533297978
116.0,7.7,23.1,11.638819438329511
95.7,1.4,7.4,8.312827767832005
237.4,5.1,23.5,20.966673349317844
121.0,8.4,48.7,13.656794140134515
206.9,8.4,26.4,19.12877815923341
239.9,41.5,18.5,25.343099048247865
38.0,40.3,11.9,8.794401631426707
5.4,29.9,9.4,4.780685110378941
228.3,16.9,26.2,21.858942552851342


<!-- ### <b><span style='color:#F1A424'>| 5.</span> Load the saved model in another session </b> -->
<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>6.1. Access attributes of saved model</b>

In [37]:
loaded_model.coef_

array([0.07927289, 0.12302594, 0.05998016])

In [38]:
loaded_model.intercept_

array([0.11032235])

In [39]:
loaded_model.n_iter_

20

In [40]:
loaded_model.t_

3201.0

<!-- ### <b><span style='color:#F1A424'>| 5.</span> Load the saved model in another session </b> -->
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>7. Cleanup and remove connection</b>

In [41]:
# Drop persisted table.
from teradataml import db_drop_table
db_drop_table(table_name='advertising_test')

True

In [42]:
remove_context()

True