# Scikit-Learn Preprocessing and Training Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB


Using data from Google Cloud Storage and SAP Datasphere

## Install fedml_gcp package

In [None]:
pip install fedml_gcp

## Import Libraries

In [None]:
import os

from fedml_gcp import dwcgcp

## Some constant variables to use throughout the notebook

In [None]:
PROJECT_ID = '<project_id>'
REGION = '<region>'

BUCKET_NAME = '<bucket_name>'
BUCKET_URI = "gs://"+BUCKET_NAME
BUCKET_FOLDER = 'preprocessed-pipeline'
MODEL_OUTPUT_DIR = BUCKET_URI+'/'+BUCKET_FOLDER

SCRIPT_PATH = 'PreprocessingAndTrainingPipelineScript.py'
JOB_NAME = "preprocessed-pipeline-training"

MODEL_DISPLAY_NAME = "preprocessed-pipeline-model"
DEPLOYED_MODEL_DISPLAY_NAME = 'preprocessed-pipeline-deployed-model'

# Create DwcGCP Instance to access class methods and train model

It is expected that the bucket name passed here already exists in Cloud Storage.

For information on this constructor, please refer to the readme.

In [None]:
params = {'project':PROJECT_ID,
         'location':REGION, 
         'staging_bucket':BUCKET_URI}

In [None]:
dwc = dwcgcp.DwcGCP(params)


## Determine which training image and deploying image you want to use. 

Please refer here for the training pre-built containers: https://cloud.google.com/vertex-ai/docs/training/create-python-pre-built-container

Please refer here for the deployment pre-built containers: https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers

In [None]:
TRAIN_VERSION = "scikit-learn-cpu.0-23"
DEPLOY_VERSION = "sklearn-cpu.1-0"

TRAIN_IMAGE = "us-docker.pkg.dev/vertex-ai/training/{}:latest".format(TRAIN_VERSION)
DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(DEPLOY_VERSION)

# Training using a custom training job and pre built container

For information on the dwc.train_model() function, please refer to the readme.

In the training inputs, we are using a script. When using a script, we have to pass the required packages needed as well.

We are also passing args which hold the table name to get data from and some other arguments we want to access in our training script. Before running the following cell, you should have a config.json uploaded to the bucket name you specified above with the path being /gcs/'+bucket_name+'/config.json'. This is specified in the training script, inside the function called get_dwc_data. This is used as the url parameter to DbConnection() so DbConnection knows where to find your credentials for access to SAP Datasphere.

You should also have the follow view IMDB_TEST_VIEW created in your SAP Datasphere. To gather this data, please refer to https://www.kaggle.com/mantri7/imdb-movie-reviews-dataset?select=train_data+%281%29.csv and download the test dataset.

This script also downloads data from Cloud Storage and uses it for training. Please download the train dataset from the link below and upload it to your Cloud Storage bucket (file path = /data) before proceeding. https://www.kaggle.com/mantri7/imdb-movie-reviews-dataset?select=train_data+%281%29.csv

In [None]:
table_name = 'IMDB_TEST_VIEW'
file_path = BUCKET_FOLDER+'/data/imdb_train.csv'
table_size = 1
job_dir = 'gs://'+BUCKET_NAME
    
cmd_args = [
    "--table_name=" + str(table_name),
    "--table_size="+ str(table_size),
    "--file_path="+ str(file_path),
    "--job-dir=" + str(job_dir),
    "--bucket_name=" + str(BUCKET_NAME),
    "--bucket_folder=" + str(BUCKET_FOLDER)
]

In [None]:
required_packages = [
    'fedml_gcp',
    'matplotlib>=2.2.3',
    'seaborn>=0.9.0',
    'scikit-learn>=0.20.2',
    'pandas',
    'numpy',
    'hdbcli',
    'pandas-gbq'

]

In [None]:
inputs2 = {
    'display_name':JOB_NAME,
    'script_path':SCRIPT_PATH,
    'container_uri':TRAIN_IMAGE,
    'model_serving_container_image_uri':DEPLOY_IMAGE,
    'requirements':required_packages
}

In [None]:
run_job_params2 = {'model_display_name':MODEL_DISPLAY_NAME,
                  'args':cmd_args,
                  'replica_count':1,
                  'base_output_dir':MODEL_OUTPUT_DIR,
                  'sync':True}

In [None]:
model = dwc.train_model(training_inputs=inputs2, 
                      training_type='custom',
                     params=run_job_params2)

## Deployment

For information on the dwc.deploy() function please refer to the readme.

Here we are deploying the model we trained in the above cell.

In [None]:
model_config = {
    'deployed_model_display_name': DEPLOYED_MODEL_DISPLAY_NAME,
    'traffic_split':{"0": 100},
    'machine_type':'n1-standard-2',
    'min_replica_count':1,
    'max_replica_count':1,
    'sync':True
    
}
deployed_endpoint = dwc.deploy(model=model, model_config=model_config)

## Prediction

Once the model is deployed to an endpoint, we can run predictions on it.

For information on the dwc.predict() function please refer to the readme.

Since we are using DbConnection here, we will need to have the config.json in this notebook instance as well.

In [None]:
from fedml_gcp import DbConnection
import pandas as pd
import numpy as np

In [None]:
db = DbConnection()
res, column_headers = db.get_data_with_headers(table_name='IMDB_TEST_VIEW', size=1)
org_data = pd.DataFrame(res, columns=column_headers)

In [None]:
org_data = org_data.tail(1000)
org_data

In [None]:
series_data = org_data['Comment']
print(type(series_data))

In [None]:
params = {'instances':series_data.values.tolist()}

In [None]:
predictions = dwc.predict(endpoint=deployed_endpoint, predict_params=params)

In [None]:
predictions