In [None]:
#uncomment this if you have not run cdsw-build.sh before
#!sh /home/cdsw/cdsw-build.sh

In [1]:
!pip show tensorflow

Name: tensorflow
Version: 2.13.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /home/cdsw/.local/lib/python3.9/site-packages
Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, libclang, numpy, opt-einsum, packaging, protobuf, setuptools, six, tensorboard, tensorflow-estimator, tensorflow-io-gcs-filesystem, termcolor, typing-extensions, wrapt
Required-by: 


In [2]:
import transformers

print(transformers.__version__)

  from .autonotebook import tqdm as notebook_tqdm


4.31.0


In [3]:
# Other options : ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
model_checkpoint = "t5-small" # we use t5-small because this is the smallest of the class of models.

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we pick the [`t5-small`](https://huggingface.co/t5-small) checkpoint.

## Using the huggingface Tokenizer 

Before we can feed those texts to our model, we need to preprocess them, since models understand only vectorized representations of Text.  This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [5]:
tokenizer("Hello, this is a sentence!")

{'input_ids': [8774, 6, 48, 19, 3, 9, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer(["Hello, this is a sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 19, 3, 9, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [None]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this is a sentence!", "This is another sentence."]))

# to remove the warning from above : use this tokenizer(text_target=["Hello, this is a sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 19, 3, 9, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [16]:
if model_checkpoint in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

## Save the Models Locally and do Inferencing

Let's see how we could load models locally and use it to summarize text in future! First, let's load it from the hub. This means we can resume the code from here without needing to rerun everything above every time. Also you will learn how to load models locally that can be loaded later. One word of **caution**. Do not save very large models in CML projects. Recommend using HDFS or a s3 bucket instead. Since we are just using the smallest class of model for this notebook i.e. T5-small, we store this in a folder called models along with the tokenizer. 

In [20]:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
# let us save the model locally first
model.save_pretrained('/home/cdsw/models')

# let us save the tokenizer as well 
tokenizer.save_pretrained('/home/cdsw/models')



2023-08-02 05:31:30.011081: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-02 05:31:30.671410: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-02 05:31:30.672197: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


('/home/cdsw/models/tokenizer_config.json',
 '/home/cdsw/models/special_tokens_map.json',
 '/home/cdsw/models/tokenizer.json')

### Understanding the Warnings
- We are using T5-small model here, and the idea is to use it without a GPU to allow us to play/ work with LLM models without a GPU. 
- Do recognize that this will come with limitations: the number of tokens in inputs will be limited by the model architecture. You can get around this by changing to a GPU runtime and a larger model
- As mentioned earlier T5 is a class of models and you could chose from t5-small with 56M parameters to t5-11b with a billion parameters. Read more about the T5 models [here](https://huggingface.co/docs/transformers/model_doc/t5)
- The warning "All the weight.." essentially tells us that the model is a pre-trained with a set of pre-defined weights and can directly used for inferencing, if required. This is the approach we will follow. In subsequent approaches we will consider prompt tuning and fine tuning the models.  

In [22]:
model_name = 't5-small'

tokenizer = AutoTokenizer.from_pretrained('./models/')
model = TFAutoModelForSeq2SeqLM.from_pretrained('./models/')

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at ./models/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Now let's try using some sample data and use the T5 Model for summarizing some text. At this point we are using the pre-trained model as is, without additional augmentations or finetuning. Don't forget to add 'summarize:' at the start if you're using a `T5` model.

In [24]:
document = 'The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we\'re neglected or forgotten," she said.\n"That may not be true but it is perhaps my perspective over the last few days.\n"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?"\nMeanwhile, a flood alert remains in place across the Borders because of the constant rain.\nPeebles was badly hit by problems, sparking calls to introduce more defences in the area.\nScottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.\nThe Labour Party\'s deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.\nHe said it was important to get the flood protection plan right but backed calls to speed up the process.\n"I was quite taken aback by the amount of damage that has been done," he said.\n"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses."\nHe said it was important that "immediate steps" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.\nHave you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on selkirk.news@bbc.co.uk or dumfries@bbc.co.uk.'
if 't5' in model_name:
    document = "summarize: " + document
tokenized = tokenizer([document], return_tensors='np')
out = model.generate(**tokenized, max_length=128)

print(tokenizer.decode(out[0]))

<pad> the full cost of damage in Newton Stewart is still being assessed. many roads in peeblesshire remain badly affected by standing water. the water breached a retaining wall, flooding many commercial properties.</s>


## Deploying Saved models

Until now we have learnt how to save models locally, invoke them using Transformers API and then make inferences. However it is optimal to deploy the model as an API endpoint, so that others can consume the model without having / using multiple copies of the same model. We leverage the CML API v2 to do this task

In [54]:
projects = client.list_projects(search_filter=json.dumps({"name": "LLM_demo-on-CML"}))

In [55]:
print(projects)

{'next_page_token': '',
 'projects': [{'created_at': datetime.datetime(2023, 7, 31, 3, 14, 25, 999894, tzinfo=tzlocal()),
               'creation_status': 'success',
               'creator': {'email': 'vishrajagopalan@cloudera.com',
                           'name': 'Vish Rajagopalan',
                           'username': 'vishrajagopalan'},
               'default_engine_type': 'ml_runtime',
               'description': '',
               'environment': '{"CDSW_APP_POLLING_ENDPOINT":"/","PROJECT_OWNER":"vishrajagopalan"}',
               'ephemeral_storage_limit': 10,
               'ephemeral_storage_request': 0,
               'id': 'nuon-0bqb-uifv-vrc5',
               'name': 'LLM-demo-on-CML',
               'owner': {'email': 'vishrajagopalan@cloudera.com',
                         'name': 'Vish Rajagopalan',
                         'username': 'vishrajagopalan'},
               'permissions': {'admin': True,
                               'business_user': True,
         

In [60]:
import time

import os
import json
import string
import cmlapi
from src.api import ApiUtility
import cdsw
from datetime import datetime

# lets us get a Handle to API 
client = cmlapi.default_client()
# project_id = os.environ["CDSW_PROJECT_ID"]
projects = client.list_projects(search_filter=json.dumps({"name": "LLM-demo-on-CML"}))
project = projects.projects[0] # assuming only one project is returned by the above query


# create a model request
model_body = cmlapi.CreateModelRequest(project_id=project.id, name="Text Summarization LLM v1 Test3", description="Text Summarization using T5-small LLM Model")
model = client.create_model(model_body, project.id)

# create a model request
runtime_details='docker.repository.cloudera.com/cloudera/cdsw/ml-runtime-workbench-python3.9-standard:2023.05.2-b7'
model_build_body = cmlapi.CreateModelBuildRequest(project_id=project.id, model_id=model.id, file_path="LLM_inference.py", function_name="summarize", kernel="python3", runtime_identifier=runtime_details)

start_time = datetime.now()
print(start_time.strftime("%H:%M:%S"))

# Model is getting Built as a container image
model_build = client.create_model_build(model_build_body, project.id, model.id)
while model_build.status not in ["built", "build failed"]:
    print("waiting for model to build...")
    time.sleep(10)
    model_build = client.get_model_build(project.id, model.id, model_build.id)
    if model_build.status == "build failed" :
        print("model build failed, see UI for more information")
        sys.exit(1)
        
build_time = datetime.now()   
print(f"Time required for building model (sec): {(build_time - start_time).seconds}")
print("model built successfully!")


# Model is getting deployed as a container image
model_deployment_body = cmlapi.CreateModelDeploymentRequest(project_id=project.id, model_id=model.id, build_id=model_build.id, cpu=4, memory=8)
model_deployment = client.create_model_deployment(model_deployment_body, project.id, model.id, model_build.id)

while model_deployment.status not in ["stopped", "failed", "deployed"]:
    print("waiting for model to deploy...")
    time.sleep(10)
    model_deployment = client.get_model_deployment(project.id, model.id, model_build.id, model_deployment.id)

curr_time = datetime.now()

if model_deployment.status != "deployed":
    print("model deployment failed, see UI for more information")
    sys.exit(1)

if model_deployment.status == "deployed" :
    print(f"Time required for deploying model (sec): {(curr_time - start_time).seconds}")
print("model deployed successfully!")

07:05:46
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
waiting for model to build...
w