# Pretrained  GPT2  Model Deployment Example

In this notebook, we will run an example of text generation using GPT2 model exported from HuggingFace and deployed with Seldon's Triton pre-packed server. the example also covers converting the model to ONNX format.
The implemented example below is of the Greedy approach for the next token prediction.
more info: https://huggingface.co/transformers/model_doc/gpt2.html?highlight=gpt2

After we have the module deployed to Kubernetes, we will run a simple load test to evaluate the module inference performance.


## Steps:
1. Download pretrained GPT2 model from hugging face
2. Convert the model to ONNX
3. Store it in MinIo bucket
4. Setup Seldon-Core in your kubernetes cluster
5. Deploy the ONNX model with Seldon’s prepackaged Triton server.
6. Interact with the model, run a greedy alg example (generate sentence completion)
7. Run load test using vegeta
8. Clean-up

## Basic requirements
* Helm v3.0.0+
* A Kubernetes cluster running v1.13 or above (minkube / docker-for-windows work well if enough RAM)
* kubectl v1.14+
* Python 3.6+ 

In [1]:
%%writefile requirements.txt
transformers==4.5.1
torch==1.8.1
tokenizers<0.11,>=0.10.1
tensorflow==2.4.1
tf2onnx

Overwriting requirements.txt


In [2]:
!pip install --trusted-host=pypi.python.org --trusted-host=pypi.org --trusted-host=files.pythonhosted.org -r requirements.txt




### Export HuggingFace TFGPT2LMHeadModel pre-trained model and save it locally

In [3]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2", from_pt=True, pad_token_id=tokenizer.eos_token_id)
model.save_pretrained("./tfgpt2model", saved_model=True)

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
INFO:tensorflow:Assets written to: ./tfgpt2model/saved_model/1/assets
INFO:tensorflow:Assets written to: ./tfgpt2model/saved_model/1/a

### Convert the TensorFlow saved model to ONNX

In [4]:
!python -m tf2onnx.convert --saved-model ./tfgpt2model/saved_model/1 --opset 11  --output model.onnx

2021-05-27 23:54:39.198493: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-05-27 23:54:39.198556: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-05-27 23:54:41.091561: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-27 23:54:41.091930: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-05-27 23:54:41.091985: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-05-27 23:54:41.092035: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running 

## Azure Setup
We  have provided [Azure Setup Notebook](AzureSetup.ipynb) that deploys AKS cluster, Azure storage account and installs Azure Blob CSI driver. If AKS cluster already exists skip to creation of Blob Storage and CSI driver installtion steps.

In [5]:
resource_group = "seldon"   # feel free to replace or use this default
aks_name = "modeltests"    

storage_account_name = "modeltestsgpt"        # fill in
storage_container_name = "gpt2onnx"             

### Copy your model to Azure Blob


In [6]:
%%time
# Copy model file
!az extension add --name storage-preview
!az storage azcopy blob upload --container {storage_container_name} \
                               --account-name {storage_account_name} \
                               --source  ./model.onnx \
                               --destination gpt2/1/model.onnx  

[33mExtension 'storage-preview' is already installed.[0m
[0m[33mAzcopy command: ['/home/lenisha/.azure/cliextensions/storage-preview/azext_storage_preview/azcopy/azcopy_linux_amd64_10.5.0/azcopy', 'copy', './model.onnx', 'https://modeltestsgpt.blob.core.windows.net/gpt2onnx/?se=2021-05-29T03%3A57%3A39Z&sp=rwdlacup&sv=2018-03-28&ss=b&srt=sco&sig=l1L7/xauvEWX2B3oV0Dvfl3s2ajxiq1PgV4/WLQpQ%2BU%3D'][0m
INFO: Scanning...
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

Job f0fa2ca6-467a-034d-5ed5-fbdd73eea742 has started
Log file is located at: /home/lenisha/.azcopy/f0fa2ca6-467a-034d-5ed5-fbdd73eea742.log

INFO: azcopy: A newer version 10.10.0 is available to download

98.7 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total, 2-sec Throughput (Mb/s): 15.8563


Job f0fa2ca6-467a-034d-5ed5-fbdd73eea742 summary
Elapsed Time (Minutes): 0.967
Number of File Transfers: 1
Number of Folder Property Transfers: 0
Total Number of Tr

In [11]:
#Verify Uploaded file
!az storage blob list \
    --account-name {storage_account_name}\
    --container-name {storage_container_name} \
    --output table 
    

[33mThis command has been deprecated and will be removed in future release. Use 'az storage fs file list' instead. For more information go to https://github.com/Azure/azure-cli/blob/dev/src/azure-cli/azure/cli/command_modules/storage/docs/ADLS%20Gen2.md[39m
[33mThe behavior of this command has been altered by the following extension: storage-preview[0m
Name        IsDirectory    Blob Type    Blob Tier    Length     Content Type              Last Modified              Snapshot
----------  -------------  -----------  -----------  ---------  ------------------------  -------------------------  ----------
model.onnx                 BlockBlob    Hot          652535462  application/octet-stream  2021-05-28T03:58:37+00:00
[0m

### Run Seldon in your kubernetes cluster

Follow the [Seldon-Core Setup notebook](https://docs.seldon.io/projects/seldon-core/en/latest/examples/seldon_core_setup.html) to Setup a cluster with Ambassador Ingress or Istio and install Seldon Core

### Deploy your model with Seldon pre-packaged Triton server

In [12]:
%%writefile gpt2-deploy.yaml
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: gpt2
spec:
  predictors:
  - graph:
      implementation: TRITON_SERVER
      logger:
        mode: all
      modelUri: pvc://pvc-blob
      name: gpt2
      type: MODEL
    name: default
    replicas: 1
  protocol: kfserving

Writing gpt2-deploy.yaml


In [13]:

!kubectl apply -f gpt2-deploy.yaml -n default

seldondeployment.machinelearning.seldon.io/gpt2 created


In [14]:
!kubectl rollout status deploy/$(kubectl get deploy -l seldon-deployment-id=gpt2 -o jsonpath='{.items[0].metadata.name}')

deployment "gpt2-default-0-gpt2" successfully rolled out


#### Interact with the model: get model metadata (a "test" request to make sure our model is available and loaded correctly)

In [20]:
ingress_ip=!(kubectl get svc --namespace seldon-system ambassador -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
ingress_ip = ingress_ip[0]

!curl -v http://{ingress_ip}:80/seldon/default/gpt2/v2/models/gpt2

{'20.186.162.33'}
*   Trying 20.186.162.33:80...
* TCP_NODELAY set
* Connected to 20.186.162.33 (20.186.162.33) port 80 (#0)





* Mark bundle as not supporting multiuse












* Connection #0 to host 20.186.162.33 left intact
{"error":"Request for unknown model: 'gpt2' is not found"}

### Run prediction test: generate a sentence completion using GPT2 model  - Greedy approach


In [None]:
import requests
import json
import numpy as np
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = 'I enjoy working in Seldon'
count = 0
max_gen_len = 10
gen_sentence = input_text
while count < max_gen_len:
    input_ids = tokenizer.encode(gen_sentence, return_tensors='tf')
    shape = input_ids.shape.as_list()
    payload = {
            "inputs": [
                {"name": "input_ids:0",
                 "datatype": "INT32",
                 "shape": shape,
                 "data": input_ids.numpy().tolist()
                 },
                {"name": "attention_mask:0",
                 "datatype": "INT32",
                 "shape": shape,
                 "data": np.ones(shape, dtype=np.int32).tolist()
                 }
                ]
            }

    ret = requests.post('http://localhost:80/seldon/default/gpt2/v2/models/gpt2/infer', json=payload)

    try:
        res = ret.json()
    except:
       continue

    # extract logits
    logits = np.array(res["outputs"][1]["data"])
    logits = logits.reshape(res["outputs"][1]["shape"])

    # take the best next token probability of the last token of input ( greedy approach)
    next_token = logits.argmax(axis=2)[0]
    next_token_str = tokenizer.decode(next_token[-1:], skip_special_tokens=True,
                                      clean_up_tokenization_spaces=True).strip()
    gen_sentence += ' ' + next_token_str
    count += 1

print(f'Input: {input_text}\nOutput: {gen_sentence}')

### Run Load Test / Performance Test using vegeta

#### Install vegeta, for more details take a look in [vegeta](https://github.com/tsenart/vegeta#install) official documentation

In [None]:
!wget https://github.com/tsenart/vegeta/releases/download/v12.8.3/vegeta-12.8.3-linux-amd64.tar.gz
!tar -zxvf vegeta-12.8.3-linux-amd64.tar.gz
!chmod +x vegeta

#### Generate vegeta [target file](https://github.com/tsenart/vegeta#-targets) contains "post" cmd with payload in the requiered structure

In [None]:
from subprocess import run, Popen, PIPE
import json
import numpy as np
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
import base64

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = 'I enjoy working in Seldon'
input_ids = tokenizer.encode(input_text, return_tensors='tf')
shape = input_ids.shape.as_list()
payload = {
		"inputs": [
			{"name": "input_ids:0",
			 "datatype": "INT32",
			 "shape": shape,
			 "data": input_ids.numpy().tolist()
			 },
			{"name": "attention_mask:0",
			 "datatype": "INT32",
			 "shape": shape,
			 "data": np.ones(shape, dtype=np.int32).tolist()
			 }
			]
		}

cmd= {"method": "POST",
		"header": {"Content-Type": ["application/json"] },
		"url": "http://localhost:80/seldon/default/gpt2/v2/models/gpt2/infer",
		"body": base64.b64encode(bytes(json.dumps(payload), "utf-8")).decode("utf-8")}

with open("vegeta_target.json", mode="w") as file:
	json.dump(cmd, file)
	file.write('\n\n')

In [None]:
!vegeta attack -targets=vegeta_target.json -rate=1 -duration=60s -format=json | vegeta report -type=text

### Clean-up

In [None]:
!kubectl delete -f gpt2-deploy.yaml -n default