### Tutorial on deepspeed inference

This tutorial is based on the blog: [here](https://towardsdatascience.com/deepspeed-deep-dive-model-implementations-for-inference-mii-b02aa5d5e7f7)

This tutorial runs successfully with **deepspeed==0.12.6, deepspeed-mii==0.1.3, transformers==4.36.2**

#### step0. set up the environment and the dependencies

In [20]:
import os
import time

os.environ["TOKENIZERS_PARALLELISM"] = "false"

from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

In [2]:
import torch
from transformers import pipeline

In [3]:
def infer_speed_report(num_tokens, start_time, end_time):
    throughput = num_tokens / (end_time - start_time)
    print(f"Number of new tokens: {num_tokens}")
    print(f"Throughput: {throughput:.1f} tokens/sec")
    print(f"Latency: {1000 / throughput:.1f} ms")

In [4]:
model_root = os.getenv('LOCAL_MISTRAL_MODEL_ROOT')
model_path = os.path.join(model_root, "Mistral-7B-v0.1")

In [5]:
prompt = "What should you do when you feel anxious?"
max_new_tokens = 500

#### step1. use the native transformers pipeline for text generation inference

In [6]:
pipe = pipeline(
    "text-generation",
    model=model_path,
    device=0,
    torch_dtype=torch.float16,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
tokenizer = pipe.tokenizer

In [8]:
t0 = time.time()

result = pipe(prompt,
              max_new_tokens=max_new_tokens, 
              do_sample=True, 
              pad_token_id=tokenizer.eos_token_id
        )

t1 = time.time()

In [9]:
gen_text = result[0]['generated_text']
gen_tokens = len(tokenizer(gen_text, return_tensors="pt").input_ids[0])
old_tokens = len(tokenizer(prompt, return_tensors="pt").input_ids[0])
new_tokens = gen_tokens - old_tokens

infer_speed_report(new_tokens, t0, t1)

Number of new tokens: 210
Throughput: 41.6 tokens/sec
Latency: 24.0 ms


#### step2. use deepspeed-inference for text generation inference

In [10]:
import deepspeed

[2024-01-17 16:17:43,407] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)


In [11]:
pipe.model = deepspeed.init_inference( # replace the model operators with deepspeed ones
    model=pipe.model,
    dtype=torch.half,
    tensor_parallel = {
        'tp_size': 1, # the number of devices to split the model across using tensor parallelism, default `1` to not use tensor parallel,
    },
    replace_with_kernel_inject=True,
)

[2024-01-17 16:17:43,558] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.12.6, git-hash=unknown, git-branch=unknown
[2024-01-17 16:17:43,560] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1


In [12]:
t0 = time.time()

result = pipe(prompt, 
              max_new_tokens=max_new_tokens, 
              do_sample=True,
              pad_token_id=tokenizer.eos_token_id
        )

t1 = time.time()

In [13]:
gen_text = result[0]['generated_text']
gen_tokens = len(tokenizer(gen_text, return_tensors="pt").input_ids[0])
old_tokens = len(tokenizer(prompt, return_tensors="pt").input_ids[0])
new_tokens = gen_tokens - old_tokens

infer_speed_report(new_tokens, t0, t1)

Number of new tokens: 500
Throughput: 56.1 tokens/sec
Latency: 17.8 ms


#### step3. use deepspeed-MII for text generation inference

In [22]:
import mii

mii.get_supported_models(task='text-generation')

['uf-aice-lab/math-roberta',
 'stevems1/distilroberta-base-SmithsModel',
 'sangjeedondrub/tibetan-roberta-causal-base',
 'samba/samba-large-bert-fine-tuned',
 'gokceuludogan/ChemBERTaLM',
 'GItaf/roberta-base-finetuned-mbti-0901',
 'GItaf/roberta-base-roberta-base-finetuned-mbti-0911',
 'GItaf/roberta-base-roberta-base-finetuned-mbti-0912-weight0',
 'GItaf/roberta-base-roberta-base-TF-weight2-epoch5',
 'GItaf/roberta-base-roberta-base-TF-weight0.5-epoch5',
 'GItaf/roberta-base-roberta-base-TF-weight1-epoch15',
 'GItaf/roberta-base-roberta-base-TF-weight1-epoch5',
 'GItaf/roberta-base-roberta-base-TF-weight1-epoch10',
 'mamiksik/CodeBertaCLM',
 'AndrewOgn/distilroberta-base-finetuned-wikitext2',
 'l-tran/distilroberta-base-OLID-MLM',
 'hf-tiny-model-private/tiny-random-RobertaForCausalLM',
 'pthpth0206/amazon-roberta',
 'sharoz/codebert-python-custom-functions-dataset-python',
 'ayushutkarsh/roberta_t3_nce',
 'tobijen/my_awesome_eli5_clm-model_roberta',
 'panghee/pangheezoa',
 'himanima

In [14]:
from mii import pipeline

mii_pipe = pipeline(model_path)

[2024-01-17 16:17:52,670] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-17 16:17:52,671] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-01-17 16:17:52,675] [INFO] [engine_v2.py:82:__init__] Building model...


Using /home/hyp/.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/hyp/.cache/torch_extensions/py38_cu121/inference_core_ops/build.ninja...
Building extension module inference_core_ops...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module inference_core_ops...


ninja: no work to do.
Time to load inference_core_ops op: 0.8310742378234863 seconds


Using /home/hyp/.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/hyp/.cache/torch_extensions/py38_cu121/ragged_device_ops/build.ninja...
Building extension module ragged_device_ops...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module ragged_device_ops...


ninja: no work to do.
Time to load ragged_device_ops op: 0.8254876136779785 seconds


Using /home/hyp/.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/hyp/.cache/torch_extensions/py38_cu121/ragged_ops/build.ninja...
Building extension module ragged_ops...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


ninja: no work to do.
Time to load ragged_ops op: 0.819563627243042 seconds
[2024-01-17 16:17:56,223] [INFO] [huggingface_engine.py:112:parameters] Loading checkpoint: /data1/model/mistral/mistralai/Mistral-7B-v0.1/model-00002-of-00002.safetensors


Loading extension module ragged_ops...


[2024-01-17 16:17:56,454] [INFO] [huggingface_engine.py:112:parameters] Loading checkpoint: /data1/model/mistral/mistralai/Mistral-7B-v0.1/model-00001-of-00002.safetensors
[2024-01-17 16:17:59,347] [INFO] [engine_v2.py:84:__init__] Model built.
[2024-01-17 16:17:59,356] [INFO] [kv_cache.py:135:__init__] Allocating KV-cache 0 with shape: (32, 6203, 64, 2, 8, 128) consisting of 6203 blocks.


In [18]:
t0 = time.time()

result = mii_pipe([prompt], 
                  max_new_tokens=max_new_tokens, 
                  do_sample=True,
                )

t1 = time.time()

In [19]:
gen_text = result[0].generated_text
gen_tokens = len(tokenizer(gen_text, return_tensors="pt").input_ids[0])
old_tokens = len(tokenizer(prompt, return_tensors="pt").input_ids[0])
new_tokens = gen_tokens - old_tokens

infer_speed_report(new_tokens, t0, t1)

Number of new tokens: 491
Throughput: 89.8 tokens/sec
Latency: 11.1 ms
