# Installing required libraries

In [1]:
!pip install -q -U transformers bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import time

In [None]:
# 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

# Logging into huggingface
Some of the models released by mistral as gated and require token for authorization

In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

In [None]:
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map = "auto", quantization_config = quant_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Creating a list of 32 prompts
To test the quantized model for concurrency of 32, I randomly generated a list of 32 prompts. As the size of each prompt is different, the tokenizer applies padding (if th prompt length is less than 128) and truncation (if prompt length is greater than 128) to maintain a fixed size input of 128 token as per the problem statement.

In [4]:
prompt_list = [
    "A list of colors: red, blue",
    "Portugal is",
    "Write a story about a magical library where the books come alive at night.",
    "Describe your perfect day, from start to finish.",
    "You discover a secret door in your house that leads to a parallel universe. What do you find on the other side?",
    "Write a letter to your future self, 10 years from now.",
    "You wake up one morning to find that you've switched bodies with your pet. What happens next?",
    "Imagine you have the power to control time. How would you use this ability?",
    "You wake up in a strange place with no memory of how you got there. Describe your surroundings and what you do next.",
    "Write a story about a chance encounter that changes the course of someone's life.",
    "Describe your dream vacation, including the sights, sounds, and experiences.",
    "You find a genie's lamp and are granted three wishes. What do you wish for?",
    "Write a story about a character who discovers a hidden talent they never knew they had.",
    "Imagine a world where animals can talk. What would they say?",
    "You wake up one morning to find that you're the only person left in the world. What do you do?",
    "Write a story about a magical object that has the power to grant wishes, but with unexpected consequences.",
    "Describe your ideal job and what a typical day would be like.",
    "You discover a time machine and can travel to any point in history. Where do you go and why?",
    "Write a story about a character who has to overcome their greatest fear.",
    "Imagine a world where technology has advanced to the point where anything is possible. What does this world look like?",
    "You wake up to find that you have the ability to read minds. How does this change your life?",
    "Write a story about a character who discovers a secret society that exists in the shadows.",
    "Describe your perfect meal, from appetizer to dessert.",
    "You wake up one morning to find that you're a different age than you were the day before. How does this affect your life?",
    "Write a story about a character who has to make a difficult choice that will change the course of their life.",
    "Imagine a world where animals have taken over the planet. What does this world look like?",
    "You discover a mysterious object that has the power to transport you to any place you can imagine. Where do you go?",
    "Write a story about a character who has to overcome a physical or mental disability.",
    "Describe your dream house, including the layout, decor, and amenities.",
    "Write a story about a character who discovers a hidden talent they never knew they had.",
    "You wake up one morning to find that you have the ability to fly. How does this change your life?",
    "Imagine a world where magic is real. What does this world look like?"
]

# 4 bit Quantization
To load an LLM as big as Mistral on a 16GB GPU like T4, here I am performing 4 bit quantization using bits and bytes, detatils of the config are present in the quant_config. Here I have performed NF4 (4-bit Normal Float) from QLoRA that yield better results than standard 4-bit quantization. It is integrated with double quantization, where quantization occurs twice; quantized weights from the first stage of quantization are passed into the next stage of quantization, yielding optimal float range values for the model’s weights.

In [16]:
from transformers import AutoTokenizer, AutoModelForCausalLM

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", device_map="auto", quantization_config=quant_config)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left")

tokenizer.pad_token = tokenizer.eos_token  # Most LLMs don't have a pad token by default

# Tokenize the input using encode_plus
model_inputs = []
attention_inputs = []
for prompt in prompt_list:
    encoded = tokenizer.encode_plus(
        prompt,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_tensors="pt"
    )
    model_inputs.append(encoded['input_ids']) # Collect the model_inputs
    attention_inputs.append(encoded['attention_mask'])  # Collect the attention mask

model_inputs = torch.cat(model_inputs).to("cuda") #Concatenate the model_inputs
attention_inputs = torch.cat(attention_inputs).to("cuda")  # Concatenate attention masks

# Generate output
start = time.time()
generated_ids = model.generate(input_ids=model_inputs, attention_mask=attention_inputs, max_new_tokens=128, do_sample=True) # Setting max_new_tokens to 128 as per PS
end = time.time()

# Calculate latency and throughput
output_token_count = 0
for i in range(generated_ids.size(dim=0)):
    output_token_count += generated_ids[i].size(dim=0)

latency = (end - start) / 32 # (end - start) gives us the total time taken for generation, dividing by 32 gives us time for generating a single prompt
through_put = (output_token_count) / (end - start) # output_token_count is sum of both input tokens (128) and output tokens (128), output_token_count = 256 * 32

print(f"Latency: {latency} seconds")
print(f"Throughput: {through_put} tokens/second")

print(f"Generated IDs : {generated_ids}")

decoded = tokenizer.batch_decode(generated_ids)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Latency: 1.2604110017418861 seconds
Throughput: 203.108350884123 tokens/second
Generated IDs : tensor([[    2,     2,     2,  ...,  3640,   298, 17039],
        [    2,     2,     2,  ...,   297, 16160,  1159],
        [    2,     2,     2,  ...,  1190,   369,   948],
        ...,
        [    2,     2,     2,  ...,   676,  2580,   684],
        [    2,     2,     2,  ...,  2659, 28725,   304],
        [    2,     2,     2,  ...,   438,   272, 13711]], device='cuda:0')


As we can see, the above script achieves a throughput of 203 tokens/sec for concurrency of 32

# Looking at the decode response

In [14]:
for i, response in enumerate(decoded):
  print(i)
  print(response)

0
</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s><s> A list of colors: red, blue and green has been generated in our computer. In a test the following questions are asked:

Which color (s) among the given ones is/are:

red- green- blue , or red- blue, or blue- green, or red?

Let $R,B,G$ denote whether the given color is really Red, Blue, or Green. Let `X = RG, Y = RB, Z = BG.` The questions are: $Y? X? Z? YXZ$ It is to be noted here that, for example,

$R?R
1
</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></