In [1]:
 !pip -q install git+https://github.com/huggingface/transformers
!pip -q install bitsandbytes accelerate
# all of these libraries are used for speeding up the inference process

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.7/222.7 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m75.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
"""
1. When loading the model normally with a 16-bit quantization, I encountered a two errors- a. CPU RAM consumed error and b. GPU shoots up to 15 gbs and then it says cuda out of memory
   To tackle this problem, firstly I loaded the model in 4-bit quantization and we also used device_map = "auto"

2. We could have further sped up the inference process by using flash-attention, but currently it doesn't support T4 gpus
"""

'\n1. When loading the model normally with a 16-bit quantization, I encountered a two errors- a. CPU RAM consumed error and b. GPU shoots up to 15 gbs and then it says cuda out of memory\n   To tackle this problem, firstly I loaded the model in 4-bit quantization and we also used device_map = "auto"\n\n2. We could have further sped up the inference process by using flash-attention, but currently it doesn\'t support T4 gpus\n'

In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import nest_asyncio
import time
import asyncio
import textwrap

In [5]:
nest_asyncio.apply() # this is used to allow the notebook to run multiple event loops simultaneously

In [6]:
quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
)
# the weights of the model are stored in 4-bit precision
# but the computations are performed using 16-bit fp precision. This reduces the gpu ram required to store the model weights

In [7]:
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", quantization_config=quantization_config, device_map="auto")
# device_map param allows the transformers library to automatically distribute the model's layers across the avilable hardware to improve performance
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [None]:
""" uncomment and run this cell to include lora adaptation. After running the module with lora, the model weigths took ~2.4 gbs of gpu ram.
    And when we load the model without lora, it takes around ~6 gbs """
# from peft import LoraConfig, get_peft_model, PeftModel
# lora_config = LoraConfig(
#     r=8, # rank of the low-rank matrices introduced by LoRA 
#     lora_alpha=16, # scaling factor for the low-rank matrices 
#     target_modules=["q_proj", "v_proj"], # adapter targets query and value projection matrices
#     lora_dropout=0.05,
#     bias="none",
#     task_type="CAUSAL_LM"
# )
# model = get_peft_model(model, lora_config)

' uncomment and run this cell to include lora adaptation. After running the module with lora, the model weigths took ~2.4 gbs of gpu ram.\n    And when we load the model without lora, it takes around 6 gbs '

In [8]:
tokenizer.pad_token = tokenizer.eos_token
# idk why is this necessary but the model doesn't give any output without this step. It might be because the tokenizer is configured to use the EOS token as the padding token
device = 'cuda'

In [9]:
# a basic function to show the output in a more readable format
def wrap_text(text, width=90):
    lines = text.split('\n')
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]
    wrapped_text = '\n'.join(wrapped_lines)
    return wrapped_text

In [10]:
async def generate(input_texts, max_length=128):
    inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True).to(device)
    outputs = model.generate(**inputs, max_length=max_length, temperature=0.1, do_sample=True)
    texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    wrapped_texts = [wrap_text(text) for text in texts]
    for text in wrapped_texts:
        print(text)
    return wrapped_texts

In [11]:
# we initialize the model with a few iterations of generating text based on a seed text, this helps the model to generate accurate responses
# https://medium.com/better-ml/model-warmup-8e9681ef4d41
def warmup_model(model, tokenizer, text, num_iterations=5):
    for _ in range(num_iterations):
        inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(device)
        model.generate(**inputs, max_new_tokens=100)

In [12]:
# this function calculates the throughput of our model for a specific set of constraints
async def benchmark_model(model, tokenizer, prompts, input_tokens, output_tokens):
    inputs = tokenizer(prompts, return_tensors="pt", max_length=input_tokens, padding=True, truncation=True).to(device)

    start_time = time.time()
    outputs = model.generate(**inputs, max_length=output_tokens)
    end_time = time.time()

    total_time = end_time - start_time
    total_tokens = (input_tokens + output_tokens) * len(prompts)
    throughput = total_tokens / total_time

    print(f'Average time per inference: {total_time/len(prompts):.4f} seconds')
    print(f'Throughput: {throughput:.2f} tokens/sec')
    return throughput

In [13]:
async def main_concurrent(model, tokenizer, concurrency=32):
    warmup_prompt = "<s>[INST] Warm-up prompt to prepare the model. [/INST]" # seed text to warm up the model
    benchmark_prompt = "<s>[INST] Please provide a summary of the latest news. [/INST]" # we use this text to test all the functions and check the throughput on a single prompt

    warmup_model(model, tokenizer, warmup_prompt)

    print(f'Benchmarking on predefined prompt:\n{benchmark_prompt}\n')
    await benchmark_model(model, tokenizer, [benchmark_prompt]*concurrency, input_tokens=128, output_tokens=128)

    # then we ask user for prompts; number of prompts here is equal to the number of concurrent tasks that we want out model to do
    input_prompts = []
    for i in range(concurrency):
        prompt = input(f"Enter prompt {i+1}/{concurrency}: ")
        input_prompts.append(prompt)

    print('Benchmarking on user provided prompts:')
    await benchmark_model(model, tokenizer, input_prompts, input_tokens=128, output_tokens=128)

    print('Generating responses for user provided prompts:')
    await generate(input_prompts)

In [14]:
if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main_concurrent(model, tokenizer, concurrency=32))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Benchmarking on predefined prompt:
<s>[INST] Please provide a summary of the latest news. [/INST]

Average time per inference: 0.8335 seconds
Throughput: 307.12 tokens/sec
Enter prompt 1/32: what is cv? 
Enter prompt 2/32: what is ml? 
Enter prompt 3/32: what is dl?
Enter prompt 4/32: how do we make french toast? 
Enter prompt 5/32: how to make lasagna? 
Enter prompt 6/32: what is lasgana made of? 
Enter prompt 7/32: what is rl?
Enter prompt 8/32: how to use rl in self-driving cars?
Enter prompt 9/32: what is policy learning?
Enter prompt 10/32: is rl tough?
Enter prompt 11/32: what are some use cases of rl?
Enter prompt 12/32: how can we combine rl with llms?
Enter prompt 13/32: what are ViTs?
Enter prompt 14/32: what is the use of ViTs?
Enter prompt 15/32: explain me the working of ViTs?
Enter prompt 16/32: can we use ViT in a self-driving car?
Enter prompt 17/32: what is the difference between ViT and CV?
Enter prompt 18/32: do you know about tinygrad?
Enter prompt 19/32: what is th

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Benchmarking on user provided prompts:


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Average time per inference: 0.8798 seconds
Throughput: 290.97 tokens/sec
Generating responses for user provided prompts:
what is cv?

CV stands for Curriculum Vitae, which is a Latin term meaning "course of life." It is a
document used to provide detailed information about an individual's education, work
experience, skills, and other qualifications. A CV is often used when applying for jobs,
academic positions, or research grants. It is a more detailed and comprehensive version of
a resume, and is typically longer than one page. The purpose of a CV is to showcase an
individual's accomplishments and qualifications in a clear and concise manner, making
what is ml?

Machine Learning (ML) is a subfield of artificial intelligence (AI) that provides systems
the ability to automatically learn and improve from experience without being explicitly
programmed. It focuses on the development of computer programs that can access data and
use it to learn for themselves. The process of learning begins