## Before you get started, make sure you set your runtime to "GPU".

To do so:
* Click "runtime" above.
* Select "change runtime type".
* Change "hardware accelerator" to "GPU".
* Restart your kernel.

Also, make sure you install whatever packages you need. We've started you off with *transformers*.


In [None]:
!pip install transformers accelerate bitsandbytes

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.18.0-py3-none-any.whl (215 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.3/215.3 KB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.37.2-py3-none-any.whl (84.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.2/84.2 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 KB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Download

In [None]:
!nvidia-smi

Thu Apr  6 16:50:25 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!rm -rf /content/*

In [None]:
from accelerate import init_empty_weights, load_checkpoint_and_dispatch, infer_auto_device_map
from huggingface_hub import hf_hub_download
from transformers import AutoConfig, AutoModelForCausalLM,  AutoTokenizer, pipeline, AutoModelForTokenClassification, T5ForConditionalGeneration, BitsAndBytesConfig
import os
import torch
import psutil

Main Documentations

https://huggingface.co/blog/accelerate-large-models

https://huggingface.co/docs/accelerate/usage_guides/big_modeling

https://huggingface.co/docs/transformers/main/main_classes/quantization

# Assessment 1

Using the HuggingFace transformers library deploy https://huggingface.co/dslim/bert-base-NER. Create a function that takes an input of a string and outputs a list of people identified by the model.

Example Input: "John Smith and Mary walked to the beach."

Example Output: ["John Smith","Mary"]

In [None]:
# get tokenizer
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")

# load model
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

# get pipeline for inference
nlp = pipeline("ner", model=model, tokenizer=tokenizer)


def get_person_names(text, nlp):
  ner_results = nlp(text)
  preson_names = []
  for res in ner_results:
    if 'PER' in res['entity']:
      preson_names.append(res['word'])
  return preson_names



example = "This morning Jason Candle was running in the park and his friend Mor Gadol came by"
preson_names = get_person_names(example, nlp)

print(preson_names)

['Jason', 'Can', '##dle', 'Mo', '##r', 'G', '##ado', '##l']


In [None]:
res = nlp(example)

[{'entity': 'B-PER',
  'score': 0.9997925,
  'index': 3,
  'word': 'Jason',
  'start': 13,
  'end': 18},
 {'entity': 'I-PER',
  'score': 0.99982375,
  'index': 4,
  'word': 'Can',
  'start': 19,
  'end': 22},
 {'entity': 'I-PER',
  'score': 0.9996581,
  'index': 5,
  'word': '##dle',
  'start': 22,
  'end': 25},
 {'entity': 'B-PER',
  'score': 0.99873394,
  'index': 14,
  'word': 'Mo',
  'start': 65,
  'end': 67},
 {'entity': 'B-PER',
  'score': 0.9697566,
  'index': 15,
  'word': '##r',
  'start': 67,
  'end': 68},
 {'entity': 'I-PER',
  'score': 0.99955446,
  'index': 16,
  'word': 'G',
  'start': 69,
  'end': 70},
 {'entity': 'I-PER',
  'score': 0.98712033,
  'index': 17,
  'word': '##ado',
  'start': 70,
  'end': 73},
 {'entity': 'I-PER',
  'score': 0.9894043,
  'index': 18,
  'word': '##l',
  'start': 73,
  'end': 74}]

# Assessment 2

Create a token generation pipeline for the following model: https://huggingface.co/bigscience/bloom-7b1

For your pipeline, write out code to load as much of the model into the GPU as possible, then load the remainder into CPU Ram.

Since the model is too large to fit on GPU, you will need to spill it over to disk and ram. And you will need to consider how to use 8 or 16-bit to make this work.

When we test your code, we'll do so on 3 different instance types of varying sizes so don't just build for your colab instance. Assume there will always be at least some GPU space and enough CPU ram for the rest of the model.

You can use whatever packages or external code you'd like to accomplish the task.

Please remember to test the inference using the pipeline and print the output as part of the notebook.

In [None]:
# define model checkpoint
checkpoint = "bigscience/bloom-3b"

# define no splitting block name for device mapping
no_split_block = 'BloomBlock'

# load empty model
config = AutoConfig.from_pretrained(checkpoint)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

# device block mapping
device_map = infer_auto_device_map(model, 
                                   no_split_module_classes=[no_split_block],
                                   dtype='float16')

# set up a quantization 
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True,
                                         llm_int8_threshold=6.0,
                                         llm_int8_skip_modules=["lm_head"])

# load model from shraded files
model = AutoModelForCausalLM.from_pretrained(checkpoint, 
                                             device_map=device_map, 
                                             offload_folder="offload", 
                                             offload_state_dict=True,
                                             load_in_8bit=True,
                                             quantization_config=quantization_config,
                                             )

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# generate inputs
prompt = "The quick brown fox"
max_length = 50
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

# get outputs
outputs = model.generate(inputs["input_ids"], max_length=max_length)

# decode outputs
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# print outputs
print(decoded_outputs)

Downloading (…)lve/main/config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/6.01G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

['The quick brown fox jumps over the lazy dog.\n- What?\n- The quick brown fox jumps over the lazy dog.\n- What?\n- The quick brown fox jumps over the lazy dog.\n- What?\n- The quick brown fox jumps over the lazy']


The following pipeline follows the steps:

1. load empty model with the checkpoint's architecture
2. generate the storing mapping of each model blocks - it s important to store blocks together
3. load the model according to the device mapping with an offload folder for storing in the disk the parameters in 8bit format
4. Load the tokenizer
5. test the model

The exercise requires to load the bloom-7b1 however nothing was stored in the GPU ram. So I tried with bloom-1b1 and bloom-3b and it works. 

Ouput printed:
['The quick brown fox jumps over the lazy dog.\n- What?\n- The quick brown fox jumps over the lazy dog.\n- What?\n- The quick brown fox jumps over the lazy dog.\n- What?\n- The quick brown fox jumps over the lazy']

# Assessment 3

Create a token generation pipeline for the following model: https://huggingface.co/EleutherAI/gpt-neox-20b

For your pipeline, write out code to load as much of the model into the GPU as possible, then load the remainder into CPU Ram.

When we test your code, we'll do so on 3 different instance types of varying sizes so don't just build for your colab instance. Assume there will always be at least some GPU space and enough CPU ram for the rest of the model.

You can use whatever packages or external code you'd like to accomplish the task.

Please remember to test the inference using the pipeline and print the output as part of the notebook.

In [None]:
# Allocate free space in cuda
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:<600>"

# define model checkpoint
checkpoint = "EleutherAI/gpt-neox-20b"

# load empty model
config = AutoConfig.from_pretrained(checkpoint)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

# set up a quantization 
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True,
                                         llm_int8_threshold=6.0)

# load model from shraded files
model = AutoModelForCausalLM.from_pretrained(checkpoint, 
                                             device_map='auto', 
                                             offload_folder="offload", 
                                             offload_state_dict=True,
                                             load_in_8bit=True,
                                             quantization_config=quantization_config,
                                             )

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# generate inputs
prompt = "My name is Teven and I am"
max_length = 50
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

# get outputs
outputs = model.generate(inputs["input_ids"], max_length=max_length)

# decode outputs
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# print outputs
print(decoded_outputs)

Downloading (…)model.bin.index.json:   0%|          | 0.00/57.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/46 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00046.bin:   0%|          | 0.00/926M [00:00<?, ?B/s]

Downloading (…)l-00002-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00003-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00004-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00005-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00006-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00007-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00008-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00009-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00010-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00011-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00012-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00013-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00014-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00015-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00016-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00017-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00018-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00019-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00020-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00021-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00022-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00023-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00024-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00025-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00026-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00027-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00028-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00029-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00030-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00031-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00032-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00033-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00034-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00035-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00036-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00037-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00038-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00039-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00040-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00041-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00042-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00043-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00044-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00045-of-00046.bin:   0%|          | 0.00/604M [00:00<?, ?B/s]

Downloading (…)l-00046-of-00046.bin:   0%|          | 0.00/620M [00:00<?, ?B/s]

I used the same pipeline as bloom model but I do not have enoug space in the disk and in the GPU to load checkpoint shaded files resulting to closing colab session.

I ve got this error
OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 14.75 GiB total capacity; 13.56 GiB 
already allocated; 336.81 MiB free; 13.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated 
memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and 
PYTORCH_CUDA_ALLOC_CONF

I tried to solve it by setting os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:<600>"

Then I ve got this error: RuntimeError: stoi

There are a few possible reasons why we might see this error:

1. The string you are trying to convert is not a valid integer: If the string you are trying to convert contains non-numeric characters or is too large to fit into an integer, you may see this error.

2. The string you are trying to convert is empty or null: If the string you are trying to convert is empty or null, you may see this error.

3. There is a bug in the code: If there is a bug in the code that is calling the stoi function, you may see this error.

# Assessment 4

Create a token generation pipeline for the following model: https://huggingface.co/google/flan-ul2

For your pipeline, write out code to load as much of the model into the GPU as possible, then load the remainder into CPU Ram.

When we test your code, we'll do so on 3 different instance types of varying sizes so don't just build for your colab instance. Assume there will always be at least some GPU space and enough CPU ram for the rest of the model.

You can use whatever packages or external code you'd like to accomplish the task.

Please remember to test the inference using the pipeline and print the output as part of the notebook.

In [None]:
# define checkpoint
checkpoint = "google/flan-ul2"

# set up a quantization 
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True,
                                         llm_int8_threshold=6.0)

# load model 8bit
model = T5ForConditionalGeneration.from_pretrained(checkpoint, 
                                                   device_map="auto", 
                                                   offload_folder="offload", 
                                                   offload_state_dict=True,
                                                   load_in_8bit=True,
                                                   quantization_config=quantization_config) 

# get tokenizer                                                                
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# test the model for inference
input_string = """Answer the following question by reasoning step by step. 
                  The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, 
                  how many apple do they have?"""                                             
inputs = tokenizer(input_string, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(inputs, max_length=200)
print(tokenizer.decode(outputs[0]))

Downloading (…)lve/main/config.json:   0%|          | 0.00/784 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json: 0.00B [00:00, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00008.bin:   0%|          | 0.00/4.69G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00008.bin:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00008.bin:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Downloading (…)l-00004-of-00008.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Downloading (…)l-00005-of-00008.bin:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Downloading (…)l-00006-of-00008.bin:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

Downloading (…)l-00007-of-00008.bin:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Downloading (…)l-00008-of-00008.bin:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

I tried in first the same pipeline resulting to an error message sauggesting to use T5ForConditionalGeneration object from transformer package and BitsAndBytesConfig to load the model in 8bit format.

However the disk and the gpu ram are saturated while loading the checkpoint's sharded files causing the session to crach.

# Assessment 5

Create a token generation pipeline for the following model: https://huggingface.co/cerebras/Cerebras-GPT-13B

For your pipeline, write out code to load as much of the model into the GPU as possible, then load the remainder into CPU Ram.

When we test your code, we'll do so on 3 different instance types of varying sizes so don't just build for your colab instance. Assume there will always be at least some GPU space and enough CPU ram for the rest of the model.

You can use whatever packages or external code you'd like to accomplish the task.

Please remember to test the inference using the pipeline and print the output as part of the notebook.

In [None]:
# define model checkpoint
checkpoint = "cerebras/Cerebras-GPT-1.3B"

# load empty model
config = AutoConfig.from_pretrained(checkpoint)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

# set up a quantization 
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=False,
                                         llm_int8_threshold=6.0)

# load model from shraded files
model = AutoModelForCausalLM.from_pretrained(checkpoint, 
                                             device_map='auto', 
                                             offload_folder="offload", 
                                             offload_state_dict=True,
                                             load_in_8bit=True,
                                             quantization_config=quantization_config) 

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# generate inputs
prompt = "Generative AI is "
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

# get outputs
outputs = model.generate(**inputs, num_beams=5, 
                        max_new_tokens=50, early_stopping=True,
                        no_repeat_ngram_size=2)

# decode outputs
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# print outputs
print(decoded_outputs)


Downloading (…)lve/main/config.json:   0%|          | 0.00/360 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/5.36G [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Generative AI is  \nthe next step in the evolution of AI.\n\n~~~\nlucb1e\nI\'m not sure what you mean by "next step" in this context, but I think you\'re\ntalking about the next generation of']


With the model cerebras-13B also the disk is saturated so I tried with the mdoel cerebras-1.3B and it works.

Ouput printed:
'Generative AI is  \nthe next step in the evolution of AI.\n\n~~~\nlucb1e\nI\'m not sure what you mean by "next step" in this context, but I think you\'re\ntalking about the next generation of'

