## Sharding of Model
Model sharding is the technique of partitioning a model into smaller sections, allowing each section to be processed separately across different devices or nodes.

In [2]:
import os
os.chdir(os.getcwd()+"/practice_llm")

In [3]:
import torch
print(torch.__version__)  
print(torch.cuda.is_available()) 

from accelerate import init_empty_weights, load_checkpoint_and_dispatch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

2.5.1+cu118
True


In [4]:
from accelerate import Accelerator
accelerator = Accelerator()

In [5]:
shard_output_path = os.getcwd() + "/model_sharded"

In [6]:
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model = AutoModelForCausalLM.from_pretrained(model_id)

In [7]:
accelerator.save_model(model=model,save_directory=shard_output_path,max_shard_size='250MB')

You should now find all the shards created with a fixed size in the 'model_sharded' folder.

### Clear the memory by deleting the model

In [8]:
del model

## Start loading the models in both GPU and CPU

Here, I will initialize an empty model for weights. This process should not increase your CPU or GPU memory usage.

In [9]:
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

In [10]:
with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(model_id) 



In [11]:
weights_location = os.getcwd() + "/model_sharded/"


In [12]:
model = load_checkpoint_and_dispatch(
    model, checkpoint=weights_location, device_map="auto", no_split_module_classes=['Block']
)

Some parameters are on the meta device because they were offloaded to the cpu.                           


In [13]:
model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): 

**The model is now loaded into the GPU's VRAM. To demonstrate, I’m using a small model like TinyLLama. We'll continue loading additional models until we reach the VRAM limit to observe the outcome.**


In [14]:
with init_empty_weights():
    model_1 = AutoModelForCausalLM.from_pretrained(model_id) 
model_1 = load_checkpoint_and_dispatch(model_1, checkpoint=weights_location, device_map="auto", no_split_module_classes=['Block']
)

with init_empty_weights():
    model_2 = AutoModelForCausalLM.from_pretrained(model_id) 
model_2 = load_checkpoint_and_dispatch(model_2, checkpoint=weights_location, device_map="auto", no_split_module_classes=['Block']
)

Some parameters are on the meta device because they were offloaded to the cpu.                           


In [15]:
with init_empty_weights():
    model_3 = AutoModelForCausalLM.from_pretrained(model_id) 
model_3 = load_checkpoint_and_dispatch(model_3, checkpoint=weights_location, device_map="auto", no_split_module_classes=['Block']
)

In [16]:
with init_empty_weights():
    model_4 = AutoModelForCausalLM.from_pretrained(model_id) 
model_4 = load_checkpoint_and_dispatch(model_4, checkpoint=weights_location, device_map="auto", no_split_module_classes=['Block']
)

In [17]:
print(model.device)
print(model_1.device) 
print(model_2.device) 
print(model_3.device) 
print(model_4.device) 

cuda:0
cuda:0
cpu
cpu
cpu


### Let's see how many layer have been offloaded to CPU.

In [18]:
model_3.hf_device_map

{'': 'cpu'}

## Let's see the models in action

In [19]:
import time

In [20]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

In [33]:
prompts = ["Write me a small poem about Bangladesh:"]
inputs = tokenizer(prompts, return_tensors="pt", padding=False)

inputs = {key: value.to("cuda:0") for key, value in inputs.items()}


## Model 0 (GPU)

In [34]:
%%time
start = time.time()
output = model.generate(**inputs,max_new_tokens=50)
end = time.time()
print(end-start)
print(output)
print(tokenizer.batch_decode(output, skip_special_tokens=True))

30.869735717773438
tensor([[    1, 14350,   592,   263,  2319, 26576,  1048, 14320, 29880, 21754,
         29901,    13,    13, 29933,   574, 29880, 21754, 29892,   263,  2982,
           310, 11640,    13, 29909,  2982,   310,  4966,   322, 12561, 29879,
            13, 29909,  2982,   310,  5360,   322,  2562,    13, 29909,  2982,
           310, 10776,   322, 10311,  2592,    13,    13, 29909,  2982,   310,
         15409,   322, 17659,    13, 29909,  2982,   310,  8261,   267,   322,
         17173]], device='cuda:0')
['Write me a small poem about Bangladesh:\n\nBangladesh, a land of promise\nA land of hope and dreams\nA land of love and care\nA land of peace and harmony\n\nA land of beauty and grace\nA land of riches and wealth']
CPU times: user 28.5 s, sys: 1.14 s, total: 29.6 s
Wall time: 30.9 s


## Model 1 (GPU)

In [35]:
%%time
start = time.time()
output = model_1.generate(**inputs,max_new_tokens=50)
end = time.time()
print(end-start)
print(output)
print(tokenizer.batch_decode(output, skip_special_tokens=True))

63.77226996421814
tensor([[    1, 14350,   592,   263,  2319, 26576,  1048, 14320, 29880, 21754,
         29901,    13,    13, 29933,   574, 29880, 21754, 29892,   263,  2982,
           310, 11640,    13, 29909,  2982,   310,  4966,   322, 12561, 29879,
            13, 29909,  2982,   310,  5360,   322,  2562,    13, 29909,  2982,
           310, 10776,   322, 10311,  2592,    13,    13, 29909,  2982,   310,
         15409,   322, 17659,    13, 29909,  2982,   310,  8261,   267,   322,
         17173]], device='cuda:0')
['Write me a small poem about Bangladesh:\n\nBangladesh, a land of promise\nA land of hope and dreams\nA land of love and care\nA land of peace and harmony\n\nA land of beauty and grace\nA land of riches and wealth']
CPU times: user 57.8 s, sys: 2.68 s, total: 1min
Wall time: 1min 3s


## Model 2 (GPU)

In [39]:
%%time
start = time.time()
model_2 = model_2.to("cuda:0")
output=model_2.generate(**inputs,max_new_tokens=50)
end = time.time()
print(end-start)
print(output)
print(tokenizer.batch_decode(output, skip_special_tokens=True))

70.4834234714508
tensor([[    1, 14350,   592,   263,  2319, 26576,  1048, 14320, 29880, 21754,
         29901,    13,    13, 29933,   574, 29880, 21754, 29892,   263,  2982,
           310, 11640,    13, 29909,  2982,   310,  4966,   322, 12561, 29879,
            13, 29909,  2982,   310,  5360,   322,  2562,    13, 29909,  2982,
           310, 10776,   322, 10311,  2592,    13,    13, 29909,  2982,   310,
         15409,   322, 17659,    13, 29909,  2982,   310,  8261,   267,   322,
         17173]], device='cuda:0')
['Write me a small poem about Bangladesh:\n\nBangladesh, a land of promise\nA land of hope and dreams\nA land of love and care\nA land of peace and harmony\n\nA land of beauty and grace\nA land of riches and wealth']
CPU times: user 1min 7s, sys: 2.97 s, total: 1min 10s
Wall time: 1min 10s


## Model 3 (GPU and CPU offloading)

In [50]:
%%time
start = time.time()

inputs = {key: value.to("cpu") for key, value in inputs.items()}
output=model_3.generate(**inputs,max_new_tokens=50)
end = time.time()
print(end-start)
print(tokenizer.batch_decode(output, skip_special_tokens=True))

26.894044399261475
['Write me a small poem about Bangladesh:\n\nBangladesh, a land of promise\nA land of hope and dreams\nA land of love and care\nA land of peace and harmony\n\nA land of beauty and grace\nA land of riches and wealth']
CPU times: user 1min 44s, sys: 0 ns, total: 1min 44s
Wall time: 26.9 s


## Model 4 (CPU)

In [46]:

inputs = {key: value.to("cpu") for key, value in inputs.items()}


In [47]:
%%time
start = time.time()
output=model_4.generate(**inputs,max_new_tokens=50)
end = time.time()
print(end-start)
print(output)
print(tokenizer.batch_decode(output, skip_special_tokens=True))

28.74681282043457
tensor([[    1, 14350,   592,   263,  2319, 26576,  1048, 14320, 29880, 21754,
         29901,    13,    13, 29933,   574, 29880, 21754, 29892,   263,  2982,
           310, 11640,    13, 29909,  2982,   310,  4966,   322, 12561, 29879,
            13, 29909,  2982,   310,  5360,   322,  2562,    13, 29909,  2982,
           310, 10776,   322, 10311,  2592,    13,    13, 29909,  2982,   310,
         15409,   322, 17659,    13, 29909,  2982,   310,  8261,   267,   322,
         17173]])
['Write me a small poem about Bangladesh:\n\nBangladesh, a land of promise\nA land of hope and dreams\nA land of love and care\nA land of peace and harmony\n\nA land of beauty and grace\nA land of riches and wealth']
CPU times: user 1min 51s, sys: 0 ns, total: 1min 51s
Wall time: 28.8 s


From this example seems that CPU offloading is not that great. The inference time is even slower of running the entire model on the CPU. We should controll better how many layers offload to CPU memory.

## Let's organize better our layer

In [51]:
del model
del model_1
del model_2
del model_3
del model_4

In [52]:
import gc
torch.cuda.empty_cache()
gc.collect()

27843

In [53]:
from accelerate import infer_auto_device_map

## Organize the space of our model by max memory on any device


In [54]:
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

In [55]:
with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(model_id)

In [56]:
device_map = infer_auto_device_map(model, max_memory={0: "4GiB", "cpu": "10GiB"})
with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(model_id) 
model = load_checkpoint_and_dispatch(model, checkpoint=weights_location, device_map=device_map, no_split_module_classes=['Block']
)

Some parameters are on the meta device because they were offloaded to the cpu.                        


In [57]:
model.hf_device_map

{'model.embed_tokens': 0,
 'model.layers.0': 0,
 'model.layers.1': 0,
 'model.layers.2': 0,
 'model.layers.3': 0,
 'model.layers.4': 0,
 'model.layers.5': 0,
 'model.layers.6': 0,
 'model.layers.7': 0,
 'model.layers.8': 0,
 'model.layers.9': 0,
 'model.layers.10': 0,
 'model.layers.11': 0,
 'model.layers.12': 0,
 'model.layers.13': 0,
 'model.layers.14': 0,
 'model.layers.15': 0,
 'model.layers.16': 0,
 'model.layers.17': 0,
 'model.layers.18': 0,
 'model.layers.19': 0,
 'model.layers.20': 0,
 'model.layers.21.self_attn': 0,
 'model.layers.21.input_layernorm': 'cpu',
 'model.layers.21.post_attention_layernorm': 'cpu',
 'model.norm': 'cpu',
 'model.rotary_emb': 'cpu',
 'lm_head': 'cpu',
 'model.layers.21.mlp': 'cpu'}

In [58]:
prompts = ["Write me a small poem about Naples:"]
inputs = tokenizer(prompts, return_tensors="pt", padding=False)
inputs.to("cuda:0")

{'input_ids': tensor([[    1, 14350,   592,   263,  2319, 26576,  1048,  8344,   793, 29901]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

In [59]:
%%time
start = time.time()
output = model.generate(**inputs,max_new_tokens=50)
end = time.time()
print(end-start)
print(tokenizer.batch_decode(output, skip_special_tokens=True))

27.723263263702393
['Write me a small poem about Naples:\n\nNaples, the city of the sun,\nWhere the sea breeze carries the scent of the past,\nWhere the streets are lined with ancient buildings,\nAnd the sky is painted with the colors of the']
CPU times: user 26.2 s, sys: 831 ms, total: 27.1 s
Wall time: 27.7 s


We’ve found a balance that allows us to achieve similar speeds on the CPU compared to running on the GPU. The time required to transfer parameters and inference states between devices negates the performance gains from using a GPU, especially for a smaller model like TinyLLama, which doesn’t have deep layers to fully leverage the GPU's processing power.

For larger models, however, running on a GPU could be more advantageous. This is particularly true when GPU memory limitations prevent a model from running entirely on the CPU. In such cases, the overhead of moving data to and from the GPU is often justified by the significant boost in processing speed, making it a preferable choice for more complex models.