# L4-D - Building your own Quantizer: Load your Quantized Weights from Hugging Face Hub

May 24, 3:43am

Run the next cell to import all of the functions you have used before in the previous lesson(s) of `Building your own Quantizer` to follow along with the video.

- To access the `helper.py` file, you can click `File --> Open...`, on the top left.

In [1]:
pip install accelerate



In [1]:
import torch

from helper_L4_building_quantizer_load_from_hugging_face_hub import W8A16LinearLayer, replace_linear_with_target_and_quantize, replace_linear_with_target

## Memory Efficient Model Loading

- Load [EleutherAI/gpt-neo-125m]

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "EleutherAI/gpt-neo-125m"

model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [3]:
replace_linear_with_target_and_quantize(model,
                             W8A16LinearLayer,
                                   ["lm_head"])

In [4]:
model

GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPTNeoBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): W8A16LinearLayer()
            (v_proj): W8A16LinearLayer()
            (q_proj): W8A16LinearLayer()
            (out_proj): W8A16LinearLayer()
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): W8A16LinearLayer()
          (c_proj): W8A16LinearLayer()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_af

In [5]:
quantized_state_dict = model.state_dict()
torch.save(quantized_state_dict, "quantized_state_dict.pth")

- The below code is for demonstration purposes only.
- You'll need your own Hugging Face username in order for it to run.
- You'll add your usernmae in `YOUR_HF_USERNAME = ""`

```Python
from huggingface_hub import HfApi, create_repo

YOUR_HF_USERNAME = ""
your_repo_id = f"{YOUR_HF_USERNAME}/opt-125m-quantized-dlai"

api = HfApi()

# create_repo(your_repo_id)

api.upload_file(
 path_or_fileobj="quantized_state_dict.pth",
 path_in_repo="quantized_state_dict.pth",
 repo_id=your_repo_id
)
```

In [6]:
from huggingface_hub import HfApi, create_repo

YOUR_HF_USERNAME = "Laksh99"
your_repo_id = f"{YOUR_HF_USERNAME}/gpt-neo-125m"

api = HfApi()

create_repo(your_repo_id)

api.upload_file(
 path_or_fileobj="quantized_state_dict.pth",
 path_in_repo="quantized_state_dict.pth",
 repo_id=your_repo_id
)

quantized_state_dict.pth:   0%|          | 0.00/166M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Laksh99/gpt-neo-125m/commit/9fa7a6307888257665068187b5fd97f4bf49b71a', commit_message='Upload quantized_state_dict.pth with huggingface_hub', commit_description='', oid='9fa7a6307888257665068187b5fd97f4bf49b71a', pr_url=None, pr_revision=None, pr_num=None)

### Load the Model in the Meta Device

In [7]:
from transformers import GPTNeoForCausalLM, AutoTokenizer, AutoConfig

model_id = "EleutherAI/gpt-neo-125m"
config = AutoConfig.from_pretrained(model_id)

with torch.device("meta"):
  model = GPTNeoForCausalLM(config)

tokenizer = AutoTokenizer.from_pretrained(model_id)



In [8]:
for param in model.parameters():
  print(param)

Parameter containing:
tensor(..., device='meta', size=(50257, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(2048, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768,), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768,), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768,), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768,), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768,), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(3072, 768), requ

In [9]:
model

GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPTNeoBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=False)
            (q_proj): Linear(in_features=768, out_features=768, bias=False)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear(in_fe

In [10]:
replace_linear_with_target(model, W8A16LinearLayer, ["lm_head"])

In [11]:
model

GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPTNeoBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): W8A16LinearLayer()
            (v_proj): W8A16LinearLayer()
            (q_proj): W8A16LinearLayer()
            (out_proj): W8A16LinearLayer()
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): W8A16LinearLayer()
          (c_proj): W8A16LinearLayer()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_af

In [12]:
from huggingface_hub import hf_hub_download

state_dict_cache_path = hf_hub_download(
    "Laksh99/gpt-neo-125m",
    "quantized_state_dict.pth"
)

In [13]:
state_dict = torch.load(state_dict_cache_path)

In [14]:
model.load_state_dict(state_dict, strict=True, assign=True)

<All keys matched successfully>

- Test your model.
- **Note:** Your generated text might be different than what you see in the video.

In [15]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [33]:
print(type(tokenizer))

<class 'transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast'>


In [16]:
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe("Hello today I am", max_new_tokens=40)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


RuntimeError: Tensor on device cpu is not on the expected device meta!

In [36]:
from transformers import pipeline
from accelerate import Accelerator

# Initialize the Accelerator
accelerator = Accelerator()

# Move the model to the appropriate device using accelerator
model = accelerator.prepare(model)

# Create the pipeline with the device set appropriately
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0 if accelerator.device.type == 'cuda' else -1)

# Generate text
result = pipe("Hello today I am", max_new_tokens=40)
print(result)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello today I am!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'}]


In [28]:
from transformers import pipeline

# Move model to CPU (use to_empty for meta tensors)
cpu_model = model.to_empty(device='cpu')

# Create the pipeline with CPU model and original tokenizer
pipe = pipeline("text-generation", model=cpu_model, tokenizer=tokenizer)

# Generate text
pipe("Hello today I am", max_new_tokens=40)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello today I am!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'}]

In [32]:
from transformers import pipeline
import random
import torch

# Set a specific random seed
random.seed(42)
torch.manual_seed(42)

# Move model to CPU (use to_empty for meta tensors)
cpu_model = model.to_empty(device='cpu')

# Try using higher precision for CPU computations
torch.set_flush_denormal(True)
torch.set_default_tensor_type(torch.DoubleTensor)

# Create the pipeline with CPU model and original tokenizer
pipe = pipeline("text-generation", model=cpu_model, tokenizer=tokenizer)

# Generate text
pipe("Hello today I am", max_new_tokens=40)

  _C._set_default_tensor_type(t)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello today I am!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'}]

In [35]:
from transformers import pipeline
import random
import torch

# Set a specific random seed
random.seed(42)
torch.manual_seed(42)

# Move model to CPU (use to_empty for meta tensors)
cpu_model = model.to_empty(device='cpu')

# Try using higher precision for CPU computations
torch.set_flush_denormal(True)
torch.set_default_tensor_type(torch.DoubleTensor)

# Create the pipeline with CPU model and existing tokenizer
pipe = pipeline("text-generation", model=cpu_model, tokenizer=tokenizer)

# Generate text and print the output multiple times
for _ in range(3):
    generated_text = pipe("Hello today I am", max_new_tokens=240)[0]['generated_text']
    print(generated_text)
    print('-' * 50)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello today I am!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello today I am!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--------------------------------------------------
Hello today I am!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--------------------------------------------------


In [29]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe("Hello today I am giving a course about", max_new_tokens=10)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello today I am giving a course about!!!!!!!!!!'}]

In [31]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe("Once upon a time", max_new_tokens=50)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Once upon a time!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'}]