# Quantization

In [1]:
!pip install auto-round



In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from huggingface_hub import login, create_repo, upload_folder
import os
os.environ["HUGGINGFACE_TOKEN"] = ""
# Login to Hugging Face (replace with environment variable or secure token management)
login()  # You should use your token as an environment variable instead of hardcoding it

# Use the 7B model
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

# Configure AutoRound parameters
bits, group_size, sym = 4, 128, True
autoround = AutoRound(
    model, 
    tokenizer,
    nsamples=128,
    iters=256,
    gradient_accumulate_steps=2,
    batch_size=4,
    bits=bits,
    group_size=group_size,
    sym=sym
)

# Perform quantization
autoround.quantize()

# Set output directory
output_dir = "DeepSeek-R1-Distill-Qwen-7B-AutoRound-GPTQ-4bit"


# Save the quantized model
autoround.save_quantized(output_dir, format='auto_gptq', inplace=True)

# Create the repository if it doesn't exist
# create_repo(repo_id, exist_ok=True)



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[38;20m2025-01-23 15:14:14 INFO utils.py L577: Using GPU device[0m
[38;20m2025-01-23 15:14:14 INFO autoround.py L230: using torch.float16 for quantization tuning[0m
[38;20m2025-01-23 15:14:14 INFO autoround.py L300: start to cache block inputs[0m
2025-01-23 15:14:20,577 INFO config.py L54: PyTorch version 2.5.1 available.
  freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
[38;20m2025-01-23 15:14:29 INFO autoround.py L305: caching done[0m
  return F.linear(input, self.weight, self.bias)
  return F.linear(x, weight, bias)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


KeyboardInterrupt: 

In [4]:
from huggingface_hub import login, create_repo, upload_folder
login()
repo_id = "SuodhanJ6/DeepSeek-R1-Distill-Qwen-7B-AutoRound-GPTQ-4bit"
output_dir = "/home/vivek/Documents/prashantPer/perS/DeepSeek-R1-s-Students-Quantization-and-Inference-Examples-Llama-3.1-and-Qwen2.5-/notebooks/DeepSeek-R1-Distill-Qwen-7B-AutoRound-GPTQ-4bit"
# Upload the model to Hugging Face Hub
upload_folder(
    folder_path=output_dir,
    repo_id=repo_id,
    repo_type="model"
)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

model-00002-of-00002.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/SuodhanJ6/DeepSeek-R1-Distill-Qwen-7B-AutoRound-GPTQ-4bit/commit/c772dcbf5d7cfaf09caaa6af62d3418a195b66f3', commit_message='Upload folder using huggingface_hub', commit_description='', oid='c772dcbf5d7cfaf09caaa6af62d3418a195b66f3', pr_url=None, repo_url=RepoUrl('https://huggingface.co/SuodhanJ6/DeepSeek-R1-Distill-Qwen-7B-AutoRound-GPTQ-4bit', endpoint='https://huggingface.co', repo_type='model', repo_id='SuodhanJ6/DeepSeek-R1-Distill-Qwen-7B-AutoRound-GPTQ-4bit'), pr_revision=None, pr_num=None)

# Inference with Hugging Face Transformers

In [5]:
!pip install --upgrade transformers
!pip install auto-gptq optimum #if you want to run GPTQ model with HF Transformers

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

def test_model(model_name):
  model = AutoModelForCausalLM.from_pretrained(
      model_name,).to("cuda")
  tokenizer = AutoTokenizer.from_pretrained(model_name)

  prompt = "What is the maximum number of whole carrots this rabbit could consume in a 24-hour period?"
  messages = [
      {"role": "system", "content": "You are a very good thinker and reason very well."},
      {"role": "user", "content": prompt}
  ]
  text = tokenizer.apply_chat_template(
      messages,
      tokenize=False,
      add_generation_prompt=True
  )
  inputs = tokenizer([text], return_tensors="pt").to(model.device)
  response_ids = model.generate(**inputs, max_new_tokens=4096)
  response = tokenizer.batch_decode(response_ids, skip_special_tokens=True)[0]
  print(response)

## Tests with Qwen2.5 7B model

* Original
* Trained by R1
* Trained by R1 and quantized

In [3]:
model_name = "Qwen/Qwen2.5-3B-Instruct"
test_model(model_name)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

system
You are a very good thinker and reason very well.
user
What is the maximum number of whole carrots this rabbit could consume in a 24-hour period?
assistant
To determine the maximum number of whole carrots a rabbit can consume in a 24-hour period, we need to consider the physical and biological limits of the rabbit's digestive system.

1. **Digestive System Limitations**:
   - A rabbit's stomach is relatively small compared to its body size.
   - Carrots contain fiber, which can cause digestive issues if not chewed properly or if too much is consumed at once.
   - Overeating carrots can lead to digestive problems such as bloat or diarrhea.

2. **Carrot Consumption**:
   - On average, a rabbit might eat about 10-15% of its body weight in food daily.
   - If a rabbit weighs 2 pounds (about 0.9 kg), it would theoretically consume around 0.2 to 0.3 pounds (about 85-140 grams) of food per day.

3. **Carrot Weight**:
   - A medium-sized carrot typically weighs about 100 grams (or 0.1 k

In [2]:
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
test_model(model_name)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

: 

In [2]:
model_name = "SuodhanJ6/DeepSeek-R1-Distill-Qwen-7B-AutoRound-GPTQ-4bit"
test_model(model_name)

config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

  def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq):
  def backward(ctx, grad_output):
  @custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/75.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.49G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Found modules on cpu/disk. Using Exllama/Exllamav2 backend requires all the modules to be on GPU. Setting `disable_exllama=True`


tokenizer_config.json:   0%|          | 0.00/6.75k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/485 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are a very good thinker and reason very well.<｜User｜>What is the maximum number of whole carrots this rabbit could consume in a 24-hour period?<｜Assistant｜><think>
Alright, so I need to figure out the maximum number of whole carrots a rabbit could eat in 24 hours. Hmm, okay, let's start by thinking about how rabbits consume plants. I know rabbits are omnivores, so they eat both plants and vegetables, including carrots. 

First, I should consider the size of the carrots. Carrots vary in size, but let's assume they're average. Maybe about 6 inches long and 1 inch in diameter. That seems reasonable. Now, how much does a rabbit need to eat to stay healthy? I think adult rabbits need a certain number of calories per day. Let me check that. I remember that adult rabbits require about 2,400 to 3,000 calories daily, depending on the variety. Let's go with 2,400 calories for this calculation.

Next, I need to figure out how many calories a carrot provides. I'm not exactly sure, but I think 