> **Note:** Before we begin, please make sure to install the `torch` and `transformers` libraries, which are essential for working with language models in this notebook. Run the following commands:

```bash
# Install PyTorch (replace cu118 with your CUDA version, or use cpu-only if you don't have a GPU)
pip install torch --index-url https://download.pytorch.org/whl/cu118

# Install Transformers
pip install transformers


In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time
import psutil
import os
torch.manual_seed(1230)

  from .autonotebook import tqdm as notebook_tqdm


<torch._C.Generator at 0x20b52a27a90>

In [2]:
def time_it(start,end):
    nano = end-start
    return nano/1e9

In [3]:
device = "cuda"
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
max_token = 200

## Model in Full Precision

In [4]:
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Print model size
print(f"Model size: {model.get_memory_footprint():,} bytes")

Model size: 4,400,196,480 bytes


In [5]:
def time_it(start, end):
    return (end - start) / 1e9 

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if torch.cuda.is_available():
    initial_gpu_memory = torch.cuda.memory_allocated(device) / (1024 ** 2) 
else:
    initial_gpu_memory = 0

process = psutil.Process(os.getpid())
initial_cpu_memory = process.memory_info().rss / (1024 ** 2) 

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(device)
start = time.time_ns()
outputs = model.generate(**inputs, max_new_tokens=max_token)
end = time.time_ns()
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(output_text)

t = time_it(start, end)
print("Seconds:", t)

if torch.cuda.is_available():
    final_gpu_memory = torch.cuda.memory_allocated(device) / (1024 ** 2) 
    max_gpu_memory = torch.cuda.max_memory_allocated(device) / (1024 ** 2)  
else:
    final_gpu_memory = max_gpu_memory = 0

final_cpu_memory = process.memory_info().rss / (1024 ** 2) 

print(f"Initial GPU Memory Allocated: {initial_gpu_memory:.2f} MB")
print(f"Final GPU Memory Allocated: {final_gpu_memory:.2f} MB")
print(f"Peak GPU Memory Used: {max_gpu_memory:.2f} MB")
print(f"Initial CPU Memory Usage: {initial_cpu_memory:.2f} MB")
print(f"Final CPU Memory Usage: {final_cpu_memory:.2f} MB")
print(f"Memory increase in CPU: {final_cpu_memory - initial_cpu_memory:.2f} MB")

print("Token/s:", len(outputs[0]) / t)

Hello my name is John Smith and I am a software engineer. I have been working in the software industry for the past 5 years and have experience in developing web applications using various technologies such as Java, JavaScript, and HTML. I am proficient in using tools such as Git, JIRA, and Slack to manage projects and communicate with team members. I am also skilled in designing and implementing user-friendly interfaces using CSS and HTML. In my free time, I enjoy playing video games, reading books, and spending time with my family and friends. I am passionate about learning new technologies and staying up-to-date with the latest trends in the industry. I am looking forward to working with you and contributing to the development of the project. Thank you for considering my application. Best regards,

[Your Name]
Seconds: 144.3631253
Initial GPU Memory Allocated: 4196.36 MB
Final GPU Memory Allocated: 4204.49 MB
Peak GPU Memory Used: 4215.21 MB
Initial CPU Memory Usage: 2933.99 MB
Fina

In [6]:
del model

In [7]:
import bitsandbytes as bnb
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM

## INT4 and FP4 Quantization

In [8]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
print(f"Model size: {model_4bit.get_memory_footprint():,} bytes")

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Model size: 746,773,376 bytes


In [9]:
def time_it(start, end):
    return (end - start) / 1e9  

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if torch.cuda.is_available():
    initial_gpu_memory = torch.cuda.memory_allocated(device) / (1024 ** 2)
else:
    initial_gpu_memory = 0

process = psutil.Process(os.getpid())
initial_cpu_memory = process.memory_info().rss / (1024 ** 2)

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(device)
start = time.time_ns()
outputs = model_4bit.generate(**inputs, max_new_tokens=max_token)
end = time.time_ns()
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(output_text)

t = time_it(start, end)
print("Seconds:", t)

if torch.cuda.is_available():
    final_gpu_memory = torch.cuda.memory_allocated(device) / (1024 ** 2) 
    max_gpu_memory = torch.cuda.max_memory_allocated(device) / (1024 ** 2)  
else:
    final_gpu_memory = max_gpu_memory = 0

final_cpu_memory = process.memory_info().rss / (1024 ** 2) 

print(f"Initial GPU Memory Allocated: {initial_gpu_memory:.2f} MB")
print(f"Final GPU Memory Allocated: {final_gpu_memory:.2f} MB")
print(f"Peak GPU Memory Used: {max_gpu_memory:.2f} MB")
print(f"Initial CPU Memory Usage: {initial_cpu_memory:.2f} MB")
print(f"Final CPU Memory Usage: {final_cpu_memory:.2f} MB")
print(f"Memory increase in CPU: {final_cpu_memory - initial_cpu_memory:.2f} MB")

print("Token/s:", len(outputs[0]) / t)

Hello my name is John and I am 25 years old. I am a student and I am studying in the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the University of London. I am a student of the
Seconds: 14.6878761
Initial GPU Memory Allocated: 786.01 MB
Final GPU Memory Allocated: 786.02 MB
Peak GPU Memory Used: 421

## INT4 and NF4 Quantization

In [10]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
print(f"Model size: {model_4bit.get_memory_footprint():,} bytes")

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Model size: 746,773,376 bytes


In [11]:
def time_it(start, end):
    return (end - start) / 1e9  

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if torch.cuda.is_available():
    initial_gpu_memory = torch.cuda.memory_allocated(device) / (1024 ** 2)
else:
    initial_gpu_memory = 0

process = psutil.Process(os.getpid())
initial_cpu_memory = process.memory_info().rss / (1024 ** 2)  

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(device)
start = time.time_ns()
outputs = model_4bit.generate(**inputs, max_new_tokens=max_token)
end = time.time_ns()
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(output_text)

t = time_it(start, end)
print("Seconds:", t)

if torch.cuda.is_available():
    final_gpu_memory = torch.cuda.memory_allocated(device) / (1024 ** 2)  
    max_gpu_memory = torch.cuda.max_memory_allocated(device) / (1024 ** 2)  
else:
    final_gpu_memory = max_gpu_memory = 0

final_cpu_memory = process.memory_info().rss / (1024 ** 2) 


print(f"Initial GPU Memory Allocated: {initial_gpu_memory:.2f} MB")
print(f"Final GPU Memory Allocated: {final_gpu_memory:.2f} MB")
print(f"Peak GPU Memory Used: {max_gpu_memory:.2f} MB")
print(f"Initial CPU Memory Usage: {initial_cpu_memory:.2f} MB")
print(f"Final CPU Memory Usage: {final_cpu_memory:.2f} MB")
print(f"Memory increase in CPU: {final_cpu_memory - initial_cpu_memory:.2f} MB")

print("Token/s:", len(outputs[0]) / t)



Hello my name is John Smith and I am a student at the University of XYZ. I am currently enrolled in the Bachelor of Science in Computer Science program. I am currently taking 12 credits and have completed 10 credits. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project
Seconds: 14.6742134
Initial GPU Memory Allocated: 1

## 4-Bit Nested Quantization

In [12]:
double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
print(f"Model size: {model_4bit.get_memory_footprint():,} bytes")

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Model size: 746,773,376 bytes


In [13]:
def time_it(start, end):
    return (end - start) / 1e9 

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if torch.cuda.is_available():
    initial_gpu_memory = torch.cuda.memory_allocated(device) / (1024 ** 2) 
else:
    initial_gpu_memory = 0

process = psutil.Process(os.getpid())
initial_cpu_memory = process.memory_info().rss / (1024 ** 2) 

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(device)
start = time.time_ns()
outputs = model_4bit.generate(**inputs, max_new_tokens=max_token)
end = time.time_ns()
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(output_text)

t = time_it(start, end)
print("Seconds:", t)

if torch.cuda.is_available():
    final_gpu_memory = torch.cuda.memory_allocated(device) / (1024 ** 2)  
    max_gpu_memory = torch.cuda.max_memory_allocated(device) / (1024 ** 2)  
else:
    final_gpu_memory = max_gpu_memory = 0

final_cpu_memory = process.memory_info().rss / (1024 ** 2)

print(f"Initial GPU Memory Allocated: {initial_gpu_memory:.2f} MB")
print(f"Final GPU Memory Allocated: {final_gpu_memory:.2f} MB")
print(f"Peak GPU Memory Used: {max_gpu_memory:.2f} MB")
print(f"Initial CPU Memory Usage: {initial_cpu_memory:.2f} MB")
print(f"Final CPU Memory Usage: {final_cpu_memory:.2f} MB")
print(f"Memory increase in CPU: {final_cpu_memory - initial_cpu_memory:.2f} MB")

print("Token/s:", len(outputs[0]) / t)

Hello my name is John Smith and I am a student at the University of XYZ. I am currently enrolled in the Bachelor of Science in Computer Science program. I am currently taking 12 credits and have completed 10 credits. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project
Seconds: 23.1847848
Initial GPU Memory Allocated: 1

In [14]:
del model_4bit

## Complete Set of Quantization Features

In [15]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
print(f"Model size: {model_4bit.get_memory_footprint():,} bytes")

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Model size: 746,773,376 bytes


In [16]:
def time_it(start, end):
    return (end - start) / 1e9 

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if torch.cuda.is_available():
    initial_gpu_memory = torch.cuda.memory_allocated(device) / (1024 ** 2) 
else:
    initial_gpu_memory = 0

process = psutil.Process(os.getpid())
initial_cpu_memory = process.memory_info().rss / (1024 ** 2)  

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(device)
start = time.time_ns()
outputs = model_4bit.generate(**inputs, max_new_tokens=max_token)
end = time.time_ns()
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(output_text)

t = time_it(start, end)
print("Seconds:", t)

if torch.cuda.is_available():
    final_gpu_memory = torch.cuda.memory_allocated(device) / (1024 ** 2)  
    max_gpu_memory = torch.cuda.max_memory_allocated(device) / (1024 ** 2) 
else:
    final_gpu_memory = max_gpu_memory = 0

final_cpu_memory = process.memory_info().rss / (1024 ** 2)

print(f"Initial GPU Memory Allocated: {initial_gpu_memory:.2f} MB")
print(f"Final GPU Memory Allocated: {final_gpu_memory:.2f} MB")
print(f"Peak GPU Memory Used: {max_gpu_memory:.2f} MB")
print(f"Initial CPU Memory Usage: {initial_cpu_memory:.2f} MB")
print(f"Final CPU Memory Usage: {final_cpu_memory:.2f} MB")
print(f"Memory increase in CPU: {final_cpu_memory - initial_cpu_memory:.2f} MB")

print("Token/s:", len(outputs[0]) / t)

Hello my name is John Smith and I am a student at the University of XYZ. I am currently enrolled in the Bachelor of Science in Computer Science program. I am currently taking 12 credits and have completed 10 credits. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project for my final project. I am currently working on a project
Seconds: 17.0793422
Initial GPU Memory Allocated: 7

## Quantization Performance Summary
| Quant | GPU Memory (MB)| CPU Memory (MB)  | Inference (Tokens/s) |
| ------ | -------- | ------- | ------- |
| Full Precision | 4204.49| 2941.46 | 1.25 |
| 4 bit FP4 | 786.02 | 915.98 | 13.95 | 
| 4 bit Normal Float 4 | 1564.03 |829.85 | 13.97 |
| Nested 4 bit | 1565.15 | 1422.24 | 8.84 |
| All together | 786.52 | 1270.48 | 12.00 |