# Chapter 5: Low Memory Implementation Techniques

This chapter focuses on strategies for reducing the memory footprint of Large Language Models (LLMs), such as quantisation and pruning. These techniques enable the deployment of LLMs in resource-constrained environments without significantly sacrificing performance.

---

## Learning Objectives

By the end of this chapter, you will be able to:
- Understand the importance of low memory techniques for deploying LLMs.
- Implement dynamic quantisation to reduce model size.
- Explore pruning techniques to optimise model parameters.
- Apply these techniques in practical scenarios for efficient LLM deployment.

---

## 1. Introduction to Low Memory Techniques

Large Language Models often require significant memory and computational resources, which can make them impractical for certain applications. Low memory techniques, such as quantisation and pruning, address this challenge by optimising the model’s size and performance.

---

## 2. Quantisation: Reducing Model Precision

### What is Quantisation?

Quantisation reduces the precision of a model’s weights (e.g., from 32-bit floating point to 8-bit integers), resulting in smaller model sizes and faster computations.

---

### Code Example: Applying Dynamic Quantisation

Dynamic quantisation is one of the simplest methods to reduce model size without retraining. Let's apply it to a pre-trained model like LLaMA.


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# Load the LLaMA model
model_name = "meta-llama/Llama-2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [None]:
# Check the original model size
original_size = sum(param.numel() for param in model.parameters()) * 4 / 1e6  # Float32 = 4 bytes
print(f"Original model size: {original_size:.2f} MB")

In [None]:
# Apply dynamic quantisation
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

In [None]:
# Check the quantised model size
quantized_size = sum(param.numel() for param in quantized_model.parameters()) * 1 / 1e6  # Int8 = 1 byte
print(f"Quantised model size: {quantized_size:.2f} MB")

In [None]:
# Save the quantised model
quantized_model.save_pretrained("./quantized_llama_model")
print("\nQuantised model saved.")

---

## Observations

After running the code above:
1. **Size Reduction**: Note the difference in model size before and after quantisation.
2. **Performance Trade-Offs**: Dynamic quantisation typically results in negligible accuracy loss for most text generation tasks. Observe if there are any noticeable trade-offs.

---

## 3. Pruning: Removing Redundant Parameters

### What is Pruning?

Pruning involves removing less significant weights or neurons from the model to reduce its complexity and size. This technique is useful for fine-tuning a model for specific tasks while maintaining efficiency.

---

### Code Example: Basic Weight Pruning

Let's apply weight pruning to the LLaMA model to reduce its memory requirements.


import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForCausalLM

# Load the LLaMA model
model_name = "meta-llama/Llama-2-7b"
model = AutoModelForCausalLM.from_pretrained(model_name)

# Prune 30% of the weights in all linear layers
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        prune.l1_unstructured(module, name="weight", amount=0.3)

# Check the model after pruning
print("Model after pruning:")
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        print(f"{name}: {module.weight}")

# Save the pruned model
model.save_pretrained("./pruned_llama_model")
print("\nPruned model saved.")


---

## Observations

1. **Weight Sparsity**: Observe the sparsity of weights after pruning. Many weights will be set to zero.
2. **Impact on Model Performance**: While pruning reduces model size, excessive pruning can degrade performance. Experiment with different pruning percentages (e.g., 10%, 30%, 50%).

---

## 4. Exercise: Combining Quantisation and Pruning

In this exercise, you will combine quantisation and pruning to further optimise the LLaMA model.

### Instructions

1. **Apply Pruning**: Start by pruning 20% of the weights in the linear layers.
2. **Quantise the Pruned Model**: Apply dynamic quantisation to the pruned model.
3. **Compare Model Sizes**: Measure the size of the model before and after applying both techniques.


In [None]:
# Apply pruning
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        prune.l1_unstructured(module, name="weight", amount=0.2)


In [None]:
# Apply quantisation to the pruned model
pruned_quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

In [None]:
# Check the final model size
final_size = sum(param.numel() for param in pruned_quantized_model.parameters()) * 1 / 1e6
print(f"Final model size after pruning and quantisation: {final_size:.2f} MB")

In [None]:
# Save the final model
pruned_quantized_model.save_pretrained("./pruned_quantized_llama_model")
print("\nPruned and quantised model saved.")

---

## 5. Key Takeaways

- **Quantisation** reduces model size and computation costs by lowering weight precision.
- **Pruning** eliminates redundant parameters, making the model more efficient.
- **Combining Techniques**: Combining quantisation and pruning maximises memory savings while maintaining acceptable performance.

---

## 6. Summary

In this chapter, we explored techniques for reducing the memory footprint of LLMs:

1. **Quantisation**: Reduced model precision to save memory.
2. **Pruning**: Removed redundant parameters for efficiency.
3. **Combined Optimisation**: Combined quantisation and pruning to achieve significant memory savings.

These techniques enable the deployment of LLMs in resource-constrained environments, making them accessible to a wider range of applications.
