# Energy Efficient AI and Model Compression

### 🎯 Objective
In this notebook, we explore how to make deep learning models more energy-efficient and computationally lightweight. You’ll learn about **model compression** techniques like pruning, quantization, and knowledge distillation, which help reduce computational cost and energy consumption without sacrificing too much accuracy.

## Why Energy Efficiency Matters

Modern AI models such as GPTs, BERT, and ResNet require massive computational power. As these models scale up, they consume large amounts of **energy**, increasing **carbon footprint** and **deployment costs**.

Thus, **energy-efficient AI** focuses on optimizing models for:
- Faster inference on edge devices (e.g., mobile phones, IoT)
- Reduced energy consumption
- Smaller memory footprint
- Sustainable computing


## 🧠 Model Compression Overview

**Model compression** techniques reduce model size and complexity while maintaining performance. Main strategies include:

1. **Pruning** : Removing unnecessary weights or neurons.
2. **Quantization** : Reducing precision of weights (e.g., from float32 to int8).
3. **Knowledge Distillation** : Training a smaller model (student) to mimic a larger one (teacher).
4. **Low-Rank Factorization** : Decomposing matrices to reduce redundant parameters.
5. **Neural Architecture Search (NAS)** : Automatically discovering efficient model designs.


## ⚙️ 1. Model Pruning

**Idea:** Remove weights that have little impact on the model’s output.

There are two types of pruning:
- **Unstructured pruning:** Removes individual weights.
- **Structured pruning:** Removes entire neurons, filters, or channels.

### Example (PyTorch):


In [ ]:
import torch
import torch.nn.utils.prune as prune
import torch.nn as nn

model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 1))

# Apply pruning to 40% of weights in the first layer
prune.l1_unstructured(model[0], name="weight", amount=0.4)

print("Sparsity after pruning:", float(torch.sum(model[0].weight == 0)) / model[0].weight.nelement())

💡 **Result:** Pruning introduces sparsity, reducing active parameters and computation time.

After pruning, retrain the model briefly to recover accuracy (**fine-tuning**).

## ⚖️ 2. Quantization

**Idea:** Represent weights with lower bit precision to reduce memory and computation.

Common formats:
- FP32 → INT8 or FP16
- Reduces model size by up to 75%
- Can accelerate inference on CPUs and edge devices.

### Example (Post-training quantization with PyTorch):


In [ ]:
import torch.quantization

model_fp32 = model.eval()
model_int8 = torch.quantization.quantize_dynamic(model_fp32, {nn.Linear}, dtype=torch.qint8)

print("Original model size:", sum(p.numel() for p in model_fp32.parameters()))
print("Quantized model size:", sum(p.numel() for p in model_int8.parameters()))

💡 **Observation:** Quantization helps run models efficiently on hardware with limited resources (like smartphones or Raspberry Pi).

## 🧩 3. Knowledge Distillation

**Idea:** Train a small “student” model to replicate the predictions of a large “teacher” model.

The student learns from the teacher’s **soft targets** (probability distributions) instead of hard labels.

### Example Flow:
```python
teacher_model = ... # large pretrained model
student_model = ... # smaller model

for data, target in dataloader:
    with torch.no_grad():
        teacher_logits = teacher_model(data)

    student_logits = student_model(data)
    loss = distillation_loss(student_logits, teacher_logits, target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

📘 The **distillation loss** is usually a combination of cross-entropy and Kullback-Leibler (KL) divergence between teacher and student logits.

## 🧮 4. Low-Rank Factorization

**Idea:** Approximate weight matrices using low-rank decompositions (e.g., SVD). This reduces redundant parameters.

For a weight matrix $W \in \mathbb{R}^{m \times n}$, we approximate it as $W = UΣV^T$, keeping only top-k singular values.

✅ Helps compress large layers like fully connected or convolutional layers.

## ⚡ 5. Neural Architecture Search (NAS)

**Idea:** Automatically discover architectures that balance accuracy, latency, and efficiency.

For example, **MobileNet** and **EfficientNet** families were designed using NAS to optimize for both accuracy and energy use.

Techniques:
- Reinforcement Learning-based NAS
- Evolutionary algorithms
- Gradient-based NAS (e.g., DARTS)


## 🌱 Sustainable AI Practices

Beyond technical optimizations, developers can reduce AI’s environmental impact by:
- Using **green data centers** powered by renewable energy.
- Preferring **smaller pre-trained models** when possible.
- Sharing models to reduce redundant training (e.g., via Hugging Face Hub).
- Monitoring **energy usage** via tools like `CodeCarbon`.

### Example: Estimating training carbon footprint
```python
from codecarbon import EmissionsTracker

tracker = EmissionsTracker()
tracker.start()
# train_model()
tracker.stop()
```

## 📊 Summary

| Technique | Goal | Benefit |
|------------|------|----------|
| Pruning | Remove redundant parameters | Reduces computation |
| Quantization | Lower precision weights | Reduces memory & latency |
| Knowledge Distillation | Smaller model mimics large model | Retains accuracy |
| Low-Rank Factorization | Compress large matrices | Fewer parameters |
| NAS | Auto-optimize models | Efficient architectures |

These methods make models **greener, faster, and deployable** on limited hardware.

## ✅ Next Steps
- Try pruning or quantizing your own model (e.g., a trained CNN).
- Deploy a compressed model on an edge device (like Raspberry Pi).
- Measure energy savings using tools like `CodeCarbon`.

**Up Next:** `README.md` for `10-Advanced_Concepts` — summarizing all advanced techniques and concepts.