# 📘 Chapter 6: Model Resource Management Techniques

## Introduction
In production ML systems—especially on mobile, IoT, or serverless environments—resources like compute, memory, and energy are limited. 

This chapter focuses on optimizing machine learning models to be lightweight and efficient, covering techniques like:
- **Dimensionality Reduction**: Reduce input complexity
- **Quantization & Pruning**: Reduce model size and latency
- **Knowledge Distillation**: Retain performance with smaller models

---

## 1. Dimensionality Reduction
Reducing the number of input features helps models:
- Train faster
- Require less memory
- Generalize better

### Curse of Dimensionality
High-dimensional feature spaces lead to:
- Sparse data, requiring exponentially more samples
- Distance metrics (like Euclidean) becoming less informative
- Increased overfitting risk

### Solutions
- Manual feature selection (domain knowledge)
- Feature selection algorithms (correlation, mutual info, RFE)
- Dimensionality reduction (e.g., PCA, UMAP)

---

## 2. Word Embedding Example in Keras
A Keras example using the Reuters dataset:

- We train a model with **1000D embeddings** (large, high-quality but costly)
- Then train the same model with **10D embeddings** (cheaper, potentially lower accuracy)

Compare performance and efficiency.


```python
# Keras setup and data prep
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.datasets import reuters
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

num_words = 1000
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=num_words)
y_train = to_categorical(y_train, 46)
y_test = to_categorical(y_test, 46)

x_train = pad_sequences(x_train, maxlen=20)
x_test = pad_sequences(x_test, maxlen=20)
```

```python
# High-dimensional embedding model
model = keras.Sequential([
    layers.Embedding(num_words, 1000),
    layers.Flatten(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.25),
    layers.Dense(46, activation='softmax')
])
```
```python
# Low-dimensional version
model_lowdim = keras.Sequential([
    layers.Embedding(num_words, 10),
    layers.Flatten(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.25),
    layers.Dense(46, activation='softmax')
])
```

---

## 3. PCA (Principal Component Analysis)
PCA is an unsupervised algorithm that:
- Finds axes with maximal variance
- Projects data onto a lower-dimensional space

```python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

iris = load_iris()
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(iris.data)

plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=iris.target)
plt.title("PCA on Iris Dataset")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()
```

---


## 4. Quantization and Pruning

### Quantization
Quantization reduces model size by converting:
- **float32** weights to **int8** or **float16**
- Reduces memory use, improves cache performance

Used especially in **TF Lite**, **ONNX**, and **Edge AI**.

#### Post-training Quantization
```python
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open("model_quantized.tflite", "wb") as f:
    f.write(tflite_model)
```

#### Quantization-aware Training
Trains the model with simulated quantization noise.

---


### Pruning
Pruning removes unnecessary weights (e.g. near-zero weights).
Benefits:
- Fewer operations
- Smaller model size
- Retains accuracy

```python
import tensorflow_model_optimization as tfmot

prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
pruning_params = {
  'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
    initial_sparsity=0.0,
    final_sparsity=0.5,
    begin_step=0,
    end_step=1000)
}

model_pruned = prune_low_magnitude(model, **pruning_params)
```

---


## 5. Knowledge Distillation
A smaller **student model** learns to mimic a larger **teacher model**.

### Why?
- Transfer performance of a large model to a smaller one
- Reduce size/latency while retaining accuracy
- Used in DistilBERT, TinyML, etc.

### Strategy
- Use soft targets (probability distributions) from the teacher
- Blend with standard supervised loss

```python
import tensorflow.keras.backend as K

def distillation_loss(y_true, y_pred, teacher_pred, temperature=5.0, alpha=0.5):
    y_pred_soft = K.softmax(y_pred / temperature)
    teacher_soft = K.softmax(teacher_pred / temperature)
    kl_loss = K.sum(teacher_soft * K.log(teacher_soft / y_pred_soft), axis=-1)
    hard_loss = keras.losses.categorical_crossentropy(y_true, y_pred)
    return alpha * hard_loss + (1 - alpha) * kl_loss
```

### Visual: Teacher-Student Architecture
```python
from matplotlib import pyplot as plt
import matplotlib.patches as patches

fig, ax = plt.subplots(figsize=(10, 3))

# Draw Teacher block
ax.add_patch(patches.Rectangle((0.1, 0.4), 0.2, 0.3, edgecolor='blue', facecolor='lightblue'))
ax.text(0.2, 0.55, 'Teacher Model', ha='center')

# Draw arrows
ax.annotate('', xy=(0.3, 0.55), xytext=(0.5, 0.55), arrowprops=dict(facecolor='black', shrink=0.05))
ax.text(0.4, 0.58, 'Soft Targets')

# Draw Student block
ax.add_patch(patches.Rectangle((0.5, 0.4), 0.2, 0.3, edgecolor='green', facecolor='lightgreen'))
ax.text(0.6, 0.55, 'Student Model', ha='center')

# Style
ax.axis('off')
plt.title('Knowledge Distillation: Teacher → Student')
plt.show()
```

### Real-world examples
- **DistilBERT**: Compresses BERT while keeping 95% of its performance
- **TinyBERT**, **TMKD**, **Noisy Student**: Use teacher noise, multiple teachers, or synthetic labels

---

## Conclusion
Efficient model deployment requires:
- Reducing feature space (dimensionality reduction)
- Compressing models (quantization, pruning)
- Transferring knowledge (distillation)

These techniques help deploy on mobile, embedded, or edge devices at scale.

---

## Keywords

| Term | Definition |
|------|------------|
| **Dimensionality Reduction** | Reducing the number of input features while preserving important information. |
| **Curse of Dimensionality** | In high dimensions, data becomes sparse and distance metrics lose meaning. |
| **PCA** | Principal Component Analysis; projects data into a lower-dimensional space. |
| **Embedding** | Vector representation of input features, often used for text or categorical data. |
| **Pruning** | Removing weights or neurons that have little impact on output to shrink model size. |
| **Quantization** | Reducing precision (e.g., float32 → int8) to improve efficiency. |
| **TF Lite** | TensorFlow Lite; optimized framework for deploying models on edge devices. |
| **Knowledge Distillation** | Training a smaller model (student) to mimic a larger model (teacher). |
| **Teacher/Student Models** | Architecture where the teacher guides the student with soft predictions. |
| **KL Divergence** | A loss function measuring difference between predicted distributions. |
| **Softmax Temperature** | A technique to soften output distributions to enhance distillation. |
| **DistilBERT / TinyBERT / TMKD / Noisy Student** | Real-world examples of distilled or compressed models. |
| **Efficient Inference** | Making model predictions faster and lighter in production. |
| **Model Compression** | General term for reducing model size (quantization, pruning, etc.). |
| **Mobile and Edge ML** | Deploying machine learning models on mobile or embedded devices. |

