# Chapter 96: Edge AI and TinyML

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the concepts of Edge AI and TinyML and their relevance to time‑series prediction systems.
- Identify scenarios where deploying models on edge devices is advantageous over cloud‑based inference.
- Apply model compression techniques such as quantization, pruning, and knowledge distillation to reduce model size and latency.
- Use TensorFlow Lite to convert and optimise models for edge deployment.
- Implement a TinyML application for real‑time anomaly detection on streaming sensor data, using a simulated NEPSE‑like data stream.
- Evaluate the trade‑offs between model accuracy, size, inference time, and power consumption.
- Explore the hardware landscape for edge AI, from microcontrollers to edge GPUs.
- Understand the limitations and challenges of edge deployment, including hardware constraints and model updates.

---

## **96.1 Introduction to Edge AI and TinyML**

**Edge AI** refers to the deployment of artificial intelligence models on devices at the edge of the network, close to where data is generated, rather than in a centralised cloud. **TinyML** is a subfield of Edge AI focused on running machine learning models on ultra‑low‑power devices such as microcontrollers, often with memory and compute constraints measured in kilobytes and megahertz.

In the context of time‑series prediction systems like the NEPSE stock predictor, most of our work has been cloud‑centric: data is sent to a central server where models run. However, there are compelling reasons to move some intelligence to the edge:

- **Latency**: Real‑time applications (e.g., high‑frequency trading, industrial control) require predictions within milliseconds; sending data to the cloud and back may be too slow.
- **Bandwidth**: Streaming high‑frequency sensor data (e.g., from thousands of IoT devices) can overwhelm networks. Edge devices can process locally and only send summaries or anomalies.
- **Privacy**: Sensitive data (e.g., medical or financial) can be processed locally without ever leaving the device.
- **Reliability**: Edge devices can continue operating even when disconnected from the cloud.
- **Energy efficiency**: TinyML models consume minimal power, enabling battery‑operated devices to run for years.

For the NEPSE system, an edge scenario might involve a lightweight model running on a trader's smartphone, providing real‑time alerts based on streaming market data, or a sensor on a trading floor that detects unusual activity.

This chapter will guide you through the process of taking a time‑series model, compressing it for edge deployment, and running it on a simulated edge device.

---

## **96.2 Why Edge for Time‑Series?**

Time‑series data is often generated continuously and at high velocity. Consider these use cases where edge deployment shines:

- **Predictive maintenance on industrial equipment**: Vibration sensors generate data at kHz rates. Sending all data to the cloud is impractical; an edge device runs an anomaly detection model and only alerts when a fault is imminent.
- **Wearable health monitors**: Heart rate and ECG data are sensitive and require real‑time alerts for arrhythmias. Processing on the device protects privacy and ensures immediate response.
- **Smart grids**: Real‑time load forecasting at substations can optimise local energy distribution without waiting for cloud commands.
- **Financial trading terminals**: Low‑latency predictions for high‑frequency trading may be executed on FPGAs or GPUs near the exchange.

In each case, the edge device must be capable of running a model with limited resources. This necessitates model compression and optimisation.

---

## **96.3 Model Compression Techniques**

Deploying a model on an edge device often requires reducing its size and computational cost. The three main techniques are quantization, pruning, and knowledge distillation.

### **96.3.1 Quantization**

Quantization reduces the numerical precision of a model's weights and activations. For example, instead of 32‑bit floating point numbers, we can use 8‑bit integers (or even 1‑bit binary). This dramatically reduces model size and can speed up inference, especially on hardware with integer accelerators.

**Types of quantization**:

- **Post‑training quantization**: Convert a pre‑trained float model to int8 without retraining. This is easy but may cause accuracy loss.
- **Quantization‑aware training**: Simulate quantization during training so the model learns to be robust to lower precision. This often yields better accuracy.

**Example with TensorFlow**:

```python
import tensorflow as tf

# Load a pre‑trained Keras model
model = tf.keras.models.load_model('nepse_model.h5')

# Convert to TensorFlow Lite with float16 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()

# Save the quantized model
with open('model_float16.tflite', 'wb') as f:
    f.write(tflite_model)

# For int8 quantization, we need a representative dataset to calibrate
def representative_dataset():
    # Yield samples from the training set
    for _ in range(100):
        yield [np.random.randn(1, input_size).astype(np.float32)]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_int8_model = converter.convert()
```

**Explanation**:

- We first convert a Keras model to TensorFlow Lite format.
- Float16 quantization reduces model size by half with minimal accuracy loss.
- Int8 quantization requires a representative dataset to determine the optimal scaling factors for each tensor. The `representative_dataset` function yields samples from the training data.
- The resulting `.tflite` file can be deployed on edge devices, including microcontrollers.

### **96.3.2 Pruning**

Pruning removes less important connections (weights) from a neural network, creating a sparse model. This reduces storage and can accelerate inference on specialised hardware.

**Example with TensorFlow Model Optimization Toolkit**:

```python
import tensorflow_model_optimization as tfmot

# Apply pruning to a model
pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
        initial_sparsity=0.0, final_sparsity=0.5, begin_step=0, end_step=1000)
}
model_to_prune = tfmot.sparsity.keras.prune_low_magnitude(model, **pruning_params)

# Compile and train with pruning
model_to_prune.compile(...)
model_to_prune.fit(x_train, y_train, callbacks=[tfmot.sparsity.keras.UpdatePruningStep()])

# After training, strip pruning wrappers
stripped_model = tfmot.sparsity.keras.strip_pruning(model_to_prune)

# Convert to TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(stripped_model)
tflite_model = converter.convert()
```

**Explanation**:

- We wrap the model with `prune_low_magnitude`, which adds masking to the weights.
- During training, weights below a threshold are masked (set to zero). The sparsity gradually increases according to the schedule.
- After training, we strip the pruning wrappers to obtain a sparse model.
- The sparse model can be converted to TFLite, which may be smaller and faster.

### **96.3.3 Knowledge Distillation**

Knowledge distillation trains a smaller "student" model to mimic the predictions of a larger "teacher" model. The student learns from the teacher's soft outputs (probabilities) rather than hard labels, often achieving better performance than training from scratch.

**Example**:

```python
# Teacher: a large, accurate model
teacher = tf.keras.models.load_model('large_nepse_model.h5')

# Student: a smaller model (e.g., with fewer layers)
student = tf.keras.Sequential([...])

# Distillation loss: combination of hard loss (true labels) and soft loss (teacher logits)
def distillation_loss(y_true, y_pred, teacher_logits, temperature=3.0):
    hard_loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)
    soft_loss = tf.keras.losses.KLDivergence()(
        tf.nn.softmax(teacher_logits / temperature),
        tf.nn.softmax(y_pred / temperature)
    )
    return hard_loss + soft_loss * (temperature ** 2)

# Training loop with teacher providing logits
# ...
```

**Explanation**:

- The student learns from both the true labels and the teacher's softened probabilities.
- The temperature parameter controls the softness of the probability distribution.
- Distillation can produce a compact model that retains much of the teacher's accuracy.

---

## **96.4 TinyML Frameworks and Tools**

Several frameworks facilitate deployment on edge devices:

### **96.4.1 TensorFlow Lite for Microcontrollers**

TensorFlow Lite Micro is a lightweight interpreter designed to run on 32‑bit microcontrollers with only a few kilobytes of RAM. It supports a subset of TensorFlow operations and can execute quantized models.

**Key features**:

- Written in C++ 11, with a small code footprint (~16 KB for core interpreter).
- Supports ARM Cortex‑M, ESP32, and other architectures.
- No operating system required; runs bare‑metal.

### **96.4.2 Edge Impulse**

Edge Impulse is a platform that streamlines the entire TinyML workflow: data ingestion, model design, training, testing, and deployment. It supports many development boards and provides a web‑based studio.

### **96.4.3 Other Frameworks**

- **Apache TVM**: Compiles models for a wide range of hardware (CPUs, GPUs, FPGAs).
- **ONNX Runtime**: Can run on edge devices with optimised backends.
- **uTensor**: Another lightweight inference engine for microcontrollers.

For our example, we will use TensorFlow Lite and simulate deployment on a microcontroller using the TFLite interpreter in Python (as a proxy).

---

## **96.5 Case Study: Deploying an Anomaly Detection Model on an Edge Device**

We will build a simple anomaly detection model for a streaming time series, compress it, and simulate its deployment on an edge device.

### **96.5.1 Problem Definition**

Imagine a sensor that monitors the trading volume of a NEPSE stock in real time. Unusual volume spikes may indicate market manipulation or news events. We want to run an anomaly detection model on a low‑power device near the data source. The device should raise an alert when an anomaly is detected.

We'll use a simple autoencoder model trained on normal volume patterns. The reconstruction error will be the anomaly score.

### **96.5.2 Data Preparation**

We'll generate synthetic volume data with occasional anomalies.

```python
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

# Generate normal volume data (log‑normal distribution)
np.random.seed(42)
n_normal = 10000
normal_volume = np.random.lognormal(mean=12, sigma=1, size=n_normal)

# Generate anomalous spikes (10x higher)
n_anomaly = 100
anomaly_volume = np.random.lognormal(mean=15, sigma=1, size=n_anomaly)

# Combine for testing
volume = np.concatenate([normal_volume, anomaly_volume])
np.random.shuffle(volume)

# Create sequences of 10 consecutive readings
def create_sequences(data, seq_length=10):
    xs = []
    for i in range(len(data) - seq_length):
        xs.append(data[i:i+seq_length])
    return np.array(xs)

X = create_sequences(normal_volume, seq_length=10)  # train on normal only
X_test = create_sequences(volume, seq_length=10)

# Normalize to [0,1]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X.reshape(-1,1)).reshape(X.shape)
X_test_scaled = scaler.transform(X_test.reshape(-1,1)).reshape(X_test.shape)
```

### **96.5.3 Build and Train an Autoencoder**

```python
# Simple autoencoder
input_dim = 10
encoding_dim = 4

input_layer = layers.Input(shape=(input_dim,))
encoded = layers.Dense(encoding_dim, activation='relu')(input_layer)
decoded = layers.Dense(input_dim, activation='sigmoid')(encoded)

autoencoder = keras.Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mse')

# Train
history = autoencoder.fit(
    X_scaled, X_scaled,
    epochs=50,
    batch_size=32,
    validation_split=0.1,
    verbose=0
)

# Plot loss
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val')
plt.legend()
plt.title('Autoencoder Training')
plt.show()
```

### **96.5.4 Convert to TensorFlow Lite**

We'll apply quantization to reduce model size.

```python
# Convert to float16 TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(autoencoder)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()

# Save
with open('anomaly_autoencoder_float16.tflite', 'wb') as f:
    f.write(tflite_model)

print(f"Original model size: {len(autoencoder.to_json()):,} bytes (plus weights)")
print(f"TFLite model size: {len(tflite_model):,} bytes")
```

**Explanation**:

- The original Keras model (JSON + HDF5) might be hundreds of KB. The quantized TFLite model is typically much smaller.
- Float16 quantization reduces weights to 16‑bit floats, roughly halving the size.

### **96.5.5 Simulate Edge Deployment**

We'll use the TFLite interpreter in Python to simulate inference on an edge device. In a real deployment, this code would run on a microcontroller with the TFLite Micro library.

```python
# Load TFLite model and allocate tensors
interpreter = tf.lite.Interpreter(model_content=tflite_model)
interpreter.allocate_tensors()

# Get input and output tensors
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Set threshold for anomaly detection
threshold = 0.01  # chosen from validation set

# Simulate streaming: process each window
anomaly_scores = []
for i in range(len(X_test_scaled)):
    input_data = X_test_scaled[i].astype(np.float32).reshape(1, 10)
    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()
    reconstructed = interpreter.get_tensor(output_details[0]['index'])
    mse = np.mean((input_data - reconstructed) ** 2)
    anomaly_scores.append(mse)

# Identify anomalies
anomaly_flags = np.array(anomaly_scores) > threshold

# Plot
plt.figure(figsize=(12,4))
plt.plot(anomaly_scores, label='Reconstruction Error')
plt.axhline(y=threshold, color='r', linestyle='--', label='Threshold')
plt.plot(np.where(anomaly_flags)[0], np.array(anomaly_scores)[anomaly_flags], 'ro', label='Detected')
plt.legend()
plt.title('Anomaly Detection on Edge')
plt.show()
```

**Explanation**:

- We load the quantized TFLite model into an interpreter.
- For each input window, we run inference and compute the reconstruction error.
- If the error exceeds a threshold, we flag an anomaly.
- This simulates what would happen on an edge device: the model runs locally and only sends alerts when anomalies occur.

### **96.5.6 Deployment Considerations on Real Hardware**

On a real microcontroller, you would:

1. Include the TFLite Micro library in your firmware.
2. Convert the `.tflite` file to a C byte array (using `xxd`).
3. Write a simple C++ program that reads sensor data, runs inference, and triggers an alert (e.g., LED, message).

**Example C byte array generation**:

```bash
xxd -i anomaly_autoencoder_float16.tflite > model_data.cpp
```

This creates a C array like `unsigned char anomaly_autoencoder_float16_tflite[] = {...}`.

---

## **96.6 Performance Evaluation**

When deploying on edge, we must evaluate:

- **Model size**: Does it fit in flash memory? (e.g., 128 KB for some microcontrollers)
- **RAM usage**: Does inference exceed available RAM? (often a few KB to a few hundred KB)
- **Inference time**: Can it keep up with the data rate? (e.g., 10 ms per inference for 100 Hz data)
- **Accuracy**: How much accuracy is lost due to quantization/pruning?
- **Power consumption**: Milliwatts or microwatts? Affects battery life.

For our autoencoder, float16 quantization should have negligible accuracy loss. The model size (~a few KB) fits easily on most microcontrollers.

---

## **96.7 Challenges and Limitations**

- **Hardware constraints**: Memory, compute, and power are severely limited.
- **Operator support**: Not all TensorFlow operations are implemented in TFLite Micro. You may need to simplify your model.
- **Floating point vs. integer**: Some microcontrollers have no FPU; integer‑only models are required.
- **Model updates**: Updating models on deployed devices can be difficult (over‑the‑air updates require careful design).
- **Security**: Edge devices can be physically tampered with; consider secure boot and encrypted storage.

---

## **96.8 Future Directions**

- **TinyML with neural architecture search**: Automatically find efficient architectures for edge devices.
- **On‑device learning**: Models that adapt to new data without cloud connectivity.
- **Integration with embedded operating systems**: Better support in FreeRTOS, Zephyr, etc.
- **Hardware advancements**: New ultra‑low‑power AI accelerators (e.g., ARM Ethos‑U, Syntiant).
- **Federated learning**: Train across many edge devices without sharing raw data.

---

## **96.9 Chapter Summary**

In this chapter, we explored Edge AI and TinyML, focusing on deploying time‑series models on resource‑constrained devices. We covered:

- The motivation for edge deployment: latency, bandwidth, privacy, and reliability.
- Model compression techniques: quantization, pruning, and knowledge distillation.
- Frameworks like TensorFlow Lite for Microcontrollers and Edge Impulse.
- A complete case study: building an autoencoder for anomaly detection on streaming volume data, converting it to a quantized TFLite model, and simulating edge inference.
- Performance evaluation and the challenges of real‑world deployment.

Edge AI is a rapidly growing field, and for time‑series applications, it enables a new class of intelligent, low‑power, and responsive systems. While the NEPSE example here is simplified, the same principles apply to many real‑world problems, from predictive maintenance to healthcare monitoring.

In the next chapter, we will explore **Quantum Machine Learning**, a speculative but exciting frontier that may revolutionise computation for certain time‑series tasks.

---

**End of Chapter 96**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='95. automated_scientific_discovery.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='97. quantum_machine_learning.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
