# Chapt 8: Models in Production

In [None]:
from transformers import pipeline

In [None]:
bert_ckpt = "transformersbook/bert-base-uncased-finetuned-clinc"
pipe = pipeline("text-classification", model=bert_ckpt)

In [None]:
query = """Hey, I'd like to rent a vehicle from Nov 1st to Nov 15th in Paris and I need a 15 passenger van"""
pipe(query)

Must address:
- Model performance: How well does model do on test set that reflects production data
- Latency: How fast can deliver predictions?
- Memory: Especially when no access to cloud server; can fit on edge device?

In [None]:
class PerformanceBenchmark:
    def __init__(self, pipeline, dataset, optim_type="BERT baseline"):
        self.pipeline = pipeline
        self.dataset = dataset
        self.optim_type = optim_type
        
    def compute_accuracy(self):
        # We'll define this later
        pass

    def compute_size(self):
        # We'll define this later
        pass
    
    def time_pipeline(self):
        # define this later
        pass
    
    def run_benchmark(self):
        """Collect all metrics in a dictionary."""
        metrics = {}
        # keep track of different optimisation techniques
        metrics[self.optim_type] = self.compute_size()
        metrics[self.optim_type].update(self.time_pipeline())
        metrics[self.optim_type].update(self.compute_accuracy())
        return metrics

In [None]:
from datasets import load_dataset

# load CLINC50 dataset from hub; plus config is out of scope training examples
clinc = load_dataset("clinc_oos", "plus")

In [None]:
sample = clinc["test"][42]
sample

In [None]:
intents = clinc["test"].features["intent"] # provided as IDs
intents.int2str(sample["intent"]) # map to strings

## Establish a Benchmark

In [None]:
from datasets import load_metric

accuracy_score = load_metric("accuracy")

In [None]:
def compute_accuracy(self):
    """Overrides PerformanceBenchmark.compute_accuracy() method.
    Expects the predictions and references (ground truth) to be integers.
    Use the pipeline to extract the predictions from the text and then str2int() method
    to map prediction to corresponding ID.
    
    Collects predictions and labels in lists before returnign accuracy on the dataset
    """
    preds, labels = [], []
    for example in self.dataset:
        pred = self.pipeline(example["text"])[0]["label"]
        label = example["intent"]
        preds.append(intents.str2int(pred))
        labels.append(label)
    accuracy = accuracy_score.compute(predictions=preds, references=labels)
    print(f"Accuracy on test set - {accuracy['accuracy']:.3f}")
    return accuracy

PerformanceBenchmark.compute_accuracy = compute_accuracy

In [None]:
# compute size of model with torch.save(); uses Pickle module under the hood
# can see weights and biases under the hood; each key/value corresponds to layer and tensor
list(pipe.model.state_dict().items())[42]

In [None]:
import torch
torch.save(pipe.model.state_dict(), "model.pt")

In [None]:
# get model size in bytes

from pathlib import Path

def compute_size(self):
    """Overrides PerformanceBenchmark.compute_size() method"""
    state_dict = self.pipeline.model.state_dict()
    tmp_path = Path("model.pt")
    torch.save(state_dict, tmp_path)
    # calc size in mb
    size_mb = Path(tmp_path).stat().st_size / (1024 * 1024)
    # delete temporary file
    tmp_path.unlink()
    print(f"Model size (MB) - {size_mb:.2f}")
    return {"size_mb": size_mb}

PerformanceBenchmark.compute_size = compute_size

latency: time takes to feed query text and return predicted intent from model.

In [None]:
# time average latency per query

from time import perf_counter

for _ in range(3):
    start_time = perf_counter()
    _ = pipe(query)
    latency = perf_counter() - start_time
    print(f"Latency (ms) - {1000 * latency:.3f}")

In [None]:
import numpy as np

def time_pipeline(self, query="What is the pin number for my account?"):
    """Overrides the PerformanceBenchmark.time_pipeline() method.
    Performance varies depending on hardware, what's important is relative diff between 
    runs (consistency)."""
    latencies = []
    # Warmup
    for _ in range(10):
        _ = self.pipeline(query)
    # timed run
    for _ in range(100):
        start_time = perf_counter()
        _ = self.pipeline(query)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    print(f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}")
    return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}

PerformanceBenchmark.time_pipeline = time_pipeline

In [None]:
# benchmark BERT baseline
pb = PerformanceBenchmark(pipe, clinc["test"])
perf_metrics = pb.run_benchmark()

## Knowledge Distillation

Train a smaller student model to mimic a slower, larger better-performing teacher. We typically scale the probabilities with a temperature hyperparameter *T* before applying softmax to produce a softer probability distributon over classes and reveal more information about decision boundary that the teacher has learned. *T* = 1 recovers the original softmax distribution. 

We can use the **Kullback-Leibler (KL)** Divergence to measure the difference between two probability distributions. We can approximate how much is lost when we approximate the probability distribution of the teacher with the student. So we get a knowledge distillation loss:

$L_{KD} = T^2D_{KL}$

Where $T^2$ is a normalisation factor to counter the fact that the magnitude of the gradients produced by soft labels scale as $1/T^2$. For classification tasks, the student loss is then a weighted average of the distillation loss with the usual cross-entropy loss $L_{CE}$ of the ground truth labels. Where the weighting parameter $\alpha$ is how much to weigh the distillation loss vs the cross entropy loss.

**Pretraining**: Can be used during pretraining to create a general-purpose student to be later fine-tuned. Ex. in DistilBERT, the loss includes a cosine embedding loss to align the directions of the hidden state vectors between the teacher and student.

### Creating a Knowledge Distillation Trainer

Some things to add to Trainer base class:
- Hyper-parameters $\alpha$ and *T* which control relative weight of distillation loss and how much softmax probability should be smoothened
- Fine-tuned teacher model, which is BERT-base in our case
- New loss function that combines the cross-entropy loss with knowledge distillation loss

In [None]:
from transformers import TrainingArguments

class DistillationTrainingArguments(TrainingArguments):
    def __init__(self, *args, alpha=0.5, temperature=2.0, **kwargs):
        super().__init__(*args, **kwargs)
        self.alpha = alpha
        self.temperature = temperature

In [None]:
# subclass Trainer and override compute_loss() to include knowledge distillation loss L_kd

import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher_model = teacher_model
        
    def compute_loss(self, model, inputs, return_outputs=False):
        outputs_stu = model(**inputs)
        # Extract cross-entropy loss and logits from student
        loss_ce = outputs_stu.loss
        logits_stu = outputs_stu.logits
        # Extract logits from teacher
        with torch.no_grad():
            outputs_tea = self.teacher_model(**inputs)
            logits_tea = outputs_tea.logits
        # soften probabilities and compute distillation loss
        loss_fct = nn.KLDivLoss(reduction="batchmean") # average loss over batch dim
        loss_kd = self.args.temperature ** 2 * loss_fct(
            # inputs as log prob.
            F.log_softmax(logits_stu / self.args.temperature, dim=-1),
            # labels as normal prob.
            F.softmax(logits_tea / self.args.temperature, dim=-1)
        )
        # return weighted student loss
        loss = self.args.alpha * loss_ce + (1. - self.args.alpha) * loss_kd
        return (loss, outputs_stu) if return_outputs else loss

### Choosing a Good Student Initialisation

Smaller model in general for the student to reduce the latency and memory footprint. Rule of thumb from literature: Works best when teacher and student are of the same model type. Possibly because there are diferent output embedding spaces, hindering the student's ability to mimic the teacher.

In [None]:
from transformers import AutoTokenizer

# instantiate tokenizer from DistilBERT
student_ckpt = "distilbert-base-uncased"
student_tokenizer = AutoTokenizer.from_pretrained(student_ckpt)

def tokenize_text(batch):
    return student_tokenizer(batch["text"], truncation=True)

# remove text column as we no longer need
clinc_enc = clinc.map(tokenize_text, batched=True, remove_columns=["text"])
# rename intent to labels
clinc_enc = clinc_enc.rename_column("intent", "labels")

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Now we need to define hyper-parameters and `compute_metrics()` function for DistillationTrainer.

In [None]:
def compute_metrics(pred):
    predictions, labels = pred
    # convert logits to most probable prediction using argmax
    predictions = np.argmax(predictions, axis=1)
    # can use accuracy() score fn defined previously
    return accuracy_score.compute(predictions=predictions, references=labels)

In [None]:
batch_size = 48

finetuned_ckpt = "distilbert-base-uncased-finetuned-clinc"
student_training_args = DistillationTrainingArguments(
    output_dir=finetuned_ckpt, evaluation_strategy = "epoch",
    num_train_epochs=5, learning_rate=2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    alpha=1, weight_decay=0.01,
    push_to_hub=True
)

In [None]:
# mappings between each intent and label ID; can be obtained from BERT base model
id2label = pipe.model.config.id2label
label2id = pipe.model.config.label2id

In [None]:
from transformers import AutoConfig

# configuration for student with information about label mappings
# also specify number of classes our model should expect
num_labels = intents.num_classes
student_config = (AutoConfig.from_pretrained(
    student_ckpt, num_labels=num_labels, id2label=id2label, label2id=label2id
))

In [None]:
import torch
from transformers import AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def student_init():
    # can provide config in .from_pretrained
    return (AutoModelForSequenceClassification.from_pretrained(student_ckpt, config=student_config).to(device))

The above is everything needed for the distillation trainer (distilBERT). Now we can load the teacher and fine-tune.

In [None]:
!sudo apt-get install git-lfs

In [None]:
import gc
torch.cuda.empty_cache()
gc.collect()

In [None]:
teacher_ckpt = "transformersbook/bert-base-uncased-finetuned-clinc"
teacher_model = (AutoModelForSequenceClassification.from_pretrained(
    teacher_ckpt, num_labels=num_labels).to(device))

distilbert_trainer = DistillationTrainer(
    model_init=student_init, teacher_model=teacher_model, args=student_training_args,
    train_dataset=clinc_enc["train"], eval_dataset=clinc_enc["validation"],
    compute_metrics=compute_metrics, tokenizer=student_tokenizer
)

distilbert_trainer.train()

In [None]:
# push model to hub for later re-use
distilbert_trainer.push_to_hub("Training completed!")

In [None]:
# use model in a pipeline for our performance benchmark
finetuned_ckpt = "stevevee0101/distilbert-base-uncased-finetuned-clinc"
pipe = pipeline("text-classification", model=finetuned_ckpt)

In [None]:
# pass to PerformanceBenchmark to compute metrics
optim_type = "DistilBERT"
pb = PerformanceBenchmark(pipe, clinc["test"], optim_type=optim_type)
perf_metrics.update(pb.run_benchmark())

Scatterplot accuracy vs latency, with point radius being the model size on disk. 

In [None]:
import pandas as pd

def plot_metrics(perf_metrics, current_optim_type):
    df = pd.DataFrame.from_dict(perf_metrics, orient="index")
    
    for idx in df.index:
        df_opt = df.loc[idx]
        # add dashed circle around current optimisation type
        if idx == current_optim_type:
            plt.scatter(df_opt["time_avg_ms"], df_opt["accuracy"] * 100,
                       alpha=0.5, s=df_opt["size_mb"], label=idx,
                       marker="$\u25CC$")
        else:
            plt.scatter(df_opt["time_avg_ms"], df_opt["accuracy"] * 100,
                       s=df_opt["size_mb"], label=idx, alpha=0.5)
        
    legend = plt.legend(bbox_to_anchor=(1,1))
    for handle in legend.legendHandles:
        handle.set_sizes([20])
        
    plt.ylim(80, 90)
    
    # use slowest model to define x-axis range
    xlim = int(perf_metrics["BERT baseline"]["time_avg_ms"] + 3)
    plt.xlim(1, xlim)
    plt.ylabel("Accuracy (%)")
    plt.xlabel("Average latency (ms)")
    plt.show()
    
plot_metrics(perf_metrics, optim_type)

Smaller model significantly decreases average latency; with only 1% reduction in accuracy. Try to close accuracy gap with distillation loss of teacher and find good values for $\alpha$ and $T$.

### Find Good Hyperparameters with Optuna

Could do a grid-search.. A better alternativ eis to use *Optuna*, an optimisation framework. We find the minimum of $f(x,y)$ by defining an objective() function that returns the alue of $f(x,y)$.

In [None]:
def objective(trial):
    # specifies parameter ranges to sample uniformly from
    x = trial.suggest_float("x", -2, 2)
    y = trial.suggest_float("y", -2, 2)
    return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2

In [None]:
import optuna 

study = optuna.create_study()
# collects multiple trials as a study
study.optimize(objective, n_trials=1000)

In [None]:
# get best parameters once study is completed
study.best_params

Finds values reasonably close to the global minimum (1, 1). Follow similar logic, define hyper-parameter space we wish to optimise over.

In [None]:
# include number of training epochs

def hp_space(trial):
    return {
        "num_train_epochs": trial.suggest_int("num_train_epochs", 5, 10),
        "alpha": trial.suggest_float("alpha", 0, 1),
        "temperature": trial.suggest_int("temperature", 2, 20)
    }

In [None]:
best_run = distilbert_trainer.hyperparameter_search(
    # specify 'maximize' as we want the best accuracy possible
    n_trials=20, direction="maximize", hp_space=hp_space
)

In [None]:
print(best_run)

Alpha being 0.12 tells us most of the training signal is coming from the knowledge distillation term instead of cross-entropy loss.

In [None]:
# update training arguments with these values and run final training run
for k, v in best_run.hyperparameters.items():
    setattr(student_training_args, k, v)

In [None]:
# define a new repository to store our distilled model
distilled_ckpt = "distilbert-base-uncased-distilled-clinc"
student_training_args.output_dir = distilled_ckpt

# create a new trainer with optimal parameters
distil_trainer = DistillationTrainer(
    model_init=student_init, teacher_model=teacher_model,
    args=student_training_args, train_dataset=clinc_enc["train"], 
    eval_dataset=clinc_enc["validation"], compute_metrics=compute_metrics,
    tokenizer=student_tokenizer
)

distil_trainer.train();

In [None]:
# here the student matches the accuracy of the teacher despite being half the size!
# push to hub for future use
distil_trainer.push_to_hub("Training complete")

### Benchmarking our Distilled Model

In [None]:
# redo benchmark
distilled_ckpt = "transformersbook/distilbert-base-uncased-distilled-clinc"
pipe = pipeline("text-classification", model=distilled_ckpt)
optim_type = "Distillation"
pb = PerformanceBenchmark(pipe, clinc["test"], optim_type=optim_type)
perf_metrics.update(pb.run_benchmark())

In [None]:
plot_metrics(perf_metrics, optim_type)

Accuracy surpassed teacher! Possibly because teacher has not been fine-tuned as systematically as the student. We can compress our model even further with Quantisation.

### Quantisation of our Model

Reduce precision of weights and activation eg. to 8bit instead of usual 32bit to require less memory storage; which can be done with little to no loss in accuracy. This is as once a model is trained, we only need forward pass to run inference so can reduce the precision type without impacting accuracy too much. We can control the range and precision of a fixed-point number by adjusting the scaling factor.

Can map the range to a smaller one and linearly distribute the values in-between. Values outside range get clamped; when reverting, dequantisation gives the nearest fixed-point number.

Transformers and DNN are good candidates for quantisation as the weights and activations take values in relatively small ranges. So we don't have to squeeze a huge range into the 256 numbers of INT8.

In [None]:
import matplotlib.pyplot as plt

state_dict = pipe.model.state_dict()
weights = state_dict["distilbert.transformer.layer.0.attention.out_lin.weight"]
plt.hist(weights.flatten().numpy(), bins=250, range=(-0.3, 0.3), edgecolor="C0")
plt.show();

Weights are distributed in range [-0.1, 0.1] around zero. If we want to quantise as 8-bit integer, the range of vaues would be $[q_{min}, q_{max}] = [-128, 127]$. Zero points of FP32 and 8-bit coincide. Scale factor is:

$f = (\frac{f_{max}-f_{min}}{q_{max} - q_{min}})(q-Z)$

In [None]:
zero_point = 0
scale = (weights.max() - weights.min()) / (127 - (-128))

To obtain quantised tensor, we need to invert mapping $q=f/S + Z$, clamp the values, round to the nearest integer and represent the result in torch.int8 data type using Tensor.char() function.

In [None]:
(weights / scale + zero_point).clamp(-128, 127).round().char()

So we just quantised our first tensor! In Pytorch, we can simplify with `quantize_per_tensor()` function with quantised data type `torch.qint`, optimised for integer arithmetic operations.

In [None]:
from torch import quantize_per_tensor

dtype = torch.qint8
quantized_weights = quantize_per_tensor(weights, scale, zero_point, dtype)
quantized_weights.int_repr()

In [None]:
%%timeit
weights @ weights

In [None]:
# use QFunctional wrapper so we can perform operations with torch.qint8 data type
from torch.nn.quantized import QFunctional

q_fn = QFunctional()

In [None]:
%%timeit
_fn.mul(quantized_weights, quantized_weights)

Almost 100x faster! Even faster with dedicated backends for running quantised operators efficiently.

Also reduces memory storage by factor of 4! Test with example.

In [None]:
import sys

sys.getsizeof(weights.storage()) / sys.getsizeof(quantized_weights.storage())

Trade-off: Changing precision at each layer introduces small disturbances which can compound and affect the model's performance. Three (of many) typical ways to quantise:
- *Dynamic Quantization*: Nothing changes during training and adapt during inference. Also model's activations are quantised; happens on the fly. However, activations are written and read to memory in floating-point format, this conversion between integer and float can be a performance bottleneck.
- *Static QUantisation*: Precompute quantisation scheme. Calculate and save ahead of time. However, requires access to good data sample to determine a good quantisation scheme. But des not address precision discrepancy leading to performance drop in metrics.
- *Quantisation aware*: Train by rounding FP32 to mimic quantisation effect in both forward and backward pass. Improves performance in model metrics over static and dynamic quantisation.

Biggest bottleneck is compute and memory bandwidth associated with enormous numbers of weights in models. So dynamic quantisation is best for transformer-based models in NLP. In smaller models, the limiting factor is memory bandwidth of the activations, so static quantisation is generally used. 

In [None]:
# simple to implement dynamic quantisation and can be done with a single line

from torch.quantization import quantize_dynamic

model_ckpt = "transformersbook/distilbert-base-uncased-distilled-clinc"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = (AutoModelForSequenceClassification.from_pretrained(model_ckpt).to("cpu"))

# specify classes we wish to quantise. See how much int8 impacts accuracy
model_quantized = quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)

In [None]:
# benchmark quantised model
pipe = pipeline("text-classification", model=model_quantized, tokenizer=tokenizer)
optim_type = "Distillation + quantization"
pb = PerformanceBenchmark(pipe, clinc["test"], optim_type=optim_type)
perf_metrics.update(pb.run_benchmark())

Quantised model almost half the size of distilled and even slight accuracy gain! Push to limit with framework called ONNX Runtime.

## Optimizing Inference with ONNX and ONNX Runtime

ONNX optimises computation by converting a NN to a computation graph and can run on heavily optimised accelerators; going down a layer of abstraction for increased speed. Must convert to ONNX format for this, and can achieve with the following steps:
1. Initialise model as pipeline
2. Run placeholder inputs through pipeline so ONNX can record the computational graph
3. Define dynamic axes to handle dynamic sequence lengths
4. Save graph with network parameters

In [None]:
import os
from psutil import cpu_count

# must set some OpenMP environment variables for ONNX
os.environ["OMP_NUM_THREADS"] = f"{cpu_count()}"
os.environ["OMP_WAIT_POLICY"] = "ACTIVE" # specifies waiting threads should be acive

OpenMP is designed for developing highly parallelized applications.

In [None]:
from transformers.convert_graph_to_onnx import convert

model_ckpt = "transformersbook/distilbert-base-uncased-distilled-clinc"
onnx_model_path = Paht("onnx/model.onnx")
# wrap model in a transformers pipeline() function during conversion
# also pass tokenizer to initialise pipeline
convert(framework="pt", model=model_ckpt, tokenizer=tokenizer,
       output=onnx_model_path, opset=12, pipeline_name="text-classification")

ONNX uses *operator sets* to group together immutable operator specifications, so opset=12 corresponds to specific version of ONNX library.

In [None]:
from onnxruntime import (
    GraphOptimizationLevel, InferenceSession, SessionOptions
)
def create_model_for_provider(model_path, provider="CPUExecutionProvider"):
    options = SessionOptions()
    options.intra_op_num_threads = 1
    options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL
    # create inference session to feed inputs to model
    session = InferenceSession(str(model_path), options, providers=[provider])
    session.disable_fallback()
    return session

In [None]:
inputs = clinc_enc["test"][:1]
del inputs["labels"] # requires input_ids and attention_mask as inputs; so drop labels
logits_onnx = onnx_model.run(None, inputs)[0] 
logits_onnx.shape # can get class logits

In [None]:
np.argmax(logits_onnx)

In [None]:
clinc_enc["test"][0]["labels"]

ONNX model is not compatible with text-classification pipeline, so create our own class that mimics the core behaviour.

In [None]:
from scipy.special import softmax

class OnnxPipeline:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        
    def __call__(self, query):
        model_inputs = self.tokenizer(query, return_tensors="pt")
        inputs_onnx = {k: v.cpu().detach().numpy() for k, v in model_inputs.items()}
        logits = self.model.run(None, inputs_onnx)[0][0, :]
        probs = softmax(logits)
        pred_idx = np.argmax(probs).item()
        return [{"label": intents.int2str(pred_idx), "score": probs[pred_idx]}]

In [None]:
pipe = OnnxPipeline(onnx_model, tokenizer)
pipe(query)

Now create a Performance Benchmark. Override previous class' `compute_size()` because we cannot rely on `state_dict` and `torch.save()` to measure a model's size since `onnx_model` is technically an ONNX InferenceSession object that doesn't have access to attributes of PyTorch's nn.Module.

In [None]:
class OnnxPerformanceBenchmark(PerformanceBenchmark):
    def __init__(self, *args, model_path, **kwargs):
        super().__init__(*args, **kwargs)
        self.model_path = model_path
        
    def compute_size(self):
        size_mb = Path(self.model_path).stat().st_size / (1024 * 1024)
        print(f"Model size (MB) - {size_mb:.2f}")
        return {"size_mb": size_mb}

In [None]:
# now see how our distilled model compares with ONNX format
optim_type = "Distillation + ORT"
pb = OnnxPerformanceBenchmark(pipe, clinc["test"], optim_type, model_path="onnx/model.onnx")
perf_metrics.update(pb.run_benchmark())

Has improved latency! ORT offers the three ways to quantise a model as below. We'll apply dynamic quantisation to our distilled model.

In ORT, the quantisation is applied through `quantize_dynamic()` function, which requires a path to the ONNX model to quantize, a target path to save the quantized model to, and the data type to reduce the weights to.

In [None]:
from onnxruntime.quantization import quantize_dynamic, QuantType

model_input = "onnx/model.onnx"
model_output = "onnx/model.quant.onnx"
quantize_dynamic(model_input, model_output, weight_type=QuantType.QInt8)

In [None]:
onnx_quantized_model = create_model_for_provider(model_output)
pipe = OnnxPipeline(onnx_quantized_model, tokenizer)
optim_type = "Distillation + ORT (quantized)"
pb = OnnxPerformanceBenchmark(pipe,clinc["test"], optim_type, model_path=model_output)
perf_metrics.update(pb.run_benchmark())

In [None]:
plot_metrics(perf_metrics, optim_type)

Reduced latency by ~30% compared to PyTorch's quantization. One reason is because PyTorch only optimizes `nn.Linear` modules whereas ONNX quantises embedding layer also. Almost a 3x gain compared to BERT baseline.

Another strategy to reduce the size is remove some weights altogether, this is called *weight pruning*.

## Weight Pruning: Making Models Sparser

Basically gradually remove weight connections during training so our model becomes progressively sparser. The resulting model has a small number of nonzero parameters, which can then be stored in a compact sparse matrix format. 

Mathematically, the way most weight pruning methods works is to calculate matrix S of *importance scores* then select top *k* percent of weights by importance.

In [1]:
!pip install latexify-py

Collecting latexify-py
  Downloading latexify_py-0.0.7-py3-none-any.whl (9.0 kB)
Installing collected packages: latexify-py
Successfully installed latexify-py-0.0.7
[0m

In [6]:
import math
import latexify

@latexify.with_latex
def topk(s_ij, top_k_pct):
    if s_ij in top_k_pct:
        return 1
    else:
        return 0

topk

<latexify.core.with_latex.<locals>._LatexifiedFunction at 0x7f191479b450>

so k is a new hyperparameter to control the amount of sparsity in the model. We can then define a *mask matrix* **M** that masks the weights $W_{ij}$ during the forward pass with some input $x_i$ and creates a sparse network of activations.

Consider:
- Which weights to eliminate
- How to adjust remaining weights for best performance
- How to eliminate computationally efficiently

**Magnitude Pruning**: Keep most important until desired sparsity is reached. However, computationally demanding, need to train to convergence at each step. So better to gradually increase initial sparsity. We can have most pruning initially and it tapers off. Update binary masks to allow masked weights to reactivate during training and recover crom potential accuracy loss by tuning. Though only designed for pure supervised learning and can make finetuning difficult as may remove important connections.

**Movement Pruning**: Gradually remove weights during fine-tuning so model becomes progressively sparser. Scores increase as weights move away from zero, so most important weights are furthest from zero.

Though not supported by corrent hardware for sparse matrix operations.