## Note on Reproducibility and Scope

This notebook serves as an illustration of the model-compression techniques explored in our project. Due to the use of proprietary Google infrastructure in the original experiments, the full production pipeline cannot be released publicly. Here we provide a small demo of the knowledge distillation + quantization flow.


### Setup

In Google Ads' internal conversion value prediction system, a large teacher model predicts conversion outcomes using a multi-head architecture and modeling performance using a Poisson Log Loss function.

The student model is a smaller, more efficient neural network designed to mimic the teacher's behavior. Through knowledge distillation, the student learns to approximate the teacher's predictions while using significantly fewer parameters, enabling faster inference and lower serving costs in production.

To illustrate this pipeline, we define a synthetic conversion prediction task. Each training sample consists of:
  - A random feature vector
  - A binary label indicating whether a conversion occurred
  - A non-negative conversion count (with many zeros, reflecting real-world sparsity)

---

### Model Architectures

Teacher Model
- Architecture: [128, 64, 32] - three dense hidden layers
- Three output heads:
  - `conv_prob`: Sigmoid activation (conversion probability)
  - `conv_value`: Softplus activation (non-negative count for Poisson loss)
  - `prob_logits`: Raw logits for distillation

Student Model
- Architecture: [64, 32] - two hidden layers (fixed to ensure dimension compatibility with the teacher's bottleneck layer)
- Same output heads and activations as the teacher

---
### Hyperparameter Sweep Configuration

We include an example config with sweeps to show how the experiments can pull from the config and sweep over the specified hyperparamters. This example is very simple compared to the more sophisticated pyplan config options at Google, but demonstrates the high-level workflow.

This setup demonstrates how model compression via teacher-student training can retain accuracy while significantly improving deployment efficiency in large-scale systems such as Google Ads.


In [49]:
import numpy as np
import tensorflow as tf
import itertools
import os

# example of user config input (in reality would be separate file)
HYPERPARAM_CONFIG = {
    "student": {
        "dropout_rate": [0.0, 0.1, 0.2]
    },
    "distillation": {
        "alpha": [0.1, 0.5],
        "temperature": [3.0]
    }
}

# fix random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

num_features = 10
num_samples = 10000
X_train = np.random.normal(size=(num_samples, num_features)).astype(np.float32)

# simulate a scenario where conversion count follows a Poisson distribution based on features.
# define an arbitrary "true" weight vector for Poisson rate:
true_w = np.random.normal(scale=0.5, size=(num_features, 1))

# compute a latent log-rate and apply exp to get Poisson lambda
log_lambda = X_train.dot(true_w)  # shape (num_samples, 1)
lambda_vals = np.exp(log_lambda).flatten()

# sample conversion counts from Poisson(lambda). (Clip to some max for safety)
y_count = np.random.poisson(lam=lambda_vals).astype(np.float32)
y_count = np.clip(y_count, 0, 20)  # limit extreme values for stability

# conversion occurred or not (binary label) – 1 if count > 0 else 0
y_conv = (y_count > 0).astype(np.float32)

# prepare targets for Keras
y_train = {
    "conv_prob": y_conv,
    "conv_value": y_count,
    "prob_logits": np.zeros_like(y_conv)
}

def build_model(hidden_units, dropout=0.0, name="Model"):
    inputs = tf.keras.Input(shape=(10,), name="features")
    x = inputs

    for i, units in enumerate(hidden_units):
        layer_name = "bottleneck" if i == len(hidden_units) - 1 else f"hidden_{i}"
        x = tf.keras.layers.Dense(units, activation='relu', name=layer_name)(x)
        if dropout > 0:
            x = tf.keras.layers.Dropout(dropout)(x)

    prob_logits = tf.keras.layers.Dense(1, activation=None, name="prob_logits")(x)
    prob_output = tf.keras.layers.Activation('sigmoid', name="conv_prob")(prob_logits)
    value_output = tf.keras.layers.Dense(1, activation=tf.nn.softplus, name="conv_value")(x)

    # return a cictionary of outputs to match the keys in loss/y_train
    return tf.keras.Model(
        inputs=inputs,
        outputs={
            "conv_prob": prob_output,
            "conv_value": value_output,
            "prob_logits": prob_logits
        },
        name=name
    )

### Synthetic Training Setup

We generate random input features and construct labels to mimic a realistic conversion prediction task:

- The conversion count `y_count` is sampled from a Poisson distribution:  
  $$
  y_{\text{count}} \sim \text{Poisson}(\exp(w \cdot x))
  $$
- The binary conversion label `y_conv` is defined as:
  $$
  y_{\text{conv}} = \mathbb{1}(y_{\text{count}} > 0)
  $$

---

### Teacher Model Training Objective

The teacher model is trained using two loss functions:

- Binary Cross-Entropy Loss for the `conv_prob` output, modeling the probability of a conversion
- Poisson Loss (`tf.keras.losses.Poisson`) for the `conv_value` output, modeling the conversion count or value

The Poisson loss in TensorFlow implements the negative Poisson log-likelihood:
$$
L = y_{\text{pred}} - y_{\text{true}} \log(y_{\text{pred}})
$$
which matches the standard Poisson log loss up to constant terms that do not affect optimization.

Here, the teacher model is trained briefly on synthetic data. In a real production setting, the teacher would be pre-trained on a massive dataset with richer features and longer training schedules.


## 2. Knowledge Distillation Training (Hard + Soft Targets)

Knowledge distillation (KD) compresses a large teacher model into a smaller student by training the student on both:
- Hard targets (ground-truth labels), and  
- Soft targets (the teacher's predictions).

The key idea is that the teacher's outputs encode *dark knowledge* about the underlying function, helping the student generalize better than training on labels alone.

---

### Distillation Loss

The student is trained with a weighted combination of losses:
$$
L_{\text{total}} = \alpha \, L_{\text{hard}}(y_{\text{true}}, y_{\text{student}})
+ \beta \, L_{\text{soft}}(y_{\text{teacher}}, y_{\text{student}})
$$
where alpha and beta balance supervision from true labels and the teacher (often alpha + beta = 1).

- Hard losses:
  - Binary cross-entropy for conversion probability  
  - Poisson loss for conversion count  
- Soft losses:
  - Mean squared error (MSE) between teacher and student outputs for each head  

The teacher's weights are frozen, and the student is trained using the combined distillation loss.


In [None]:
def run_experiment(dropout, alpha, temp, X_train, y_train, teacher_weights):

    # build student and teacher
    student_model = build_model([64, 32], dropout=dropout, name="Student")
    teacher_model = build_model([128, 64, 32], name="Teacher")
    teacher_model.set_weights(teacher_weights)
    teacher_model.trainable = False

    # feature extractors
    teacher_feat_extractor = tf.keras.Model(inputs=teacher_model.inputs, outputs=teacher_model.get_layer("bottleneck").output)
    student_feat_extractor = tf.keras.Model(inputs=student_model.inputs, outputs=student_model.get_layer("bottleneck").output)

    # training loop
    batch_size = 64
    epochs = 2
    learning_rate = 0.001
    gamma = 0.1

    optimizer = tf.keras.optimizers.Adam(learning_rate)
    dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(batch_size)

    final_loss = 0.0

    for epoch in range(epochs):
        epoch_loss = 0.0
        steps = 0
        for x_batch, y_batch in dataset:
            # teacher inference
            t_preds = teacher_model(x_batch, training=False)
            t_prob, t_val, t_logits = t_preds["conv_prob"], t_preds["conv_value"], t_preds["prob_logits"]

            with tf.GradientTape() as tape:
                # student training (dictionary output)
                s_preds = student_model(x_batch, training=True)
                s_prob = s_preds["conv_prob"]
                s_val = s_preds["conv_value"]
                s_logits = s_preds["prob_logits"]

                # squeeze outputs to match label dimensions
                s_prob_squeezed = tf.squeeze(s_prob, axis=-1)
                s_val_squeezed = tf.squeeze(s_val, axis=-1)

                # hard loss (compare student output to ground truth labels)
                loss_hard = tf.reduce_mean(tf.keras.losses.binary_crossentropy(y_batch["conv_prob"], s_prob_squeezed)) + \
                            tf.reduce_mean(tf.keras.losses.Poisson()(y_batch["conv_value"], s_val_squeezed))

                # soft loss (compare student logits to teacher logits)
                t_prob_soft = tf.nn.sigmoid(t_logits / temp)
                s_prob_soft = tf.nn.sigmoid(s_logits / temp)
                mse = tf.keras.losses.MeanSquaredError()
                loss_soft = tf.reduce_mean(mse(t_prob_soft, s_prob_soft)) + \
                            tf.reduce_mean(mse(t_val, s_val))

                # feature loss
                feat_loss = 0.0
                if gamma > 0:
                    t_feat = teacher_feat_extractor(x_batch, training=False)
                    s_feat = student_feat_extractor(x_batch, training=True)
                    feat_loss = tf.reduce_mean(mse(t_feat, s_feat))

                total_loss = (1 - alpha) * loss_hard + (alpha * (temp**2) * loss_soft) + (gamma * feat_loss)

            grads = tape.gradient(total_loss, student_model.trainable_weights)
            optimizer.apply_gradients(zip(grads, student_model.trainable_weights))
            epoch_loss += total_loss
            steps += 1

        final_loss = float(epoch_loss / steps)

    return final_loss, student_model


def generate_configs():
    dropout_list = HYPERPARAM_CONFIG['student']['dropout_rate']
    alpha_list = HYPERPARAM_CONFIG['distillation']['alpha']
    temp_list = HYPERPARAM_CONFIG['distillation']['temperature']

    for dropout, alpha, temp in itertools.product(dropout_list, alpha_list, temp_list):
        label = f"Dropout={dropout}, Alpha={alpha}, Temp={temp}"
        yield (dropout, alpha, temp), label


## train teacher (stays constant)
print("Training Base Teacher ([128, 64, 32])...")
base_teacher = build_model([128, 64, 32], name="Teacher")
base_teacher.compile(optimizer='adam',
                      loss={"conv_prob": "binary_crossentropy",
                            "conv_value": "poisson",
                            "prob_logits": None})
base_teacher.fit(X_train, y_train, epochs=5, batch_size=64, verbose=0)
teacher_weights = base_teacher.get_weights()
print("Teacher Ready.\n")


## run experiments with all sweep combinations
results = []
print(f"{'RUN CONFIGURATION':<50} | {'LOSS':<8}")
print("-" * 70)

for (dropout, alpha, temp), label in generate_configs():
    loss, _ = run_experiment(dropout, alpha, temp, X_train, y_train, teacher_weights)
    print(f"{label:<50} | {loss:.4f}")
    results.append((label, loss, dropout, alpha, temp))

print("-" * 70)
best_run = min(results, key=lambda x: x[1])
print(f"Best Configuration: {best_run[0]}")
print(f"Lowest Loss: {best_run[1]:.4f}")

best_label, best_loss, best_dropout, best_alpha, best_temp = best_run


### Distillation Training Loop Summary

In each training batch, we compute:

- Teacher predictions: Conversion probability and value from the frozen teacher (soft targets).
- Student predictions: Corresponding outputs from the student model.

---

### Loss Components

- Hard loss:  
  - Binary cross-entropy between true conversion labels and student probabilities  
  - Poisson loss between true counts and student predicted values  

- Soft loss:  
  - Mean squared error (MSE) between teacher and student outputs for both heads  

- Feature matching loss (optional):  
  - MSE between intermediate teacher (“hint”) and student (“guided”) representations, encouraging the student to mimic the teacher's internal features  
  - Scaled by a factor gamma (set to 0 to disable)

---

### Optimization

The total student loss is a weighted sum of hard loss, soft loss, and feature matching loss. In this example:
- alpha = [0.1, 0.5] (part of config) balances hard vs. soft losses  
- gamma = 0.1 lightly weights feature matching  

The student is trained for a few epochs, during which the distillation loss decreases, showing that it is learning from both ground truth and teacher outputs.

At this stage, the student model closely approximates the teacher. Next, we apply additional compression techniques such as quantization and pruning to further reduce model size and inference cost.


## 3. Post-Training Quantization (PTQ)

Post-training quantization (PTQ) reduces model size and improves inference efficiency after training, without updating model weights. In this implementation, we use dynamic range quantization, which converts weights from float32 to int8 while keeping activations as float32. This approach typically achieves ~2-4x model size reduction with minimal accuracy loss.

---

### PTQ with TensorFlow Lite

PTQ is applied using the TensorFlow Lite converter, which can directly convert a trained `tf.keras` model. Dynamic range quantization is the simplest form of PTQ—it only requires setting the optimization flag and does not require a representative dataset for calibration, making it straightforward to apply while still providing significant model compression.

---

### Setup

Here, we apply PTQ to the best-performing student model from the distillation experiments:
- Convert the Keras model to a TFLite-compatible wrapper (single output instead of dictionary)
- Create both a float32 baseline and a quantized version using `tf.lite.Optimize.DEFAULT`
- Compare file sizes and verify prediction accuracy

This produces a compact, efficient model suitable for low-latency, production inference.

In [51]:
def quantize_verify(student_model, X_train, y_train, config_label):
  print(f"Post-training quantization: {config_label}")
  wrapper_input = tf.keras.Input(shape=(10,), name="input")
  student_outputs = student_model(wrapper_input)
  wrapper_model = tf.keras.Model(inputs=wrapper_input, outputs=student_outputs["conv_prob"], name="wrapper")

  # convert to FP32 TFLite model (baseline)
  converter_fp32 = tf.lite.TFLiteConverter.from_keras_model(wrapper_model)
  tflite_model_fp32 = converter_fp32.convert()

  # convert with dynamic range quantization (weights only)
  converter_quant = tf.lite.TFLiteConverter.from_keras_model(wrapper_model)
  converter_quant.optimizations = [tf.lite.Optimize.DEFAULT]
  tflite_model_quant = converter_quant.convert()

  with open("student_model_fp32.tflite", "wb") as f:
      f.write(tflite_model_fp32)
  with open("student_model_quantized.tflite", "wb") as f:
      f.write(tflite_model_quant)

  fp32_size = os.path.getsize("student_model_fp32.tflite") / 1024
  quant_size = os.path.getsize("student_model_quantized.tflite") / 1024
  compression_ratio = fp32_size / quant_size if quant_size > 0 else 0
  size_reduction = ((fp32_size - quant_size) / fp32_size * 100) if fp32_size > 0 else 0

  print(f"   FP32 TFLite model size:       {fp32_size:.2f} KB")
  print(f"   Quantized TFLite model size:  {quant_size:.2f} KB")
  print(f"   Compression ratio:            {compression_ratio:.2f}x")
  print(f"   Size reduction:               {size_reduction:.1f}%")

  # load interpreters
  interpreter_fp32 = tf.lite.Interpreter(model_content=tflite_model_fp32)
  interpreter_fp32.allocate_tensors()
  input_details_fp32 = interpreter_fp32.get_input_details()
  output_details_fp32 = interpreter_fp32.get_output_details()

  interpreter_quant = tf.lite.Interpreter(model_content=tflite_model_quant)
  interpreter_quant.allocate_tensors()
  input_details_quant = interpreter_quant.get_input_details()
  output_details_quant = interpreter_quant.get_output_details()

  print(f"\n   {'Example':<10} | {'Original':<12} | {'FP32 TFLite':<12} | {'Quantized':<12} | {'Quant Error':<12}")
  print("   " + "-"*70)

  total_error = 0.0
  num_examples = 10

  for i in range(num_examples):
      x = X_train[i:i+1]

      # original model prediction
      orig_pred = student_model(x, training=False)
      orig_prob = float(orig_pred["conv_prob"][0, 0])

      # FP32 TFLite model prediction
      interpreter_fp32.set_tensor(input_details_fp32[0]['index'], x)
      interpreter_fp32.invoke()
      fp32_prob = float(interpreter_fp32.get_tensor(output_details_fp32[0]['index'])[0, 0])

      # quantized TFLite model prediction
      interpreter_quant.set_tensor(input_details_quant[0]['index'], x)
      interpreter_quant.invoke()
      quant_prob = float(interpreter_quant.get_tensor(output_details_quant[0]['index'])[0, 0])

      error = abs(orig_prob - quant_prob)
      total_error += error

      print(f"   {i:<10} | {orig_prob:<12.4f} | {fp32_prob:<12.4f} | {quant_prob:<12.4f} | {error:<12.4f}")

  avg_error = total_error / num_examples
  print(f"\n   Average quantization error: {avg_error:.4f}")

  print("\n" + "="*80)

  return {
      "fp32_size_kb": fp32_size,
      "quantized_size_kb": quant_size,
      "compression_ratio": compression_ratio,
      "size_reduction_pct": size_reduction,
      "avg_quantization_error": avg_error
  }

### PTQ Results and Validation

After conversion, we compare the file sizes of the float32 and quantized TFLite models. The quantized model achieves approximately 1.75x compression (43% size reduction), reducing the model from ~13 KB to ~7.5 KB. While not the full 4x compression possible with full integer quantization, dynamic range quantization provides a good balance between compression and ease of implementation.

Post-training quantization does not require retraining, making it simple to apply. The main benefit of dynamic range quantization is that it introduces minimal accuracy loss. In our validation, the average quantization error is only 0.0003, meaning predictions remain nearly identical to the original model.

Finally, we validate the quantized model by running inference with a TensorFlow Lite interpreter on 10 test examples, comparing predictions from the original Keras model, the float32 TFLite model (to verify conversion accuracy), and the quantized model (to measure quantization impact). The results confirm the quantized model produces accurate outputs with negligible degradation.

*Note on Loss Values*: The training loss reported may be negative due to the Poisson loss component, which can produce negative values when predictions closely match targets.


In [None]:
# run post-training quantization on the best model
print("Retraining best configuration for quantization...")
print(f"Config: Dropout={best_dropout}, Alpha={best_alpha}, Temp={best_temp}")

# retrain the best model
_, best_student_model = run_experiment(best_dropout, best_alpha, best_temp,
                                       X_train, y_train, teacher_weights)

# run quantization and verification
quant_results = quantize_verify(best_student_model, X_train, y_train, best_label)

print(f"\nFinal Summary:")
print(f"  Best Training Loss:        {best_loss:.4f}")
print(f"  FP32 Model Size:           {quant_results['fp32_size_kb']:.2f} KB")
print(f"  Quantized Model Size:      {quant_results['quantized_size_kb']:.2f} KB")
print(f"  Size Reduction:            {quant_results['size_reduction_pct']:.1f}%")
print(f"  Avg Quantization Error:    {quant_results['avg_quantization_error']:.4f}")


## 4. Model Pruning for Weight Sparsity

Pruning reduces model size by removing low-impact weights, making the network sparse. Many neural network weights contribute little to predictions, so zeroing them out and retraining can significantly reduce model size with minimal accuracy loss.

---

### Pruning Workflow

The code below implements magnitude-based pruning to compress the best performing student model at target sparsity levels of 30%, 50%, and 70%. It calculates layer-specific thresholds to zero out the least significant weights based on their absolute values. Then it does fine-tuning to recover accuracy. During this fine-tuning loop, it forces the zeroed out weights to remain frozen at zero. Finally it takes a sample of some predictions and compares the results to show that the prediction errors were minimal even after the highest level of pruning.




In [53]:
def prune_and_evaluate(student_model, X_train, y_train, sparsity_target=0.5):
    print(f"Magnitude-based pruning (target sparsity: {sparsity_target*100:.0f}%)")

    # clone model
    pruned_model = tf.keras.models.clone_model(student_model)
    pruned_model.set_weights(student_model.get_weights())

    # apply magnitude-based pruning and create masks
    pruning_masks = {}
    total_weights_pruned = 0
    total_weights = 0

    for i, layer in enumerate(pruned_model.layers):
        if hasattr(layer, 'kernel'):  # only prune layers with weights
            weights = layer.get_weights()
            kernel = weights[0]

            # find threshold value below which weights will be zeroed
            flat_weights = np.abs(kernel.flatten())
            threshold = np.percentile(flat_weights, sparsity_target * 100)

            # create mask: 0 for pruned weights, 1 for kept weights
            mask = (np.abs(kernel) >= threshold).astype(np.float32)
            pruned_kernel = kernel * mask

            # track pruning statistics
            weights_in_layer = kernel.size
            pruned_in_layer = np.sum(mask == 0)
            total_weights += weights_in_layer
            total_weights_pruned += pruned_in_layer

            # store mask and update layer weights
            pruning_masks[layer.name] = mask
            weights[0] = pruned_kernel
            layer.set_weights(weights)

    # fine-tune and preserve pruning
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
    dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(64)
    for epoch in range(2):
        for x_batch, y_batch in dataset:
            with tf.GradientTape() as tape:
                predictions = pruned_model(x_batch, training=True)

                # calculate losses
                prob_squeezed = tf.squeeze(predictions["conv_prob"], axis=-1)
                val_squeezed = tf.squeeze(predictions["conv_value"], axis=-1)

                loss = tf.reduce_mean(tf.keras.losses.binary_crossentropy(y_batch["conv_prob"], prob_squeezed)) + \
                       tf.reduce_mean(tf.keras.losses.Poisson()(y_batch["conv_value"], val_squeezed))

            # apply gradients and then re-apply masks to ensure zeros stay zero
            gradients = tape.gradient(loss, pruned_model.trainable_variables)
            optimizer.apply_gradients(zip(gradients, pruned_model.trainable_variables))

            # re-zero pruned weights after each gradient update
            for layer in pruned_model.layers:
                if layer.name in pruning_masks and hasattr(layer, 'kernel'):
                    weights = layer.get_weights()
                    weights[0] = weights[0] * pruning_masks[layer.name]
                    layer.set_weights(weights)

    def get_sparsity(model, threshold=1e-7):
        zero_count = 0
        total_count = 0
        layer_stats = []

        for layer in model.layers:
            if hasattr(layer, 'kernel'):
                weights = layer.kernel.numpy()
                layer_zeros = np.sum(np.abs(weights) < threshold)
                layer_total = weights.size
                zero_count += layer_zeros
                total_count += layer_total
                layer_stats.append({
                    'name': layer.name,
                    'zeros': layer_zeros,
                    'total': layer_total,
                    'sparsity': (layer_zeros / layer_total * 100) if layer_total > 0 else 0
                })

        overall_sparsity = (zero_count / total_count * 100) if total_count > 0 else 0
        return overall_sparsity, layer_stats

    original_sparsity, _ = get_sparsity(student_model)
    pruned_sparsity, layer_stats = get_sparsity(pruned_model)

    print(f"Original model sparsity:  {original_sparsity:.2f}%")
    print(f"Pruned model sparsity:    {pruned_sparsity:.2f}%")
    print(f"{'Layer':<20} | {'Zeros':<10} | {'Total':<10} | {'Sparsity':<10}")
    print("-"*60)
    for stat in layer_stats:
        print(f"{stat['name']:<20} | {stat['zeros']:<10} | {stat['total']:<10} | {stat['sparsity']:<10.2f}%")

    # compare predictions
    print("\nPredictions comparison on test samples...")
    print(f"{'Example':<10} | {'Original':<12} | {'Pruned':<12} | {'Difference':<12}")
    print("-"*55)

    total_diff = 0.0
    num_examples = 10

    for i in range(num_examples):
        x = X_train[i:i+1]

        orig_pred = student_model(x, training=False)
        orig_prob = float(orig_pred["conv_prob"][0, 0])

        pruned_pred = pruned_model(x, training=False)
        pruned_prob = float(pruned_pred["conv_prob"][0, 0])

        diff = abs(orig_prob - pruned_prob)
        total_diff += diff

        print(f"{i:<10} | {orig_prob:<12.4f} | {pruned_prob:<12.4f} | {diff:<12.4f}")

    avg_diff = total_diff / num_examples
    print(f"Average prediction difference: {avg_diff:.4f}")
    print("=" * 80 + "\n")

    return {
        "pruned_model": pruned_model,
        "original_sparsity": original_sparsity,
        "pruned_sparsity": pruned_sparsity,
        "avg_prediction_diff": avg_diff
    }

In [None]:
sparsity_levels = [0.3, 0.5, 0.7]
pruning_results = []

for sparsity in sparsity_levels:
    result = prune_and_evaluate(best_student_model, X_train, y_train, sparsity_target=sparsity)
    pruning_results.append({
        "target_sparsity": sparsity,
        "achieved_sparsity": result["pruned_sparsity"],
        "avg_error": result["avg_prediction_diff"]
    })

print("Summary:")
print(f"\n{'Target Sparsity':<20} | {'Achieved Sparsity':<20} | {'Avg Prediction Error':<20}")
print("-"*70)

for res in pruning_results:
    print(f"{res['target_sparsity']*100:<19.0f}% | {res['achieved_sparsity']:<19.2f}% | {res['avg_error']:<20.4f}")

## 5. Conclusion

We presented a workflow pipeline inspired by the Google Ads conversion prediction setup. Starting from a large teacher model with multi-task outputs (conversion probability and value, optimized with sigmoid and Poisson losses), we trained a compact student model using knowledge distillation.

The student was further optimized using:
- Post-training quantization (PTQ) for 43% model size reduction with minimal accuracy loss  
- Pruning up to ~70% sparsity with fine-tuning  

Together, these steps significantly reduce model size, latency, and compute cost while maintaining strong performance.
