![BridgingAI Logo](../bridgingai_logo.png)

# Deep Learning - Exercise 6.1: Scaling Laws

---
1. [Chinchilla Scaling Laws](#background)

2. [Implementation](#implementation)
<br/> &#9; 2.1 [Loss Function](#loss)
<br/> &#9; 2.2 [Parameter Optimization](#optimization)

3. [Analysis](#analysis)
<br/> &#9; 3.1 [Residual Analysis](#residuals)
<br/> &#9; 3.2 [Optimal Scaling Analysis](#scaling)
   
4. [References](#references)

---

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from itertools import product
from scipy.optimize import minimize

from utils import (
    preprocess_df,
    plot_residuals,
    model_scaling_plot,
)
from tests.sanity_checks import SanityChecks

# Load the Chinchilla data
df = pd.read_csv("data.csv")
df_clean = preprocess_df(df)

This assigmment is based on two papers: the original [Chinchilla paper](https://arxiv.org/abs/2203.15556) and [Chinchilla Scaling: A replication attempt](https://arxiv.org/abs/2404.10102). This assigment will not be as coding intensive as the previous ones, but will require you to understand the concepts of scaling laws and how they apply to deep learning models. To be specific, you will implement the *Approach 3* from the Chinchilla paper, fitting a parametric model to the data. We highly recommend reading the papers before starting the assignment. 

In this assignment, you will work with scaling laws in deep learning, specifically focusing on the relationships between model size, training data, and model performance. You will:

1. Implement key components of scaling law analysis
2. Analyze real training data from language models
3. Compare your findings with the Chinchilla paper
4. Visualize the optimal compute allocation

# 1. Chinchilla Scaling Laws <a id="background"></a>

In applied machine learning (especially in large-scale deep learning), we often want to describe how model performance changes with variations in:
- Model size (N): Number of parameters
- Dataset size (D): Number of training tokens/samples
- Compute budget (C): Total training FLOPS

For neural networks, so-called "scaling laws" have been shown to hold empirically. These equations predictably describe how model performance (loss) changes with variations in the three aforementioned quantities.
The Chinchilla paper, for example, proposed to approximate model performance (loss) as:

$$
L(N,D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}
$$

Where:
- $L$: Loss
- $N$: Model size
- $D$: Dataset size
- $E$: Irreducible loss
- $A,B$: Scaling coefficients
- $\alpha,\beta$: Scaling exponents

Given a collection of experiments (triples of $N,D,L$), we can fit the parameters of this model to the data, and then use the model to predict the loss for new values of $N$ and $D$. This is what you will do in this assignment.

# 2. Implementation <a id="implementation"></a>

## 2.1 Loss Function <a id="loss"></a>



In this section, you will have to implement the loss function that will be used to fit the model. The formula of the loss can be found in Equation (3) or Equation (11) of the [Chinchilla paper](https://arxiv.org/abs/2203.15556). 

In [None]:
def log_sum_exp(a, b, e, alpha, beta, N, D):
    """
    Calculate the log sum exp of the given parameters. This function corresponds LSE in Equation 11 of the Chinchilla paper.

    Args:
        a, b, e, alpha, beta: Scalars
        N, D: numpy arrays of shape (num_samples,)

    Returns:
        lse: A numpy array of shape (num_samples,)
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return lse


def huber_loss(y_true, y_pred, delta=1e-3):
    """
    Calculate the Huber loss between the true and predicted values.

    Args:
        y_true: A numpy array of shape (num_samples,)
        y_pred: A numpy array of shape (num_samples,)
        delta: A scalar for the Huber loss calculation

    Returns:
        loss: A scalar representing the Huber loss
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return loss


def huber_loss_objective(params, N, D, losses):
    a, b, e, alpha, beta = params
    predictions = log_sum_exp(a, b, e, alpha, beta, N, D)
    return huber_loss(np.log(losses), predictions, delta=1e-3)

You can run the following code to sanity check your implementation. 

In [None]:
SanityChecks.verify_implementation(log_sum_exp)
SanityChecks.verify_implementation(huber_loss)

<a id="optimization"></a>
## 2.2 Parameter Optimization 

The optimization objective is highly non-convex, and small deviations in the parameters can lead to large changes in the resulting predictions. We follow the approach of Hoffmann et al. (2024) and perform a grid search over the parameter using the L-BFGS-B method.

**TODO**: Run the following code to fit the model to the data.

In [None]:
def fit(df, verbose=True):
    np.random.seed(42)
    N = df["Model Size"].values
    D = df["Training Tokens"].values
    losses = df["loss"].values

    alpha_vals = np.arange(0, 2.5, 0.5)
    beta_vals = np.arange(0, 2.5, 0.5)
    e_vals = np.arange(-1, 1.5, 0.5)
    a_vals = np.arange(0, 30, 5)
    b_vals = np.arange(0, 30, 5)

    # Perform the optimization using L-BFGS over the grid of initial values
    best_loss = np.inf
    best_params = None

    results_dict = {}
    search_space = [alpha_vals, beta_vals, e_vals, a_vals, b_vals]
    total = np.prod([len(s) for s in search_space])
    pbar = tqdm(product(*search_space), total=total)

    for alpha, beta, e, a, b in pbar:
        init_params = [a, b, e, alpha, beta]
        result = minimize(
            huber_loss_objective, init_params, args=(N, D, losses), method="L-BFGS-B"
        )
        results_dict[tuple(init_params)] = {"params": result.x, "loss": result.fun}
        if result.success and result.fun < best_loss:
            best_loss = result.fun
            best_params = result.x

            if verbose:
                print(f"New best loss: {best_loss}")
                print(f"Best params: {best_params}")
                print(f"Initial guess: {init_params}")

    if best_params is not None:
        A = np.exp(best_params[0])
        B = np.exp(best_params[1])
        E = np.exp(best_params[2])
        alpha = best_params[3]
        beta = best_params[4]
        print(f"Best fit parameters: A={A}, B={B}, E={E}, alpha={alpha}, beta={beta}")
    else:
        print("Optimization failed to converge.")
    return dict(A=A, B=B, E=E, alpha=alpha, beta=beta)


params = fit(df, verbose=False)
params_clean = fit(df_clean, verbose=False)

# 3. Analysis <a id="analysis"></a>

## 3.1 Residual Analysis <a id="residuals"></a>

First, we'll examine the residuals to assess how well our model fits the data. The residuals are the differences between the observed values and the model's predictions. A good model should have residuals that:

1. Are roughly symmetric around zero
2. Have consistent variance across the range of predictions 

We will visualize the residuals using scatter plots to analyze their distribution. We will start with the parameters as reported in the Chinchilla paper. Looking at the residuals, do you think these parameters are a good fit for the data?

In [None]:
params_chinchilla = {
    "A": 406.4,
    "B": 410.7,
    "E": 1.69,
    "alpha": 0.34,
    "beta": 0.28,
}
plot_residuals(
    params_chinchilla, df, "Residuals of original Chinchilla parameters (with outliers)"
)
plot_residuals(
    params_chinchilla,
    df_clean,
    "Residuals of original Chinchilla parameters (without outliers)",
)

Now let's look at the residuals of the parameters we found. Do you think these parameters are a better fit for the data?

In [None]:
plot_residuals(params, df, "Scatter Plot of Residuals")
plot_residuals(params_clean, df_clean, "Scatter Plot of Residuals (Cleaned)")

## 3.2 Optimal Scaling Analysis <a id="scaling"></a>

This section explores how model performance scales with different combinations of model size (N) and training tokens (D) under a fixed compute budget. We visualize these relationships and analyze the optimal allocation of compute resources.

The key questions we examine:
1. How does model performance change as we vary N and D while keeping compute fixed?
2. What are the optimal values of N and D for a given compute budget?
3. How do our findings compare to the original Chinchilla paper's conclusions?

**Question:**
Given a compute budget of C = 5.76e23 FLOPs, what would be the optimal model size (N) and number of training tokens (D) according to the scaling laws? Use the `model_scaling_plot()` visualization to help determine these values.

You should give answers for the following configurations:
1. (`params_clean`, `df_clean`)
2. (`params_chinchilla`, `df`)
3. (`params`, `df`)

In [None]:
# (params_clean, df_clean)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# (params_chinchilla, df)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# (params, df)
# YOUR CODE HERE
raise NotImplementedError()

<a id="references"></a>
# 4. References

- [Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556)
- [Chinchilla Scaling: A replication attempt](https://arxiv.org/abs/2404.10102)