In [1]:
import numpy as np

First, we set the following parameters:
* embedding\_size
* m: number of models to route among
* theta: size m vector where theta[i] ~ quality of ith model 

In [2]:
embedding_size = 10
m = 3 # number of prompts/models

# construct groundtruth theta "accuracies"
theta = np.random.random(m)*50 # all positive for now. Larger theta = higher quality model generations.
print("Theta:", theta)

We construct mean vector (length m x embedding\_size) $\mu$ = 0 and covariance matrix $\Sigma$ = diag(1/2 theta). 
These are the parameters of the multivariate Gaussian corresponding to each model's error in embedding space,
where error is defined as the vector difference between a model's generation and the unknown ground-truth generation. 

In [3]:
# construct mu and sigma for multivariate gaussian formulation of the model.
sigma_diag = np.zeros(m*embedding_size)
for i in range(len(sigma_diag)):
    prompt_idx = int(i / embedding_size)
    sigma_diag[i] = 1/(2 * theta[prompt_idx])

print("Covariance matrix diagonal:", sigma_diag)

sigma = np.diag(sigma_diag)
mu = np.zeros(m * embedding_size)

print("Mean:", mu) # Zero mean

Theta: [ 7.51705954 39.08450852 49.46922858]
Covariance matrix: [0.06651537 0.06651537 0.06651537 0.06651537 0.06651537 0.06651537
 0.06651537 0.06651537 0.06651537 0.06651537 0.01279279 0.01279279
 0.01279279 0.01279279 0.01279279 0.01279279 0.01279279 0.01279279
 0.01279279 0.01279279 0.01010729 0.01010729 0.01010729 0.01010729
 0.01010729 0.01010729 0.01010729 0.01010729 0.01010729 0.01010729]
Mean: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]


Next, we generate our data. Our sampling process is:
1) Sample from $\mathcal{N}(\mu, \Sigma)$, reshape this into a set of m error vectors of length embedding\_size.
2) Randomly sample an embedding vector corresponding to the ground-truth generation.
3) We add each error vector to the ground-truth embedding to get each model generation's embedding. 

We end with a tensor all\_lfs\_y, which contains all m+1 embeddings per sample. In practice, we have access to the first m entries of each sample, which correspond to the model generations' embeddings.

In [4]:
n = 10000 # number of samples
all_lfs_y = []
all_diffs = []
count = 0
while count < n:
    # Construct embeddings by sampling 1) a multivariate gaussian "diff" 2) a ground truth y 3) adding them together
    # and setting LF = diff + y
    diff = np.random.multivariate_normal(mu, sigma)
    all_diffs.append(diff)

    y = np.random.random(embedding_size)
    y_repeated = np.tile(y, reps= m)

    lfs = (y_repeated + diff).reshape((m, embedding_size))
    lfs_y = np.concatenate([lfs, y.reshape((-1, embedding_size))], axis=0)

    all_lfs_y.append(lfs_y)
    count += 1

all_lfs_y = np.array(all_lfs_y)
all_lfs_y.shape # The first three rows of the second dimension corresponds to model embeddings, and the fourth corresponds to y embeddings

(10000, 4, 10)

Now that we have all\_lfs\_y, we can use Smoothie to estimate theta.

We describe the math behind Smoothie. Let $x$ be the input sample and let $\lambda_i(x)$ be the embedding of the generation for $x$ produced by the $i$th model. Let $y(x)$ be the embedding of the true optimal generation. Then we can express the fact that error embedding vectors are Gaussian as $[\lambda_1(x) - y(x), \dots, \lambda_m(x) - y(x)] \sim \mathcal{N}(\mu, \Sigma)$. The following holds:

$\begin{align}\mathbb{E}[\|\lambda_i(x) - \lambda_j(x)\|^2] &= \mathbb{E}[\|(\lambda_i(x) - y(x)) - (\lambda_j(x) - y(x))\|^2] \nonumber \\
&= \mathbb{E}[\|\lambda_i(x) - y(x) \|^2] + \mathbb{E}[\|\lambda_j(x) - y(x) \|^2] - 2\mathbb{E}[(\lambda_i(x) - y(x))^\top (\lambda_j(x) - y(x))] \nonumber \end{align}$

Since $\Sigma$ is a diagonal matrix, the $2\mathbb{E}[(\lambda_i(x) - y(x))^\top (\lambda_j(x) - y(x))]$ term is $0$, and therefore we have an elegant decomposition:
$\begin{align}\mathbb{E}[\|\lambda_i(x) - \lambda_j(x)\|^2] = \mathbb{E}[\|\lambda_i(x) - y(x) \|^2] + \mathbb{E}[\|\lambda_j(x) - y(x) \|^2] \nonumber \end{align}$

We write this equation for $\lambda_j, \lambda_k$ and $\lambda_i, \lambda_k$ to get a system of three equations:
$\begin{align}\mathbb{E}[\|\lambda_i(x) - \lambda_j(x)\|^2] &= \mathbb{E}[\|\lambda_i(x) - y(x) \|^2] + \mathbb{E}[\|\lambda_j(x) - y(x) \|^2] \nonumber \\
\mathbb{E}[\|\lambda_j(x) - \lambda_k(x)\|^2] &= \mathbb{E}[\|\lambda_j(x) - y(x) \|^2] + \mathbb{E}[\|\lambda_k(x) - y(x) \|^2] \nonumber \\
\mathbb{E}[\|\lambda_i(x) - \lambda_k(x)\|^2] &= \mathbb{E}[\|\lambda_i(x) - y(x) \|^2] + \mathbb{E}[\|\lambda_k(x) - y(x) \|^2] \nonumber \end{align}$

There are three unknown quantities, the average L2 norm of the $i, j, k$ th error vectors (RHS), and three observable qualities, the average L2 norm of the difference between pairs of $i, j, k$ embeddings (LHS). Solving this system of equations, we have
$\begin{align}\mathbb{E}[\|\lambda_i(x) - y(x) \|^2] &= \frac{1}{2} \big(\mathbb{E}[\|\lambda_i(x) - \lambda_j(x)\|^2] + \mathbb{E}[\|\lambda_i(x) - \lambda_k(x)\|^2] - \mathbb{E}[\|\lambda_j(x) - \lambda_k(x) \|^2] \big) \end{align}$

and similarly for $\mathbb{E}[\|\lambda_j(x) - y(x) \|^2]$ and $\mathbb{E}[\|\lambda_k(x) - y(x) \|^2]$. 

Some simple algebra lets us recover $\theta_i = \frac{embedding\_size}{2\mathbb{E}[\|\lambda_i(x) - y(x) \|^2]}$, which are the final *Smoothie Weights*. 


In [19]:
def triplet(i, j, k):
    # Computes an estimate of E[||lambda_i(x) - y(x)||^2]
    diff_ij = (np.linalg.norm(all_lfs_y[:, i, :] - all_lfs_y[:, j, :], axis=1, ord=2)**2).mean()
    diff_ik = (np.linalg.norm(all_lfs_y[:, i, :] - all_lfs_y[:, k, :], axis=1, ord=2)**2).mean()
    diff_jk = (np.linalg.norm(all_lfs_y[:, j, :] - all_lfs_y[:, k, :], axis=1, ord=2)**2).mean()

    return 0.5*(diff_ij + diff_ik - diff_jk)

In [20]:
diff = np.zeros(m)

for i in range(m):
    other_idxs = np.delete(np.arange(m), i)
    j, k = np.random.choice(other_idxs, size=2, replace=False)
    diff[i] = triplet(i, j, k)

    # compare to true value
    print(diff[i], (np.linalg.norm(all_lfs_y[:, i, :] - all_lfs_y[:, m, :], axis=1)**2).mean())

0.23288287180858308 0.23369112730365055
0.5578053959351905 0.554497767481926
0.1796984318051682 0.18107175815446175


Convert to canonical parameters, i.e., Smnoothie Weights.

In [21]:
# convert mean parameters to canonical parameters 
theta_estimate = embedding_size/(2*diff)

print(theta_estimate, theta)

[21.47002036  8.96369959 27.82439418] [21.44408149  9.07740938 27.70640861]
