<a href="https://colab.research.google.com/github/ReemMahmoud/leep-transferability-score/blob/main/leep_score.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Transferability Estimation Problem

In this notebook, we dissect the formulation of LEEP - the Log Expected Empirical Prediction - score from the paper titled "LEEP: A New Measure to Evaluate Transferability of Learned Representations" by C. Nguyen et al. published at ICML 2020.

## LEEP full code implementation

Source: https://github.com/thuml/LogME/blob/main/LEEP.py

In [None]:
import numpy as np

def LEEP(pseudo_source_label: np.ndarray, target_label: np.ndarray):
    """
    :param pseudo_source_label: shape [N, C_s]
    :param target_label: shape [N], elements in [0, C_t)
    :return: leep score
    """
    N, C_s = pseudo_source_label.shape
    target_label = target_label.reshape(-1)
    C_t = int(np.max(target_label) + 1)   # the number of target classes
    normalized_prob = pseudo_source_label / float(N)  # sum(normalized_prob) = 1
    joint = np.zeros((C_t, C_s), dtype=float)  # placeholder for joint distribution over (y, z)
    for i in range(C_t):
        this_class = normalized_prob[target_label == i]
        row = np.sum(this_class, axis=0)
        joint[i] = row # P (y , z)
    p_target_given_source = (joint / joint.sum(axis=0, keepdims=True)).T  # P(y | z)

    empirical_prediction = pseudo_source_label @ p_target_given_source
    empirical_prob = np.array([predict[label] for predict, label in zip(empirical_prediction, target_label)])
    leep_score = np.mean(np.log(empirical_prob))
    return leep_score

## Code Breakdown

In the remainder of the notebook, we dissect the above code implementation of LEEP with a dummy example to understand:

*   What computations are taking place
*   How these computations render a score that reflects transferability across tasks



### Define the givens:


*   Y = target labels
*   Z = pseudo source labels (or 'dummy' labels)
*   C_t = No. of labels in target task (counting from index 0)
*   C_s = No. of labels in source task (counting from index 0)
*   N = No. of samples in target dataset


### Dummy Example 1:

We will define a dummy example to follow through the computations of LEEP in a simple way. This includes a:
1. Target dataset with 10 data samples (N=10)
2. Target task with 5 labels (C_t=5)
3. Source task with 4 labels (C_s=4)

In [None]:
# define a column reflecting the target output variable (Y) with 5 labels 
target_label = np.array([0, 2, 4, 4, 1, 1, 3, 0, 0, 2]) # C_t = 5, N = 10
target_label.shape

(10,)

In [None]:
# define a matrix of pseudo source labels that is generated from a forward pass of the target data samples through the pre-trained network
# pseudo source label matrix has 10 instances of predictions (since N=10) with label distribution across 4 labels 
#     Note: (doesn't have to be one-hot encoded. We could take output probabilities of the output layer prior to softmax.)
pseudo_source_label = np.array([[0, 0, 0, 1],
                               [0, 1, 0, 0],
                               [0, 1, 0, 0],
                               [0, 0, 1, 0],
                               [0, 1, 0, 0],
                               [1, 0, 0, 0],
                               [1, 0, 0, 0],
                               [0, 0, 0, 1],
                               [0, 0, 0, 1],
                               [1, 0, 0, 0]]) # C_s = 4, N = 10
pseudo_source_label.shape

(10, 4)

In [None]:
# Just for fun, let's compute the LEEP score across the dummy example sets:
LEEP(pseudo_source_label, target_label)

-0.6591673732008659

Note: The LEEP score is always a negative value. The LARGER the score, the better the transferability across tasks. 

In [None]:
# defining parameters N, C_s, C_t 
N, C_s = pseudo_source_label.shape
C_t = int(np.max(target_label) + 1)
print(f' Size of dataset = {N} \n Size of target labels = {C_t} \n Size of source labels = {C_s}')

 Size of dataset = 10 
 Size of target labels = 5 
 Size of source labels = 4


In [None]:
# reshaping, although we don't need it here. (
# Disclaimer: I still need to look into why we're reshaping. What shape is being expected into the function, though they define it as a vectorized input?? Any thoughts?
target_label = target_label.reshape(-1)
target_label.shape

(10,)

In [None]:
# normalizing the dummy labels since we're going to average both P(y,z) and P(z) by dividing through with 1/N. 
normalized_prob = pseudo_source_label / float(N) # Note: sum(normalized_prob) = 1
normalized_prob                                  

array([[0. , 0. , 0. , 0.1],
       [0. , 0.1, 0. , 0. ],
       [0. , 0.1, 0. , 0. ],
       [0. , 0. , 0.1, 0. ],
       [0. , 0.1, 0. , 0. ],
       [0.1, 0. , 0. , 0. ],
       [0.1, 0. , 0. , 0. ],
       [0. , 0. , 0. , 0.1],
       [0. , 0. , 0. , 0.1],
       [0.1, 0. , 0. , 0. ]])

In [None]:
# creating a placeholder for the joint distribution, P(y, z), that we'll compute soon
joint = np.zeros((C_t, C_s), dtype=float)  
print(joint.shape)
print(joint)

(5, 4)
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]




---

# Breakdown of one iteration of loop to compute the joint probability P(y,z):

We're about to go through one broken down iteration of the following loop that computes P(y,z):


```
for i in range(C_t):
  this_class = normalized_prob[target_label == i]
  row = np.sum(this_class, axis=0)
  joint[i] = row

```



In [None]:
i = 0 # for label = 0 
this_class = normalized_prob[target_label == i] # we're going to select the normalized prob rows at the indicies where the target label column has the label '0'
this_class

array([[0. , 0. , 0. , 0.1],
       [0. , 0. , 0. , 0.1],
       [0. , 0. , 0. , 0.1]])

If you're confused, let's put things side to side

We want to look at the normlized probability (which remember: is the normalized dummy source labels matrix) at the rows for the true target label (or class, thus the variable name "this_class") of '0'.


```
target_label = 
      np.array([0, 2, 4, 4, 1, 1, 3, 0, 0, 2])
```

We can see that label '0' exists at indicies: 0, 7, 8

```
this_class = 
array([[0. , 0. , 0. , 0.1],
       [0. , 0. , 0. , 0.1],
       [0. , 0. , 0. , 0.1]])
```
You can now understand that `this_class` captured the rows from `normalized_prob` at indicies: 0, 7, 8. 

See `normalized_prob` for reference below.

```
normalized_prob = 
array([[0. , 0. , 0. , 0.1],
       [0. , 0.1, 0. , 0. ],
       [0. , 0.1, 0. , 0. ],
       [0. , 0. , 0.1, 0. ],
       [0. , 0.1, 0. , 0. ],
       [0.1, 0. , 0. , 0. ],
       [0.1, 0. , 0. , 0. ],
       [0. , 0. , 0. , 0.1],
       [0. , 0. , 0. , 0.1],
       [0.1, 0. , 0. , 0. ]])
```









In [None]:
# sum across axis = 0 (through each column, summing value of all rows)
row = np.sum(this_class, axis=0) 
row

array([0. , 0. , 0. , 0.3])

In [None]:
# we replace the first row of the joint with the resulting summation from the prior step
joint[i] = row
joint

array([[0. , 0. , 0. , 0.3],
       [0. , 0. , 0. , 0. ],
       [0. , 0. , 0. , 0. ],
       [0. , 0. , 0. , 0. ],
       [0. , 0. , 0. , 0. ]])

**Et Voila!**

That's the computation of the joint probability P(y,z) over one iteration.

Next, we will run the loop to compute all rows of P(y,z).


---



In [None]:
# compute the full P(y,z) matrix
for i in range(C_t):
  this_class = normalized_prob[target_label == i]
  row = np.sum(this_class, axis=0)
  joint[i] = row

joint

array([[0. , 0. , 0. , 0.3],
       [0.1, 0.1, 0. , 0. ],
       [0.1, 0.1, 0. , 0. ],
       [0.1, 0. , 0. , 0. ],
       [0. , 0.1, 0.1, 0. ]])

### Compute the conditional probability P(y|z)

P(y|z) = P(y,z) / P(z)

We already computed P(y,z) in the previous step.

Now, we need to compute P(z) then divide to get the resulting P(y|z).

In [None]:
# compute the marginal probability P(z) by simply summing across axis=0 of the joint P(y,z)
joint_sum = joint.sum(axis=0, keepdims=True)
joint_sum

array([[0.3, 0.3, 0.1, 0.3]])

In [None]:
joint / joint_sum

array([[0.        , 0.        , 0.        , 1.        ],
       [0.33333333, 0.33333333, 0.        , 0.        ],
       [0.33333333, 0.33333333, 0.        , 0.        ],
       [0.33333333, 0.        , 0.        , 0.        ],
       [0.        , 0.33333333, 1.        , 0.        ]])

In [None]:
# compute the conditional probability P(y|z): P(y,z)/P(z)

p_target_given_source = (joint / joint.sum(axis=0, keepdims=True))
p_target_given_source

array([[0.        , 0.        , 0.        , 1.        ],
       [0.33333333, 0.33333333, 0.        , 0.        ],
       [0.33333333, 0.33333333, 0.        , 0.        ],
       [0.33333333, 0.        , 0.        , 0.        ],
       [0.        , 0.33333333, 1.        , 0.        ]])

In [None]:
pseudo_source_label @ p_target_given_source

array([[1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.33333333, 0.33333333, 0.        , 0.33333333],
       [0.        , 0.33333333, 0.33333333, 0.        , 0.33333333],
       [0.        , 0.        , 0.        , 0.        , 1.        ],
       [0.        , 0.33333333, 0.33333333, 0.        , 0.33333333],
       [0.        , 0.33333333, 0.33333333, 0.33333333, 0.        ],
       [0.        , 0.33333333, 0.33333333, 0.33333333, 0.        ],
       [1.        , 0.        , 0.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.33333333, 0.33333333, 0.33333333, 0.        ]])

### Compute the LEEP score

#### Start by computing the empirical predictor result = pseudo_source_label @ p_target_given_source

In [None]:
print(pseudo_source_label.shape)
print(p_target_given_source.shape)

(10, 4)
(5, 4)


We need to transpose p_target_given_source in order to get dimensions (4x5) and successfully implement the matrix multiplication for the empirical predictor.

In [None]:
empirical_prediction = pseudo_source_label @ (p_target_given_source).T
print(empirical_prediction.shape)
print(empirical_prediction)

(10, 5)
[[1.         0.         0.         0.         0.        ]
 [0.         0.33333333 0.33333333 0.         0.33333333]
 [0.         0.33333333 0.33333333 0.         0.33333333]
 [0.         0.         0.         0.         1.        ]
 [0.         0.33333333 0.33333333 0.         0.33333333]
 [0.         0.33333333 0.33333333 0.33333333 0.        ]
 [0.         0.33333333 0.33333333 0.33333333 0.        ]
 [1.         0.         0.         0.         0.        ]
 [1.         0.         0.         0.         0.        ]
 [0.         0.33333333 0.33333333 0.33333333 0.        ]]


Our empirical predictor results in a (10x5) matrix:
- predictions across 10 samples
- 5 categories for the prediction (matching our target task class count)

In [None]:
target_label

array([0, 2, 4, 4, 1, 1, 3, 0, 0, 2])

In [None]:
# compute a empirical predictor vector of label probabilities for each of the 10 samples
# Note: matching the dimention of our target_label

empirical_prob = np.array([predict[label] for predict, label in zip(empirical_prediction, target_label)])
print(empirical_prob.shape)
print(empirical_prob)

(10,)
[1.         0.33333333 0.33333333 1.         0.33333333 0.33333333
 0.33333333 1.         1.         0.33333333]


### More explanation:

The empirical probability is looking at the probability generated by the expected empirical predictor (EEP) for each TRUE label in target_label.

So, again, let's put things side-by-side to see them clearly.

```
target_label = 
      np.array([0, 2, 4, 4, 1, 1, 3, 0, 0, 2])
```
```
empirical_prediction = 
[[1.         0.         0.         0.         0.        ]
 [0.         0.33333333 0.33333333 0.         0.33333333]
 [0.         0.33333333 0.33333333 0.         0.33333333]
 [0.         0.         0.         0.         1.        ]
 [0.         0.33333333 0.33333333 0.         0.33333333]
 [0.         0.33333333 0.33333333 0.33333333 0.        ]
 [0.         0.33333333 0.33333333 0.33333333 0.        ]
 [1.         0.         0.         0.         0.        ]
 [1.         0.         0.         0.         0.        ]
 [0.         0.33333333 0.33333333 0.33333333 0.        ]]
 ```

**To get the `empirical_prob`, we want to answer:**

For each row of `empirical_prediction`, what is the probability of the true label from `target_label`?


Example:
```
# First iteration
For first row of empirical_prediction = [1. 0. 0. 0. 0. ]
empirical_prediction[label=0] = 1

# Second iteration
For second row of empirical_prediction = [0. 0.333 0.333 0. 0.333]
empirical_prediction[label=1] = 0.333

Etc.

```



---


## Building our intuition: 

We are seeing how close are the empirical predictor probabilities to the true target labels. 

This will give a sense of how close the source dummy labels are from the true labels, which in return measures how close are the source and target tasks semantics. 

We use this to decide whether we will get successful knowledge transfer between source and target tasks.





## Compute LEEP


In [None]:
# compute LEEP by averaging across the log of the empirical probabilities
leep_score = np.mean(np.log(empirical_prob))
leep_score

-0.6591673732008659

We won't be able to justify or make much sense out of this score since we're working with a dummy example.

Let's try out a test where our source dummy labels are exactly matching our true target labels.

**We should expect to see a LEEP score = 0**

# Dummy example 2

In [None]:
target_label = np.array([0, 2, 4, 4, 1, 1, 3, 0, 0, 2]) # C_t = 5, N = 10
target_label.shape

(10,)

In [None]:
# for the sake of the dummy example, we use the same number of labels across source and target tasks
pseudo_source_label = np.array([[1, 0, 0, 0, 0],
                               [0, 0, 1, 0, 0],
                               [0, 0, 0, 0, 1],
                               [0, 0, 0, 0, 1],
                               [0, 1, 0, 0, 0],
                               [0, 1, 0, 0, 0],
                               [0, 0, 0, 1, 0],
                               [1, 0, 0, 0, 0],
                               [1, 0, 0, 0, 0],
                               [0, 0, 1, 0, 0]]) # C_s = 5, N = 10
pseudo_source_label.shape

(10, 5)

In [None]:
# check and verify that LEEP will = 0 
LEEP(pseudo_source_label, target_label)

0.0

As expected (Y) !