# Sequential Self Teaching Worked Example

## Set up

Three features $(x1, x_2, x3)$ and two labels $(y_0, y_1)$

The hypothesis space is given by:

$$
h_1 = [1, 1, 1] \\
h_2 = [0, 1, 1] \\
h_3 = [0, 0, 1] \\
h_4 = [0, 0, 0]
$$

The learner's prior over hypotheses is uniform, $p_L(h') = 1/4 \quad \forall h' \in h$

In [1]:
import numpy as np

def create_boundary_hyp_space(n_features):
    """Creates a hypothesis space of concepts defined by a linear boundary"""
    hyp_space = []
    for i in range(n_features + 1):
        hyp = [1 for _ in range(n_features)]
        hyp[:i] = [0 for _ in range(i)]
        hyp_space.append(hyp)
    hyp_space = np.array(hyp_space)
    return hyp_space

# initialize model
n_features = 3  # number of features
features = np.arange(n_features)  # features
n_labels = 2  # number of labels
labels = np.arange(n_labels)  # labels
hyp_space = create_boundary_hyp_space(n_features)
n_hyp = len(hyp_space)  # number of hypotheses
hyp_shape = (n_hyp, n_features, n_labels)  # shape of structures

# set learner's prior p_L(h) to be uniform over hypotheses
learner_prior = 1 / n_hyp * np.ones(hyp_shape)

# set self-teaching posterior p_T(x|h) to be uniform over features
self_teaching_prior = 1 / n_features * np.ones(hyp_shape)

assert np.allclose(np.sum(learner_prior, axis=0), 1.0)
assert np.allclose(np.sum(self_teaching_prior, axis=1), 1.0)

The likelihood $p(y|x, h)$

In [2]:
lik = np.ones(hyp_shape)

for i, hyp in enumerate(hyp_space):
    for j, feature in enumerate(features):
        for k, label in enumerate(labels):
            if hyp[feature] == label:
                lik[i, j, k] = 1
            else:
                lik[i, j, k] = 0
                
assert lik.shape == hyp_shape

Thus, the learner's posterior $p_L(h|x, y) \propto p(y|x, h)p_T(x|h)p_L(h)$ is...

In [3]:
# multiply everything together and normalize
learner_posterior = lik * self_teaching_prior * learner_prior
learner_posterior = learner_posterior / np.sum(learner_posterior, axis=0)

assert learner_posterior.shape == hyp_shape
assert np.allclose(np.sum(learner_posterior, axis=0), 1.0)

Using the learner's posterior, we can calculate the probability that the self-teacher will teach each feature, i.e. $p_T(x|h)$, by the following equations:

$$
p_T(x,y|h') \propto p_L(h'|x,y)p_T(x,y), 
$$

where $p_T(x,y)$ is usually a uniform prior over $x,y$;

In [4]:
# set teaching prior to be uniform over labels and features
teaching_prior = 1 / (n_features * n_labels) * np.ones(hyp_shape)
assert np.allclose(np.sum(teaching_prior, axis=(1, 2)), 1.0)

# multiply with learner's posterior
feature_label_posterior = learner_posterior * teaching_prior
feature_label_posterior = (feature_label_posterior.T / 
                               np.sum(feature_label_posterior, axis=(1, 2))).T

assert feature_label_posterior.shape == hyp_shape
assert np.allclose(np.sum(feature_label_posterior, axis=(1, 2)), 1.0)

$$
\begin{aligned}
p_T(x,y,h'|h) &= p_T(x,y|h')p_a(h'|h) \\
&= p_T(x,y|h')p_L(h') = p_T(x,y,h'),
\end{aligned}
$$

where $p_a(h'|h) = p_L(h')$ captures the self-teaching idea that the learner's posterior on $h'$ does not depend on the underlying true $h$, and thus, turns the selection probability $p_T(x,h')$ independent of $h$;

In [5]:
self_teaching_hyp_prior = 1 / n_hyp * np.ones(hyp_shape)  # p(h'|h)
assert np.allclose(np.sum(teaching_prior, axis=(1, 2)), 1.0)

joint_self_teaching_posterior = feature_label_posterior * self_teaching_hyp_prior
assert joint_self_teaching_posterior.shape == hyp_shape
assert np.isclose(np.sum(joint_self_teaching_posterior), 1.0)

$$p_T(x|h) = p_T(x) = \sum_y \sum_{h'} p_T(x,y,h').$$

In [6]:
self_teaching_posterior_original = np.sum(joint_self_teaching_posterior, axis=(0, 2))
print(self_teaching_posterior_original)
self_teaching_posterior = np.tile(np.tile(self_teaching_posterior_original, (n_labels, 1)).T, 
                                  (n_hyp, 1, 1))  # broadcast to be the same shape

assert self_teaching_posterior.shape == hyp_shape
assert np.allclose(np.sum(self_teaching_posterior, axis=1), 1.0)

[ 0.32467532  0.35064935  0.32467532]


One can check that cooperative inference in self-teaching converges after 1 step because $p_T(x|h) = p_T(x)$, which is independent of $h$, which means

$$
\begin{aligned}
p_L(h|x, y) &\propto p(y|x, h)p_T(x|h)p(h) \\
\rightarrow p_L(h|x, y) &\propto p(y|x, h)p_T(x)p(h) \\
\rightarrow p_L(h|x, y) &\propto p(y|x, h)p(h).
\end{aligned}
$$

In [7]:
updated_learner_posterior = lik * self_teaching_posterior * learner_prior
updated_learner_posterior = updated_learner_posterior / np.sum(updated_learner_posterior, axis=0)

assert np.allclose(updated_learner_posterior, learner_posterior)

## Look ahead

Up until here, the self-teacher has considered the effectiveness of selecting a particular feature according to its immediate benefit. A self-teacher can also look ahead into the future to see what the consequence of this first selection and modify the selection probabiity according to a particular feature's future benefit. This can be done by the following steps:

First, compute the look-ahead posterior

$$
p_L(h|x^{(2)}, y^{(2)}, x^{(1)}, y^{(1)}) \propto p(y^{(2)}|h, x^{(2)}, x^{(1)}, y^{(1)})p_L(h|x^{(1)},y^{(1)}),
$$

where $x^{(2)}, y^{(2)}$ respectively denote the selection and outcome one more step into the future;

The first part of computing the look-ahead posterior is to calculate the likelihood $$p(y^{(2)}|x^{(2)}, y^{(1)}, x^{(1)}, h)$$

In [8]:
sequential_hyp_shape = (n_hyp, n_features, n_labels, n_features, n_labels)
sequential_lik = np.ones(sequential_hyp_shape)

for i, hyp in enumerate(hyp_space):
    for j, feature_one in enumerate(features):
        for k, label_one in enumerate(labels):
            for l, feature_two in enumerate(features):
                for m, label_two in enumerate(labels):
                    if hyp[feature_one] == label_one and hyp[feature_two] == label_two:
                        sequential_lik[i, j, k, l, m] = 1
                    else:
                        sequential_lik[i, j, k, l, m] = 0

# code to double check sequential likelihood is calculated correctly
sequential_lik_two = np.repeat(lik, n_features * n_labels).reshape(sequential_hyp_shape)
# likelihood needs to calculate over having observed both features
for i, feature in enumerate(features):
    for j, label in enumerate(labels):
        sequential_lik_two[:, :, :, i, j] = sequential_lik_two[:, i, j, : :] * \
            sequential_lik_two[:, :, :, i, j]
            
assert sequential_lik.shape == sequential_hyp_shape
assert np.array_equal(sequential_lik, sequential_lik_two)

Next, we calculate the sequential self teaching prior $$p_T(x^{(2)}|x^{(1)}, y^{(1)}, h)$$

In [9]:
# set sequential self teaching prior to be uniform
sequential_self_teaching_prior = 1 / n_features * np.ones(sequential_hyp_shape)

assert np.allclose(np.sum(sequential_self_teaching_prior, axis=3), 1.0)

# also consider setting the sequential self teaching prior to be 
# zero when the second feature is the same as the first)
# sequential_self_teaching_prior = np.array(sequential_self_teaching_prior, copy=True)
# for i, feature_one in enumerate(features):
#     for j, feature_two in enumerate(features):
#         if feature_one == feature_two:
#             sequential_self_teaching_prior[:, i, :, j, :] = 0
            
# # normalize 
# sequential_self_teaching_prior  = sequential_self_teaching_prior / \
#     np.repeat(np.sum(sequential_self_teaching_prior, axis=3), n_features).reshape(
#         sequential_hyp_shape)

# assert np.allclose(np.sum(sequential_self_teaching_prior, axis=3), 1.0)    

In [10]:
# expand posterior to same shape
sequential_learner_posterior_one = np.repeat(learner_posterior, n_features * n_labels).reshape(
    sequential_hyp_shape)
sequential_learner_posterior_two = np.repeat(learner_posterior, n_features * n_labels).reshape(
    sequential_hyp_shape)

# re-order axes so that posterior matches along the same axes
for i, feature in enumerate(features):
    for j, label in enumerate(labels):
        sequential_learner_posterior_one[:, :, :, i, j] = \
            sequential_learner_posterior_two[:, i, j, : :] 
            
# compute look-ahead posterior
lookahead_posterior = sequential_lik * sequential_self_teaching_prior * \
    sequential_learner_posterior_one
    
# normalize and set NaNs to zero
lookahead_posterior = lookahead_posterior / np.nansum(lookahead_posterior, axis=0)
lookahead_posterior = np.nan_to_num(lookahead_posterior)

# check if posterior only contains zeros and ones
assert lookahead_posterior.shape == sequential_hyp_shape
assert np.array_equal(np.sum(lookahead_posterior, axis=0), 
                      (np.sum(lookahead_posterior, axis=0)).astype(bool))



second, compute

$$
p_T(x^{(2)},y^{(2)}|h',x,y) \propto p_L(h'|x^{(2)},y^{(2)},x,y)p_T(x^{(2)},y^{(2)}), 
$$

where $p_T(x^{(2)},y^{(2)})$ is uniform over $x^{(2)},y^{(2)}$ as before;

In [11]:
# set teaching prior to be uniform over labels and features
sequential_teaching_prior = 1 / (n_features * n_labels) * np.ones(sequential_hyp_shape)

assert np.allclose(np.sum(sequential_teaching_prior, axis=(3, 4)), 1.0)

# multiply with learner's posterior
sequential_feature_label_posterior = lookahead_posterior * sequential_teaching_prior
sequential_feature_label_data_lik = np.sum(sequential_feature_label_posterior, axis=(3, 4))

# normalize
for i, hyp in enumerate(hyp_space):
    for j, feature_one in enumerate(features):
        for k, label_one in enumerate(labels):
            sequential_feature_label_posterior[i, j, k, :, :] = \
                sequential_feature_label_posterior[i, j, k, :, :] / \
                    sequential_feature_label_data_lik[i, j, k]

# turn nans to zero
sequential_feature_label_posterior = np.nan_to_num(sequential_feature_label_posterior)

# check if posterior only contains zeros and ones
assert sequential_feature_label_posterior.shape == sequential_hyp_shape
assert np.allclose(np.sum(sequential_feature_label_posterior, axis=(3, 4)), 
                    np.sum(sequential_feature_label_posterior, axis=(3, 4)).astype(bool)) 




third, compute

$$
p_T(x^{(2)},y^{(2)},h'|x,y) = p_T(x^{(2)},y^{(2)}|h',x, y)p_L(h'|x, y);
$$

In [12]:
sequential_joint_posterior = sequential_feature_label_posterior * sequential_learner_posterior_one
sequential_joint_data_lik = np.sum(sequential_joint_posterior, axis=(0, 3, 4))

# normalize
for i, hyp in enumerate(hyp_space):
    for j, feature_two in enumerate(features):
        for k, label_two in enumerate(labels):
            sequential_joint_posterior[i, :, :, j, k] = sequential_joint_posterior[i, :, :, j, k] / \
                sequential_joint_data_lik

assert np.allclose(np.sum(sequential_joint_posterior, axis=(0, 3, 4)), 1.0)

forth, compute the predictive distribution (not 100% sure if this is right)

$$
p_L(y|x) = \sum_h p(y|h, x)p_L(h|x) = \sum_h p(y|h, x)p_L(h);
$$

In [13]:
label_predictives_original = np.sum(lik * learner_prior, axis=0) 
label_predictives = np.empty(sequential_hyp_shape)

# do swapping axes trick
for i, hyp in enumerate(hyp_space):
    for j, feature_one in enumerate(features):
        for k, label_one in enumerate(labels):
            label_predictives[:, j, k] = label_predictives_original[j, k]
            
assert np.allclose(np.sum(label_predictives, axis=2), 1.0)

fifth, compute

$$
p_T(x^{(2)},y^{(2)},h',y|x) = p_T(x^{(2)},y^{(2)},h'|x, y)p_L(y|x);
$$

In [14]:
sequential_full_joint_posterior = sequential_joint_posterior * label_predictives

assert np.allclose(np.sum(sequential_full_joint_posterior, axis=(0, 2, 3, 4)), 1.0)

sixth, compute

$$
p_T(x^{(2)}|x) = \sum_{y^{(2)}} \sum_{h'} \sum_y p_T(x^{(2)},y^{(2)},h',y|x);
$$

In [15]:
sequential_conditional_feature_prob = np.sum(sequential_full_joint_posterior, axis=(0, 2, 4))

assert np.allclose(np.sum(sequential_conditional_feature_prob, axis=1), 1.0)

print(np.sum(sequential_conditional_feature_prob, axis=1))
print(sequential_conditional_feature_prob[0] * self_teaching_posterior_original[0])

[ 1.  1.  1.]
[ 0.07615112  0.11511216  0.13341204]


lastly, combine the benefit from $x^{(2)}$:

$$
p_T(x^{(2)}, x) = p_T(x^{(2)}|x) p_T(x) \\
p^{(2)}_T(x) = \sum_{x^{(2)}} p_T(x^{(2)},x).
$$

(Written out this way, it seems like look-ahead will never have any effect. So, I am probably doing something wrong.)

In [16]:
sequential_joint_feature_prob = sequential_conditional_feature_prob.T * self_teaching_posterior_original
assert np.isclose(np.sum(sequential_joint_feature_prob, axis=(0, 1)), 1.0)

sequential_self_teaching_prob = np.sum(sequential_joint_feature_prob, axis=0)

# normalize
sequential_self_teaching_prob = sequential_self_teaching_prob / \
    np.sum(sequential_self_teaching_prob)
    
print(sequential_self_teaching_prob)

[ 0.32467532  0.35064935  0.32467532]
