A researcher wants a generative model of entities and their attributes. The data takes the form of $\mathcal{D}_{\lambda} = [(e_1,a_1),(e_2,a_2) ... (e_N,a_N)] $, where $\mathcal{\lambda}$ is an extraction strategy, $\mathcal{A}$ is an attribute set and $\mathcal{E}$ is an entity set. Note that $\lambda$ defines $\mathcal{D}_{\lambda}$.

There are lots of reasons to prefer a generative model of $\mathcal{D}_{\lambda}$
1. testing hypothesis about $(e,a)$ pairs with likelihood ratios 
2. quantifying uncertainty about conclusions with credible intervals
3. transparently incorporating prior beliefs about entities and their attributes 
     - possibly even including prior beliefs about extraction strategies $\lambda$, entity sets $\mathcal{E}$ and attribute sets $\mathcal{A}$
4. updating priors in the face of evidence
5. suggesting new interpretations of the data (e.g. suggesting new $a$ for inclusion in $\mathcal{A}$, or new $e$ for inclusion in $\mathcal{E}$)

A researcher expresses their beliefs about $\mathcal{A}$, $\mathcal{E}$ and $\lambda$ via rules

Also:
- Blei [notes](https://www.cs.princeton.edu/courses/archive/spring12/cos424/pdf/em-mixtures.pdf)

 - Say the data is generated via a mixture of multinomials. (It could also be a mixture of HMMs, etc). 
 

In [320]:
import numpy as np
import random
from numpy.random import multinomial
from numpy.random import beta


# data generation procedure
V = range(11)  # vocab of size 10

WORDSPERDOC = 100  # 100 words per doc
NDOCS = 14
alpha = [random.randint(1,5) for i in V]
alpha2 = [random.randint(1,5) for i in V]
a,b = [random.randint(1,5) for i in range(2)]

pi = beta(a,b)
group1 = np.random.dirichlet(alpha)
group2 = np.random.dirichlet(alpha2)
theta = np.vstack([group1, group2])

docs = np.zeros((NDOCS, len(V)))
lambda_ = np.zeros(NDOCS,)

for dno, d in enumerate(range(NDOCS)):
    if random.uniform(0, 1) < pi:
        w1 = multinomial(WORDSPERDOC, group1)
        docs[dno] = w1
        lambda_[dno] = 0
    else:
        w2 = multinomial(WORDSPERDOC, group2)
        docs[dno] = w2
        lambda_[dno] = 1

In [363]:
# init params
theta_hat = np.random.rand(2, len(V))
theta_hat /= np.sum(theta_hat, axis=1).reshape(-1, 1)
pi_hat = np.random.uniform(0,1)
lambda_hat = np.random.rand(1, NDOCS)

for i in range(10):
    # estep
    d1 = np.exp(np.sum((docs * np.log(theta_hat[0])), axis=1) + np.log(pi_hat))
    d2 = np.exp(np.sum((docs * np.log(theta_hat[1])), axis=1) + np.log(1 - pi_hat))
    lambda_hat = (d1/(d1 + d2))

    # mstep
    
    #pi
    pi_hat = np.sum(lambda_hat)/NDOCS

    # theta 0
    expected_counts_under_assignments = lambda_hat.reshape(-1,1) * docs
    d = np.sum(expected_counts_under_assignments)
    n = np.sum(expected_counts_under_assignments,axis=0)
    theta_hat[0] = n/d

    # theta 1
    expected_counts_under_assignments = (1 - lambda_hat).reshape(-1,1) * docs
    d = np.sum(expected_counts_under_assignments)
    n = np.sum(expected_counts_under_assignments,axis=0)
    theta_hat[1] = n/d

    # get nll
    a = np.sum((lambda_hat.reshape(-1,1) * docs) * np.log(theta_hat[0]))
    b = np.sum(((1 - lambda_hat).reshape(-1,1) * docs) * np.log(theta_hat[1]))
    print(a, b)
    print(a + b)

-209.22880375713453 -2845.4179912047252
-3054.64679496186
-366.9677219313346 -2642.782780389375
-3009.7505023207095
-571.8902709103461 -2381.8105576628004
-2953.7008285731463
-571.9055751413597 -2381.790073483815
-2953.6956486251747
-571.9055751413597 -2381.790073483815
-2953.6956486251747
-571.9055751413597 -2381.790073483815
-2953.6956486251747
-571.9055751413597 -2381.790073483815
-2953.6956486251747
-571.9055751413597 -2381.790073483815
-2953.6956486251747
-571.9055751413597 -2381.790073483815
-2953.6956486251747
-571.9055751413597 -2381.790073483815
-2953.6956486251747


In [361]:
theta_hat[0]

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])