A researcher wants a generative model of entities and their attributes. The data takes the form of $\mathcal{D}_{\lambda} = [(e_1,a_1),(e_2,a_2) ... (e_N,a_N)] $, where $\mathcal{\lambda}$ is an extraction strategy, $\mathcal{A}$ is an attribute set and $\mathcal{E}$ is an entity set. Note that $\lambda$ defines $\mathcal{D}_{\lambda}$.

There are lots of reasons to prefer a generative model of $\mathcal{D}_{\lambda}$
1. testing hypothesis about $(e,a)$ pairs with likelihood ratios 
2. quantifying uncertainty about conclusions with credible intervals
3. transparently incorporating prior beliefs about entities and their attributes 
     - possibly even including prior beliefs about extraction strategies $\lambda$, entity sets $\mathcal{E}$ and attribute sets $\mathcal{A}$
4. updating priors in the face of evidence
5. suggesting new interpretations of the data (e.g. suggesting new $a$ for inclusion in $\mathcal{A}$, or new $e$ for inclusion in $\mathcal{E}$)

A researcher expresses their beliefs about $\mathcal{A}$, $\mathcal{E}$ and $\lambda$ via rules

Also:
- Blei [notes](https://www.cs.princeton.edu/courses/archive/spring12/cos424/pdf/em-mixtures.pdf)


 

Title: Sort rules for entity-attribute analysis

Basic idea: use snorkel rules define E and A, which are typically unknown at the start of the process. 
- What are the Ys? The Ys are the zs: i.e. is (a, e, m) ? That is just now known at the moment. 
- Let $(e,a,m)$ be an entity, attribute, mention triple. Let $z$ be a variable expressing: does $a$ refer to a permanent property of $e$ in $m$? 
    - $z$ is analogous to $y$ in snorkel.
    - $a$ is analogous to a version of $\lambda$ in snorkel. 
    - For now, assume $E$ is known ahead of time.

- Suggest other rules. Incorporate feedback from researcher into a soft analysis system.
- Rules could be generated from a regular expression, like sipser.

 - Say the data is generated via a mixture of multinomials. (It could also be a mixture of HMMs, etc). 
 

In [2]:
import autograd.numpy as np
import random
from numpy.random import multinomial
from numpy.random import beta


# data generation procedure
V = range(11)  # vocab of size 10

WORDSPERDOC = 100  # 100 words per doc
NDOCS = 21
alpha = [random.randint(1,6) for i in V]
alpha2 = [random.randint(1,6) for i in V]
a,b = [random.randint(1,6) for i in range(2)]

pi = beta(a,b)
group1 = np.random.dirichlet(alpha)
group2 = np.random.dirichlet(alpha2)
theta = np.vstack([group1, group2])

docs = np.zeros((NDOCS, len(V)))
lambda_ = np.zeros(NDOCS,)

for dno, d in enumerate(range(NDOCS)):
    if random.uniform(0, 1) < pi:
        w1 = multinomial(WORDSPERDOC, group1)
        docs[dno] = w1
        lambda_[dno] = 0
    else:
        w2 = multinomial(WORDSPERDOC, group2)
        docs[dno] = w2
        lambda_[dno] = 1

In [2]:
# init params
theta_hat = np.random.rand(2, len(V))
theta_hat /= np.sum(theta_hat, axis=1).reshape(-1, 1)
pi_hat = np.random.uniform(0,1)
lambda_hat = np.random.rand(1, NDOCS)

for i in range(10):
    # estep
    d1 = np.exp(np.sum((docs * np.log(theta_hat[0])), axis=1) + np.log(pi_hat))
    d2 = np.exp(np.sum((docs * np.log(theta_hat[1])), axis=1) + np.log(1 - pi_hat))
    lambda_hat = (d1/(d1 + d2))

    # mstep
    
    #pi
    pi_hat = np.sum(lambda_hat)/NDOCS

    # theta 0
    expected_counts_under_assignments = lambda_hat.reshape(-1,1) * docs
    d = np.sum(expected_counts_under_assignments)
    n = np.sum(expected_counts_under_assignments,axis=0)
    theta_hat[0] = n/d

    # theta 1
    expected_counts_under_assignments = (1 - lambda_hat).reshape(-1,1) * docs
    d = np.sum(expected_counts_under_assignments)
    n = np.sum(expected_counts_under_assignments,axis=0)
    theta_hat[1] = n/d

    # get nll
    a = np.sum((lambda_hat.reshape(-1,1) * docs) * np.log(theta_hat[0]))
    b = np.sum(((1 - lambda_hat).reshape(-1,1) * docs) * np.log(theta_hat[1]))
    #print(a, b)
    print(a + b)

-4899.849849071947
-4585.374550037082
-4517.025288795171
-4517.025288795171
-4517.025288795171
-4517.025288795171
-4517.025288795171
-4517.025288795171
-4517.025288795171
-4517.025288795171


In [3]:
from autograd import elementwise_grad as egrad
# https://github.com/HIPS/autograd

def tanh(x):                 # Define a function
    y = np.exp(-2.0 * x)
    return (1.0 - y) / (1.0 + y)


def eq1_simple(a,b,lambda_,y):
    M = len(lambda_)
    ou = 1
    for m in range(M):
        ou *= b[m] * a[m] * (lambda_[m] == y[m])
        ou *= b[m] * (1 - a[m]) * (lambda_[m] == -1 * y[m])
        ou *= (1 - b[m]) * (lambda_[m] == 0)
    return ou
    

grad_tanh = egrad(tanh)        # Obtain its gradient function
grad_tanh(np.asarray([1.,2.])) # Evaluate the gradient at x = 1.0
grad = egrad(eq1_simple)


# what are the attributes? what are the entities?

In [574]:
from __future__ import division
import json
from collections import defaultdict
from collections import Counter

import numpy as np

breeds = ["pitbull", "lab", "golden"]
attributes = ["aggressive"]

emissions = defaultdict(list)
stream = []

import string 
punct = [i for i in string.punctuation]

with open("dogs.subreddit.mini.jsonl", "r") as inf:
    for i in inf:
        i = json.loads(i)
        words = [w["word"] for w in i['tokens']]
        if len(set(words) & set(breeds + attributes)) > 0:
            stream.append("SOS")
            for t in i["tokens"]:
                if t["word"] not in ['\\', 'n', '"', "''"] + punct:
                    pos = t["pos"]
                    if t["word"].lower() in breeds:
                        pos = "E"
                    if t["word"].lower() in attributes:
                        pos = "A"
                    emissions[pos].append(t["word"].lower())
                    stream.append(pos)
            stream.append("EOS")
        
transitions = defaultdict(list)
for sno in range(len(stream) - 1):
    s1,s2 = stream[sno:sno+2]
    transitions[s1].append(s2)

states = set(transitions)
V = set(j for i in emissions.values() for j in i)
s2n = {v:k for k,v in enumerate(states)}
v2n = {v:k for k,v in enumerate(V)}
n2v = {k:v for k,v in enumerate(v2n)}
n2s = {k:v for k,v in enumerate(s2n)}

tprobs = np.zeros((len(states), len(states)))
eprobs = np.zeros((len(states), len(V)))

for t in transitions:
    tot = len(transitions[t])
    observed = Counter(transitions[t])
    for s2,count in observed.items():
        tprobs[s2n[t]][s2n[s2]] = count/tot

for state in emissions:
    tot = len(emissions[state])
    observed = Counter(emissions[state])
    for em, count in observed.items():
        eprobs[s2n[state]][v2n[em]] = count/tot

In [575]:
def sample_next_state(state):
    return n2s[np.nonzero(np.random.multinomial(1, tprobs[s2n[state]]))[0][0]]

def sample_emission(state):
    return n2v[np.nonzero(np.random.multinomial(1, eprobs[s2n[state]]))[0][0]]

from random import  randint

# generate from the model. This is equivalent to forward MC search over the WFSA lattice 

patterns = []

for i in range(1000):
    state = "A"
    K = randint(2, 6)
    pattern = " "
    for i in range(K):
        em = sample_emission(state)
        state = sample_next_state(state)
        if state == "EOS":
            break
        pattern = pattern + ' ' + em
    patterns.append(pattern.strip())
patterns = set([i for i in patterns if len(i.split(" ")) > 1])

In [576]:
with open("dogs.subreddit.mini.jsonl", "r") as inf:
    for i in inf:
        i = json.loads(i)
        words = " ".join([o['word'] for o in i['tokens']])
        for p in patterns:
            if p in words:
                print(p)
                ix = words.index(p)
                print(words[ix-200:ix+200])

aggressive breed
the jaw - this technique can be used with all dogs but pitties in particular - check out badrap dog training \ n \ nAnyways bottom line is this - the queensland / austrialian cattle dog is a slightly aggressive breed - they are kick ass dogs who are brilliant and loyal but they typically are a family dog who is only cuddly once they know you .
aggressive but
The Austrialian shepard is a less aggressive but they can be .
aggressive but the
The Austrialian shepard is a less aggressive but they can be .
aggressive i
aggressive in
aggressive or
NO belgian should be aggressive or hostile in normal life .
aggressive or a
aying she 's for-sure fear aggressive or anything , but the growling is a little worrisome on that end and I just would n't push her too hard these first few weeks .
aggressive or
aying she 's for-sure fear aggressive or anything , but the growling is a little worrisome on that end and I just would n't push her too hard these first few weeks .
aggressive tow

## Basic process

1. Define
2. grep
3. train
4. sample
5. define

## What is different about this? What is new here?

1. human in the loop dipre
2. summarization problems are new angle also
3. You could sell this as intretable rules for manual coding.
      - say you have entities and known gendered attributes. what are the syntactic structures connecting them? You have this with Wu. Can you make it into an intpretable rule. One simple rule is a <-amod- e. Even just defining the syntactic relationships between the entities and their attributes in an interpretable way is new. How are they related. The idea is that gendered or racialized words can predict the class. This is quite vague. **why** is this racism or sexism? This is missing from the literature. One thing you could do is read. Another thing you could do is try to explain. "Explaining racist or gendered language."
      
1. Do you know e?
2. Do you know a? 
3. If you know e and a, then **why**. 
4. If you have such an explanation and it is good, then it should be robust to modeling choices.
5. So you can vary lambda, and then run models and see what is the same. 