# PCFG

## Motivation: ambiguity

One crucial use of probabilistic parsing is to solve the problem of disambiguation.   
  
The CKY parsing algorithm can represent ambiguities like **coordination ambiguity** and **attachment ambiguity** in an efﬁcient way but is not equipped to resolve them.   
  
A general solution to those problems is adding probability, that is to compute the probability of each interpretation and choose the most probable interpretation.   
  
The most commonly used probabilistic constituency grammar formalism is the **probabilistic context-free grammar (PCFG)**, a probabilistic augmentation of context-free grammars in which each rule is associated with a probability.

## Definition

PCFGs are augmentation of context-free grammars, that is, each rule now has a corresponding probability.  
  
* A PCFG is consistent if sum of probabilities of all sentences in language is 1.


## Learn the PCFG Rule Probabilities
The simplest way is to use a treebank, a corpus of already parsed sentences. Given a treebank, we can compute the probability of each expansion of a non-terminal by counting the number of times that expansion occurs and then normalizing.

$$P(\alpha \to \beta \mid \alpha) = \frac{Count(\alpha \to \beta)}{Count(\alpha)}$$

In [1]:
import nltk
from nltk import Tree
from collections import Counter, defaultdict

def update_production_cnt(tree_str, c):
    tree = Tree.fromstring(tree_str)
    for p in tree.productions():
        c[p.lhs()].update([p])
    return 

def calculate_prob(c):
    d = defaultdict(dict)
    for key, cnt in c.items():
        size = sum(cnt.values())
        for rule, occ in cnt.items():
            d[key][rule] = occ / size
    return d

In [2]:
pcfg_cnt = defaultdict(Counter)

treebank_filename = 'data/treebank.txt'

with open(treebank_filename, 'r') as f:
    for line in f.readlines():
        if line.strip():
            update_production_cnt(line.strip(), pcfg_cnt)
    pcfg = calculate_prob(pcfg_cnt)

Take a look of some rules and probabilities

In [6]:
pcfg[list(pcfg.keys())[0]]

{TOP -> SQ PUNC: 0.048638132295719845,
 TOP -> S PUNC: 0.08949416342412451,
 TOP -> SBARQ PUNC: 0.23540856031128404,
 TOP -> S_VP PUNC: 0.2140077821011673,
 TOP -> FRAG_NP PUNC: 0.2237354085603113,
 TOP -> FRAG PUNC: 0.054474708171206226,
 TOP -> INTJ_UH PUNC: 0.0038910505836575876,
 TOP -> FRAG_NP_NN PUNC: 0.0038910505836575876,
 TOP -> SBAR PUNC: 0.0019455252918287938,
 TOP -> FRAG_VP PUNC: 0.019455252918287938,
 TOP -> NP PUNC: 0.005836575875486381,
 TOP -> FRAG_NP_NNP PUNC: 0.0019455252918287938,
 TOP -> FRAG_WHNP PUNC: 0.07392996108949416,
 TOP -> FRAG_PP PUNC: 0.011673151750972763,
 TOP -> FRAG_ADJP_JJ PUNC: 0.0038910505836575876,
 TOP -> ADJP_JJ PUNC: 0.0019455252918287938,
 TOP -> FRAG_ADJP_JJS PUNC: 0.0019455252918287938,
 TOP -> FRAG_NP_NNS PUNC: 0.0019455252918287938,
 TOP -> X_S_VP PUNC: 0.0019455252918287938}

In [18]:
with open('data/grammar.pcfg', 'w') as f:
    f.write('%start TOP\n')
    for key, cnt in pcfg.items():
        for rule, p in cnt.items():
            f.write('{} [{}]\n'.format(rule, p))

# PCKY
PCKY produces the most-likely parse for a given sentence, it's based on CKY, and we output the most probable parsing.

In [15]:
from nltk import PCFG
from nltk.tokenize import word_tokenize

def parse(sent, grammar):
    '''
    sent: list of words
    grammar: NLTK PCFG grammar
    '''
    table = [[None]*len(sent) for _ in range(len(sent))]
    for j in range(len(sent)):
        table[j][j] = [(rule.lhs(), Tree(str(rule.lhs()), rule.rhs()), rule.prob()) for rule in grammar.productions(rhs=sent[j])]
        for i in range(j)[::-1]:
            for k in range(i, j):
                l1, l2 = table[i][k], table[k+1][j]
                l3 = table[i][j] if table[i][j] else []
                if l1 and l2:
                    for i1 in l1:
                        for i2 in l2:
                            s1, tree1, prob1 = i1
                            s2, tree2, prob2 = i2
                            possible_rule = grammar.productions(rhs=s1)
                            for r in possible_rule:
                                if r.rhs() == (s1, s2):
                                    l3 += [(r.lhs(), Tree(str(r.lhs()), [tree1, tree2]), prob1*prob2*r.prob())]
                    table[i][j] = l3
    return table

def parse_sentence(sent, grammar):
    table = parse(word_tokenize(sent), grammar)
    if table[0][-1]:
        tree = max(table[0][-1], key=lambda x: x[2])
        if tree[0] == pcfg.start():
            print(tree[1])

In [19]:
pcfg = nltk.data.load('data/grammar.pcfg')

text = 'Show me all flights that depart before ten a.m and have first class .'

parse_sentence(text, pcfg)

(TOP
  (S_VP
    (S_VP_PRIME (VB Show) (NP_PRP me))
    (NP
      (NP_PRIME
        (NP
          (NP (DT all) (NNS flights))
          (SBAR
            (WHNP_WDT that)
            (S_VP
              (VBP depart)
              (PP (IN before) (NP (CD ten) (RB a.m))))))
        (CC and))
      (VP (VBP have) (NP (JJ first) (NN class)))))
  (PUNC .))
