# Demos for _Background_

This notebook contains small demos of various measures used in the pipeline to demonstrate how different terms are ranked. The ACL corpus is as example.

In [1]:
import utils
import os
os.chdir(utils.ROOT)

In [2]:
from datautils import dataio
from pipeline import components as cm
from stats import conceptstats, ngramcounting

In [3]:
# first things first
corpus = dataio.load_acl_corpus()
ngram_model = ngramcounting.NgramModel.load_model('acl', '_all')
C_VALUE_THRESHOLD = 2
FREQ_THRESHOLD = 2

Loading ACL 2.0 corpus: 100%|██████████| 300/300 [00:00<00:00, 497.34it/s]


## Extraction

In [4]:
extractor = cm.CandidateExtractor(cm.ExtractionFilters.SIMPLE)
for doc in corpus:
    extractor.extract_candidates(doc)

In [31]:
packed_docs = sorted((d for d in corpus), key=lambda d: len(extractor.doc_index[d]), reverse=True)

In [58]:
print(*(c.get_covered_text() for c in sorted(extractor.doc_index[packed_docs[5]], key=lambda c: c.span[0])), sep='\n')

Syntax-based statistical machine
Syntax-based statistical machine translation
statistical machine
statistical machine translation
machine translation
statistical models
structured data
syntax-based statistical machine translation
syntax-based statistical machine
syntax-based statistical machine translation system
statistical machine translation
statistical machine
statistical machine translation system
machine translation
machine translation system
translation system
probabilistic synchronous dependency insertion grammar
probabilistic synchronous dependency
probabilistic synchronous dependency insertion
synchronous dependency insertion grammar
synchronous dependency
synchronous dependency insertion
dependency insertion
dependency insertion grammar
insertion grammar
Synchronous dependency insertion
Synchronous dependency
Synchronous dependency insertion grammars
dependency insertion
dependency insertion grammars
insertion grammars
synchronous grammars
dependency trees
parallel corpora
g

In [60]:
print(*[t.get_covered_text() + '/\\textsc{' + t.pos.lower() + '}' for t in packed_docs[5].get_annotations('Token')])

Syntax-based/\textsc{jj} statistical/\textsc{jj} machine/\textsc{nn} translation/\textsc{nn} (/\textsc{-lrb-} MT/\textsc{nn} )/\textsc{-rrb-} aims/\textsc{vbz} at/\textsc{in} applying/\textsc{vbg} statistical/\textsc{jj} models/\textsc{nns} to/\textsc{to} structured/\textsc{jj} data/\textsc{nns} ./\textsc{.} In/\textsc{in} this/\textsc{dt} paper/\textsc{nn} ,/\textsc{,} we/\textsc{prp} present/\textsc{vbp} a/\textsc{dt} syntax-based/\textsc{jj} statistical/\textsc{jj} machine/\textsc{nn} translation/\textsc{nn} system/\textsc{nn} based/\textsc{vbn} on/\textsc{in} a/\textsc{dt} probabilistic/\textsc{jj} synchronous/\textsc{jj} dependency/\textsc{nn} insertion/\textsc{nn} grammar/\textsc{nn} ./\textsc{.} Synchronous/\textsc{jj} dependency/\textsc{nn} insertion/\textsc{nn} grammars/\textsc{nns} are/\textsc{vbp} a/\textsc{dt} version/\textsc{nn} of/\textsc{in} synchronous/\textsc{jj} grammars/\textsc{nns} defined/\textsc{vbn} on/\textsc{in} dependency/\textsc{nn} trees/\textsc{nns} ./\text

In [48]:
print(*packed_docs, sep='\n\n')

At MIT Lincoln Laboratory, we have been developing a Korean-to-English machine translation system CCLINC (Common Coalition Language System at Lincoln Laboratory). The CCLINC Korean-to-English translation system consists of two core modules, language understanding and generation modules mediated by a language neutral meaning representation called a semantic frame. The key features of the system include: (i) Robust efficient parsing of Korean (a verb final language with overt case markers, relatively free word order, and frequent omissions of arguments). (ii) High quality translation via word sense disambiguation and accurate word order generation of the target language. (iii) Rapid system development and porting to new domains via knowledge-based automated acquisition of grammars. Having been trained on Korean newspaper articles on missiles and chemical biological warfare, the system produces the translation output sufficient for content understanding of the original document.

We prese

## Ranking

In [19]:
term_freqs = extractor.term_frequencies()
c_value = cm.CValueRanker(extractor, C_VALUE_THRESHOLD)
rect_freq = cm.RectifiedFreqRanker(extractor)
tf_idf = cm.TfIdfRanker(extractor, n_docs=len(corpus))
glossex = cm.GlossexRanker(extractor, ngram_model)
pmi_nl = cm.PmiNlRanker(extractor, ngram_model)
term_coherence = cm.TermCoherenceRanker(extractor, ngram_model)
voter = cm.VotingRanker(extractor, rect_freq, c_value, tf_idf, glossex, pmi_nl,
                        term_coherence, weights=[2, 1, 1, 1, 1, 1])

Calculating C-values
Calculating Rectified Frequencies
Calculating TF-IDF values
Calculating Glossex values
Calculating length normalized PMI values
Calculating Term Coherence values
Calculating votes between rankers


## Frequency


In [6]:
# top 10
for c, count in term_freqs.most_common(10):
    print(*c, '\t&', term_freqs[c], '\t\\\\')

natural language 	& 63 	\\
machine translation 	& 56 	\\
statistical machine 	& 20 	\\
statistical machine translation 	& 20 	\\
speech recognition 	& 18 	\\
language processing 	& 16 	\\
experimental result 	& 16 	\\
word sense 	& 15 	\\
translation system 	& 15 	\\
language model 	& 15 	\\


In [7]:
# bottom 10
for c in [c for c, count in term_freqs.most_common() if count >= FREQ_THRESHOLD][-10:]:
    freq =  term_freqs[c]
    if freq >= FREQ_THRESHOLD:
        print(*c, '\t&', freq, '\t\\\\')

time o 	& 2 	\\
single system 	& 2 	\\
finite-state approximation 	& 2 	\\
contrastive accent 	& 2 	\\
construct algebra 	& 2 	\\
dialog motivator 	& 2 	\\
character type 	& 2 	\\
semantic network 	& 2 	\\
calculus notation 	& 2 	\\
path-based inference rule 	& 2 	\\


## TF-IDF

In [8]:
# top 10
doc_freqs = extractor.doc_frequencies()
ranker = tf_idf
for c in [c for c in ranker.keep_proportion(1) if term_freqs[c] >= 5][:10]:
    print(*c, '\t&', ranker.value(c), '\t&', term_freqs[c], '\t&', doc_freqs[c], '\t\\\\')

sophisticated representation 	& 750.0 	& 5 	& 1 	\\
homophone error 	& 750.0 	& 5 	& 1 	\\
user modeling 	& 700.0 	& 7 	& 2 	\\
smt model 	& 600.0 	& 6 	& 2 	\\
user model 	& 600.0 	& 6 	& 2 	\\
polysemous word 	& 600.0 	& 6 	& 2 	\\
elementary structure 	& 500.0 	& 5 	& 2 	\\
word string 	& 500.0 	& 5 	& 2 	\\
sentence planner 	& 500.0 	& 5 	& 2 	\\
parsing strategy 	& 480.0 	& 8 	& 4 	\\


In [9]:
# bottom 10
for c in [c for c in ranker.keep_proportion(1) if term_freqs[c] >= 5][-10:]:
    print(*c, '\t&', ranker.value(c), '\t&', term_freqs[c], '\t&', doc_freqs[c], '\t\\\\')

important role 	& 250.0 	& 5 	& 5 	\\
syntactic analysis 	& 250.0 	& 5 	& 5 	\\
language generation 	& 250.0 	& 5 	& 5 	\\
evaluation method 	& 250.0 	& 5 	& 5 	\\
novel method 	& 250.0 	& 5 	& 5 	\\
decision tree 	& 250.0 	& 5 	& 5 	\\
machine learning 	& 250.0 	& 5 	& 5 	\\
statistical model 	& 250.0 	& 5 	& 5 	\\
different language 	& 250.0 	& 5 	& 5 	\\
context-free grammar 	& 250.0 	& 5 	& 5 	\\


## C-value

## Rectified frequency

In [10]:
# top 10
ranker = rect_freq
for c in [c for c in ranker.keep_proportion(1) if term_freqs[c] >= 5][:10]:
    print(*c, '\t&', round(ranker.value(c), 2), '\t&', term_freqs[c], '\t\\\\')

machine translation 	& 18 	& 56 	\\
experimental result 	& 15 	& 16 	\\
natural language 	& 14 	& 63 	\\
statistical machine translation 	& 9 	& 20 	\\
dialogue system 	& 9 	& 9 	\\
feature structure 	& 9 	& 11 	\\
natural language processing 	& 8 	& 14 	\\
previous work 	& 8 	& 8 	\\
word sense disambiguation 	& 6 	& 10 	\\
user model 	& 6 	& 6 	\\


In [11]:
# bottom 10
for c in [c for c in ranker.keep_proportion(1) if term_freqs[c] >= 5][-10:]:
    print(*c, '\t&', round(ranker.value(c), 2), '\t&', term_freqs[c], '\t\\\\')

machine learning 	& 1 	& 5 	\\
word segmentation 	& 1 	& 7 	\\
generation system 	& 1 	& 5 	\\
sentence planner 	& 1 	& 5 	\\
translation model 	& 0 	& 5 	\\
language processing 	& 0 	& 16 	\\
phrase structure 	& 0 	& 5 	\\
language interface 	& 0 	& 5 	\\
statistical machine 	& 0 	& 20 	\\
sense disambiguation 	& 0 	& 12 	\\


## Glossex

## Length-normalized PMI

In [12]:
# top 10
ranker = pmi_nl
for c in [c for c in ranker.keep_proportion(1) if term_freqs[c] >= 8][:10]:
    print(*c, '\t&', round(ranker.value(c), 2), '\t&', term_freqs[c], '\t\\\\')

sense disambiguation 	& 6.22 	& 12 	\\
error rate 	& 6.17 	& 8 	\\
speech recognition 	& 5.85 	& 18 	\\
previous work 	& 5.76 	& 8 	\\
word sense disambiguation 	& 5.69 	& 10 	\\
statistical machine translation 	& 5.4 	& 20 	\\
machine translation 	& 5.39 	& 56 	\\
experimental result 	& 5.23 	& 16 	\\
statistical machine 	& 5.22 	& 20 	\\
parallel corpus 	& 5.14 	& 11 	\\


In [13]:
# bottom 10
for c in [c for c in ranker.keep_proportion(1) if term_freqs[c] >= 8][-10:]:
    print(*c, '\t&', round(ranker.value(c), 2), '\t&', term_freqs[c], '\t\\\\')

word alignment 	& 4.6 	& 13 	\\
machine translation system 	& 4.29 	& 10 	\\
language processing 	& 4.18 	& 16 	\\
language pair 	& 4.15 	& 8 	\\
feature structure 	& 4.08 	& 11 	\\
mt system 	& 3.97 	& 12 	\\
dialogue system 	& 3.55 	& 9 	\\
translation system 	& 2.79 	& 15 	\\
language model 	& 2.44 	& 15 	\\
language system 	& 2.22 	& 14 	\\


# Term coherence

In [14]:
import math

In [15]:
# top 10
ranker = term_coherence
for c in [c for c in ranker.keep_proportion(1) if term_freqs[c] >= 8][:10]:
    print(*c, '\t&', round(ranker.value(c) / math.log10(term_freqs[c]), 2), '\t&', term_freqs[c], '\t\\\\')

machine translation 	& 0.58 	& 56 	\\
natural language 	& 0.46 	& 63 	\\
speech recognition 	& 0.42 	& 18 	\\
sense disambiguation 	& 0.42 	& 12 	\\
statistical machine 	& 0.33 	& 20 	\\
statistical machine translation 	& 0.24 	& 20 	\\
experimental result 	& 0.23 	& 16 	\\
error rate 	& 0.3 	& 8 	\\
previous work 	& 0.25 	& 8 	\\
word sense 	& 0.16 	& 15 	\\


In [16]:
# bottom 10
for c in [c for c in ranker.keep_proportion(1) if term_freqs[c] >= 8][-10:]:
    print(*c, '\t&', round(ranker.value(c) / math.log10(term_freqs[c]), 2), '\t&', term_freqs[c], '\t\\\\')

feature structure 	& 0.13 	& 11 	\\
information retrieval 	& 0.15 	& 8 	\\
translation system 	& 0.08 	& 15 	\\
mt system 	& 0.08 	& 12 	\\
unknown word 	& 0.09 	& 8 	\\
language model 	& 0.07 	& 15 	\\
language system 	& 0.06 	& 14 	\\
machine translation system 	& 0.07 	& 10 	\\
language pair 	& 0.07 	& 8 	\\
dialogue system 	& 0.06 	& 9 	\\


## Voter

In [20]:
# top 10
ranker = voter
for c in [c for c in ranker.keep_proportion(1) if term_freqs[c] >= 5][:10]:
    print(*c, '\t&', round(ranker.value(c), 2), '\t&', term_freqs[c], '\t\\\\')

machine translation 	& 3.52 	& 56 	\\
natural language 	& 2.18 	& 63 	\\
experimental result 	& 1.21 	& 16 	\\
sophisticated representation 	& 1.15 	& 5 	\\
statistical machine translation 	& 0.94 	& 20 	\\
natural language processing 	& 0.56 	& 14 	\\
homophone error 	& 0.56 	& 5 	\\
dialogue system 	& 0.46 	& 9 	\\
feature structure 	& 0.45 	& 11 	\\
user modeling 	& 0.45 	& 7 	\\


In [21]:
# bottom 10
for c in [c for c in ranker.keep_proportion(1) if term_freqs[c] >= 5][-10:]:
    print(*c, '\t&', round(ranker.value(c), 2), '\t&', term_freqs[c], '\t\\\\')

bilingual corpus 	& 0.04 	& 5 	\\
evaluation method 	& 0.03 	& 5 	\\
translation model 	& 0.03 	& 5 	\\
language interface 	& 0.03 	& 5 	\\
language generation 	& 0.02 	& 5 	\\
phrase structure 	& 0.02 	& 5 	\\
machine learning 	& 0.02 	& 5 	\\
markov model 	& 0.02 	& 6 	\\
generation system 	& 0.02 	& 5 	\\
translation output 	& 0.01 	& 5 	\\


## Filtering

In [24]:
metrics = cm.Metrics()
metrics.add(c_value, rect_freq, tf_idf, glossex, pmi_nl, term_coherence, voter)

concept_filter = cm.ConceptFilter(
    lambda c: metrics[c][cm.Metrics.C_VALUE] >= C_VALUE_THRESHOLD,
    lambda c: metrics[c][cm.Metrics.RECT_FREQ] >= FREQ_THRESHOLD,
    lambda c: metrics[c][cm.Metrics.GLOSSEX] >= 1.5,
    # lambda c: METRICS[c][cm.Metrics.PMI_NL] >= 2,
    # lambda c: METRICS[c][cm.Metrics.TF_IDF] >= 100,
    filtering_method=cm.ConceptFilter.METHODS.ALL
)

In [29]:
# filtered concepts
final = set(ranker.keep_n_highest(100))
filtered_concepts = concept_filter.apply(final)
final.difference(filtered_concepts)

{('academic', 'writing'),
 ('air', 'travel'),
 ('bibliographic', 'citation'),
 ('biological', 'warfare'),
 ('cfg', 'parser'),
 ('cfg', 'parsing'),
 ('combinatorial', 'explosion'),
 ('data', 'representation'),
 ('dependency', 'parser'),
 ('disambiguation', 'algorithm'),
 ('extractive', 'mutli-document'),
 ('file', 'card'),
 ('france', 'telecom'),
 ('france', 'telecom', 'rd', 'beijing'),
 ('grammar', 'parsing'),
 ('heuristic', 'parsing'),
 ('hidden', 'markov'),
 ('important', 'role'),
 ('inference', 'type'),
 ('kruseman', 'aretz'),
 ('language', 'processing'),
 ('ngram', 'tm'),
 ('parsing', 'algorithm'),
 ('previous', 'work'),
 ('robust', 'parser'),
 ('robust', 'parsing'),
 ('robust', 'probabilistic', 'parsing'),
 ('san', 'diego'),
 ('sense', 'disambiguation'),
 ('sentence', 'plan'),
 ('smt', 'algorithm'),
 ('statistical', 'machine'),
 ('street', 'journal'),
 ('syntactic', 'disambiguation'),
 ('syntactic', 'parse'),
 ('unmodified', 'subgraphs'),
 ('wall', 'street'),
 ('wall', 'street', '