# Context-Aware Text Analysis

Bag-of-words based decomposition technique,
enable us to explore relationships between documents that contain the same
mixture of individual words. This frequency of tokens based approach 
can be very effective particularly in cases where the vocabulary of
a specific discipline or topic is sufficient to distinguish it from or relate it to other
text.

The context in which the words appear, plays a huge role in conveying meaning.
Consider the following phrases: “she liked the smell of roses” and “she smelled like roses.”
Using the text normalization techniques presented in previous chapters such as stopwords
removal and lemmatization, these two utterances would have identical bag-ofwords
vectors though they have completely different meanings.

This does not mean that bag-of-words models should be completely discounted, and
in fact, bag-of-words models are usually very useful initial models. Nonetheless,
lower performing models can often be significantly improved with the addition of
contextual feature extraction. One simple, yet effective approach is to augment models
with grammars to create templates that help us target specific types of phrases,
which capture more nuance than words alone.

## Grammar-Based Feature Extraction

Grammatical features such as parts-of-speech enable us to encode more contextual
information about language. One of the most effective ways of improving model performance
is by combining grammars and parsers, which allow us to build up lightweight
syntactic structures to directly target dynamic collections of text that could be
significant.

To get information about the language in which the sentence is written, we need a set
of grammatical rules that specify the components of well-structured sentences in that
language; this is what a grammar provides. A grammar is a set of rules describing
specifically how syntactic units (sentences, phrases, etc.) in a given language should
be deconstructed into their constituent units.

### Context-Free Grammars
We can use grammars to specify different rules that allow us to build up parts-of-speech
into phrases or chunks. A context-free grammar is a set of rules for combining
syntactic components to form sensical strings. For instance, the noun phrase “the castle”
has a determiner (DT) and a noun (N).
The prepositional phrase (PP) “in the castle” has a preposition (P) and a noun phrase
(NP). The verb phrase (VP) “looks in the castle” has a verb (V) and a prepositional
phrase (PP). The sentence (S) “Gwen looks in the castle” has a proper noun (NNP) and
verb phrase (VP)

In [1]:
GRAMMAR = """
S -> NNP VP
VP -> V PP
PP -> P NP
NP -> DT N
NNP -> 'Gwen' | 'George'
V -> 'looks' | 'burns'
P -> 'in' | 'for'
DT -> 'the'
N -> 'castle' | 'ocean'
"""

In [2]:
from nltk import CFG
cfg = CFG.fromstring(GRAMMAR)
print(cfg)
print(cfg.start())
print(cfg.productions())

Grammar with 13 productions (start state = S)
    S -> NNP VP
    VP -> V PP
    PP -> P NP
    NP -> DT N
    NNP -> 'Gwen'
    NNP -> 'George'
    V -> 'looks'
    V -> 'burns'
    P -> 'in'
    P -> 'for'
    DT -> 'the'
    N -> 'castle'
    N -> 'ocean'
S
[S -> NNP VP, VP -> V PP, PP -> P NP, NP -> DT N, NNP -> 'Gwen', NNP -> 'George', V -> 'looks', V -> 'burns', P -> 'in', P -> 'for', DT -> 'the', N -> 'castle', N -> 'ocean']


### Syntactic Parsers

Once we have defined a grammar, we need a mechanism to systematically search out
the meaningful syntactic structures from our corpus; this is the role of the parser. If a
grammar defines the search criterion for “meaningfulness” in the context of our language,
the parser executes the search. A syntactic parser is a program that deconstructs
sentences into a parse tree, which consists of hierarchical constituents, or
syntactic categories.

When a parser encounters a sentence, it checks to see if the structure of that sentence
conforms to a known grammar. If so, it parses the sentence according to the rules of
that grammar, producing a parse tree. Parsers are often used to identify important
structures, like the subject and object of verbs in a sentence, or to determine which
sequences of words in a sentence should be grouped together within each syntactic
category.

In [3]:
from nltk.chunk.regexp import RegexpParser
GRAMMAR = r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'
chunker = RegexpParser(GRAMMAR)

### Extracting Keyphrases

In [4]:
GRAMMAR = r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'
GOODTAGS = frozenset(['JJ','JJR','JJS','NN','NNP','NNS','NNPS'])

from sklearn.base import BaseEstimator, TransformerMixin
from unicodedata import category as unicat
from itertools import groupby
from nltk.chunk import tree2conlltags
from nltk import pos_tag

class KeyphraseExtractor(BaseEstimator, TransformerMixin):
    """
    Wraps a PickledCorpusReader consisting of pos-tagged documents.
    """
    def __init__(self, grammar=GRAMMAR):
        self.grammar = GRAMMAR
        self.chunker = RegexpParser(self.grammar)
        
        
    def normalize(self, sent):
        """
        Removes punctuation from a tokenized/tagged sentence and
        lowercases words.
        """
        sent = pos_tag(sent.split())
        is_punct = lambda word: all(unicat(c).startswith('P') for c in word)
        sent = filter(lambda t: not is_punct(t[0]), sent)
        sent = map(lambda t: (t[0].lower(), t[1]), sent)
        return list(sent)      
    
    def extract_keyphrases(self, sent):
        """
        For a document, parse sentences using our chunker created by
        our grammar, converting the parse tree into a tagged sequence.
        Yields extracted phrases.
        """
        sent = self.normalize(sent)
        chunks = tree2conlltags(self.chunker.parse(sent))
        phrases = [
            " ".join(word for word, pos, chunk in group).lower()
            for key, group in groupby(
                chunks, lambda term: term[-1] != 'O'
            ) if key
        ]
        for phrase in phrases:
            yield phrase
            
    def fit(self, documents, y=None):
        return self
    
    def transform(self, documents):
        for document in documents:
            yield self.extract_keyphrases(document)            

In [5]:
kpe = KeyphraseExtractor()
kps = kpe.fit_transform(
        ['Hello how are you', 'Such a beautiful view', 'What a wOnderfUl sight'],
)

for kp in kps:
    print(list(kp))

['hello']
['beautiful view']
['wonderful sight']


In [6]:
GOODLABELS = frozenset(['PERSON', 'ORGANIZATION', 'FACILITY', 'GPE', 'GSP'])
from nltk import ne_chunk

class EntityExtractor(BaseEstimator, TransformerMixin):
    
    def __init__(self, labels=GOODLABELS, **kwargs):
        self.labels = labels
    
    def get_entities(self, sentence):
        entities = []
        trees = ne_chunk(pos_tag(sentence.split()))
        for tree in trees:
            if hasattr(tree, 'label'):
                if tree.label() in self.labels:
                    entities.append(
                        ' '.join([child[0].lower() for child in tree])
                    )
        return entities

    def fit(self, documents, labels=None):
        return self
    
    def transform(self, documents):
        for document in documents:
            yield self.get_entities(document)

In [7]:
ee = EntityExtractor()

entities = ee.fit_transform(['Barak Obama is the President of United States'])

for entity in entities:
    print(list(entity))

['barak', 'obama', 'united states']


## n-Gram Feature Extraction

Grammar-based approaches, while very effective, do not always work.
For one thing, they rely heavily on the success of part-of-speech tagging, meaning we
must be confident that our tagger is correctly labeling nouns, verbs, adjectives, and
other parts of speech.

Grammar-based feature extraction is also somewhat inflexible, because we must
begin by defining a grammar. It is often very difficult to know in advance which
grammar pattern will most effectively capture the high-signal terms and phrases
within a text.
We can address these challenges iteratively, by experimenting with many different
grammars or by training our own custom part-of-speech tagger.

**Choosing the Right n-Gram Window**

Choosing n can also be considered as balancing the trade-off between bias and variance.
A small n leads to a simpler (weaker) model, therefore causing more error due
to bias. A larger n leads to a more complex model (a higher-order model), thus causing
more error due to variance. Just as with all supervised machine learning problems,
we have to strike the right balance between the sensitivity and the specificity of
our model. The more dependent words are on more distant precursors, the greater
the complexity needed for an n-gram model to be predictive.


**Significant Collocations**

Using raw n-grams will produce many,
many candidates, most of which will not be relevant. For example, the sentence “I got
lost in the corn maze during the fall picnic” contains the trigram ('in', 'the',
'corn'), which is not a typical prepositional target, whereas the trigram ('I',
'got', 'lost') seems to make sense on its own.

In practice, this is too high a computational cost to be useful in most applications.
The solution is to compute conditional probability. For example, what is the likelihood
that the tokens ('the', 'fall') appear in the text given the token 'during'? We
can compute empirical likelihoods by calculating the frequency of the (n-1)-gram
conditioned by the first token of the n-gram. Using this technique we can value ngrams
that are more often used together such as ('corn', 'maze') over rarer compositions
that are less meaningful.

The idea of some n-grams having more value than others leads to another tool in the
text analysis toolkit: significant collocations. Collocation is an abstract synonym for ngram
(without the specificity of the window size) and simply means a sequence of
tokens whose likelihood of co-occurrence is caused by something other than random
chance. Using conditional probability, we can test the hypothesis that a specified collocation
is meaningful.


![ngram](../meta/ngram.png)

## n-Gram Language Model

ngram models utilize the statistical frequency of n-grams to make decisions about text.
To compute an n-gram language model that predicts the next word after a series of
words, we would first count all n-grams in the text and then use those frequencies to
predict the likelihood of the last token in the n-gram given the tokens that precede it