# Chunking and chinking

This notebook explores (briefly) the differences between **chunking** and **chinking** using NLTK.

## Imports

In [1]:
from nltk import pos_tag, RegexpParser, word_tokenize

## Data

For this notebook, we'll use extracts from [*The Gruffalo*](https://en.wikipedia.org/wiki/The_Gruffalo), for no especial reason.

In [2]:
text = "Who is this creature with terrible claws and terrible teeth in his terrible jaws?"
print(text)

Who is this creature with terrible claws and terrible teeth in his terrible jaws?


## Text processing

In order to effectively chunk text, we need to process it first. Chunkers expect tagged & tokenised text as inputs, meaning that our text should be split into individual words (tokens) and each token should have an associated **part of speech**.

We'll work through the process step-by-step for this first sentence, showing how the text is transformed. Further in the notebook, we'll process text all at once, with `pos_tag(word_tokenize(text))`, but it will still be the same transformations, just in a single line.

In [3]:
# Split the text into tokens

tokens = word_tokenize(text)

print(tokens)

['Who', 'is', 'this', 'creature', 'with', 'terrible', 'claws', 'and', 'terrible', 'teeth', 'in', 'his', 'terrible', 'jaws', '?']


In [4]:
# Add part-of-speech tags to the tokens

tagged_tokens = pos_tag(tokens)

print(tagged_tokens)

[('Who', 'WP'), ('is', 'VBZ'), ('this', 'DT'), ('creature', 'NN'), ('with', 'IN'), ('terrible', 'JJ'), ('claws', 'NNS'), ('and', 'CC'), ('terrible', 'JJ'), ('teeth', 'NNS'), ('in', 'IN'), ('his', 'PRP$'), ('terrible', 'JJ'), ('jaws', 'NNS'), ('?', '.')]


## Chunking

Chunking lets us extract related sequences of tokens together as a single unit.

"He has purple prickles all over his back".

In the above sentence, "purple" and "prickles" form a **noun phrase**: it makes sense to keep them together as a single unit, rather than view them as unrelated. In the next few examples, we'll use NLTK to extract noun phrases from our text data.

### Defining a grammar

In order to extract chunks, we need to define a **grammar**: a computer-readable description of what the chunks we're interested in look like. A grammar defines a specific chunk by detailing the tokens within a chunk using (a version of) regex syntax.

In [5]:
np_grammar = "NP: {<DT>?<JJ>*<NNS?>}"

The `np_grammar` variable describes one possible noun phrase chunk definition. The table below shows what each sequence of characters in the grammar means.

| Sequence | Meaning |
| --- | --- |
| `NP:` | The name of a chunk, in this case "NP" |
| `{` | Start of chunk |
| `<DT>` | A **determiner**, such as "a" or "the" |
| `?` | 0 or 1 of the preceding token |
| `<JJ>` | An **adjective**, like "purple" in "purple prickles" |
| `*` | 0 or more of the preceding token |
| `<NNS?>`| A **noun** (singular or plural), like "prickles" in "purple prickles" |
| `}` | End of chunk |

In more human-accessible language, the grammar defines a chunk called "NP". `NP` chunks consist of an (optional) determiner, 0 or more adjectives, and a final noun.

This grammar would find "purple prickles", but also "sharp purple prickles", "the prickles", "those purple prickles" and so on.

### Using a parser

Once we've defined a grammar, we next need to get something that can chunk tokens *using* that grammar. For that, we'll create a **regex parser** using NLTK. This will be able to read through our tokens, combining sequences that match the `NP` grammar into chunks.

In [6]:
# Create a parser that can read the grammar

np_parser = RegexpParser(np_grammar)

In [7]:
# Actually parse the tokens

parsed_tokens = np_parser.parse(tagged_tokens)

print(parsed_tokens)

(S
  Who/WP
  is/VBZ
  (NP this/DT creature/NN)
  with/IN
  (NP terrible/JJ claws/NNS)
  and/CC
  (NP terrible/JJ teeth/NNS)
  in/IN
  his/PRP$
  (NP terrible/JJ jaws/NNS)
  ?/.)


The parser has wrapped everything in a top-level `S` chunk. Within that, there are many independent tokens, but also four `NP` chunks. One chunk is formed by "this" (a determiner) and "creature" (a singular noun), while the others - still valid `NP` chunks - each consist of an adjective and a plural noun.

We can use the same grammar and parser to get the `NP` chunks in another piece of text:

In [8]:
# Prepare the text

text = "'Here, by these rocks, and his favourite food is… roasted fox!' the mouse answered."

tagged_tokens = pos_tag(word_tokenize(text))

In [9]:
# Chunk it

parsed_tokens = np_parser.parse(tagged_tokens)

print(parsed_tokens)

(S
  'Here/RB
  ,/,
  by/IN
  (NP these/DT rocks/NNS)
  ,/,
  and/CC
  his/PRP$
  (NP favourite/JJ food/NN)
  (NP is…/NN)
  roasted/VBD
  fox/RB
  !/.
  '/''
  (NP the/DT mouse/NN)
  answered/VBD
  ./.)


Again, the parser has identified several `NP` chunks. Note though, that "roasted" and "fox" has not been found - our current grammar isn't sophisticated enough to pick up all the combinations we *might* be interested in, just the ones that match our precise specifications. 

## Chinking

Chinking is a way to simplify your chunks down; once you've defined a chunk that groups a sequence of tokens, you can then use a chink to snip out some of the chunk.

To demonstrate this, we'll need to use a grammar with more complicated rules. We'll start by defining the basic rule, and then we'll build from there.

In [10]:
# Define the grammar

first_rule_grammar = """
    ACTOR: {<DT>?<NN><VBD><DT>?<NN>}
"""

# Create a parser that can read the grammar

fr_parser = RegexpParser(first_rule_grammar)

The `ACTOR` chunk defined above looks for sequences of tokens where something does something to something else. A dog bites a man, for example, or a flower loses a leaf. The table below breaks down the pieces of the pattern.

| Sequence | Meaning |
| --- | --- |
| `ACTOR:` | The name of a chunk, in this case "ACTOR" |
| `{` | Start of chunk |
| `<DT>` | A **determiner**, such as "a" or "the" |
| `?` | 0 or 1 of the preceding token |
| `<NN>` | A **noun**, like "toes" in "turned-out toes" |
| `<VBD>` | A past-tense **verb**, such as "slid" or "sped" |
| `}` | End of chunk |

In this case, some pieces repeat in the pattern: there is an optional determiner followed by a noun at both the start and the end of the sentence. In between the two is a `VBD` - a past-tense verb. This pattern will match sequences such as "the mouse found a nut", or "woman rides horse".

In [11]:
# Prepare some text

text = "A mouse took a stroll through the deep, dark wood. A fox saw the mouse, and the mouse looked good."
tagged_tokens = pos_tag(word_tokenize(text))

In [12]:
# Parse the text

parsed_tokens = fr_parser.parse(tagged_tokens)

print(parsed_tokens)

(S
  (ACTOR A/DT mouse/NN took/VBD a/DT stroll/NN)
  through/IN
  the/DT
  deep/JJ
  ,/,
  dark/JJ
  wood/NN
  ./.
  (ACTOR A/DT fox/NN saw/VBD the/DT mouse/NN)
  ,/,
  and/CC
  the/DT
  mouse/NN
  looked/VBD
  good/JJ
  ./.)


In the output above, you can see that two `ACTOR` tokens have been identified: "a mouse took a stroll" and "a fox saw the mouse". Using just our first rule, we've pulled out phrases following a particular pattern.

But we don't have to stop there! The pattern is called `ACTOR`, but at the moment it's pulling out entire actions. What if we wanted to get just the actors from each action: the things being affected or affecting others?

To do this, we can use a chink, attaching it onto our `ACTION` rule in the grammar. The combined grammar will follow a two-stage process: first it will chunk actions together, and then it will split each such chunk into two by snipping out the verb and any determiners. The end result will be two `ACTOR` chunks for each action, each one containing a single noun.

In [13]:
# Define the grammar

actor_grammar = """
    ACTOR: {<DT>?<NN><VBD><DT>?<NN>}
           }<DT|VBD>{
"""

# Create a parser that can read the grammar

actor_parser = RegexpParser(actor_grammar)

The second line - `}<DT|VBD>{` is our chink; note the use of reversed brackets to mark it.

| Sequence | Meaning |
| --- | --- |
| `}` | Start of chink |
| `<DT\|VBD>` | A **determiner** or a past-tense **verb** |
| `{` | End of chunk |

When used to parse tokens, this will neatly cut out the non-noun bits of our initial `ACTOR` pattern.

In [14]:
# Parse the text with the new parser

parsed_tokens = actor_parser.parse(tagged_tokens)

print(parsed_tokens)

(S
  A/DT
  (ACTOR mouse/NN)
  took/VBD
  a/DT
  (ACTOR stroll/NN)
  through/IN
  the/DT
  deep/JJ
  ,/,
  dark/JJ
  wood/NN
  ./.
  A/DT
  (ACTOR fox/NN)
  saw/VBD
  the/DT
  (ACTOR mouse/NN)
  ,/,
  and/CC
  the/DT
  mouse/NN
  looked/VBD
  good/JJ
  ./.)


Instead of getting the whole action, we now get two `ACTOR` tags for each action, with the determiners and verbs snipped out. Chunking let us focus on just one sequence, and then chinking let us pull out just the bit we want to focus on.

As before, this isn't perfect: we might have wanted to pull in "the mouse looked good" too, but our pattern wasn't precise enough. We might want not to label "stroll" as an actor, because it's not really a thing in the same way as "fox" and "mouse".

## Conclusions

Text data is difficult to work with because language is complex (and beautiful). You're always going to have edge cases, and always going to need to refine your code to catch those edge cases. With careful application of both chunking and chinking though, you can devise grammars that will let you extract out any data you desire.

Personally, I think this is really cool. It can also be useful though too; off the top of my head, here are three potentially interesting/useful applications of chunking & chinking:

- Determining who the most active and passive characters are in a narrative
- Identifying all the adjectives customers use to refer to a specific product
- Highlighting overly-complicated sentence structures