In this section of the tutorial, we will learn how to <u>chunk</u> text. Chunking divides a text into segments, such as noun phrases and verb phrases.

First off, we will import our raw text, and use the code from the "Part of Speech" section of the tutorial to tag all the words with their corresponding parts of speech.

In [None]:
import nltk

In [None]:
f = open("255s.txt","rU") #U is for Universal
raw = f.read()
f.close()

In [None]:
sentences = nltk.sent_tokenize(raw)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]

Chunks can be represented in both tree form, and as IOB tags. We are going to use IOB tags to represent our chunks, because we will be using a tagging algorithm to create our chunks.

First, we need to get some data to train our chunker on, and we need to format that data in a way that is useful to us. Let's grab some data from the CONLL 2000 collection. This data comes in a tree form, so we need to convert it into IOB format. Afterwards, we convert it into a tuple so that it is compatible with our tagger.

In [None]:
trainingSentences = nltk.corpus.conll2000.chunked_sents('train.txt') #grab training data from conll2000
trainingSentences = [nltk.chunk.tree2conlltags(tree) for tree in trainingSentences] #convert from tree to IOB
#convert from triple to tuple
trainingSentences = [[(tag,chunk) for (word, tag, chunk) in chunk_tags] for chunk_tags in trainingSentences]

We can now see our correctly formatted training data. You may also want to print out the data at different stages of the reformatting process, to see what is happening to our data.

In [None]:
trainingSentences

We can now train our chunker on our formatted data.

In [None]:
bigram_chunker = nltk.tag.BigramTagger(trainingSentences)

Let's pass our chunker some data that does not yet have IOB tags.
Our data needs to be a list of part of speech singletons.

In [None]:
#remove the words from our data, leaving only the part of speech tags
sentencesIOB = [[(tag) for (word, tag) in chunk_tags] for chunk_tags in sentences] #store stripped data in new list
sentencesIOB[0] = bigram_chunker.tag(sentencesIOB[0]) #tag the first sentence in our text

Now we can see our new IOB tags.

In [None]:
sentencesIOB[0]

We can add our IOB tags to our original sentences with the following code:

In [None]:
tempSentence=[]
for x in range(0,len(sentences[0])):
    tempSentence.append((sentences[0][x][0],sentences[0][x][1],sentencesIOB[0][x][1]))
sentences[0] = tempSentence
sentences[0]