# Define a corpus of text

For the given task it will be useful to have access to a variety of different texts to evaluate the accuracy and performance of the quantum NLP implementation. The Python package [NLTK](https://www.nltk.org/) gives a nice means to easily obtain such data.

Firstly, we must ensure the NLTK package is installed.

In [2]:
!python -m pip install nltk

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/6f/ed/9c755d357d33bc1931e157f537721efb5b88d2c583fe593cc09603076cc3/nltk-3.4.zip (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 9.7MB/s eta 0:00:01
Collecting singledispatch (from nltk)
  Downloading https://files.pythonhosted.org/packages/c5/10/369f50bcd4621b263927b0a1519987a04383d4a98fb10438042ad410cf88/singledispatch-3.4.0.3-py2.py3-none-any.whl
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... [?25ldone
[?25h  Stored in directory: /Users/mlxd/Library/Caches/pip/wheels/4b/c8/24/b2343664bcceb7147efeb21c0b23703a05b23fcfeaceaa2a1e
Successfully built nltk
Installing collected packages: singledispatch, nltk
Successfully installed nltk-3.4 singledispatch-3.4.0.3


Next, we download a variety of available texts from which to select our corpora. This will install the texts into the user's home directory as `${HOME}/nltk_data`.

In [3]:
!python -m nltk.downloader all

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /Users/mlxd/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/mlxd/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/mlxd/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /Users/mlxd/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/mlxd/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/mlxd/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to
[nltk_data]   

[nltk_data]    | Downloading package qc to /Users/mlxd/nltk_data...
[nltk_data]    |   Package qc is already up-to-date!
[nltk_data]    | Downloading package reuters to
[nltk_data]    |     /Users/mlxd/nltk_data...
[nltk_data]    |   Package reuters is already up-to-date!
[nltk_data]    | Downloading package rte to /Users/mlxd/nltk_data...
[nltk_data]    |   Package rte is already up-to-date!
[nltk_data]    | Downloading package semcor to
[nltk_data]    |     /Users/mlxd/nltk_data...
[nltk_data]    |   Package semcor is already up-to-date!
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     /Users/mlxd/nltk_data...
[nltk_data]    |   Package senseval is already up-to-date!
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     /Users/mlxd/nltk_data...
[nltk_data]    |   Package sentiwordnet is already up-to-date!
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /Users/mlxd/nltk_data...
[nltk_data]    |   Package sentenc

[nltk_data]    |   Package mwa_ppdb is already up-to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection all


We may now import the NLTK package and examine some of the available texts. Below we extract the first paragraph from Jane Austin's *Emma* as an example.

In [136]:
import nltk
import nltk.corpus as cps

In [142]:
startOffset = 3
numParagraphs = 3
#Create initial structure by extracting the required number of paragraphs
Emma_tokens = [p[0] for p in cps.gutenberg.paras('austen-emma.txt')[0+startOffset : numParagraphs+startOffset]]
#Flatten the resulting lists into a single set of tokens
Emma_tokens = [val for p in Emma_tokens for val in p]
print(len(Emma_tokens))

121


Alternatively, we may define our own text (here we have chosen the first paragraph of *Peter Pan*), and tokenize it using NLTK.

In [139]:
PanPar1 = """All children, except one, grow up. They soon know that they will grow
up, and the way Wendy knew was this. One day when she was two years old
she was playing in a garden, and she plucked another flower and ran with
it to her mother. I suppose she must have looked rather delightful, for
Mrs. Darling put her hand to her heart and cried, “Oh, why can't you
remain like this for ever!” This was all that passed between them on
the subject, but henceforth Wendy knew that she must grow up. You always
know after you are two. Two is the beginning of the end."""

In [141]:
Pan_tokens = nltk.word_tokenize(PanPar1)
len(Pan_tokens)

127