# NLTK Chapter 1

## Accessing Text Corpora and Lexical Resources

*The html version of this chapter in the NLTK book is available [here](https://www.nltk.org/book/ch02.html#exercises "Ch02 Exercises").*

### 8   Exercises

###### 1. 

☼ Create a variable `phrase` containing a list of words. Review the operations described in the previous chapter, including addition, multiplication, indexing, slicing, and sorting..

In [3]:
phrase1 = ['This', 'is', 'a', 'lovely', 'list', 'of', 'words.']
phrase2 = ['As', 'is', 'this.']

phrase1 + phrase2

In [6]:
print(phrase1 * 3, end = '')

['This', 'is', 'a', 'lovely', 'list', 'of', 'words.', 'This', 'is', 'a', 'lovely', 'list', 'of', 'words.', 'This', 'is', 'a', 'lovely', 'list', 'of', 'words.']

In [8]:
phrase1[-1]

'words.'

In [7]:
phrase1[2:]

['a', 'lovely', 'list', 'of', 'words.']

In [11]:
sorted(phrase1 + phrase2)

['As', 'This', 'a', 'is', 'is', 'list', 'lovely', 'of', 'this.', 'words.']

###### 2.

☼ Use the corpus module to explore `austen-persuasion.txt`. How many word tokens does this book have? How many word types?

*As I discussed at length in my notes, I recommend removing non-alphabetic characters and enclitics:*

In [16]:
from nltk.corpus import gutenberg

austen = 'austen-persuasion.txt'
enclitics = ("d", "ll", "m", "re", "s", "t", "ve")
words = gutenberg.words(austen)

# number of tokens
num_tokens = len([w for w in words if w.isalpha() and w not in enclitics])

num_tokens

83617

In [17]:
# number of word types
num_wt = len(set([w for w in words if w.isalpha() and w not in enclitics]))
num_wt

6031

*If we didn't bother with removing non-alphabetic characters and enclitics:*

In [18]:
# number of tokens
num_tokens = len([w for w in words])

num_tokens

98171

In [19]:
# number of word types
num_wt = len(set([w for w in words]))
num_wt

6132

###### 3. 

☼ Use the Brown corpus reader `nltk.corpus.brown.words()` or the Web text corpus reader `nltk.corpus.webtext.words()` to access some sample text in two different genres.

*The code below will randomly choose a category, and then randomly print five sentences from that category.*

In [55]:
from nltk.corpus import brown
import random

random_category = random.choice(brown.categories())
random_place = random.randint(0, len(brown.sents(categories = random_category)) - 5)


print(brown.sents(categories = random_category)[random_place:random_place + 5], end = '')

[['or', 'allowing', 'survival', 'of', 'a', 'dividend', 'carryover', 'to', 'a', 'personal', 'holding', 'company', '(', 'section', '381(c)(14)', ')', ',', 'but', 'not', 'carryover', 'of', 'excess', 'tax', 'credits', 'for', 'foreign', 'taxes', '?', '?'], ['These', 'items', ',', 'and', 'most', 'of', 'the', 'others', 'listed', 'above', ',', 'seem', 'quite', 'comparable', 'to', 'items', 'whose', 'right', 'of', 'survival', 'is', 'provided', 'for', 'in', 'section', '381', '.'], ['There', 'does', 'not', 'seem', 'to', 'be', 'any', 'reasonable', 'basis', 'for', 'distinction', 'either', 'in', 'terms', 'of', 'the', 'nature', 'of', 'the', 'tax', 'attribute', 'or', 'in', 'terms', 'of', 'tax-avoidance', 'possibilities', '.'], ['With', 'respect', 'to', 'items', 'such', 'as', 'these', 'the', 'provisions', 'of', 'section', '381(c)', ',', 'viewed', 'in', 'historical', 'perspective', ',', 'suggest', 'a', 'rule', 'requiring', 'survival', ',', 'whether', 'the', 'items', 'are', 'beneficial', 'or', 'detrimenta

*As you can see, the output doesn't look very nice.  We could try a list comprehension:*

In [56]:
print([w for w in brown.sents(categories = random_category)[random_place:random_place + 5]], end = '')

[['or', 'allowing', 'survival', 'of', 'a', 'dividend', 'carryover', 'to', 'a', 'personal', 'holding', 'company', '(', 'section', '381(c)(14)', ')', ',', 'but', 'not', 'carryover', 'of', 'excess', 'tax', 'credits', 'for', 'foreign', 'taxes', '?', '?'], ['These', 'items', ',', 'and', 'most', 'of', 'the', 'others', 'listed', 'above', ',', 'seem', 'quite', 'comparable', 'to', 'items', 'whose', 'right', 'of', 'survival', 'is', 'provided', 'for', 'in', 'section', '381', '.'], ['There', 'does', 'not', 'seem', 'to', 'be', 'any', 'reasonable', 'basis', 'for', 'distinction', 'either', 'in', 'terms', 'of', 'the', 'nature', 'of', 'the', 'tax', 'attribute', 'or', 'in', 'terms', 'of', 'tax-avoidance', 'possibilities', '.'], ['With', 'respect', 'to', 'items', 'such', 'as', 'these', 'the', 'provisions', 'of', 'section', '381(c)', ',', 'viewed', 'in', 'historical', 'perspective', ',', 'suggest', 'a', 'rule', 'requiring', 'survival', ',', 'whether', 'the', 'items', 'are', 'beneficial', 'or', 'detrimenta

*That doesn't help much.  I seem to recall having this problem in Chapter 1, so I believe I'll place the code I used into a function:*

In [48]:
def convert_list_to_text(l):
    full_sent = ""
    alpha_text = []

    for i in range(len(l)):

        ## TO DO: find a more general solution for intra-word punctuation
        if l[i].isalpha() or '-' in l[i]:
            alpha_text.append(l[i])
        else:
            full_sent += (' '.join(alpha_text) + l[i] + ' ')
            alpha_text = []

    return full_sent

In [60]:
text = [w for w in brown.sents(categories = random_category)[random_place:random_place + 5]]

full_text = []
for t in text:
    full_text.append(convert_list_to_text(t))
print(''.join(full_text))

or allowing survival of a dividend carryover to a personal holding company( section381(c)(14) ) , but not carryover of excess tax credits for foreign taxes? ? These items, and most of the others listed above, seem quite comparable to items whose right of survival is provided for in section381 . There does not seem to be any reasonable basis for distinction either in terms of the nature of the tax attribute or in terms of tax-avoidance possibilities. With respect to items such as these the provisions of section381(c) , viewed in historical perspective, suggest a rule requiring survival, whether the items are beneficial or detrimental to the surviving corporation. To this extent some stretching of the literal meaning of the Committee Report seems justified, since the literal meaning conflicts with the clear implication, if not the language, of the statute. 


*It's not perfect, but it looks much better than it did.  Let's try it again with another random text:*

In [61]:
random_category = random.choice(brown.categories())
random_place = random.randint(0, len(brown.sents(categories = random_category)) - 5)

text = [w for w in brown.sents(categories = random_category)[random_place:random_place + 5]]

full_text = []
for t in text:
    full_text.append(convert_list_to_text(t))
print(''.join(full_text))

But he was not. So what? ? Why should I be spinning just because the goddamn log is spinning? ? ( he asked this out loud, but no one heard it over the other noise in the hut) . Over on the bank, the west bank, a man stood, calling to him. 
