### 23

> Are you able to write a regular expression to tokenize text in such a way that the word _don’t_ is tokenized into _do_ and _n’t_? Explain why this regular expression won’t work: `«n't|\w+»`.

In [2]:
import nltk

In [19]:
nltk.regexp_tokenize('don\'t', r'((?:\w+(?=n\'t))|n\'t)')

['do', "n't"]

## 25

> _Pig Latin_ is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append _ay_, e.g., _string_ → _ingstray_, _idle_ → _idleay_ (see http://en.wikipedia.org/wiki/Pig_Latin).
>
> a. Write a function to convert a word to Pig Latin.
>
> b. Write code that converts text, instead of individual words.
>
> c. Extend it further to preserve capitalization, to keep `qu` together (so that `quiet` becomes `ietquay`, for example), and to detect when `y` is used as a consonant (e.g., `yellow`) versus a vowel (e.g., `style`).

In [30]:
def to_pig_latin(word):
    import re
    idx = re.search('[aeiou]', word)
    if idx is not None:
        ret = word[idx.span()[0]:] + word[0:idx.span()[0]] + 'ay'
    else:
        ret = word
    return ret

In [33]:
to_pig_latin('string')

'ingstray'

In [34]:
def to_pig_latin_text(text):
    return [to_pig_latin(word) for word in nltk.word_tokenize(text)]

In [35]:
to_pig_latin_text('The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky')

['eThay',
 'ojectPray',
 'utenbergGay',
 'ookEBay',
 'ofay',
 'imeCray',
 'anday',
 'unishmentPay',
 ',',
 'by',
 'odorFyay',
 'ostoevskyDay']

## 29

> Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define $\mu_w$ to be the average number of letters per word, and $\mu_s$ to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: 4.71 $\mu_w$ + 0.5 $\mu_s$ - 21.43. Compute the ARI score for various sections of the Brown Corpus, including section `f` (popular lore) and `j` (learned). Make use of the fact that `nltk.corpus.brown.words()` produces a sequence of words, whereas `nltk.corpus.brown.sents()` produces a sequence of sentences.

In [40]:
def cal_ARI():
    words = nltk.corpus.brown.words(categories=['lore', 'learned'])
    sents = nltk.corpus.brown.sents(categories=['lore', 'learned'])
    w = sum([len(word) for word in words]) / len(words)
    s = len(words) / len(sents)
    return 4.71 * w + 0.5 * s - 21.43

In [41]:
cal_ARI()

11.290781451862358

## 38

> An interesting challenge for tokenization is words that have been split across a linebreak. E.g., if _long-term_ is split, then we have the string `long-\nterm`.
>
> a. Write a regular expression that identifies words that are hyphenated at a linebreak. The expression will need to include the `\n` character.
>
> b. Use `re.sub()` to remove the `\n` character from these words.
>
> c. How might you identify words that should not remain hyphenated once the newline is removed, e.g., `'encyclo-\npedia'`?

In [44]:
import re
re.search(r'[-\w]+\n[-\w]+', 'long-\nterm')

<re.Match object; span=(0, 10), match='long-\nterm'>

In [45]:
re.sub(r'([-\w]+)\n([-\w]+)', r'\1\2', 'long-\nterm')

'long-term'

In [49]:
words = ['attribution', 'confabulation', 'elocution',
         'sequoia', 'tenacious', 'unidirectional']
vsequences = set([''.join(re.findall(r'[aeiou]', word)) for word in words])
sorted(vsequences)

['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']

In [48]:
re.findall(r'[aeiou]', 'attribution')

['a', 'i', 'u', 'i', 'o']