# NLTK Chapter 3

## Processing Raw Text

*The html version of this chapter in the NLTK book is available [here](https://www.nltk.org/book/ch03.html#exercises "Ch03 Exercises").*

### 8   Exercises

###### 1. 

☼ Define a string `s = 'colorless'`. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.

In [4]:
s = 'colorless'
s = s[:4] + 'u' + s[4:]
s

'colourless'

##### 2.

☼ We can use the slice notation to remove morphological endings on words. For example, `'dogs'[:-1]` removes the last character of `dogs`, leaving `dog`. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): `dish-es`, `run-ning`, `nation-ality`, `un-do`, `pre-heat`.

In [15]:
affixed = [('dishes', 2), 
           ('running', 4),
           ('nationality', 5),
           ('undo', 2),
           ('preheat', 4)]

print([s[:-a] for s, a in affixed])

['dish', 'run', 'nation', 'un', 'pre']


##### 3.

☼ We saw how we can generate an `IndexError` by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?

*Yes.  I'm not going to run the code in my notebook, because then the cells below this one wouldn't run.*

```
>>>trial = "trial"
>>>for i in range(1, len(trial) + 2):
>>>    print(trial[-i])
    
l
a
i
r
t
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-21-98077138b076> in <module>
      1 trial = "trial"
      2 for i in range(1, len(trial) + 2):
----> 3     print(trial[-i])

IndexError: string index out of range
```

##### 4. 

☼ We can specify a "step" size for the slice. The following returns every second character within the slice: `monty[6:11:2]`. It also works in the reverse direction: `monty[10:5:-2]` Try these for yourself, then experiment with different step values.

In [22]:
# A Czech tongue twister:
tt = "Třistatřiatřicet stříbrných křepelek přeletělo přes třistatřiatřicet stříbrných střech."

# Every other letter
tt[::2]

'Tittitie tírýhkeee řltl řstittitie tírýhsřc.'

In [23]:
# Every other letter from the end
tt[::-2]

'.cřshýrít eitittitsř ltlř eeekhýrít eitittiT'

In [24]:
# Every third letter
tt[::3]

'Tstaittbý el etoř iaiřesínhtc'

*You get the point...*

##### 5. 

☼ What happens if you ask the interpreter to evaluate `monty[::-1]`? Explain why this is a reasonable result.

*It prints the word backwards.  It's simply printing from the end by steps of -1:*

In [26]:
"redrum"[::-1]

'murder'

##### 6.

☼ Describe the class of strings matched by the following regular expressions.

a. `[a-zA-Z]+`

b. `[A-Z][a-z]*`

c. `p[aeiou]{,2}t`

d. `\d+(\.\d+)?`

e. `([^aeiou][aeiou][^aeiou])*`

f. `\w+|[^\w\s]+`

Test your answers using `nltk.re_show()`.

*`[a-zA-Z]+` will match anything alphabetical:*

In [43]:
import nltk, re

nltk.re_show(r'[a-zA-Z]+', "cAMELCASE 6186258313 hybr1d")

{cAMELCASE} 6186258313 {hybr}1{d}


<i>`[A-Z][a-z]*` will match words beginning with uppercase letters, or any uppercase letters in other positions:</i>

In [42]:
test = 'I think words beginning with Uppercase Letters will be matched, ' \
       'or any uppercase letters found in oTHER positions.'

nltk.re_show(r'[A-Z][a-z]*', test)

{I} think words beginning with {Uppercase} {Letters} will be matched, or any uppercase letters found in o{T}{H}{E}{R} positions.


*`p[aeiou]{,2}t` will match all words with __p__, up to two vowels, and a letter __t__.  This is a lot of words: In the wordlist we've been using in this chapter, this RegExp will return nearly 7,000 words, since any word with __pt__ will be a match.*

In [51]:
wordlist = [w.lower() for w in nltk.corpus.words.words('en')]
len([w for w in wordlist if re.search(r'p[aeiou]{,2}t', w)])

6978

In [52]:
print([w for w in wordlist if re.search(r'p[aeiou]{,2}t', w)][:20])

['abaptiston', 'abepithymia', 'ableptical', 'ableptically', 'abrupt', 'abruptedly', 'abruption', 'abruptly', 'abruptness', 'absorpt', 'absorptance', 'absorptiometer', 'absorptiometric', 'absorption', 'absorptive', 'absorptively', 'absorptiveness', 'absorptivity', 'absumption', 'acalypterae']


*If we add the `^` and <code>\$</code> operators, we'll instead end up with all 3-letter words beginning and ending with __p__ and __t__ with one vowel in the middle, or all 4-letter words beginning and ending with __p__ and __t__ with two vowels in the middle:*

In [49]:
print([w for w in wordlist if re.search(r'^p[aeiou]{,2}t$', w)])

['pat', 'pat', 'paut', 'peat', 'pet', 'piet', 'piet', 'pit', 'poet', 'poot', 'pot', 'pout', 'put']


*`\d+(\.\d+)?` will match any numbers and decimal points, no matter how many numbers are to the left/right of the decimal.  It will not match dashes, dollar signs, or any other symbol associated with number.*

In [62]:
test = ['1234', '12.34', 'example 123.4 in a string', '1-234', '12,4', '$12.34']
for t in test:
    nltk.re_show(r'\d+(\.\d+)?', t) 

{1234}
{12.34}
example {123.4} in a string
{1}-{234}
{12},{4}
${12.34}


*If we use two decimals, the second is ignored:*

In [67]:
nltk.re_show(r'\d+(\.\d+)?', '1.23.4')

{1.23}.{4}


*We can alter that by changing the `?` to a `+`:*

In [68]:
nltk.re_show(r'\d+(\.\d+)+', '1.23.4')

{1.23.4}


<i>`([^aeiou][aeiou][^aeiou])*` is a weird one - it will match any non-vowel\vowel\non-vowel combination, no matter how many times it's repeated:</i>

In [83]:
string = "babbabbab"
nltk.re_show(r'([^aeiou][aeiou][^aeiou])*', string)

{babbabbab}{}


In [None]:
# NOT FINISHED
