# [3. Processing Raw Text](https://www.nltk.org/book/ch03.html) - Exercise Solutions

* [NLTK-Book-Resource Repository](https://github.com/BetoBob/NLTK-Book-Resource)
* [NLTK-Book-Resource Table of Contents](https://github.com/BetoBob/NLTK-Book-Resource#table-of-contents)

Run the cell below before running any other code.

In [31]:
import nltk, re, pprint
from nltk import word_tokenize

## 1.

☼ Define a string `s = 'colorless'`. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.

In [1]:
s = 'colorless'

In [9]:
s[:4] + "u" + s[4:]

'colourless'

## 2.

☼ We can use the slice notation to remove morphological endings on words. For example, `'dogs'[:-1]` removes the last character of dogs, leaving dog. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): `dish-es, run-ning, nation-ality, un-do, pre-heat`.

In [11]:
"dishes"[:-2]

'dish'

In [12]:
"running"[:-4]

'run'

In [13]:
"nationality"[:-5]

'nation'

In [14]:
"undo"[:-2]

'un'

In [15]:
"preheat"[:-4]

'pre'

## 3.

☼ We saw how we can generate an `IndexError` by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?

### Solution

* you will receive an `IndexError` for positive indecies `>=` to the length of the string
* you will recive an `IndexError` for negative indeciews `<` the negative length of the string

In [16]:
s = "testing"

In [18]:
len(s)

7

In [26]:
s[6]

'g'

In [19]:
s[7]

IndexError: string index out of range

In [27]:
s[-7]

't'

In [23]:
s[-8]

IndexError: string index out of range

## 4.

☼ We can specify a "step" size for the slice. The following returns every second character within the slice: `monty[6:11:2]`. It also works in the reverse direction: `monty[10:5:-2]` Try these for yourself, then experiment with different step values.

In [28]:
monty = 'Monty Python'

In [30]:
monty[6:11]

'Pytho'

In [29]:
monty[6:11:2]

'Pto'

In [33]:
monty[10:5:-1]

'ohtyP'

In [31]:
monty[10:5:-2]

'otP'

## 5.

☼ What happens if you ask the interpreter to evaluate `monty[::-1]`? Explain why this is a reasonable result.

### Solution

* this will reverse the entire string

`[<start>:<end>:<increment>]`

By default (i.e. no value given), the start of a slice is the beginning of a list / string and the end of a slice if the end of the the list / string. The -1 increment means the string will be reversed from end to start.

In [34]:
monty[::-1]

'nohtyP ytnoM'

## 6.

☼ Describe the class of strings matched by the following regular expressions.

1. `[a-zA-Z]+`
2. `[A-Z][a-z]*`
3. `p[aeiou]{,2}t`
4. `\d+(\.\d+)?`
5. `([^aeiou][aeiou][^aeiou])*`
6. `\w+|[^\w\s]+`


#### Solution

1. All strings that:
    * have at least one character
    * use only alphabetical letters

In [90]:
import random

# 1 - 3
wordlist = [w for w in nltk.corpus.words.words('en')]

# 4 - 6
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))

In [75]:
# 1
random.choices([w for w in wordlist if re.search('[a-zA-Z]+', w)], k=20)

['codfishery',
 'Melastomaceae',
 'stillhouse',
 'complot',
 'Anthophora',
 'Loiseleuria',
 'bloodsucker',
 'winberry',
 'bespoke',
 'wobbliness',
 'ported',
 'condylopod',
 'torotoro',
 'maam',
 'diamide',
 'Lodur',
 'Kizil',
 'prizetaker',
 'introvolution',
 'intersex']

2. All strings that:
    * start with a capital letter
    * followed by all lower case alphabetical letters

In [78]:
#2
random.choices([w for w in wordlist if re.search('[A-Z][a-z]*', w)], k=20)

['Papist',
 'Predentata',
 'Zostera',
 'Duchess',
 'Tuckahoe',
 'Asterales',
 'Derby',
 'Musalmani',
 'Lionel',
 'Mendelize',
 'Ponera',
 'Digitaria',
 'Forficulidae',
 'Protoceratops',
 'Hevea',
 'Brabejum',
 'Spirochaetaceae',
 'Leviticalism',
 'Brummagem',
 'Tulalip']

3. All strings that:
    * start with the letter 'p'
    * followed by 0 - 2 vowels
    * ending in 't'

In [83]:
#3
random.choices([w for w in wordlist if re.search('p[aeiou]{,2}t', w)], k=20)

['pterothorax',
 'Ephemeroptera',
 'reptilian',
 'crampet',
 'captivator',
 'Septemberism',
 'prodespotic',
 'apperceptionistic',
 'metropathia',
 'bypath',
 'spatular',
 'plecopterous',
 'stript',
 'angiopathy',
 'aerotherapeutics',
 'peripety',
 'coleopteroid',
 'amputative',
 'uncaptivated',
 'consumption']

The above example are words that contain the regular expression as a substring. To find all words in this example that start with 'p' and end with 't', add the carrot `^` and dollar `$` symbol respectively.

In [92]:
# 3 (continued)
[w for w in wordlist if re.search('^p[aeiou]{,2}t$', w)]

['pat',
 'paut',
 'peat',
 'pet',
 'piet',
 'pit',
 'poet',
 'poot',
 'pot',
 'pout',
 'put']

4. All strings that:
    * start with one or more digits
    * optionally contain **one** period followed by more digits

In [95]:
# 4
random.choices([w for w in chat_words if re.search('^\d+(\.\d+)?$', w)], k=20)

['28147',
 '92780',
 '20',
 '300',
 '64.8',
 '2006',
 '7',
 '73042',
 '98.6',
 '49',
 '1996',
 '1.99',
 '99703',
 '1930',
 '295',
 '73042',
 '0',
 '73042',
 '39.3',
 '51']

5. All string that contain a three letter substring with this pattern: `(non-vowel)(vowel)(non-vowel)`. This can occur *zero times* (an empty string), *one time* following the three letter structure, or *n-number of times* as long as the next three letters follow the same pattern.

In [112]:
# 5
random.choices([w for w in chat_words if re.search('^([^aeiou][aeiou][^aeiou])*$', w)], k=20)

['suggested',
 'New',
 'bumber',
 'dobson',
 'tastes',
 'bar',
 'rey',
 'messenger',
 'san',
 'killed',
 'sec',
 'center',
 'Her',
 'burned',
 'heh',
 'bumber',
 'van',
 'wantin',
 'Dustin',
 'waz']

6. All word characters or all non-word characters and non-space characters (ex: `'####'`).

In [107]:
# 6
random.choices([w for w in chat_words if re.search('^\w+|[^\w\s]+$', w)], k=20)

['lysol',
 'choco',
 'back',
 'tonight',
 'Looking',
 'ground',
 'min',
 'sores',
 'bio',
 'covered',
 'snowy',
 '32',
 'ACTION',
 '####',
 'pumpkins',
 'U197',
 'Kent',
 'Horace',
 'Jesus',
 'beg']

## 7.

☼ Write regular expressions to match the following classes of strings:

1. A single determiner (assume that a, an, and the are the only determiners).
2. An arithmetic expression using integers, addition, and multiplication, such as `2*3+8`.

### Solution

* use `\b` (for boundry) to mark the beginning or end of a non-whitespace word

In [56]:
#1

string = "This an especially important test to the determiners of a sentence."

nltk.re_show(r"\bthe\b|\ban\b|\ba\b", string)

This {an} especially important test to {the} determiners of {a} sentence.


* the solution below assumes an integer without operators like `+` and `*` are valid arithmetic expressions
* `-` and `\` are excluded, but can be easily added in the `(\+|\*)` portion of the regular expression

In [44]:
#2

string = "2 + 2 is 4 - 1 that's 3 quick maths 2*3+8"

nltk.re_show(r"\d(\W*(\+|\*)\W*\d)*", string)

{2 + 2} is {4} - {1} that's {3} quick maths {2*3+8}


## 8.

☼ Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use `from urllib import request` and then `request.urlopen('http://nltk.org/').read().decode('utf8')` to access the contents of the URL.

### Solution

* use the `Beautiful Soup` library to easily get the text content of a web page

In [100]:
from urllib import request
from bs4 import BeautifulSoup

def remove_markup(URL):
    page = request.urlopen(URL).read().decode('utf8')
    soup = BeautifulSoup(page)
    return soup.get_text()

In [102]:
print(remove_markup('http://nltk.org/'))




Natural Language Toolkit — NLTK 3.5 documentation













NLTK 3.5 documentation

next |
          modules |
          index










Natural Language Toolkit¶
NLTK is a leading platform for building Python programs to work with human language data.
It provides easy-to-use interfaces to over 50 corpora and lexical
resources such as WordNet,
along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries,
and an active discussion forum.
Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation,
NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike.
NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project.
NLTK has been called “a wonderful tool for teaching, and worki

## 9.

☼ Save some text into a file `corpus.txt`. Define a function `load(f)` that reads from the file named in its sole argument, and returns a string containing the text of the file.

In [1]:
def load(f):
    f = open(f)
    raw = f.read()
    f.close()
    
    return raw

In [3]:
print(load("data/example.txt"))

This is a txt file :)
It has a new line character. This looks like a '\n' and creates a new line in the file.
That's all. Thanks for reading!


## 10. 

☼ Rewrite the following loop as a list comprehension:

```python
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = []
for word in sent:
    word_len = (word, len(word))
    result.append(word_len)
```

**Output:**

`[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]`

In [37]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']

In [40]:
result = [(s, len(s)) for s in sent]

In [41]:
result

[('The', 3),
 ('dog', 3),
 ('gave', 4),
 ('John', 4),
 ('the', 3),
 ('newspaper', 9)]

## 11.

☼ Define a string `raw` containing a sentence of your own choosing. Now, split raw on some character other than space, such as `'s'`.

In [42]:
raw = "a string that splits on s"

In [43]:
raw.split('s')

['a ', 'tring that ', 'plit', ' on ', '']

## 12.

☼ Write a for loop to print out the characters of a string, one per line.

In [44]:
s = "for loop magic"

In [46]:
for c in s:
    print(c)

f
o
r
 
l
o
o
p
 
m
a
g
i
c


## 13.

☼ What is the difference between calling split on a string with no argument or with `' '` as the argument, e.g. `sent.split()` versus `sent.split(' ')`? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? 

**Tip:** Type `\t` to write a tab character into a string

### Solution
* by default, the `.split()` function splits on  **all** whitespace characters. This includes the characters
    * ' '
    * '\n'
    * '\t'
    * and others
* if you want to specify a string to split on, enter the string as an argument to the split function

In [19]:
sent = "A string to be split!\n\t With a tab"

In [21]:
# notice the split function splits on \n and \t by default as well
sent.split()

['A', 'string', 'to', 'be', 'split!', 'With', 'a', 'tab']

In [22]:
sent.split(' ')

['A', 'string', 'to', 'be', 'split!\n\t', 'With', 'a', 'tab']

In [24]:
sent.split('\t')

['A string to be split!\n', ' With a tab']

## 14.

Create a variable `words` containing a list of words. Experiment with `words.sort()` and `sorted(words)`. What is the difference?

### Solution

* `words.sort()` changes the order of the `words` list permanently
* `sorted(words)` returns a new list that is sorted, but doesn't effect the old `words` list

In [8]:
words1 = ["these", "are", "a", "list", "of", "words"]
words2 = ["these", "are", "a", "list", "of", "words"]

In [9]:
words1.sort()

In [10]:
words1

['a', 'are', 'list', 'of', 'these', 'words']

In [11]:
sorted(words2)

['a', 'are', 'list', 'of', 'these', 'words']

In [12]:
words2

['these', 'are', 'a', 'list', 'of', 'words']

## 17.

☼ What happens when the formatting strings `%6s` and `%-6s` are used to display strings that are longer than six characters?

**Note:**

* this question uses the **old** python format
* the new python format would be `{:>6}` and `{:6}` respectively

* [more information on old vs. new format](https://pyformat.info/)

### Solution

The full string is printed with no additional whitespace in each case.

In [69]:
long_word = "mississippi"

mississippi


### Old Format

In [80]:
print('%6s' % (long_word))

mississippi


In [81]:
print('%-6s' % (long_word))

mississippi


### New Format

In [70]:
print("{:>6}".format(long_word))

mississippi


In [75]:
print("{:6}".format(long_word))

mississippi


## 18.

◑ Read in some text from a corpus, tokenize it, and print the list of all *wh*-word types that occur. (*wh*-words in English are used in questions, relative clauses and exclamations: *who*, *which*, *what*, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?

### Solution

* some 'wh' words start capitalized while others don't
* some words are a combination of multiple words like *wholesale* and *wholeheartedly*

In [103]:
sotu_tokens = nltk.corpus.state_union.words()

In [115]:
[s for s in sotu_tokens if s.lower().startswith('wh')][100:120]

['who',
 'wholesale',
 'where',
 'who',
 'who',
 'who',
 'which',
 'While',
 'when',
 'who',
 'which',
 'While',
 'which',
 'which',
 'which',
 'wholeheartedly',
 'which',
 'whole',
 'where',
 'which']

## 19. 

◑ Create a file consisting of words and (made up) frequencies, where each line consists of a word, the space character, and a positive integer, e.g. `fuzzy 53`. Read the file into a Python list using `open(filename).readlines()`. Next, break each line into its two fields using `split()`, and convert the number into an integer using `int()`. The result should be a list of the form: `[['fuzzy', 53], ...]`.

* **Note:** It is a safer programming practice to open and close a document like this:

```python
f = open(filename)
raw = f.readlines()
f.close()
```

### Solution

* a solution file is provided in `data/sol/freqs.txt`
* after using `readlines()`, use a list comprehension and `split()` method to create the new list

In [33]:
file = 'data/sol/freqs.txt'

f = open(file)
raw = f.readlines()
f.close()

[[w.split()[0], int(w.split()[1])] for w in raw]

[['fuzzy', 53],
 ['cat', 9],
 ['triangle', 3],
 ['square', 4],
 ['octopus', 8],
 ['bob', 808]]

## 20.

◑ Write code to access a favorite webpage and extract some text from it. For example, access a weather site and extract the forecast top temperature for your town or city today.

## 21.

◑ Write a function `unknown()` that takes a URL as its argument, and returns a list of unknown words that occur on that webpage. In order to do this, extract all substrings consisting of lowercase letters (using re.findall()) and remove any items from this set that occur in the Words Corpus (nltk.corpus.words). Try to categorize these words manually and discuss your findings.

## 22.

◑ Examine the results of processing the URL http://news.bbc.co.uk/ using the regular expressions suggested above. You will see that there is still a fair amount of non-textual data there, particularly Javascript commands. You may also find that sentence breaks have not been properly preserved. Define further regular expressions that improve the extraction of text from this web page.

## 23.

◑ Are you able to write a regular expression to tokenize text in such a way that the word *don't* is tokenized into *do* and *n't*? Explain why this regular expression won't work: `«n't|\w+»`.

### Solution

This regular expression by default is *greedy*, so it will capture the `don` in `don't` first before it captures `n't`.

In [117]:
string = "I don't think this will work."

{I} {don}'{t} {think} {this} {will} {work}.


In [192]:
# naive solution
re.findall(r"n't|\w+", string)

['I', 'don', 't', 'think', 'this', 'will', 'work']

`n't` can be captured by checking for two substrings:
1. a word using a non-greedy search method: `(\w+?)
2. the *optional* ending letters `n't`: `(n't)?`

Because a non-greedy approach will be used, a `\b` must be used to signify the end of a word or else the regular expression will capture the minimum amount of characters needed to form a word. This would mean that the regular expression would only chapture single characters instead of whole words.

In [182]:
results = re.findall(r"(\w+?)(n't)?\b", string)

results

[('I', ''),
 ('do', "n't"),
 ('think', ''),
 ('this', ''),
 ('will', ''),
 ('work', '')]

The `re.findall` method will return a list of pairs (tuples). To convert it into a list of strings, we can unpack the list of tuples using a list comprehension:

In [178]:
[token for tup in results for token in tup if token] 

['I', 'do', "n't", 'think', 'this', 'will', 'work']

## 24.

◑ Try to write code to convert text into *hAck3r*, using regular expressions and substitution, where `e → 3`, `i → 1`, `o → 0`, `l → |`, `s → 5`, `. → 5w33t!`, `ate → 8`. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map s to two different values: `$` for word-initial `s`, and `5` for word-internal `s`.

## 25.

◑ *Pig Latin* is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append *ay*, e.g. *string → ingstray*, *idle → idleay*. http://en.wikipedia.org/wiki/Pig_Latin

1. Write a function to convert a word to Pig Latin.
2. Write code that converts text, instead of individual words.
3. Extend it further to preserve capitalization, to keep `qu` together (i.e. so that `quiet` becomes `ietquay`), and to detect when y is used as a consonant (e.g. `yellow`) vs a vowel (e.g. `style`).



## 26. 

◑ Download some text from a language that has vowel harmony (e.g. Hungarian), extract the vowel sequences of words, and create a vowel bigram table.

## 27.

Python's `random` module includes a function `choice()` which randomly chooses an item from a sequence, e.g. choice("aehh ") will produce one of four possible characters, with the letter h being twice as frequent as the others. Write a generator expression that produces a sequence of 500 randomly chosen letters drawn from the string `"aehh "`, and put this expression inside a call to the `''.join()` function, to concatenate them into one long string. You should get a result that looks like uncontrolled sneezing or maniacal laughter: `he  haha ee  heheeh eha`. Use `split()` and `join()` again to normalize the whitespace in this string.

In [40]:
import random

random.choice("aehh ")

'e'

## 28.

◑ Consider the numeric expressions in the following sentence from the MedLine Corpus: *The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively.* Should we say that the numeric expression *4.53 +/- 0.15%* is three words? Or should we say that it's a single compound word? Or should we say that it is actually *nine* words, since it's read "four point five three, plus or minus zero point fifteen percent"? Or should we say that it's not a "real" word at all, since it wouldn't appear in any dictionary? Discuss these different possibilities. Can you think of application domains that motivate at least two of these answers?

## 29.

◑ Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define μ<sub>w</sub> to be the average number of letters per word, and μ<sub>s</sub> to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: 4.71 μ<sub>w</sub> + 0.5 μ<sub>s</sub> - 21.43. Compute the ARI score for various sections of the Brown Corpus, including section `f` (lore) and `j` (learned). Make use of the fact that `nltk.corpus.brown.words()` produces a sequence of words, while `nltk.corpus.brown.sents()` produces a sequence of sentences.

## 30.

◑ Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences.

### Solution

* the third movie review in the `movie_reviews` corpused will be used

In [22]:
from nltk.corpus import movie_reviews

tokens = movie_reviews.words(movie_reviews.fileids()[2])

In [4]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

In [27]:
porter_list = [porter.stem(t) for t in tokens][0:20]

In [28]:
lancaster_list = [lancaster.stem(t) for t in tokens][0:20]

In [29]:
for i in range(20):
    if porter_list[i] != lancaster_list[i]:
        print(porter_list[i], "vs.", lancaster_list[i])

movi vs. movy
like vs. lik
these vs. thes
make vs. mak
jade vs. jad
movi vs. movy
viewer vs. view
invent vs. inv


## 31.

◑ Define the variable saying to contain the list `['After', 'all', 'is', 'said', 'and', 'done', ',', 'more',
'is', 'said', 'than', 'done', '.']`. Process this list using a `for` loop, and store the length of each word in a new list `lengths`. 

* **Hint:** begin by assigning the empty list to lengths, using `lengths = []`. Then each time through the loop, use `append()` to add another length value to the list. 

Now do the same thing using a list comprehension.

In [42]:
exp = ['After', 'all', 'is', 'said', 'and', 'done', ',', 'more', 'is', 'said', 'than', 'done', '.']

lengths = []
for word in exp:
    lengths.append(len(word))
    
lengths

[5, 3, 2, 4, 3, 4, 1, 4, 2, 4, 4, 4, 1]

In [43]:
[len(word) for word in exp]

[5, 3, 2, 4, 3, 4, 1, 4, 2, 4, 4, 4, 1]

## 32.

◑ Define a variable `silly` to contain the string: `'newly formed bland ideas are inexpressible in an infuriating way'`. (This happens to be the legitimate interpretation that bilingual English-Spanish speakers can assign to Chomsky's famous nonsense phrase, colorless green ideas sleep furiously according to Wikipedia). Now write code to perform the following tasks:


1. Split `silly` into a list of strings, one per word, using Python's `split()` operation, and save this to a variable called `bland`.
2. Extract the second letter of each word in `silly` and join them into a string, to get `'eoldrnnnna'`.
3. Combine the words in bland back into a single string, using join(). Make sure the words in the resulting string are separated with whitespace.
4. Print the words of silly in alphabetical order, one per line.


In [44]:
silly = 'newly formed bland ideas are inexpressible in an infuriating way'

In [46]:
#1
bland = silly.split()

In [50]:
#2
''.join([word[1] for word in bland])

'eoldrnnnna'

In [51]:
#3 
' '.join(bland)

'newly formed bland ideas are inexpressible in an infuriating way'

In [55]:
#4
for word in sorted(bland):
    print(word)

an
are
bland
formed
ideas
in
inexpressible
infuriating
newly
way


## 33.

◑ The `index()` function can be used to look up items in sequences. For example, `'inexpressible'.index('e')` tells us the index of the first position of the letter e.


1. What happens when you look up a substring, e.g. `'inexpressible'.index('re')`?
2. Define a variable words containing a list of words. Now use `words.index()` to look up the position of an individual word.
3. Define a variable `silly` as in the exercise above. Use the `index()` function in combination with list slicing to build a list phrase consisting of all the words up to (but not including) `in` in `silly`.

### Soltuion

In [56]:
'inexpressible'.index('e')

2

When looking up a substring, the `.index` function finds the first occurance of that substrings and returns the index of the beginning character of the substring.

In [60]:
#1
'inexpressible'.index('re')

5

In [61]:
#1 (continued)
'rly inexpressible'.index('re')

9

In [63]:
#2
words = "these are a bunch of words"
words.index("are")

6

In [65]:
#3
silly = 'newly formed bland ideas are inexpressible in an infuriating way'

silly[:silly.index('in')]

'newly formed bland ideas are '

## 34.

◑ Write code to convert nationality adjectives like Canadian and Australian to their corresponding nouns Canada and Australia (see http://en.wikipedia.org/wiki/List_of_adjectival_forms_of_place_names).

## 35.

◑ Read the LanguageLog post on phrases of the form as best *as p can and as best p can*, where *p* is a pronoun. Investigate this phenomenon with the help of a corpus and the `findall()` method for searching tokenized text described in [3.5](https://www.nltk.org/book/ch03.html#sec-useful-applications-of-regular-expressions). http://itre.cis.upenn.edu/~myl/languagelog/archives/002733.html

## 36.

◑ Study the *lolcat* version of the book of Genesis, accessible as `nltk.corpus.genesis.words('lolcat.txt')`, and the rules for converting text into *lolspeak* at http://www.lolcatbible.com/index.php?title=How_to_speak_lolcat. Define regular expressions to convert English words into corresponding lolspeak words.

## 37.

◑ Read about the `re.sub()` function for string substitution using regular expressions, using `help(re.sub)` and by consulting the further readings for this chapter. Use `re.sub` in writing code to remove HTML tags from an HTML file, and to normalize whitespace.

## 38.

★ An interesting challenge for tokenization is words that have been split across a line-break. E.g. if *long-term* is split, then we have the string `long-\nterm`.

1. Write a regular expression that identifies words that are hyphenated at a line-break. The expression will need to include the `\n` character.
2. Use `re.sub()` to remove the `\n` character from these words.
3. How might you identify words that should not remain hyphenated once the newline is removed, e.g. `'encyclo-\npedia'`?



## 39.

★ Read the Wikipedia entry on [*Soundex*](https://en.wikipedia.org/wiki/Soundex). Implement this algorithm in Python.

## 40.

★ Obtain raw texts from two or more genres and compute their respective reading difficulty scores as in the earlier exercise on reading difficulty. E.g. compare ABC Rural News and ABC Science News (`nltk.corpus.abc`). Use Punkt to perform sentence segmentation.

## 41.

★ Rewrite the following nested loop as a nested list comprehension:

```python
words = ['attribution', 'confabulation', 'elocution', 'sequoia', 'tenacious', 'unidirectional']

vsequences = set()
for word in words:
    vowels = []
    for char in word:
        if char in 'aeiou':
            vowels.append(char)
    vsequences.add(''.join(vowels))

sorted(vsequences)
```

**Output:**

`['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']`

#### Solution

* this code is returning a set of the vowels in each word
* this solution will be done in parts

**1.** Create a list of the words with a list comprehension. This will be our starting point:

In [50]:
[w for w in words]

['attribution',
 'confabulation',
 'elocution',
 'sequoia',
 'tenacious',
 'unidirectional']

**2.** Create a list of letters for each word. This will require us to use a nested list comprehension:

In [54]:
[[letter for letter in w] for w in words]

[['a', 't', 't', 'r', 'i', 'b', 'u', 't', 'i', 'o', 'n'],
 ['c', 'o', 'n', 'f', 'a', 'b', 'u', 'l', 'a', 't', 'i', 'o', 'n'],
 ['e', 'l', 'o', 'c', 'u', 't', 'i', 'o', 'n'],
 ['s', 'e', 'q', 'u', 'o', 'i', 'a'],
 ['t', 'e', 'n', 'a', 'c', 'i', 'o', 'u', 's'],
 ['u', 'n', 'i', 'd', 'i', 'r', 'e', 'c', 't', 'i', 'o', 'n', 'a', 'l']]

**3.** Filter out any non-vowals by creating the conditional statement `if letter in "aeiou"` within the nested list comprehension:

In [55]:
[[letter for letter in w if letter in "aeiou"] for w in words]

[['a', 'i', 'u', 'i', 'o'],
 ['o', 'a', 'u', 'a', 'i', 'o'],
 ['e', 'o', 'u', 'i', 'o'],
 ['e', 'u', 'o', 'i', 'a'],
 ['e', 'a', 'i', 'o', 'u'],
 ['u', 'i', 'i', 'e', 'i', 'o', 'a']]

**4.** Use the `''.join(` method on the nested list comprehension to convert the list of vowals into a string:

In [63]:
[''.join([letter for letter in w if letter in "aeiou"]) for w in words]

['aiuio', 'oauaio', 'eouio', 'euoia', 'eaiou', 'uiieioa']

**5.** Convert the entire list into a set (this sorts the entries alphabetically by default):

In [64]:
set([''.join([letter for letter in w if letter in "aeiou"]) for w in words])

{'aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa'}

## 42.

★ Use WordNet to create a semantic index for a text collection. Extend the concordance search program in [3.6](https://www.nltk.org/book/ch03.html#code-stemmer-indexing), indexing each word using the offset of its first synset, e.g. `wn.synsets('dog')[0].offset` (and optionally the offset of some of its ancestors in the hypernym hierarchy).

## 43.

★ With the help of a multilingual corpus such as the Universal Declaration of Human Rights Corpus (`nltk.corpus.udhr`), and NLTK's frequency distribution and rank correlation functionality (`nltk.FreqDist`, `nltk.spearman_correlation`), develop a system that guesses the language of a previously unseen text. For simplicity, work with a single character encoding and just a few languages.

## 44.

★ Write a program that processes a text and discovers cases where a word has been used with a novel sense. For each word, compute the WordNet similarity between all synsets of the word and all synsets of the words in its context. (Note that this is a crude approach; doing it well is a difficult, open research problem.)

## 45.

★ Read the article on normalization of non-standard words (Sproat et al, 2001), and implement a similar system for text normalization.