# 4. [Writing Structured Programs](https://www.nltk.org/book/ch04.html) - Notes

* [NLTK-Book-Resource Repository](https://github.com/BetoBob/NLTK-Book-Resource)
* [NLTK-Book-Resource Table of Contents](https://github.com/BetoBob/NLTK-Book-Resource#table-of-contents)

Run the cell below before running the examples.

In [84]:
import nltk, pprint, re

from nltk import word_tokenize

## 4.1 - Back to the Basics

### Assignment

In [None]:
foo = 'Monty'
bar = foo 
foo = 'Python'

bar

In [None]:
foo = ['Monty', 'Python']
bar = foo
foo[1] = 'Bodkin'

bar

In [None]:
empty = []
nested = [empty, empty, empty]

nested

In [None]:
nested[1].append('Python')

nested

**Your Turn:** Use multiplication to create a list of lists: `nested = [[]] * 3`. Now modify one of the elements of the list, and observe that all the elements are changed. Use Python's `id()` function to find out the numerical identifier for any object, and verify that `id(nested[0])`, `id(nested[1])`, and `id(nested[2])` are all the same.

In [None]:
nested = [[]] * 3

### Equality

In [None]:
size = 5
python = ['Python']
snake_nest = [python] * size

In [None]:
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

In [None]:
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

In [None]:
import random
position = random.choice(range(size))
snake_nest[position] = ['Python']
snake_nest 

In [None]:
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

In [None]:
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

In [None]:
[id(snake) for snake in snake_nest]

### Conditionals

In [None]:
mixed = ['cat', '', ['dog'], []]

for element in mixed:
    if element:
        print(element)

In [None]:
animals = ['cat', 'dog']

if 'cat' in animals:
    print(1)
elif 'dog' in animals:
    print(2)

In [None]:
sent = ['No', 'good', 'fish', 'goes', 'anywhere', 'without', 'a', 'porpoise', '.']

In [None]:
all(len(w) > 4 for w in sent)

In [None]:
any(len(w) > 4 for w in sent)

## 4.2 - Sequences

In [None]:
t = 'walk', 'fem', 3
t

In [None]:
t[0]

In [None]:
t[1:]

In [None]:
len(t)

In [None]:
raw = 'I turned off the spectroroute'
text = ['I', 'turned', 'off', 'the', 'spectroroute']
pair = (6, 'turned')

In [None]:
raw[2], text[3], pair[1]

In [None]:
raw[-3:], text[-3:], pair[-3:]

In [None]:
len(raw), len(text), len(pair)

### Operating on Sequence Types

In [None]:
raw = 'Red lorry, yellow lorry, red lorry, yellow lorry.'
text = word_tokenize(raw)
fdist = nltk.FreqDist(text)

In [None]:
sorted(fdist)

In [None]:
for key in fdist:
    print(key + ':', fdist[key], end='; ')

In [None]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']

words[2], words[3], words[4] = words[3], words[4], words[2]

words

In [None]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']

tmp = words[2]
words[2] = words[3]
words[3] = words[4]
words[4] = tmp

words

#### zip / enumerate

In [None]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']
tags = ['noun', 'verb', 'prep', 'det', 'noun']

In [None]:
zip(words, tags)

In [None]:
list(zip(words, tags))

In [None]:
list(enumerate(words))

#### splitting data

In [None]:
text = nltk.corpus.nps_chat.words()
cut = int(0.9 * len(text))
training_data, test_data = text[:cut], text[cut:]

In [None]:
text == training_data + test_data

In [None]:
len(training_data) / len(test_data)

### Combining Different Sequence Types

In [None]:
words = 'I turned off the spectroroute'.split()
wordlens = [(len(word), word) for word in words]
wordlens.sort()

In [None]:
' '.join(w for (_, w) in wordlens)

In [None]:
lexicon = [
    ('the', 'det', ['Di:', 'D@']),
    ('off', 'prep', ['Qf', 'O:f'])
]
lexicon.sort()

In [None]:
lexicon[1] = ('turned', 'VBD', ['t3:nd', 't3`nd'])

In [None]:
del lexicon[0]

In [None]:
lexicon

**Your Turn:** Convert `lexicon` to a tuple, using `lexicon = tuple(lexicon)`, then try each of the above operations, to confirm that none of them is permitted on tuples.

In [None]:
lexicon = [
    ('the', 'det', ['Di:', 'D@']),
    ('off', 'prep', ['Qf', 'O:f'])
]

### Generator Expressions

In [None]:
text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
"it means just what I choose it to mean - neither more nor less."'''

In [None]:
[w.lower() for w in word_tokenize(text)]

In [None]:
max([w.lower() for w in word_tokenize(text)]) 

In [None]:
max(w.lower() for w in word_tokenize(text))

## 4.3 - Questions of Style

### Python Coding Style

In [None]:
syllables = []

In [None]:
# before (too many characters)

if (len(syllables) > 4 and len(syllables[2]) == 3 and syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]):
    process(syllables)

In [None]:
# after

if (len(syllables) > 4 and len(syllables[2]) == 3 and \
    syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]):
    process(syllables)

### Procedural vs Declarative Style

#### Average Word Lengths Example

In [None]:
tokens = nltk.corpus.brown.words(categories='news')

In [None]:
# Procedural Style

count = 0
total = 0
for token in tokens:
    count += 1
    total += len(token)
    
total / count

In [None]:
# Declarative Style

total = sum(len(t) for t in tokens)

print(total / len(tokens))

#### Unique Words Example

**Note:** 

A smaller set of tokens is used in this example because the procedural style of code takes a significant time to run on the brown corpus tokens.

In [None]:
text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
"it means just what I choose it to mean - neither more nor less."'''

tokens = word_tokenize(text)

In [None]:
# Procedural Style

word_list = []
i = 0
while i < len(tokens):
    j = 0
    while j < len(word_list) and word_list[j] <= tokens[i]:
        j += 1
    if j == 0 or tokens[i] != word_list[j-1]:
        word_list.insert(j, tokens[i])
    i += 1

word_list

In [None]:
# Declarative Style

word_list = sorted(set(tokens))

word_list

#### Word Frequency Example

In [None]:
fd = nltk.FreqDist(nltk.corpus.brown.words())
cumulative = 0.0
most_common_words = [word for (word, count) in fd.most_common()]
for rank, word in enumerate(most_common_words):
    cumulative += fd.freq(word)
    print("%3d %6.2f%% %s" % (rank + 1, cumulative * 100, word))
    if cumulative > 0.25:
        break

#### Longest Length Word Example

In [None]:
# Procedural Style

text = nltk.corpus.gutenberg.words('milton-paradise.txt')
longest = ''
for word in text:
    if len(word) > len(longest):
        longest = word
        
longest

In [None]:
# Declarative Style

maxlen = max(len(word) for word in text)

[word for word in text if len(word) == maxlen]

### Some Legitimate Uses for Counters

In [None]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
n = 3
[sent[i:i+n] for i in range(len(sent)-n+1)]

In [None]:
m, n = 3, 7
array = [[set() for i in range(n)] for j in range(m)]
array[2][5].add('Alice')

pprint.pprint(array)

In [None]:
array = [[set()] * n] * m
array[2][5].add(7)
pprint.pprint(array)

## 4.4 - Functions: The Foundation of Structured Programming

In [85]:
def get_text(file):
    """Read text from a file, normalizing whitespace and stripping HTML markup."""
    text = open(file).read()
    text = re.sub(r'<.*?>', ' ', text)
    text = re.sub('\s+', ' ', text)
    return text

In [86]:
help(get_text)

Help on function get_text in module __main__:

get_text(file)
    Read text from a file, normalizing whitespace and stripping HTML markup.



### Function Inputs and Outputs

In [89]:
def repeat(msg, num):
    return ' '.join([msg] * num)

monty = 'Monty Python'

repeat(monty, 3)

'Monty Python Monty Python Monty Python'

In [90]:
def monty():
    return "Monty Python"

monty()

'Monty Python'

In [91]:
repeat(monty(), 3)

'Monty Python Monty Python Monty Python'

In [92]:
repeat('Monty Python', 3)

'Monty Python Monty Python Monty Python'

In [101]:
def my_sort1(mylist):      # good: modifies its argument, no return value
    mylist.sort()

def my_sort2(mylist):      # good: doesn't touch its argument, returns value
    return sorted(mylist)

def my_sort3(mylist):      # bad: modifies its argument and also returns it
    mylist.sort()
    return mylist

### Parameter Passing

In [103]:
# call-by-value

def set_up(word, properties):
    word = 'lolcat'
    properties.append('noun')
    properties = 5

w = ''
p = []
set_up(w, p)

In [104]:
w

''

In [105]:
p

['noun']

In [106]:
w = ''
word = w
word = 'lolcat'
w

''

In [107]:
p = []
properties = p
properties.append('noun')
properties = 5
p

['noun']

### Checking Parameter Types

In [108]:
def tag(word):
    if word in ['a', 'the', 'all']:
        return 'det'
    else:
        return 'noun'

In [109]:
tag('the')

'det'

In [110]:
tag('knight')

'noun'

In [111]:
tag(["'Tis", 'but', 'a', 'scratch'])

'noun'

In [112]:
def tag(word):
    assert isinstance(word, basestring), "argument to tag() must be a string"
    if word in ['a', 'the', 'all']:
        return 'det'
    else:
        return 'noun'

### Functional Decomposition

**Note:** The frequency distribution will differ from the book because the webpage has likely changed.

In [147]:
from urllib import request
from bs4 import BeautifulSoup

def freq_words(url, freqdist, n):
    html = request.urlopen(url).read().decode('utf8')
    raw = BeautifulSoup(html, 'html.parser').get_text()
    for word in word_tokenize(raw):
        freqdist[word.lower()] += 1
    result = []
    for word, count in freqdist.most_common(n):
        result = result + [word]
    print(result)

In [148]:
constitution = "http://www.archives.gov/exhibits/charters/constitution_transcript.html"
fd = nltk.FreqDist()

freq_words(constitution, fd, 30)

["''", ',', ':', ':1', ';', 'the', '(', ')', '``', '{', '}', 'of', '?', 'url', 'https', '@', 'import', 'qjv5uu', "'", 'archives', '#', 'and', '.', '[', ']', 'national', 'a', 'documents', 'constitution', 'founding']


In [149]:
from urllib import request
from bs4 import BeautifulSoup

def freq_words(url, n):
    html = request.urlopen(url).read().decode('utf8')
    text = BeautifulSoup(html, 'html.parser').get_text()
    freqdist = nltk.FreqDist(word.lower() for word in word_tokenize(text))
    return [word for (word, _) in fd.most_common(n)]

In [150]:
freq_words(constitution, 30)

["''",
 ',',
 ':',
 ':1',
 ';',
 'the',
 '(',
 ')',
 '``',
 '{',
 '}',
 'of',
 '?',
 'url',
 'https',
 '@',
 'import',
 'qjv5uu',
 "'",
 'archives',
 '#',
 'and',
 '.',
 '[',
 ']',
 'national',
 'a',
 'documents',
 'constitution',
 'founding']

### Documenting Functions

In [151]:
def accuracy(reference, test):
    """
    Calculate the fraction of test items that equal the corresponding reference items.

    Given a list of reference values and a corresponding list of test values,
    return the fraction of corresponding values that are equal.
    In particular, return the fraction of indexes
    {0<i<=len(test)} such that C{test[i] == reference[i]}.

        >>> accuracy(['ADJ', 'N', 'V', 'N'], ['N', 'N', 'V', 'ADJ'])
        0.5

    :param reference: An ordered list of reference values
    :type reference: list
    :param test: A list of values to compare against the corresponding
        reference values
    :type test: list
    :return: the accuracy score
    :rtype: float
    :raises ValueError: If reference and length do not have the same length
    """

    if len(reference) != len(test):
        raise ValueError("Lists must have the same length.")
    num_correct = 0
    for x, y in zip(reference, test):
        if x == y:
            num_correct += 1
    return float(num_correct) / len(reference)

In [152]:
help(accuracy)

Help on function accuracy in module __main__:

accuracy(reference, test)
    Calculate the fraction of test items that equal the corresponding reference items.
    
    Given a list of reference values and a corresponding list of test values,
    return the fraction of corresponding values that are equal.
    In particular, return the fraction of indexes
    {0<i<=len(test)} such that C{test[i] == reference[i]}.
    
        >>> accuracy(['ADJ', 'N', 'V', 'N'], ['N', 'N', 'V', 'ADJ'])
        0.5
    
    :param reference: An ordered list of reference values
    :type reference: list
    :param test: A list of values to compare against the corresponding
        reference values
    :type test: list
    :return: the accuracy score
    :rtype: float
    :raises ValueError: If reference and length do not have the same length



## 4.5 - Doing More with Functions

## 4.6 - Program Development

## 4.7 - Algorithm Design

## 4.8 - A Sample of Python Libraries

## Your Turn Solutions

### 4.1

**Your Turn:** Use multiplication to create a list of lists: `nested = [[]] * 3`. Now modify one of the elements of the list, and observe that all the elements are changed. Use Python's `id()` function to find out the numerical identifier for any object, and verify that `id(nested[0])`, `id(nested[1])`, and `id(nested[2])` are all the same.

In [None]:
nested = [[]] * 3

In [None]:
nested[0].append('test')

In [None]:
nested

In [None]:
id(nested[0]) == id(nested[1]) == id(nested[2])

### 4.2

**Your Turn:** Convert `lexicon` to a tuple, using `lexicon = tuple(lexicon)`, then try each of the above operations, to confirm that none of them is permitted on tuples.

In [None]:
lexicon = [
    ('the', 'det', ['Di:', 'D@']),
    ('off', 'prep', ['Qf', 'O:f'])
]

lexicon = tuple(lexicon)

In [None]:
lexicon

In [None]:
type(lexicon)

List operations on a `tuple` will fail because they are designed for the `list` sequence type:

In [None]:
lexicon.sort()

In [None]:
lexicon[1] = ('turned', 'VBD', ['t3:nd', 't3`nd'])

In [None]:
del lexicon[0]