# 4. [Writing Structured Programs](https://www.nltk.org/book/ch04.html) - Notes

* [NLTK-Book-Resource Repository](https://github.com/BetoBob/NLTK-Book-Resource)
* [NLTK-Book-Resource Table of Contents](https://github.com/BetoBob/NLTK-Book-Resource#table-of-contents)

Run the cell below before running the examples.

In [126]:
import nltk, pprint

from nltk import word_tokenize

## 4.1 - Back to the Basics

### Assignment

In [2]:
foo = 'Monty'
bar = foo 
foo = 'Python'

bar

'Monty'

In [7]:
foo = ['Monty', 'Python']
bar = foo
foo[1] = 'Bodkin'

bar

['Monty', 'Bodkin']

In [12]:
empty = []
nested = [empty, empty, empty]

nested

[[], [], []]

In [13]:
nested[1].append('Python')

nested

[['Python'], ['Python'], ['Python']]

**Your Turn:** Use multiplication to create a list of lists: `nested = [[]] * 3`. Now modify one of the elements of the list, and observe that all the elements are changed. Use Python's `id()` function to find out the numerical identifier for any object, and verify that `id(nested[0])`, `id(nested[1])`, and `id(nested[2])` are all the same.

In [20]:
nested = [[]] * 3

### Equality

In [30]:
size = 5
python = ['Python']
snake_nest = [python] * size

In [29]:
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

True

In [31]:
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

True

In [33]:
import random
position = random.choice(range(size))
snake_nest[position] = ['Python']
snake_nest 

[['Python'], ['Python'], ['Python'], ['Python'], ['Python']]

In [34]:
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

True

In [35]:
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

False

In [36]:
[id(snake) for snake in snake_nest]

[1583029968520, 1583029968520, 1583030290120, 1583029968520, 1583029968520]

### Conditionals

In [37]:
mixed = ['cat', '', ['dog'], []]

for element in mixed:
    if element:
        print(element)

cat
['dog']


In [38]:
animals = ['cat', 'dog']

if 'cat' in animals:
    print(1)
elif 'dog' in animals:
    print(2)

1


In [39]:
sent = ['No', 'good', 'fish', 'goes', 'anywhere', 'without', 'a', 'porpoise', '.']

In [40]:
all(len(w) > 4 for w in sent)

False

In [41]:
any(len(w) > 4 for w in sent)

True

## 4.2 - Sequences

In [42]:
t = 'walk', 'fem', 3
t

('walk', 'fem', 3)

In [43]:
t[0]

'walk'

In [44]:
t[1:]

('fem', 3)

In [45]:
len(t)

3

In [46]:
raw = 'I turned off the spectroroute'
text = ['I', 'turned', 'off', 'the', 'spectroroute']
pair = (6, 'turned')

('t', 'the', 'turned')

In [48]:
raw[2], text[3], pair[1]

('t', 'the', 'turned')

In [47]:
raw[-3:], text[-3:], pair[-3:]

('ute', ['off', 'the', 'spectroroute'], (6, 'turned'))

In [49]:
len(raw), len(text), len(pair)

(29, 5, 2)

### Operating on Sequence Types

In [54]:
raw = 'Red lorry, yellow lorry, red lorry, yellow lorry.'
text = word_tokenize(raw)
fdist = nltk.FreqDist(text)

In [55]:
sorted(fdist)

[',', '.', 'Red', 'lorry', 'red', 'yellow']

In [56]:
for key in fdist:
    print(key + ':', fdist[key], end='; ')

Red: 1; lorry: 4; ,: 3; yellow: 2; red: 1; .: 1; 

In [80]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']

words[2], words[3], words[4] = words[3], words[4], words[2]

words

['I', 'turned', 'the', 'spectroroute', 'off']

In [79]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']

tmp = words[2]
words[2] = words[3]
words[3] = words[4]
words[4] = tmp

words

['I', 'turned', 'the', 'spectroroute', 'off']

In [81]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']
tags = ['noun', 'verb', 'prep', 'det', 'noun']

zip(words, tags)

<zip at 0x17097770588>

In [82]:
list(zip(words, tags))

[('I', 'noun'),
 ('turned', 'verb'),
 ('off', 'prep'),
 ('the', 'det'),
 ('spectroroute', 'noun')]

In [83]:
list(enumerate(words))

[(0, 'I'), (1, 'turned'), (2, 'off'), (3, 'the'), (4, 'spectroroute')]

In [85]:
text = nltk.corpus.nps_chat.words()
cut = int(0.9 * len(text))
training_data, test_data = text[:cut], text[cut:]

text == training_data + test_data

True

In [86]:
len(training_data) / len(test_data)

9.0

### Combining Different Sequence Types

In [87]:
words = 'I turned off the spectroroute'.split()
wordlens = [(len(word), word) for word in words]
wordlens.sort()

' '.join(w for (_, w) in wordlens)

'I off the turned spectroroute'

In [96]:
lexicon = [
    ('the', 'det', ['Di:', 'D@']),
    ('off', 'prep', ['Qf', 'O:f'])
]

In [97]:
lexicon.sort()

In [98]:
lexicon[1] = ('turned', 'VBD', ['t3:nd', 't3`nd'])

In [99]:
del lexicon[0]

In [100]:
lexicon

[('turned', 'VBD', ['t3:nd', 't3`nd'])]

**Your Turn:** Convert `lexicon` to a tuple, using `lexicon = tuple(lexicon)`, then try each of the above operations, to confirm that none of them is permitted on tuples.

In [94]:
lexicon = [
    ('the', 'det', ['Di:', 'D@']),
    ('off', 'prep', ['Qf', 'O:f'])
]

lexicon = tuple(lexicon)

In [95]:
lexicon

(('the', 'det', ['Di:', 'D@']), ('off', 'prep', ['Qf', 'O:f']))

### Generator Expressions

In [109]:
text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
"it means just what I choose it to mean - neither more nor less."'''

In [110]:
[w.lower() for w in word_tokenize(text)]

['``',
 'when',
 'i',
 'use',
 'a',
 'word',
 ',',
 "''",
 'humpty',
 'dumpty',
 'said',
 'in',
 'rather',
 'a',
 'scornful',
 'tone',
 ',',
 "''",
 'it',
 'means',
 'just',
 'what',
 'i',
 'choose',
 'it',
 'to',
 'mean',
 '-',
 'neither',
 'more',
 'nor',
 'less',
 '.',
 "''"]

In [111]:
max([w.lower() for w in word_tokenize(text)]) 

'word'

In [105]:
max(w.lower() for w in word_tokenize(text))

'word'

## 4.3 - Questions of Style

### Python Coding Style

In [112]:
if (len(syllables) > 4 and len(syllables[2]) == 3 and
    syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]):
    process(syllables)
    
if len(syllables) > 4 and len(syllables[2]) == 3 and \
syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]:
    process(syllables)

NameError: name 'syllables' is not defined

### Procedural vs Declarative Style

In [113]:
tokens = nltk.corpus.brown.words(categories='news')
count = 0
total = 0
for token in tokens:
    count += 1
    total += len(token)
    
total / count

4.401545438271973

In [114]:
total = sum(len(t) for t in tokens)

In [115]:
print(total / len(tokens))

4.401545438271973


In [117]:
word_list = []
i = 0
while i < len(tokens):
    j = 0
    while j < len(word_list) and word_list[j] <= tokens[i]:
        j += 1
    if j == 0 or tokens[i] != word_list[j-1]:
        word_list.insert(j, tokens[i])
    i += 1

KeyboardInterrupt: 

In [118]:
word_list = sorted(set(tokens))

In [119]:
fd = nltk.FreqDist(nltk.corpus.brown.words())
cumulative = 0.0
most_common_words = [word for (word, count) in fd.most_common()]
for rank, word in enumerate(most_common_words):
    cumulative += fd.freq(word)
    print("%3d %6.2f%% %s" % (rank + 1, cumulative * 100, word))
    if cumulative > 0.25:
        break

  1   5.40% the
  2  10.42% ,
  3  14.67% .
  4  17.78% of
  5  20.19% and
  6  22.40% to
  7  24.29% a
  8  25.97% in


In [120]:
text = nltk.corpus.gutenberg.words('milton-paradise.txt')
longest = ''
for word in text:
    if len(word) > len(longest):
        longest = word
        
longest

'unextinguishable'

In [121]:
maxlen = max(len(word) for word in text)

In [122]:
[word for word in text if len(word) == maxlen]

['unextinguishable',
 'transubstantiate',
 'inextinguishable',
 'incomprehensible']

### Some Legitimate Uses for Counters

In [123]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
n = 3
[sent[i:i+n] for i in range(len(sent)-n+1)]

[['The', 'dog', 'gave'],
 ['dog', 'gave', 'John'],
 ['gave', 'John', 'the'],
 ['John', 'the', 'newspaper']]

In [127]:
m, n = 3, 7
array = [[set() for i in range(n)] for j in range(m)]
array[2][5].add('Alice')

pprint.pprint(array)

[[set(), set(), set(), set(), set(), set(), set()],
 [set(), set(), set(), set(), set(), set(), set()],
 [set(), set(), set(), set(), set(), {'Alice'}, set()]]


In [128]:
array = [[set()] * n] * m
array[2][5].add(7)
pprint.pprint(array)

[[{7}, {7}, {7}, {7}, {7}, {7}, {7}],
 [{7}, {7}, {7}, {7}, {7}, {7}, {7}],
 [{7}, {7}, {7}, {7}, {7}, {7}, {7}]]


## 4.4 - Functions: The Foundation of Structured Programming

## 4.5 - Doing More with Functions

## 4.6 - Program Development

## 4.7 - Algorithm Design

## 4.8 - A Sample of Python Libraries

## Your Turn Solutions

### 4.1

**Your Turn:** Use multiplication to create a list of lists: `nested = [[]] * 3`. Now modify one of the elements of the list, and observe that all the elements are changed. Use Python's `id()` function to find out the numerical identifier for any object, and verify that `id(nested[0])`, `id(nested[1])`, and `id(nested[2])` are all the same.

In [22]:
nested = [[]] * 3

In [6]:
nested[0].append('test')

In [7]:
nested

[['test'], ['test'], ['test']]

In [21]:
id(nested[0]) == id(nested[1]) == id(nested[2])

True

### 4.2

**Your Turn:** Convert `lexicon` to a tuple, using `lexicon = tuple(lexicon)`, then try each of the above operations, to confirm that none of them is permitted on tuples.