**This notebook covers Ch4.3. of the NLTK book.**

Several changes had to be made to reflect changes to running tokenization, 
and to improve messaging (hopefully).

## 4.3   Questions of Style (The Art of Python Programming)


**Python Coding Style**

When writing programs you make many subtle choices about names, spacing, comments, and so on. When you look at code written by other people, needless differences in style make it harder to interpret the code. Therefore, the designers of the Python language have published a style guide for Python code, available at http://www.python.org/dev/peps/pep-0008/. 

The underlying value presented in the style guide is consistency, for the purpose of maximizing the readability of code. 

Code layout should use **four** spaces per indentation level. You should make sure that when you write Python code in a file, you **avoid tabs for indentation**, since these can be misinterpreted by different text editors and the indentation can be messed up. 

Lines should be less than 80 characters long; 

if necessary **you can break a line inside parentheses, brackets, or braces**, because Python is able to detect that the line continues over to the next line. 

If you need to break a line outside parentheses, brackets, or braces, you can often add extra parentheses, 

and you can always add a backslash at the end of the line that is broken:

In [0]:
%%latex
# we haven't defined process() so this is illustration only 
# note the backslash to break the if statement without using the parenthesis

>>> if (len(syllables) > 4 and len(syllables[2]) == 3 and
...    syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]):
...     process(syllables)
>>> if len(syllables) > 4 and len(syllables[2]) == 3 and \
...    syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]:
...     process(syllables)


<IPython.core.display.Latex object>

### Procedural vs Declarative Style

In [80]:
import nltk
nltk.download('brown')


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [81]:
tokens = nltk.corpus.brown.words(categories='news')
count = 0
total = 0
for token in tokens:
    count += 1
    total += len(token)
total / count

4.401545438271973

In [82]:
#Here we specify, i.e. declare,  the relations not the steps
total = sum(len(t) for t in tokens)
print(total / len(tokens))

4.401545438271973


**Another Example**

In [83]:
len(tokens)

100554

In [0]:
tokens1000 = tokens[0:1000]

utokens=tokens1000
#utokens =tokens

word_list = []
i = 0
while i < len(utokens):
   j = 0
   while j < len(word_list) and word_list[j] <= utokens[i]:
       j += 1
   if j == 0 or utokens[i] != word_list[j-1]:
       word_list.insert(j, utokens[i])
   i += 1

In [85]:
print(utokens)
print(tokens1000)
print(word_list)

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
["''", '(', ')', ',', '.', '1', '13', '1913', '1923', '1937', '1962', '2', '637', '71', '74', ':', 'Aj', 'Ala.', 'Allen', 'Alpharetta', 'Ask', 'Association', 'Atlanta', "Atlanta's", 'Attorneys', 'Aug.', 'B.', 'Bar', 'Bellwood', 'Berry', 'Bowden', 'Cheshire', 'City', "Commissioner's", 'Committee', 'County', 'Court', 'Department', "Department's", 'Durwood', 'E.', 'Executive', 'Failure', 'Four', 'Friday', 'Fulton', 'Georgia', "Georgia's", 'Grady', 'Grand', 'Griffin', 'Hartsfield', 'He', 'Health', 'Henry', 'His', 'Hospital', 'However', 'Implementation', 'It', 'Ivan', 'J.', 'Jail', 'Jan.', 'Jr.', 'Judge', 'Jury', 'L.', 'Legislature', 'M.', 'Mayor', 'Mayor-nominate', 'Merger', 'Mrs.', 'Nevertheless', 'Office', 'On', 'Only', 'Opelika', 'Pearl', 'Pelham', 'Police', 'Purchasing', 'Pye', 'Rd.', 'Regarding', 'Republicans', 'Sept.', 'September-October', 'State', 'Superior', 'Tax', 'T

In [0]:
# Now set utokens=tokens and run the above code. Measure the time 

In [86]:
# declarative version is faster as well, since sorted() has an optimized implementation in Python
# and set() removes the duplicates


word_list = sorted(set(tokens))
print(word_list[0:50])

['!', '$1', '$1,000', '$1,000,000,000', '$1,500', '$1,500,000', '$1,600', '$1,800', '$1.1', '$1.4', '$1.5', '$1.80', '$10', '$10,000', '$10,000-per-year', '$100', '$100,000', '$102,285,000', '$109', '$11.50', '$115,000', '$12', '$12,192,865', '$12,500', '$12.50', '$12.7', '$120', '$125', '$135', '$139.3', '$14', '$15', '$15,000', '$15,000,000', '$150', '$157,460', '$16', '$16,000', '$17', '$17,000', '$17.8', '$172,000', '$172,400', '$18', '$18.2', '$18.9', '$2', '$2,000', '$2,170', '$2,330,000']


In [87]:
# other ways of specifying parts of lists 

print(len(tokens))
print(tokens[-15:], len(tokens[-15:]))
print(tokens[:15],  len(tokens[:15]))
print(tokens[:-15], len(tokens[:-15]))

#what do yo see?

100554
['its', 'theme', ':', '``', 'For', 'a', 'richer', ',', 'fuller', 'life', ',', 'read', "''", '!', '!'] 15
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced'] 15
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 100539


In [88]:
print(len(word_list)) 
print(word_list[0:10])
print(word_list[14384:])
print(len(set(tokens)))
print(list(set(tokens))[67:90])  #note the conversion of set to list


14394
['!', '$1', '$1,000', '$1,000,000,000', '$1,500', '$1,500,000', '$1,600', '$1,800', '$1.1', '$1.4']
['your', 'yourself', 'youth', 'youthful', 'youths', 'zeal', 'zeroed', 'zinc', 'zombies', 'zone']
14394
['properly', 'allowing', '50th', "Throneberry's", 'defends', 'World', 'safety', 'realistic', 'safeties', '447', '3211', 'beaming', 'Givers', 'refused', 'Gillis', 'circumstances', 'exaggeration', 'or', 'ribbons', 'Century', '$450', 'heredity', 'countian']


In [89]:
print(set(tokens)[67:90]) #error expected


TypeError: ignored

In [90]:
print(set(tokens)) 



In [91]:
'fee' in set(tokens) 
# more on sets e.g. https://www.geeksforgeeks.org/sets-in-python/ 

True

**Enumerate**

Another case where a loop variable seems to be necessary is for printing a counter with each line of output. 

Instead, we can use enumerate(), 

which processes a sequence s and produces a tuple of the form (i, s[i]) 

for each item in s, starting with (0, s[0]). 

Here we enumerate the key-value pairs of the frequency distribution, resulting in nested tuples (rank, (word, count)). We print rank+1 so that the counting appears to start from 1, as required when producing a list of ranked items.

In [92]:
 	
fd = nltk.FreqDist(nltk.corpus.brown.words())
cumulative = 0.0
most_common_words = [word for (word, count) in fd.most_common()]
for rank, word in enumerate(most_common_words):
     cumulative += fd.freq(word)
     print("%3d %6.2f%% %s " % (rank + 1, cumulative * 100, word))
     if cumulative > 0.25:
         break

  1   5.40% the 
  2  10.42% , 
  3  14.67% . 
  4  17.78% of 
  5  20.19% and 
  6  22.40% to 
  7  24.29% a 
  8  25.97% in 


**Method to find the longest word in a text**

In [93]:
#text = nltk.corpus.gutenberg.words('milton-paradise.txt')
text = nltk.corpus.brown.words(categories='romance')
longest = ''
for word in text:
     if len(word) > len(longest):
         longest = word
longest


'yielding-Mediterranian-woman-'

In [95]:
#using comprehension to find other long words
maxlen = max(len(word) for word in text)
print(maxlen)
[word for word in text if len(word) > maxlen-12]

29


['princess-in-a-carriage',
 'semi-professionally',
 'heavily-upholstered',
 'whichever-the-hell',
 'self-consciousness',
 'five-hundred-dollar',
 'yielding-Mediterranian-woman-',
 'socio-archaeological',
 'beautifully-tapered']

### Building Arrays via Counting

In [96]:
#N-grams

sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
n = 3
g3= [sent[i:i+n] for i in range(len(sent)-n+1)]
print(g3)
#It is quite tricky to get the range of the loop variable right. 
#Since this is a common operation in NLP, NLTK supports it with functions 
#bigrams(text) and trigrams(text), and a general purpose ngrams(text, n).

[['The', 'dog', 'gave'], ['dog', 'gave', 'John'], ['gave', 'John', 'the'], ['John', 'the', 'newspaper']]


In [97]:
# an array with m rows and n columns, where each cell is a set, we could use a nested list comprehension:
import pprint

m, n = 3, 7
array = [[set() for i in range(n)] for j in range(m)]
array[2][5].add('Alice')
pprint.pprint(array)

[[set(), set(), set(), set(), set(), set(), set()],
 [set(), set(), set(), set(), set(), set(), set()],
 [set(), set(), set(), set(), set(), {'Alice'}, set()]]


In [98]:
# Note that it would be incorrect to do this work using multiplication, 
# for reasons concerning object copying that were discussed earlier in this section.

array = [[set()] * n] * m
array[2][5].add('Alice')
pprint.pprint(array)

[[{'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}],
 [{'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}],
 [{'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}, {'Alice'}]]


**Moral: ** Arrays, lists and sets can sometimes be tricky 

Strings, lists and tuples are different kinds of sequence object, supporting common operations such as indexing, slicing, len(), sorted(), and membership testing using in.

