**This notebook covers Ch4.1. and Ch4.2. of the NLTK book.**

Several changes had to be made to reflect changes to running tokenization 


# **4.1   Back to the Basics** #


*Assignment* would seem to be the most elementary programming concept, not deserving a separate discussion. However, there are some surprising subtleties here. Consider the following code fragment:

In [89]:
foo = 'Monty'
bar = foo 
foo = 'Python' 

print(foo)
print(bar)

#This behaves exactly as expected. 

Python
Monty


Assignment always copies the value of an expression, but a* value is not always what you might expect it to be*. 

**In particular, the "value" of a structured object such as a list is actually just a reference to the object. **

In the following example, [1] assigns the reference of foo to the new variable bar. 

Now when we modify something inside foo on line [2], we can see that the contents of bar have also been changed.

In [90]:
foo = ['Monty', 'Python']
bar = foo
foo[1] = 'Bodkin' 
print(bar)
print(foo)
# note foo has changed 

['Monty', 'Bodkin']
['Monty', 'Bodkin']


https://www.nltk.org/images/array-memory.png


The line

 bar = foo  
 
 does not copy the contents of the variable, only its "object reference". To understand what is going on here, we need to know how lists are stored in the computer's memory. In 4.1, we see that a list foo is a reference to an object stored at location 3133 (which is itself a series of pointers to other locations holding strings). When we assign bar = foo, it is just the object reference 3133 that gets copied. 

In [91]:
# Let's experiment some more, by creating a variable empty holding the empty list, then using it three times on the next line.

 	
empty = []
nested = [empty, empty, empty]
print('first nested', nested)
[[], [], []]
nested[1].append('Python')
print('second nested',nested)

# Observe that changing one of the items inside our nested list of lists changed them all. This is because each of the three elements is actually just a reference to one and the same list in memory.

first nested [[], [], []]
second nested [['Python'], ['Python'], ['Python']]


Use multiplication to create a list of lists: 

nested = [[]] * 3. 

Now modify one of the elements of the list, and observe that all the elements are changed. Use Python's id() function to find out the numerical identifier for any object, and verify that id(nested[0]), id(nested[1]), and id(nested[2]) are all the same.

**Now, notice that when we assign a new value to one of the elements of the list, it does not propagate to the others:**

 	


In [92]:
nested = [[]] * 3
print('first', nested)
nested[1].append('Python')
nested[1] = ['Monty']
print('second', nested)

first [[], [], []]
second [['Python'], ['Monty'], ['Python']]


We began with a list containing three references to a single empty list object. Then we modified that object by appending 'Python' to it, resulting in a list containing three references to a single list object ['Python']. Next, we overwrote one of those references with a reference to a new object ['Monty']. This last step modified one of the three object references inside the nested list. However, the ['Python'] object wasn't changed, and is still referenced from two places in our nested list of lists. It is crucial to appreciate this difference between modifying an object via an object reference, and overwriting an object reference.

**Note** *Important:* To copy the items from a list foo to a new list bar, you can write 

bar = foo[:]. 

This copies the object references inside the list. To copy a structure without copying any object references, use 

copy.deepcopy().

In [0]:
# now you try 3 examples with bar = foo[:]  and copy.deepcopy()
# your code comes here 

## **Equality**

Python provides two ways to check that a pair of items are the same. The is operator tests for *object identity*. 

We can use it to verify our earlier observations about objects. First we create a list containing several copies of the same object, and demonstrate that they are not only identical according to ==, but also that they are one and the same object:

In [93]:
# checking == first

size = 5
python = ['Python']
snake_nest = [python] * size
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

True

In [94]:
# checking "is" second 
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

True

**Now let's put a new** python **in this nest. We can easily show that the objects are not all identical:**

In [95]:
import random
position = random.choice(range(size))
snake_nest[position] = ['Python']
snake_nest

[['Python'], ['Python'], ['Python'], ['Python'], ['Python']]

In [96]:
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

True

In [97]:
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

False

In [98]:
#You can do several pairwise tests to discover which position contains the interloper, 
#but the id() function makes detection easier:
#Find the item of the list has a distinct identifier. 
#If you try running this code snippet yourself, expect to see different numbers in the resulting list, 
#and also the interloper may be in a different position.
 	
[id(snake) for snake in snake_nest]



[140459918877000,
 140459918877000,
 140459918877000,
 140459918876808,
 140459918877000]

## **Conditionals**

In the condition part of an if statement, a nonempty string or list is evaluated as true, while an empty string or list evaluates as false.


In [99]:
mixed = ['cat', '', ['dog'], []]
for element in mixed:
    if element:
       print(element)
#Note the ":" and the indentations 
#Also note we don't need to say   if len(element) > 0:   in the condition

cat
['dog']


**if vs elif **

In [100]:
# Note: Since the if clause of the statement is satisfied, Python never tries to evaluate the elif clause, 
# so we never get to print out 2. 
# By contrast, if we replaced the elif by an if, then we would print out both 1 and 2. 
# So an elif clause potentially gives us more information than a bare if clause; 
# when it evaluates to true, it tells us not only that the condition is satisfied, 
# but also that the condition of the main if clause was not satisfied.

animals = ['cat', 'dog']
if 'cat' in animals:
    print(1)
elif 'dog' in animals:
    print(2)

1


In [0]:
#Exercise  change it to double if and run it

###**Python has quantifiers**

In [102]:
sent = ['No', 'good', 'fish', 'goes', 'anywhere', 'without', 'a', 'porpoise', '.']
all(len(w) > 4 for w in sent)

False

In [103]:
any(len(w) > 4 for w in sent)

True

# **4.2   Sequences**

So far, we have seen two kinds of sequence object: strings and lists. Another kind of sequence is called a tuple. Tuples are formed with the comma operator, and typically enclosed using parentheses. We've actually seen them in the previous chapters, and sometimes referred to them as "pairs", since there were always two members. However, tuples can have any number of members. Like lists and strings, tuples can be indexed and sliced, and have a length. As shown in the next few lines.

In [104]:
t = 'walk', 'fem', 3 # comma is an operator 
t



('walk', 'fem', 3)

In [105]:
t[0] # index



'walk'

In [106]:
t[1:] #slicing 



('fem', 3)

In [107]:
len(t) #length

3

Tuples are constructed using the comma operator. Parentheses are a more general feature of Python syntax, designed for grouping. A tuple containing the single element 'snark' is defined by adding a trailing comma, like this: "'snark',". The empty tuple is a special case, and is defined using empty parentheses ().

In [108]:
# Let's compare strings, lists and tuples directly, and do the indexing, slice, and length operation on each type

raw = 'I turned off the spectroroute'
text = ['I', 'turned', 'off', 'the', 'spectroroute']
pair = (6, 'turned')
raw[2], text[3], pair[1]

('t', 'the', 'turned')

In [109]:
raw[-3:], text[-3:], pair[-3:]

('ute', ['off', 'the', 'spectroroute'], (6, 'turned'))

In [110]:
len(raw), len(text), len(pair)

(29, 5, 2)

Notice in this code sample that we computed multiple values on a single line, separated by commas. 

These comma-separated expressions are actually just tuples 

— Python allows us to omit the parentheses around tuples if there is no ambiguity. 

When we print a tuple, the parentheses are always displayed. 

By using tuples in this way, we are implicitly aggregating items together.

###**Operating on Sequence Types**###

Let's look at Table 4.1. https://www.nltk.org/book/ch04.html#tab-python-sequence and the two paragraph afterwards.

In [111]:
#Some other objects, such as a FreqDist, 
#can be converted into a sequence (using list() or sorted()) and support iteration, e.g.
import nltk
from nltk import wordpunct_tokenize as tokz

raw = 'Red lorry, yellow lorry, red lorry, yellow lorry.'
text = tokz(raw)
fdist = nltk.FreqDist(text)
print(sorted(fdist))

for key in fdist:
    print(key + '::', fdist[key], end='  ;')

[',', '.', 'Red', 'lorry', 'red', 'yellow']
Red:: 1  ;lorry:: 4  ;,:: 3  ;yellow:: 2  ;red:: 1  ;.:: 1  ;

In [112]:
#In the next example, we use tuples to re-arrange the contents of our list. 
#(We can omit the parentheses because the comma has higher precedence than assignment.)

words = ['I', 'turned', 'off', 'the', 'spectroroute']
words[2], words[3], words[4] = words[3], words[4], words[2]
words

['I', 'turned', 'the', 'spectroroute', 'off']

In [114]:
# The above is an idiomatic and readable way to move items inside a list. 
# It is equivalent to the following traditional way of doing such tasks 
# that does not use tuples (notice that this method needs a temporary variable tmp).
words = ['I', 'turned', 'off', 'the', 'spectroroute']
 	
tmp = words[2]
words[2] = words[3]
words[3] = words[4]
words[4] = tmp
words
# Run this cell a couple of times in a raw. What do you see?

['I', 'turned', 'the', 'spectroroute', 'off']

**zip**ing sequences together 

zip -- is a very frequently used function in Python

As we have seen, Python has sequence functions such as sorted() and reversed() that rearrange the items of a sequence. 

There are also functions that modify the structure of a sequence and which can be handy for language processing. 


Thus, zip() takes the items of two or more sequences and "zips" them together into a single list of tuples. 

Given a sequence s, enumerate(s) returns pairs consisting of an index and the item at that index.

In [115]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']
tags = ['noun', 'verb', 'prep', 'det', 'noun']
zip(words, tags)



<zip at 0x7fbf5f84df48>

In [116]:
list(zip(words, tags))


[('I', 'noun'),
 ('turned', 'verb'),
 ('off', 'prep'),
 ('the', 'det'),
 ('spectroroute', 'noun')]

In [117]:
enumerate(words)

<enumerate at 0x7fbf5f8ee240>

In [118]:
# note that the result of enumerate has to be list()-ed to see the actual sequence
list(enumerate(words))


[(0, 'I'), (1, 'turned'), (2, 'off'), (3, 'the'), (4, 'spectroroute')]

**Lazy Evaluation**

It is a widespread feature of Python 3 and NLTK 3 to only perform computation when required (a feature known as "lazy evaluation"). If you ever see a result like <zip object at 0x10d005448> when you expect to see a sequence, you can force the object to be evaluated just by putting it in a context that expects a sequence, like list(x), or for item in x.

In [119]:
# For some NLP tasks it is necessary to cut up a sequence into two or more parts. 
# For instance, we might want to "train" a system on 90% of the data 
# and test it on the remaining 10%. 

#To do this we decide the location where we want to cut the data [1], 
# then cut the sequence at that location [2].

# We can verify that none of the original data is lost during this process, nor is it duplicated [3]. 
# We can also verify that the ratio of the sizes of the two pieces is what we intended [4].

import nltk
nltk.download('nps_chat') #or use the nltk data from your drive

text = nltk.corpus.nps_chat.words()
cut = int(0.9 * len(text)) #[1]
training_data, test_data = text[:cut], text[cut:] #[2]
print(text == training_data + test_data) #[3]


print(len(training_data) / len(test_data)) #[4]




[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
True
9.0


**Combining Different Sequence Types**

Let's combine our knowledge of these three sequence types, together with list comprehensions, to perform the task of sorting the words in a string by their length.

In [120]:
words = 'I turned off the spectroroute'.split() #[1]
wordlens = [(len(word), word) for word in words] #[2]
wordlens.sort() #[3]
' '.join(w for (_, w) in wordlens) #[4]

'I off the turned spectroroute'

*  Each of the above lines of code contains a significant feature. A simple string is actually an object with methods defined on it such as split() [1]. 
*  We use a list comprehension to build a list of tuples [2], where each tuple consists of a number (the word length) and the word, e.g. (3, 'the'). 
*  We use the sort() method [3] to sort the list in-place. 
*  Finally, we discard the length information and join the words back into a single string [4]. 
*  *  (The underscore [4] is just a regular Python variable, but we can use underscore by convention to indicate that we will not use its value.)





**Lists vs Tuples**

We often use lists to hold sequences of words. In contrast, a tuple is typically a collection of objects of different types, of fixed length. We often use a tuple to hold a record, a collection of different fields relating to some entity. This distinction between the use of lists and tuples takes some getting used to, so here is another example:



In [121]:
# a lexicon is represented as a list because it is a collection of objects of a single type 
# — lexical entries — of no predetermined length
lexicon = [
   ('the', 'det', ['Di:', 'D@']),
   ('off', 'prep', ['Qf', 'O:f'])
   ]
lexicon

[('the', 'det', ['Di:', 'D@']), ('off', 'prep', ['Qf', 'O:f'])]

A good way to decide when to use tuples vs lists is to ask whether the interpretation of an item depends on its position. 

For example, a tagged token combines two strings having different interpretation, and we choose to interpret the first item as the token and the second item as the tag. Thus we use tuples like this: ('grail', 'noun'); a tuple of the form ('noun', 'grail') would be nonsensical since it would be a word noun tagged grail. 

In contrast, the elements of a text are all tokens, and position is not significant. 

Thus we use lists like this: ['venetian', 'blind']; a list of the form ['blind', 'venetian'] would be equally valid. 

The linguistic meaning of the words might be different, but the interpretation of list items as tokens is unchanged.

In [122]:
# The distinction between lists and tuples has been described in terms of usage. 
# However, there is a more fundamental difference: in Python, lists are mutable, while tuples are immutable. 
# In other words, lists can be modified, while tuples cannot. Here are some of the operations on lists that 
# do in-place modification of the list.

print(lexicon) #previously defined lexicon 
 	
lexicon.sort() #sorting
lexicon[1] = ('turned', 'VBD', ['t3:nd', 't3`nd']) #modification
del lexicon[0]  #deletion
lexicon

[('the', 'det', ['Di:', 'D@']), ('off', 'prep', ['Qf', 'O:f'])]


[('turned', 'VBD', ['t3:nd', 't3`nd'])]

In [123]:
# Convert lexicon to a tuple, using lexicon = tuple(lexicon), then try each of the above operations, 
# to confirm that none of them is permitted on tuples
lexicon = [
   ('the', 'det', ['Di:', 'D@']),
   ('off', 'prep', ['Qf', 'O:f'])
   ]
lexicon = tuple(lexicon)
print(lexicon)
lexicon.sort()

(('the', 'det', ['Di:', 'D@']), ('off', 'prep', ['Qf', 'O:f']))


AttributeError: ignored

In [124]:
lexicon[1] = ('turned', 'VBD', ['t3:nd', 't3`nd']) #modification

TypeError: ignored

In [125]:
del lexicon[0]  #deletion

TypeError: ignored

In [0]:
lexicon=list(lexicon)

In [126]:
del lexicon[0]  #deletion
lexicon

TypeError: ignored

*Moral:*   list() and tuple()  **are inverses of each other**

##Generator Expressions##

This is an important topic if you gonna program anything serious in Python

In [127]:
text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
... "it means just what I choose it to mean - neither more nor less."'''

tok_text = [w.lower() for w in tokz(text)] #remember we named our tokenizer tokz
print(tok_text)


['"', 'when', 'i', 'use', 'a', 'word', ',"', 'humpty', 'dumpty', 'said', 'in', 'rather', 'a', 'scornful', 'tone', ',', '"', 'it', 'means', 'just', 'what', 'i', 'choose', 'it', 'to', 'mean', '-', 'neither', 'more', 'nor', 'less', '."']


In [128]:
'word'>'when' #we will be comparing words in the next segment

True

In [129]:
m1 = max([w.lower() for w in tokz(text)]) #[1]
m2=  max(w.lower() for w in  tokz(text))  #[2]
print(m1, m2)

word word


The second line uses a **generator expression**. This is more than a notational convenience: in many language processing situations, generator expressions will be more efficient. In [1], storage for the list object must be allocated before the value of max() is computed. If the text is very large, this could be slow. In [2], the data is streamed to the calling function. Since the calling function simply has to find the maximum value — the word which comes latest in lexicographic sort order — it can process the stream of data without having to store anything more than the maximum value seen so far.


**Summary **

*  Python's assignment and parameter passing use object references; e.g. if a is a list and we assign b = a, then any operation on a will modify b, and vice versa.
The is operation tests if two objects are identical internal objects, while == tests if two objects are equivalent. This distinction parallels the type-token distinction.
*  Strings, lists and tuples are different kinds of sequence object, supporting common operations such as indexing, slicing, len(), sorted(), and membership testing using in.
*  A declarative programming style usually produces more compact, readable code; manually-incremented loop variables are usually unnecessary; when a sequence must be enumerated, use enumerate().