# 4. [Writing Structured Programs](https://www.nltk.org/book/ch04.html) - Notes

* [NLTK-Book-Resource Repository](https://github.com/BetoBob/NLTK-Book-Resource)
* [NLTK-Book-Resource Table of Contents](https://github.com/BetoBob/NLTK-Book-Resource#table-of-contents)

Run the cell below before running the examples.

In [None]:
import nltk, pprint, re

from nltk import word_tokenize

## 4.1 - Back to the Basics

### Assignment

In [None]:
foo = 'Monty'
bar = foo 
foo = 'Python'

bar

In [None]:
foo = ['Monty', 'Python']
bar = foo
foo[1] = 'Bodkin'

bar

In [None]:
empty = []
nested = [empty, empty, empty]

nested

In [None]:
nested[1].append('Python')

nested

**Your Turn:** Use multiplication to create a list of lists: `nested = [[]] * 3`. Now modify one of the elements of the list, and observe that all the elements are changed. Use Python's `id()` function to find out the numerical identifier for any object, and verify that `id(nested[0])`, `id(nested[1])`, and `id(nested[2])` are all the same.

In [None]:
nested = [[]] * 3

### Equality

In [None]:
size = 5
python = ['Python']
snake_nest = [python] * size

In [None]:
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

In [None]:
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

In [None]:
import random
position = random.choice(range(size))
snake_nest[position] = ['Python']
snake_nest 

In [None]:
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

In [None]:
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

In [None]:
[id(snake) for snake in snake_nest]

### Conditionals

In [None]:
mixed = ['cat', '', ['dog'], []]

for element in mixed:
    if element:
        print(element)

In [None]:
animals = ['cat', 'dog']

if 'cat' in animals:
    print(1)
elif 'dog' in animals:
    print(2)

In [None]:
sent = ['No', 'good', 'fish', 'goes', 'anywhere', 'without', 'a', 'porpoise', '.']

In [None]:
all(len(w) > 4 for w in sent)

In [None]:
any(len(w) > 4 for w in sent)

## 4.2 - Sequences

In [None]:
t = 'walk', 'fem', 3
t

In [None]:
t[0]

In [None]:
t[1:]

In [None]:
len(t)

In [None]:
raw = 'I turned off the spectroroute'
text = ['I', 'turned', 'off', 'the', 'spectroroute']
pair = (6, 'turned')

In [None]:
raw[2], text[3], pair[1]

In [None]:
raw[-3:], text[-3:], pair[-3:]

In [None]:
len(raw), len(text), len(pair)

### Operating on Sequence Types

In [None]:
raw = 'Red lorry, yellow lorry, red lorry, yellow lorry.'
text = word_tokenize(raw)
fdist = nltk.FreqDist(text)

In [None]:
sorted(fdist)

In [None]:
for key in fdist:
    print(key + ':', fdist[key], end='; ')

In [None]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']

words[2], words[3], words[4] = words[3], words[4], words[2]

words

In [None]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']

tmp = words[2]
words[2] = words[3]
words[3] = words[4]
words[4] = tmp

words

#### zip / enumerate

In [None]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']
tags = ['noun', 'verb', 'prep', 'det', 'noun']

In [None]:
zip(words, tags)

In [None]:
list(zip(words, tags))

In [None]:
list(enumerate(words))

#### splitting data

In [None]:
text = nltk.corpus.nps_chat.words()
cut = int(0.9 * len(text))
training_data, test_data = text[:cut], text[cut:]

In [None]:
text == training_data + test_data

In [None]:
len(training_data) / len(test_data)

### Combining Different Sequence Types

In [None]:
words = 'I turned off the spectroroute'.split()
wordlens = [(len(word), word) for word in words]
wordlens.sort()

In [None]:
' '.join(w for (_, w) in wordlens)

In [None]:
lexicon = [
    ('the', 'det', ['Di:', 'D@']),
    ('off', 'prep', ['Qf', 'O:f'])
]
lexicon.sort()

In [None]:
lexicon[1] = ('turned', 'VBD', ['t3:nd', 't3`nd'])

In [None]:
del lexicon[0]

In [None]:
lexicon

**Your Turn:** Convert `lexicon` to a tuple, using `lexicon = tuple(lexicon)`, then try each of the above operations, to confirm that none of them is permitted on tuples.

In [None]:
lexicon = [
    ('the', 'det', ['Di:', 'D@']),
    ('off', 'prep', ['Qf', 'O:f'])
]

### Generator Expressions

In [None]:
text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
"it means just what I choose it to mean - neither more nor less."'''

In [None]:
[w.lower() for w in word_tokenize(text)]

In [None]:
max([w.lower() for w in word_tokenize(text)]) 

In [None]:
max(w.lower() for w in word_tokenize(text))

## 4.3 - Questions of Style

### Python Coding Style

In [None]:
syllables = []

In [None]:
# before (too many characters)

if (len(syllables) > 4 and len(syllables[2]) == 3 and syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]):
    process(syllables)

In [None]:
# after

if (len(syllables) > 4 and len(syllables[2]) == 3 and \
    syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]):
    process(syllables)

### Procedural vs Declarative Style

#### Average Word Lengths Example

In [None]:
tokens = nltk.corpus.brown.words(categories='news')

In [None]:
# Procedural Style

count = 0
total = 0
for token in tokens:
    count += 1
    total += len(token)
    
total / count

In [None]:
# Declarative Style

total = sum(len(t) for t in tokens)

print(total / len(tokens))

#### Unique Words Example

**Note:** 

A smaller set of tokens is used in this example because the procedural style of code takes a significant time to run on the brown corpus tokens.

In [None]:
text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
"it means just what I choose it to mean - neither more nor less."'''

tokens = word_tokenize(text)

In [None]:
# Procedural Style

word_list = []
i = 0
while i < len(tokens):
    j = 0
    while j < len(word_list) and word_list[j] <= tokens[i]:
        j += 1
    if j == 0 or tokens[i] != word_list[j-1]:
        word_list.insert(j, tokens[i])
    i += 1

word_list

In [None]:
# Declarative Style

word_list = sorted(set(tokens))

word_list

#### Word Frequency Example

In [None]:
fd = nltk.FreqDist(nltk.corpus.brown.words())
cumulative = 0.0
most_common_words = [word for (word, count) in fd.most_common()]
for rank, word in enumerate(most_common_words):
    cumulative += fd.freq(word)
    print("%3d %6.2f%% %s" % (rank + 1, cumulative * 100, word))
    if cumulative > 0.25:
        break

#### Longest Length Word Example

In [None]:
# Procedural Style

text = nltk.corpus.gutenberg.words('milton-paradise.txt')
longest = ''
for word in text:
    if len(word) > len(longest):
        longest = word
        
longest

In [None]:
# Declarative Style

maxlen = max(len(word) for word in text)

[word for word in text if len(word) == maxlen]

### Some Legitimate Uses for Counters

In [None]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
n = 3
[sent[i:i+n] for i in range(len(sent)-n+1)]

In [None]:
m, n = 3, 7
array = [[set() for i in range(n)] for j in range(m)]
array[2][5].add('Alice')

pprint.pprint(array)

In [None]:
array = [[set()] * n] * m
array[2][5].add(7)
pprint.pprint(array)

## 4.4 - Functions: The Foundation of Structured Programming

In [None]:
def get_text(file):
    """Read text from a file, normalizing whitespace and stripping HTML markup."""
    text = open(file).read()
    text = re.sub(r'<.*?>', ' ', text)
    text = re.sub('\s+', ' ', text)
    return text

In [None]:
help(get_text)

### Function Inputs and Outputs

In [None]:
def repeat(msg, num):
    return ' '.join([msg] * num)

monty = 'Monty Python'

repeat(monty, 3)

In [None]:
def monty():
    return "Monty Python"

monty()

In [None]:
repeat(monty(), 3)

In [None]:
repeat('Monty Python', 3)

In [None]:
def my_sort1(mylist):      # good: modifies its argument, no return value
    mylist.sort()

def my_sort2(mylist):      # good: doesn't touch its argument, returns value
    return sorted(mylist)

def my_sort3(mylist):      # bad: modifies its argument and also returns it
    mylist.sort()
    return mylist

### Parameter Passing

In [None]:
# call-by-value

def set_up(word, properties):
    word = 'lolcat'
    properties.append('noun')
    properties = 5

w = ''
p = []
set_up(w, p)

In [None]:
w

In [None]:
p

In [None]:
w = ''
word = w
word = 'lolcat'
w

In [None]:
p = []
properties = p
properties.append('noun')
properties = 5
p

### Checking Parameter Types

In [None]:
def tag(word):
    if word in ['a', 'the', 'all']:
        return 'det'
    else:
        return 'noun'

In [None]:
tag('the')

In [None]:
tag('knight')

In [None]:
tag(["'Tis", 'but', 'a', 'scratch'])

In [None]:
def tag(word):
    assert isinstance(word, basestring), "argument to tag() must be a string"
    if word in ['a', 'the', 'all']:
        return 'det'
    else:
        return 'noun'

### Functional Decomposition

**Note:** The frequency distribution will differ from the book because the webpage has likely changed.

In [None]:
from urllib import request
from bs4 import BeautifulSoup

def freq_words(url, freqdist, n):
    html = request.urlopen(url).read().decode('utf8')
    raw = BeautifulSoup(html, 'html.parser').get_text()
    for word in word_tokenize(raw):
        freqdist[word.lower()] += 1
    result = []
    for word, count in freqdist.most_common(n):
        result = result + [word]
    print(result)

In [None]:
constitution = "http://www.archives.gov/exhibits/charters/constitution_transcript.html"
fd = nltk.FreqDist()

freq_words(constitution, fd, 30)

In [None]:
from urllib import request
from bs4 import BeautifulSoup

def freq_words(url, n):
    html = request.urlopen(url).read().decode('utf8')
    text = BeautifulSoup(html, 'html.parser').get_text()
    freqdist = nltk.FreqDist(word.lower() for word in word_tokenize(text))
    return [word for (word, _) in fd.most_common(n)]

In [None]:
freq_words(constitution, 30)

### Documenting Functions

In [None]:
def accuracy(reference, test):
    """
    Calculate the fraction of test items that equal the corresponding reference items.

    Given a list of reference values and a corresponding list of test values,
    return the fraction of corresponding values that are equal.
    In particular, return the fraction of indexes
    {0<i<=len(test)} such that C{test[i] == reference[i]}.

        >>> accuracy(['ADJ', 'N', 'V', 'N'], ['N', 'N', 'V', 'ADJ'])
        0.5

    :param reference: An ordered list of reference values
    :type reference: list
    :param test: A list of values to compare against the corresponding
        reference values
    :type test: list
    :return: the accuracy score
    :rtype: float
    :raises ValueError: If reference and length do not have the same length
    """

    if len(reference) != len(test):
        raise ValueError("Lists must have the same length.")
    num_correct = 0
    for x, y in zip(reference, test):
        if x == y:
            num_correct += 1
    return float(num_correct) / len(reference)

In [None]:
help(accuracy)

## 4.5 - Doing More with Functions

### Functions As Arguments

In [None]:
sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the', \
        'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']

def extract_property(prop):
    return [prop(word) for word in sent]

In [None]:
extract_property(len)

In [None]:
def last_letter(word):
    return word[-1]

extract_property(last_letter)

#### lambda expressions

In [None]:
extract_property(lambda w: w[-1])

In [None]:
sorted(sent)

**Note:** Python 3.0 does not use the `cmp` function anymore. The `sorted` function requires a `key` argument to determine how to sort.

* [sorted Python 3 documentation](https://docs.python.org/3/howto/sorting.html)

In [None]:
# solution using lambda functions
sorted(sent, key = lambda x: -len(x))

In [None]:
# cleaner solution
sorted(sent, key = len, reverse = True)

### Accumulative Functions

**Tip:**

* if you would like to explore the efficieny of runnning certain cell blocks, you can use **magic functions** in your Jupyter Notebook
* `%%time` returns the time it takes to run the cell *once*
* `%%timeit` runs the cell 7 times, and returns the average time it takes to run the cell


* [Built-in magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html)

In [None]:
def search1(substring, words):
    result = []
    for word in words:
        if substring in word:
            result.append(word)
    return result

def search2(substring, words):
    for word in words:
        if substring in word:
            yield word

In [None]:
%%time

for item in search1('zz', nltk.corpus.brown.words()):
    print(item, end=" ")
    
print("\n")

In [None]:
%%time

for item in search2('zz', nltk.corpus.brown.words()):
    print(item, end=" ")

print("\n")

In [None]:
def permutations(seq):
    if len(seq) <= 1:
        yield seq
    else:
        for perm in permutations(seq[1:]):
            for i in range(len(perm)+1):
                yield perm[:i] + seq[0:1] + perm[i:]

In [None]:
list(permutations(['police', 'fish', 'buffalo']))

### Higher-Order Functions

#### filter

In [None]:
def is_content_word(word):
    return word.lower() not in ['a', 'of', 'the', 'and', 'will', ',', '.']

sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the', \
        'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']

In [None]:
list(filter(is_content_word, sent))

In [None]:
[w for w in sent if is_content_word(w)]

#### map

In [None]:
lengths = list(map(len, nltk.corpus.brown.sents(categories='news')))
sum(lengths) / len(lengths)

In [None]:
lengths = [len(sent) for sent in nltk.corpus.brown.sents(categories='news')]
sum(lengths) / len(lengths)

**Note:** 

The lists of lower case vowels needs to be converted into a list structure before it's length can be checked. Otherwise the python interpreter will see the list as a generator, and the error message `object of type 'generator' has no len()` will appear.

In [None]:
list(map(lambda w: len(list(filter(lambda c: c.lower() in "aeiou", w))), sent))

In [None]:
[len([c for c in w if c.lower() in "aeiou"]) for w in sent]

### Named Arguments

In [None]:
def repeat(msg='<empty>', num=1):
    return msg * num

In [None]:
repeat(num=3)

In [None]:
repeat(msg='Alice')

In [None]:
repeat(num=5, msg='Alice')

#### keyboard arguments

In [None]:
def generic(*args, **kwargs):
    print(args)
    print(kwargs)
    
generic(1, "African swallow", monty="python")

In [None]:
song = [['four', 'calling', 'birds'],
        ['three', 'French', 'hens'],
        ['two', 'turtle', 'doves']]

In [None]:
list(zip(song[0], song[1], song[2]))

In [None]:
list(zip(*song))

#### freq_words

* [Source of Chapter 1 RST file](https://github.com/nltk/nltk_book/blob/master/book/ch01.rst)

In [None]:
def freq_words(file, min=1, num=10):
    text = open(file).read()
    tokens = word_tokenize(text)
    freqdist = nltk.FreqDist(t for t in tokens if len(t) >= min)
    return freqdist.most_common(num)

In [None]:
# download the ch01.rst from Github
!wget --no-check-certificate https://raw.githubusercontent.com/nltk/nltk_book/master/book/ch01.rst

In [None]:
fw = freq_words('ch01.rst', 4, 10)
fw = freq_words('ch01.rst', min=4, num=10)
fw = freq_words('ch01.rst', num=10, min=4)

In [None]:
fw

In [None]:
def freq_words(file, min=1, num=10, verbose=False):
    freqdist = nltk.FreqDist()
    if verbose: print("Opening", file)
    text = open(file).read()
    if verbose: print("Read in %d characters" % len(file))
    for word in word_tokenize(text):
        if len(word) >= min:
            freqdist[word] += 1
        if verbose and freqdist.N() % 100 == 0: print(".", sep="")
    if verbose: print
    return freqdist.most_common(num)

In [None]:
freq_words('ch01.rst', min=4, num=10, verbose=True)

## 4.6 - Program Development

### Structure of a Python Module

**Note:** 

You can find a list of the NLTK Modules here:

* [NLTK 3.5 API](https://www.nltk.org/api/nltk.html)
* [Source code for nltk.metrics.distance](https://www.nltk.org/_modules/nltk/metrics/distance.html#binary_distance)

The NLTK API allows you to look at NLTK's source code and documentation about each modules functions. It's especially helpful to read this API when debugging code that uses NLTK modules.

NLTK has been packaged in a different way since the writing of this book. The `nltk.metrics.distance` module is not longer accessible. So for the example code below, `nltk.metrics` will be used to demonstrate the file location of the module's source code. More information is available here: 

* [Missing modules from nltk.metrics in Python 3.5 ??](https://github.com/nltk/nltk/issues/1516)
* [Chapter 4: update Python 2 example for module file location](https://github.com/nltk/nltk_book/issues/229)

In [None]:
nltk.metrics.__file__

###  Sources of Error

In [None]:
def find_words(text, wordlength, result=[]):
    for word in text:
        if len(word) == wordlength:
            result.append(word)
    return result

In [None]:
find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 3)

In [None]:
find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 2, ['ur'])

In [None]:
# keep running this cell to see the error
find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 3)

### Debugging Techniques

In [None]:
import pdb
find_words(['cat'], 3)

In [None]:
# type `step`
# type `args`

# continue pressing `step` to go through the lines of code
# or type `exit` to leave the debugger

pdb.run("find_words(['dog'], 3)")

## 4.7 - Algorithm Design

### Recursion

#### factorial

**Tip:**

Try running `factorial2` with a large number (like `100000`). You will likely receive an error stating that you have reached the maximum recursion depth in comparison; this means that `factorial2` has been called too many times

In [None]:
# factorial without recursion

def factorial1(n):
    result = 1
    for i in range(n):
        result *= (i+1)
    return result

In [None]:
# factorial with recursion

def factorial2(n):
    if n == 1:
        return 1
    else:
        return n * factorial2(n-1)

#### size

In [None]:
# size with recursion

def size1(s):
    return 1 + sum(size1(child) for child in s.hyponyms())

In [None]:
# size without recursion

def size2(s):
    layer = [s]
    total = 0
    while layer:
        total += len(layer)
        layer = [h for c in layer for h in c.hyponyms()]
    return total

In [None]:
from nltk.corpus import wordnet as wn

dog = wn.synset('dog.n.01')

In [None]:
size1(dog)

In [None]:
size2(dog)

#### insert

In [None]:
def insert(trie, key, value):
    if key:
        first, rest = key[0], key[1:]
        if first not in trie:
            trie[first] = {}
        insert(trie[first], rest, value)
    else:
        trie['value'] = value

In [None]:
trie = {}
insert(trie, 'chat', 'cat')
insert(trie, 'chien', 'dog')
insert(trie, 'chair', 'flesh')
insert(trie, 'chic', 'stylish')
trie = dict(trie)               # for nicer printing

In [None]:
trie['c']['h']['a']['t']['value']

In [None]:
pprint.pprint(trie, width=40)

### Space-Time Tradeoffs

#### code_search_documents.py

The program below locates lines in the `movie_reviews` corpus that contain a user input query. The program stop when `quit` is given as a query.

In [None]:
def raw(file):
    contents = open(file).read()
    contents = re.sub(r'<.*?>', ' ', contents)
    contents = re.sub('\s+', ' ', contents)
    return contents

def snippet(doc, term):
    text = ' '*30 + raw(doc) + ' '*30
    pos = text.index(term)
    return text[pos-30:pos+30]

print("Building Index...")
files = nltk.corpus.movie_reviews.abspaths()
idx = nltk.Index((w, f) for f in files for w in raw(f).split())

query = input("query> ")
while query != "quit":    
    if query in idx:
        for doc in idx[query]:
            print(snippet(doc, query))
    else:
        print("Not found")
    query = input("query> ")

#### code_strings_to_ints.py

In [None]:
def preprocess(tagged_corpus):
    words = set()
    tags = set()
    for sent in tagged_corpus:
        for word, tag in sent:
            words.add(word)
            tags.add(tag)
    wm = dict((w, i) for (i, w) in enumerate(words))
    tm = dict((t, i) for (i, t) in enumerate(tags))
    return [[(wm[w], tm[t]) for (w, t) in sent] for sent in tagged_corpus]

#### set vs. list

In [None]:
from timeit import Timer
vocab_size = 100000
setup_list = "import random; vocab = range(%d)" % vocab_size
setup_set = "import random; vocab = set(range(%d))" % vocab_size
statement = "random.randint(0, %d) in vocab" % (vocab_size * 2)

In [None]:
print(Timer(statement, setup_list).timeit(1000))

In [None]:
print(Timer(statement, setup_set).timeit(1000))

### Dynamic Programming

In [None]:
# (i) recursive
def virahanka1(n):
    if n == 0:
        return [""]
    elif n == 1:
        return ["S"]
    else:
        s = ["S" + prosody for prosody in virahanka1(n-1)]
        l = ["L" + prosody for prosody in virahanka1(n-2)]
        return s + l

# (ii) bottom-up dynamic programming
def virahanka2(n):
    lookup = [[""], ["S"]]
    for i in range(n-1):
        s = ["S" + prosody for prosody in lookup[i+1]]
        l = ["L" + prosody for prosody in lookup[i]]
        lookup.append(s + l)
    return lookup[n]

# (iii) top-down dynamic programming
def virahanka3(n, lookup={0:[""], 1:["S"]}):
    if n not in lookup:
        s = ["S" + prosody for prosody in virahanka3(n-1)]
        l = ["L" + prosody for prosody in virahanka3(n-2)]
        lookup[n] = s + l
    return lookup[n]

# (iv) built-in memoization
from nltk import memoize
@memoize
def virahanka4(n):
    if n == 0:
        return [""]
    elif n == 1:
        return ["S"]
    else:
        s = ["S" + prosody for prosody in virahanka4(n-1)]
        l = ["L" + prosody for prosody in virahanka4(n-2)]
        return s + l

In [None]:
virahanka1(4)

In [None]:
virahanka2(4)

In [None]:
virahanka3(4)

In [None]:
virahanka4(4)

## 4.8 - A Sample of Python Libraries

### [Matplotlib](https://matplotlib.org/)

In [None]:
from numpy import arange
from matplotlib import pyplot as plt

colors = 'rgbcmyk' # red, green, blue, cyan, magenta, yellow, black

def bar_chart(categories, words, counts):
    "Plot a bar chart showing counts for each word by category"
    ind = arange(len(words))
    width = 1 / (len(categories) + 1)
    bar_groups = []
    for c in range(len(categories)):
        bars = plt.bar(ind+c*width, counts[categories[c]], width,
                         color=colors[c % len(colors)])
        bar_groups.append(bars)
    plt.xticks(ind+width, words)
    plt.legend([b[0] for b in bar_groups], categories, loc='upper left')
    plt.ylabel('Frequency')
    plt.title('Frequency of Six Modal Verbs by Genre')
    plt.show()

In [None]:
genres = ['news', 'religion', 'hobbies', 'government', 'adventure']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfdist = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in genres
    for word in nltk.corpus.brown.words(categories=genre)
    if word in modals)

counts = {}
for genre in genres:
    counts[genre] = [cfdist[genre][word] for word in modals]
    
bar_chart(genres, modals, counts)

In [None]:
# run this cell to save your graph

from matplotlib import use

use('TkAgg')

bar_chart(genres, modals, counts)

plt.savefig("modals.png", facecolor='w', transparent=False)

In [None]:
# run this to see your saved graph

from IPython.display import Image
Image(filename="modals.png")

### [NetworkX](https://networkx.org/)

In [None]:
import networkx as nx
import matplotlib
from nltk.corpus import wordnet as wn

def traverse(graph, start, node):
    graph.depth[node.name] = node.shortest_path_distance(start)
    for child in node.hyponyms():
        graph.add_edge(node.name, child.name)
        traverse(graph, start, child)

def hyponym_graph(start):
    G = nx.Graph()
    G.depth = {}
    traverse(G, start, start)
    return G

def graph_draw(graph):
    nx.draw(graph,
         node_size = [16 * graph.degree(n) for n in graph],
         node_color = [graph.depth[n] for n in graph])
    matplotlib.pyplot.show()

In [None]:
dog = wn.synset('dog.n.01')
graph = hyponym_graph(dog)
graph_draw(graph)

### [csv](https://docs.python.org/3/library/csv.html)

In [None]:
import csv
input_file = open("data/lexicon.csv", "r")

In [None]:
for row in csv.reader(input_file):
    print(row)

### [pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)

Pandas is another library that can read `csv` files. It creates a useful datatype called a `DataFrame` that allows users to run queries on data similar to spreadsheet formulas. This is a very helpful way to organize tabulated data in Python.

In [None]:
import pandas as pd

lexicon = pd.read_csv("data/lexicon.csv", names=['col1', 'col2', 'col3', 'col4'])

In [None]:
lexicon

In [None]:
type(lexicon)

### [NumPy](https://numpy.org/)

* understanding NumPy arrays is very useful in many machine learning models
* NumPy uses C code to perform faster calculations on Numpy Arrays

In [None]:
from numpy import array

cube = array([ [[0,0,0], [1,1,1], [2,2,2]],
              [[3,3,3], [4,4,4], [5,5,5]],
              [[6,6,6], [7,7,7], [8,8,8]] ])

In [None]:
cube[1,1,1]

In [None]:
cube[2].transpose()

In [None]:
cube[2,1:]

#### lantent semantic analysis

In [None]:
from numpy import linalg

a=array([[4,0], [3,-5]])
u,s,vt = linalg.svd(a)

In [None]:
u

In [None]:
s

In [None]:
vt

## Your Turn Solutions

### 4.1

**Your Turn:** Use multiplication to create a list of lists: `nested = [[]] * 3`. Now modify one of the elements of the list, and observe that all the elements are changed. Use Python's `id()` function to find out the numerical identifier for any object, and verify that `id(nested[0])`, `id(nested[1])`, and `id(nested[2])` are all the same.

In [None]:
nested = [[]] * 3

In [None]:
nested[0].append('test')

In [None]:
nested

In [None]:
id(nested[0]) == id(nested[1]) == id(nested[2])

### 4.2

**Your Turn:** Convert `lexicon` to a tuple, using `lexicon = tuple(lexicon)`, then try each of the above operations, to confirm that none of them is permitted on tuples.

In [None]:
lexicon = [
    ('the', 'det', ['Di:', 'D@']),
    ('off', 'prep', ['Qf', 'O:f'])
]

lexicon = tuple(lexicon)

In [None]:
lexicon

In [None]:
type(lexicon)

List operations on a `tuple` will fail because they are designed for the `list` sequence type:

In [None]:
lexicon.sort()

In [None]:
lexicon[1] = ('turned', 'VBD', ['t3:nd', 't3`nd'])

In [None]:
del lexicon[0]