## Scope and Functions

Take a look at the code below. Guess what values are printed. If it's not obvious to you, we need to talk briefly about scoping.

In [4]:
def my_fcn(val) :
    x = 20
    val = val + x
    return(val)


In [5]:
x = 100
print(my_fcn(15))
print(x)

35
100


---


## Sandbox

Throughout the class I'll ask for exercises. This notebook holds my exploration of the results.

In [6]:
%matplotlib inline

import nltk
from nltk.book import *

import re
import numpy as np
import matplotlib 
import matplotlib.pyplot as plt

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


  if 'order' in inspect.getargspec(np.copy)[0]:


## Normalization

Questions:

1. Find emojis in the chat corpus.

1. Determine a normalization scheme. (What needs to be normalized, how would you do it?)

1. Count the happy vs sad emojis.

In [17]:
chat = text5 # give it a nice name. 

# Let's find emojis in chat. 
potential_emojis = {w for w in chat if ":" in w or ";" in w or "=" in w}

In [18]:
potential_emojis

{'!=',
 '.:',
 '.;)',
 '//www.wunderground.com/cgi-bin/findweather/getForecast?query=95953#FIR',
 '10:49',
 '2:55',
 '3:45',
 '4:03',
 '6:38',
 '6:41',
 '6:51',
 '6:53',
 '7:45',
 '9:10',
 ':',
 ':(',
 ':)',
 ':):):)',
 ':-(',
 ':-)',
 ':-@',
 ':-o',
 ':.',
 ':/',
 ':@',
 ':D',
 ':O',
 ':P',
 ':]',
 ':beer:',
 ':blush:',
 ':love:',
 ':o *',
 ':p',
 ':tongue:',
 ':|',
 ';',
 '; ..',
 ';)',
 ';-(',
 ';-)',
 ';0',
 ';]',
 ';p',
 '=',
 "='s",
 '=(',
 '=)',
 '=-\\',
 '=/',
 '=D',
 '=O',
 '=[',
 '=]',
 '=p',
 '>:->',
 ']:)',
 'capab;e',
 'd=',
 'http://forums.talkcity.com/tc-adults/start ',
 'http://www.shadowbots.com',
 'n;t',
 'o<|=D'}

In [25]:
# These are all oriented left-to-right, so let's make a regex to find them. 
emoji = re.compile(r"^[:;=]-?[)(\]PD@op|O]$") # misses '>:->' and ']:)' and repeats. Insert shruggie
emoji2 = re.compile(r"^[:;=]-?.$")
emojis = {w for w in chat if emoji2.search(w)}
sorted(emojis)
#len(emojis)
# could normalize by removing hyphens, case letters to upper case

[':(',
 ':)',
 ':-(',
 ':-)',
 ':-@',
 ':-o',
 ':.',
 ':/',
 ':@',
 ':D',
 ':O',
 ':P',
 ':]',
 ':p',
 ':|',
 ';)',
 ';-(',
 ';-)',
 ';0',
 ';]',
 ';p',
 '=(',
 '=)',
 '=-\\',
 '=/',
 '=D',
 '=O',
 '=[',
 '=]',
 '=p']

In [29]:
x = "abcdefg"

x[-3:]

'efg'

In [20]:
# Count happy vs sad
happy = [w for w in chat if w in {":-)",":)",":D",";-)","=)"}]
sad = [w for w in chat if w in {":-(",":(",";-(","=("}]

print(len(happy))
print(len(sad))

159
20


---

## Stemming

Let's go through some stemming examples from the NLTK.

In [32]:
x = text4[:30]

In [30]:
vowels = re.compile(r'[aeiouyAEIOU]')

len({w for w in nltk.corpus.words.words() if not vowels.search(w[:-3]) and w[-3:] == "ing"})

35

In [31]:
porter = nltk.PorterStemmer() # give it a short name.
start = 30000
distance = 100

print(" ".join(text4[start:(start + distance)]))
print("\n\n")
print(" ".join([porter.stem(w) for w in text4[start:(start + distance)]]))



aid of that Almighty Power which has hitherto protected me and enabled me to bring to favorable issues other important but still greatly inferior trusts heretofore confided to me by my country . The broad foundation upon which our Constitution rests being the people -- a breath of theirs having made , as a breath can unmake , change , or modify it -- it can be assigned to none of the great divisions of government but to that of democracy . If such is its theory , those who are called upon to administer it must recognize as its



aid of that Almighti Power which ha hitherto protect me and enabl me to bring to favor issu other import but still greatli inferior trust heretofor confid to me by my countri . The broad foundat upon which our Constitut rest be the peopl -- a breath of their have made , as a breath can unmak , chang , or modifi it -- it can be assign to none of the great divis of govern but to that of democraci . If such is it theori , those who are call upon to administ it mu

In [None]:
# words in inaugural addresses
print(len(set(text4)))

In [35]:
inaug_stemmed = {porter.stem(w.lower()) for w in text4}

print(len(inaug_stemmed))

print(len(set(text4))/len(inaug_stemmed))

5470
1.783180987202925


---

## Language Models
Let's find some common n-grams in S&S.

In [42]:
fd.freq('a')

0.01443041193422614

In [47]:
nltk.corpus.stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

In [50]:
fd = FreqDist([w.lower() for w in text2 if w.lower() not in nltk.corpus.stopwords.words("english") and w.isalpha()])
total_words = sum([count for word, count in fd.items()])

for pairs in fd.most_common(20) :
    print(" : ".join([pairs[0],str(pairs[1]),str(pairs[1]/total_words)]))
    
#4063/3861

elinor : 685 : 0.012688474789760307
could : 578 : 0.010706479457637166
marianne : 566 : 0.010484199607305598
mrs : 530 : 0.009817360056310896
would : 515 : 0.009539510243396436
said : 397 : 0.0073537583818026895
every : 377 : 0.0069832919645834105
one : 331 : 0.006131219204979069
much : 290 : 0.005371763049679547
must : 283 : 0.005242099803652799
sister : 282 : 0.005223576482791835
edward : 263 : 0.00487163338643352
mother : 258 : 0.0047790167821287
dashwood : 252 : 0.004667876856962916
well : 240 : 0.004445597006631349
time : 239 : 0.004427073685770385
know : 232 : 0.004297410439743637
jennings : 230 : 0.004260363798021709
though : 216 : 0.004001037305968214
willoughby : 216 : 0.004001037305968214


In [71]:
fd = FreqDist([" ".join(b) for b in nltk.ngrams(text2,3) if b[0] == "I" and b[1] == "am"]) # could use bigram function instead

In [72]:
fd.most_common(10)

[('I am sure', 72),
 ('I am not', 12),
 ('I am sorry', 11),
 ('I am so', 11),
 ('I am afraid', 11),
 ('I am very', 10),
 ('I am now', 4),
 ('I am glad', 4),
 ('I am monstrous', 4),
 ('I am always', 3)]

In [63]:
for gram, count in fd.items() :
    if gram[0] == "I" :
        print(" ".join(gram) + ": " + str(count))

        

I turn: 1
I walked: 2
I fancy: 3
I known: 1
I happened: 3
I suspect: 1
I travelled: 1
I admired: 1
I profess: 1
I told: 5
I mentioned: 1
I expected: 1
I abhor: 1
I take: 1
I longed: 1
I DO: 1
I sha: 1
I write: 1
I fear: 5
I more: 1
I pity: 1
I stop: 1
I wonder: 11
I only: 7
I know: 56
I assure: 17
I heartily: 1
I give: 1
I directly: 1
I ever: 13
I both: 1
I imagine: 1
I trusted: 1
I did: 25
I tell: 9
I talked: 3
I call: 2
I question: 1
I alluded: 1
I would: 35
I got: 2
I sent: 3
I dare: 36
I see: 13
I wanted: 3
I acquit: 1
I ought: 3
I endured: 1
I to: 5
I allowed: 1
I earnestly: 1
I ask: 3
I speak: 2
I remain: 1
I avoided: 1
I returned: 2
I insist: 1
I guessed: 2
I felt: 18
I might: 12
I stay: 1
I suffered: 2
I learnt: 2
I quitted: 1
I value: 2
I formed: 1
I copied: 1
I dreaded: 1
I admire: 2
I once: 2
I may: 16
I spoken: 1
I not: 1
I distress: 1
I can: 56
I immediately: 1
I tried: 1
I protest: 1
I say: 3
I find: 1
I entreat: 2
I sat: 1
I don: 6
I feared: 1
I made: 3
I feel: 6
I expre

In [60]:
total_words = sum([count for pair, count in fd.items() if pair[0] == "I"])

In [61]:
total_words

2004

In [64]:
for gram, count in sorted(fd.items(), key ) :
    if gram[0] == "I" : 
        print(gram)
        print(count)

('I', "'")
6
('I', ',')
13
('I', ',"')
1
('I', '--')
2
('I', '.')
1
('I', '.--')
1
('I', 'AM')
1
('I', 'COULD')
2
('I', 'DID')
3
('I', 'DO')
1
('I', 'KNEW')
1
('I', 'SHOULD')
1
('I', 'TRIED')
1
('I', 'WAS')
1
('I', 'WILL')
4
('I', 'a')
1
('I', 'abhor')
1
('I', 'acknowledge')
1
('I', 'acquit')
1
('I', 'admire')
2
('I', 'admired')
1
('I', 'advise')
3
('I', 'advised')
1
('I', 'allowed')
1
('I', 'alluded')
1
('I', 'always')
7
('I', 'am')
223
('I', 'an')
1
('I', 'approached')
1
('I', 'ask')
3
('I', 'assure')
17
('I', 'avoided')
1
('I', 'been')
3
('I', 'beg')
5
('I', 'began')
1
('I', 'begged')
1
('I', 'believe')
47
('I', 'blundered')
1
('I', 'both')
1
('I', 'bring')
1
('I', 'call')
2
('I', 'called')
1
('I', 'came')
5
('I', 'can')
56
('I', 'cannot')
40
('I', 'care')
2
('I', 'cease')
1
('I', 'certainly')
7
('I', 'chose')
1
('I', 'clearly')
1
('I', 'come')
4
('I', 'compare')
2
('I', 'conceal')
1
('I', 'confess')
9
('I', 'consider')
3
('I', 'considered')
1
('I', 'contradicted')
1
('I', 'convince

In [38]:
for gram,count in sorted(fd.items(), key=lambda pair: pair[1], reverse=True) : 
    if gram[0] == "I" :
        print(" : ".join([str(gram),str(count),str(round(count/total_words,3))]))        

('I', 'am') : 223 : 0.111
('I', 'have') : 192 : 0.096
('I', 'was') : 86 : 0.043
('I', 'should') : 69 : 0.034
('I', 'do') : 68 : 0.034
('I', 'had') : 66 : 0.033
('I', 'shall') : 64 : 0.032
('I', 'know') : 56 : 0.028
('I', 'can') : 56 : 0.028
('I', 'could') : 56 : 0.028
('I', 'think') : 55 : 0.027
('I', 'believe') : 47 : 0.023
('I', 'hope') : 42 : 0.021
('I', 'cannot') : 40 : 0.02
('I', 'dare') : 36 : 0.018
('I', 'would') : 35 : 0.017
('I', 'never') : 35 : 0.017
('I', 'must') : 34 : 0.017
('I', 'will') : 33 : 0.016
('I', 'suppose') : 32 : 0.016
('I', 'thought') : 29 : 0.014
('I', 'did') : 25 : 0.012
('I', 'wish') : 24 : 0.012
('I', 'felt') : 18 : 0.009
('I', 'assure') : 17 : 0.008
('I', 'may') : 16 : 0.008
('I', 'saw') : 14 : 0.007
('I', 'ever') : 13 : 0.006
('I', 'see') : 13 : 0.006
('I', ',') : 13 : 0.006
('I', 'might') : 12 : 0.006
('I', 'wonder') : 11 : 0.005
('I', 'understand') : 11 : 0.005
('I', 'declare') : 10 : 0.005
('I', 'tell') : 9 : 0.004
('I', 'heard') : 9 : 0.004
('I', 'con

In [65]:
?sorted

In [None]:
223/192

In [66]:
fd.most_common(10)

[((',', 'and'), 1598),
 (("'", 's'), 700),
 ((';', 'and'), 605),
 (('Mrs', '.'), 529),
 (('of', 'the'), 430),
 (('."', '"'), 428),
 (('to', 'be'), 428),
 ((',', '"'), 392),
 (('.', '"'), 369),
 (('in', 'the'), 348)]

In [74]:
fd = FreqDist(nltk.ngrams(text2,3))

In [75]:
total_words = 0

for gram,count in sorted(fd.items(), key=lambda pair: pair[1], reverse=True) : 
    if gram[0] == "I" and gram[1] == "am" :
        total_words += count
        print(" : ".join([str(gram),str(count)])) 
        

print(72/total_words)
print(12/total_words)

('I', 'am', 'sure') : 72
('I', 'am', 'not') : 12
('I', 'am', 'so') : 11
('I', 'am', 'sorry') : 11
('I', 'am', 'afraid') : 11
('I', 'am', 'very') : 10
('I', 'am', 'glad') : 4
('I', 'am', 'now') : 4
('I', 'am', 'monstrous') : 4
('I', 'am', ',') : 3
('I', 'am', 'much') : 3
('I', 'am', 'convinced') : 3
('I', 'am', 'always') : 3
('I', 'am', 'in') : 3
('I', 'am', 'perfectly') : 3
('I', 'am', 'alive') : 2
('I', 'am', 'only') : 2
('I', 'am', 'almost') : 2
('I', 'am', 'well') : 2
('I', 'am', 'the') : 2
('I', 'am', 'extremely') : 2
('I', 'am', 'to') : 2
('I', 'am', 'quite') : 2
('I', 'am', 'particularly') : 2
('I', 'am', 'capable') : 2
('I', 'am', 'grown') : 2
('I', 'am', 'unable') : 1
('I', 'am', 'resolved') : 1
('I', 'am', '.') : 1
('I', 'am', 'doing') : 1
('I', 'am', 'disappointed') : 1
('I', 'am', 'writing') : 1
('I', 'am', ';') : 1
('I', 'am', 'happy') : 1
('I', 'am', 'allowed') : 1
('I', 'am', 'right') : 1
('I', 'am', 'wretched') : 1
('I', 'am', 'miserable') : 1
('I', 'am', 'shut') : 1
('I

In [76]:
text2.concordance("I")

Displaying 25 of 2004 matches:
 to me ," replied her husband , " that I should assist his widow and daughters 
 did not know what he was talking of , I dare say ; ten to one but he was light
ly to myself . He could hardly suppose I should neglect them . But as he requir
hem . But as he required the promise , I could not do less than give it ; at le
ld not do less than give it ; at least I thought so at the time . The promise ,
t you have such a generous spirit !" " I would not wish to do any thing mean ,"
little . No one , at least , can think I have not done enough for them : even t
can afford to do ." " Certainly -- and I think I may afford to give them five h
rd to do ." " Certainly -- and I think I may afford to give them five hundred p
 That is very true , and , therefore , I do not know whether , upon the whole ,
 them -- something of the annuity kind I mean .-- My sisters would feel the goo
 are not aware of what you are doing . I have known a great deal of the trouble
such an a

In [77]:
# need this for phrases
from nltk.app import concordance

In [81]:
text2.concordance("sure")

Displaying 25 of 136 matches:
ur poor little boy --" " Why , to be sure ," said her husband , very gravely ,
 very convenient addition ." " To be sure it would ." " Perhaps , then , it wo
rtune for any young woman ." " To be sure it is ; and , indeed , it strikes me
 them . If they marry , they will be sure of doing well , and if they do not ,
g her consent to this plan . " To be sure ," said she , " it is better than pa
 abhorrence of annuities , that I am sure I would not pin myself down to the p
e their style of living if they felt sure of a larger income , and would not b
g my promise to my father ." " To be sure it will . Indeed , to say the truth 
will be ! Five hundred a year ! I am sure I cannot imagine how they will spend
and if THAT were your opinion , I am sure you could never be civil to him ." M
that is worthy and amiable ." " I am sure ," replied Elinor , with a smile , "
n travelling so far to see me , I am sure I will find none in accommodating th
 . " As for the house 

---

## N-gram models

Let's make a function that takes in text, builds a freq dist and generates text with various n-grams.

In [83]:
import random

def weighted_choice(freq_dist):
    weight_total = sum([count for token,count in freq_dist.items()])
    n = random.uniform(0, weight_total)
    for token, count in freq_dist.items() :
        if n < count:
            return(token)
        n = n - count
    return(token)

In [107]:
weighted_choice(FreqDist(text5))

'lol'

In [108]:
def generate_unigram(text,length=10) :
    fd = FreqDist(text)
    
    results = []
    for i in range(length) :
        results.append(weighted_choice(fd))
        
    return(" ".join(results))


In [112]:
generate_unigram(text1)

', Moby respect had three fights when he their you'

In [110]:
generate_unigram(text2)

'your seeing dislike liberty would I know Jennings ; .'

In [111]:
generate_unigram(text5)

'puff here m is is , .. probably i that'

In [114]:
def weighted_choice_ngram(cur_word,freq_dist) :
    ''' Starts with a current word and randomly chooses 
        a following word based on the bigrams. '''
    
    # First, build list of tuples of the form
    # ('a_word',count)
    # where our freq_dist has an entry like 
    # ('cur_word','a_word',count)
    sub_dist = {}
    
    for bigram, count in freq_dist.items() :
        if bigram[0] == cur_word :
            sub_dist[bigram[1]] = count
    
    return(weighted_choice(sub_dist))

def generate_bigram(text,length=10,start=None) :
    
    if not start :
        uni_fd = FreqDist(text)
        start = weighted_choice(uni_fd)
        
    fd = FreqDist(nltk.bigrams(text))
    
    results = []
    this_word = start
    for i in range(length) :
        this_word = weighted_choice_ngram(this_word,fd)
        results.append(this_word)
        
    return(" ".join(results))


In [116]:
generate_bigram(text1)

'the lone whale are infernal aforethought of angels mobbing thee'

In [120]:
generate_bigram(text2)

', his distress beyond a sore throat .-- When they'

In [118]:
generate_bigram(text5)

'gas is U37 PART ! U156 No ? :) ok'