# Processing Raw Text

## Accessing Text from the Web and from Disk

### Electronic Books

Although NLTK has a sample of books available from Project Gutenberg, we can use Python to access the text from the corpus of free books hosted on http://www.gutenberg.org/catalog/.

In [1]:
import nltk, re, pprint
from nltk import word_tokenize

from urllib import request

In [2]:
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

Data type of `raw`:

In [3]:
type(raw)

str

Total count of characters in `raw`:

In [4]:
len(raw)

1176967

In [5]:
raw[:75]

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

The raw content of the file contains whitespace, line breaks, blank lines and other characters which might not be interesting for NLP tasks. For language processing, we break the raw string into words and punctuations. This process is called **tokenization**. 

In [6]:
tokens = word_tokenize(raw)

In [7]:
type(tokens)

list

In [8]:
len(tokens)

257727

In [9]:
print(tokens[:10])

['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']


To perform NLTK specific processing on this text, we need to convert it first. We can do that using tokens.

In [10]:
text = nltk.Text(tokens)

In [11]:
type(text)

nltk.text.Text

In [12]:
print(text[1024:1062])

['an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge', '.', 'He', 'had', 'successfully']


In [13]:
text.collocation_list()

['Katerina Ivanovna',
 'Pyotr Petrovitch',
 'Pulcheria Alexandrovna',
 'Avdotya Romanovna',
 'Rodion Romanovitch',
 'Marfa Petrovna',
 'Sofya Semyonovna',
 'old woman',
 'Project Gutenberg-tm',
 'Porfiry Petrovitch',
 'Amalia Ivanovna',
 'great deal',
 'young man',
 'Nikodim Fomitch',
 'Ilya Petrovitch',
 'Project Gutenberg',
 'Andrey Semyonovitch',
 'Hay Market',
 'Dmitri Prokofitch',
 'Good heavens']

A manual inspection of the file is needed to ascertain the start and end of the content.

In [14]:
raw.find("PART I")

5336

In [15]:
raw.rfind("End of Project Gutenberg’s Crime and Punishment")

1157812

Now we can use a slice to assign the content string to raw.

In [16]:
raw = raw[5336:1157812]

In [17]:
raw.find("PART I")

0

### Dealing with HTML

As much of text on the web is in the form of HTML documents, we need to be acquainted with tools to process these using Python. Below is an example of how to retrieve an HTML doc, and parse it using the BeautifulSoup library.

In [18]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [19]:
from bs4 import BeautifulSoup

In [20]:
raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)
tokens[:10]

['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', 'in']

Again, we need manual inspection to find the start and end indices of tokens that contain the actual content of the article from BBC.

In [21]:
tokens = tokens[110:390]
text= nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


### Processing RSS Feeds

Example of fetching content from a blogpost from its RSS Feed using Python's Universal Feed Parser libary. 

In [22]:
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

'Language Log'

In [23]:
len(llog.entries)

13

In [24]:
post = llog.entries[4]
post.title

'An odd error'

In [25]:
content = post.content[0].value
content[:100]

'<p>"<a href="https://www.skynews.com.au/details/_6085490018001" rel="noopener noreferrer" target="_b'

In [36]:
raw = BeautifulSoup(content, 'html.parser').get_text()
print(word_tokenize(raw)[:10])

['``', 'Teens', 'charged', 'with', 'Qld', 'arsenal', "'completely", 'despicable', "'", "''"]


#### Reading Files from NLTK's Corpus

In [32]:
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'r').read()
raw[:100]

'[Moby Dick by Herman Melville 1851]\n\n\nETYMOLOGY.\n\n(Supplied by a Late Consumptive Usher to a Grammar'

#### Capturing User Input

In [34]:
s = input("Enter some text: ")

Enter some text: Hello. Is it me you're looking for?


In [35]:
print("You typed",len(word_tokenize(s)), "words.")

You typed 10 words.


### Text Processing with Unicode

Unicode is a character encoding which supports over a million characters. Each character is assigned a number, called a code point. In Python, code points are written in the form \uXXXX, where XXXX is the number in 4-digit hexadecimal form.

Let's try opening a file that s encoded as Latin-2.

In [38]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


We can see the codepoints of the characters as follows

In [39]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


To go deeper, we can inspect the properties of the unicode characters in the third line of the file. 

In [40]:
import unicodedata
lines = open(path, encoding='latin2').readlines()
line = lines[2]
print(line.encode('unicode_escape'))

b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'


In [41]:
for c in line:
    if ord(c) > 127:
        print('{} U+{:04x} {}'.format(c.encode('utf8'), ord(c), unicodedata.name(c)))

b'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE
b'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE
b'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE
b'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK
b'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE


NLTK tokenizers can tokenize Unicode strings as well.

In [43]:
print(word_tokenize(line))

['Niemców', 'pod', 'koniec', 'II', 'wojny', 'światowej', 'na', 'Dolny', 'Śląsk', ',', 'zostały']


### Regular Expressions for Detecting Word Patterns

Regular expressions are a powerful for describing character patterns.

In [44]:
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

Finding words that end with 'ed'

In [47]:
print([w for w in wordlist if re.search('ed$',w)][:20])

['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', 'abridged', 'abscessed', 'absconded', 'absorbed', 'abstracted', 'abstricted', 'accelerated', 'accepted', 'accidented', 'accoladed', 'accolated', 'accomplished', 'accosted']


The wildcard symbol '.' can be used ot match any single character. If we wanted to find an 8-letter word with j as its third letter and t as its sixth, we can use the period (wildcard symbol) as follows.

In [50]:
print([w for w in wordlist if re.search('^..j..t..$',w)])

['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted', 'unjustly']


Seen above, the ^ symbol represents the start of the string and $ represents the end. That means, the pattern should appear at the start of the string if ^ is used and at the end if $ is used.

The + symbol, which means "one or more instances of the preceding item" can be applied to individual characters, a character set or a range. Let's take a look at its examples.

In [54]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$',w)]

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'mine',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [55]:
print([w for w in chat_words if re.search('^[ha]+$',w)])

['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh', 'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa', 'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', 'hahahahaaa', 'hahahahahaha', 'hahahahahahaha', 'hahahahahahahahahahahahahahahaha', 'hahahhahah', 'hahhahahaha']


**More examples**

In [58]:
wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)][:5]

['0.0085', '0.05', '0.1', '0.16', '0.2']

In [59]:
[w for w in wsj if re.search('^[A-Z]+\$$', w)]

['C$', 'US$']

In [61]:
[w for w in wsj if re.search('^[0-9]{4}$', w)][:5]

['1614', '1637', '1787', '1901', '1903']

In [63]:
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)][:5]

['10-day', '10-lap', '10-year', '100-share', '12-point']

In [64]:
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]

['black-and-white',
 'bread-and-butter',
 'father-in-law',
 'machine-gun-toting',
 'savings-and-loan']

In [65]:
[w for w in wsj if re.search('(ed|ing)$', w)][:5]

['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced']

### Applications of Regular Expressions

#### Extracting Word Pieces

Finding all vowels in a word and counting them:

In [67]:
word = 'supercalifragilisticexpialidocious'
vowels = re.findall(r'[aeiou]',word)
print(vowels)
print("Number of vowels: ", len(vowels))

['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
Number of vowels:  16


Finding all sequences of two or more vowels and their frequencies:

In [68]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}',word))

In [70]:
fd.most_common(10)

[('io', 549),
 ('ea', 476),
 ('ie', 331),
 ('ou', 329),
 ('ai', 261),
 ('ia', 253),
 ('ee', 217),
 ('oo', 174),
 ('ua', 109),
 ('au', 106)]

Word compression by removing vowel sequences that occur in the middle of strings: 

In [71]:
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

In [73]:
english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and


#### Finding Word Stems

Examples of a Regex approach for finding word stems.

In [83]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['ing']

In [84]:
# Outputting the whole word
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['processing']

In [85]:
# Splitting the word into stem and suffix
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['processing']

In [86]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('processe', 's')]

In [87]:
# Avoiding greediness of the star (\*) operator to correctly match the stem and suffix in the above example:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('process', 'es')]

In [88]:
# Allowing an empty suffix by making the content of second parantheses optional
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')

[('language', '')]

In [89]:
def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem

In [90]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

In [92]:
tokens = word_tokenize(raw)
print([stem(t) for t in tokens])

['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


#### Searching Tokenized Text

In [95]:
from nltk.corpus import gutenberg, nps_chat
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r'<a> (<.*>) <man>')

monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave


In [96]:
chat = nltk.Text(nps_chat.words())
chat.findall(r'<.*> <.*> <bro>')

you rule bro; telling you bro; u twizted bro


Discovering hypernyms by searching for expressions of the form x and other ys 

In [97]:
from nltk.corpus import brown
hobbies_learned = nltk.Text(brown.words(categories=['hobbies','learned']))
hobbies_learned.findall(r'<\w*> <and> <other> <\w*s>')

speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals


### Normalizing Text

#### Stemmers

Porter and Lancaster Stemmers are supplied by NLTK, and offer their own rules of stemming.

In [99]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)

In [101]:
porter = nltk.PorterStemmer()
print([porter.stem(t) for t in tokens])

['denni', ':', 'listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']


In [104]:
lancaster = nltk.LancasterStemmer()
print([lancaster.stem(t) for t in tokens])

['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.']


In [111]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()

In [112]:
porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')

r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


#### Lemmatization

In [114]:
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t) for t in tokens])

['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


### Regular Expressions for Tokenizing Text

#### Simple Approaches to Tokenization

In [115]:
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
well without--Maybe it's always pepper that makes people hot-tempered,'..."""

Splitting on spaces, tabs and other whitespace characters:

In [118]:
print(re.split(r'\s+', raw))

["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]


Splitting on anything other than a word character:

In [121]:
print(re.split(r'\W+', raw))

['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', '']


Permitting word-internal hyphens and apostrophes

In [122]:
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))

["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']


#### NLTK Regular Expression Tokenizer

In [123]:
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r'''(?x)     # set flag to allow verbose regexps
     (?:[A-Z]\.)+       # abbreviations, e.g. U.S.A.
   | \w+(?:-\w+)*       # words with optional internal hyphens
   | \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
   | \.\.\.             # ellipsis
   | [][.,;"'?():-_`]   # these are separate tokens; includes ], [
 '''
nltk.regexp_tokenize(text, pattern)

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

### Segmentation

#### Sentence Segmentation

In [125]:
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = nltk.sent_tokenize(text)
pprint.pprint(sents[79:89])

['"Nonsense!"',
 'said Gregory, who was very rational when anyone else\nattempted paradox.',
 '"Why do all the clerks and navvies in the\n'
 'railway trains look so sad and tired, so very sad and tired?',
 'I will\ntell you.',
 'It is because they know that the train is going right.',
 'It\n'
 'is because they know that whatever place they have taken a ticket\n'
 'for that place they will reach.',
 'It is because after they have\n'
 'passed Sloane Square they know that the next station must be\n'
 'Victoria, and nothing but Victoria.',
 'Oh, their wild rapture!',
 'oh,\n'
 'their eyes like stars and their souls again in Eden, if the next\n'
 'station were unaccountably Baker Street!"',
 '"It is you who are unpoetical," replied the poet Syme.']


#### Word Segmentation

In [126]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"

The boolean 1 is used to indicate wether or not a word-break appears after the character.

In [127]:
def segment(text, segs):
    words = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == '1':
            words.append(text[last:i+1])
            last = i+1
    words.append(text[last:])
    return words

In [128]:
segment(text, seg1)

['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']

In [130]:
print(segment(text, seg2))

['do', 'you', 'see', 'the', 'kitty', 'see', 'the', 'doggy', 'do', 'you', 'like', 'the', 'kitty', 'like', 'the', 'doggy']


Defining the objective function to optimize search for the correct segments

In [131]:
def evaluate(text, segs):
    words = segment(text, segs)
    text_size = len(words)
    lexicon_size = sum(len(word) + 1 for word in set(words))
    return text_size + lexicon_size

In [132]:
seg3 = "0000100100000011001000000110000100010000001100010000001"

In [133]:
evaluate(text, seg3)

47

In [134]:
evaluate(text, seg2)

48

In [135]:
evaluate(text, seg1)

64

Implementing the search for the patterns of 0s and 1s that minimize the objective function

In [136]:
from random import randint

def flip(segs, pos):
    return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]

def flip_n(segs, n):
    for i in range(n):
        segs = flip(segs, randint(0, len(segs)-1))
    return segs

def anneal(text, segs, iterations, cooling_rate):
    temperature = float(len(segs))
    while temperature > 0.5:
        best_segs, best = segs, evaluate(text, segs)
        for i in range(iterations):
            guess = flip_n(segs, round(temperature))
            score = evaluate(text, guess)
            if score < best:
                best, best_segs = score, guess
        score, segs = best, best_segs
        temperature = temperature / cooling_rate
        print(evaluate(text, segs), segment(text, segs))
    print()
    return segs

In [137]:
anneal(text, seg1, 5000, 1.2)

64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
60 ['d', 'oyouseethe', 'kittyse', 'ethedoggy', 'doyou', 'lik', 'ethekitty', 'lik', 'ethedoggy']
59 ['doy', 'ouse', 'ethekit', 'tys', 'e', 'e', 'thedoggy', 'doy', 'ou', 'lik', 'ethekit', 'tylik', 'e', 'thedoggy']
59 ['doy', 'ouse', 'ethekit', 'tys', 'e', 'e', 'thedoggy', 'doy', 'ou', 'lik', 'ethekit', 'tylik', 'e', 'thedoggy']
59 ['doy', 'ouse', 'ethekit', 'tys', 'e', 'e', 'thedoggy', 'doy', 'ou', 'lik', 'ethekit', 'tylik', 'e', 'thedoggy']
59 ['doy', 'ouse', 'ethe

'0000101000000001010000000010000100100000000100100000000'

### Formatting: From Lists to Strings

In [138]:
silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
' '.join(silly)

'We called him Tortoise because he taught us .'

In [140]:
';'.join(silly)

'We;called;him;Tortoise;because;he;taught;us;.'

In [141]:
''.join(silly)

'WecalledhimTortoisebecausehetaughtus.'

**Using string formatting to print variables in text:**

In [143]:
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in sorted(fdist):
    print('{}->{};'.format(word, fdist[word]), end=' ')

cat->3; dog->4; snake->1; 

**Adding left or right padding to formatted strings**

In [146]:
'{:6}'.format(41)

'    41'

In [148]:
'{:>6}'.format(41)

'    41'

**Formatting decimals**

In [149]:
import math
'{:.4f}'.format(math.pi)

'3.1416'

**Using formatting for tabulated data**

In [150]:
def tabulate(cfdist, words, categories):
    print('{:16}'.format('Category'), end=' ')                    
    for word in words:
        print('{:>6}'.format(word), end=' ')
    print()
    for category in categories:
        print('{:16}'.format(category), end=' ')                  
        for word in words:                                        
            print('{:6}'.format(cfdist[category][word]), end=' ') 
        print()                                                   

In [151]:
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
          (genre, word)
          for genre in brown.categories()
          for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
tabulate(cfd, modals, genres)

Category            can  could    may  might   must   will 
news                 93     86     66     38     50    389 
religion             82     59     78     12     54     71 
hobbies             268     58    131     22     83    264 
science_fiction      16     49      4     12      8     16 
romance              74    193     11     51     45     43 
humor                16     30      8      8      9     13 


### Writing Results to a File

In [152]:
output_file = open('output/3_output.txt', 'w')
words = set(nltk.corpus.genesis.words('english-kjv.txt'))
for word in sorted(words):
    print(word, file=output_file)

### Text Wrapping

In [153]:
saying = ['After', 'all', 'is', 'said', 'and', 'done', ',',
           'more', 'is', 'said', 'than', 'done', '.']

In [154]:
from textwrap import fill
pieces = ["{} {}".format(word, len(word)) for word in saying]
output = ' '.join(pieces)
wrapped = fill(output)
print(wrapped)

After 5 all 3 is 2 said 4 and 3 done 4 , 1 more 4 is 2 said 4 than 4
done 4 . 1
