# Processing Raw Text

# required packages for this chapter

In [7]:
import nltk, re, pprint
from nltk import word_tokenize

# 3.1 Accessing Text from the Web and from Disk

**We often want to pull text from somewhere online since the web is a large collection of text.**

**The request function: allows you to access internet information. Paired with .urlopen, you can open a website and pull all the text from it.**
    
**request.urlopen("URL you want to access")**

In [8]:
from urllib import request 
url ="http://www.gutenberg.org/files/2554/2554-0.txt"

response = request.urlopen(url)

print(response)

<http.client.HTTPResponse object at 0x0000018589CE48B0>


**The next step would be to put that information into a readable format for us to use.**

**The .read() and .decode() functions read in a file and decode the formating. We discussed previously that non Latin based languages might be a problem, as they contain special characters. Decode is one way to deal with them.**

**variable_name.read().decode("format style")**

In [9]:
from urllib import request 
url ="https://rest.uniprot.org/uniprotkb/P05067.fasta"
response = request.urlopen(url) 
piddd = response.read()
piddd = piddd.decode("utf8") 

print(piddd)

>sp|P05067|A4_HUMAN Amyloid-beta precursor protein OS=Homo sapiens OX=9606 GN=APP PE=1 SV=3
MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLNMHMNVQNGKWDSDPSGTK
TCIDTKEGILQYCQEVYPELQITNVVEANQPVTIQNWCKRGRKQCKTHPHFVIPYRCLVG
EFVSDALLVPDKCKFLHQERMDVCETHLHWHTVAKETCSEKSTNLHDYGMLLPCGIDKFR
GVEFVCCPLAEESDNVDSADAEEDDSDVWWGGADTDYADGSEDKVVEVAEEEEVAEVEEE
EADDDEDDEDGDEVEEEAEEPYEEATERTTSIATTTTTTTESVEEVVREVCSEQAETGPC
RAMISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCGSAMSQSLLKTTQEPLARD
PVKLPTTAASTPDAVDKYLETPGDENEHAHFQKAKERLEAKHRERMSQVMREWEEAERQA
KNLPKADKKAVIQHFQEKVESLEQEAANERQQLVETHMARVEAMLNDRRRLALENYITAL
QAVPPRPRHVFNMLKKYVRAEQKDRQHTLKHFEHVRMVDPKKAAQIRSQVMTHLRVIYER
MNQSLSLLYNVPAVAEEIQDEVDELLQKEQNYSDDVLANMISEPRISYGNDALMPSLTET
KTTVELLPVNGEFSLDDLQPWHSFGADSVPANTENEVEPVDARPAADRGLTTRPGSGLTN
IKTEEISEVKMDAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIATVIVITL
VMLKKKQYTSIHHGVVEVDAAVTPEERHLSKMQQNGYENPTYKFFEQMQN



# Dealing with HTML

In [10]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')

In [11]:
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [12]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)
tokens

['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 'years',
 "'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Country',
 'Profiles',
 'In',
 'Depth',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Programmes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'SERVICES',
 'Daily',
 'E-mail',
 'News',
 'Ticker',
 'Mobile/PDAs',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Text',
 'Only',
 'Feedback',
 'Help',
 'EDITIONS',
 'Change',
 'to',
 'UK',
 'Friday',
 ',',
 '27',
 'September',
 ',',
 '2002',
 ',',
 '11:51',
 'GMT',
 '1

In [13]:
raw

'\n\n\nBBC NEWS | Health | Blondes \'to die out in 200 years\'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNEWS\n\xa0\xa0SPORT\n\xa0\xa0WEATHER\n\xa0\xa0WORLD SERVICE\n\n\xa0\xa0A-Z INDEX\xa0\n\n\xa0\xa0SEARCH\xa0\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n    \xa0You are in:\xa0Health \xa0\r\n    \r\n    \r\n\n\n\n\n\n\n\n\n\n\n\nNews Front Page\n\n\n\n\n\nAfrica\n\n\nAmericas\n\n\nAsia-Pacific\n\n\nEurope\n\n\nMiddle East\n\n\nSouth Asia\n\n\nUK\n\n\nBusiness\n\n\nEntertainment\n\n\nScience/Nature\n\n\nTechnology\n\n\nHealth\n\n\nMedical notes\n\n\n-------------\n\n\nTalking Point\n\n\n-------------\n\n\nCountry Profiles\n\n\nIn Depth\n\n\n-------------\n\n\nProgrammes\n\n\n-------------\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSERVICES\r\n\n\n\n\n\n\n\nDaily E-mail\r\n\n\n\n\n\n\n\nNews Ticker\r\n\n\n\n\n\n\n\nMobile/PDAs\r\n\n\n\n\n\n\n-------------\r\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nText Onl

In [14]:
tokens = tokens[110:390]
text = nltk.Text(tokens)

In [15]:
tokens

['12:51',
 'UK',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 'years',
 "'",
 'Scientists',
 'believe',
 'the',
 'last',
 'blondes',
 'will',
 'be',
 'in',
 'Finland',
 'The',
 'last',
 'natural',
 'blondes',
 'will',
 'die',
 'out',
 'within',
 '200',
 'years',
 ',',
 'scientists',
 'believe',
 '.',
 'A',
 'study',
 'by',
 'experts',
 'in',
 'Germany',
 'suggests',
 'people',
 'with',
 'blonde',
 'hair',
 'are',
 'an',
 'endangered',
 'species',
 'and',
 'will',
 'become',
 'extinct',
 'by',
 '2202',
 '.',
 'Researchers',
 'predict',
 'the',
 'last',
 'truly',
 'natural',
 'blonde',
 'will',
 'be',
 'born',
 'in',
 'Finland',
 '-',
 'the',
 'country',
 'with',
 'the',
 'highest',
 'proportion',
 'of',
 'blondes',
 '.',
 'The',
 'frequency',
 'of',
 'blondes',
 'may',
 'drop',
 'but',
 'they',
 'wo',
 "n't",
 'disappear',
 'Prof',
 'Jonathan',
 'Rees',
 ',',
 'University',
 'of',
 'Edinburgh',
 'But',
 'they',
 'say',
 'too',
 'few',
 'people',
 'now',
 'carry',
 'the',
 'gene',

In [16]:
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


# NLP Pipeline

In [17]:
# import image module
from IPython.display import Image
  
# get the image
Image(url="NLP_Pipeline.PNG", width=1000, height=1000)

# 3.4 Regular Expressions for Detecting Word Patterns

**REGEX IS EVERYTHING: pattern matching, allowing you to find specific things are you looking for**

**We used import re earlier to import the package that does regex.**

**Let's pull a wordlist from the Word Copus in nltk to get started**

In [18]:
import re
wordlist = [w for w in nltk.corpus.words.words('en')]

In [19]:
len(wordlist)

235886

In [20]:
wordlist

['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron',
 'Aaronic',
 'Aaronical',
 'Aaronite',
 'Aaronitic',
 'Aaru',
 'Ab',
 'aba',
 'Ababdeh',
 'Ababua',
 'abac',
 'abaca',
 'abacate',
 'abacay',
 'abacinate',
 'abacination',
 'abaciscus',
 'abacist',
 'aback',
 'abactinal',
 'abactinally',
 'abaction',
 'abactor',
 'abaculus',
 'abacus',
 'Abadite',
 'abaff',
 'abaft',
 'abaisance',
 'abaiser',
 'abaissed',
 'abalienate',
 'abalienation',
 'abalone',
 'Abama',
 'abampere',
 'abandon',
 'abandonable',
 'abandoned',
 'abandonedly',
 'abandonee',
 'abandoner',
 'abandonment',
 'Abanic',
 'Abantes',
 'abaptiston',
 'Abarambo',
 'Abaris',
 'abarthrosis',
 'abarticular',
 'abarticulation',
 'abas',
 'abase',
 'abased',
 'abasedly',
 'abasedness',
 'abasement',
 'abaser',
 'Abasgi',
 'abash',
 'abashed',
 'abashedly',
 'abashedness',
 'abashless',
 'abashlessly',
 'abashment',
 'abasia',
 'abasic',
 'abask',
 'Abassin',
 'abastardize',
 'abatable',
 'abate

In [21]:
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

In [22]:
wordlist

['a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'aardvark',
 'aardwolf',
 'aba',
 'abac',
 'abaca',
 'abacate',
 'abacay',
 'abacinate',
 'abacination',
 'abaciscus',
 'abacist',
 'aback',
 'abactinal',
 'abactinally',
 'abaction',
 'abactor',
 'abaculus',
 'abacus',
 'abaff',
 'abaft',
 'abaisance',
 'abaiser',
 'abaissed',
 'abalienate',
 'abalienation',
 'abalone',
 'abampere',
 'abandon',
 'abandonable',
 'abandoned',
 'abandonedly',
 'abandonee',
 'abandoner',
 'abandonment',
 'abaptiston',
 'abarthrosis',
 'abarticular',
 'abarticulation',
 'abas',
 'abase',
 'abased',
 'abasedly',
 'abasedness',
 'abasement',
 'abaser',
 'abash',
 'abashed',
 'abashedly',
 'abashedness',
 'abashless',
 'abashlessly',
 'abashment',
 'abasia',
 'abasic',
 'abask',
 'abastardize',
 'abatable',
 'abate',
 'abatement',
 'abater',
 'abatis',
 'abatised',
 'abaton',
 'abator',
 'abattoir',
 'abature',
 'abave',
 'abaxial',
 'abaxile',
 'abaze',
 'abb',
 'abbacomes',
 'abbacy',
 'abbas',
 'abbasi',
 'abbassi',


# $ indicates you are looking for the end of a word

In [23]:
[w for w in wordlist if re.search('ed$', w)]

['abaissed',
 'abandoned',
 'abased',
 'abashed',
 'abatised',
 'abed',
 'aborted',
 'abridged',
 'abscessed',
 'absconded',
 'absorbed',
 'abstracted',
 'abstricted',
 'accelerated',
 'accepted',
 'accidented',
 'accoladed',
 'accolated',
 'accomplished',
 'accosted',
 'accredited',
 'accursed',
 'accused',
 'accustomed',
 'acetated',
 'acheweed',
 'aciculated',
 'aciliated',
 'acknowledged',
 'acorned',
 'acquainted',
 'acquired',
 'acquisited',
 'acred',
 'aculeated',
 'addebted',
 'added',
 'addicted',
 'addlebrained',
 'addleheaded',
 'addlepated',
 'addorsed',
 'adempted',
 'adfected',
 'adjoined',
 'admired',
 'admitted',
 'adnexed',
 'adopted',
 'adossed',
 'adreamed',
 'adscripted',
 'aduncated',
 'advanced',
 'advised',
 'aeried',
 'aethered',
 'afeared',
 'affected',
 'affectioned',
 'affined',
 'afflicted',
 'affricated',
 'affrighted',
 'affronted',
 'aforenamed',
 'afterfeed',
 'aftershafted',
 'afterthoughted',
 'afterwitted',
 'agazed',
 'aged',
 'agglomerated',
 'aggri

# Wildcards are space holders, essentially denoting that something should exist, but not what specifically.

# The ^ indicatex the start of the word. So here, we are saying two symbols, then a j, two more, tha a t, two more and then the end

# You can also use ? to denote an optional character.

In [24]:
[w for w in wordlist if re.search('^..j..t..$', w)]

['abjectly',
 'adjuster',
 'dejected',
 'dejectly',
 'injector',
 'majestic',
 'objectee',
 'objector',
 'rejecter',
 'rejector',
 'unjilted',
 'unjolted',
 'unjustly']

# What if we want to consider a range of letters?

**How about words that start with vowels?**

**Ranges are created by the characters in [] meaning any of these.**

In [25]:
[w for w in wordlist if re.search('^[aeiou]', w)][:5]

['a', 'aa', 'aal', 'aalii', 'aam']

# Let's explore the + symbol a bit further. Notice that it can be applied to individual letters, or to bracketed sets of letters:

In [26]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'mine',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [27]:
# What does the *symbol do?

In [28]:
[w for w in chat_words if re.search('^m*i*n*e*$', w)]

['',
 'e',
 'i',
 'in',
 'm',
 'me',
 'meeeeeeeeeeeee',
 'mi',
 'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'min',
 'mine',
 'mm',
 'mmm',
 'mmmm',
 'mmmmm',
 'mmmmmm',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee',
 'mmmmmmmmmm',
 'mmmmmmmmmmmmm',
 'mmmmmmmmmmmmmm',
 'n',
 'ne']

# The .findall() function looks for all matches (non-overlapping).

# re.findall("match",word or list)

In [29]:
word = 'supercalifragilisticexpialidocious'
re.findall(r'[aeiou]', word)

['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']

In [30]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
                   for vs in re.findall(r'[aeiou]{2,}', word))

fd.most_common(12)

[('io', 549),
 ('ea', 476),
 ('ie', 331),
 ('ou', 329),
 ('ai', 261),
 ('ia', 253),
 ('ee', 217),
 ('oo', 174),
 ('ua', 109),
 ('au', 106),
 ('ue', 105),
 ('ui', 95)]

In [31]:
fd

FreqDist({'io': 549, 'ea': 476, 'ie': 331, 'ou': 329, 'ai': 261, 'ia': 253, 'ee': 217, 'oo': 174, 'ua': 109, 'au': 106, ...})

In [32]:
for i in fd:
    print(i, fd[i])

io 549
ea 476
ie 331
ou 329
ai 261
ia 253
ee 217
oo 174
ua 109
au 106
ue 105
ui 95
ei 86
oi 65
oa 59
eo 39
iou 27
eu 18
oe 15
iu 14
ae 11
eau 10
uo 8
oui 6
ao 6
eou 5
uou 5
uee 4
aa 3
ieu 3
uie 3
eei 2
iai 1
oei 1
uu 1
aii 1
aiia 1
aia 1
iao 1
eea 1
ueui 1
ioa 1
ooi 1


# Next, let's combine regular expressions with conditional frequency distributions. Here we will extract all consonant-vowel sequences from the words of Rotokas, such as ka and si. Since each of these is a pair, it can be used to initialize a conditional frequency distribution. We then tabulate the frequency of each pair:

In [33]:
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 


# Another consideration for regex is that it might be a good way to stem or create the lemmas in a text. For example 

**Take off suffixes such as: ing, ly, ed, ious, ion, es, s, ment, etc.**

**Returns the word without that suffix**

**nltk has multiple built in stemmers that can also handle this job for us**

In [34]:
# another special character set is <> meaning any word boundary

from nltk.corpus import nps_chat

chat = nltk.Text(nps_chat.words())

chat.findall("<.*><.*><bro>")

#this version of regex is specific to nltk.text objects

you rule bro; telling you bro; u twizted bro


# Normalizing Text

**Normalization: Converting text to all the same type for later use**
    
**Stemming: taking the affixes off of a word.**
    
**Lemmatization: making sure the word is in a known dictionary**

In [35]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
     is no basis for a system of government.  Supreme executive power derives from
    a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)

In [36]:
tokens

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'lying',
 'in',
 'ponds',
 'distributing',
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

# Stemmers

**.PorterStemmer(): developed by Martin Porter, popular choice for English**

**.LancasterStemmer(): developed by Chris Paice at Lancester University**

In [37]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

[porter.stem(t) for t in tokens]

['denni',
 ':',
 'listen',
 ',',
 'strang',
 'women',
 'lie',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandat',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcic',
 'aquat',
 'ceremoni',
 '.']

In [38]:
[lancaster.stem(t) for t in tokens]

['den',
 ':',
 'list',
 ',',
 'strange',
 'wom',
 'lying',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'bas',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'pow',
 'der',
 'from',
 'a',
 'mand',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'som',
 'farc',
 'aqu',
 'ceremony',
 '.']

# Lemmatization

**We can use the WordNet option, but only works if it's in the dictionary-does handle some odd plurals (like women) but not odd verbs like lying.**

In [39]:
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'woman',
 'lying',
 'in',
 'pond',
 'distributing',
 'sword',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

# Tokenizing Text

**Tokenizating: creating language pieces, often words, split on whitespace**
    
**You could do this with regular expression**

**But there are options already built into nltk**

**The .regexp_tokenize() function: allows you to use regex break apart into tokens.**

In [40]:
text = 'That U.S.A. poster-print costs $12.40...'

nltk.regexp_tokenize(text,"\w+\s")

  nltk.regexp_tokenize(text,"\w+\s")


['That ', 'print ', 'costs ']

# Segmentation

**The .sent_tokenization() function: enter some text to break apart into sentences. Also consider the issues with end of line codes from text documents.**

In [41]:
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = nltk.sent_tokenize(text)
pprint.pprint(sents[3:4])

['Life was a fly that faded, and death a drone that stung;\n'
 'The world was very old indeed when you and I were young.']


**Thank You**