In [3]:
from __future__ import division
import nltk, re, pprint
from bs4 import BeautifulSoup

In [9]:
from urllib.request import urlopen
url="http://www.gutenberg.org/files/2554/2554-0.txt"
raw = urlopen(url).read()
raw=str(raw)
type(raw)


URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)>

In [None]:
len(raw)

In [None]:
raw[:75]

Break up the string into words and punctuation - Tokenization

In [None]:
tokens = nltk.word_tokenize(raw)
len(tokens)

In [None]:
tokens[:10]

To carry out all the nltk operations we just need to convert it to nltk text

In [None]:
text=nltk.Text(tokens)
type(text)

In [None]:
text.collocations()

In [None]:
raw.rfind("End of Project")

Text found on Web may contain unwanted material and there may not be an automatic way. Hence, we deal with HTML.

### Dealing with HTML

In [None]:
url="http://news.bbc.co.uk/2/hi/health/2284783.stm" #BBC News Story 'Blondies to die out in 200 years
html=urlopen(url).read()
html[:60]

In [None]:
raw=BeautifulSoup(html,'html.parser').get_text()
tokens=nltk.word_tokenize(raw)
print(tokens)

In [None]:
tokens=tokens[96:399]
print(tokens)

In [None]:
text = nltk.Text(tokens)
text.concordance('gene')

### Processing Search Engine Results`

The Web can be thought of as a huge corpus of unannotated text. Web search engines
provide an efficient means of searching this large quantity of text for relevant linguistic
examples. The main advantage of search engines is size: since you are searching such
a large set of documents, you are more likely to find any linguistic pattern you are
interested in. Furthermore, you can make use of very specific patterns, which would
match only one or two examples on a smaller example, but which might match tens of
thousands of examples when run on the Web. A second advantage of web search engines
is that they are very easy to use. Thus, they provide a very convenient tool for
quickly checking a theory, to see if it is reasonable.

### Processing RSS Feeds 

In [None]:
import feedparser

In [None]:
llog=feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

In [None]:
len(llog.entries)

In [None]:
llog.entries

In [None]:
for post in llog.entries:
    print(post.title)

In [None]:
post=llog.entries[2]
post.title

In [None]:
content=post.content[0].value
content[:70]

In [None]:
post.content[0]

In [None]:
raw=BeautifulSoup(content, 'html.parser').get_text()
tokens=nltk.word_tokenize(raw)

In [None]:
tokens

### Reading Local Files

In [None]:
import os
os.listdir('.')

In [None]:
f=open('C:/Users/mohit/Documents/doc.txt','rU')
raw=f.read()

In [None]:
raw

In [None]:
type(f)

In [None]:
s=input('enter: ')

In [None]:
s

**NLP Pipeline**
1. Download webpage and strip the html to get the required content(using BeautfulSoup and urlopen).
2. Then Tokenize the words and convert it to NLTK Text(using nltk.Text())
3. Build the Vocabulary(using set and string.lower())

### Unicode
Unicode supports over a million characters. Each character is assigned a number, called
a code point. In Python, code points are written in the form \uXXXX, where XXXX
is the number in four-digit hexadecimal form.

Text in files will be in a particular encoding, so we need some mechanism for translating
it into Unicode—translation into Unicode is called decoding. Conversely, to write out
Unicode to a file or a terminal, we first need to translate it into a suitable encoding—
this translation out of Unicode is called encoding

The Python codecs module provides functions to read encoded data into Unicode
strings, and to write out Unicode strings in encoded form


In [None]:
path=nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [None]:
import codecs
f = codecs.open(path, encoding = 'latin2') #The codecs.open() function
#takes an encoding parameter to specify the encoding of the file being read or written

in order to view this text on a terminal, we need to encode it, using a suitable encoding.
The Python-specific encoding unicode_escape is a dummy encoding that converts all
non-ASCII characters into their \uXXXX representations


In [None]:
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

The first line in this output illustrates a Unicode escape string preceded by the \u escape
string, namely \u0144. The relevant Unicode character will be displayed on the screen
as the glyph ń. In the third line of the preceding example, we see \xf3, which corresponds
to the glyph ó, and is within the 128–255 range.

### Regular Expressions for detectiong word patterns

In [None]:
import re
word_list=[w for w in nltk.corpus.words.words('en') if w.islower()]

###### Using Basic Metacharacters

In [None]:
print([w for w in word_list if re.search('ed$', w)])
#The '$' character marks the end of the character. Finds words ending with 'ed'

**The '.'(dot) wildcard symbol matches any single character**. To find letters with j as third letter and t as sixth letter we would do the following.

In [None]:
print([w for w in word_list if re.search('^..j..t$',w)]) #The '^'(caret symbol) marks the start of the word

In [None]:
print([w for w in word_list if re.search('..j..t',w)]) #Without he caret and the doollar symbol

**Finally, the ? symbol specifies that the previous character is optional. Thus «^e-?mail
$» will match both email and e-mail.**

In [None]:
sum(1 for w in word_list if re.search('^e-?mail$', w))

In [None]:
word_list.index('email')

In [None]:
'email' in word_list

###### Ranges and Closures

The T9 system is used for entering text on mobile phones. Two or more words that are entered with the same sequence of keystrokes are known as textonyms. The T9 system is used for entering text on mobile phones (see Figure 3-5). Two or more
words that are entered with the same sequence of keystrokes are known as
textonyms. For example, both hole and golf are entered by pressing the sequence 4653.
What other words could be produced with the same sequence? Here we use the regular
expression «^[ghi][mno][jlk][def]$»:

In [None]:
[w for w in word_list if re.search('^[ghi][mno][jlk][def]$', w)]

The square brackets around certain character indicates that either 1 of those characters. The sequence in the expression should be followed. first letter should either be 'g', 'h', or 'i' follwed by the second letter which can be either of 'm','n' or 'o' and so on.

In [None]:
print([w for w in word_list if re.search('^[a-fj-o]+$', w)])#words with characters from a to f and j to 0 occuring one or more times

In [None]:
chat_words=sorted(set(w for w in nltk.corpus.nps_chat.words()))
print([w for w in chat_words if re.search('^[ha]+$', w)])

'+' means one or more instances of the preceding term.<br>
'*' means zero or more instances of the preceding term.<br>
These both refered to as Kleene closures.<br>
'^' also works as a not function inside a square bracket. [^aeiouAEIOU] - means any character which is not a vowel.<br>
'\' (backslash) is an escape character. The following character will be deprived of its special powers<br>
'{}' specify the range of the length of character. {4} 4 characters long. {3,5} - 3 to 5 characters long. {, 6} upto 6 chars long.<br>
'|' works like an or condition

In [None]:
wsj=sorted(set(nltk.corpus.treebank.words()))
print([w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]) #'\' (backslash) is more like an escapre character

In [None]:
print([w for w in wsj if re.search('^[A-Z]+\$$', w)])

In [None]:
print([w for w in wsj if re.search('^[0-9]{4}$', w)])#4 digit numbers 

In [None]:
print([w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)])

In [None]:
print([w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)])

In [None]:
print([w for w in wsj if re.search('(ed|ing)$', w)])

In [None]:
print([w for w in wsj if re.search('ed|ing$', w)])#without the paranthesis

In [None]:
print(len([w for w in wsj if re.search('(ed|ing)$', w)]))#length with the paranthesis

In [None]:
print(len([w for w in wsj if re.search('ed|ing$', w)]))#length without the paranthesis

### Useful Applications of Regular Expressions

In [None]:
strings_to_search = ['abc', 'def', 'fgh hello']

complete_list = ['abc abc dsss abc', 'defgj', 'abc fgh hello xabd', 'fgh helloijj']

for col_key in strings_to_search:
    print(list(map(lambda x: re.findall([col_key], x), complete_list)))

In [None]:
word='supercalifragilisticexpialidocious'
print(re.findall(r'[aeiou]', word))

In [None]:
fd= nltk.FreqDist(w for word in wsj for w in re.findall('[aeiou]{2,}', word))
print(fd.items())

In [None]:
[int(n) for n in re.findall(r'[0-9]+', '2009-12-31')]

It is sometimes noted that English text is highly redundant, and it is still easy to read
when word-internal vowels are left out. For example, declaration becomes dclrtn, and
inalienable becomes inlnble, retaining any initial or final vowel sequences. The regular
expression in our next example matches initial vowel sequences, final vowel sequences,
and all consonants; everything else is ignored.

In [None]:
def compress_word(word):
    return ''.join(re.findall(r'^[aeiouAEIOU]+|[^aeiouAEIOU]+$|[^AEIOUaeiou]', word))


In [None]:
english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress_word(w) for w in english_udhr[:75]))

In [None]:
rotokas_words= nltk.corpus.toolbox.words('rotokas.dic')
cvs=[cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd=nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

In [None]:
cv_word_pairs=[(cv, w) for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cv_index=nltk.Index(cv_word_pairs)

In [None]:
print(cv_index['pa'])

In [None]:
print(cv_index['su'])

In [None]:
print(cv_index['ka'])

In [None]:
print('kasuari' in cv_index['ka'])

### Finding Word Stems 

In [None]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ivs|s|es|ment)$', 'processing')

In [None]:
re.findall(r'^.*[ing|ly|ed|ious|ies|ivs|s|es|ment]$', 'processing')

Here, re.findall() just gave us the suffix even though the regular expression matched
the entire word. This is because the parentheses have a second function, to select substrings
to be extracted. If we want to use the parentheses to specify the scope of the
disjunction, but not to select the material to be output, we have to add ?:, which is just
one of many arcane subtleties of regular expressions.

In [None]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ivs|s|es|ment)$', 'processing')

However, we’d actually like to split the word into stem and suffix. So we should just
parenthesize both parts of the regular expression

In [None]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ivs|s|es|ment)$', 'processing')

Still has a problem though

The regular expression incorrectly found an -s suffix instead of an -es suffix. This demonstrates
another subtlety: the star operator is “greedy” and so the .* part of the expression
tries to consume as much of the input as possible. If we use the “non-greedy”
version of the star operator, written *?, we get what we want:')

In [None]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ivs|s|es|ment)$', 'processes')

In [None]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ivs|s|es|ment)?$', 'language')

In [None]:
def stem(word):
    regexp=r'(.*?)(ing|ly|ed|ious|ivs|s|es|ment)?$'
    stem, suffix=re.findall(regexp, word)[0Notice that our regular expression removed the s from ponds but also from is and
basis. It produced some non-words, such as distribut and deriv, but these are acceptable
stems in some applications]
    return stem

In [None]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government. Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
tokens=nltk.word_tokenize(raw)
print([stem(w) for w in tokens])

Notice that our regular expression removed the s from ponds but also from is and
basis. It produced some non-words, such as distribut and deriv, but these are acceptable
stems in some applications

### Searching Tokenized Text

You can use a special kind of regular expression for searching across multiple words in
a text (where a text is a list of tokens).

In [None]:
from nltk.corpus import gutenberg, nps_chat
mob= nltk.Text(gutenberg.words('melville-moby_dick.txt'))
mob.findall(r'<a> (<.*>) <man>')

In [None]:
chat = nltk.Text(nps_chat.words())
chat.findall(r'<.*><.*><bro>')

In [None]:
chat.findall(r'<l.*>{3,}')

In [None]:
nltk.re_show(r'(ing)', 'processing and thinking and painting and many ing\'s')

In [None]:
from nltk.corpus import brown
hobbies_learned = nltk.Text(brown.words(categories=['hobbies','learned']))
hobbies_learned.findall(r'<\w*> <and><other><\w*s>')

### Normalizing Text

In earlier program examples we have often converted text to lowercase before doing
anything with its words, e.g., set(w.lower() for w in text). By using lower(), we have
normalized the text to lowercase so that the distinction between The and the is ignored.
Often we want to go further than this and strip off any affixes, a task known as stemming.
A further step is to make sure that the resulting form is a known word in a
dictionary, a task known as lemmatization

In [None]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government. Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)

In [None]:
print(tokens)

###### Stemmers

In [None]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
print([porter.stem(t) for t in tokens])# "Handles 'lying' correctly to 'lie'"
print([lancaster.stem(t) for t in tokens])#"Handles lying incorrectly"

The Porter Stemmer is a good choice if you are indexing some texts and want to support search using alternative forms of words

###### Lemmatization

The WordNet lemmatizer removes affixes only if the resulting word is in its dictionary

In [None]:
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t) for t in tokens])

The WordNet lemmatizer is a good choice if you want to compile the vocabulary of
some texts and want a list of valid lemmas (or lexicon headwords).

## Regular Expressions for Tokenizing Text

Tokenization is the task of cutting a string into identifiable linguistic units that constitute
a piece of language data.

In [None]:
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
... though), 'I won't have any pepper in my  kitchen AT ALL. Soup does very
... well without--Maybe it's always pepper that makes people hot-tempered,'..."""
print(re.split(r' ', raw))#Notice the new line chars are not split

In [None]:
print(re.split(r'[ \t\n]+', raw))#split at new line and tab characters

In [None]:
print(re.split(r'[ \t\n]', raw))# wihtout the plus. Notice element before the kitchen token

In [None]:
print(re.split(r'\s', raw))#for any white space character split

\w is equivalent to [a-zA-Z0-9_]. <br>
\W is complement of \w- All charaters other than letters, numbers or underscores.

In [None]:
print(re.split(r'\W+', raw))

In [None]:
'xx'.split('x')

In [None]:
print(re.findall(r'\w+', raw))

In [None]:
print(re.findall(r'\w', raw))

The regular expression «\w+|\S\w*» will first
try to match any sequence of word characters. If no match is found, it will try to match
any non-whitespace character (\S is the complement of \s) followed by further word
characters. This means that punctuation is grouped with any following letters
(e.g., ’s) but that sequences of two or more punctuation characters are separated.

In [None]:
print(re.findall(r'\w+|\S\w*', raw))

«\w+([-']\w+)*». This expression means \w+ followed by zero or more instances of [-']\w+; it would match hot-tempered and it’s. (We need to include ?: in this expression for reasons discussed earlier.) We’ll also add a pattern to match quote characters so these are kept separate from the text they enclose. The expression in this example also included «[-.(]+», which causes the double hyphen, ellipsis, and open parenthesis to be tokenized separately.

In [None]:
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S|w*", raw))

In [None]:
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r'''(?x)
([A-Z]\.)+
|w+(-\w+)*
|\$?\d+(\.\d+)?%?
|\.\.\.
\[][.,;"'?():-_`]
'''
nltk.regexp_tokenize(text, pattern)

The special (?x) “verbose flag” tells Python to strip out the embedded whitespace and comments. When using the verbose flag, you can no longer use ' ' to match a space character; use \s instead. The regexp_tokenize() function has an optional gaps parameter. When set to True, the regular expression specifies the gaps between tokens, as with re.split().

We can evaluate a tokenizer by comparing the resulting tokens with a
wordlist, and then report any tokens that don’t appear in the wordlist,
using set(tokens).difference(wordlist). You’ll probably want to
lowercase all the tokens first.

In [None]:
print('%d %d'% (4,3))


\D Any non-digit character (equivalent to [^0-9])<br>
\s Any whitespace character (equivalent to [ \t\n\r\f\v]
<br>\S Any non-whitespace character (equivalent to [^ \t\n\r\f\v])<br>
\w Any alphanumeric character (equivalent to [a-zA-Z0-9_])<br>
\W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])<br>
\t The tab character<br>
\n The newline character

In [7]:
strings_to_search = ['abc', 'def', 'fgh hello']

complete_list = ['abc abc dsss abc', 'defgj', 'abc fgh hello xabd', 'fgh helloijj']

for col_key in strings_to_search:
    print(list(map(lambda x: re.findall(, x), complete_list)))

[['c', 'c', 'c'], ['e'], ['c', 'e', 'l', 'l', 'o'], ['e', 'l', 'l', 'o']]
[['c', 'c', 'c'], ['e'], ['c', 'e', 'l', 'l', 'o'], ['e', 'l', 'l', 'o']]
[['c', 'c', 'c'], ['e'], ['c', 'e', 'l', 'l', 'o'], ['e', 'l', 'l', 'o']]


In [8]:
strings_to_search = ['abc', 'def', 'fgh hello']
complete_list = ['abc abc dsss abc', 'defgj', 'abc fgh hello xabd', 'fgh helloijj']

for col_key in strings_to_search:
    word = r'\b{}\b'.format(col_key)
    print(list(map(lambda x: re.findall(word, x), complete_list)))

[['abc', 'abc', 'abc'], [], ['abc'], []]
[[], [], [], []]
[[], [], ['fgh hello'], []]
