Source: Hands-On Python Natural Language Processing By Aman Kedia and Mayank Rasu

In [1]:
#Creating tokens with .split
sentence = "The capital of China is Beijing"
sentence.split()

In [2]:
#Tokenization issues with apostrophes
sentence = "Beijing is where we'll go"
sentence.split()

In [3]:
#Tokenization with a period
sentence = "A friend is pursuing his M.S from Beijing"
sentence.split()

In [4]:
#Regular expression based tokenizeers
from nltk.tokenize import RegexpTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)

['A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$3000.0',
 '-',
 '$8000.0',
 'in',
 'USA',
 '.']

The \w+|\$[\d\.]+|\S+ regular expression allows three alternative patterns:
First alternative: \w+ that matches any word character (equal to [a-zA-Z0-9_]). The + is a quantifier and matches between one and unlimited times as many times as possible.

Second alternative: \$[\d\.]+. Here, \$ matches the character $, \d matches a digit between 0 and 9, \. matches the character . (period), and + again acts as a quantifier matching between one and unlimited times.

Third alternative: \S+. Here, \S accepts any non-whitespace character and + again acts the same way as in the preceding two alternatives.

In [5]:
#The Treebank tokenizer also uses regular expressions to tokenize text according to the
#Penn TreebankHere, words are mostly split based on punctuation.
from nltk.tokenize import TreebankWordTokenizer
s = "I'm going to buy a Rolex watch that doesn't cost more than $3000.0"
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(s)

['I',
 "'m",
 'going',
 'to',
 'buy',
 'a',
 'Rolex',
 'watch',
 'that',
 'does',
 "n't",
 'cost',
 'more',
 'than',
 '$',
 '3000.0']

In [6]:
#Tweet Tokenizer
from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer()
tokenizer.tokenize(s)

['@amankedia',
 "I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxxxxxxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

In [7]:
#Strip and presrve
#The parameter strip_handles, when set to True, removes the handles mentioned in a post/tweet. 
#preserve_case, which, when set to False, converts everything to lower case in order to normalize the vocabulary.
from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
tokenizer.tokenize(s)

["I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

In [8]:
#Snowball languages
from nltk.stem.snowball import SnowballStemmer
print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


The Porter stemmer works only with strings, whereas the Snowball stemmer works with both strings and Unicode data. The Snowball stemmer also allows the option to ignore stopwords as an inherent functionality.

In [9]:
plurals = ['caresses', 'flies', 'dies', 'mules', 'died', 'agreed', 'owned',
'humbled', 'sized', 'meeting', 'stating',
 'siezing', 'itemization', 'traditional', 'reference', 'colonizer',
'plotted', 'having', 'generously']

In [10]:
#Porter Stem
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have gener


In [11]:
#Snowball Stemmer
stemmer2 = SnowballStemmer(language='english')
singles = [stemmer2.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have generous


In [12]:
#Lemmatizer
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

s = "We are putting in efforts to enhance our understanding of \
        Lemmatization"
token_list = s.split()
print("The tokens are: ", token_list)

lemmatized_output = ' '.join([lemmatizer.lemmatize(token) for token \
                              in token_list])
print("The lemmatized output is: ", lemmatized_output)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/emanuel.s/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


The tokens are:  ['We', 'are', 'putting', 'in', 'efforts', 'to', 'enhance', 'our', 'understanding', 'of', 'Lemmatization']
The lemmatized output is:  We are putting in effort to enhance our understanding of Lemmatization


In [13]:
#Using POS to help Lemmatizer
pos_tags = nltk.pos_tag(token_list)
pos_tags

[('We', 'PRP'),
 ('are', 'VBP'),
 ('putting', 'VBG'),
 ('in', 'IN'),
 ('efforts', 'NNS'),
 ('to', 'TO'),
 ('enhance', 'VB'),
 ('our', 'PRP$'),
 ('understanding', 'NN'),
 ('of', 'IN'),
 ('Lemmatization', 'NN')]

In [14]:
from nltk.corpus import wordnet
##This is a common method which is widely used across the NLP community of practitioners and readers
"""Maps POS tags to first character lemmatize() accepts. 
We are focusing on Verbs, Nouns, Adjectives and Adverbs here."""
def get_part_of_speech_tags(token):
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    tag = nltk.pos_tag([token])[0][1][0].upper()
    return tag_dict.get(tag, wordnet.NOUN)

In [15]:
lemmatized_output_with_POS_information = [lemmatizer.lemmatize(token,
get_part_of_speech_tags(token)) for token in token_list]
print(' '.join(lemmatized_output_with_POS_information))

We be put in effort to enhance our understand of Lemmatization


In [16]:
#Compare to snowball stemmer
stemmer2 = SnowballStemmer(language='english')
stemmed_sentence = [stemmer2.stem(token) for token in token_list]
print(' '.join(stemmed_sentence))

we are put in effort to enhanc our understand of lemmat


In [17]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
", ".join(stop)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/emanuel.s/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


"than, off, whom, aren't, an, mightn, to, weren't, couldn't, themselves, you, isn, both, mustn, nor, don, here, has, you'll, against, when, doesn, all, him, above, just, each, my, not, she, you've, up, can, any, now, only, and, d, himself, shouldn, through, isn't, your, were, ma, where, didn't, herself, mightn't, during, of, in, does, will, who, after, needn't, aren, wouldn, the, how, into, doing, having, she's, do, it's, this, between, own, yourselves, out, theirs, so, other, few, some, before, wouldn't, weren, t, its, if, it, should, about, most, on, for, his, her, should've, y, shouldn't, you're, couldn, won, i, are, over, wasn't, such, was, why, needn, under, shan't, you'd, ll, doesn't, ain, while, myself, below, shan, be, down, which, at, ours, them, did, these, our, they, hers, o, that'll, is, further, me, wasn, there, with, hasn, by, then, very, he, we, hadn't, yourself, what, but, or, am, because, no, as, didn, haven, don't, that, itself, mustn't, once, yours, ourselves, being,

In [18]:
#Keeping Wh- words
wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
stop = set(stopwords.words('english'))

In [19]:
sentence = "how are we putting in efforts to enhance our understanding of Lemmatization"
for word in wh_words:
    stop.remove(word)
sentence_after_stopword_removal = [token for token in sentence.split() if token not in stop]
" ".join(sentence_after_stopword_removal)

'how putting efforts enhance understanding Lemmatization'

In [23]:
#Bigrams
from nltk.util import ngrams
s = "Natural Language Processing is the way to go"
tokens = s.split()
bigrams = list(ngrams(tokens, 2))
[" ".join(token) for token in bigrams]

['Natural Language',
 'Language Processing',
 'Processing is',
 'is the',
 'the way',
 'way to',
 'to go']

In [24]:
#Trigrams
s = "Natural Language Processing is the way to go"
tokens = s.split()
trigrams = list(ngrams(tokens, 3))
[" ".join(token) for token in trigrams]

['Natural Language Processing',
 'Language Processing is',
 'Processing is the',
 'is the way',
 'the way to',
 'way to go']

In [27]:
#Using Beautiful soup to help clear HTML Tags
html = "<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>"
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

My First HeadingMy first paragraph.


In [28]:
s = "Natural Language Processing is the way to go"
tokens = set(s.split())
vocabulary = sorted(tokens)
vocabulary

['Language', 'Natural', 'Processing', 'go', 'is', 'the', 'to', 'way']

Chapter 3 Vocab

In simple terms, a lexicon can be thought of as a dictionary of terms that are called lexemes.

Phonemes can be thought of as the speech sounds, made by the mouth or unit of sound, that can differentiate one word from another in a language.

Graphemes are groups of letters of size one or more that can represent these individual sounds or phonemes. The word spoon consists of five letters that actually represent four phonemes, identified by the graphemes s, p, oo,  and n.

A morpheme is the smallest meaningful unit in a language. The word unbreakable is composed of three morphemes:
un—a bound morpheme signifying not break—the root morpheme able—a free morpheme signifying can be done

Tokenization - In order to build up a vocabulary, the first thing to do is to break the documents or sentences into chunks called tokens.

Regular expressions are sequences of characters that define a search pattern. 

Stemming - a crude attempt is made to remove the inflectional forms of a word and bring them to a base form called the stem

Overstemming - words that are stemmed to the same root should have been stemmed to different roots

Understemming - words that should have been stemmed to the same root aren't stemmed to it

Lemmatization is a process wherein the context is used to convert a word to its meaningful base form.

Case Folding - turns all letters in text corpus to lower case