**Getting Started With NLTK:**

Our first step is to install NLTK with pip.

In [3]:
!pip install nltk==3.5



Now we need to install NumPy and Matplotlib.

In [4]:
!pip install numpy matplotlib



**Tokenizing:**
Using tokenizing, we can split up text by word or by sentence. It’s an initial step to convert the unstructured data into structured data for easier analysis purposes.

We shall be tokenizing by word and tokenizing by sentence.

1. **Tokenizing by word:** Words are like the atoms of natural language. These are the smallest units of meaning that still make sense on their own.

2. **Tokenizing by sentence:** When we tokenize by sentence, we can analyze how those words relate to one another and see more context.

Now, we shall import the relevant parts of NLTK so you can tokenize by word and by sentence.

In [5]:
from nltk.tokenize import sent_tokenize, word_tokenize

Here we can create a string to tokenize.

In [6]:
example_string = """
... Muad'Dib learned rapidly because his first training was in how to learn.
... And the first lesson of all was the basic trust that he could learn.
... It's shocking to find how many people do not believe they can learn,
... and how many more believe learning to be difficult."""

In [7]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

We can use sent_tokenize() to split up example_string into sentences.

In [8]:
sent_tokenize(example_string)

["\nMuad'Dib learned rapidly because his first training was in how to learn.",
 'And the first lesson of all was the basic trust that he could learn.',
 "It's shocking to find how many people do not believe they can learn,\nand how many more believe learning to be difficult."]

Tokenizing example_string by sentence provides us with a list of three strings that are sentences:

1. "Muad'Dib learned rapidly because his first training was in how to learn."
2. 'And the first lesson of all was the basic trust that he could learn.'
3. "It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult."

Here we are tokenizing example_string by word.

In [9]:
word_tokenize(example_string)

["Muad'Dib",
 'learned',
 'rapidly',
 'because',
 'his',
 'first',
 'training',
 'was',
 'in',
 'how',
 'to',
 'learn',
 '.',
 'And',
 'the',
 'first',
 'lesson',
 'of',
 'all',
 'was',
 'the',
 'basic',
 'trust',
 'that',
 'he',
 'could',
 'learn',
 '.',
 'It',
 "'s",
 'shocking',
 'to',
 'find',
 'how',
 'many',
 'people',
 'do',
 'not',
 'believe',
 'they',
 'can',
 'learn',
 ',',
 'and',
 'how',
 'many',
 'more',
 'believe',
 'learning',
 'to',
 'be',
 'difficult',
 '.']

Here we got a list of strings that NLTK considers to be words, such as: "Muad'Dib" ,  'training' , 'how' and so on.

We also had the following strings that were also considered words: " 's " ,  ' , ' and ' . '.

Here we can note that "Muad'Dib" isn’t an accepted contraction like "It's" as we know that "'s" is a contraction of "is". Therefore, it wasn’t read as two separate words and was left intact.

**Filtering Stop Words:** We filter out the stop words like 'in', 'is', and 'an', as they are not important in the sentence.

To filter the stop words, import the relevant parts of NLTK.

In [10]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
from nltk.corpus import stopwords

In [12]:
from nltk.tokenize import word_tokenize

Let us filter the following quote.

In [13]:
worf_quote = "Sir, I protest. I am not a merry man!"

After tokenizing worf_quote by word, we can store the resulting list in words_in_quote.

In [14]:
words_in_quote = word_tokenize(worf_quote)

In [15]:
words_in_quote

['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man', '!']

We focused on stop words in "english".

In [16]:
 stop_words = set(stopwords.words("english"))

Next, we have created an empty list to hold the words that make it past the filter.

In [17]:
filtered_list = []

Now we can use stop_words to filter words_in_quote.

In [18]:
for word in words_in_quote:
...    if word.casefold() not in stop_words:
...         filtered_list.append(word)

We iterated over words_in_quote with a for loop and added all the words that were not stop words to the filtered_list. We used .casefold() on word so we could ignore whether the letters in the word were uppercase or lowercase. This is worth doing because stopwords.words('english') include only lowercase versions of stop words.

Alternatively, we could use list comprehension to make a list of all the words in our text that are not stop words:

In [19]:
filtered_list = [
...     word for word in words_in_quote if word.casefold() not in stop_words
... ]

Finally, we got the following filtered_list.

In [20]:
filtered_list

['Sir', ',', 'protest', '.', 'merry', 'man', '!']

We filtered out words like 'am' and 'a' from the sentence. The pronoun 'I' and adverb 'not' are not important parts of a sentence; that's why they are removed from the sentence.

**Stemming:** Stemming is a text processing task in which we reduce words to their root, which is the core part of a word.

First, we import the relevant parts of NLTK in order to start stemming.

In [21]:
from nltk.stem import PorterStemmer

In [22]:
from nltk.tokenize import word_tokenize

We can create a stemmer with PorterStemmer().

In [23]:
stemmer = PorterStemmer()

We created a string for the stem.

In [24]:
string_for_stemming = """
... The crew of the USS Discovery discovered many discoveries.
... Discovering is what explorers do."""

Before we can stem the words in that string, we need to separate all the words in it.

In [25]:
words = word_tokenize(string_for_stemming)

Now, you have a list of all the tokenized words in the string.

In [26]:
words

['The',
 'crew',
 'of',
 'the',
 'USS',
 'Discovery',
 'discovered',
 'many',
 'discoveries',
 '.',
 'Discovering',
 'is',
 'what',
 'explorers',
 'do',
 '.']

We created a list of the stemmed versions of the words in words by using stemmer.stem() in a list comprehension.

In [27]:
stemmed_words = [stemmer.stem(word) for word in words]

In [28]:
stemmed_words

['the',
 'crew',
 'of',
 'the',
 'uss',
 'discoveri',
 'discov',
 'mani',
 'discoveri',
 '.',
 'discov',
 'is',
 'what',
 'explor',
 'do',
 '.']

**Tagging Parts of Speech:** Part of speech is a grammatical term that deals with the roles words play when we use them together in sentences. Tagging parts of speech, or POS tagging, is the task of labeling the words in our text according to their part of speech.

This is how to import the relevant parts of NLTK in order to tag parts of speech.

In [29]:
from nltk.tokenize import word_tokenize

In [30]:
sagan_quote = """
... If you wish to make an apple pie from scratch,
... you must first invent the universe."""

We can use word_tokenize to separate the words in that string and store them in a list.

In [31]:
words_in_sagan_quote = word_tokenize(sagan_quote)

Now let us call nltk.pos_tag() on our new list of words.

In [32]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(words_in_sagan_quote)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('If', 'IN'),
 ('you', 'PRP'),
 ('wish', 'VBP'),
 ('to', 'TO'),
 ('make', 'VB'),
 ('an', 'DT'),
 ('apple', 'NN'),
 ('pie', 'NN'),
 ('from', 'IN'),
 ('scratch', 'NN'),
 (',', ','),
 ('you', 'PRP'),
 ('must', 'MD'),
 ('first', 'VB'),
 ('invent', 'VB'),
 ('the', 'DT'),
 ('universe', 'NN'),
 ('.', '.')]

We can see all the words in the quote are now in a separate tuple, with a tag that represents their part of speech.

We can get a list of tags and their meanings in the following way.

In [33]:
nltk.download('tagsets')

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

We can make a string to hold an excerpt from this poem.

In [34]:
jabberwocky_excerpt = """
... 'Twas brillig, and the slithy toves did gyre and gimble in the wabe:
... all mimsy were the borogoves, and the mome raths outgrabe."""

We can use word_tokenize to separate the words in the excerpt and store them in a list.

In [35]:
 words_in_excerpt = word_tokenize(jabberwocky_excerpt)

Let us call nltk.pos_tag() on our new list of words.

In [36]:
nltk.pos_tag(words_in_excerpt)

[("'Twas", 'CD'),
 ('brillig', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('the', 'DT'),
 ('slithy', 'JJ'),
 ('toves', 'NNS'),
 ('did', 'VBD'),
 ('gyre', 'NN'),
 ('and', 'CC'),
 ('gimble', 'JJ'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('wabe', 'NN'),
 (':', ':'),
 ('all', 'DT'),
 ('mimsy', 'NNS'),
 ('were', 'VBD'),
 ('the', 'DT'),
 ('borogoves', 'NNS'),
 (',', ','),
 ('and', 'CC'),
 ('the', 'DT'),
 ('mome', 'JJ'),
 ('raths', 'NNS'),
 ('outgrabe', 'RB'),
 ('.', '.')]

Here, accepted English words like 'and' and 'the' were correctly tagged as a conjunction and a determiner, respectively. The gibberish word 'slithy' was tagged as an adjective; that is what a human English speaker would probably assume from the context of the poem as well.

**Lemmatizing:** Lemmatizing reduces words to their core meaning, but it will give us a complete English word that makes sense on its own instead of just a fragment of a word.

Let us see how to import the relevant parts of NLTK in order to start lemmatizing.

In [37]:
from nltk.stem import WordNetLemmatizer

We can create a lemmatizer to use.

In [38]:
 lemmatizer = WordNetLemmatizer()

We can also lemmatize a plural noun.

In [39]:
nltk.download('wordnet')
lemmatizer.lemmatize("scarves")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


'scarf'

"scarves" gave us 'scarf',  and this is more meaningful than what we got with the stemmer, which is 'scarv'. Next, create a string with more than one word to lemmatize:

In [40]:
string_for_lemmatizing = "The friends of DeSoto love scarves."

Now we can tokenize that string by word.

In [41]:
words = word_tokenize(string_for_lemmatizing)

Here is our list of words:

In [42]:
words

['The', 'friends', 'of', 'DeSoto', 'love', 'scarves', '.']

We have created a list containing all the words in words after they’ve been lemmatized.

In [43]:
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

This is the list we got.

In [44]:
lemmatized_words

['The', 'friend', 'of', 'DeSoto', 'love', 'scarf', '.']

Let us try lemmatizing "worst".

In [45]:
lemmatizer.lemmatize("worst")

'worst'

We got the result 'worst' because lemmatizer.lemmatize() assumed that "worst" was a noun. We can make it clear that we want "worst" to be an adjective.

In [46]:
lemmatizer.lemmatize("worst", pos="a")

'bad'

The default parameter for pos is 'n' for noun, but we made sure that "worst" was treated as an adjective by adding the parameter pos="a". As a result, we got 'bad', which looks very different from our original word and is nothing like what we could get if we were stemming. We are getting this because "worst" is the superlative form of the adjective 'bad', and lemmatizing can reduce superlatives as well as comparatives to their lemmas.

**Chunking:** While tokenizing allows us to identify words and sentences, chunking allows us to identify phrases.

This is how to import the relevant parts of NLTK in order to chunk.

In [47]:
from nltk.tokenize import word_tokenize

Before we can chunk, we need to make sure that the parts of speech in our text are tagged, so create a string for POS tagging.

In [48]:
lotr_quote = "It's a dangerous business, Frodo, going out your door."

Now we can tokenize that string by word.

In [49]:
words_in_lotr_quote = word_tokenize(lotr_quote)

In [50]:
words_in_lotr_quote

['It',
 "'s",
 'a',
 'dangerous',
 'business',
 ',',
 'Frodo',
 ',',
 'going',
 'out',
 'your',
 'door',
 '.']

In the next step, we can tag those words by part of speech.

In [51]:
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [52]:
lotr_pos_tags = nltk.pos_tag(words_in_lotr_quote)

In [53]:
lotr_pos_tags

[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

We have a list of tuples of all the words in the quote, along with their POS tag. In order to chunk, we first need to understand a chunk grammar. A **chunk grammar** is a combination of rules on how sentences should be chunked. It often uses regular expressions, or regexes.

Let us create a chunk grammar with one regular expression rule.

In [54]:
grammar = "NP: {<DT>?<JJ>*<NN>}"

Here, NP stands for noun phrase. According to the rule we created, our chunks:

1. Start with an optional (?) determiner ('DT')
2. Can have any number (*) of adjectives (JJ)
3. End with a noun (< NN >)

We created a chunk parser with this grammar.

In [55]:
chunk_parser = nltk.RegexpParser(grammar)

Now we can try it out with our quote.

In [57]:
tree = chunk_parser.parse(lotr_pos_tags)


The visual representation can be done using tree.draw().

**Chinking:** Chinking is used together with chunking, but while chunking is used to include a pattern, chinking is used to exclude a pattern.

In [58]:
lotr_pos_tags

[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

In [59]:
grammar = """
... Chunk: {<.*>+}
...        }<JJ>{"""

The first rule of our grammar is {<. * >+}. This rule has curly braces that face inward ({}) because it is used to determine what patterns we want to include in our chunks. In this case, we want to include everything: <.*>+.

The second rule of your grammar is }< JJ >{. This rule has curly braces that face outward (}{) because it is used to determine what patterns you want to exclude in your chunks. In this case, you want to exclude adjectives: < JJ >.

In [60]:
chunk_parser = nltk.RegexpParser(grammar)

In [64]:
tree = chunk_parser.parse(lotr_pos_tags)


The visual representation can be done using tree.draw().