<a href="https://colab.research.google.com/github/Mukolwe29/KCB-Data-science-and-AI/blob/master/NLTK_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Natural Language Toolkit (NLTK) is one of the largest Python libraries for performing various Natural Language Processing tasks.

**Table of Content**

What is the Natural Language Toolkit (NLTK)?

Tokenization

Stemming and Lemmatization

Stemming

Lemmatization

Part of Speech Tagging





## What is the Natural Language Toolkit (NLTK)?

NLTK is Python’s API library for performing an array of tasks in human language. It can perform a variety of operations on textual data, such as classification, tokenization, stemming, tagging, Leparsing, semantic reasoning, etc

In [1]:
! pip install nltk



### Accessing Additional Resources:

To incorporate the usage of additional resources, such as recourses of languages other than English – you can run the following in a python script. It has to be done only once when you are running it for the first time in your system

In [2]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

Now, having installed NLTK successfully in our system, let’s perform some basic operations on text data using NLTK.

## Tokenization

Tokenization refers to break down the text into smaller units. It entails splitting paragraphs into sentences and sentences into words. It is one of the initial steps of any NLP pipeline.

 Let us have a look at the two major kinds of tokenization that NLTK provides:

### Word Tokenization

It involves breaking down the text into words.
"I study Machine Learning ." will be word-tokenized as
  ['I', 'study', 'Machine', 'Learning',  '.'].

### Sentence Tokenization
It involves breaking down the text into individual sentences.

In [3]:
# Tokenization using NLTK
from nltk import word_tokenize, sent_tokenize
sent = "Youtube is a great learning platform.\
It is one of the best for Data Science students."
print(word_tokenize(sent))
print(sent_tokenize(sent))


['Youtube', 'is', 'a', 'great', 'learning', 'platform.It', 'is', 'one', 'of', 'the', 'best', 'for', 'Data', 'Science', 'students', '.']
['Youtube is a great learning platform.It is one of the best for Data Science students.']


## Stemming and Lemmatization
When working with Natural Language, we are not much interested in the form of words – rather, we are concerned with the meaning that the words intend to convey. Thus, we try to map every word of the language to its root/base form. This process is called canonicalization.

E.g. The words ‘play’, ‘plays’, ‘played’, and ‘playing’ convey the same action – hence, we can map them all to their base form i.e. ‘play’.

Now, there are two widely used canonicalization techniques: Stemming and Lemmatization.


### Stemming
Stemming generates the base word from the inflected word by removing the affixes of the word. It has a set of pre-defined rules that govern the dropping of these affixes. It must be noted that stemmers might not always result in semantically meaningful base words.  Stemmers are faster and computationally less expensive than lemmatizers.

In the following code, we will be stemming words using Porter Stemmer – one of the most widely used stemmers:

In [4]:
from nltk.stem import PorterStemmer

# create an object of class PorterStemmer
porter = PorterStemmer()
print(porter.stem("play"))
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("played"))


play
play
play
play


We can see that all the variations of the word ‘play’ have been reduced to the same word  – ‘play’. In this case, the output is a meaningful word, ‘play’. However, this is not always the case. Let us take an example.

Please note that these groups are stored in the lemmatizer; there is no removal of affixes as in the case of a stemmer.

In [12]:
from nltk.stem import PorterStemmer
# create an object of class PorterStemmer
porter = PorterStemmer()
print(porter.stem("communication"))


commun


The stemmer reduces the word ‘communication’ to a base word ‘commun’ which is meaningless in itself.

### Lemmatization
Lemmatization involves grouping together the inflected forms of the same word. This way, we can reach out to the base form of any word which will be meaningful in nature. The base from here is called the Lemma.

Lemmatizers are slower and computationally more expensive than stemmers.

Example:
'play', 'plays', 'played', and 'playing' have 'play' as the lemma.

In [6]:
from nltk.stem import WordNetLemmatizer
# create an object of class WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("plays", 'v'))
print(lemmatizer.lemmatize("played", 'v'))
print(lemmatizer.lemmatize("play", 'v'))
print(lemmatizer.lemmatize("playing", 'v'))


play
play
play
play


Please note that in lemmatizers, we need to pass the Part of Speech of the word along with the word as a function argument.

Also, stemmers always result in meaningful base words.

Let us take the same example as we took in the case for stemmers.

In [7]:
from nltk.stem import WordNetLemmatizer

# create an object of class WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("Communication", 'v'))


Communication


##Part of Speech Tagging

art of Speech (POS) tagging refers to assigning each word of a sentence to its part of speech. It is significant as it helps give a better syntactic overview of a sentence.

Example:
"Youtube is a Data Science platform."
Let's see how NLTK's POS tagger will tag this sentence.




In Python, both these tokenizations can be implemented in NLTK as follows:

In [8]:
from nltk import pos_tag
from nltk import word_tokenize

text = "Youtube is a Data Science platform."
tokenized_text = word_tokenize(text)
tags = tokens_tag = pos_tag(tokenized_text)
tags


[('Youtube', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('Data', 'NNP'),
 ('Science', 'NNP'),
 ('platform', 'NN'),
 ('.', '.')]

# Conclusion
In conclusion, the Natural Language Toolkit (NLTK) works as a powerful Python library that a wide range of tools for Natural Language Processing (NLP). From fundamental tasks like text pre-processing to more advanced operations such as semantic reasoning, NLTK provides a versatile API that caters to the diverse needs of language-related tasks.