# Various Steps in NLP

We will be using a Python library called NLTK (Natural Language Toolkit).

NLTK is a powerful open source tool that provides a set of methods and algorithms to perform a wide range of NLP tasks, including tokenizing, parts-of-speech tagging, stemming, lemmatization, and more.

## Tokenization

Tokenization refers to the procedure of splitting a sentence into its constituent parts—the words and punctuation that it is made up of.

NLTK provides a method called word_tokenize(), which tokenizes given text into words.

It actually separates the text into different words based on punctuation and spaces between words.

Import the necessary libraries and download the different types of NLTK data

In [None]:
from nltk import word_tokenize, download
download(['punkt','averaged_perceptron_tagger','stopwords'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

We need to add a sentence as input to the word_tokenize() method so that it performs its job.

In [None]:
def get_tokens(sentence):
    words = word_tokenize(sentence)
    return words

In [None]:
import nltk
nltk.download('punkt_tab')
print(get_tokens("I am reading NLP Fundamentals."))

['I', 'am', 'reading', 'NLP', 'Fundamentals', '.']


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


## PoS Tagging

PoS refers to parts of speech.

PoS tagging refers to the process of tagging words within sentences with their respective PoS.

import the necessary libraries

In [None]:
from nltk import word_tokenize, pos_tag

Using word_tokenize() method, find the tokens in the sentence.

In [None]:
def get_tokens(sentence):
    words = word_tokenize(sentence)
    return words

In [None]:
words  = get_tokens("I am reading NLP Fundamentals")
print(words)

['I', 'am', 'reading', 'NLP', 'Fundamentals']


Use the pos_tag() method.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')
def get_pos(words):
    return pos_tag(words)
get_pos(words)

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


[('I', 'PRP'),
 ('am', 'VBP'),
 ('reading', 'VBG'),
 ('NLP', 'NNP'),
 ('Fundamentals', 'NNS')]

PRP stands for personal pronoun.

VBP stands for verb present.

VGB stands for verb gerund.

NNP stands for proper noun singular

NNS stands for noun plural.

## Stop Word Removal

Stop words are the most frequently occurring words in any language.

They are just used to support the construction of sentences and do not contribute anything to the semantics of a sentence.

Removing them will help us clean our data, making its analysis much more efficient.

Import the necessary libraries

In [None]:
from nltk import download
download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In order to check the list of stop words provided for English, we pass it as a parameter to the words() function.

In [None]:
stop_words = stopwords.words('english')

In [None]:
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

To remove the stop words from a sentence-

Assign a string to the sentence variable.

Tokenize it into words using the word_tokenize() method.

In [None]:
sentence = "I am learning Python. It is one of the "\
"most popular programming languages"
sentence_words = word_tokenize(sentence)

In [None]:
print(sentence_words)

['I', 'am', 'learning', 'Python', '.', 'It', 'is', 'one', 'of', 'the', 'most', 'popular', 'programming', 'languages']


To remove the stop words, we need to loop through each word in the sentence, check whether there are any stop words, and then finally combine them to form a complete sentence.

In [None]:
def remove_stop_words(sentence_words, stop_words):
    return ' '.join([word for word in sentence_words if \
                     word not in stop_words])

In [None]:
print(remove_stop_words(sentence_words,stop_words))

I learning Python . It one popular programming languages


Add your own stop words to the stop word list

In [None]:
stop_words.extend(['I','It', 'one'])
print(remove_stop_words(sentence_words,stop_words))

learning Python . popular programming languages


## Text Normalization

There are various ways of normalizing text-

1. spelling correction
2. stemming
3. lemmatization

Assign a string to the sentence variable

In [None]:
sentence = "I visited the US from the UK on 22-10-18"

Replace-
1. "US" with "United States"
2. "UK" with "United Kingdom"
3. "18" with "2018"

To do so, use the replace() function and store the updated output in the "normalized_sentence" variable.

In [None]:
def normalize(text):
    return text.replace("US", "United States")\
.replace("UK", "United Kingdom")\
.replace("-18", "-2018")

Check whether the text has been normalized

In [None]:
normalized_sentence = normalize(sentence)
print(normalized_sentence)

I visited the United States from the United Kingdom on 22-10-2018


## Spelling Correction

Spelling correction is one of the most important tasks in any NLP project.

Import the necessary libraries

In [None]:
pip install autocorrect

Collecting autocorrect
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/622.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m622.8/622.8 kB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py) ... [?25l[?25hdone
  Created wheel for autocorrect: filename=autocorrect-2.6.1-py3-none-any.whl size=622364 sha256=970a0a8cca10e12496d78a980150209f91579c6a7f41c8e22defc5010fd16073
  Stored in directory: /root/.cache/pip/wheels/5e/90/99/807a5ad861ce5d22c3c299a11df8cba9f31524f23ae6e645cb
Successfully built autocorrect
Installing collected packages: autocorrect
Successfully installed autocorrect-2.6.1


In [None]:
from nltk import word_tokenize
from autocorrect import Speller

In [None]:
spell = Speller(lang='en')
spell('Natureal')

'Natural'

In [None]:
sentence = word_tokenize("Ntural Luanguage Processin deals with "\
"the art of extracting insightes from "\
"Natural Languaes")
print(sentence)

['Ntural', 'Luanguage', 'Processin', 'deals', 'with', 'the', 'art', 'of', 'extracting', 'insightes', 'from', 'Natural', 'Languaes']


In [None]:
def correct_spelling(tokens):
    sentence_corrected = ' '.join([spell(word) \
for word in tokens])
    return sentence_corrected
print(correct_spelling(sentence))

Natural Language Processing deals with the art of extracting insights from Natural Languages


## Stemming

![image.png](attachment:image.png)

Import the necessary libraries

In [None]:
from nltk import stem

 Pass the words as parameters to the stem() method.

In [None]:
def get_stems(word,stemmer):
    return stemmer.stem(word)
porterStem = stem.PorterStemmer()
get_stems("production",porterStem)

'product'

In [None]:
get_stems("coming",porterStem)

'come'

In [None]:
get_stems("firing",porterStem)

'fire'

In [None]:
get_stems("battling",porterStem)

'battl'

In [None]:
stemmer = stem.SnowballStemmer("english")
get_stems("battling",stemmer)

'battl'

## Lemmatization

Import the necessary libraries

In [None]:
from nltk import download
download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...


Create an object of the WordNetLemmatizer class.

In [None]:
lemmatizer = WordNetLemmatizer()

Bring the word to its proper form by using the lemmatize() method of the WordNetLemmatizer class.

In [None]:
def get_lemma(word):
    return lemmatizer.lemmatize(word)
get_lemma('products')

'product'

In [None]:
get_lemma('production')

'production'

In [None]:
get_lemma('coming')

'coming'

## Named Entity Recognition (NER)

Import the necessary libraries

In [None]:
from nltk import download
from nltk import pos_tag
from nltk import ne_chunk
from nltk import word_tokenize
download('maxent_ne_chunker')
download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

Declare the <b>sentence</b>  variable and assign it a string.

In [None]:
sentence = "We are reading a book published by Packt "\
"which is based out of Birmingham."

Find the named entities from the preceding text

In [None]:
import nltk
nltk.download('maxent_ne_chunker_tab')
def get_ner(text):
    i = ne_chunk(pos_tag(word_tokenize(text)), binary=True)
    return [a for a in i if len(a)==1]
get_ner(sentence)

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


[Tree('NE', [('Packt', 'NNP')]), Tree('NE', [('Birmingham', 'NNP')])]