https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/

1)Tokenization –
In this step, we decompose our text data into the smallest unit called tokens. Generally, our dataset consists long paragraph which is made up of many lines and lines are made up of words. It is quite difficult to analyze the long paragraphs so first, we decompose the paragraphs into separate lines and then lines are decomposed into words.

In [None]:
import nltk
nltk.download('punkt')
from nltk import tokenize
text = "NLP is a systematic proces that help to do various task. It is used to analyze, organize and summarize the data"
#decomposing the paragraph into lines
lines=tokenize.sent_tokenize(text)
print(lines)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['NLP is a systematic proces that help to do various task.', 'It is used to analyze, organize and summarize the data']


In [None]:
print("total line in the given paragraph",len(lines))
['NLP is a systematic proces that help to do various task.', 'It is used to analyze, organize and summarize the data']


total line in the given paragraph 2


['NLP is a systematic proces that help to do various task.',
 'It is used to analyze, organize and summarize the data']

In [None]:
#decomposing the text into words
words=tokenize.word_tokenize(text)
print(words)
print("Total number of words in the paragraph are",len(words))
['NLP', 'is', 'a', 'systematic', 'proces', 'that', 'help', 'to', 'do', 'various', 'task', '.', 'It', 'is', 'used', 'to', 'analyze', ',', 'organize', 'and', 'summarize', 'the', 'data']

['NLP', 'is', 'a', 'systematic', 'proces', 'that', 'help', 'to', 'do', 'various', 'task', '.', 'It', 'is', 'used', 'to', 'analyze', ',', 'organize', 'and', 'summarize', 'the', 'data']
Total number of words in the paragraph are 23


['NLP',
 'is',
 'a',
 'systematic',
 'proces',
 'that',
 'help',
 'to',
 'do',
 'various',
 'task',
 '.',
 'It',
 'is',
 'used',
 'to',
 'analyze',
 ',',
 'organize',
 'and',
 'summarize',
 'the',
 'data']

2) Normalization –
Most of the dataset contains many words that are generated from a single word by adding some suffix or prefix. These conditions can cause redundancy in our dataset and it will not give a better output. So it is an important task to convert those words into their root form that also decreases the count of unique words in our dataset and improves our outcomes.

In NLP, Two methods are used to perform the normalization of the dataset:-

       a) Stemming –
Stemming is used to remove any kind of suffix from the word and return the word in its original form that is the root word but sometimes the root word that is generated is a non-meaningful word or it does not belong to the English dictionary.

Eg- the words ”playful”, ”played”, ”playing” will be converted to “play” after stemming.

In [None]:
#importing module for stemming
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
words=["playing","playful","played"]
print('Original words-', words)
stem_words=[]
for word in words:
    root_word= ps.stem(word)
    stem_words.append(root_word)
print("After stemming -", stem_words)

Original words- ['playing', 'playful', 'played']
After stemming - ['play', 'play', 'play']


b) Lemmatization –
Lemmatization is similar to stemming but it works with much better efficiency. In lemmatization, the word that is generated after chopping off the suffix is always meaningful and belongs to the dictionary that means it does not produce any incorrect word. The word generated after lemmatization is also called a lemma.

In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

wml = WordNetLemmatizer()
words_orig = ["cries", "crys", "cried"]
print('Original words-', words_orig)
lemma_words = []
for word in words_orig:
    tokens = wml.lemmatize(word)
    lemma_words.append(tokens)
print("After lemmatization", lemma_words)


Original words- ['cries', 'crys', 'cried']
After lemmatization ['cry', 'cry', 'cried']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Difference between lemmatization and stemming-
Lemmatization is a better way to obtain the original form of any given text rather than stemming because lemmatization returns the actual word that has some meaning in the dictionary.

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
ps = PorterStemmer()
wml = WordNetLemmatizer()
lemma_words=[]
stem_words=[]
words=["increases"]
print('Original words-', words)
for word in words:
    root_word= ps.stem(word)
    stem_words.append(root_word)
    tokens = wml.lemmatize(word)
    lemma_words.append(tokens)
print("After stemming -", stem_words)
print("After lemmatization", lemma_words)

Original words- ['increases']
After stemming - ['increas']
After lemmatization ['increase']


3) Removing Stop-words
Stop words are those words in any language that helps to combine the sentence and make it meaningful. for eg. In the English language various words like “I, am, are, is to, etc. are all known as stop-wards. But these stop-words are not that much useful for our model so there is a need to remove these stop-words from our dataset so that we can focus on only important words rather than these supporting words.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

text = "NLP consists of a systematic process to organize the massive data and help to solve the numerous automated tasks in various fields like – machine translation, speech recognition, automatic summarization etc."
words = word_tokenize(text)
print("Total words in the paragraph:", len(words))
print(words)

filter_words = []
stopwords = stopwords.words('english')

# Removing the stop-words
for w in words:
    if w not in stopwords:
        filter_words.append(w)

print("Total words after removing stop-words:", len(filter_words))
print(filter_words)


Total words in the paragraph: 34
['NLP', 'consists', 'of', 'a', 'systematic', 'process', 'to', 'organize', 'the', 'massive', 'data', 'and', 'help', 'to', 'solve', 'the', 'numerous', 'automated', 'tasks', 'in', 'various', 'fields', 'like', '–', 'machine', 'translation', ',', 'speech', 'recognition', ',', 'automatic', 'summarization', 'etc', '.']
Total words after removing stop-words: 26
['NLP', 'consists', 'systematic', 'process', 'organize', 'massive', 'data', 'help', 'solve', 'numerous', 'automated', 'tasks', 'various', 'fields', 'like', '–', 'machine', 'translation', ',', 'speech', 'recognition', ',', 'automatic', 'summarization', 'etc', '.']


4) Part of Speech tag (POS)
In this process, each token is listed according to its part of speech that whether it is a noun, adjective, verb, etc.
Some basics tags used for part of speech –
Lable(tags)           Part of Speech

NN                                Singular Noun

NNP                             Proper Noun

JJ                                 Adjective

VBD                             Past Tense Verb

IN                                Preposition

DT                               Determiner

In [None]:
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.tokenize import word_tokenize

nltk.download('averaged_perceptron_tagger')

text = "NLP consists of a systematic process to organize the massive data and help to solve the numerous automated tasks in various fields like – machine translation, speech recognition, automatic summarization etc."

words = word_tokenize(text)
stopwords = stopwords.words('english')

# Removing the stop-words
filter_words = [w for w in words if w not in stopwords]

# Part-of-speech tagging
pos = pos_tag(filter_words)

print(pos)


[('NLP', 'NNP'), ('consists', 'VBZ'), ('systematic', 'JJ'), ('process', 'NN'), ('organize', 'VBP'), ('massive', 'JJ'), ('data', 'NNS'), ('help', 'NN'), ('solve', 'VBP'), ('numerous', 'JJ'), ('automated', 'VBN'), ('tasks', 'NNS'), ('various', 'JJ'), ('fields', 'NNS'), ('like', 'IN'), ('–', 'NNP'), ('machine', 'NN'), ('translation', 'NN'), (',', ','), ('speech', 'NN'), ('recognition', 'NN'), (',', ','), ('automatic', 'JJ'), ('summarization', 'NN'), ('etc', 'NN'), ('.', '.')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
