<a href="https://colab.research.google.com/github/Rohit-Madhesiya/GenAI_KrishNaik/blob/main/Text_Preprocessing_Stemming_Lemmatization_Stopwords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Stemming:** \
Stemming is the process of reducing a word to its word stem that addixes to suffixes and prefixes or to the roots of words known as ***lemma***. \
Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [2]:
# Classification Problem
# Comments of product is a positive review or negative review
# Reviews can have words like --->{eating, eat, eaten}->eat(word stem),{going,gone,goes}->go(word stem)

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalize"]

**PorterStemmer**

In [3]:
from nltk.stem import PorterStemmer

In [4]:
stemming=PorterStemmer()

In [6]:
for word in words:
  print(word+"--->"+stemming.stem(word))
# Some words are not changed or changed badly-->major disadvantage of stemming

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
writes--->write
programming--->program
programs--->program
history--->histori
finally--->final
finalize--->final


In [10]:
stemming.stem('congratulations')
# It is changing the meaning of the word-->Disadvantage ------>
# These all will be fixed with the help of Lemmatization

'congratul'

In [9]:
stemming.stem('sitting')

'sit'

**RegexpStemmer Class:** \
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. \
It basically takes a single regular expression and removes any prefix or suffix that matches the expression.

In [11]:
from nltk.stem import RegexpStemmer

In [15]:
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4) #whatever in the last of the word, just remove it
# words having ing_/s_/e_/able_ remove these characters from the words

In [14]:
reg_stemmer.stem('eating')

'eat'

In [17]:
reg_stemmer.stem('ingeating') #here it will not work

'ingeat'

**Snowball Stemmer:** \


In [19]:
from nltk.stem import SnowballStemmer

In [21]:
snowball_stemmer=SnowballStemmer('english')

In [22]:
for word in words:
  print(word+"--->"+snowball_stemmer.stem(word))

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
writes--->write
programming--->program
programs--->program
history--->histori
finally--->final
finalize--->final


In [23]:
stemming.stem('fairly'),stemming.stem('sportingly')

('fairli', 'sportingli')

In [25]:
snowball_stemmer.stem('fairly'),snowball_stemmer.stem('sportingly')

('fair', 'sport')

In [26]:
# Snowball Stemmer performs better than Porter Stemmer but still not efficient
snowball_stemmer.stem('goes')

'goe'



---


**Lemmatization:**

**Wordnet Lemmatizer:** \
Lemmatization technique is like stemming. The output we will get after lemmatization is called ***'lemma'***, which is a *'root word'* rather than root stem, the output of stemming. \
After lemmatization, we will be getting a valid word that means the same thing. \

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma.


---

**Use Cases:** \
Q&A, Chatbots,  Text Summarization

In [31]:
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [32]:
lemmatizer=WordNetLemmatizer()

In [37]:
'''
POS- Noun-n
verb-v
adjective-a
adverb-r
'''
lemmatizer.lemmatize("going",pos='v')

'go'

In [38]:
for word in words:
  print(word+"---->"+lemmatizer.lemmatize(word,pos='n'))

eating---->eating
eats---->eats
eaten---->eaten
writing---->writing
writes---->writes
programming---->programming
programs---->program
history---->history
finally---->finally
finalize---->finalize


In [39]:
for word in words:
  print(word+"---->"+lemmatizer.lemmatize(word,pos='v'))

eating---->eat
eats---->eat
eaten---->eat
writing---->write
writes---->write
programming---->program
programs---->program
history---->history
finally---->finally
finalize---->finalize


In [40]:
lemmatizer.lemmatize('goes',pos='v')

'go'

In [41]:
lemmatizer.lemmatize('fairly'), lemmatizer.lemmatize('sportingly')

('fairly', 'sportingly')



---


**Stopwords:**

In [68]:
# Speech of Dr. APJ Abdul Kalam
paragraph="""Greeting everyone. Today, I am here to deliver a speech on APJ Abdul Kalam.
Dr APJ Abdul Kalam’s full name was Avul Pakir Zainuldeben Abdul Kalam,
very few people know him by his full name as he was mostly addressed as ‘Missile Man of India’
and ‘People’s President’. He was born into a very poor family in Rameswaram on October 15, 1931.
Since childhood, he enjoyed flying, and was equally curious to know how birds fly in the air?
He was very intelligent and enjoyed reading, but his family did not have sufficient income for his school fees,
so to support his education, he would wake up early in the morning and ride a bicycle 3 kilometres from home to collect newspapers and sell them.
He was admitted to St. Joseph's College, Tiruchirapalli,
and later he went on to complete a degree in physics in 1954
and then studied at the Madras Institute of Technology and graduated in aeronautical engineering in 1955.
Since his childhood, Dr Abdul Alam wanted to be a pilot but couldn’t make his dream come true.
He learned from his mistakes and accomplished numerous achievements in his life.
After completing his degree, Abdul Kalam entered the Defense Department of India.
He has been one of the key figures in building the nuclear capabilities of India.
APJ Abdul Kalam was appointed to the Indian Ministry of Defense as a Technical Advisor in 1992, after which he served with DRDO and ISRO,
the country's largest organization. Considered a national hero for successful nuclear tests in 1998,
a second successful nuclear test was conducted in Pokhran the same year under his supervision,
after which India was included in the list of nuclear-powered nations.
Abdul Kalam has been active in all space programs and development programs in India as a scientist.
For developing India's Agni missile, Kalam was called 'Missile Man.'Abdul Kalam made a special technological and scientific contribution,
for which, along with Bharat Ratna, India's highest honour, he was awarded the Padma Bhushan, Padam Vibhushan, etc.
He was also awarded an honorary doctorate by more than 30 universities in the world for the same.
In 2002, he was elected President of India and was the country's first scientist and non-political president.
He visited many countries during his tenure as President and
led India's youth through his lectures and encouraged them to move forward.
‘My vision for India’ was a Famous Speech of APJ Abdul Kalam delivered at IIT Hyderabad in 2011,
and is to this day my favourite speech. His far-reaching thinking gave India's growth a fresh path and became the youth's inspiration.
Dr Abdul Kalam died on July 27, 2015, from an apparent cardiac arrest while delivering a lecture at IIM Shillong at the age of 83.
 He spent his entire life in service and inspiration for the nation and the youth, and his death is also while addressing the youth.
 His death is a never-ending loss to the country.
"""

In [44]:
from nltk.corpus import stopwords

In [45]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [46]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [47]:
stopwords.words('german')

['aber',
 'alle',
 'allem',
 'allen',
 'aller',
 'alles',
 'als',
 'also',
 'am',
 'an',
 'ander',
 'andere',
 'anderem',
 'anderen',
 'anderer',
 'anderes',
 'anderm',
 'andern',
 'anderr',
 'anders',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'damit',
 'dann',
 'der',
 'den',
 'des',
 'dem',
 'die',
 'das',
 'dass',
 'daß',
 'derselbe',
 'derselben',
 'denselben',
 'desselben',
 'demselben',
 'dieselbe',
 'dieselben',
 'dasselbe',
 'dazu',
 'dein',
 'deine',
 'deinem',
 'deinen',
 'deiner',
 'deines',
 'denn',
 'derer',
 'dessen',
 'dich',
 'dir',
 'du',
 'dies',
 'diese',
 'diesem',
 'diesen',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'einig',
 'einige',
 'einigem',
 'einigen',
 'einiger',
 'einiges',
 'einmal',
 'er',
 'ihn',
 'ihm',
 'es',
 'etwas',
 'euer',
 'eure',
 'eurem',
 'euren',
 'eurer',
 'eures',
 'für',
 'gegen',
 'gewesen',
 'hab',
 'habe',
 'haben',
 'hat',
 'hatte',
 'hatten',
 '



---


**Stopwords and Filter using PorterStemmer**

In [49]:
from nltk.stem import PorterStemmer

In [50]:
stemmer=PorterStemmer()

In [55]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [69]:
documents=nltk.sent_tokenize(paragraph)

In [58]:
type(documents)

list

In [60]:
# Apply Stopwords and Filter and then Applying Stemming
for i in range(len(documents)):
  words=nltk.word_tokenize(documents[i])
  words=[stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  documents[i]=' '.join(words) #converting all the list of words into sentences using join

In [61]:
documents

['greet everyon .',
 'today , deliv speech apj abdul kalam .',
 'dr apj abdul kalam ’ full name avul pakir zainuldeben abdul kalam , peopl know full name mostli address ‘ missil man india ’ ‘ peopl ’ presid ’ .',
 'born poor famili rameswaram octob 15 , 1931 .',
 'sinc childhood , enjoy fli , equal curiou know bird fli air ?',
 'intellig enjoy read , famili suffici incom school fee , support educ , would wake earli morn ride bicycl 3 kilometr home collect newspap sell .',
 "admit st. joseph 's colleg , tiruchirap , later went complet degr physic 1954 studi madra institut technolog graduat aeronaut engin 1955 .",
 'sinc childhood , dr abdul alam want pilot ’ make dream come true .',
 'learn mistak accomplish numer achiev life .',
 'complet degr , abdul kalam enter defen depart india .',
 'one key figur build nuclear capabl india .',
 "apj abdul kalam appoint indian ministri defen technic advisor 1992 , serv drdo isro , countri 's largest organ .",
 'consid nation hero success nuclear te



---


**Stopwords and Filter using SnowballStemmer**

In [62]:
from nltk.stem import SnowballStemmer
snowball_stemmer=SnowballStemmer('english')

In [64]:
# Apply Stopwords and Filter using Snowball Stemmer
for i in range(len(documents)):
  words=nltk.word_tokenize(documents[i])
  words=[snowball_stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  documents[i]=' '.join(words) #converting the list of words into sentences

In [65]:
documents

['greet everyon .',
 'today , deliv speech apj abdul kalam .',
 'dr apj abdul kalam ’ full name avul pakir zainuldeben abdul kalam , peopl know full name most address ‘ missil man india ’ ‘ peopl ’ presid ’ .',
 'born poor famili rameswaram octob 15 , 1931 .',
 'sinc childhood , enjoy fli , equal curiou know bird fli air ?',
 'intellig enjoy read , famili suffici incom school fee , support educ , would wake ear morn ride bicycl 3 kilometr home collect newspap sell .',
 "admit st. joseph 's colleg , tiruchirap , later went complet degr physic 1954 studi madra institut technolog graduat aeronaut engin 1955 .",
 'sinc childhood , dr abdul alam want pilot ’ make dream come true .',
 'learn mistak accomplish numer achiev life .',
 'complet degr , abdul kalam enter defen depart india .',
 'one key figur build nuclear capabl india .',
 "apj abdul kalam appoint indian ministri defen technic advisor 1992 , serv drdo isro , countri 's largest organ .",
 'consid nation hero success nuclear test 1



---


**Stopwords and Filter using WordNetLemmatizer**

In [66]:
from nltk.stem import WordNetLemmatizer

In [67]:
lemmatizer=WordNetLemmatizer()

In [79]:
# Apply Stopwords and Filter using Lemmatizer
for i in range(len(documents)):
  documents[i]=documents[i].lower()
  words=nltk.word_tokenize(documents[i])
  words=[lemmatizer.lemmatize(word,pos='v') for word in words if word not in set(stopwords.words('english'))]
  documents[i]=' '.join(words)

In [80]:
documents

['greet everyone .',
 'today , deliver speech apj abdul kalam .',
 'dr apj abdul kalam ’ full name avul pakir zainuldeben abdul kalam , people know full name mostly address ‘ missile man india ’ ‘ people ’ president ’ .',
 'bear poor family rameswaram october 15 , 1931 .',
 'since childhood , enjoy fly , equally curious know bird fly air ?',
 'intelligent enjoy read , family sufficient income school fee , support education , would wake early morning ride bicycle 3 kilometre home collect newspaper sell .',
 "admit st. joseph 's college , tiruchirapalli , later go complete degree physic 1954 study madras institute technology graduate aeronautical engineer 1955 .",
 'since childhood , dr abdul alam want pilot ’ make dream come true .',
 'learn mistake accomplish numerous achievement life .',
 'complete degree , abdul kalam enter defense department india .',
 'one key figure build nuclear capability india .',
 "apj abdul kalam appoint indian ministry defense technical advisor 1992 , serve 