#***Stemming***
*Stemming is the process of reducing a word to its word stem*

**Example**


1.  "running" -> "run"
2.   "happiness" -> "happi"
3. "caresses" -> "caress"

#***Overstemming***

*Overstemming occurs when a stemming algorithm removes more characters than necessary, leading to stems that are too general or incorrect*.

**Example**

1.  "university" -> "univers" (correct stem: "universi")

2.   "generalization" -> "gener" (correct stem: "general")



In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.stem import PorterStemmer

stemming = PorterStemmer()

words=["eating!",'eating',"eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

for word in words:
    print(word+"   ----> "+'  '+stemming.stem(word))

eating!   ---->   eating!
eating   ---->   eat
eats   ---->   eat
eaten   ---->   eaten
writing   ---->   write
writes   ---->   write
programming   ---->   program
programs   ---->   program
history   ---->   histori
finally   ---->   final
finalized   ---->   final


In [None]:
stemming.stem('Congratulation')

'congratul'

In [None]:
stemming.stem("sitting")

'sit'

In [None]:
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()

words = ["running", "happiness", "caresses"]
for word in words:
    print(word+"  ----> "+' '+lancaster.stem(word))

running  ---->  run
happiness  ---->  happy
caresses  ---->  caress


In [None]:
lancaster.stem('Congratulation')

'congrat'

#***RegexpStemmer class***
*NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms*

In [None]:
from nltk.stem import RegexpStemmer
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

print(reg_stemmer.stem('eating'))
print(reg_stemmer.stem('ingeating'))
print(reg_stemmer.stem('eats'))
print(reg_stemmer.stem('vegetable'))

eat
ingeat
eat
veget


***SnowballStemmer***. *The following languages are supported: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.*

In [None]:
from nltk.stem import SnowballStemmer
snowballsstemmer=SnowballStemmer('english')

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]
for word in words:
    print(word+"  ----> "+' '+snowballsstemmer.stem(word))

eating  ---->  eat
eats  ---->  eat
eaten  ---->  eaten
writing  ---->  write
writes  ---->  write
programming  ---->  program
programs  ---->  program
history  ---->  histori
finally  ---->  final
finalized  ---->  final


In [None]:
porter_stems = ["fairly", "sportingly", "goes"]
snowball_stems = ["fairly", "sportingly", "goes"]

Porter = []
Snowball = []
for i in porter_stems:
  Porter.append(stemming.stem(i))

for i in snowball_stems:
  Snowball.append(snowballsstemmer.stem(i))

print("Porter Stemmer: ", Porter)
print("Snowball Stemmer: ", Snowball)




Porter Stemmer:  ['fairli', 'sportingli', 'goe']
Snowball Stemmer:  ['fair', 'sport', 'goe']


#***Lemmatization***
*Lemmatization is a process in natural language processing (NLP) that reduces words to their base or dictionary form, known as the lemma*.

*which often removes suffixes in a crude way, lemmatization uses linguistic knowledge, including vocabulary and morphological analysis, to produce more accurate and meaningful base forms*

**Example**

Running:
Lemmatized: "run" (verb)

Better:
Lemmatized: "good" (adjective)

Geese:
Lemmatized: "goose" (noun)


In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

lemmatizer.lemmatize("going")

'going'

POS tag -

Noun-n

verb-v

adjective-a

adverb-r

In [None]:
print(lemmatizer.lemmatize("going",pos='v'))
print(lemmatizer.lemmatize("going",pos='n'))
print(lemmatizer.lemmatize("going",pos='a'))
print(lemmatizer.lemmatize("going",pos='r'))


go
going
going
going


In [None]:
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

for word in words:
    print(word+" ---> "+' '+lemmatizer.lemmatize(word,pos='n'))

eating --->  eating
eats --->  eats
eaten --->  eaten
writing --->  writing
writes --->  writes
programming --->  programming
programs --->  program
history --->  history
finally --->  finally
finalized --->  finalized


#***Lemmatization v/s Stemming***

**Stemming**: "running" -> "run", "better" -> "better", "geese" -> "gees"

**Lemmatization**: "running" -> "run", "better" -> "good", "geese" -> "goose"

#***Stopwords***

*Stopwords are common words that usually do not carry significant meaning and are often removed during text preprocessing in natural language processing (NLP) tasks*.

**Examples** of stopwords

"and," "is," "in," "the,"

Articles: a, an, the

Conjunctions: and, but, or

Prepositions: in, on, at

Pronouns: he, she, it, they

In [None]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
stopwords.words('english')
sentence = 'hi this is a apple'
print

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

**if word.lower() not in removing_words**

*checking all the words in the sentence to in lowercase bcoz . all the stopwords are in lowercase*


*ensuring not removing the words that are not presented in stopwords*



Checking each word:

"hi" -> "hi" (not in stopwords)

"this" -> "this" (in stopwords)

"is" -> "is" (in stopwords)

"a" -> "a" (in stopwords)

"apple" -> "apple" (not in stopwords)

In [None]:
removing_words = stopwords.words('english')
sentence = 'hi this is a apple'

words = sentence.split()
print(words)

filtered_words = []
for word in words:

    if word.lower() not in removing_words:
        filtered_words.append(word)

filtered_sentence = (filtered_words)

print(filtered_sentence)

['hi', 'this', 'is', 'a', 'apple']
['hi', 'apple']


***using Join***


In [None]:
words = sentence.split()
print(words)

filtered_words = []
for word in words:

    if word.lower() not in removing_words:
        filtered_words.append(word)

filtered_sentence = ' '.join(filtered_words)

print(filtered_sentence)

['hi', 'this', 'is', 'a', 'apple']
hi apple
