<a href="https://colab.research.google.com/github/Paromita2001/NLP_/blob/main/nlp3_mahe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

| Feature  | **Stemming**                             | **Lemmatization**                         |
| -------- | ---------------------------------------- | ----------------------------------------- |
| Output   | Rough root form (may not be a real word) | Proper dictionary word (lemma)            |
| Based on | Heuristic rules (cutting suffixes)       | Linguistic knowledge (WordNet dictionary) |
| Example  | ‚Äústudies‚Äù ‚Üí ‚Äústudi‚Äù                      | ‚Äústudies‚Äù ‚Üí ‚Äústudy‚Äù                       |
| Accuracy | Fast but crude                           | Slower but accurate                       |


Stemming = chop off endings

Lemmatization = find correct base form

In [None]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

# 1. Porter Stemmer

Oldest and most popular (from 1980).

Uses a simple set of rules to remove suffixes.

Balanced between accuracy and speed.

Commonly used in Information Retrieval (like search engines).

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

words = ['running', 'runs', 'easily', 'fairly']
for w in words:
    print(w, "‚Üí", ps.stem(w))


running ‚Üí run
runs ‚Üí run
easily ‚Üí easili
fairly ‚Üí fairli


In [None]:
ps = PorterStemmer()
ps.stem("programming")

'program'

In [None]:
ps.stem("computing")

'comput'

In [None]:
ps.stem("went")

'went'

In [None]:
ps.stem("easily")

'easili'

# 2. Lancaster Stemmer

More aggressive version of Porter.

Cuts more characters, sometimes too much.

Often over-stems (produces very short or incorrect roots).

Useful when you want maximum compression of vocabulary.

In [None]:
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()

for w in ['running', 'runs', 'easily', 'fairly']:
    print(w, "‚Üí", ls.stem(w))


running ‚Üí run
runs ‚Üí run
easily ‚Üí easy
fairly ‚Üí fair


# 3. Snowball Stemmer

Improved version of Porter (also called Porter2).

More modern and consistent.

Supports multiple languages (English, French, German, etc.)

A good balance between Porter and Lancaster.

In [None]:
from nltk.stem import SnowballStemmer
sn = SnowballStemmer("english")

for w in ['running', 'runs', 'easily', 'fairly']:
    print(w, "‚Üí", sn.stem(w))


running ‚Üí run
runs ‚Üí run
easily ‚Üí easili
fairly ‚Üí fair


In [None]:
sn = SnowballStemmer("english")
sn.stem("went")

'went'

In [None]:
sn.stem("easily")

'easili'

In [None]:
ls = LancasterStemmer()
ls.stem("went")

'went'

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer

lm = WordNetLemmatizer()
print(lm.lemmatize('went', pos='v'))


go


In [None]:
lm.lemmatize('programming',pos='v')

'program'

In [None]:
nltk.edit_distance('kitten','sitting')

3

#What is jaccard_distance and edit_distance?

These are text similarity (or dissimilarity) metrics ‚Äî used to compare how similar or different two strings (or sets) are.

They‚Äôre very useful for:

Spell checking

Duplicate detection

Plagiarism checking

Chatbots (matching user intent)

Fuzzy matching in NLP preprocessing

# edit_distance (a.k.a. Levenshtein Distance)
Definition:

It measures the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one word into another.

In [None]:
from nltk.metrics import edit_distance

print(edit_distance('cat', 'cut'))


1


In [None]:
words = ['apple', 'apply', 'apples', 'maple']
target = 'appel'

for w in words:
    print(w, "‚Üí distance:", edit_distance(target, w))


apple ‚Üí distance: 2
apply ‚Üí distance: 2
apples ‚Üí distance: 2
maple ‚Üí distance: 3


jaccard_distance
üìò Definition:

It measures the dissimilarity between two sets ‚Äî based on how many elements overlap.

It‚Äôs calculated as:

ùêΩ
ùëé
ùëê
ùëê
ùëé
ùëü
ùëë

ùê∑
ùëñ
ùë†
ùë°
ùëé
ùëõ
ùëê
ùëí
=
1
‚àí
‚à£
ùê¥
‚à©
ùêµ
‚à£
‚à£
ùê¥
‚à™
ùêµ
‚à£
Jaccard Distance=1‚àí
‚à£A‚à™B‚à£/
‚à£A‚à©B‚à£
	‚Äã



In [None]:
import nltk
from nltk.metrics import jaccard_distance, edit_distance

In [None]:
jaccard_distance(set('azaming'),set('amazing'))

0.0

In [None]:
jaccard_distance(set('better'), set('amazing'))

1.0

In [None]:
inc=['amazing','happiiiiiii', 'spellling']

In [None]:
print(jaccard_distance(set('azaming'),set('amazing')))
print(jaccard_distance(set('happiiiiiii'),set('happy')))
print(jaccard_distance(set('spellling'),set('spelling')))

0.0
0.4
0.0


In [None]:
from nltk.util import ngrams

In [None]:
list(ngrams("amazing", 3))

[('a', 'm', 'a'),
 ('m', 'a', 'z'),
 ('a', 'z', 'i'),
 ('z', 'i', 'n'),
 ('i', 'n', 'g')]

In [None]:
import nltk
nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
from nltk.corpus import words

In [None]:
correct_words = words.words()

In [None]:
len(correct_words)

236736

In [None]:
inc=['amazing','happiiiiiii', 'spellling']

In [None]:
for i in inc:
  result = [(jaccard_distance(set(ngrams(i, 2)), set(ngrams(w, 2))), w) for w in correct_words if w[0]==i[0]]
  print(sorted(result, key=lambda x: x[0])[0])

(0.0, 'amazing')
(0.42857142857142855, 'happier')
(0.0, 'spelling')


In [None]:
list(ngrams('azaming',2))

[('a', 'z'), ('z', 'a'), ('a', 'm'), ('m', 'i'), ('i', 'n'), ('n', 'g')]

In [None]:
list(ngrams("amazing",2))

[('a', 'm'), ('m', 'a'), ('a', 'z'), ('z', 'i'), ('i', 'n'), ('n', 'g')]