<div style="text-align: center;">

<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 46px; font-weight: bold;">
NLP Preprocessing Essentials
</h1>

<h3 style="color:#B0C4DE; font-family: 'Trebuchet MS', sans-serif; font-size: 22px;">
Tokenization • Stop Words • Stemming
</h3>

</div>

---

<div style="height: 2px; background-color:#6CA6C1; margin: 20px 0;"></div>

<h2 style="color:#FFB84C; font-family: 'Georgia', serif;">Word Tokenization</h2>
<p style=" font-family: Arial, sans-serif;">
Splitting text into individual words. Example:<br>
<i>"NLP is fun!" → ["NLP", "is", "fun", "!"]</i>
</p>

---

<div style="height: 2px; background-color:#6CA6C1; margin: 20px 0;"></div>

<h2 style="color:#FFB84C; font-family: 'Georgia', serif;">Sentence Tokenization</h2>
<p style=" font-family: Arial, sans-serif;">
Dividing text into sentences. Example:<br>
<i>"I love NLP. It is powerful!" → ["I love NLP.", "It is powerful!"]</i>
</p>

---

<div style="height: 2px; background-color:#6CA6C1; margin: 20px 0;"></div>

<h2 style="color:#FFB84C; font-family: 'Georgia', serif;">Stop Words</h2>
<p style=" font-family: Arial, sans-serif;">
Removing common words like <i>"is", "the", "and"</i> that carry little meaning.<br>
Example:<br>
<i>"This is a good example" → ["This", "good", "example"]</i>
</p>

---

<div style="height: 2px; background-color:#6CA6C1; margin: 20px 0;"></div>

<h2 style="color:#FFB84C; font-family: 'Georgia', serif;">Stemming</h2>
<p style=" font-family: Arial, sans-serif;">
Reducing words to their root form. Example:<br>
<i>"running", "runner", "ran"</i> → <b>"run"</b>
</p>

---

<div style="height: 2px; background-color:#6CA6C1; margin: 20px 0;"></div>

<h2 style="color:#FFB84C; font-family: 'Georgia', serif;">Lemmatization </h2>
<p style=" font-family: Arial, sans-serif;">
Lemmatization reduces words to their base or dictionary form, considering meaning and context. Example:<br>
<i>"mice"</i> → <b>"mouse"</b>
</p>




In [1]:
s = "GeeksforGeeks is a great learning platform . It is one of the best for Computer Science students."

<div style="text-align: center;">
<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 46px; font-weight: bold;">
Tokenization
</h1>
</div>

In [2]:
from nltk.tokenize import word_tokenize , sent_tokenize 

In [3]:
word_tokenize(s)

['GeeksforGeeks',
 'is',
 'a',
 'great',
 'learning',
 'platform',
 '.',
 'It',
 'is',
 'one',
 'of',
 'the',
 'best',
 'for',
 'Computer',
 'Science',
 'students',
 '.']

In [4]:
sent_tokenize(s)

['GeeksforGeeks is a great learning platform .',
 'It is one of the best for Computer Science students.']

In [5]:
from nltk import pos_tag

In [7]:
pos_tag(word_tokenize(s))

[('GeeksforGeeks', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('learning', 'JJ'),
 ('platform', 'NN'),
 ('.', '.'),
 ('It', 'PRP'),
 ('is', 'VBZ'),
 ('one', 'CD'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('best', 'JJS'),
 ('for', 'IN'),
 ('Computer', 'NNP'),
 ('Science', 'NNP'),
 ('students', 'NNS'),
 ('.', '.')]


<div style="text-align: center;">
<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 46px; font-weight: bold;">
Stop Words
</h1>
</div>

In [20]:
from nltk.corpus import stopwords
from string import punctuation

In [21]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [22]:
stop =stopwords.words("english")

### punctuation : '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'


In [25]:
stop_words = list(punctuation) + stop

In [36]:
for i in word_tokenize(s) :
    if i not in stop_words:
        print(i)

GeeksforGeeks
great
learning
platform
It
one
best
Computer
Science
students


<div style="text-align: center;">

<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 40px; font-weight: bold;">
Stemming
</h1>

<h3 style="color:#B0C4DE; font-family: 'Trebuchet MS', sans-serif; font-size: 22px;">
Stemming in NLTK refers to the process of reducing words to their base or root form.
</h3>

</div>

---

## 🌟 Types of Stemmers in NLTK

NLTK provides several algorithms for stemming. Here are the most popular ones:

| Stemmer | Description |
|---------|-------------|
| **Porter Stemmer** | One of the oldest and most widely used stemming algorithms. Good for English words. |
| **Lancaster Stemmer** | More aggressive than Porter; may produce shorter stems. |
| **Snowball Stemmer** | An improved version of Porter Stemmer with support for multiple languages. |
| **Regexp Stemmer** | Allows you to define custom regular expressions for stemming. |



In [40]:
from nltk.stem import LancasterStemmer , RegexpStemmer , PorterStemmer ,SnowballStemmer

In [59]:
l = LancasterStemmer()
r = RegexpStemmer('ing')
p = PorterStemmer()
s = SnowballStemmer('english')

In [54]:
l.stem("changing")

'lowest'

In [63]:
r.stem("changing")

'chang'

In [61]:
p.stem("changes")

'chang'

In [60]:
s.stem("changed")

'chang'

<div style="text-align: center;">

<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 40px; font-weight: bold;">
Lemmatization
</h1>

<h3 style="color:#B0C4DE; font-family: 'Trebuchet MS', sans-serif; font-size: 22px;">
Lemmatization reduces words to their base or dictionary form, considering meaning and context.
</h3>

</div>

---

## WordNet Lemmatizer

NLTK uses the **WordNet Lemmatizer** to find the base form of words.  
It handles irregular forms and considers the context when needed.

- **Example:** "mice" → "mouse", "better" → "good", "flies" → "fly"

---

## 📝 Difference Between Stemming and Lemmatization

| Feature | Stemming | Lemmatization |
|---------|----------|---------------|
| **Definition** | Cuts words to their root form. | Converts words to their base/dictionary form using meaning. |
| **Accuracy** | Less accurate; may not be real words. | More accurate; always valid words. |
| **Example** | "running" → "run", "flies" → "fli" | "mice" → "mouse", "better" → "good" |
| **Speed** | Faster | Slower |
| **Use** | Quick preprocessing. | When meaning matters in NLP tasks. |


In [64]:
from nltk.stem import WordNetLemmatizer

In [66]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...


True

In [79]:
wl = WordNetLemmatizer()
wl.lemmatize("mice")

'mouse'