<div style="text-align: center;">

<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 46px; font-weight: bold;">
NLP : Natural Language Processing 
</h1>





In [1]:
s = "GeeksforGeeks is a great learning platform . It is one of the best for Computer Science students."

<div style="text-align: center;">

<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 40px; font-weight: bold;">
Tokenization
</h1>

<h3 style="color:#B0C4DE; font-family: 'Trebuchet MS', sans-serif; font-size: 22px;">
Tokenization is the process of splitting text into smaller pieces called tokens, such as words or sentences.
</h3>

</div>

---

## What Is Tokenization?

Tokenization breaks text into manageable parts for NLP tasks.  

**Example Sentence:**  
*"Natural Language Processing is amazing!"*

**Word Tokens:**  
["Natural", "Language", "Processing", "is", "amazing", "!"]

**Sentence Tokens:**  
["Natural Language Processing is amazing!"]

---

## Why Use Tokenization?

- Helps analyze text **word by word or sentence by sentence**  
- Necessary for **text preprocessing** steps like stop word removal, stemming, or lemmatization  
- Forms the foundation for **N-grams, Count Vectorizer, and other NLP techniques**


In [42]:
from nltk.tokenize import word_tokenize , sent_tokenize 

In [3]:
word_tokenize(s)

['GeeksforGeeks',
 'is',
 'a',
 'great',
 'learning',
 'platform',
 '.',
 'It',
 'is',
 'one',
 'of',
 'the',
 'best',
 'for',
 'Computer',
 'Science',
 'students',
 '.']

In [4]:
sent_tokenize(s)

['GeeksforGeeks is a great learning platform .',
 'It is one of the best for Computer Science students.']

In [5]:
from nltk import pos_tag

In [7]:
pos_tag(word_tokenize(s))

[('GeeksforGeeks', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('learning', 'JJ'),
 ('platform', 'NN'),
 ('.', '.'),
 ('It', 'PRP'),
 ('is', 'VBZ'),
 ('one', 'CD'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('best', 'JJS'),
 ('for', 'IN'),
 ('Computer', 'NNP'),
 ('Science', 'NNP'),
 ('students', 'NNS'),
 ('.', '.')]

<div style="text-align: center;">

<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 40px; font-weight: bold;">
Stop Words
</h1>

<h3 style="color:#B0C4DE; font-family: 'Trebuchet MS', sans-serif; font-size: 22px;">
Stop Words are common words in a language that do not add significant meaning to text and can be removed for NLP tasks.
</h3>

</div>

---

## What Are Stop Words?

Stop words are words like **"a", "an", "the", "is", "in"** that appear frequently but carry little information.

**Example Sentence:**  
*"This is a simple example of NLP in action."*

**After removing stop words:**  
*"simple example NLP action"*

---

## Why Remove Stop Words?

- Reduces **noise** in the text  
- Speeds up **processing** and **analysis**  
- Improves results in **text classification, sentiment analysis, and search engines**


In [20]:
from nltk.corpus import stopwords
from string import punctuation

In [21]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [22]:
stop =stopwords.words("english")

### punctuation : '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'


In [25]:
stop_words = list(punctuation) + stop

In [36]:
for i in word_tokenize(s) :
    if i not in stop_words:
        print(i)

GeeksforGeeks
great
learning
platform
It
one
best
Computer
Science
students


<div style="text-align: center;">

<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 40px; font-weight: bold;">
Stemming
</h1>

<h3 style="color:#B0C4DE; font-family: 'Trebuchet MS', sans-serif; font-size: 22px;">
Stemming in NLTK refers to the process of reducing words to their base or root form.
</h3>

</div>

---

## 🌟 Types of Stemmers in NLTK

NLTK provides several algorithms for stemming. Here are the most popular ones:

| Stemmer | Description |
|---------|-------------|
| **Porter Stemmer** | One of the oldest and most widely used stemming algorithms. Good for English words. |
| **Lancaster Stemmer** | More aggressive than Porter; may produce shorter stems. |
| **Snowball Stemmer** | An improved version of Porter Stemmer with support for multiple languages. |
| **Regexp Stemmer** | Allows you to define custom regular expressions for stemming. |



In [40]:
from nltk.stem import LancasterStemmer , RegexpStemmer , PorterStemmer ,SnowballStemmer

In [59]:
l = LancasterStemmer()
r = RegexpStemmer('ing')
p = PorterStemmer()
s = SnowballStemmer('english')

In [54]:
l.stem("changing")

'lowest'

In [63]:
r.stem("changing")

'chang'

In [61]:
p.stem("changes")

'chang'

In [60]:
s.stem("changed")

'chang'

<div style="text-align: center;">

<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 40px; font-weight: bold;">
Lemmatization
</h1>

<h3 style="color:#B0C4DE; font-family: 'Trebuchet MS', sans-serif; font-size: 22px;">
Lemmatization reduces words to their base or dictionary form, considering meaning and context.
</h3>

</div>

---

## WordNet Lemmatizer

NLTK uses the **WordNet Lemmatizer** to find the base form of words.  
It handles irregular forms and considers the context when needed.

- **Example:** "mice" → "mouse", "better" → "good", "flies" → "fly"

---

## 📝 Difference Between Stemming and Lemmatization

| Feature | Stemming | Lemmatization |
|---------|----------|---------------|
| **Definition** | Cuts words to their root form. | Converts words to their base/dictionary form using meaning. |
| **Accuracy** | Less accurate; may not be real words. | More accurate; always valid words. |
| **Example** | "running" → "run", "flies" → "fli" | "mice" → "mouse", "better" → "good" |
| **Speed** | Faster | Slower |
| **Use** | Quick preprocessing. | When meaning matters in NLP tasks. |


In [64]:
from nltk.stem import WordNetLemmatizer

In [66]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...


True

In [79]:
wl = WordNetLemmatizer()
wl.lemmatize("mice")

'mouse'

<div style="text-align: center;">

<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 40px; font-weight: bold;">
N-Grams
</h1>

<h3 style="color:#B0C4DE; font-family: 'Trebuchet MS', sans-serif; font-size: 22px;">
N-grams are groups of N words taken together from a sentence.
</h3>

</div>

---

## What Are N-Grams?

N-grams help in understanding word patterns and relationships in text.

| Type | Example (from: "I love natural language processing") |
|------|------------------------------------------------------|
| **Unigram (1-word)** | I, love, natural, language, processing |
| **Bigram (2-words)** | I love, love natural, natural language |
| **Trigram (3-words)** | I love natural, love natural language |

---

## Why Use N-Grams?

- To analyze **word combinations**
- Useful in **text prediction** and **language modelling**
- Commonly used in **chatbots, autocomplete, and spam detection**


In [29]:
from nltk.collocations import BigramCollocationFinder , TrigramCollocationFinder , ngrams 
from nltk.tokenize import word_tokenize

In [34]:
txt = "my name is nitin  i am a good boy i am a bca student."
w = word_tokenize(txt)

In [66]:
b = BigramCollocationFinder.from_words(w)
t = TrigramCollocationFinder.from_words(w)
n = ngrams(w , 4)

In [67]:
b.ngram_fd

FreqDist({('i', 'am'): 2, ('am', 'a'): 2, ('my', 'name'): 1, ('name', 'is'): 1, ('is', 'nitin'): 1, ('nitin', 'i'): 1, ('a', 'good'): 1, ('good', 'boy'): 1, ('boy', 'i'): 1, ('a', 'bca'): 1, ...})

In [68]:
t.ngram_fd

FreqDist({('i', 'am', 'a'): 2, ('my', 'name', 'is'): 1, ('name', 'is', 'nitin'): 1, ('is', 'nitin', 'i'): 1, ('nitin', 'i', 'am'): 1, ('am', 'a', 'good'): 1, ('a', 'good', 'boy'): 1, ('good', 'boy', 'i'): 1, ('boy', 'i', 'am'): 1, ('am', 'a', 'bca'): 1, ...})

In [69]:
for i in n :
    print(i)

('my', 'name', 'is', 'nitin')
('name', 'is', 'nitin', 'i')
('is', 'nitin', 'i', 'am')
('nitin', 'i', 'am', 'a')
('i', 'am', 'a', 'good')
('am', 'a', 'good', 'boy')
('a', 'good', 'boy', 'i')
('good', 'boy', 'i', 'am')
('boy', 'i', 'am', 'a')
('i', 'am', 'a', 'bca')
('am', 'a', 'bca', 'student')
('a', 'bca', 'student', '.')


<div style="text-align: center;">

<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 40px; font-weight: bold;">
Count Vectorizer
</h1>

<h3 style="color:#B0C4DE; font-family: 'Trebuchet MS', sans-serif; font-size: 22px;">
Count Vectorizer converts text into numbers by counting how many times each word appears.
</h3>

</div>

---

## Example

Given the sentences:

1. **"One Geek helps Two Geeks"**  
2. **"Two Geeks help Four Geeks"**  
3. **"Each Geek helps many other Geeks at GeeksforGeeks."**

Count Vectorizer will convert them into a table like this:

| Word           | one | geek | helps | two | geeks | help | four | each | many | other | at | geeksforgeeks |
|----------------|-----|------|-------|-----|-------|------|------|------|------|-------|----|---------------|
| Sentence 1     | 1   | 1    | 1     | 1   | 1     | 0    | 0    | 0    | 0    | 0     | 0  | 0             |
| Sentence 2     | 0   | 0    | 0     | 1   | 1     | 1    | 1    | 0    | 0    | 0     | 0  | 0             |
| Sentence 3     | 0   | 1    | 1     | 0   | 1     | 0    | 0    | 1    | 1    | 1     | 1  | 1             |

---

## Why Use Count Vectorizer?

- It **turns text into numbers**, which is necessary for **machine learning models**  
- Commonly used in **text classification, spam detection, sentiment analysis**, etc.  


In [2]:
import pandas as pd

In [30]:
l = [ "One Geek helps Two Geeks", "Two Geeks help Four Geeks", "Each Geek helps many other Geeks at GeeksforGeeks."]

In [31]:
df =  pd.DataFrame({"name":l})

In [32]:
df


Unnamed: 0,name
0,One Geek helps Two Geeks
1,Two Geeks help Four Geeks
2,Each Geek helps many other Geeks at GeeksforGe...


In [33]:
from sklearn.feature_extraction.text import CountVectorizer

In [34]:
cv = CountVectorizer()

In [35]:
new_data = cv.fit_transform(df['name']).toarray()

In [36]:
new_data

array([[0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1],
       [0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 0, 1],
       [1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0]])

In [37]:
cv.vocabulary_

{'one': 9,
 'geek': 3,
 'helps': 7,
 'two': 11,
 'geeks': 4,
 'help': 6,
 'four': 2,
 'each': 1,
 'many': 8,
 'other': 10,
 'at': 0,
 'geeksforgeeks': 5}

<div style="text-align: center;">

<h1 style="color:#FFB84C; font-family: 'Georgia', serif; font-size: 40px; font-weight: bold;">
Word Sense Disambiguation (WSD)
</h1>

<h3 style="color:#B0C4DE; font-family: 'Trebuchet MS', sans-serif; font-size: 22px;">
WSD is the process of identifying the correct meaning of a word based on context.
</h3>

</div>

---

## What Is Word Sense Disambiguation?

Some words have multiple meanings. WSD helps the system understand **which meaning is correct** in a given sentence.

| Word | Sentence | Meaning Chosen |
|------|----------|----------------|
| **Bank** | "He sat by the river bank." | River side |
| **Bank** | "She deposited money in the bank." | Financial institution |
| **Bat** | "A bat flew in the night sky." | Animal |
| **Bat** | "He hit the ball with a bat." | Sports equipment |

---

## Why Use WSD?

- Helps machines understand **context and meaning**
- Useful in **translation, question answering, and chatbots**
- Makes NLP systems **more intelligent and human-like**


In [39]:
from nltk.wsd import lesk

In [90]:
x = "sun is glowing"
y = "The girl nodded and brushed the loose strands of mouse brown hair from her face."

In [94]:
l = lesk(word_tokenize("y"),'sun')
m = lesk(word_tokenize("x"),'mouse')

In [95]:
l.definition()

'the star that is the source of light and heat for the planets in the solar system'

In [96]:
m.definition()

'any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails'