Topics covered:
- Tokenization
    - word
    - sentence
- Text Normalization
    - stemming
    - lemmatization
- StopWords
- Name Entity Recognition
***

# Tokenization

Tokenization refers to break down the text into smaller units. It entails splitting paragraphs into sentences and sentences into words. It is one of the initial steps of any NLP pipeline.

These tokens can be Words or Sentence

Few Terminology
- Corpus : A large collection of sentences. (plural: Corpora)
- Token : The smallest unit in corpus.

In [None]:
%pip install nltk

Defaulting to user installation because normal site-packages is not writeable


DEPRECATION: Loading egg at c:\python312\lib\site-packages\vboxapi-1.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330


In [None]:
import nltk
# nltk.download('punkt')  # For tokenization


try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')


In [2]:
text = """The quick brown fox jumps over the lazy dog. 
Natural Language Processing allows machines to understand human language. 
Isn't that amazing? Nope We must try hard"""

### Sentence Tokenize


In [3]:
from nltk.tokenize import sent_tokenize

In [5]:
sentences=sent_tokenize(text)
print(sentences)
for i in sentences:
    print(i)

['The quick brown fox jumps over the lazy dog.', 'Natural Language Processing allows machines to understand human language.', "Isn't that amazing?", 'Nope We must try hard']
The quick brown fox jumps over the lazy dog.
Natural Language Processing allows machines to understand human language.
Isn't that amazing?
Nope We must try hard


### Word tokenize

In [None]:
from nltk.tokenize import word_tokenize
words=word_tokenize(text)
print(words)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', 'Natural', 'Language', 'Processing', 'allows', 'machines', 'to', 'understand', 'human', 'language', '.', 'Is', "n't", 'that', 'amazing', '?', 'Nope', 'We', 'must', 'try', 'hard']


## 🤓 Fun Fact

## Wordnet
WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms.
***
- Your Own Personal Dictionary. But A SMART one.
- It let you know meaning of the word, synonymns, antonymns etc.
***

In [None]:
# Lexicon Example using WordNet
from nltk.corpus import wordnet
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

syns = wordnet.synsets("learning")
print("\nLexicon Entry for 'learning':")
print(syns[0].definition())

In [None]:
wordnet.synonyms("girl")

***
# Stemming

Stemming is the process of reducing a word to its root or base form. 
It may not result in an actual valid word.

Example: "running", "runs", and "ran" may all be reduced to "run" or "run" could become "runn" using certain stemmers.

# Lemmatization
Lemmatization is also the process of reducing a word to its base form (lemma).
However, it always returns a valid word, using a dictionary and part-of-speech (POS) tagging.
Example: "running" becomes "run", "better" becomes "good".


# Then What's The difference? 🤔
1. Stemming is faster but less accurate.
2. Lemmatization is slower but more accurate and meaningful.
3. Stemming may produce non-existent words; Lemmatization always returns real words.

In [None]:
import nltk

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')


from nltk.stem import PorterStemmer    # stemming
from nltk.stem import WordNetLemmatizer   #lemmatizing
from nltk.corpus import wordnet

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()





In [None]:
sampleText=" When i am running, i feel better than before. Too much studying just not me."

In [None]:
words = nltk.word_tokenize(sampleText)
print("\nStemming Results:")
for word in words:
    print(f"{word} --> {stemmer.stem(word)}")


In [None]:
print("\nLemmatization Results:")
for word in words:
    print(f"{word} --> {lemmatizer.lemmatize(word)}")

In [None]:
# Words that show difference between stemming and lemmatization
print("\nStrong Examples Showing Differences:")
sample_words = ["running", "flies", "better", "flying", "studies", "feet", "ate"]
for word in sample_words:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word)
    print(f"{word} --> Stem: {stemmed}, Lemma: {lemmatized}")

### 💡💡 Brainy Bites

## RegexpStemmer

RegexpStemmer is a rule-based stemmer in NLTK that uses regular expressions to strip word suffixes. Unlike PorterStemmer or LancasterStemmer

In [None]:
# Applying RegexpStemmer

from nltk.stem import RegexpStemmer
R_stem=RegexpStemmer("ing$|s$|e$|able$",min=4)
arg=['eats','geting',"disable"]
for i in arg:
    print(R_stem.stem(i))



## SnowBallStemmer
It’s also called “Porter2” stemmer.
More advanced than PorterStemmer with better linguistic rules.
Supports multiple languages, unlike PorterStemmer (which only supports English).

In [None]:
from nltk.stem import SnowballStemmer
Snowballstem=SnowballStemmer("english")
for  i in arg:
    print(i+"....."+Snowballstem.stem(i))

### Still Confused??🥲
### Here you go 😎 Full Comparison Table:
 | Method              | Key Feature                                                 | Returns Real Word? | Based on Rules or Dictionary? | Accuracy       | Language Support      |
 |---------------------|-------------------------------------------------------------|---------------------|-------------------------------|----------------|------------------------|
 | Porter Stemmer      | Oldest and simplest stemmer, chops suffixes                | ❌ (Not always)     | Rule-based                    | Low to Medium  | English only           |
 | Lemmatization       | Uses vocabulary + grammar (POS tagging) to get base word   | ✅ (Always)         | Dictionary-based              | High           | English (WordNet)      |
 | RegexpStemmer       | Applies custom regex rules (e.g., remove -ing, -ed, -s)    | ❌ (Usually not)    | Rule-based (regex)            | Low            | Depends on rules       |
| Snowball Stemmer    | More advanced than Porter, improved linguistic logic       | ❌ (Not always)     | Rule-based (refined)          | Medium to High | Multiple languages     |

*** 
## Stop Words
Stop words are common words that usually carry less meaning and are often removed from text data during preprocessing.
Examples include: "is", "and", "the", "a", "in", etc.
Removing them helps in focusing on more meaningful words in NLP tasks.

**How it removing them helps?**
- They occur frequently but add little information.
- Reducing them simplifies the dataset and improves model performance.
- Eliminating them reduces noise in the data.
- It helps in decreasing the size of the dataset.
- Enhances the efficiency of text classification and search-related tasks.

In [None]:
# Importing Required Libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download required NLTK data files
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
# Tokenize the text
words = word_tokenize(sampleText)

# Get English stopwords
stop_words = set(stopwords.words('english')) # setting the language

# Filter out stop words
filtered_words = [word for word in words if word.lower() not in stop_words]

print("\nOriginal Words:")
print(words)

print("\nFiltered Words (Stop Words Removed):")
print(filtered_words)

## Named Entity Recognition (NER)

**What is NER?**

Named Entity Recognition (NER) is the process of locating and classifying named entities in text into predefined categories such as:
- Person names
- Organizations
- Locations
- Dates
- Time expressions
- Monetary values

NER helps in understanding the meaning of text by identifying important nouns that have a specific meaning or role.

### Why is NER important?
- Helps extract structured information from unstructured text.
- Useful in applications like chatbots, search engines, recommendation systems, and summarization.
- Makes information retrieval more accurate.


In [None]:
import spacy

# Load English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Apple was founded by Steve Jobs in Cupertino in 1976."
# Try your own!

# Process the text
doc = nlp(text)

# Print named entities
for ent in doc.ents:
    print(ent.text, "-", ent.label_)


### Output:
"""
Apple - ORG
Steve Jobs - PERSON
Cupertino - GPE
1976 - DATE
"""

### Lets Test Your Knowledge








***
# 🥳🥳 Congrats Everyone we did it!!
Now you know a lot more than before you.
### You should pat your back. Coz i certainly am.

Thankyou for being a part of it. 
Happy Creative Coding!!! 
# 🤘🤘

