# **📌 Module 1: Introduction to NLP**

### **Objective:**
- Understand what NLP is and why it's important
- Learn real-world applications of NLP
- Get hands-on experience with basic text processing using Python

## **🚀 1. What is NLP?**
**Theory**

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that helps computers understand, interpret, and generate human language.

**Why is NLP important?**

Humans communicate using natural language (English, French, Yoruba, etc.), but computers understand only numbers. NLP acts as a bridge between human language and machine language by converting text into something a computer can process.

**Real-world Applications of NLP**
- Virtual Assistants: Siri, Google Assistant, Alexa
- Chatbots: Customer support bots (e.g., ChatGPT)
- Sentiment Analysis: Detecting emotions in tweets, reviews, or news
- Machine Translation: Google Translate
- Speech Recognition: Converting speech to text (YouTube captions, Google Voice)

## **🛠 2. Setting Up Your Environment**

Before diving into coding, we need to install some essential libraries. Run the following in a Jupyter Notebook or a Python script:

In [None]:
# !pip install nltk spacy textblob # Install the necessary packages

## **📖 3. Basic Text Processing in NLP**
We'll now introduce fundamental text processing techniques with both theory and practical code examples.



### **📝 3.1 Tokenization**
**Theory:**

Tokenization is the process of breaking a sentence into individual words (word tokenization) or breaking a paragraph into individual sentences (sentence tokenization).

In [2]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')  # Download required dataset

text = "Hello students! Welcome to Natural Language Processing. NLP is amazing."

# Word Tokenization
words = word_tokenize(text)
print("Word Tokenization:", words)

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:", sentences)


Word Tokenization: ['Hello', 'students', '!', 'Welcome', 'to', 'Natural', 'Language', 'Processing', '.', 'NLP', 'is', 'amazing', '.']
Sentence Tokenization: ['Hello students!', 'Welcome to Natural Language Processing.', 'NLP is amazing.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**🔹 Explanation:**

- word_tokenize(text): Breaks text into words
- sent_tokenize(text): Breaks text into sentences

### **📝 3.2 Removing Stopwords**
**Theory:**

Stopwords are common words (like "the", "is", "in") that do not contribute to the meaning of the sentence.

In [3]:
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Filtered Words (No Stopwords):", filtered_words)


Filtered Words (No Stopwords): ['Hello', 'students', '!', 'Welcome', 'Natural', 'Language', 'Processing', '.', 'NLP', 'amazing', '.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**🔹 Explanation:**

- stopwords.words('english') provides a list of common stopwords
- We remove them from our tokenized words

### **📝 3.3 Stemming and Lemmatization**

**Theory:**
- Stemming: Reduces words to their root form (e.g., "running" → "run")
- Lemmatization: Similar to stemming but ensures words remain meaningful

In [14]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word = "running"

print("Stemmed Word:", stemmer.stem(word))  # Output: run
print("Lemmatized Word:", lemmatizer.lemmatize(word, pos='v'))  # Output: run


Stemmed Word: run
Lemmatized Word: run


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**🔹 Explanation:**

- Stemming just removes suffixes (e.g., "ing", "ed")
- Lemmatization considers meaning and context (e.g., "better" → "good")

### **📝 3.4 Part-of-Speech (POS) Tagging**
**Theory:**

POS tagging assigns a grammatical category to words (noun, verb, adjective, etc.)

In [15]:
from nltk import pos_tag

nltk.download('averaged_perceptron_tagger')

pos_tags = pos_tag(words)
print("POS Tags:", pos_tags)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


POS Tags: [('Hello', 'JJ'), ('students', 'NNS'), ('!', '.'), ('Welcome', 'NNP'), ('to', 'TO'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('.', '.'), ('NLP', 'NNP'), ('is', 'VBZ'), ('amazing', 'JJ'), ('.', '.')]


**🔹 Explanation:**

Tags words as Noun (NN), Verb (VB), Adjective (JJ), etc.

## **💡 4. Mini Project: Basic NLP Pipeline**

Now, let's put everything together into a simple NLP text pre-processing pipeline.

In [16]:
# Import necessary libraries
import nltk 
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag

# Download required datasets
#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('wordnet')
#nltk.download('averaged_perceptron_tagger')

# Sample text 
text = "NLP is an interesting field of AI. It helps machines understand human language."

# 1. Tokenization
words = word_tokenize(text)
sentences = sent_tokenize(text)
print("Words:", words)
print("Sentences:", sentences)

# 2. Removing Stopwords
filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]
print("Filtered Words:", filtered_words)

# 3. Stemming & Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_words = [stemmer.stem(word) for word in filtered_words]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in filtered_words]

print("Stemmed Words:", stemmed_words)
print("Lemmatized Words:", lemmatized_words)

# 4. POS Tagging
pos_tags = pos_tag(filtered_words)
print("POS Tags:", pos_tags)


Words: ['NLP', 'is', 'an', 'interesting', 'field', 'of', 'AI', '.', 'It', 'helps', 'machines', 'understand', 'human', 'language', '.']
Sentences: ['NLP is an interesting field of AI.', 'It helps machines understand human language.']
Filtered Words: ['NLP', 'interesting', 'field', 'AI', '.', 'helps', 'machines', 'understand', 'human', 'language', '.']
Stemmed Words: ['nlp', 'interest', 'field', 'ai', '.', 'help', 'machin', 'understand', 'human', 'languag', '.']
Lemmatized Words: ['NLP', 'interest', 'field', 'AI', '.', 'help', 'machine', 'understand', 'human', 'language', '.']
POS Tags: [('NLP', 'NNP'), ('interesting', 'JJ'), ('field', 'NN'), ('AI', 'NNP'), ('.', '.'), ('helps', 'VBZ'), ('machines', 'NNS'), ('understand', 'JJ'), ('human', 'JJ'), ('language', 'NN'), ('.', '.')]


## **🎯 5. Assignment for Students**
To reinforce learning, ask students to:

1️⃣ Modify the text in the pipeline and observe the outputs

2️⃣ Try another NLP library like spaCy for tokenization and POS tagging

3️⃣ Apply stemming and lemmatization to different words

# **📌 Module 2: Text Cleaning and Preprocessing**

**Objective:**

- Understand the importance of text cleaning in NLP
- Learn how to remove noise, special characters, and perform text normalization
- Get hands-on experience with regular expressions (RegEx) and text preprocessing in Python

## **🚀 1. Why is Text Cleaning Important?**

**Theory**

Raw text data is often noisy and unstructured. Before applying NLP models, we need to clean and normalize the text to improve accuracy and efficiency.

**Common Issues in Text Data**

❌ Special characters (@, #, $, &, etc.)

❌ Punctuation (., ,, !, ?, etc.)

❌ Numbers (e.g., 12345)

❌ Extra whitespaces

❌ Case sensitivity (e.g., Apple vs apple)

❌ URLs and Email Addresses

❌ Emojis and emoticons (😊, :-) )

❌ Incorrect spelling



### **🛠 2. Setting Up for Text Cleaning**

Before we start, let’s install additional dependencies:

In [1]:
# !pip install re emoji

## **📖 3. Basic Text Cleaning Techniques**

Let’s go through text cleaning step by step.

### **📝 3.1 Lowercasing**
**Theory:**

Converting text to lowercase ensures uniformity (e.g., "Python" and "python" are treated the same).

In [1]:
text = "Natural Language Processing is FUN!"
clean_text = text.lower()
print(clean_text)  # Output: natural language processing is fun!


natural language processing is fun!


## **📝 3.2 Removing Punctuation**

**Theory:**

Punctuation marks do not contribute to text meaning in most NLP tasks.


In [2]:
import string

text = "Hello, NLP! Let's remove punctuations: @#&."
clean_text = text.translate(str.maketrans('', '', string.punctuation))
print(clean_text)  # Output: Hello NLP Lets remove punctuations


Hello NLP Lets remove punctuations 


### **📝 3.3 Removing Numbers**

**Theory:**

Numbers can be irrelevant in some NLP tasks (e.g., sentiment analysis).

In [3]:
import re

text = "My phone number is 08100001111 and I have 10 apples."
clean_text = re.sub(r'\d+', '', text)
print(clean_text)  # Output: My phone number is  and I have  apples.


My phone number is  and I have  apples.


### **📝 3.4 Removing Extra Whitespaces**

**Theory:**

Multiple spaces between words are unnecessary and should be reduced to a single space.

In [4]:
text = "This   is   NLP   with    extra    spaces."
clean_text = re.sub(r'\s+', ' ', text).strip()
print(clean_text)  # Output: This is NLP with extra spaces.


This is NLP with extra spaces.


### **📝 3.5 Removing Special Characters**

**Theory:**

Characters like @, #, &, and * are often not useful in NLP tasks.

In [5]:
text = "Follow us @NLP_Project #AI & Machine Learning!"
clean_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
print(clean_text)  # Output: Follow us NLPProject AI Machine Learning


Follow us NLPProject AI  Machine Learning


### **📝 3.6 Removing URLs and Email Addresses**

**Theory:**

Links and emails can be noisy and are usually not useful.

In [6]:
text = "Check out our website: https://www.example.com or email me at test@example.com."
clean_text = re.sub(r'http\S+|www\S+|@\S+', '', text)
print(clean_text)  # Output: Check out our website:  or email me at .


Check out our website:  or email me at test


### **📝 3.7 Removing Emojis**

**Theory:**

Emojis can add sentiment but may not always be useful for NLP models.

In [7]:
import emoji

text = "I love NLP! 😍💡🔥"
clean_text = emoji.replace_emoji(text, replace="")
print(clean_text)  # Output: I love NLP!


I love NLP! 


### **📝 3.8 Spell Checking and Correction**
**Theory:**

Misspelled words can affect NLP models negatively. We can use TextBlob for spelling correction.

In [8]:
from textblob import TextBlob

text = "Natrual Langage Procesing is amazng!"
corrected_text = TextBlob(text).correct()
print(corrected_text)  # Output: Natural Language Processing is amazing!


Natural Language Processing is amazing!


## **💡 4. Mini Project: Full Text Cleaning Pipeline**

Now, let’s combine all these steps into a single text-cleaning function.

In [None]:
import re
import string
import emoji
from textblob import TextBlob

def clean_text(text):
    text = text.lower()  # Lowercasing
    text = re.sub(r'http\S+|www\S+|@\S+', '', text)  # Remove URLs and emails
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespaces
    text = emoji.replace_emoji(text, replace="")  # Remove emojis
    text = ' '.join([word for word in text.split() if word.isalpha()])  # Remove special characters
    text = str(TextBlob(text).correct())  # Correct spelling
    return text

# Test the function
sample_text = "Natrual Langage Procesing is amazng! Visit https://example.com 🚀🔥"
cleaned_text = clean_text(sample_text)
print(cleaned_text)

natural language processing is amazing visit


## **🎯 5. Assignment for Students**

To reinforce learning, ask students to:

1️⃣ Use the function on different noisy text samples

2️⃣ Modify the function to include additional cleaning steps

3️⃣ Use another spell checker like pyspellchecker instead of TextBlob

# **📌 Module 3: Text Representation & Feature Engineering**

**Objective:**
- Understand how to convert text data into a numerical format for ML models
- Learn different text vectorization techniques (Bag of Words, TF-IDF, Word Embeddings)
- Implement practical examples using Python

## **🚀 1. Why Do We Need Text Representation?**

**Theory**

Machines cannot understand raw text directly. Text representation converts text into numerical data that models can process.

🔹 Example Problem:

"The cat sat on the mat."

A machine cannot interpret this sentence unless we convert it into numbers.

There are different methods for this:

1️⃣ Bag of Words (BoW)

2️⃣ TF-IDF (Term Frequency-Inverse Document Frequency)

3️⃣ Word Embeddings (Word2Vec, GloVe, FastText, BERT, etc.)

## **🛠 2. Bag of Words (BoW)**

### **📝 2.1 How Does It Work?**

- Counts the occurrence of words in a document
- Ignores the order of words
- Creates a sparse matrix representation

**Example:**

**Corpus:**

1️⃣ "The cat sat on the mat."

2️⃣ "The dog barked at the cat."


**🛠 Code Example: Implementing BoW with Scikit-Learn**

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "The cat sat on the mat.",
    "The dog barked at the cat."
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())  # Show feature words
print(X.toarray())  # Show numerical representation


['at' 'barked' 'cat' 'dog' 'mat' 'on' 'sat' 'the']
[[0 0 1 0 1 1 1 2]
 [1 1 1 1 0 0 0 2]]


✅ Pros: Simple and interpretable

❌ Cons: Ignores context, produces large sparse matrices

## **📌 3. TF-IDF (Term Frequency-Inverse Document Frequency)**

### **📝 3.1 How Does It Work?**

TF-IDF scores words based on their importance.

**Formula:**

            TF−IDF=TF×IDF

**Where:**

- TF (Term Frequency) = Number of times a word appears in a document
- IDF (Inverse Document Frequency) = Measures how rare a word is across all documents

🔹 Example: Common words like "the" have lower TF-IDF scores, while unique words like "barked" have higher scores.

**🛠 Code Example: Implementing TF-IDF with Scikit-Learn**

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())  # Show feature words
print(X.toarray())  # Show numerical representation


['at' 'barked' 'cat' 'dog' 'mat' 'on' 'sat' 'the']
[[0.         0.         0.30253071 0.         0.42519636 0.42519636
  0.42519636 0.60506143]
 [0.42519636 0.42519636 0.30253071 0.42519636 0.         0.
  0.         0.60506143]]


✅ Pros: Reduces importance of common words like "the"

❌ Cons: Still lacks contextual understanding



## **📌 4. Word Embeddings (Word2Vec, GloVe, FastText)**

### **📝 4.1 How Do Word Embeddings Work?**

Unlike BoW and TF-IDF, word embeddings capture meaning. They represent words as dense vectors where words with similar meanings have similar vector representations.


**🔹 Example:**

📝 "King - Man + Woman = Queen"

(Word embeddings understand relationships between words!)

**📝 4.2 Types of Word Embeddings**

1️⃣ Word2Vec (Google) – Learns word meanings from context

2️⃣ GloVe (Stanford) – Captures global word relationships

3️⃣ FastText (Facebook) – Works well with rare words

**🛠 Code Example: Implementing Word2Vec**

In [16]:
%pip install --upgrade gensim nltk

Defaulting to user installation because normal site-packages is not writeable
Collecting gensim
  Downloading gensim-4.3.3-cp312-cp312-win_amd64.whl.metadata (8.2 kB)
Downloading gensim-4.3.3-cp312-cp312-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/24.0 MB 682.7 kB/s eta 0:00:36
   ---------------------------------------- 0.0/24.0 MB 660.6 kB/s eta 0:00:37
   ---------------------------------------- 0.0/24.0 MB 281.8 kB/s eta 0:01:26
   ---------------------------------------- 0.0/24.0 MB 281.8 kB/s eta 0:01:26
   ---------------------------------------- 0.1/24.0 MB 438.1 kB/s eta 0:00:55
   ---------------------------------------- 0.1/24.0 MB 438.1 kB/s eta 0:00:55
   ---------------------------------------- 0.1/24.0 MB 364.4 kB/s eta 0:01:06
   ---------------------------------------- 0.1/24.0 MB 364.4 kB/s eta 0:01:06
   ---------------------------------------- 0.2/24.0 MB 393.8 kB/s eta

In [1]:

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')  # Download required tokenizer data

# Sample corpus
corpus = [
    "The cat sat on the mat.",
    "The dog barked at the cat."
]

# Tokenize sentences
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

# Train Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4, sg=0)

# Get vector representation of a word
print("Vector for 'cat':")
print(model.wv['cat'])  # Prints word embedding for 'cat'

# Find similar words
print("\nSimilar words to 'cat':")
print(model.wv.most_similar('cat', topn=3))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Vector for 'cat':
[ 9.4563962e-05  3.0773198e-03 -6.8126451e-03 -1.3754654e-03
  7.6685809e-03  7.3464094e-03 -3.6732971e-03  2.6427018e-03
 -8.3171297e-03  6.2054861e-03 -4.6373224e-03 -3.1641065e-03
  9.3113566e-03  8.7338570e-04  7.4907029e-03 -6.0740625e-03
  5.1605068e-03  9.9228229e-03 -8.4573915e-03 -5.1356913e-03
 -7.0648370e-03 -4.8626517e-03 -3.7785638e-03 -8.5361991e-03
  7.9556061e-03 -4.8439382e-03  8.4236134e-03  5.2625705e-03
 -6.5500261e-03  3.9578713e-03  5.4701497e-03 -7.4265362e-03
 -7.4057197e-03 -2.4752307e-03 -8.6257253e-03 -1.5815723e-03
 -4.0343284e-04  3.2996845e-03  1.4418805e-03 -8.8142155e-04
 -5.5940580e-03  1.7303658e-03 -8.9737179e-04  6.7936908e-03
  3.9735902e-03  4.5294715e-03  1.4343059e-03 -2.6998555e-03
 -4.3668128e-03 -1.0320747e-03  1.4370275e-03 -2.6460087e-03
 -7.0737829e-03 -7.8053069e-03 -9.1217868e-03 -5.9351693e-03
 -1.8474245e-03 -4.3238713e-03 -6.4606704e-03 -3.7173224e-03
  4.2891586e-03 -3.7390434e-03  8.3781751e-03  1.5339935e-03
 -7.24

✅ Pros: Captures meaning & context

❌ Cons: Requires a large dataset for better results

## **📌 5. Advanced: Using Pretrained Word Embeddings**

Instead of training our own embeddings, we can use pretrained models like GloVe or FastText.

**🛠 Code Example: Using Pretrained GloVe Embeddings**

In [4]:
import gensim.downloader as api

# Load pretrained word embeddings
glove_vectors = api.load("glove-wiki-gigaword-100")

# Get vector for a word
print(glove_vectors['cat'])  # Prints GloVe vector for 'cat'


[ 0.23088    0.28283    0.6318    -0.59411   -0.58599    0.63255
  0.24402   -0.14108    0.060815  -0.7898    -0.29102    0.14287
  0.72274    0.20428    0.1407     0.98757    0.52533    0.097456
  0.8822     0.51221    0.40204    0.21169   -0.013109  -0.71616
  0.55387    1.1452    -0.88044   -0.50216   -0.22814    0.023885
  0.1072     0.083739   0.55015    0.58479    0.75816    0.45706
 -0.28001    0.25225    0.68965   -0.60972    0.19578    0.044209
 -0.31136   -0.68826   -0.22721    0.46185   -0.77162    0.10208
  0.55636    0.067417  -0.57207    0.23735    0.4717     0.82765
 -0.29263   -1.3422    -0.099277   0.28139    0.41604    0.10583
  0.62203    0.89496   -0.23446    0.51349    0.99379    1.1846
 -0.16364    0.20653    0.73854    0.24059   -0.96473    0.13481
 -0.0072484  0.33016   -0.12365    0.27191   -0.40951    0.021909
 -0.6069     0.40755    0.19566   -0.41802    0.18636   -0.032652
 -0.78571   -0.13847    0.044007  -0.084423   0.04911    0.24104
  0.45273   -0.18682 

💡 Pretrained models are more accurate because they are trained on huge datasets!



## **🎯 6. Mini Project: Comparing BoW, TF-IDF, and Word Embeddings**

**📌 Task:**

1️⃣ Convert the following text into numerical representations using BoW, TF-IDF, and Word2Vec

corpus = ["Machine Learning is amazing!", "Deep Learning is a subset of Machine Learning."]

2️⃣ Compare the size and quality of the numerical outputs.

# **📌 Module 4: NLP Pipeline & Named Entity Recognition (NER)**

**Objective:**

- Understand the NLP pipeline and how text is processed step by step
- Learn about Named Entity Recognition (NER) for extracting information
- Implement practical NER models using spaCy and NLTK

## **🚀 1. Understanding the NLP Pipeline**

### **📝 1.1 What is an NLP Pipeline?**

An NLP pipeline is a sequence of preprocessing and analysis steps used to convert raw text into structured data for machine learning.

🔹 Steps in an NLP Pipeline:

1️⃣ Text Cleaning & Preprocessing – Tokenization, Stopword Removal, Lemmatization

2️⃣ Text Representation – Bag of Words, TF-IDF, Word Embeddings (covered in Module 3)

3️⃣ Feature Engineering – N-grams, POS tagging, Named Entity Recognition

4️⃣ Model Training & Prediction – Using ML/DL models for classification, sentiment analysis, etc.

Additional Link : https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/



## **🛠 2. Implementing an NLP Pipeline in Python**

Let’s build an NLP pipeline using spaCy.

In [None]:
import spacy

# Load English NLP model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976."

# Process the text
doc = nlp(text)

# Print tokens
print("Tokens:", [token.text for token in doc])

# Print named entities
print("Named Entities:", [(ent.text, ent.label_) for ent in doc.ents])


**Explanation:**

- Apple Inc. is recognized as an organization (ORG)
- Steve Jobs is identified as a person (PERSON)
- 1976 is detected as a date (DATE)

## **📌 3. Named Entity Recognition (NER)**

### **📝 3.1 What is Named Entity Recognition?**

NER is a technique to extract important information from text by identifying names, locations, organizations, dates, and more.

## Rule-based vs ML-based NER

In Named Entity Recognition (NER), a rule-based approach relies on manually defined patterns and linguistic rules to identify entities in text, while an ML-based approach uses machine learning models trained on labeled data to recognize entities, allowing for greater adaptability to new data and complex linguistic variations. 

### Key Differences:

#### Rule Creation:
- **Rule-based NER** requires experts to craft specific rules based on grammar and syntax, like using regular expressions to match patterns.
- **ML-based NER** learns patterns from large datasets through training. 

#### Flexibility:
- **Rule-based systems** are less flexible and struggle with ambiguous or novel language constructs.
- **ML models** can adapt to new situations and variations in language usage. 

#### Data Requirements:
- **Rule-based systems** require less training data as they rely on predefined rules.
- **ML-based systems** need large amounts of annotated data for effective training. 

### Advantages of Rule-based NER:
- **Interpretability**: Easy to understand how the system identifies entities due to the explicit rules.
- **Faster development**: Can be implemented quickly for well-defined tasks with clear patterns.
- **Domain-specific expertise**: Can be tailored to specific domains with custom rules. 

### Advantages of ML-based NER:
- **Accuracy**: Can achieve higher accuracy on complex and diverse data due to learning from large datasets.
- **Generalizability**: Can adapt to new data and language variations without manual rule adjustments.
- **Scalability**: Can handle large volumes of text data efficiently. 

### Common Scenarios for Each Approach:

#### Rule-based:
- Identifying basic entities like phone numbers, email addresses, dates in structured text. 

#### ML-based:
- Recognizing complex entities like people's names, locations, organizations in unstructured text with diverse language usage.


### **🛠 3.2 Implementing NER using spaCy**

Let’s extract named entities from a news article.

In [None]:

import spacy

nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Tesla was founded by Elon Musk and is headquartered in Palo Alto, California."

# Process the text
doc = nlp(text)

# Extract named entities
for ent in doc.ents:
    print(f"{ent.text} → {ent.label_}")


✅ Output:

Tesla → ORG

Elon Musk → PERSON

Palo Alto → GPE

California → GPE

✅ Key Takeaways:

- Tesla is detected as an organization (ORG)
- Elon Musk is recognized as a person (PERSON)
- Palo Alto and California are identified as geopolitical locations (GPE)

## **📌 4. Implementing NER using NLTK**

### **📝 4.1 Using NLTK’s Built-in Named Entity Recognition**

NLTK also supports Named Entity Recognition, but it requires POS tagging first.

In [11]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Sample text
text = "Jeff Bezos founded Amazon in Seattle in 1994."

# Tokenization and POS tagging
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

# Named Entity Recognition
ner_tree = ne_chunk(pos_tags)
print(ner_tree)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mmaxent_ne_chunker_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('maxent_ne_chunker_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mchunkers/maxent_ne_chunker_tab/english_ace_multiclass/[0m

  Searched in:
    - 'C:\\Users\\DELL/nltk_data'
    - 'c:\\ProgramData\\anaconda3\\nltk_data'
    - 'c:\\ProgramData\\anaconda3\\share\\nltk_data'
    - 'c:\\ProgramData\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


✅ Output Example:

(S
  (PERSON Jeff/NNP)

  (PERSON Bezos/NNP)
  
  founded/VBD
  
  (ORGANIZATION Amazon/NNP)
  
  in/IN
  
  (GPE Seattle/NNP)
  
  in/IN
  
  1994/CD)

✅ Key Observations:

Jeff Bezos → PERSON

Amazon → ORGANIZATION

Seattle → GPE (Geopolitical Entity)

1994 → DATE

## **📌 5. Training a Custom NER Model**

Sometimes, the default NER models do not recognize custom entities (e.g., brand names, industry-specific terms). In such cases, we train our own custom NER model.

### **🛠 5.1 Steps to Train a Custom NER Model**

1️⃣ Define training data (annotated text)

2️⃣ Load a pretrained spaCy model

3️⃣ Fine-tune the model with custom entities

4️⃣ Save and test the trained model

🔹 Example Training Data:

In [12]:
TRAIN_DATA = [
    ("TechTreendzz is a popular YouTube channel.", {"entities": [(0, 12, "ORG")]}),
    ("OpenAI created ChatGPT, an AI model.", {"entities": [(0, 6, "ORG"), (14, 21, "PRODUCT")]}),
]


🔹 Train & Update the Model:

In [13]:
import spacy
from spacy.training.example import Example

# Load a pretrained model
nlp = spacy.load("en_core_web_sm")

# Get the Named Entity Recognizer
ner = nlp.get_pipe("ner")

# Add new label
ner.add_label("PRODUCT")

# Train the model
for text, annotations in TRAIN_DATA:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    nlp.update([example])

# Save the updated model
nlp.to_disk("custom_ner_model")

# Test the new model
nlp = spacy.load("custom_ner_model")
text = "ChatGPT is a popular AI product developed by OpenAI."
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])


TypeError: ForwardRef._evaluate() missing 1 required keyword-only argument: 'recursive_guard'

✅ Output Example:

✅ Key Observations:

- ChatGPT is now correctly identified as a PRODUCT
- OpenAI is detected as an ORGANIZATION


## **📌 6. Mini Project: Extracting Entities from News Articles**

**📌 Task:**

1️⃣ Scrape a news article using BeautifulSoup

2️⃣ Apply spaCy’s NER to extract people, organizations, and locations

3️⃣ Store results in a Pandas DataFrame



In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Step 1: Scrape a news article
url = 'https://www.bbc.com/news/world-us-canada-59944889'  # Example URL
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract the article text
article_text = ' '.join([p.text for p in soup.find_all('p')])

# Step 2: Apply NLTK’s NER
tokens = word_tokenize(article_text)
pos_tags = pos_tag(tokens)
ner_tree = ne_chunk(pos_tags)

# Extract entities
entities = []
for subtree in ner_tree:
    if hasattr(subtree, 'label'):
        entity_name = ' '.join([leaf[0] for leaf in subtree.leaves()])
        entity_type = subtree.label()
        if entity_type in ['PERSON', 'ORGANIZATION', 'GPE']:
            entities.append((entity_name, entity_type))

# Step 3: Store results in a Pandas DataFrame
df = pd.DataFrame(entities, columns=['Entity', 'Label'])
print(df)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mmaxent_ne_chunker_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('maxent_ne_chunker_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mchunkers/maxent_ne_chunker_tab/english_ace_multiclass/[0m

  Searched in:
    - 'C:\\Users\\DELL/nltk_data'
    - 'c:\\ProgramData\\anaconda3\\nltk_data'
    - 'c:\\ProgramData\\anaconda3\\share\\nltk_data'
    - 'c:\\ProgramData\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************
