
# 🧠 **Named Entity Recognition (NER) 🔍**

> 📘 **Definition**
> NER is a **Natural Language Processing (NLP)** technique used to **identify and classify named entities** (like people, places, organizations, dates, etc.) in text into predefined categories.

---

## ⚙️ **How NER Works**

NER generally works in 3 main steps:

1. 🧹 **Preprocessing**

   * Tokenization ✂️ (splitting text into words/tokens)
   * POS tagging 🏷️ (Part-of-Speech tagging)
   * Lemmatization/Normalization ✨

2. 🧠 **Entity Detection**

   * The algorithm detects chunks of text that could be entities (e.g., "Barack Obama").

3. 🏷️ **Entity Classification**

   * Each detected entity is classified into categories like:

     * 👤 Person
     * 🌍 Location
     * 🏢 Organization
     * 📅 Date
     * 💰 Money
     * 🧪 Miscellaneous



## 🛠️ **Popular Tools & Libraries for NER**

| Library                      | Language | Highlights 🌟                                |
| ---------------------------- | -------- | -------------------------------------------- |
| **spaCy**                    | Python   | ⚡ Fast, accurate, easy to use                |
| **NLTK**                     | Python   | 📚 Educational, customizable                 |
| **Stanza**                   | Python   | 🧠 Deep learning-based (from Stanford)       |
| **Flair**                    | Python   | 🌈 Supports multilingual & contextual models |
| **HuggingFace Transformers** | Python   | 🤖 Pretrained models like BERT for NER       |



## 🔍 **NER Example in Action (Using spaCy)**

```python
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Elon Musk founded SpaceX in California in 2002."

doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)
```
en_core_web_trf there are different models
📤 **Output:**

```
Elon Musk     PERSON
SpaceX        ORG
California    GPE
2002          DATE
```



## 🎯 **Use Cases of NER**

| Use Case                   | Description 💬                                        |
| -------------------------- | ----------------------------------------------------- |
| 📰 **News Categorization** | Identify people, places, and events in news articles  |
| 💼 **Resume Screening**    | Extract skills, organizations, and roles from resumes |
| 🛍️ **E-commerce**         | Recognize product names and brands                    |
| 📊 **Market Analysis**     | Identify companies, industries, and trends in reports |
| 🔍 **Search Engines**      | Improve query understanding and result relevance      |
| ⚖️ **Legal & Compliance**  | Extract dates, case numbers, organizations, etc.      |



## 🤔 **Types of NER Models**

* 🏷️ **Rule-Based**: Manually written patterns and dictionaries
* 🤖 **Machine Learning-Based**: Uses statistical models like CRF
* 🧠 **Deep Learning-Based**: Uses BiLSTM, CNN, BERT, etc., for context-aware detection


## 🔐 **Challenges in NER**

* 🔄 Ambiguity (e.g., "Apple" 🍎 vs. "Apple Inc.")
* 🌍 Multilingual Texts
* 👩‍💻 Domain Adaptation (e.g., legal vs. medical entities)
* 🆕 Emerging Entities (e.g., new tech brands, slang)



## 📚 **Learning Resources**

* 📘 [spaCy Official Docs](https://spacy.io/usage)
* 🎓 [Coursera NLP Specialization](https://www.coursera.org/specializations/natural-language-processing)
* 📙 *Natural Language Processing with Python* – O’Reilly



In [None]:
# LOADING DATASET:
# use enveNER virtual environment
import pandas as pd
DATA = pd.read_csv(r"C:\Users\Nagesh Agrawal\OneDrive\Desktop\6_MACHINE LEARNING\3__NATURAL LANGUAGE PROCESSING\NLP_DATASETS\NAMED ENTITY RECOGNITION DATA.csv")
DATA

Unnamed: 0,Headline,Content,News Categories,Date
0,Congress leader Baljinder Singh shot dead at h...,Congress leader Baljinder Singh was shot dead ...,['national'],19-09-2023
1,17-year-old girl preparing for NEET dies by su...,Another NEET aspirant died by suicide in Rajas...,['national'],19-09-2023
2,Hampers to welcome MPs in new Parliament tomor...,In order to mark the first-ever working day of...,['national'],19-09-2023
3,"Only 10% women lawmakers in RS, while only 14%...","Congress President Mallikarjun Kharge, while s...",['national'],19-09-2023
4,"Ganesh temple decorated with notes, coins wort...",The Sri Sathya Ganapathi Temple in Bengaluru a...,['national'],19-09-2023
...,...,...,...,...
307691,"Tamil Nadu to open 10,000 'CM's pharmacy store...",Tamil Nadu CM MK Stalin has announced that 'Ch...,"['national', 'Health___Fitness']",2024-08-26
307692,NMC study finds mental health issues prevalent...,One in four MBBS students has a mental disorde...,"['education', 'Health___Fitness', 'national']",2024-08-26
307693,Telangana CM says World Bank will help retire ...,Telangana CM Revanth Reddy said the World Bank...,"['politics', 'Health___Fitness', 'national']",2024-08-26
307694,Dr Gagandeep Kang explores role of AI in vacci...,"Dr Gagandeep Kang, a microbiologist and virolo...","['Health___Fitness', 'national']",2024-08-26


In [4]:
DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307696 entries, 0 to 307695
Data columns (total 4 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   Headline         307696 non-null  object
 1   Content          307696 non-null  object
 2   News Categories  307696 non-null  object
 3   Date             307696 non-null  object
dtypes: object(4)
memory usage: 9.4+ MB


In [5]:
DATA["Date"].unique()

array(['19-09-2023', '20-09-2023', '21-09-2023', '22-09-2023',
       '23-09-2023', '24-09-2023', '25-09-2023', '26-09-2023',
       '27-09-2023', '28-09-2023', '29-09-2023', '30-09-2023',
       '01-10-2023', '02-10-2023', '03-10-2023', '04-10-2023',
       '05-10-2023', '06-10-2023', '07-10-2023', '08-10-2023',
       '09-10-2023', '10-10-2023', '11-10-2023', '12-10-2023',
       '2023-10-13', '2023-10-14', '2023-10-15', '2023-10-16',
       '2023-10-17', '2023-10-18', '2023-10-19', '2023-10-20',
       '2023-10-22', '2023-10-23', '2023-10-24', '2023-10-25',
       '2023-10-28', '2023-10-30', '2023-10-31', '2023-11-01',
       '2023-11-03', '2023-11-04', '2023-11-05', '2023-11-06',
       '2023-11-07', '2023-11-08', '2023-11-09', '2023-11-10',
       '2023-11-11', '2023-11-12', '2023-11-14', '2023-11-16',
       '2023-11-18', '2023-11-25', '2023-11-26', '2023-12-02',
       '2023-12-03', '2023-12-05', '2023-12-06', '2023-12-07',
       '2023-12-08', '2023-12-09', '2023-12-10', '2023-

In [6]:
def FIX_DATE(date_str):
    from datetime import datetime
    try:
        # If format is 'yyyy-mm-dd', convert to 'dd-mm-yyyy'
        d1 = datetime.strptime(date_str, '%Y-%m-%d')
        return d1.strftime('%d-%m-%Y')
    except:
        try:
            # If already 'dd-mm-yyyy', keep it
            datetime.strptime(date_str, '%d-%m-%Y')
            return date_str
        except:
            return 'Invalid'

In [7]:
DATA["Date"] = DATA["Date"].apply(FIX_DATE)

In [8]:
DATA["Date"].unique

<bound method Series.unique of 0         19-09-2023
1         19-09-2023
2         19-09-2023
3         19-09-2023
4         19-09-2023
             ...    
307691    26-08-2024
307692    26-08-2024
307693    26-08-2024
307694    26-08-2024
307695    26-08-2024
Name: Date, Length: 307696, dtype: object>

In [9]:
DATA["Date"] = pd.to_datetime(DATA["Date"], format="%d-%m-%Y")

In [10]:
DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307696 entries, 0 to 307695
Data columns (total 4 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   Headline         307696 non-null  object        
 1   Content          307696 non-null  object        
 2   News Categories  307696 non-null  object        
 3   Date             307696 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(3)
memory usage: 9.4+ MB


In [11]:
DATA.duplicated().sum()

np.int64(52047)

In [12]:
DATA[DATA["Headline"].duplicated(keep=False)]

Unnamed: 0,Headline,Content,News Categories,Date
0,Congress leader Baljinder Singh shot dead at h...,Congress leader Baljinder Singh was shot dead ...,['national'],2023-09-19
1,17-year-old girl preparing for NEET dies by su...,Another NEET aspirant died by suicide in Rajas...,['national'],2023-09-19
3,"Only 10% women lawmakers in RS, while only 14%...","Congress President Mallikarjun Kharge, while s...",['national'],2023-09-19
4,"Ganesh temple decorated with notes, coins wort...",The Sri Sathya Ganapathi Temple in Bengaluru a...,['national'],2023-09-19
5,Bee attack during funeral in K'taka leaves 1 d...,One person died and at least 14 others were in...,['national'],2023-09-19
...,...,...,...,...
307691,"Tamil Nadu to open 10,000 'CM's pharmacy store...",Tamil Nadu CM MK Stalin has announced that 'Ch...,"['national', 'Health___Fitness']",2024-08-26
307692,NMC study finds mental health issues prevalent...,One in four MBBS students has a mental disorde...,"['education', 'Health___Fitness', 'national']",2024-08-26
307693,Telangana CM says World Bank will help retire ...,Telangana CM Revanth Reddy said the World Bank...,"['politics', 'Health___Fitness', 'national']",2024-08-26
307694,Dr Gagandeep Kang explores role of AI in vacci...,"Dr Gagandeep Kang, a microbiologist and virolo...","['Health___Fitness', 'national']",2024-08-26


In [13]:
DATA[DATA['Headline']=="Only 10% women lawmakers in RS, while only 14% in LS: INC's Kharge"]

Unnamed: 0,Headline,Content,News Categories,Date
3,"Only 10% women lawmakers in RS, while only 14%...","Congress President Mallikarjun Kharge, while s...",['national'],2023-09-19
2169,"Only 10% women lawmakers in RS, while only 14%...","Congress President Mallikarjun Kharge, while s...",['national'],2023-09-20


In [14]:
DATA.drop_duplicates(keep='first', inplace=True)

In [15]:
DATA['News Categories'].value_counts()# 1,335  categories

News Categories
['entertainment']                                                15232
['miscellaneous']                                                12500
['science']                                                      11359
['sports']                                                       10453
['business']                                                      9852
                                                                 ...  
['hatke', 'world', 'entertainment', 'national', 'technology']        1
['politics', 'business', 'FINANCE']                                  1
['automobile', 'business', 'FINANCE']                                1
['sports', 'experiment', 'PARIS_2024_PARALYMPICS']                   1
['world', 'Health___Fitness', 'Israel-Hamas_War']                    1
Name: count, Length: 1335, dtype: int64

In [16]:
DATA

Unnamed: 0,Headline,Content,News Categories,Date
0,Congress leader Baljinder Singh shot dead at h...,Congress leader Baljinder Singh was shot dead ...,['national'],2023-09-19
1,17-year-old girl preparing for NEET dies by su...,Another NEET aspirant died by suicide in Rajas...,['national'],2023-09-19
2,Hampers to welcome MPs in new Parliament tomor...,In order to mark the first-ever working day of...,['national'],2023-09-19
3,"Only 10% women lawmakers in RS, while only 14%...","Congress President Mallikarjun Kharge, while s...",['national'],2023-09-19
4,"Ganesh temple decorated with notes, coins wort...",The Sri Sathya Ganapathi Temple in Bengaluru a...,['national'],2023-09-19
...,...,...,...,...
307690,Why is fruit-only diet not recommended?,"As per Cleveland Clinic, fruit-only diet can b...","['EXPLAINERS', 'Health___Fitness']",2024-08-26
307691,"Tamil Nadu to open 10,000 'CM's pharmacy store...",Tamil Nadu CM MK Stalin has announced that 'Ch...,"['national', 'Health___Fitness']",2024-08-26
307693,Telangana CM says World Bank will help retire ...,Telangana CM Revanth Reddy said the World Bank...,"['politics', 'Health___Fitness', 'national']",2024-08-26
307694,Dr Gagandeep Kang explores role of AI in vacci...,"Dr Gagandeep Kang, a microbiologist and virolo...","['Health___Fitness', 'national']",2024-08-26


### 🕵🏻LET'S PERFORM NAMED ENTITY RECOGNITION ON Content

In [None]:
python -m spacy download en_core_web_trf


In [19]:
import spacy
NER = spacy.load("en_core_web_trf")

OSError: [E050] Can't find model 'en_core_web_trf'. It doesn't seem to be a Python package or a valid path to a data directory.