

# **Named Entity Recognition (NER) on Tech_Intel Dataset**  

### **Overview**  
This project performs **Named Entity Recognition (NER)** on a dataset of **20 text files**, categorized into different topic-based folders. The extracted entities are saved in a structured format for further analysis.  



## **🔹 Steps in the Code**
1. **Load all text files** from category-based folders.
2. **Perform Named Entity Recognition (NER)** using **spaCy**.
3. **Extract named entities** like **Organizations (ORG)**, **Dates (DATE)**, **People (PERSON)**, and more.
4. **Save results to a CSV file** (`ner_results.csv`) for further analysis.



## **🔹 Installation**  
Install the required dependencies using the following commands:  
```bash
pip install spacy pandas
python -m spacy download en_core_web_sm
```


## **🔹 Expected Output (CSV Example)**  
| **File**  | **Category** | **Entity**  | **Entity Type** |
|-----------|-------------|-------------|-----------------|
| data.txt  | AI & ML     | OpenAI      | ORG            |
| data.txt  | AI & ML     | 2024        | DATE           |
| data.txt  | AI & ML     | Google      | ORG            |
| news.txt  | Cybersecurity | Microsoft | ORG            |
| news.txt  | Cybersecurity | 2003      | DATE           |


In [13]:
# pip install spacy pandas
# python -m spacy download en_core_web_sm

In [6]:
import spacy
import os
import pandas as pd

In [7]:
nlp = spacy.load("en_core_web_sm")


In [8]:
dataset_path = "tech_intel"

In [9]:
data = []

In [10]:
for category in os.listdir(dataset_path):
    category_path = os.path.join(dataset_path, category)

    # Ensure it's a directory
    if os.path.isdir(category_path):
        for filename in os.listdir(category_path):
            if filename.endswith(".txt"):
                file_path = os.path.join(category_path, filename)

                # Read the text content
                with open(file_path, "r", encoding="utf-8") as file:
                    text = file.read()

                # Process text using spaCy
                doc = nlp(text)

                # Extract named entities
                for ent in doc.ents:
                    data.append([filename, category, ent.text, ent.label_])

In [11]:
df = pd.DataFrame(data, columns=["File", "Category", "Entity", "Entity Type"])
df.to_csv("ner_results.csv", index=False)