### Performing NER on a collection of news articles

#### Dataset

* Source: https://www.kaggle.com/datasets/tanishqdublish/text-classification-documentation
* License: [Licence Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business.

Download a copy of the dataset to work with this notebook. 

In [None]:
from haystack.components.extractors import NamedEntityExtractor
import pandas as pd
from haystack.dataclasses import Document

In [None]:
def extract_named_entities_with_ids(documents):
    """This function extracts named entities from a list of
    documents and returns the result in a structured format.

    Args:
        documents (list): List of Haystack Document objects

    Returns:
        extracted_data (list): A list of dictionaries containing the extracted entities
    """
    extracted_data = []

    for document in documents:
        content = document.content
        doc_id = document.id
        named_entities = document.meta.get('named_entities', [])
        
        # Sets to store unique entities by type
        entities_by_type = {
            "LOC": set(),
            "PER": set(),
            "ORG": set()
        }
        
        # Loop through the entities and filter by score and type
        for entity in named_entities:
            if float(entity.score) < 0.8 or entity.entity == "MISC":
                continue
            
            word = content[entity.start:entity.end]
            if entity.entity in entities_by_type:
                entities_by_type[entity.entity].add(word)  # Use set to ensure uniqueness
        
        # Prepare the meta field with comma-separated values
        meta = {
            "LOC": ",".join(entities_by_type["LOC"]),
            "PER": ",".join(entities_by_type["PER"]),
            "ORG": ",".join(entities_by_type["ORG"])
        }
        
        # Append the result for this document
        extracted_data.append({
            'document_id': doc_id,
            'content': content,
            'meta': meta
        })
    

    return extracted_data

### Initialize the Named Entity Extractor


In [16]:
extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER")
extractor.warm_up()

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
df = pd.read_csv("df_file.csv")
sample_docs = df['Text'].to_list()
documents = [Document(id=str(i), content=sample_docs[i]) for i in range(len(sample_docs))]

# Apply extractor to the documents
extractor.run(documents)


In [None]:
# Extract named entities from the documents
extracted_documents = extract_named_entities_with_ids(documents)
df = pd.DataFrame(extracted_documents)
df.to_csv("ner_output.csv", index=False)

In [57]:
df

Unnamed: 0,document_id,content,meta
0,0,Budget to set scene for election\n \n Gordon B...,"{'LOC': 'UK,England,Wales', 'PER': 'George Osb..."
1,1,Army chiefs in regiments decision\n \n Militar...,"{'LOC': 'Scotland,Iraq', 'PER': 'Eric,Joyce,Ge..."
2,2,Howard denies split over ID cards\n \n Michael...,"{'LOC': '', 'PER': 'Davis,Ye,Michael Howard,Ti..."
3,3,Observers to monitor UK election\n \n Minister...,"{'LOC': 'Britain,UK,Northern Ireland', 'PER': ..."
4,4,Kilroy names election seat target\n \n Ex-chat...,"{'LOC': 'UK,Derbyshire,London,Erewash,Nottingh..."
...,...,...,...
2220,2220,India opens skies to competition\n \n India wi...,"{'LOC': 'Saudi Arabia,India,Gulf,US,Kuwait', '..."
2221,2221,Yukos bankruptcy 'not US matter'\n \n Russian ...,"{'LOC': 'Russia,sk,Gibraltar,US,Houston,Europe..."
2222,2222,Survey confirms property slowdown\n \n Governm...,"{'LOC': 'Wales,UK,Greater London,England', 'PE..."
2223,2223,High fuel prices hit BA's profits\n \n British...,"{'LOC': '', 'PER': 'Martin Broughton,Mike Powe..."
