## Data Anonymizer using Named Entity Recognition (NER)

Objective:
To automatically detect and anonymize sensitive information such as organization names and locations from unstructured text using NLP.

In [1]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


## Loading NER Pipeline
This project uses a pre-trained BERT-based model fine-tuned on the CoNLL-2003 dataset.

In [2]:
ner_pipeline = pipeline(
    task="ner",
    aggregation_strategy="simple"
)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496.
Using a pipeline without specifying a model name and revision in production is not recommended.
Loading weights: 100%|███████████████████████| 391/391 [00:01<00:00, 204.98it/s, Materializing param=classifier.weight]
BertForTokenClassification LOAD REPORT from: dbmdz/bert-large-cased-finetuned-conll03-english
Key                      | Status     |  | 
-------------------------+------------+--+-
bert.pooler.dense.weight | UNEXPECTED |  | 
bert.pooler.dense.bias   | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


## Sample Input Text

In [3]:
text = "The contract was awarded to AlphaTech Solutions in Singapore."

## Named Entity Recognition Output

In [4]:
entities = ner_pipeline(text)
entities

[{'entity_group': 'ORG',
  'score': np.float32(0.9988667),
  'word': 'AlphaTech Solutions',
  'start': 28,
  'end': 47},
 {'entity_group': 'LOC',
  'score': np.float32(0.9997998),
  'word': 'Singapore',
  'start': 51,
  'end': 60}]

## Anonymization Process
Detected entities are replaced with their corresponding labels.

In [5]:
anonymized_text = text

for ent in entities:
    label = ent["entity_group"]
    word = ent["word"]
    anonymized_text = anonymized_text.replace(word, f"[{label}]")

print("Original Text:")
print(text)
print("\nAnonymized Text:")
print(anonymized_text)

Original Text:
The contract was awarded to AlphaTech Solutions in Singapore.

Anonymized Text:
The contract was awarded to [ORG] in [LOC].


## Observations

- The NER model accurately identifies organization and location entities in unstructured text.
- The model is fine-tuned on the CoNLL-2003 dataset, which supports PER, ORG, LOC, and MISC entity types.
- Temporal expressions such as dates are not consistently detected due to dataset limitations.
- Subword tokenization is handled using aggregation_strategy="simple".
- The solution runs entirely on CPU without any additional model training.