In [7]:
# load libraries for NER 

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import torch


## NER engine

In [8]:
# init NER engine

model_id = 'dslim/bert-base-NER'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

# nlp pipeline

nlp = pipeline('ner',
              model=model,
              tokenizer=tokenizer,
              aggregation_strategy= 'max',
              device= 'cpu') 
# nlp("Bill Gates is the founder of Microsoft")

tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


The `aggregation_strategy` parameter in the NER pipeline addresses a critical issue that arises from how transformer tokenizers work with named entity recognition tasks.

## Why Aggregation Strategy Is Required

When using transformers for NER (Named Entity Recognition), there's a fundamental mismatch between:
1. How tokenizers split words (into subwords/wordpieces)
2. How NER labels need to be applied (typically at the whole word/entity level)

### The Problem

Let's look at what happens when you process "Bill Gates":

1. The tokenizer might split this into: ["Bill", "Gate", "##s"]
2. Each token gets its own prediction from the model
3. But you want one cohesive entity: "Bill Gates" as a PERSON

Without an aggregation strategy, you'd get separate predictions for each subtoken:
- "Bill" → PERSON
- "Gate" → PERSON
- "##s" → PERSON

This creates several issues:
- Entity boundaries become unclear
- Confidence scores may vary across subtokens
- Post-processing becomes necessary to rebuild complete entities

### What Aggregation Strategies Do

The `aggregation_strategy` parameter tells the pipeline how to combine these subtoken predictions into meaningful entities:

- `'none'`: No aggregation (raw predictions for each subtoken)
- `'simple'`: Group adjacent tokens with the same entity label
- `'first'`: Use the prediction of the first subtoken for the whole word
- `'average'`: Average the scores across subtokens
- `'max'`: Use the highest confidence prediction from the subtokens

### In Our Example

```python
nlp = pipeline('ner',
              model=model,
              tokenizer=tokenizer,
              aggregation_strategy='max', 
              device='cpu')
```

Setting `aggregation_strategy='max'` means:
1. For multi-subtoken words/entities, the pipeline will identify all subtokens belonging to the same word
2. It will use the prediction with the highest confidence score as the label for the entire word/entity
3. The result will be a single prediction for "Bill Gates" instead of separate predictions for each subtoken


## Retriever | embed_model

In [None]:
from sentence_transformers import SentenceTransformer


# https://huggingface.co/flax-sentence-embeddings/all_datasets_v3_mpnet-base
embed_model = SentenceTransformer(
    "flax-sentence-embeddings/all_datasets_v3_mpnet-base")

The MPNet-base model is larger (110M parameters) and typically more powerful but slower than MiniLM-L6-v2 (around 23M parameters).

The MPNet variant was trained on more diverse datasets ("all_datasets_v3") and generally achieves higher performance on semantic similarity benchmarks, while MiniLM-L6-v2 is optimized for efficiency with a good balance of speed and accuracy.

> The choice between them depends on your specific needs - MPNet for maximum accuracy when computational resources aren't constrained, or MiniLM for faster processing with still very good performance.

In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

embed_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embed_model.encode(sentences)
print(embeddings)


In [None]:

def extract_entities(list_of_text):
    entities = []
    for doc in list_of_text: 
        entities.append([item['word'] for item in nlp(doc)])
        # list of entities for 1 doc
    return entities

In [16]:
import pandas as pd 
def prepare_medium_articles_data(path):
    df = pd.read_csv(path,nrows=20)
    df = df.dropna()
    df = df[~df["subtitle_truncated_flag"]]
    df['title_extended'] = df['title'] + df['subtitle']



    df['metadata'] = df.apply(lambda x: {
        'title' : x['title'],
        'subtitle': x['subtitle'],
        'category': x['category'],
        'entities': extract_entities([x['title_extended']])
        }, axis=1)
    
    # ids = [f"article{i}" for i in range(df.shape[0])]
    # documents = df["title_extended"].to_list()
    # metadatas = df.drop("title_extended", axis=1).to_dict(orient="records") 

    return df

    # return {"ids": ids, "documents": documents, "metadatas": metadatas}


df_out = prepare_medium_articles_data(r"C:\development\01 GenAI Dev\dev-genAI\Project01-SemanticSearch\data\medium_post_titles.csv")

df_out



Unnamed: 0,category,title,subtitle,subtitle_truncated_flag,title_extended,metadata
0,work,"""21 Conversations"" - A fun (and easy) game for...",A (new?) Icebreaker game to get your team to s...,False,"""21 Conversations"" - A fun (and easy) game for...","{'title': '""21 Conversations"" - A fun (and eas..."
1,spirituality,"""Biblical Porn"" at Mars Hill",Author and UW lecturer Jessica Johnson talks a...,False,"""Biblical Porn"" at Mars HillAuthor and UW lect...","{'title': '""Biblical Porn"" at Mars Hill', 'sub..."
2,lgbtqia,"""CISGENDER?! Is That A Disease?!""","Or, a primer in gender vocabulary for the curi...",False,"""CISGENDER?! Is That A Disease?!""Or, a primer ...","{'title': '""CISGENDER?! Is That A Disease?!""',..."
4,artificial-intelligence,"""Can I Train my Model on Your Computer?""",How we waste computational resources and how t...,False,"""Can I Train my Model on Your Computer?""How we...","{'title': '""Can I Train my Model on Your Compu..."
5,cryptocurrency,"""Cypherpunks and Wall Street"": The Security To...",Bruce Fenton presents at the World Blockchain ...,False,"""Cypherpunks and Wall Street"": The Security To...","{'title': '""Cypherpunks and Wall Street"": The ..."
6,politics,"""Diss"" vs. ""Piss"": The Blue Wave and Yellow Tr...",Michael Gofman & Matthew Wigler explore how bu...,False,"""Diss"" vs. ""Piss"": The Blue Wave and Yellow Tr...","{'title': '""Diss"" vs. ""Piss"": The Blue Wave an..."
7,health,"""Doctor, he's gone into shock!""",You've seen it in movies and on television. B...,False,"""Doctor, he's gone into shock!""You've seen it ...","{'title': '""Doctor, he's gone into shock!""', '..."
8,culture,"""Happily Ever After: Fairy Tales for Every Chi...",Television shows have an invaluable opportunit...,False,"""Happily Ever After: Fairy Tales for Every Chi...","{'title': '""Happily Ever After: Fairy Tales fo..."
9,poetry,"""I Love You"" The Dangerous Toxic Truth","The Big, Smelly Heap of Lies Pretending To Be ...",False,"""I Love You"" The Dangerous Toxic TruthThe Big,...","{'title': '""I Love You"" The Dangerous Toxic Tr..."
10,poetry,"""I would the gods had made thee poetical"": Sha...",Not all bards are created equal.,False,"""I would the gods had made thee poetical"": Sha...","{'title': '""I would the gods had made thee poe..."


In [5]:
df = pd.read_csv(r"C:\development\01 GenAI Dev\dev-genAI\Project01-SemanticSearch\data\medium_post_titles.csv",nrows=10000)

In [None]:
query = ["What are the most useful medium articles related to Rut Programming"]

where = {'entities': {'$in':''}}

In [17]:
nlp("Elon Musk is a good guy. He has Tesla and SpaceX")

[{'entity_group': 'PER',
  'score': 0.99256885,
  'word': 'Elon Musk',
  'start': 0,
  'end': 9},
 {'entity_group': 'ORG',
  'score': 0.9068927,
  'word': 'Tesla',
  'start': 32,
  'end': 37},
 {'entity_group': 'ORG',
  'score': 0.9687877,
  'word': 'SpaceX',
  'start': 42,
  'end': 48}]

In [15]:
extract_entities(["Elon Musk is a good guy. He has Tesla and SpaceX"])

[['Elon Musk', 'Tesla', 'SpaceX']]