In [1]:
# load libraries for NER

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import torch


In [2]:

model_id = 'dslim/bert-base-NER'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
nlp = pipeline('ner',
               model=model,
               tokenizer=tokenizer,
               aggregation_strategy='max',
               device='cpu')




Device set to use cpu


In [8]:
nlp("Arbaaz is a very good guy! He is working for Sandeep")

[{'entity_group': 'PER',
  'score': np.float32(0.9988845),
  'word': 'Arbaaz',
  'start': 0,
  'end': 6},
 {'entity_group': 'PER',
  'score': np.float32(0.9988896),
  'word': 'Sandeep',
  'start': 45,
  'end': 52}]

The `aggregation_strategy` parameter in the NER pipeline addresses a critical issue that arises from how transformer tokenizers work with named entity recognition tasks.

## Why Aggregation Strategy Is Required

When using transformers for NER (Named Entity Recognition), there's a fundamental mismatch between:
1. How tokenizers split words (into subwords/wordpieces)
2. How NER labels need to be applied (typically at the whole word/entity level)

### The Problem

Let's look at what happens when you process "Bill Gates":

1. The tokenizer might split this into: ["Bill", "Gate", "##s"]
2. Each token gets its own prediction from the model
3. But you want one cohesive entity: "Bill Gates" as a PERSON

Without an aggregation strategy, you'd get separate predictions for each subtoken:
- "Bill" → PERSON
- "Gate" → PERSON
- "##s" → PERSON

This creates several issues:
- Entity boundaries become unclear
- Confidence scores may vary across subtokens
- Post-processing becomes necessary to rebuild complete entities

### What Aggregation Strategies Do

The `aggregation_strategy` parameter tells the pipeline how to combine these subtoken predictions into meaningful entities:

- `'none'`: No aggregation (raw predictions for each subtoken)
- `'simple'`: Group adjacent tokens with the same entity label
- `'first'`: Use the prediction of the first subtoken for the whole word
- `'average'`: Average the scores across subtokens
- `'max'`: Use the highest confidence prediction from the subtokens

### In Our Example

```python
nlp = pipeline('ner',
              model=model,
              tokenizer=tokenizer,
              aggregation_strategy='max',
              device='cpu')
```

Setting `aggregation_strategy='max'` means:
1. For multi-subtoken words/entities, the pipeline will identify all subtokens belonging to the same word
2. It will use the prediction with the highest confidence score as the label for the entire word/entity
3. The result will be a single prediction for "Bill Gates" instead of separate predictions for each subtoken


```

# https://huggingface.co/flax-sentence-embeddings/all_datasets_v3_mpnet-base
embed_model = SentenceTransformer(
    "flax-sentence-embeddings/all_datasets_v3_mpnet-base")


```

In [9]:
from sentence_transformers import SentenceTransformer
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

embed_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embed_model.encode(sentences)
print(embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[[ 0.01919576  0.1200854   0.15959835 ... -0.00536288 -0.08109502
   0.05021336]
 [-0.0186904   0.04151872  0.07431543 ...  0.00486598 -0.06190441
   0.03187512]
 [ 0.136502    0.08227322 -0.02526164 ...  0.08762047  0.03045843
  -0.01075751]]


In [11]:
embeddings[0].size

384

In [33]:
nlp = pipeline('ner',
               model=model,
               tokenizer=tokenizer,
               aggregation_strategy='max',
               device='cpu')



def extract_entities(list_of_text):
    entities = []
    for doc in list_of_text:
        entities.append([item['word'] for item in nlp(doc)])
    return entities


Device set to use cpu


In [17]:
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
ner_meta = extract_entities(sentences)
print(ner_meta)

[[], [], []]


In [19]:
for doc in sentences:
    print(doc)
    print(nlp(doc))

The weather is lovely today.
[]
It's so sunny outside!
[]
He drove to the stadium.
[]


In [30]:
[item['word'] for item in nlp('My name is Kirk. I am going to Nehru Stadium with my friends')]

['Kirk', 'Nehru Stadium']

In [36]:
nlp = pipeline('ner',
               model=model,
               tokenizer=tokenizer,
               aggregation_strategy='max',
               device='cpu')



def extract_entities(list_of_text):
    entities = []
    for doc in list_of_text:
        entities.append([item['word'] for item in nlp(doc)])
    return entities


extract_entities(['My name is Kirk. I am going to Nehru Stadium with my friends'])

Device set to use cpu


[['Kirk', 'Nehru Stadium']]

In [40]:
import pandas as pd
def prepare_medium_articles_data(path):
    df = pd.read_csv(path,nrows=20)
    df = df.dropna()
    df = df[~df["subtitle_truncated_flag"]]
    df['title_extended'] = df['title'] + df['subtitle']



    df['metadata'] = df.apply(lambda x: {
        'title' : x['title'],
        'subtitle': x['subtitle'],
        'category': x['category'],
        'entities': extract_entities([x['title_extended']])
        }, axis=1)

    # ids = [f"article{i}" for i in range(df.shape[0])]
    # documents = df["title_extended"].to_list()
    # metadatas = df.drop("title_extended", axis=1).to_dict(orient="records")

    return df

    # return {"ids": ids, "documents": documents, "metadatas": metadatas}


file_path = 'medium_post_titles.csv'

df_out = prepare_medium_articles_data(file_path)

df_out



Unnamed: 0,category,title,subtitle,subtitle_truncated_flag,title_extended,metadata
0,work,"""21 Conversations"" - A fun (and easy) game for...",A (new?) Icebreaker game to get your team to s...,False,"""21 Conversations"" - A fun (and easy) game for...","{'title': '""21 Conversations"" - A fun (and eas..."
1,spirituality,"""Biblical Porn"" at Mars Hill",Author and UW lecturer Jessica Johnson talks a...,False,"""Biblical Porn"" at Mars HillAuthor and UW lect...","{'title': '""Biblical Porn"" at Mars Hill', 'sub..."
2,lgbtqia,"""CISGENDER?! Is That A Disease?!""","Or, a primer in gender vocabulary for the curi...",False,"""CISGENDER?! Is That A Disease?!""Or, a primer ...","{'title': '""CISGENDER?! Is That A Disease?!""',..."
4,artificial-intelligence,"""Can I Train my Model on Your Computer?""",How we waste computational resources and how t...,False,"""Can I Train my Model on Your Computer?""How we...","{'title': '""Can I Train my Model on Your Compu..."
5,cryptocurrency,"""Cypherpunks and Wall Street"": The Security To...",Bruce Fenton presents at the World Blockchain ...,False,"""Cypherpunks and Wall Street"": The Security To...","{'title': '""Cypherpunks and Wall Street"": The ..."
6,politics,"""Diss"" vs. ""Piss"": The Blue Wave and Yellow Tr...",Michael Gofman & Matthew Wigler explore how bu...,False,"""Diss"" vs. ""Piss"": The Blue Wave and Yellow Tr...","{'title': '""Diss"" vs. ""Piss"": The Blue Wave an..."
7,health,"""Doctor, he's gone into shock!""",You've seen it in movies and on television. B...,False,"""Doctor, he's gone into shock!""You've seen it ...","{'title': '""Doctor, he's gone into shock!""', '..."
8,culture,"""Happily Ever After: Fairy Tales for Every Chi...",Television shows have an invaluable opportunit...,False,"""Happily Ever After: Fairy Tales for Every Chi...","{'title': '""Happily Ever After: Fairy Tales fo..."
9,poetry,"""I Love You"" The Dangerous Toxic Truth","The Big, Smelly Heap of Lies Pretending To Be ...",False,"""I Love You"" The Dangerous Toxic TruthThe Big,...","{'title': '""I Love You"" The Dangerous Toxic Tr..."
10,poetry,"""I would the gods had made thee poetical"": Sha...",Not all bards are created equal.,False,"""I would the gods had made thee poetical"": Sha...","{'title': '""I would the gods had made thee poe..."


In [44]:
for item in df_out['metadata']:
    print(item)

{'title': '"21 Conversations" - A fun (and easy) game for teams to get to know each other', 'subtitle': 'A (new?) Icebreaker game to get your team to say all the interesting stuff', 'category': 'work', 'entities': [['Icebreaker']]}
{'title': '"Biblical Porn" at Mars Hill', 'subtitle': "Author and UW lecturer Jessica Johnson talks about her new book on Mars Hill Church's and Mark Driscoll's evangelical masculinity", 'category': 'spirituality', 'entities': [['Biblical Porn', 'Mars HillAuthor', 'UW', 'Jessica Johnson', 'Mars Hill Church', 'Mark Driscoll']]}
{'title': '"CISGENDER?! Is That A Disease?!"', 'subtitle': 'Or, a primer in gender vocabulary for the curious-minded', 'category': 'lgbtqia', 'entities': [[]]}
{'title': '"Can I Train my Model on Your Computer?"', 'subtitle': 'How we waste computational resources and how to stop it.', 'category': 'artificial-intelligence', 'entities': [['Computer']]}
{'title': '"Cypherpunks and Wall Street": The Security Token Revolution & Regulation',

In [None]:
query = ["What are the most useful medium articles related to Rust Programming"]

where = {'entities': {'$in':''}}