<a href="https://colab.research.google.com/github/P-AshishKumar/2d-car-game/blob/master/Exercise2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 NER Demo: Traditional vs LLM Methods

This demo compares Named Entity Recognition (NER) on sample texts from healthcare, transportation, and finance using traditional NLP and LLM methods.

In [2]:
# Sample texts
texts = {
    "Blue Cross": "Patient Jane Doe was diagnosed with hypertension and prescribed lisinopril. ICD-10 code: I10. She will follow up in two weeks.",
    "Union Pacific": "The UPX-2013 locomotive experienced a power failure near Cheyenne. Engineers replaced the faulty alternator and checked coolant levels.",
    "Farm Credit": "Tom Hanks applied for a $200,000 loan to expand his soybean farm. His credit score is 720 and the purpose is equipment purchase."
}

for domain, text in texts.items():
    print(f"\n--- {domain} ---\n{text}")


--- Blue Cross ---
Patient Jane Doe was diagnosed with hypertension and prescribed lisinopril. ICD-10 code: I10. She will follow up in two weeks.

--- Union Pacific ---
The UPX-2013 locomotive experienced a power failure near Cheyenne. Engineers replaced the faulty alternator and checked coolant levels.

--- Farm Credit ---
Tom Hanks applied for a $200,000 loan to expand his soybean farm. His credit score is 720 and the purpose is equipment purchase.


## 🔍 spaCy-based NER (Traditional)


In [3]:
import spacy
nlp_spacy = spacy.load("en_core_web_sm")

for domain, text in texts.items():
    doc = nlp_spacy(text)
    print(f"\n{domain} Entities (spaCy):")
    for ent in doc.ents:
        print(f" - {ent.text} ({ent.label_})")


Blue Cross Entities (spaCy):
 - Jane Doe (PERSON)
 - two weeks (DATE)

Union Pacific Entities (spaCy):
 - UPX-2013 (ORG)
 - Cheyenne (GPE)

Farm Credit Entities (spaCy):
 - Tom Hanks (PERSON)
 - 200,000 (MONEY)
 - 720 (CARDINAL)


## 🤖 LLM-based NER using Transformers

In [4]:
from transformers import pipeline

ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

for domain, text in texts.items():
    print(f"\n{domain} Entities (LLM):")
    entities = ner_pipeline(text)
    for ent in entities:
        print(f" - {ent['word']} ({ent['entity_group']})")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu



Blue Cross Entities (LLM):
 - Jane Doe (PER)
 - ICD (MISC)
 - 10 (MISC)

Union Pacific Entities (LLM):
 - UPX - 2013 (MISC)
 - Cheyenne (LOC)

Farm Credit Entities (LLM):
 - Tom Hanks (PER)


## 📊 Comparison & Discussion
- Which model captured more relevant or domain-specific entities?
- Did either model miss key phrases like 'ICD-10 code', 'credit score', or 'locomotive ID'?
- Which is more flexible for real-world use in these domains?

### 🔧 Extension Ideas
- Visualize results with `displacy`
- Fine-tune or use domain-specific LLMs
- Use prompt-based extraction with LLMs for custom entity types

## 🖼️ spaCy NER Visualization with displaCy

In [5]:
from spacy import displacy
from IPython.core.display import display, HTML

for domain, text in texts.items():
    doc = nlp_spacy(text)
    print(f"\n{domain} Visualization:")
    display(HTML(displacy.render(doc, style="ent", jupyter=True)))


Blue Cross Visualization:


<IPython.core.display.HTML object>


Union Pacific Visualization:


<IPython.core.display.HTML object>


Farm Credit Visualization:


<IPython.core.display.HTML object>

## ✨ Prompt-based LLM NER (Domain-Specific Instructions)

In [6]:
from transformers import pipeline

qa_ner = pipeline("text2text-generation", model="google/flan-t5-base")

prompts = {
    "Blue Cross": "Extract all ICD-10 codes, diagnoses, and medications from the following text:",
    "Union Pacific": "Extract all locomotive IDs, part names, and failure types from the following text:",
    "Farm Credit": "Extract applicant names, loan amounts, credit scores, and purposes from the following text:"
}

for domain, text in texts.items():
    prompt = prompts[domain] + " " + text
    print(f"\n{domain} (Prompt-based LLM NER):")
    print(qa_ner(prompt)[0]['generated_text'])

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu



Blue Cross (Prompt-based LLM NER):
I10

Union Pacific (Prompt-based LLM NER):
UPX-2013

Farm Credit (Prompt-based LLM NER):
Tom Hanks


## ❗ Missed Entities: Manual Checks & Gaps

In [7]:
# Manually check for important domain-specific entities
expected_keywords = {
    "Blue Cross": ["ICD-10", "hypertension", "lisinopril"],
    "Union Pacific": ["UPX-2013", "power failure", "alternator"],
    "Farm Credit": ["$200,000", "credit score", "soybean"]
}

for domain, text in texts.items():
    print(f"\n{domain} Missed Entity Checks:")
    spacy_ents = [ent.text.lower() for ent in nlp_spacy(text).ents]
    llm_ents = [e['word'].lower() for e in ner_pipeline(text)]
    for keyword in expected_keywords[domain]:
        present_spacy = any(keyword.lower() in ent for ent in spacy_ents)
        present_llm = any(keyword.lower() in ent for ent in llm_ents)
        print(f" - {keyword}: spaCy {'✅' if present_spacy else '❌'}, LLM {'✅' if present_llm else '❌'}")


Blue Cross Missed Entity Checks:
 - ICD-10: spaCy ❌, LLM ❌
 - hypertension: spaCy ❌, LLM ❌
 - lisinopril: spaCy ❌, LLM ❌

Union Pacific Missed Entity Checks:
 - UPX-2013: spaCy ✅, LLM ❌
 - power failure: spaCy ❌, LLM ❌
 - alternator: spaCy ❌, LLM ❌

Farm Credit Missed Entity Checks:
 - $200,000: spaCy ❌, LLM ❌
 - credit score: spaCy ❌, LLM ❌
 - soybean: spaCy ❌, LLM ❌
