<a href="https://colab.research.google.com/github/2403a52030-sketch/NLP-LAB/blob/main/NLP_LAB_14_2030.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install required libraries (run once in Google Colab)
!pip install -q spacy transformers torch pandas datasets
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m73.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
# spaCy for traditional NLP and NER
import spacy
from spacy import displacy

# Hugging Face libraries for transformer-based NER
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

# PyTorch (backend for transformer models)
import torch

# Pandas for tabular data representation
import pandas as pd

In [3]:
# Load English pretrained spaCy model
nlp = spacy.load("en_core_web_sm")

"""
A pipeline in spaCy is a sequence of components applied to text.
spaCy pipeline includes:
- Tokenizer: splits text into tokens
- Tagger: assigns POS tags
- Parser: analyzes sentence structure
- NER: detects named entities like PERSON, ORG, GPE
"""

'\nA pipeline in spaCy is a sequence of components applied to text.\nspaCy pipeline includes:\n- Tokenizer: splits text into tokens\n- Tagger: assigns POS tags\n- Parser: analyzes sentence structure\n- NER: detects named entities like PERSON, ORG, GPE\n'

In [4]:
# List of real-world sentences
sentences = [
    "Apple announced a new iPhone in California.",
    "Virat Kohli scored a century for India in the World Cup.",
    "The Prime Minister of India met Elon Musk in New York.",
    "Google is investing heavily in artificial intelligence.",
    "The Olympics will be held in Paris in 2024."
]

In [5]:
# List to store spaCy NER results
spacy_results = []

for sentence in sentences:
    doc = nlp(sentence)
    for ent in doc.ents:
        spacy_results.append([sentence, ent.text, ent.label_])

# Display results
for row in spacy_results:
    print(f"Sentence: {row[0]}")
    print(f"Entity: {row[1]} | Label: {row[2]}")
    print("-" * 50)

Sentence: Apple announced a new iPhone in California.
Entity: Apple | Label: ORG
--------------------------------------------------
Sentence: Apple announced a new iPhone in California.
Entity: iPhone | Label: ORG
--------------------------------------------------
Sentence: Apple announced a new iPhone in California.
Entity: California | Label: GPE
--------------------------------------------------
Sentence: Virat Kohli scored a century for India in the World Cup.
Entity: Virat Kohli | Label: PERSON
--------------------------------------------------
Sentence: Virat Kohli scored a century for India in the World Cup.
Entity: a century | Label: DATE
--------------------------------------------------
Sentence: Virat Kohli scored a century for India in the World Cup.
Entity: India | Label: GPE
--------------------------------------------------
Sentence: Virat Kohli scored a century for India in the World Cup.
Entity: the World Cup | Label: EVENT
---------------------------------------------

In [6]:
# Create DataFrame
spacy_df = pd.DataFrame(
    spacy_results,
    columns=["Sentence", "Entity", "Label"]
)

spacy_df

Unnamed: 0,Sentence,Entity,Label
0,Apple announced a new iPhone in California.,Apple,ORG
1,Apple announced a new iPhone in California.,iPhone,ORG
2,Apple announced a new iPhone in California.,California,GPE
3,Virat Kohli scored a century for India in the ...,Virat Kohli,PERSON
4,Virat Kohli scored a century for India in the ...,a century,DATE
5,Virat Kohli scored a century for India in the ...,India,GPE
6,Virat Kohli scored a century for India in the ...,the World Cup,EVENT
7,The Prime Minister of India met Elon Musk in N...,India,GPE
8,The Prime Minister of India met Elon Musk in N...,Elon Musk,PERSON
9,The Prime Minister of India met Elon Musk in N...,New York,GPE


In [7]:
# Visualize NER for the first sentence
doc = nlp(sentences[0])
displacy.render(doc, style="ent", jupyter=True)

"""
Different colors represent different entity types.
Labels like ORG, GPE, PERSON indicate entity categories.
"""

'\nDifferent colors represent different entity types.\nLabels like ORG, GPE, PERSON indicate entity categories.\n'

In [8]:
# Load tokenizer and model (BERT fine-tuned on CoNLL-2003)
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

# Create NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

"""
Token classification assigns a label to each token.
Transformer models understand context using attention mechanisms,
allowing better handling of ambiguous words.
"""

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertForTokenClassification LOAD REPORT from: dbmdz/bert-large-cased-finetuned-conll03-english
Key                      | Status     |  | 
-------------------------+------------+--+-
bert.pooler.dense.weight | UNEXPECTED |  | 
bert.pooler.dense.bias   | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


'\nToken classification assigns a label to each token.\nTransformer models understand context using attention mechanisms,\nallowing better handling of ambiguous words.\n'

In [9]:
# List to store transformer NER results
hf_results = []

for sentence in sentences:
    entities = ner_pipeline(sentence)
    for ent in entities:
        hf_results.append([
            sentence,
            ent["word"],
            ent["entity_group"],
            round(ent["score"], 4)
        ])

# Print results
for row in hf_results:
    print(f"Sentence: {row[0]}")
    print(f"Entity: {row[1]} | Label: {row[2]} | Confidence: {row[3]}")
    print("-" * 50)

Sentence: Apple announced a new iPhone in California.
Entity: Apple | Label: ORG | Confidence: 0.9976999759674072
--------------------------------------------------
Sentence: Apple announced a new iPhone in California.
Entity: iPhone | Label: MISC | Confidence: 0.9952999949455261
--------------------------------------------------
Sentence: Apple announced a new iPhone in California.
Entity: California | Label: LOC | Confidence: 0.9997000098228455
--------------------------------------------------
Sentence: Virat Kohli scored a century for India in the World Cup.
Entity: Virat Kohli | Label: PER | Confidence: 0.9973999857902527
--------------------------------------------------
Sentence: Virat Kohli scored a century for India in the World Cup.
Entity: India | Label: LOC | Confidence: 0.9995999932289124
--------------------------------------------------
Sentence: Virat Kohli scored a century for India in the World Cup.
Entity: World Cup | Label: MISC | Confidence: 0.9952999949455261
----

In [10]:
# Create DataFrame for transformer output
hf_df = pd.DataFrame(
    hf_results,
    columns=["Sentence", "Entity", "Label", "Confidence Score"]
)

hf_df

Unnamed: 0,Sentence,Entity,Label,Confidence Score
0,Apple announced a new iPhone in California.,Apple,ORG,0.9977
1,Apple announced a new iPhone in California.,iPhone,MISC,0.9953
2,Apple announced a new iPhone in California.,California,LOC,0.9997
3,Virat Kohli scored a century for India in the ...,Virat Kohli,PER,0.9974
4,Virat Kohli scored a century for India in the ...,India,LOC,0.9996
5,Virat Kohli scored a century for India in the ...,World Cup,MISC,0.9953
6,The Prime Minister of India met Elon Musk in N...,India,LOC,0.9994
7,The Prime Minister of India met Elon Musk in N...,Elon Musk,PER,0.9961
8,The Prime Minister of India met Elon Musk in N...,New York,LOC,0.9995
9,Google is investing heavily in artificial inte...,Google,ORG,0.9991


In [11]:
comparison_data = {
    "Feature": [
        "Model Type",
        "Speed",
        "Accuracy (Qualitative)",
        "Context Handling",
        "Confidence Score",
        "GPU Requirement"
    ],
    "spaCy": [
        "Statistical + Rules",
        "Fast",
        "Moderate",
        "Limited",
        "No",
        "No"
    ],
    "Hugging Face": [
        "Transformer-based",
        "Slower",
        "High",
        "Excellent",
        "Yes",
        "Yes"
    ]
}

comparison_df = pd.DataFrame(comparison_data)
comparison_df

Unnamed: 0,Feature,spaCy,Hugging Face
0,Model Type,Statistical + Rules,Transformer-based
1,Speed,Fast,Slower
2,Accuracy (Qualitative),Moderate,High
3,Context Handling,Limited,Excellent
4,Confidence Score,No,Yes
5,GPU Requirement,No,Yes
