Dataset: CoNLL003 (Kaggle)­
 Identify named entities (like people, locations, and organizations) from article content­

 Used rule-based and model-based NER approaches­

 Highlighted and categorized extracted entities in the text

Visualize extracted entities with displacy

Compare results using two different spaCy models

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from spacy import displacy

def load_conll_data(file_path):
    sentences = []
    current_sentence = []
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                if current_sentence:
                    sentences.append(" ".join(current_sentence))
                    current_sentence = []
            else:
                parts = line.split()
                if len(parts) >= 1:
                    token = parts[0]
                    current_sentence.append(token)
        if current_sentence:
            sentences.append(" ".join(current_sentence))
    return sentences

train_sentences = load_conll_data("/content/eng.train")
testa_sentences = load_conll_data("/content/eng.testa")
testb_sentences = load_conll_data("/content/eng.testb")

print(f"Loaded {len(train_sentences)} train, {len(testa_sentences)} testa, {len(testb_sentences)}testb sentences.")




Loaded 14987 train, 3466 testa, 3684testb sentences.


Rule based NER

In [6]:
def rule_based_ner(text):
    people = re.findall(r'\b[A-Z][a-z]+ [A-Z][a-z]+\b', text)  # e.g. "John Smith"
    locations = re.findall(r'\b(?:London|Paris|Berlin|Tokyo|New York)\b', text)
    organizations = re.findall(r'\b[A-Z]{2,}\b', text)  # acronyms like "EU", "UN"
    return {
        "PERSON": list(set(people)),
        "GPE": list(set(locations)),
        "ORG": list(set(organizations))
    }

Model-based NER

In [8]:
import spacy
print("Loading spaCy models...")
nlp_sm = spacy.load("en_core_web_sm")
nlp_trf = spacy.load("en_core_web_trf")

def extract_entities_spacy(text, nlp_model):
    doc = nlp_model(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

Loading spaCy models...


Coparing and visualize

In [9]:
def compare_models(texts):
    results = []
    for t in texts:
        ents_sm = extract_entities_spacy(t, nlp_sm)
        ents_trf = extract_entities_spacy(t, nlp_trf)
        results.append({
            "text": t,
            "rule_based": rule_based_ner(t),
            "sm_count": len(ents_sm),
            "trf_count": len(ents_trf),
            "sm_ents": ents_sm,
            "trf_ents": ents_trf
        })
    return pd.DataFrame(results)

comparison on sample sentences

In [10]:
sample_df = compare_models(testb_sentences[:5])
print(sample_df)

                                                text  \
0                                         -DOCSTART-   
1  SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...   
2                                        Nadim Ladki   
3           AL-AIN , United Arab Emirates 1996-12-06   
4  Japan began the defence of their Asian Cup tit...   

                                          rule_based  sm_count  trf_count  \
0     {'PERSON': [], 'GPE': [], 'ORG': ['DOCSTART']}         0          0   
1  {'PERSON': [], 'GPE': [], 'ORG': ['WIN', 'SURP...         1          2   
2  {'PERSON': ['Nadim Ladki'], 'GPE': [], 'ORG': []}         1          1   
3  {'PERSON': ['United Arab'], 'GPE': [], 'ORG': ...         3          3   
4    {'PERSON': ['Asian Cup'], 'GPE': [], 'ORG': []}         6          6   

                                             sm_ents  \
0                                                 []   
1                                    [(DEFEAT, ORG)]   
2                            [(N

visualize

In [15]:
from IPython.core.display import HTML, display
def read_conll_file(file_path):
    sentences = []
    sentence = []
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith("-DOCSTART-"):
                if sentence:
                    sentences.append(" ".join(sentence))
                    sentence = []
                continue
            word = line.split()[0]
            sentence.append(word)
    if sentence:
        sentences.append(" ".join(sentence))
    return sentences


testb_sentences = read_conll_file("/content/eng.testb")
for i, sent in enumerate(testb_sentences[:10], start=1):
    print(f"{i}: {sent}")

html_blocks = []
for sent in testb_sentences[:10]:
    doc = nlp_trf(sent)
    html_blocks.append(displacy.render(doc, style="ent", jupyter = False))

display(HTML("<br><br>".join(html_blocks)))

#  Display in Colab output
# displacy.render(doc_trf, style="ent")



1: SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT .
2: Nadim Ladki
3: AL-AIN , United Arab Emirates 1996-12-06
4: Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday .
5: But China saw their luck desert them in the second match of the group , crashing to a surprise 2-0 defeat to newcomers Uzbekistan .
6: China controlled most of the match and saw several chances missed until the 78th minute when Uzbek striker Igor Shkvyrin took advantage of a misdirected defensive header to lob the ball over the advancing Chinese keeper and into an empty net .
7: Oleg Shatskiku made sure of the win in injury time , hitting an unstoppable left foot shot from just outside the area .
8: The former Soviet republic was playing in an Asian Cup finals tie for the first time .
9: Despite winning the Asian Games title two years ago , Uzbekistan are in the finals as outsiders .
10: Two goals from defensive errors in the last six m