**Assignment 4**

Build a Named Entity Recognition (NER) system for extracting entities from real-world text
such as news articles or social media data. And measure its accuracy, precision, recall, and F1-
score

**Installing Library**

In [1]:
%pip install spacy
print("spaCy library installed.")

spaCy library installed.


In [3]:
import spacy.cli
spacy.cli.download("en_core_web_sm")
print("en_core_web_sm model downloaded.")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
en_core_web_sm model downloaded.


**Sample text**

In [6]:
sample_text = """Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976. The company is headquartered in Cupertino, California. In 2023, Apple announced its new Vision Pro headset, attracting significant attention from the tech community. Tim Cook is the current CEO."""
print(sample_text)

Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976. The company is headquartered in Cupertino, California. In 2023, Apple announced its new Vision Pro headset, attracting significant attention from the tech community. Tim Cook is the current CEO.


**Loading Model**

In [8]:
import spacy

nlp = spacy.load("en_core_web_sm")
print("en_core_web_sm model loaded.")

en_core_web_sm model loaded.


In [20]:
doc = nlp(sample_text)
print("Extracted Entities:")

for ent in doc.ents:
    print(f"- Text: {ent.text}, Label: {ent.label_}, Explanation: {spacy.explain(ent.label_)}")

Extracted Entities:
- Text: Apple Inc., Label: ORG, Explanation: Companies, agencies, institutions, etc.
- Text: Steve Jobs, Label: PERSON, Explanation: People, including fictional
- Text: Steve Wozniak, Label: PERSON, Explanation: People, including fictional
- Text: Ronald Wayne, Label: PERSON, Explanation: People, including fictional
- Text: April 1976, Label: DATE, Explanation: Absolute or relative dates or periods
- Text: Cupertino, Label: GPE, Explanation: Countries, cities, states
- Text: California, Label: GPE, Explanation: Countries, cities, states
- Text: 2023, Label: DATE, Explanation: Absolute or relative dates or periods
- Text: Apple, Label: ORG, Explanation: Companies, agencies, institutions, etc.
- Text: Vision Pro, Label: ORG, Explanation: Companies, agencies, institutions, etc.
- Text: Tim Cook, Label: PERSON, Explanation: People, including fictional


In [12]:
gold_standard = [
    {'text': 'Apple Inc.', 'label': 'ORG', 'start_char': 0, 'end_char': 10},
    {'text': 'Steve Jobs', 'label': 'PERSON', 'start_char': 26, 'end_char': 36},
    {'text': 'Steve Wozniak', 'label': 'PERSON', 'start_char': 38, 'end_char': 51},
    {'text': 'Ronald Wayne', 'label': 'PERSON', 'start_char': 57, 'end_char': 69},
    {'text': 'April 1976', 'label': 'DATE', 'start_char': 73, 'end_char': 83},
    {'text': 'Cupertino', 'label': 'GPE', 'start_char': 111, 'end_char': 120},
    {'text': 'California', 'label': 'GPE', 'start_char': 122, 'end_char': 132},
    {'text': '2023', 'label': 'DATE', 'start_char': 138, 'end_char': 142},
    {'text': 'Apple', 'label': 'ORG', 'start_char': 144, 'end_char': 149},
    {'text': 'Vision Pro', 'label': 'ORG', 'start_char': 168, 'end_char': 178},
    {'text': 'Tim Cook', 'label': 'PERSON', 'start_char': 226, 'end_char': 234}
]

print("Gold standard dataset created.")
print(gold_standard)

Gold standard dataset created.
[{'text': 'Apple Inc.', 'label': 'ORG', 'start_char': 0, 'end_char': 10}, {'text': 'Steve Jobs', 'label': 'PERSON', 'start_char': 26, 'end_char': 36}, {'text': 'Steve Wozniak', 'label': 'PERSON', 'start_char': 38, 'end_char': 51}, {'text': 'Ronald Wayne', 'label': 'PERSON', 'start_char': 57, 'end_char': 69}, {'text': 'April 1976', 'label': 'DATE', 'start_char': 73, 'end_char': 83}, {'text': 'Cupertino', 'label': 'GPE', 'start_char': 111, 'end_char': 120}, {'text': 'California', 'label': 'GPE', 'start_char': 122, 'end_char': 132}, {'text': '2023', 'label': 'DATE', 'start_char': 138, 'end_char': 142}, {'text': 'Apple', 'label': 'ORG', 'start_char': 144, 'end_char': 149}, {'text': 'Vision Pro', 'label': 'ORG', 'start_char': 168, 'end_char': 178}, {'text': 'Tim Cook', 'label': 'PERSON', 'start_char': 226, 'end_char': 234}]


In [14]:
predicted_entities = []
for ent in doc.ents:
    predicted_entities.append({
        'text': ent.text,
        'label': ent.label_,
        'start_char': ent.start_char,
        'end_char': ent.end_char
    })

print("Predicted entities extracted and formatted:")
print(predicted_entities)

Predicted entities extracted and formatted:
[{'text': 'Apple Inc.', 'label': 'ORG', 'start_char': 0, 'end_char': 10}, {'text': 'Steve Jobs', 'label': 'PERSON', 'start_char': 26, 'end_char': 36}, {'text': 'Steve Wozniak', 'label': 'PERSON', 'start_char': 38, 'end_char': 51}, {'text': 'Ronald Wayne', 'label': 'PERSON', 'start_char': 57, 'end_char': 69}, {'text': 'April 1976', 'label': 'DATE', 'start_char': 73, 'end_char': 83}, {'text': 'Cupertino', 'label': 'GPE', 'start_char': 117, 'end_char': 126}, {'text': 'California', 'label': 'GPE', 'start_char': 128, 'end_char': 138}, {'text': '2023', 'label': 'DATE', 'start_char': 143, 'end_char': 147}, {'text': 'Apple', 'label': 'ORG', 'start_char': 149, 'end_char': 154}, {'text': 'Vision Pro', 'label': 'ORG', 'start_char': 173, 'end_char': 183}, {'text': 'Tim Cook', 'label': 'PERSON', 'start_char': 251, 'end_char': 259}]


In [19]:
true_positives = 0
false_positives = 0
false_negatives = 0

def match_entity(entity1, entity2):
    return (
        entity1['text'] == entity2['text'] and
        entity1['label'] == entity2['label'] and
        entity1['start_char'] == entity2['start_char'] and
        entity1['end_char'] == entity2['end_char']
    )

matched_predicted_indices = set()
for gold_ent in gold_standard:
    found_match = False
    for i, pred_ent in enumerate(predicted_entities):
        if match_entity(gold_ent, pred_ent):
            true_positives += 1
            found_match = True
            matched_predicted_indices.add(i)
            break
    if not found_match:
        false_negatives += 1


for i, pred_ent in enumerate(predicted_entities):
    if i not in matched_predicted_indices:
        false_positives += 1

print(f"True Positives: {true_positives}")
print(f"False Positives: {false_positives}")
print(f"False Negatives: {false_negatives}")


True Positives: 5
False Positives: 6
False Negatives: 6


**Recall , Precision and f1 score**

In [18]:
precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1_score:.4f}")

Precision: 0.4545
Recall: 0.4545
F1-score: 0.4545
