# Beginner

## Exploring NER Tools

Research and compare two different NER tools (e.g., Flair and spaCy). Summarize their features, strengths, and weaknesses, especially as they relate to Medieval historical text analysis. Use any text. Choose one tool and use it to extract named entities from a provided text sample.

**This script does the following:**

- Imports necessary libraries for Flair and spaCy.
- Defines a sample medieval text.
- Performs NER using both Flair and spaCy.
- Prints the results from both tools.


**Key points for comparison:**

**Flair:**

**Strengths:** High accuracy, good at handling context, flexible and can be fine-tuned.

**Weaknesses:** Slower processing speed, higher memory requirements, may need fine-tuning for specific domains.


**spaCy:**

**Strengths:** Fast processing speed, comprehensive NLP pipeline, good out-of-the-box performance.

**Weaknesses:** Potentially lower accuracy than Flair in some cases, limited pre-trained models for historical texts, more generic entity types.


In [1]:
# Install flair
!pip install flair --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.3/388.3 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.4/54.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.7/19.7 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m202.6/202.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.9/143.9 kB[0m [31m12.0 MB/s[

In [2]:
# Import necessary libraries
import flair
from flair.models import SequenceTagger
from flair.data import Sentence
import spacy
from spacy import displacy

# Sample medieval text
medieval_text = """
In the year of our Lord 1066, William, Duke of Normandy, crossed the English Channel
with his great army. He landed at Pevensey and marched towards Hastings. There, he
met the forces of King Harold Godwinson on the field of battle. The Norman knights,
supported by archers from Brittany, defeated the Saxon fyrd and housecarls of Wessex.
With this victory, William claimed the throne of England and was crowned at
Westminster Abbey.
"""

print("Sample Text:")
print(medieval_text)

# Flair NER
print("\nFlair NER Results:")
flair_tagger = SequenceTagger.load('ner')
flair_sentence = Sentence(medieval_text)
flair_tagger.predict(flair_sentence)

for entity in flair_sentence.get_spans('ner'):
    print(f"{entity.text} - {entity.tag}")

# spaCy NER
print("\nspaCy NER Results:")
nlp = spacy.load("en_core_web_sm")
doc = nlp(medieval_text)

for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")


# Choose one tool (spaCy in this case) for further analysis
print("\nDetailed spaCy Analysis:")
for ent in doc.ents:
    print(f"Entity: {ent.text}")
    print(f"Label: {ent.label_}")
    print(f"Start: {ent.start_char}")
    print(f"End: {ent.end_char}")
    print("---")

# Visualize spaCy results
displacy.render(doc, style="ent", jupyter=True)

Sample Text:

In the year of our Lord 1066, William, Duke of Normandy, crossed the English Channel
with his great army. He landed at Pevensey and marched towards Hastings. There, he
met the forces of King Harold Godwinson on the field of battle. The Norman knights,
supported by archers from Brittany, defeated the Saxon fyrd and housecarls of Wessex.
With this victory, William claimed the throne of England and was crowned at
Westminster Abbey.


Flair NER Results:


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


pytorch_model.bin:   0%|          | 0.00/432M [00:00<?, ?B/s]

2024-07-04 14:03:39,644 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>
William - PER
Normandy - LOC
English Channel - LOC
Pevensey - LOC
Hastings - LOC
Harold Godwinson - PER
Norman - MISC
Brittany - LOC
Saxon - MISC
Wessex - LOC
William - PER
England - LOC
Westminster Abbey - LOC

spaCy NER Results:
the year - DATE
William - PERSON
English - NORP
Pevensey - ORG
Hastings - WORK_OF_ART
Harold Godwinson - PERSON
Norman - PERSON
Brittany - GPE
Saxon - PERSON
Wessex - GPE
William - PERSON
England - GPE
Westminster Abbey - PERSON

Detailed spaCy Analysis:
Entity: the year
Label: DATE
Start: 4
End: 12
---
Entity: William
Label: PERSON
Start: 31
End: 38
---
Entity: English
Label: NORP
Start: 70
End: 77
---
Entity: Pevensey
Label: ORG
Start: 120
End: 128
---
Entity: Hastings
Label: WORK_OF_ART
Start: 149
End: 157
---
Entity: Harold Godwinson
Label: PE

## Solution


Comparison of Flair and spaCy:
Flair:
- Strengths:
  * Generally high accuracy
  * Good at handling context
  * Flexible, can be fine-tuned for specific domains
- Weaknesses:
  * Slower processing speed
  * Requires more memory
  * May struggle with very domain-specific entities without fine-tuning

spaCy:
- Strengths:
  * Fast processing speed
  * Comprehensive NLP pipeline
  * Good out-of-the-box performance
- Weaknesses:
  * May have lower accuracy than Flair in some cases
  * Limited pre-trained models for historical texts
  * Entity types are more generic

For medieval historical text analysis:
Flair might be preferable due to its flexibility and ability to be fine-tuned.
However, spaCy could be a good starting point due to its speed and ease of use.

For medieval historical text analysis:

Flair might be preferable due to its flexibility and fine-tuning capabilities, which could be crucial for handling archaic language and specific historical entities.
However, spaCy could be a good starting point due to its speed and ease of use, especially for larger corpora or when quick results are needed.

The choice between Flair and spaCy (or any other NER tool) would depend on the specific requirements of your project, such as the volume of text to be processed, the specificity of the entities you're looking for, and the resources available for training or fine-tuning models.