<a href="https://colab.research.google.com/github/OdysseusPolymetis/enssib_class/blob/main/4_named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>**Named Entity Recognition : basics**</center>

---



**Named entities** are elements that can be culturally idenfiable. They are not necessarily one unique word. For example, "Rome" is a named entity, but a NER tool could also associate it with the expression "the city of seven hills".

# **Named Entity Recognition with `flair`**

In [None]:
!pip install flair
!pip install stanza

In [None]:
from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load("UGARIT/flair_grc_bert_ner")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


pytorch_model.bin:   0%|          | 0.00/464M [00:00<?, ?B/s]

2024-04-10 10:17:48,589 SequenceTagger predicts: Dictionary with 15 tags: O, S-PER, B-PER, E-PER, I-PER, S-MISC, B-MISC, E-MISC, I-MISC, S-LOC, B-LOC, E-LOC, I-LOC, <START>, <STOP>


# **Test on a sentence**

In [None]:
sentence = Sentence('ταῦτα εἴπας ὁ Ἀλέξανδρος παρίζει Πέρσῃ ἀνδρὶ ἄνδρα Μακεδόνα ὡς γυναῖκα τῷ λόγῳ · οἳ δέ , ἐπείτε σφέων οἱ Πέρσαι ψαύειν ἐπειρῶντο , διεργάζοντο αὐτούς .')
tagger.predict(sentence)
for entity in sentence.get_spans('ner'):
    print(entity)

Span[3:4]: "Ἀλέξανδρος" → PER (0.9974)
Span[5:6]: "Πέρσῃ" → MISC (0.9951)
Span[8:9]: "Μακεδόνα" → MISC (0.9954)
Span[20:21]: "Πέρσαι" → MISC (0.9944)


# **Test on a random txt**

In [None]:
import stanza
import numpy as np

In [None]:
!wget https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/odyssee_integrale.txt

--2024-04-10 10:18:54--  https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/odyssee_integrale.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042364 (1018K) [text/plain]
Saving to: ‘odyssee_integrale.txt’


2024-04-10 10:18:54 (36.7 MB/s) - ‘odyssee_integrale.txt’ saved [1042364/1042364]



In [None]:
with open('/content/odyssee_integrale.txt', 'r', encoding='utf-8') as file:
    text = file.read()

In [None]:
stanza.download('grc')
nlp = stanza.Pipeline(lang='grc', processors='tokenize,lemma')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: grc (Ancient_Greek) ...


Downloading https://huggingface.co/stanfordnlp/stanza-grc/resolve/v1.8.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/grc/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: grc (Ancient_Greek):
| Processor | Package          |
--------------------------------
| tokenize  | perseus          |
| lemma     | perseus_nocharlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: lemma
INFO:stanza:Done loading processors!


In [None]:
window=50

In [None]:
doc=nlp(text)

from collections import defaultdict
import numpy as np

# cooccurrence matrix
cooccurrence_dict = defaultdict(lambda: defaultdict(int))

for sentence in doc.sentences:
    sentence_text = sentence.text

    # prediction with flair
    ner_sentence = Sentence(sentence_text)
    tagger.predict(ner_sentence)

    # getting PER NERs
    ner_entities = [(entity.text, entity.start_position, entity.end_position) for entity in ner_sentence.get_spans('ner') if entity.get_label('ner').value == 'PER']

    # For each entity, get he ones that interact within an X window of words
    for i, (entity_text_i, start_i, end_i) in enumerate(ner_entities):
        lemma_i = ' '.join([token.lemma for token in sentence.words if entity_text_i in token.text])

        for j, (entity_text_j, start_j, end_j) in enumerate(ner_entities):
            if i != j and abs(start_i - start_j) <= window:
                lemma_j = ' '.join([token.lemma for token in sentence.words if entity_text_j in token.text])
                cooccurrence_dict[lemma_i][lemma_j] += 1

# matrix
entities = list(cooccurrence_dict.keys())
matrix_size = len(entities)
cooccurrence_matrix = np.zeros((matrix_size, matrix_size), dtype=int)

for i, entity_i in enumerate(entities):
    for j, entity_j in enumerate(entities):
        cooccurrence_matrix[i, j] = cooccurrence_dict[entity_i][entity_j]

# printing the matrix
print(cooccurrence_matrix)


[[0 2 0 ... 0 0 0]
 [2 0 1 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [None]:
import networkx as nx

G = nx.Graph()

for i, entity in enumerate(entities):
    G.add_node(i, label=entity)

for i, row in enumerate(cooccurrence_matrix):
    for j, weight in enumerate(row):
        if weight > 0 and i != j:
            G.add_edge(i, j, weight=weight)

# export for gephi
nx.write_gexf(G, "network_direct.gexf")

## If you want more NERs (including MISC, for example)

In [None]:
cooccurrence_dict = defaultdict(lambda: defaultdict(int))

for sentence in doc.sentences:
    sentence_text = sentence.text

    # prediction with flair
    ner_sentence = Sentence(sentence_text)
    tagger.predict(ner_sentence)

    # PER and MISC extraction
    ner_entities = [(entity.text, entity.start_position, entity.end_position) for entity in ner_sentence.get_spans('ner') if entity.get_label('ner').value in ['PER', 'MISC']]

    # For each entity, get he ones that interact within an X window of words
    for i, (entity_text_i, start_i, end_i) in enumerate(ner_entities):
        lemma_i = ' '.join([token.lemma for token in sentence.words if entity_text_i in token.text])

        for j, (entity_text_j, start_j, end_j) in enumerate(ner_entities):
            if i != j and abs(start_i - start_j) <= window:
                lemma_j = ' '.join([token.lemma for token in sentence.words if entity_text_j in token.text])
                cooccurrence_dict[lemma_i][lemma_j] += 1

# matrix conversion
entities = list(cooccurrence_dict.keys())
matrix_size = len(entities)
cooccurrence_matrix = np.zeros((matrix_size, matrix_size), dtype=int)

for i, entity_i in enumerate(entities):
    for j, entity_j in enumerate(entities):
        cooccurrence_matrix[i, j] = cooccurrence_dict[entity_i][entity_j]

# printing the matrix
print(cooccurrence_matrix)

# **Named Entity Recognition with `stanza`**

Let's try with `stanza` and redefine the Pipeline, using the ner process.

In [None]:
stanza_ner = stanza.Pipeline(lang='fr', processors='tokenize,ner')

Let's make a test on the _Misérables_.

In [None]:
!wget https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/miserables.txt

In [None]:
!wget https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/stopwords_fr.txt

In [None]:
stops = open("/content/stopwords_fr.txt", encoding="utf-8").read().split("\n")

In [None]:
filepath_of_text = "/content/miserables.txt"

In [None]:
full_text = open(filepath_of_text, encoding="utf-8").read()

This part may take some time (and you may need a large GPU), as we're running it on the whole of the _Misérables_, which is pretty long. (default here : roughly 4 minutes)

In [None]:
ents_stanza = stanza_ner(full_text)

In [None]:
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in ents_stanza.ents[:100]], sep='\n')

In [None]:
from collections import Counter

per_counter = Counter()
loc_counter = Counter()

for sentence in ents_stanza.sentences:
    for ent in sentence.ents:
        ent_text_lower = ent.text.lower()
        if ent_text_lower not in stops:
            if ent.type == 'PER':
                per_counter[ent.text] += 1
            elif ent.type == 'LOC':
                loc_counter[ent.text] += 1

most_common_per = per_counter.most_common()
print("Most frequent PER entities :")
for ent, freq in most_common_per[:10]:
    print(f"{ent}: {freq}")

most_common_loc = loc_counter.most_common()
print("\nMost frequent LOC entities :")
for ent, freq in most_common_loc[:10]:
    print(f"{ent}: {freq}")