<a href="https://colab.research.google.com/github/OdysseusPolymetis/ia_et_shs/blob/main/3_named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>**Named Entity Recognition : basics**</center>

---



**Named entities** are elements that can be culturally idenfiable. They are not necessarily one unique word. For example, "Rome" is a named entity, but a NER tool could also associate it with the expression "the city of seven hills".

# **Named Entity Recognition with `flair`**

In [None]:
!pip install flair
!pip install stanza

In [None]:
from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load("UGARIT/flair_grc_bert_ner")

# **Test on a sentence**

In [None]:
sentence = Sentence('ταῦτα εἴπας ὁ Ἀλέξανδρος παρίζει Πέρσῃ ἀνδρὶ ἄνδρα Μακεδόνα ὡς γυναῖκα τῷ λόγῳ · οἳ δέ , ἐπείτε σφέων οἱ Πέρσαι ψαύειν ἐπειρῶντο , διεργάζοντο αὐτούς .')
tagger.predict(sentence)
for entity in sentence.get_spans('ner'):
    print(entity)

Span[3:4]: "Ἀλέξανδρος" → PER (0.9974)
Span[5:6]: "Πέρσῃ" → MISC (0.9951)
Span[8:9]: "Μακεδόνα" → MISC (0.9954)
Span[20:21]: "Πέρσαι" → MISC (0.9944)


# **Test on a random txt**

In [None]:
import stanza
import numpy as np

In [None]:
!wget https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/odyssee_integrale.txt

In [None]:
with open('/content/odyssee_integrale.txt', 'r', encoding='utf-8') as file:
    text = file.read()

In [None]:
stanza.download('grc')
nlp = stanza.Pipeline(lang='grc', processors='tokenize,lemma')

In [None]:
window=50

In [None]:
doc=nlp(text)

from collections import defaultdict
import numpy as np

# cooccurrence matrix
cooccurrence_dict = defaultdict(lambda: defaultdict(int))

for sentence in doc.sentences:
    sentence_text = sentence.text

    # prediction with flair
    ner_sentence = Sentence(sentence_text)
    tagger.predict(ner_sentence)

    # getting PER NERs
    ner_entities = [(entity.text, entity.start_position, entity.end_position) for entity in ner_sentence.get_spans('ner') if entity.get_label('ner').value == 'PER']

    # For each entity, get he ones that interact within an X window of words
    for i, (entity_text_i, start_i, end_i) in enumerate(ner_entities):
        lemma_i = ' '.join([token.lemma for token in sentence.words if entity_text_i in token.text])

        for j, (entity_text_j, start_j, end_j) in enumerate(ner_entities):
            if i != j and abs(start_i - start_j) <= window:
                lemma_j = ' '.join([token.lemma for token in sentence.words if entity_text_j in token.text])
                cooccurrence_dict[lemma_i][lemma_j] += 1

# matrix
entities = list(cooccurrence_dict.keys())
matrix_size = len(entities)
cooccurrence_matrix = np.zeros((matrix_size, matrix_size), dtype=int)

for i, entity_i in enumerate(entities):
    for j, entity_j in enumerate(entities):
        cooccurrence_matrix[i, j] = cooccurrence_dict[entity_i][entity_j]

# printing the matrix
print(cooccurrence_matrix)


[[0 2 0 ... 0 0 0]
 [2 0 1 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [None]:
import networkx as nx

G = nx.Graph()

for i, entity in enumerate(entities):
    G.add_node(i, label=entity)

for i, row in enumerate(cooccurrence_matrix):
    for j, weight in enumerate(row):
        if weight > 0 and i != j:
            G.add_edge(i, j, weight=weight)

# export for gephi
nx.write_gexf(G, "network_direct.gexf")

## If you want more NERs (including MISC, for example)

In [None]:
cooccurrence_dict = defaultdict(lambda: defaultdict(int))

for sentence in doc.sentences:
    sentence_text = sentence.text

    # prediction with flair
    ner_sentence = Sentence(sentence_text)
    tagger.predict(ner_sentence)

    # PER and MISC extraction
    ner_entities = [(entity.text, entity.start_position, entity.end_position) for entity in ner_sentence.get_spans('ner') if entity.get_label('ner').value in ['PER', 'MISC']]

    # For each entity, get he ones that interact within an X window of words
    for i, (entity_text_i, start_i, end_i) in enumerate(ner_entities):
        lemma_i = ' '.join([token.lemma for token in sentence.words if entity_text_i in token.text])

        for j, (entity_text_j, start_j, end_j) in enumerate(ner_entities):
            if i != j and abs(start_i - start_j) <= window:
                lemma_j = ' '.join([token.lemma for token in sentence.words if entity_text_j in token.text])
                cooccurrence_dict[lemma_i][lemma_j] += 1

# matrix conversion
entities = list(cooccurrence_dict.keys())
matrix_size = len(entities)
cooccurrence_matrix = np.zeros((matrix_size, matrix_size), dtype=int)

for i, entity_i in enumerate(entities):
    for j, entity_j in enumerate(entities):
        cooccurrence_matrix[i, j] = cooccurrence_dict[entity_i][entity_j]

# printing the matrix
print(cooccurrence_matrix)

# **Named Entity Recognition with `stanza`**

Let's try with `stanza` and redefine the Pipeline, using the ner process.

In [None]:
stanza_ner = stanza.Pipeline(lang='fr', processors='tokenize,ner')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json


Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.9.0/models/tokenize/combined.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.9.0/models/mwt/combined.pt:   0%|         …

Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.9.0/models/ner/wikiner.pt:   0%|          …

Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.9.0/models/backward_charlm/newswiki.pt:   …

Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.9.0/models/pretrain/fasttextwiki.pt:   0%|…

Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.9.0/models/forward_charlm/newswiki.pt:   0…

INFO:stanza:Loading these models for language: fr (French):
| Processor | Package  |
------------------------
| tokenize  | combined |
| mwt       | combined |
| ner       | wikiner  |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Loading: ner
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  data = torch.load(self.filename, lambda storage, loc: storage)
  state = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Done loading processors!


Let's make a test on the _Misérables_.

In [None]:
!wget https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/miserables.txt

--2024-11-18 15:49:36--  https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/miserables.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3173790 (3.0M) [text/plain]
Saving to: ‘miserables.txt’


2024-11-18 15:49:36 (53.4 MB/s) - ‘miserables.txt’ saved [3173790/3173790]



In [None]:
!wget https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/stopwords_fr.txt

--2024-11-18 15:49:43--  https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/stopwords_fr.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5891 (5.8K) [text/plain]
Saving to: ‘stopwords_fr.txt’


2024-11-18 15:49:43 (85.0 MB/s) - ‘stopwords_fr.txt’ saved [5891/5891]



In [None]:
stops = open("/content/stopwords_fr.txt", encoding="utf-8").read().split("\n")

In [None]:
filepath_of_text = "/content/miserables.txt"

In [None]:
full_text = open(filepath_of_text, encoding="utf-8").read()

This part may take some time (and you may need a large GPU), as we're running it on the whole of the _Misérables_, which is pretty long. (default here : roughly 4 minutes)

In [None]:
ents_stanza = stanza_ner(full_text)

In [None]:
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in ents_stanza.ents[:50]], sep='\n')

entity: Chapitre I	type: MISC
entity: Monsieur Myriel	type: PER
entity: M. Charles-François-Bienvenu Myriel	type: PER
entity: Digne	type: LOC
entity: Digne	type: LOC
entity: Myriel	type: PER
entity: Aix	type: LOC
entity: Charles Myriel	type: PER
entity: Charles Myriel	type: PER
entity: Italie	type: LOC
entity: M. Myriel	type: PER
entity: Italie	type: LOC
entity: M. Myriel	type: PER
entity: Brignolles	type: LOC
entity: Paris	type: LOC
entity: M. le cardinal Fesch	type: PER
entity: Napoléon	type: PER
entity: M. Myriel	type: PER
entity: M. Myriel	type: PER
entity: Digne	type: LOC
entity: M. Myriel	type: PER
entity: Myriel	type: PER
entity: Myriel	type: PER
entity: Digne	type: LOC
entity: Myriel	type: PER
entity: Digne	type: LOC
entity: Baptistine	type: PER
entity: Baptistine	type: PER
entity: Magloire	type: PER
entity: M. le Curé_	type: PER
entity: Mademoiselle Baptistine	type: PER
entity: Madame Magloire	type: PER
entity: M.	type: PER
entity: Myriel	type: PER
entity: Chapitre II	type: MI

In [None]:
from collections import Counter

per_counter = Counter()
loc_counter = Counter()

for sentence in ents_stanza.sentences:
    for ent in sentence.ents:
        ent_text_lower = ent.text.lower()
        if ent_text_lower not in stops:
            if ent.type == 'PER':
                per_counter[ent.text] += 1
            elif ent.type == 'LOC':
                loc_counter[ent.text] += 1

most_common_per = per_counter.most_common()
print("Most frequent PER entities :")
for ent, freq in most_common_per[:10]:
    print(f"{ent}: {freq}")

most_common_loc = loc_counter.most_common()
print("\nMost frequent LOC entities :")
for ent, freq in most_common_loc[:10]:
    print(f"{ent}: {freq}")

Most frequent PER entities :
Marius: 1301
Jean Valjean: 1035
Cosette: 954
Thénardier: 518
Javert: 447
Gavroche: 305
Enjolras: 242
Fauchelevent: 237
Fantine: 188
Jondrette: 149

Most frequent LOC entities :
Paris: 391
Montfermeil: 73
France: 68
la France: 67
Montparnasse: 64
Angleterre: 56
Luxembourg: 54
Montreuil-sur-mer: 53
Digne: 52
Europe: 51
