* L'objectif est d'extraire les entités nommées, telles que les noms de personnes, d'organisations, d'entreprises, de lieux, les dates, etc., à partir d'un texte provenant de la page Wikipédia d'Henri Amand, accessible via le lien suivant : Henri Amand - Wikipedia.

* L'exercice demande ensuite de produire un résumé sémantique des informations extraites, en se concentrant sur les aspects suivants :

   - Restitution des 5 types d'entités les plus fréquents.
   - Restitution des 5 mentions d'entités les plus fréquentes.
   - Restitution des 5 cooccurrences de types d'entités les plus fréquentes, en considérant une cooccurrence lorsque deux types d'entités apparaissent dans la même phrase.

* En outre, deux bonus sont proposés :

    Bonus 1 : L'implémentation de l'extraction d'information avec deux bibliothèques NLP différentes, notamment aymara/lima (https://pypi.org/project/aymara/).
    Bonus 2 : La résolution à l'échelle du document de la normalisation des mentions d'entités (entity linking), en utilisant plusieurs méthodes possibles, telles qu'un modèle d'embedding de phrases et un calcul de similarité.

In [None]:
import tensorflow as tf
device_list = tf.test.gpu_device_name()
device_list

## Data collection


In [10]:
import wikipedia
wikipedia.set_lang("fr")
ha = wikipedia.page('Henri Amand')
ha.content

'Henri Amand, dit Le capitaine à barbe,, né le 17 septembre 1873 à Paris VIe, mort le 29 septembre 1967 à Villeneuve-sur-Yonne, est un joueur français de rugby à XV évoluant au poste d\'ailier ou de demi d\'ouverture. Il est le capitaine de l\'équipe de France de rugby à XV lors du premier match officiel de son histoire, face à la Nouvelle-Zélande, le 1er janvier 1906 au Parc des Princes. \n\n\n== Biographie ==\nDe métier ingénieur civil, il est le fils d\'un commis d\'agent de change, Antoine Joseph Charles Emmanuel Amand et de sa femme Marie Berthe Garcet et ainsi le petit-fils du mathématicien Henri Garcet, cousin germain de Jules Verne.\nEn 1895, avec le colonel d\'Aigny et Frantz Reichel, il crée l\'équipe de rugby du 115e régiment d\'infanterie d\'Alençon, pour "matcher" celle des étudiants de la ville.\nHenri Amand a la carte no 1 d\'international français, sans avoir joué de rencontre officielle jusqu\'en 1906, mais grâce aux déplacements à Édimbourg (contre le Civil Service) e

# Solition 1

#### Dans cette solution, j'ai utilisé la bibliothèque Spacy.

In [11]:
!python -m spacy download fr_core_news_lg

Collecting fr-core-news-lg==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_lg-3.7.0/fr_core_news_lg-3.7.0-py3-none-any.whl (571.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m571.8/571.8 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_lg')


In [12]:
import spacy

# Load the French language model in SpaCy
nlp = spacy.load("fr_core_news_lg")

# Analyze the text
doc = nlp(ha.content)

# Extract named entities
named_entities = [(ent.text, ent.label_) for ent in doc.ents]

# Display named entities
print(named_entities)

[('Henri Amand', 'PER'), ('Paris VIe', 'LOC'), ('Villeneuve-sur-Yonne', 'LOC'), ('de France', 'LOC'), ('Nouvelle-Zélande', 'LOC'), ('Parc des Princes', 'LOC'), ('Antoine Joseph Charles Emmanuel Amand', 'PER'), ('Marie Berthe Garcet', 'PER'), ('Henri Garcet', 'PER'), ('Jules Verne', 'PER'), ('Aigny', 'LOC'), ('Frantz Reichel', 'PER'), ('Alençon', 'LOC'), ('Henri Amand', 'PER'), ('Édimbourg', 'LOC'), ('Civil Service', 'ORG'), ('Richmond', 'LOC'), ('Centre Park House FC', 'LOC'), ('Frantz Reichel', 'PER'), ('Louis Dedet', 'PER'), ('Amand', 'PER'), ('Nouvelle-Zélande', 'LOC'), ('Paris', 'LOC'), ('Parc des Princes', 'LOC'), ('Louis Dedet', 'PER'), ('Stade français', 'LOC'), ('Henri Amand', 'PER'), ('Henri Amand', 'PER'), ('XIe arrondissement de Paris', 'LOC'), ('Berthe Victorine Louisa Marcadet', 'PER'), ('Alfred Amand', 'PER'), ('front de Champagne', 'MISC'), ('Grande Guerre', 'MISC'), ('Georges André', 'PER'), ('championnat de France', 'MISC'), ('Palmarès', 'MISC'), ('Champion de France',

In [13]:
label_entities = set(ent.label_ for ent in doc.ents)
label_entities


# explication de chaque entité:

for label in label_entities:
  print("label ", label, "  : ", spacy.explain(label))

label  ORG   :  Companies, agencies, institutions, etc.
label  LOC   :  Non-GPE locations, mountain ranges, bodies of water
label  MISC   :  Miscellaneous entities, e.g. events, nationalities, products or works of art
label  PER   :  Named person or family.


In [14]:
from collections import Counter

# 1. Restituer les 5 types d'entités les plus fréquents
types_entites = [entite.label_ for entite in doc.ents]
types_entites_frequents = Counter(types_entites).most_common(5)
print("Les 5 types d'entités les plus fréquents :", types_entites_frequents)

# 2. Restituer les 5 mentions d'entités les plus fréquentes
mentions_entites = [entite.text for entite in doc.ents]
mentions_entites_frequentes = Counter(mentions_entites).most_common(5)
print("Les 5 mentions d'entités les plus fréquentes :", mentions_entites_frequentes)

# 3. Restituer 5 cooccurrences de type d'entités les plus fréquentes
cooccurrences = Counter((ent1.label_, ent2.label_) for ent1 in doc.ents for ent2 in doc.ents if ent1 != ent2 and ent1.sent == ent2.sent)
cooccurrences_frequentes = cooccurrences.most_common(5)
print("Les 5 cooccurrences de type d'entités les plus fréquentes :", cooccurrences_frequentes)


Les 5 types d'entités les plus fréquents : [('PER', 18), ('LOC', 18), ('MISC', 12), ('ORG', 5)]
Les 5 mentions d'entités les plus fréquentes : [('Henri Amand', 5), ('de France', 2), ('Nouvelle-Zélande', 2), ('Parc des Princes', 2), ('Frantz Reichel', 2)]
Les 5 cooccurrences de type d'entités les plus fréquentes : [(('PER', 'LOC'), 79), (('LOC', 'PER'), 79), (('LOC', 'LOC'), 66), (('PER', 'PER'), 62), (('MISC', 'MISC'), 42)]


# Solution 2 / Bonus 1

#### Utilisation de la librairies LIMA 

In [55]:
!pip install --upgrade pip # IMPORTANT: LIMA needs a recent pip
!pip install aymara==0.4.1

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0
Collecting aymara
  Downloading aymara-0.4.1-cp37-abi3-manylinux_2_28_x86_64.whl (276.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.1/276.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting pyconll (from aymara)
  Downloading pyconll-3.2.0-py3-none-any.whl.metadata (8.0 kB)
Collecting shiboken2 (from aymara)
  Downloading shiboken2-5.15.2.1-5.15.2-cp35.cp36.cp37.cp38.cp39.cp310-abi3-manylinux1_x86_64.whl (975 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m975.4/975.4 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
Collecting unix-ar (from ay

In [57]:
import aymara.lima
l = aymara.lima.Lima("ud-eng")
r = l.analyzeText("The author wrote a novel.", lang="ud-eng")
print(r)




In [62]:
!lima_models.py -l eng

Language: english, code: eng
Installation dir: /root/.local/share/lima/resources
Downloading https://github.com/aymara/lima-models/releases/download/v0.1.5/lima-deep-models-eng-english_0.1.5_all.deb
100% 678M/678M [00:11<00:00, 56.6MiB/s]


In [65]:
!lima_models.py -l fra

Language: french, code: fra
Installation dir: /root/.local/share/lima/resources
Downloading https://github.com/aymara/lima-models/releases/download/v0.1.5/lima-deep-models-fra-french_0.1.5_all.deb
100% 675M/675M [00:14<00:00, 47.5MiB/s]


In [2]:

import aymara.lima
nlp = aymara.lima.Lima("ud-fra")
sentences = nlp('hello, world!')
print(sentences[0][0].lemma)
print(sentences.conll())


IndexError: list index out of range

# Solution 3

#### Dans cette solution, j'ai utilisé la librairies FlairNLP.

In [15]:
!pip install flair
!pip install wikipedia
!pip install --upgrade urllib3

Collecting urllib3<2.0.0,>=1.0.0 (from flair)
  Downloading urllib3-1.26.18-py2.py3-none-any.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 kB[0m [31m910.7 kB/s[0m eta [36m0:00:00[0m
Downloading urllib3-1.26.18-py2.py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.8/143.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 2.2.0
    Uninstalling urllib3-2.2.0:
      Successfully uninstalled urllib3-2.2.0
Successfully installed urllib3-1.26.18
[0m

Collecting urllib3
  Downloading urllib3-2.2.0-py3-none-any.whl.metadata (6.4 kB)
Downloading urllib3-2.2.0-py3-none-any.whl (120 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.9/120.9 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.18
    Uninstalling urllib3-1.26.18:
      Successfully uninstalled urllib3-1.26.18
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
botocore 1.34.34 requires urllib3<2.1,>=1.25.4; python_version >= "3.10", but you have urllib3 2.2.0 which is incompatible.
flair 0.13.1 requires urllib3<2.0.0,>=1.0.0, but you have urllib3 2.2.0 which is incompatible.[0m[31m
[0mSuccessfully installed urllib3-2.2.0
[0m

In [17]:
from flair.data import Sentence
from flair.models import SequenceTagger
from flair.splitter import SegtokSentenceSplitter


# initialize sentence splitter
splitter = SegtokSentenceSplitter()

# use splitter to split text into list of sentences
sentences = splitter.split(ha.content)


# load the NER tagger
tagger = SequenceTagger.load("flair/ner-english-ontonotes-large") #  it's multi langual
tagger.predict(sentences)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


2024-02-04 22:42:51,856 SequenceTagger predicts: Dictionary with 76 tags: <unk>, O, B-CARDINAL, E-CARDINAL, S-PERSON, S-CARDINAL, S-PRODUCT, B-PRODUCT, I-PRODUCT, E-PRODUCT, B-WORK_OF_ART, I-WORK_OF_ART, E-WORK_OF_ART, B-PERSON, E-PERSON, S-GPE, B-DATE, I-DATE, E-DATE, S-ORDINAL, S-LANGUAGE, I-PERSON, S-EVENT, S-DATE, B-QUANTITY, E-QUANTITY, S-TIME, B-TIME, I-TIME, E-TIME, B-GPE, E-GPE, S-ORG, I-GPE, S-NORP, B-FAC, I-FAC, E-FAC, B-NORP, E-NORP, S-PERCENT, B-ORG, E-ORG, B-LANGUAGE, E-LANGUAGE, I-CARDINAL, I-ORG, S-WORK_OF_ART, I-QUANTITY, B-MONEY


In [18]:
for sentence in sentences :
  print(sentence.get_spans('ner')[:10]) # doc to check all the NE types : https://huggingface.co/flair/ner-english-ontonotes-large

[Span[0:2]: "Henri Amand" → PERSON (1.0), Span[4:7]: "Le capitaine à" → PERSON (0.6721), Span[11:15]: "le 17 septembre 1873" → DATE (0.9998), Span[16:18]: "Paris VIe" → GPE (1.0), Span[20:24]: "le 29 septembre 1967" → DATE (0.9999), Span[25:26]: "Villeneuve-sur-Yonne" → GPE (0.9957)]
[Span[7:8]: "France" → GPE (0.9596), Span[14:15]: "premier" → ORDINAL (1.0), Span[23:25]: "la Nouvelle-Zélande" → GPE (0.9531), Span[27:30]: "1er janvier 1906" → DATE (0.9843), Span[31:34]: "Parc des Princes" → FAC (1.0)]
[Span[18:23]: "Antoine Joseph Charles Emmanuel Amand" → PERSON (1.0), Span[27:30]: "Marie Berthe Garcet" → PERSON (1.0), Span[36:38]: "Henri Garcet" → PERSON (1.0), Span[42:44]: "Jules Verne" → PERSON (1.0)]
[Span[1:2]: "1895" → DATE (1.0), Span[8:10]: "Frantz Reichel" → PERSON (1.0), Span[17:20]: "115e régiment d'infanterie" → ORG (0.9993)]
[Span[0:2]: "Henri Amand" → PERSON (1.0), Span[6:7]: "1" → CARDINAL (1.0), Span[17:18]: "1906" → DATE (1.0), Span[24:25]: "Édimbourg" → GPE (1.0), Sp

### Résumé sémantique

In [19]:
from collections import Counter
from itertools import combinations


#  Fonction pour extraire les types d'entités
def extract_entity_types(sentence):
    entity_types = set()
    for entity in sentence.get_spans("ner"):
        entity_types.add(entity.tag)
    return entity_types


#### 1. Restituer les 5 types d’entités les plus fréquents:

In [20]:
entity_type_counter = Counter()

for sentence in sentences:
    entity_types = extract_entity_types(sentence)
    entity_type_counter.update(entity_types)

top_entity_types = entity_type_counter.most_common(5)

print("Top 5 Entity Types:")
for entity_type, count in top_entity_types:
    print(f"Entity Type: {entity_type}, Count: {count}")


Top 5 Entity Types:
Entity Type: PERSON, Count: 10
Entity Type: DATE, Count: 10
Entity Type: GPE, Count: 8
Entity Type: ORG, Count: 6
Entity Type: CARDINAL, Count: 4


#### 2. Restituer les 5 mentions d’entités les plus fréquentes:

In [21]:
# Fonction pour extraire les metions d'entités d'une phrase
def extract_entity_mentions(sentence):
    entity_mentions = [entity.text for entity in sentence.get_spans("ner")]
    return entity_mentions

entity_mention_counter = Counter()

for sentence in sentences:
    entity_mentions = extract_entity_mentions(sentence)
    entity_mention_counter.update(entity_mentions)

top_entity_mentions = entity_mention_counter.most_common(5)

print("Top 5 Entity Mentions:")
for entity_mention, count in top_entity_mentions:
    print(f"Entity Mention: {entity_mention}, Count: {count}")

Top 5 Entity Mentions:
Entity Mention: 1906, Count: 5
Entity Mention: Henri Amand, Count: 4
Entity Mention: 1, Count: 4
Entity Mention: France, Count: 3
Entity Mention: 1893, Count: 3


#### 3. Restituer 5 cooccurrences de type d’entités les plus fréquentes:

In [22]:

cooccurrence_counter = Counter()

for sentence in sentences:
    entity_types = extract_entity_types(sentence)
    cooccurrence_counter.update(combinations(entity_types, 2))

top_cooccurrences = cooccurrence_counter.most_common(5)

print("Top 5 Co-occurrences:")
for cooccurrence, count in top_cooccurrences:
    print(f"Co-occurrence: {cooccurrence}, Count: {count}")

Top 5 Co-occurrences:
Co-occurrence: ('GPE', 'DATE'), Count: 8
Co-occurrence: ('PERSON', 'DATE'), Count: 5
Co-occurrence: ('PERSON', 'GPE'), Count: 4
Co-occurrence: ('GPE', 'FAC'), Count: 3
Co-occurrence: ('DATE', 'FAC'), Count: 3


## Bonus 2

In [23]:
list_sentences = []

for setn in sentences:
  list_sentences.append(extract_entity_mentions(setn))
flattened_list = [item for sublist in list_sentences for item in sublist]

In [24]:
from transformers import BertModel, BertTokenizer

# Chargement du modèle et du tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Encodage des mentions d’entités
embeddings = []
for phrase in flattened_list:
    inputs = tokenizer(phrase, return_tensors="pt")
    outputs = model(**inputs)
    embeddings.append(outputs.pooler_output)

In [25]:
# Initialisation des groupes
groupes = {}
for i in range(len(embeddings)):
    groupes[i] = []

# Calcul de la similarité cosinus ... on aurais pu choisir Jaccard ou manhattan
similarities = []
for i in range(len(embeddings)):
    for j in range(i + 1, len(embeddings)):
        dot_product = (embeddings[i] * embeddings[j]).sum().item()
        magnitude_product = (embeddings[i].norm() * embeddings[j].norm()).item()
        cosine_similarity = dot_product / magnitude_product

        if cosine_similarity > 0.9:  # Valeur paramétrée
            groupes[i].append(j)

# Affichage des groupes
for i, groupe in groupes.items():
    print(f"Groupe {i + 1}:")
    for j in groupe:
        print(f"\t{flattened_list[j]}")

Groupe 1:
	Le capitaine à
	Paris VIe
	France
	premier
	la Nouvelle-Zélande
	Marie Berthe Garcet
	1895
	Frantz Reichel
	Henri Amand
	1
	1906
	Édimbourg
	le Civil Service
	le Centre Park House FC
	1893
	Frantz Reichel
	Louis Dedet
	Amand
	premier
	la Nouvelle-Zélande
	Paris
	1er
	français
	1906
	Louis Dedet
	demi
	Henri Amand
	1,63 m
	Henri Amand
	Paris
	1899
	Alfred Amand
	Champagne
	la Grande Guerre
	1915
	Georges André
	1913
	France
	1893
	1894
	1897
	1898
	1901
	1903
	1892
	1896
	1899
	1904
	1905
	1906
	1906
	1
	France
	1
	1906
	demi
	1
	deux
	1893
	Pèlerinage chez Henri Amand
	°1
	Robert Roy
	ESPNscrum
	Fédération française de rugby
Groupe 2:
	le 17 septembre 1873
	Paris VIe
	le 29 septembre 1967
	France
	premier
	la Nouvelle-Zélande
	1er janvier 1906
	Parc des Princes
	Marie Berthe Garcet
	Henri Garcet
	1895
	Frantz Reichel
	Henri Amand
	1
	1906
	Édimbourg
	le Civil Service
	le Centre Park House FC
	1893
	Frantz Reichel
	Louis Dedet
	Amand
	premier
	la Nouvelle-Zélande
	1er janvier