# Projet Knowledge Extraction - Partie A : Preprocessing et Repr√©sentation Text

**Universit√© Paris Cit√© - Master 2 VMI**  
**Cours :** IFLCE085 Recherche et extraction s√©mantique √† partir de texte (Prof. Salima Benbernou)

**√âquipe :**
- **Partie A (Preprocessing) : Jacques Gastebois**
- Partie B : Boutayna EL MOUJAOUID
- Partie C : Franz Dervis
- Partie D : Aya Benkabour

---

## Dataset : NER (Named Entity Recognition)

Ce notebook traite un dataset de **2221 phrases** annot√©es pour la reconnaissance d'entit√©s nomm√©es.

**Colonnes originales :**
- `id` : Identifiant unique de la phrase
- `words` : Liste des mots tokenis√©s
- `ner_tags` : Tags NER (0=O, 1=B-LOC, 2=B-PER, 4=B-ORG)
- `text` : Texte brut de la phrase

**Colonnes ajout√©es par ce notebook :**
- `cleaned_text` : Texte nettoy√© (lowercase, sans caract√®res sp√©ciaux)
- `lemmatized_text` : Texte lemmatis√© avec spaCy

## √âtape 1 : Setup et Importations

In [14]:
import sys
# Installation des d√©pendances de base
!{sys.executable} -m pip install -q pandas numpy nltk scikit-learn spacy
!{sys.executable} -m spacy download en_core_web_sm

import os
import re
import pandas as pd
import numpy as np
import spacy

# Chargement du mod√®le spaCy
nlp = spacy.load('en_core_web_sm')

# Configuration de l'affichage pandas
pd.set_option('display.max_colwidth', 100)

print("‚úÖ Environnement configur√© avec succ√®s.")

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
‚úÖ Environnement configur√© avec succ√®s.


## √âtape 2 : Chargement des Donn√©es

In [15]:
# Chargement du dataset
df = pd.read_csv('data.csv')

print(f"üìä Dataset charg√© : {len(df)} phrases")
print(f"\nColonnes originales : {list(df.columns)}")
print(f"\nAper√ßu des donn√©es :")
df.head()

üìä Dataset charg√© : 700 phrases

Colonnes originales : ['id', 'words', 'ner_tags', 'text']

Aper√ßu des donn√©es :


Unnamed: 0,id,words,ner_tags,text
0,en-doc5809-sent11,"['When' 'Aeneas' 'later' 'traveled' 'to' 'Hades' ',' 'he' 'called' 'to'\n 'her' 'ghost' 'but' 's...",[0 2 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0],"When Aeneas later traveled to Hades , he called to her ghost but she neither spoke to nor acknow..."
1,en-doc6123-sent45,['On' '23' 'November' '1969' 'he' 'wrote' 'to' 'The' 'Times' 'newspaper'\n 'saying' 'that' 'the'...,[0 0 0 0 0 0 0 4 4 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0],On 23 November 1969 he wrote to The Times newspaper saying that the preparation for show trials ...
2,en-doc5831-sent40,"['Stephenson' ""'s"" 'estimates' 'and' 'organising' 'ability' 'proved' 'to'\n 'be' 'inferior' 'to'...",[2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 0],Stephenson 's estimates and organising ability proved to be inferior to those of Locke and the b...
3,en-doc6189-sent73,['France' 'then' 'postponed' 'a' 'visit' 'by' 'Sharon' '.'],[1 0 0 0 0 0 2 0],France then postponed a visit by Sharon .
4,en-doc6139-sent18,"['Only' 'twenty-seven' 'years' 'old' 'at' 'his' 'death' ',' 'Moseley'\n 'could' 'in' 'many' 'sci...",[0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0],"Only twenty-seven years old at his death , Moseley could in many scientists ' opinions have cont..."


## √âtape 3 : Nettoyage du Texte

Ajout de la colonne `cleaned_text` au dataset.

In [16]:
def clean_text(text):
    """
    Nettoie le texte : lowercase, suppression caract√®res sp√©ciaux, normalisation espaces.
    """
    if not isinstance(text, str):
        return ""
    
    # 1. Lowercase
    text = text.lower()
    
    # 2. Suppression des caract√®res sp√©ciaux (garde lettres, chiffres et espaces)
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    
    # 3. Suppression des espaces multiples
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Application du nettoyage
print("üîÑ Nettoyage en cours...")
df['cleaned_text'] = df['text'].apply(clean_text)

print("‚úÖ Colonne 'cleaned_text' ajout√©e")
print(f"\nExemple :")
print(f"  Original : {df.iloc[0]['text']}")
print(f"  Nettoy√©  : {df.iloc[0]['cleaned_text']}")

üîÑ Nettoyage en cours...
‚úÖ Colonne 'cleaned_text' ajout√©e

Exemple :
  Original : When Aeneas later traveled to Hades , he called to her ghost but she neither spoke to nor acknowledged him .
  Nettoy√©  : when aeneas later traveled to hades he called to her ghost but she neither spoke to nor acknowledged him


## √âtape 4 : Lemmatization

Ajout de la colonne `lemmatized_text` au dataset.

In [17]:
def lemmatize_text(text):
    """
    Lemmatise le texte avec spaCy.
    """
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc]
    return ' '.join(lemmas)

# Application de la lemmatization
print("üîÑ Lemmatization en cours (cela peut prendre quelques minutes)...")
df['lemmatized_text'] = df['cleaned_text'].apply(lemmatize_text)

print("‚úÖ Colonne 'lemmatized_text' ajout√©e")
print(f"\nExemple de texte lemmatis√© :")
print(f"  {df.iloc[0]['lemmatized_text'][:100]}...")

üîÑ Lemmatization en cours (cela peut prendre quelques minutes)...
‚úÖ Colonne 'lemmatized_text' ajout√©e

Exemple de texte lemmatis√© :
  when aeneas later travel to hade he call to her ghost but she neither speak to nor acknowledge he...


## √âtape 5 : Sauvegarde du Dataset Enrichi

In [18]:
# V√©rification des colonnes
print("üìã Colonnes du dataset enrichi :")
print(list(df.columns))
print(f"\nNombre de lignes : {len(df)}")

# Aper√ßu
df.head()

üìã Colonnes du dataset enrichi :
['id', 'words', 'ner_tags', 'text', 'cleaned_text', 'lemmatized_text']

Nombre de lignes : 700


Unnamed: 0,id,words,ner_tags,text,cleaned_text,lemmatized_text
0,en-doc5809-sent11,"['When' 'Aeneas' 'later' 'traveled' 'to' 'Hades' ',' 'he' 'called' 'to'\n 'her' 'ghost' 'but' 's...",[0 2 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0],"When Aeneas later traveled to Hades , he called to her ghost but she neither spoke to nor acknow...",when aeneas later traveled to hades he called to her ghost but she neither spoke to nor acknowle...,when aeneas later travel to hade he call to her ghost but she neither speak to nor acknowledge he
1,en-doc6123-sent45,['On' '23' 'November' '1969' 'he' 'wrote' 'to' 'The' 'Times' 'newspaper'\n 'saying' 'that' 'the'...,[0 0 0 0 0 0 0 4 4 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0],On 23 November 1969 he wrote to The Times newspaper saying that the preparation for show trials ...,on 23 november 1969 he wrote to the times newspaper saying that the preparation for show trials ...,on 23 november 1969 he write to the times newspaper say that the preparation for show trial in c...
2,en-doc5831-sent40,"['Stephenson' ""'s"" 'estimates' 'and' 'organising' 'ability' 'proved' 'to'\n 'be' 'inferior' 'to'...",[2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 0],Stephenson 's estimates and organising ability proved to be inferior to those of Locke and the b...,stephenson s estimates and organising ability proved to be inferior to those of locke and the bo...,stephenson s estimate and organise ability prove to be inferior to those of locke and the board ...
3,en-doc6189-sent73,['France' 'then' 'postponed' 'a' 'visit' 'by' 'Sharon' '.'],[1 0 0 0 0 0 2 0],France then postponed a visit by Sharon .,france then postponed a visit by sharon,france then postpone a visit by sharon
4,en-doc6139-sent18,"['Only' 'twenty-seven' 'years' 'old' 'at' 'his' 'death' ',' 'Moseley'\n 'could' 'in' 'many' 'sci...",[0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0],"Only twenty-seven years old at his death , Moseley could in many scientists ' opinions have cont...",only twenty seven years old at his death moseley could in many scientists opinions have contribu...,only twenty seven year old at his death moseley could in many scientist opinion have contribute ...


In [19]:
# Sauvegarde du dataset enrichi
output_file = 'data_preprocessed.csv'
df.to_csv(output_file, index=False)

print(f"‚úÖ Dataset enrichi sauvegard√© : {output_file}")
print(f"   Colonnes : {list(df.columns)}")
print(f"   Nombre de lignes : {len(df)}")

‚úÖ Dataset enrichi sauvegard√© : data_preprocessed.csv
   Colonnes : ['id', 'words', 'ner_tags', 'text', 'cleaned_text', 'lemmatized_text']
   Nombre de lignes : 700


## R√©sum√©

### Dataset Enrichi

Le fichier `data_preprocessed.csv` contient maintenant **6 colonnes** :

**Colonnes originales (conserv√©es) :**
1. `id` : Identifiant unique
2. `words` : Liste des mots tokenis√©s
3. `ner_tags` : Tags NER
4. `text` : Texte original

**Colonnes ajout√©es :**
5. `cleaned_text` : Texte nettoy√© (lowercase, sans caract√®res sp√©ciaux)
6. `lemmatized_text` : Texte lemmatis√© avec spaCy

### Utilisation (Partie B)

```python
import pandas as pd

# Charger le dataset enrichi
df = pd.read_csv('data_preprocessed.csv')

# Acc√©der aux diff√©rentes versions du texte
original_texts = df['text']
cleaned_texts = df['cleaned_text']
lemmatized_texts = df['lemmatized_text']

# Acc√©der aux annotations NER originales
ner_tags = df['ner_tags']
```

### Pipeline Appliqu√©

1. **Nettoyage** : Lowercase + Suppression caract√®res sp√©ciaux + Normalisation espaces
2. **Lemmatization** : spaCy `en_core_web_sm`

**Note** : Toutes les colonnes originales sont conserv√©es intactes !