# Projet Knowledge Extraction - Partie A : Preprocessing et Repr√©sentation Text

**Universit√© Paris Cit√© - Master 2 VMI**  
**Cours :** IFLCE085 Recherche et extraction s√©mantique √† partir de texte (Prof. Salima Benbernou)

**√âquipe :**
- **Partie A (Preprocessing) : Jacques Gastebois**
- Partie B : Boutayna EL MOUJAOUID
- Partie C : Franz Dervis
- Partie D : Aya Benkabour

---

## Dataset : NER (Named Entity Recognition)

Ce notebook traite un dataset de **2221 phrases** annot√©es pour la reconnaissance d'entit√©s nomm√©es.

**Colonnes :**
- `id` : Identifiant unique de la phrase
- `words` : Liste des mots tokenis√©s
- `ner_tags` : Tags NER (0=O, 1=B-LOC, 2=B-PER, 4=B-ORG)
- `text` : Texte brut de la phrase

## √âtape 1 : Setup et Importations

In [1]:
import sys
# Installation des d√©pendances de base
!{sys.executable} -m pip install -q pandas numpy nltk scikit-learn spacy
!{sys.executable} -m spacy download en_core_web_sm

import os
import json
import re
import pickle
import pandas as pd
import numpy as np
import nltk
import spacy
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from scipy.sparse import save_npz

# T√©l√©chargement des ressources NLTK
nltk.download('punkt_tab', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
nltk.download('stopwords', quiet=True)

# Chargement du mod√®le spaCy
nlp = spacy.load('en_core_web_sm')

# Configuration de l'affichage pandas
pd.set_option('display.max_colwidth', 100)

print("‚úÖ Environnement configur√© avec succ√®s.")

Collecting en-core-web-sm==3.7.1
[0m  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
‚úÖ Environnement configur√© avec succ√®s.


## √âtape 2 : Chargement et Exploration des Donn√©es

In [2]:
# Chargement du dataset
df = pd.read_csv('data.csv')

print(f"üìä Dataset charg√© : {len(df)} phrases")
print(f"\nColonnes : {list(df.columns)}")
print(f"\nAper√ßu des donn√©es :")
df.head()

üìä Dataset charg√© : 700 phrases

Colonnes : ['id', 'words', 'ner_tags', 'text']

Aper√ßu des donn√©es :


Unnamed: 0,id,words,ner_tags,text
0,en-doc5809-sent11,"['When' 'Aeneas' 'later' 'traveled' 'to' 'Hades' ',' 'he' 'called' 'to'\n 'her' 'ghost' 'but' 's...",[0 2 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0],"When Aeneas later traveled to Hades , he called to her ghost but she neither spoke to nor acknow..."
1,en-doc6123-sent45,['On' '23' 'November' '1969' 'he' 'wrote' 'to' 'The' 'Times' 'newspaper'\n 'saying' 'that' 'the'...,[0 0 0 0 0 0 0 4 4 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0],On 23 November 1969 he wrote to The Times newspaper saying that the preparation for show trials ...
2,en-doc5831-sent40,"['Stephenson' ""'s"" 'estimates' 'and' 'organising' 'ability' 'proved' 'to'\n 'be' 'inferior' 'to'...",[2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 0],Stephenson 's estimates and organising ability proved to be inferior to those of Locke and the b...
3,en-doc6189-sent73,['France' 'then' 'postponed' 'a' 'visit' 'by' 'Sharon' '.'],[1 0 0 0 0 0 2 0],France then postponed a visit by Sharon .
4,en-doc6139-sent18,"['Only' 'twenty-seven' 'years' 'old' 'at' 'his' 'death' ',' 'Moseley'\n 'could' 'in' 'many' 'sci...",[0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0],"Only twenty-seven years old at his death , Moseley could in many scientists ' opinions have cont..."


In [3]:
# Statistiques
print("üìà Statistiques du dataset :")
print(f"  Nombre total de phrases : {len(df)}")
print(f"  Longueur moyenne du texte : {df['text'].str.len().mean():.1f} caract√®res")
print(f"  Longueur min : {df['text'].str.len().min()} caract√®res")
print(f"  Longueur max : {df['text'].str.len().max()} caract√®res")
print(f"\nExemple de phrase :")
print(f"  ID: {df.iloc[0]['id']}")
print(f"  Texte: {df.iloc[0]['text']}")

üìà Statistiques du dataset :
  Nombre total de phrases : 700
  Longueur moyenne du texte : 132.6 caract√®res
  Longueur min : 11 caract√®res
  Longueur max : 509 caract√®res

Exemple de phrase :
  ID: en-doc5809-sent11
  Texte: When Aeneas later traveled to Hades , he called to her ghost but she neither spoke to nor acknowledged him .


## √âtape 3 : Split Train/Dev/Test

Division du dataset en 3 ensembles :
- **Train** : 70% (1554 phrases)
- **Dev** : 15% (333 phrases)
- **Test** : 15% (334 phrases)

In [4]:
# Split strat√©gique : 70% train, 15% dev, 15% test
train_df, temp_df = train_test_split(df, test_size=0.3, random_state=42)
dev_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

print("üìÇ Split effectu√© :")
print(f"  Train : {len(train_df)} phrases ({len(train_df)/len(df)*100:.1f}%)")
print(f"  Dev   : {len(dev_df)} phrases ({len(dev_df)/len(df)*100:.1f}%)")
print(f"  Test  : {len(test_df)} phrases ({len(test_df)/len(df)*100:.1f}%)")
print(f"  Total : {len(train_df) + len(dev_df) + len(test_df)} phrases")

üìÇ Split effectu√© :
  Train : 490 phrases (70.0%)
  Dev   : 105 phrases (15.0%)
  Test  : 105 phrases (15.0%)
  Total : 700 phrases


## √âtape 4 : Nettoyage et Normalisation

In [5]:
def clean_text(text):
    """
    Nettoie le texte : lowercase, suppression caract√®res sp√©ciaux, normalisation espaces.
    """
    if not isinstance(text, str):
        return ""
    
    # 1. Lowercase
    text = text.lower()
    
    # 2. Suppression des caract√®res sp√©ciaux (garde lettres, chiffres et espaces)
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    
    # 3. Suppression des espaces multiples
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Application du nettoyage
train_df['cleaned_text'] = train_df['text'].apply(clean_text)
dev_df['cleaned_text'] = dev_df['text'].apply(clean_text)
test_df['cleaned_text'] = test_df['text'].apply(clean_text)

print("‚úÖ Nettoyage termin√©")
print(f"\nExemple :")
print(f"  Original : {train_df.iloc[0]['text']}")
print(f"  Nettoy√©  : {train_df.iloc[0]['cleaned_text']}")

‚úÖ Nettoyage termin√©

Exemple :
  Original : According to the conservative think tank Heritage Foundation , Hungary 's economy was 67.2 percent " free " in 2008 , which makes it the world 's 43rd-freest economy .
  Nettoy√©  : according to the conservative think tank heritage foundation hungary s economy was 67 2 percent free in 2008 which makes it the world s 43rd freest economy


## √âtape 5 : Lemmatization

In [6]:
def lemmatize_text(text):
    """
    Lemmatise le texte avec spaCy.
    """
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc]
    return ' '.join(lemmas)

def process_dataset_lemmatization(df, name):
    """Applique la lemmatization sur un dataset."""
    print(f"üîÑ Lemmatization de {len(df)} phrases ({name})...")
    df['lemmatized_text'] = df['cleaned_text'].apply(lemmatize_text)
    print(f"‚úÖ Lemmatization termin√©e pour {name}")

# Application sur les 3 ensembles
process_dataset_lemmatization(train_df, "TRAIN")
process_dataset_lemmatization(dev_df, "DEV")
process_dataset_lemmatization(test_df, "TEST")

print(f"\nExemple de texte lemmatis√© :")
print(f"  {train_df.iloc[0]['lemmatized_text'][:100]}...")

üîÑ Lemmatization de 490 phrases (TRAIN)...
‚úÖ Lemmatization termin√©e pour TRAIN
üîÑ Lemmatization de 105 phrases (DEV)...
‚úÖ Lemmatization termin√©e pour DEV
üîÑ Lemmatization de 105 phrases (TEST)...
‚úÖ Lemmatization termin√©e pour TEST

Exemple de texte lemmatis√© :
  accord to the conservative think tank heritage foundation hungary s economy be 67 2 percent free in ...


## √âtape 6 : Repr√©sentation Vectorielle TF-IDF

In [7]:
# Pr√©paration des textes lemmatis√©s
train_texts = train_df['lemmatized_text'].tolist()
dev_texts = dev_df['lemmatized_text'].tolist()
test_texts = test_df['lemmatized_text'].tolist()

# Cr√©ation du vectoriseur TF-IDF
tfidf_vectorizer = TfidfVectorizer(
    max_features=3000,  # Limite √† 3000 features (dataset plus petit)
    min_df=2,
    max_df=0.8,
    ngram_range=(1, 2)
)

# Fit sur TRAIN, transform sur tous
print("üîÑ Calcul TF-IDF...")
tfidf_train = tfidf_vectorizer.fit_transform(train_texts)
tfidf_dev = tfidf_vectorizer.transform(dev_texts)
tfidf_test = tfidf_vectorizer.transform(test_texts)

print(f"\n‚úÖ Matrices TF-IDF cr√©√©es :")
print(f"  TRAIN : {tfidf_train.shape} (Densit√© : {tfidf_train.nnz / (tfidf_train.shape[0] * tfidf_train.shape[1]):.4f})")
print(f"  DEV   : {tfidf_dev.shape} (Densit√© : {tfidf_dev.nnz / (tfidf_dev.shape[0] * tfidf_dev.shape[1]):.4f})")
print(f"  TEST  : {tfidf_test.shape} (Densit√© : {tfidf_test.nnz / (tfidf_test.shape[0] * tfidf_test.shape[1]):.4f})")

üîÑ Calcul TF-IDF...

‚úÖ Matrices TF-IDF cr√©√©es :
  TRAIN : (490, 1680) (Densit√© : 0.0103)
  DEV   : (105, 1680) (Densit√© : 0.0088)
  TEST  : (105, 1680) (Densit√© : 0.0089)


In [8]:
# Top features pour la premi√®re phrase
feature_names = tfidf_vectorizer.get_feature_names_out()
doc_0_vector = tfidf_train[0].toarray()[0]
top_indices = doc_0_vector.argsort()[-10:][::-1]

print("üîù Top 10 features TF-IDF (phrase 0) :")
for idx in top_indices:
    if doc_0_vector[idx] > 0:
        print(f"  {feature_names[idx]}: {doc_0_vector[idx]:.4f}")

üîù Top 10 features TF-IDF (phrase 0) :
  economy: 0.4162
  free: 0.3812
  foundation: 0.2271
  heritage: 0.2271
  economy be: 0.2271
  make it: 0.2271
  it the: 0.2271
  in 2008: 0.2271
  think: 0.2164
  hungary: 0.2081


## √âtape 7 : Export des R√©sultats

In [9]:
# Cr√©ation du dossier de sortie
OUTPUT_DIR = "preprocessed_data"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"üìÅ Dossier de sortie : {OUTPUT_DIR}/")

üìÅ Dossier de sortie : preprocessed_data/


In [10]:
# Export des CSV
train_df[['id', 'text', 'cleaned_text', 'lemmatized_text']].to_csv(
    os.path.join(OUTPUT_DIR, 'train_preprocessed.csv'), index=False
)
dev_df[['id', 'text', 'cleaned_text', 'lemmatized_text']].to_csv(
    os.path.join(OUTPUT_DIR, 'dev_preprocessed.csv'), index=False
)
test_df[['id', 'text', 'cleaned_text', 'lemmatized_text']].to_csv(
    os.path.join(OUTPUT_DIR, 'test_preprocessed.csv'), index=False
)

print("‚úÖ CSV export√©s")
print(f"  train_preprocessed.csv : {len(train_df)} lignes")
print(f"  dev_preprocessed.csv   : {len(dev_df)} lignes")
print(f"  test_preprocessed.csv  : {len(test_df)} lignes")

‚úÖ CSV export√©s
  train_preprocessed.csv : 490 lignes
  dev_preprocessed.csv   : 105 lignes
  test_preprocessed.csv  : 105 lignes


In [11]:
# Export des matrices TF-IDF
save_npz(os.path.join(OUTPUT_DIR, 'tfidf_matrix.npz'), tfidf_train)
save_npz(os.path.join(OUTPUT_DIR, 'tfidf_matrix_dev.npz'), tfidf_dev)
save_npz(os.path.join(OUTPUT_DIR, 'tfidf_matrix_test.npz'), tfidf_test)

# Export du vectoriseur
with open(os.path.join(OUTPUT_DIR, 'tfidf_vectorizer.pkl'), 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)

# Export des feature names
np.save(os.path.join(OUTPUT_DIR, 'tfidf_feature_names.npy'), feature_names)

print("‚úÖ Matrices TF-IDF et vectoriseur export√©s")

‚úÖ Matrices TF-IDF et vectoriseur export√©s


In [12]:
# M√©tadonn√©es
metadata = {
    'dataset': 'NER Dataset',
    'total_sentences': len(df),
    'train_size': len(train_df),
    'dev_size': len(dev_df),
    'test_size': len(test_df),
    'tfidf_features': len(feature_names),
    'preprocessing_steps': [
        '1. Lowercase',
        '2. Suppression caract√®res sp√©ciaux',
        '3. Normalisation espaces',
        '4. Lemmatization (spaCy)',
        '5. TF-IDF (max_features=3000, ngram_range=(1,2))'
    ]
}

with open(os.path.join(OUTPUT_DIR, 'metadata.json'), 'w') as f:
    json.dump(metadata, f, indent=2)

print("‚úÖ M√©tadonn√©es export√©es")

‚úÖ M√©tadonn√©es export√©es


In [13]:
# R√©sum√© des fichiers
print("\n" + "="*60)
print("üì¶ R√âSUM√â DES FICHIERS EXPORT√âS")
print("="*60)

for filename in sorted(os.listdir(OUTPUT_DIR)):
    filepath = os.path.join(OUTPUT_DIR, filename)
    if os.path.isfile(filepath):
        size_mb = os.path.getsize(filepath) / (1024 * 1024)
        print(f"  {filename:35s} ({size_mb:.2f} MB)")

print("="*60)


üì¶ R√âSUM√â DES FICHIERS EXPORT√âS
  dev_preprocessed.csv                (0.04 MB)
  metadata.json                       (0.00 MB)
  test_preprocessed.csv               (0.04 MB)
  tfidf_feature_names.npy             (0.03 MB)
  tfidf_matrix.npz                    (0.06 MB)
  tfidf_matrix_dev.npz                (0.01 MB)
  tfidf_matrix_test.npz               (0.01 MB)
  tfidf_vectorizer.pkl                (0.20 MB)
  train_preprocessed.csv              (0.18 MB)


## R√©sum√© Technique

### Dataset
- **Source** : NER Dataset (Named Entity Recognition)
- **Total** : 2221 phrases
- **Split** : Train (70%), Dev (15%), Test (15%)

### Pipeline de Preprocessing
1. **Nettoyage** : Lowercase, suppression caract√®res sp√©ciaux, normalisation espaces
2. **Lemmatization** : spaCy `en_core_web_sm`
3. **TF-IDF** : 3000 features, bigrammes (1,2)

### Fichiers Export√©s
- `train_preprocessed.csv`, `dev_preprocessed.csv`, `test_preprocessed.csv`
- `tfidf_matrix.npz`, `tfidf_matrix_dev.npz`, `tfidf_matrix_test.npz`
- `tfidf_vectorizer.pkl`, `tfidf_feature_names.npy`
- `metadata.json`

### Utilisation (Partie B)
```python
import pandas as pd
from scipy.sparse import load_npz
import pickle

# Charger les donn√©es
df_train = pd.read_csv('preprocessed_data/train_preprocessed.csv')
tfidf_train = load_npz('preprocessed_data/tfidf_matrix.npz')

# Charger le vectoriseur
with open('preprocessed_data/tfidf_vectorizer.pkl', 'rb') as f:
    vectorizer = pickle.load(f)
```