# Test du Pipeline Light de Preprocessing pour l'Analyse de Sentiment

Ce notebook teste la version "light" complète du pipeline de preprocessing optimisée pour préserver les indicateurs émotionnels importants pour l'analyse de sentiment.

## 📋 Pipeline testé :
1. **Nettoyage** (`TextCleaner` light)
2. **Tokenisation** (`TextTokenizer` amélioré)
3. **Retrait des stopwords** (`StopwordRemover` light) 
4. **Lemmatisation** (`TextLemmatizer` light)


In [10]:
# Imports des classes light
import sys

from preprocessing.light.text_cleaner_light import TextCleaner
from preprocessing.text_tokenizer import TextTokenizer
from preprocessing.light.stopword_remover_light import StopwordRemover
from preprocessing.light.text_lemmatizer_light import TextLemmatizer


In [11]:
# Configuration NLTK
import nltk
import os

nltk_dir = 'nltk_data'
nltk.data.path.append(os.path.abspath(nltk_dir))

# Décommentez si premier run
# nltk.download('punkt_tab', download_dir=nltk_dir)
# nltk.download('stopwords', download_dir=nltk_dir)
# nltk.download('wordnet', download_dir=nltk_dir)
# nltk.download('omw-1.4', download_dir=nltk_dir)
# nltk.download('averaged_perceptron_tagger_eng', download_dir=nltk_dir)


In [12]:
# Initialisation des classes
text_cleaner = TextCleaner()
tokenizer = TextTokenizer(nltk_dir=nltk_dir)
stopword_remover = StopwordRemover()
lemmatizer = TextLemmatizer(nltk_dir=nltk_dir)

print("✅ Toutes les classes sont initialisées !")


✅ Toutes les classes sont initialisées !


## 🧪 Exemples de Test Variés

Nous allons tester différents types de textes pour voir comment le pipeline light préserve les éléments émotionnels :


In [13]:
# Exemples de textes avec différents défis pour l'analyse de sentiment
test_texts = [
	"I'm soooo HAPPY!!! This is absolutely AMAZING 😍 Much better than before!!!",
	
	"Can't believe how terrible this is... I'm extremely disappointed 😞 Worst experience EVER!",
	
	"<p>This ChatGPT update is @amazing #AI https://example.com BUT I don't think it's perfect yet...</p>",
	
	"Nooooo way! This is incredible!!! I absolutely looove it 💕 10/10 would recommend!",
	
	"It's quite good, but I've seen better products in 2024. Not bad though, just not outstanding.",
	
	"OMG this is HORRIBLE!!! Won't buy again, totally disgusting and revolting 🤮",
	
	"I really, really love this! It's so much more efficient than the old version. Fantastic work!"
]

print(f"📝 {len(test_texts)} exemples de test préparés")


📝 7 exemples de test préparés


## 🔄 Fonction de Pipeline Complète


In [14]:
def process_text_light(text):
	"""
	Applique le pipeline light complet sur un texte.
	
	Args:
		text (str): Texte à traiter
		
	Returns:
		dict: Résultats de chaque étape
	"""
	results = {
		'original': text,
		'cleaned': '',
		'tokens': [],
		'without_stopwords': [],
		'lemmatized': []
	}
	
	# Étape 1: Nettoyage
	results['cleaned'] = text_cleaner.clean_text(text)
	
	# Étape 2: Tokenisation
	results['tokens'] = tokenizer.tokenize(results['cleaned'])
	
	# Étape 3: Retrait des stopwords
	results['without_stopwords'] = stopword_remover.remove_stopwords(results['tokens'])
	
	# Étape 4: Lemmatisation
	results['lemmatized'] = lemmatizer.lemmatize(results['without_stopwords'], conservative_mode=True)
	
	return results


## 📊 Test et Analyse des Résultats


In [15]:
def display_results(results, example_num):
	"""
	Affiche les résultats de façon lisible.
	"""
	print(f"\n{'='*80}")
	print(f"📝 EXEMPLE {example_num}")
	print(f"{'='*80}")
	
	print(f"🔸 Original:")
	print(f"   {results['original']}")
	
	print(f"\n🧹 Nettoyé:")
	print(f"   '{results['cleaned']}'")
	
	print(f"\n🔤 Tokens ({len(results['tokens'])}) :")
	print(f"   {results['tokens']}")
	
	print(f"\n🚫 Sans stopwords ({len(results['without_stopwords'])}) :")
	print(f"   {results['without_stopwords']}")
	
	print(f"\n🔄 Lemmatisé ({len(results['lemmatized'])}) :")
	print(f"   {results['lemmatized']}")

# Test de tous les exemples
for i, text in enumerate(test_texts, 1):
	results = process_text_light(text)
	display_results(results, i)



📝 EXEMPLE 1
🔸 Original:
   I'm soooo HAPPY!!! This is absolutely AMAZING 😍 Much better than before!!!

🧹 Nettoyé:
   'i'm soooo happy!!! this is absolutely AMAZING 😍 much better than before!!!'

🔤 Tokens (18) :
   ["i'm", 'soo', 'happy', '!', '!', '!', 'this', 'is', 'absolutely', 'AMAZING', '😍', 'much', 'better', 'than', 'before', '!', '!', '!']

🚫 Sans stopwords (13) :
   ['soo', 'happy', '!', '!', '!', 'absolutely', 'AMAZING', '😍', 'much', 'better', '!', '!', '!']

🔄 Lemmatisé (13) :
   ['soo', 'hapy', '!', '!', '!', 'absolutely', 'AMAZING', '😍', 'much', 'better', '!', '!', '!']

📝 EXEMPLE 2
🔸 Original:
   Can't believe how terrible this is... I'm extremely disappointed 😞 Worst experience EVER!

🧹 Nettoyé:
   'can't believe how terrible this is i'm extremely disappointed 😞 worst experience ever!'

🔤 Tokens (14) :
   ["can't", 'believe', 'how', 'terrible', 'this', 'is', "i'm", 'extremely', 'disappointed', '😞', 'worst', 'experience', 'ever', '!']

🚫 Sans stopwords (10) :
   ["can't", 