# Pipeline de Preprocessing de Texte pour l'Analyse de Sentiments

Ce notebook contient un pipeline complet de preprocessing de texte optimisé pour l'analyse de commentaires Reddit. Le pipeline comprend plusieurs étapes de nettoyage, tokenisation, suppression des mots vides et lemmatisation.

## 📊 Chargement des Données

Cette section charge le dataset de commentaires Reddit ChatGPT et configure le répertoire pour les données NLTK.


In [80]:
import pandas as pd

ntlk_dir = 'nltk_data'
df = pd.read_csv('data/chatgpt-reddit-comments.csv')

## 🧹 Classe TextCleaner - Nettoyage du Texte

Cette classe effectue le nettoyage et la préprocessing du texte avec plusieurs étapes :

- **Suppression HTML** : Retire les balises HTML (`<p>`, `<div>`, etc.)
- **Conversion en minuscules** : Uniformise la casse
- **Suppression des URLs** : Retire les liens HTTP/HTTPS
- **Suppression des mentions/hashtags** : Retire @mentions et #hashtags
- **Suppression de la ponctuation** : Retire tous les caractères de ponctuation
- **Suppression des chiffres** : Retire les nombres
- **Normalisation des espaces** : Uniformise les espaces multiples

La méthode `clean_text()` applique toutes ces étapes en séquence.


In [81]:
import re
from bs4 import BeautifulSoup

class TextCleaner:
	"""
	Class for cleaning and preprocessing text data with multiple processing steps.
	"""
	
	def __init__(self) -> None:
		pass
	
	def remove_html(self, text) -> str:
		"""Remove HTML tags from text."""
		if not isinstance(text, str):
			return ""
		return BeautifulSoup(text, "html.parser").get_text()
	
	def convert_to_lowercase(self, text) -> str:
		"""Convert text to lowercase."""
		if not isinstance(text, str):
			return ""
		return text.lower()
	
	def remove_urls(self, text) -> str:
		"""Remove URLs from text."""
		if not isinstance(text, str):
			return ""
		return re.sub(r"http\S+", "", text)
	
	def remove_mentions_hashtags(self, text) -> str:
		"""Remove mentions (@username) and hashtags (#hashtag) from text."""
		if not isinstance(text, str):
			return ""
		return re.sub(r"@\w+|#\w+", "", text)
	
	def remove_punctuation(self, text) -> str:
		"""Remove punctuation from text."""
		if not isinstance(text, str):
			return ""
		return re.sub(r"[^\w\s]", "", text)
	
	def remove_digits(self, text) -> str:
		"""Remove digits from text."""
		if not isinstance(text, str):
			return ""
		return re.sub(r"\d+", "", text)
	
	def normalize_whitespace(self, text) -> str:
		"""Normalize whitespace in text."""
		if not isinstance(text, str):
			return ""
		return re.sub(r"\s+", " ", text).strip()
	
	def clean_text(self, text) -> str:
		"""
		Apply all cleaning steps to the text.
		
		Args:
			text (str): The text to clean
			
		Returns:
			str: The cleaned text
		"""
		if not isinstance(text, str):
			return ""
		
		# Apply all cleaning steps in sequence
		text = self.remove_html(text)
		text = self.convert_to_lowercase(text)
		text = self.remove_urls(text)
		text = self.remove_mentions_hashtags(text)
		text = self.remove_punctuation(text)
		text = self.remove_digits(text)
		text = self.normalize_whitespace(text)
		
		return text

cleaner = TextCleaner()

def clean_text(text) -> str:
	return cleaner.clean_text(text)


## 🔤 Classe TextTokenizer - Tokenisation

Cette classe divise le texte nettoyé en tokens (mots individuels) en utilisant le tokenizer de NLTK.

**Configuration NLTK** :
- Configure le chemin de données NLTK vers le dossier local
- Télécharge le package `punkt_tab` nécessaire pour la tokenisation

**Fonctionnalité** :
- Transforme une chaîne de caractères en liste de mots
- Gère les contractions et la ponctuation restante
- Retourne une liste vide si l'entrée n'est pas valide


In [82]:
import nltk
from nltk.tokenize import word_tokenize
import os

# Configure NLTK data path
nltk.data.path.append(os.path.abspath(ntlk_dir))

# Comment once done for the first time
# nltk.download('punkt_tab', download_dir=ntlk_dir)

class TextTokenizer:
	"""
	Class for tokenizing cleaned text into word-level tokens.
	"""
	
	def __init__(self) -> None:
		pass
	
	def tokenize(self, text) -> list[str]:
		"""
		Tokenize text into words.
		
		Args:
			text (str): The cleaned input text
		
		Returns:
			List[str]: List of tokens
		"""
		if not isinstance(text, str):
			return []
		return word_tokenize(text)

## 🚫 Classe StopwordRemover - Suppression des Mots Vides

Cette classe supprime les mots vides (stopwords) qui n'apportent pas de sens sémantique significatif.

**Caractéristiques** :
- Utilise la liste de stopwords anglais de NLTK
- **Préservation des négations** : Garde les mots comme "not", "no", "never", "n't" car ils sont cruciaux pour l'analyse de sentiment
- Paramètre de langue configurable (anglais par défaut)
- Comparaison insensible à la casse

**Objectif** : Réduire le bruit tout en préservant les indicateurs de sentiment importants.


In [83]:
import nltk
from nltk.corpus import stopwords

# Comment once done for the first time
# nltk.download('stopwords', download_dir=ntlk_dir)

class StopwordRemover:
	"""
	Class for removing stopwords from tokenized text.
	"""
	
	def __init__(self, language="english") -> None:
		"""
		Initialize the stopword remover with a given language.
		
		Args:
			language (str): Language of the stopwords (default is 'english')
		"""
		self.stop_words = set(stopwords.words(language))
	
	def remove_stopwords(self, tokens, keep_negation=True) -> list[str]:
		"""
		Remove stopwords from a list of tokens.
		
		Args:
			tokens (List[str]): List of word tokens
		
		Returns:
			List[str]: Tokens without stopwords
		"""
		if not isinstance(tokens, list):
			return []
		if keep_negation:
			negations = {"not", "no", "never", "n't"}
			return [word for word in tokens if word.lower() not in self.stop_words or word.lower() in negations]
		else:
			return [word for word in tokens if word.lower() not in self.stop_words]

## 🔄 Classe TextLemmatizer - Lemmatisation

Cette classe réduit les mots à leur forme canonique (lemme) pour normaliser les variations morphologiques.

**Packages NLTK requis** :
- `wordnet` : Base de données lexicale
- `omw-1.4` : Open Multilingual Wordnet
- `averaged_perceptron_tagger_eng` : Étiqueteur morpho-syntaxique anglais

**Fonctionnement** :
1. **POS Tagging** : Détermine la catégorie grammaticale de chaque mot
2. **Conversion des tags** : Convertit les tags TreeBank vers le format WordNet
3. **Lemmatisation** : Applique la lemmatisation en fonction du type grammatical

**Exemples** : "running" → "run", "better" → "good", "children" → "child"


In [84]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Comment once done for the first time
# nltk.download('wordnet', download_dir=ntlk_dir)
# nltk.download('omw-1.4', download_dir=ntlk_dir)
# nltk.download('averaged_perceptron_tagger_eng', download_dir=ntlk_dir)

class TextLemmatizer:
	"""
	Class for lemmatizing word tokens using NLTK's WordNetLemmatizer.
	"""
	
	def __init__(self) -> None:
		self.lemmatizer = WordNetLemmatizer()
	
	def get_wordnet_pos(self, treebank_tag) -> str:
		"""
		Convert POS tag from Treebank to WordNet format for better lemmatization.
		"""
		if treebank_tag.startswith('J'):
			return wordnet.ADJ
		elif treebank_tag.startswith('V'):
			return wordnet.VERB
		elif treebank_tag.startswith('N'):
			return wordnet.NOUN
		elif treebank_tag.startswith('R'):
			return wordnet.ADV
		else:
			return wordnet.NOUN  # fallback
	
	def lemmatize(self, tokens) -> list[str]:
		"""
		Lemmatize a list of word tokens.
		
		Args:
			tokens (List[str]): List of word tokens
		
		Returns:
			List[str]: Lemmatized tokens
		"""
		if not isinstance(tokens, list):
			return []

		pos_tags = nltk.pos_tag(tokens)  # POS tagging
		return [
			self.lemmatizer.lemmatize(token, self.get_wordnet_pos(pos))
			for token, pos in pos_tags
		]


## 🧪 Démonstration du Pipeline Complet

Cette section teste le pipeline complet sur des exemples variés pour illustrer chaque étape du preprocessing.

**Exemples de test** :
1. **HTML + Mentions/URLs** : Texte avec balises, mentions et liens
2. **Expressions informelles** : Contractions et ponctuation excessive  
3. **Négations** : Test de préservation des mots négatifs
4. **Contenu web** : Hashtags et URLs de sites web
5. **Formes grammaticales** : Différentes conjugaisons et pluriels

**Étapes visualisées** :
- Texte original → Texte nettoyé → Tokens → Sans stopwords → Lemmatisé

Cela permet de vérifier l'efficacité de chaque composant du pipeline.


In [None]:
from typing import Dict, Any
import emoji

class FlexibleTextProcessor:
	"""
	Flexible text processor with configurable cleaning levels.
	"""
	
	def __init__(self, config: Dict[str, Any] = None):
		"""
		Initialize with configuration dictionary.
		
		Args:
			config: Dictionary with processing options
		"""
		# Default configuration (Light cleaning)
		self.default_config = {
			'remove_html': True,
			'to_lowercase': True,
			'remove_urls': True,
			'remove_mentions': True,
			'remove_hashtags': False,  # Keep hashtags for sentiment context
			'remove_punctuation': False,  # Keep punctuation for emphasis
			'remove_digits': False,  # Keep numbers for context
			'remove_emojis': False,  # Keep emojis for sentiment
			'expand_contractions': False,  # Keep natural speech patterns
			'normalize_whitespace': True,
			'remove_stopwords': True,
			'keep_negations': True,
			'apply_lemmatization': False,  # Avoid over-normalization
			'preserve_caps': False,  # Keep CAPS for emphasis
			'min_word_length': 1,  # Keep short words like "I", "no"
		}
		
		self.config = self.default_config.copy()
		if config:
			self.config.update(config)
		
		# Initialize components
		self.tokenizer = TextTokenizer()
		self.stopword_remover = StopwordRemover()
		self.lemmatizer = TextLemmatizer()
		
		# Contractions dictionary
		self.contractions = {
			"won't": "will not", "can't": "cannot", "n't": " not",
			"'re": " are", "'ve": " have", "'ll": " will", "'d": " would",
			"'m": " am", "'s": " is"
		}
	
	def get_light_config(self) -> Dict[str, Any]:
		"""Light cleaning configuration - minimal processing."""
		return {
			'remove_html': True,
			'to_lowercase': True,
			'remove_urls': True,
			'remove_mentions': True,
			'remove_hashtags': False,
			'remove_punctuation': False,
			'remove_digits': False,
			'remove_emojis': False,
			'expand_contractions': False,
			'normalize_whitespace': True,
			'remove_stopwords': True,
			'keep_negations': True,
			'apply_lemmatization': False,
			'preserve_caps': False,
			'min_word_length': 1,
		}
	
	def get_medium_config(self) -> Dict[str, Any]:
		"""Medium cleaning configuration - balanced processing."""
		return {
			'remove_html': True,
			'to_lowercase': True,
			'remove_urls': True,
			'remove_mentions': True,
			'remove_hashtags': True,
			'remove_punctuation': True,
			'remove_digits': True,
			'remove_emojis': False,  # Keep emojis
			'expand_contractions': True,
			'normalize_whitespace': True,
			'remove_stopwords': True,
			'keep_negations': True,
			'apply_lemmatization': True,
			'preserve_caps': False,
			'min_word_length': 2,
		}
	
	def get_hard_config(self) -> Dict[str, Any]:
		"""Hard cleaning configuration - aggressive processing."""
		return {
			'remove_html': True,
			'to_lowercase': True,
			'remove_urls': True,
			'remove_mentions': True,
			'remove_hashtags': True,
			'remove_punctuation': True,
			'remove_digits': True,
			'remove_emojis': True,
			'expand_contractions': True,
			'normalize_whitespace': True,
			'remove_stopwords': True,
			'keep_negations': True,
			'apply_lemmatization': True,
			'preserve_caps': False,
			'min_word_length': 3,
		}
	
	def expand_contractions(self, text: str) -> str:
		"""Expand contractions in text."""
		if not isinstance(text, str):
			return ""
		
		for contraction, expansion in self.contractions.items():
			text = text.replace(contraction, expansion)
		return text
	
	def remove_emojis(self, text: str) -> str:
		"""Remove emojis from text."""
		if not isinstance(text, str):
			return ""
		return emoji.replace_emoji(text, replace='')
	
	def preserve_emphasis(self, text: str) -> str:
		"""Convert emphasis patterns to tokens."""
		if not isinstance(text, str):
			return ""
		
		# Convert multiple punctuation to emphasis tokens
		text = re.sub(r'!{2,}', ' VERY_EXCITED ', text)
		text = re.sub(r'\?{2,}', ' VERY_QUESTIONING ', text)
		text = re.sub(r'\.{3,}', ' TRAILING_THOUGHT ', text)
		
		# Convert CAPS words to emphasis tokens
		if not self.config.get('preserve_caps', False):
			text = re.sub(r'\b[A-Z]{2,}\b', lambda m: f'EMPHASIS_{m.group().lower()}', text)
		
		return text
	
	def process_text(self, text: str) -> str:
		"""
		Process text according to configuration.
		
		Args:
			text: Input text to process
			
		Returns:
			Processed text string
		"""
		if not isinstance(text, str):
			return ""
		
		result = text
		
		# Step 1: HTML removal
		if self.config.get('remove_html', True):
			result = BeautifulSoup(result, "html.parser").get_text()
		
		# Step 2: Preserve emphasis before other processing
		if not self.config.get('remove_punctuation', True):
			result = self.preserve_emphasis(result)
		
		# Step 3: Expand contractions
		if self.config.get('expand_contractions', False):
			result = self.expand_contractions(result)
		
		# Step 4: Case handling
		if self.config.get('to_lowercase', True):
			if not self.config.get('preserve_caps', False):
				result = result.lower()
		
		# Step 5: Remove various elements
		if self.config.get('remove_urls', True):
			result = re.sub(r"http\S+", "", result)
		
		if self.config.get('remove_mentions', True):
			result = re.sub(r"@\w+", "", result)
		
		if self.config.get('remove_hashtags', False):
			result = re.sub(r"#\w+", "", result)
		
		if self.config.get('remove_emojis', False):
			result = self.remove_emojis(result)
		
		if self.config.get('remove_punctuation', False):
			result = re.sub(r"[^\w\s]", "", result)
		
		if self.config.get('remove_digits', False):
			result = re.sub(r"\d+", "", result)
		
		# Step 6: Normalize whitespace
		if self.config.get('normalize_whitespace', True):
			result = re.sub(r"\s+", " ", result).strip()
		
		return result
	
	def process_tokens(self, tokens: list[str]) -> list[str]:
		"""
		Process tokens according to configuration.
		
		Args:
			tokens: List of tokens
			
		Returns:
			Processed list of tokens
		"""
		if not isinstance(tokens, list):
			return []
		
		result = tokens.copy()
		
		# Filter by minimum word length
		min_length = self.config.get('min_word_length', 1)
		result = [token for token in result if len(token) >= min_length]
		
		# Remove stopwords
		if self.config.get('remove_stopwords', True):
			keep_neg = self.config.get('keep_negations', True)
			result = self.stopword_remover.remove_stopwords(result, keep_negation=keep_neg)
		
		# Apply lemmatization
		if self.config.get('apply_lemmatization', False):
			result = self.lemmatizer.lemmatize(result)
		
		return result
	
	def full_pipeline(self, text: str) -> list[str]:
		"""
		Complete processing pipeline.
		
		Args:
			text: Input text
			
		Returns:
			List of processed tokens
		"""
		# Process text
		cleaned_text = self.process_text(text)
		
		# Tokenize
		tokens = self.tokenizer.tokenize(cleaned_text)
		
		# Process tokens
		final_tokens = self.process_tokens(tokens)
		
		return final_tokens


In [None]:
from enum import Enum
from abc import ABC, abstractmethod

class CleaningLevel(Enum):
	LIGHT = "light"
	MEDIUM = "medium"
	HARD = "hard"
	CUSTOM = "custom"

class BaseProcessor(ABC):
	"""Abstract base class for text processors."""
	
	@abstractmethod
	def process(self, text: str) -> list[str]:
		pass

class LightProcessor(BaseProcessor):
	"""
	Light cleaning - Preserve sentiment indicators
	- Keep punctuation for emphasis (!!!, ???)
	- Keep emojis for emotional context
	- Keep contractions for natural speech
	- Minimal lemmatization to preserve meaning nuances
	"""
	
	def __init__(self):
		self.text_cleaner = TextCleaner()
		self.tokenizer = TextTokenizer()
		self.stopword_remover = StopwordRemover()
	
	def process(self, text: str) -> list[str]:
		# Custom light cleaning
		if not isinstance(text, str):
			return []
		
		# Remove only essential noise
		result = BeautifulSoup(text, "html.parser").get_text()
		result = result.lower()
		result = re.sub(r"http\S+", "", result)  # Remove URLs
		result = re.sub(r"@\w+", "", result)   # Remove mentions
		
		# Keep hashtags, punctuation, emojis, numbers
		result = re.sub(r"\s+", " ", result).strip()
		
		# Tokenize
		tokens = self.tokenizer.tokenize(result)
		
		# Light stopword removal (keep negations)
		tokens = self.stopword_remover.remove_stopwords(tokens, keep_negation=True)
		
		# No lemmatization to preserve word forms
		return tokens

class MediumProcessor(BaseProcessor):
	"""
	Medium cleaning - Balanced approach
	- Remove most punctuation but preserve key patterns
	- Selective emoji handling
	- Moderate lemmatization
	"""
	
	def __init__(self):
		self.text_cleaner = TextCleaner()
		self.tokenizer = TextTokenizer()
		self.stopword_remover = StopwordRemover()
		self.lemmatizer = TextLemmatizer()
	
	def process(self, text: str) -> list[str]:
		# Standard cleaning
		cleaned = self.text_cleaner.clean_text(text)
		
		# Tokenize
		tokens = self.tokenizer.tokenize(cleaned)
		
		# Remove stopwords
		tokens = self.stopword_remover.remove_stopwords(tokens, keep_negation=True)
		
		# Selective lemmatization (avoid over-normalization)
		pos_tags = nltk.pos_tag(tokens)
		lemmatized = []
		
		for token, pos in pos_tags:
			# Don't lemmatize adjectives to preserve sentiment intensity
			if pos.startswith('JJ'):  # Adjectives (better, worse, etc.)
				lemmatized.append(token)
			else:
				lemmatized.append(self.lemmatizer.lemmatizer.lemmatize(
					token, self.lemmatizer.get_wordnet_pos(pos)
				))
		
		return lemmatized

class HardProcessor(BaseProcessor):
	"""
	Hard cleaning - Aggressive normalization
	- Remove all non-essential elements
	- Full lemmatization
	- Strict filtering
	"""
	
	def __init__(self):
		self.text_cleaner = TextCleaner()
		self.tokenizer = TextTokenizer()
		self.stopword_remover = StopwordRemover()
		self.lemmatizer = TextLemmatizer()
	
	def process(self, text: str) -> list[str]:
		# Aggressive cleaning
		cleaned = self.text_cleaner.clean_text(text)
		
		# Additional aggressive steps
		cleaned = re.sub(r"[^\w\s]", "", cleaned)  # Remove all punctuation
		cleaned = emoji.replace_emoji(cleaned, replace='')  # Remove emojis
		
		# Tokenize
		tokens = self.tokenizer.tokenize(cleaned)
		
		# Filter short tokens
		tokens = [token for token in tokens if len(token) >= 3]
		
		# Remove stopwords
		tokens = self.stopword_remover.remove_stopwords(tokens, keep_negation=True)
		
		# Full lemmatization
		tokens = self.lemmatizer.lemmatize(tokens)
		
		return tokens

class ProcessorFactory:
	"""Factory to create appropriate processor based on cleaning level."""
	
	@staticmethod
	def create_processor(level: CleaningLevel) -> BaseProcessor:
		"""Create processor based on cleaning level."""
		if level == CleaningLevel.LIGHT:
			return LightProcessor()
		elif level == CleaningLevel.MEDIUM:
			return MediumProcessor()
		elif level == CleaningLevel.HARD:
			return HardProcessor()
		else:
			raise ValueError(f"Unsupported cleaning level: {level}")
	
	@staticmethod
	def process_text(text: str, level: CleaningLevel) -> list[str]:
		"""Quick processing with specified level."""
		processor = ProcessorFactory.create_processor(level)
		return processor.process(text)


In [85]:
# Example

# Create the instances of the classes
text_cleaner = TextCleaner()
tokenizer = TextTokenizer()
stopword_remover = StopwordRemover()
lemmatizer = TextLemmatizer()

# Examples of texts with different problems
sample_texts = [
	"<p>Hello @user123! Check out this amazing #AI tool: https://example.com/awesome-tool 🚀</p>",
	"I'm loving the new ChatGPT updates!!! It's so much better than before... 😍",
	"Why are people still using OLD technologies in 2024??? Makes NO sense to me!!!",
	"<div>Visit our website www.example.com for more info about #MachineLearning and #DataScience</div>",
	"Running, jumped, better, good, children, mice, feet - testing different word forms"
]

print("=== DEMONSTRATION OF THE PREPROCESSING ===\n")

for i, text in enumerate(sample_texts, 1):
	print(f"📝 Exemple {i}:")
	print(f"Original: {text}")
	
	# Step 1: Cleaning
	cleaned = text_cleaner.clean_text(text)
	print(f"Cleaned: '{cleaned}'")
	
	# Step 2: Tokenization
	tokens = tokenizer.tokenize(cleaned)
	print(f"Tokens: {tokens}")
	
	# Step 3: Stopword Removal
	tokens_no_stop = stopword_remover.remove_stopwords(tokens)
	print(f"Without stopwords: {tokens_no_stop}")
	
	# Step 4: Lemmatization
	lemmatized = lemmatizer.lemmatize(tokens_no_stop)
	print(f"Lemmatized: {lemmatized}")
	
	print("-" * 80 + "\n")


=== DEMONSTRATION OF THE PREPROCESSING ===

📝 Exemple 1:
Original: <p>Hello @user123! Check out this amazing #AI tool: https://example.com/awesome-tool 🚀</p>
Cleaned: 'hello check out this amazing tool'
Tokens: ['hello', 'check', 'out', 'this', 'amazing', 'tool']
Without stopwords: ['hello', 'check', 'amazing', 'tool']
Lemmatized: ['hello', 'check', 'amaze', 'tool']
--------------------------------------------------------------------------------

📝 Exemple 2:
Original: I'm loving the new ChatGPT updates!!! It's so much better than before... 😍
Cleaned: 'im loving the new chatgpt updates its so much better than before'
Tokens: ['im', 'loving', 'the', 'new', 'chatgpt', 'updates', 'its', 'so', 'much', 'better', 'than', 'before']
Without stopwords: ['im', 'loving', 'new', 'chatgpt', 'updates', 'much', 'better']
Lemmatized: ['im', 'love', 'new', 'chatgpt', 'update', 'much', 'good']
--------------------------------------------------------------------------------

📝 Exemple 3:
Original: Why ar