<a href="https://colab.research.google.com/github/AbdoulKidakou/M1SLED/blob/main/Introduction_aux_m%C3%A9thodes_de_traitement_de_texte.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FEATURE ENGINEERING

# Main Objective:

To explore and illustrate how text can be preprocessed and represented as vectors for effective use in predictive models. This notebook serves as a foundation for more advanced textual data analysis.

In [1]:
# Importation des bibliothèques nécessaires
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import nltk

In [2]:
# Télécharger les ressources nécessaires de NLTK
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [3]:
# Exemple de corpus (liste de documents)
corpus = [
    "Sous le ciel bleu, les étoiles dansent,",
    "La nature chante en douce cadence.",
    "Les rêves s'élèvent, portés par le vent,",
    "Un monde de paix, simple et charmant."
]

The purpose of this Notebook is to demonstrate various text processing techniques used in feature engineering to convert textual data into numerical representations that can be utilized by machine learning models. These techniques include:

# 1.    Bag-of-Words (BoW): Transforming text into frequency vectors of words.

In [4]:
# 1. Bag-of-Words
print("\n=== Bag-of-Words ===")
vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform(corpus)
print("Feature Names:", vectorizer_bow.get_feature_names_out())
print("Bag-of-Words Matrix:\n", X_bow.toarray())


=== Bag-of-Words ===
Feature Names: ['bleu' 'cadence' 'chante' 'charmant' 'ciel' 'dansent' 'de' 'douce' 'en'
 'et' 'la' 'le' 'les' 'monde' 'nature' 'paix' 'par' 'portés' 'rêves'
 'simple' 'sous' 'un' 'vent' 'élèvent' 'étoiles']
Bag-of-Words Matrix:
 [[1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1]
 [0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 1 0]
 [0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0]]


# **2.   n-Grams and Bag-of-n-Grams: Capturing sequences of consecutive words to enrich context.**



In [5]:
# 2. N-gram (exemple avec bi-grammes)
print("\n=== N-gram (Bi-grams) ===")
vectorizer_ngram = CountVectorizer(ngram_range=(2, 2))
X_ngram = vectorizer_ngram.fit_transform(corpus)
print("Feature Names:", vectorizer_ngram.get_feature_names_out())
print("N-gram Matrix:\n", X_ngram.toarray())


=== N-gram (Bi-grams) ===
Feature Names: ['bleu les' 'chante en' 'ciel bleu' 'de paix' 'douce cadence' 'en douce'
 'et charmant' 'la nature' 'le ciel' 'le vent' 'les rêves' 'les étoiles'
 'monde de' 'nature chante' 'paix simple' 'par le' 'portés par'
 'rêves élèvent' 'simple et' 'sous le' 'un monde' 'élèvent portés'
 'étoiles dansent']
N-gram Matrix:
 [[1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1]
 [0 1 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 1 0]
 [0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0]]


# **3. Frequency-Based and Presence-Based Representations: Comparing representations based on raw word frequencies and binary presence indicators.**



In [6]:
# 3. Frequency-based vs Presence-based
print("\n=== Frequency-based ===")
X_frequency = vectorizer_bow.fit_transform(corpus).toarray()
print("Frequency Matrix:\n", X_frequency)

print("\n=== Presence-based ===")
X_presence = np.where(X_frequency > 0, 1, 0)
print("Presence Matrix:\n", X_presence)


=== Frequency-based ===
Frequency Matrix:
 [[1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1]
 [0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 1 0]
 [0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0]]

=== Presence-based ===
Presence Matrix:
 [[1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1]
 [0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 1 0]
 [0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0]]


# **4. Stemming: Reducing words to their root forms to simplify analysis.**




In [7]:
# 4. Stemming
print("\n=== Stemming ===")
stemmer = PorterStemmer()
stop_words = set(stopwords.words('french'))
for i, doc in enumerate(corpus):
    tokens = word_tokenize(doc.lower())
    tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    print(f"Document {i+1} après stemming: {stemmed_tokens}")


=== Stemming ===
Document 1 après stemming: ['sou', 'ciel', 'bleu', 'étoil', 'dansent']
Document 2 après stemming: ['natur', 'chant', 'douc', 'cadenc']
Document 3 après stemming: ['rêve', 'porté', 'vent']
Document 4 après stemming: ['mond', 'paix', 'simpl', 'charmant']


# **5. TF-IDF (Term Frequency-Inverse Document Frequency): Weighting words based on their importance within a document relative to the entire corpus.**

In [8]:
# 5. TF-IDF
print("\n=== TF-IDF ===")
vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(corpus)
print("Feature Names:", vectorizer_tfidf.get_feature_names_out())
print("TF-IDF Matrix:\n", X_tfidf.toarray())


=== TF-IDF ===
Feature Names: ['bleu' 'cadence' 'chante' 'charmant' 'ciel' 'dansent' 'de' 'douce' 'en'
 'et' 'la' 'le' 'les' 'monde' 'nature' 'paix' 'par' 'portés' 'rêves'
 'simple' 'sous' 'un' 'vent' 'élèvent' 'étoiles']
TF-IDF Matrix:
 [[0.40021825 0.         0.         0.         0.40021825 0.40021825
  0.         0.         0.         0.         0.         0.31553666
  0.31553666 0.         0.         0.         0.         0.
  0.         0.         0.40021825 0.         0.         0.
  0.40021825]
 [0.         0.40824829 0.40824829 0.         0.         0.
  0.         0.40824829 0.40824829 0.         0.40824829 0.
  0.         0.         0.40824829 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.31553666
  0.31553666 0.         0.         0.         0.40021825 0.40021825
  0.40021825 0.         0.         0.    