In [1]:
documents = [
    "Climate change is an urgent global challenge that affects ecosystems worldwide, disrupting weather patterns, melting polar ice caps, and threatening the survival of countless species. Immediate action to mitigate climate change and reduce greenhouse gas emissions is essential to preserve biodiversity and protect the planet.",
    "Renewable energy and sustainability are key to addressing the growing environmental concerns that we face today, offering viable alternatives to fossil fuels. The transition to renewable energy sources like solar, wind, and hydropower is essential in reducing the environmental impact and mitigating climate change.",
    "Green technologies, such as solar power, wind energy, and electric vehicles, have become essential in reducing carbon footprints, curbing greenhouse gas emissions, and paving the way for a sustainable and environmentally friendly future. The development of these technologies plays a crucial role in combating climate change.",
    "Climate change mitigation includes a wide range of strategies, such as reforestation, carbon capture and storage, and the promotion of renewable energy. These efforts are crucial to reduce carbon emissions and limit global warming, which is a growing threat to biodiversity and ecosystems worldwide.",
    "Environmental protection is closely linked to sustainable practices in agriculture, industry, and urban development. By embracing sustainable agricultural methods and green technologies, we can preserve natural resources, reduce pollution, and mitigate the effects of climate change on ecosystems and biodiversity.",
    "The rise in global temperatures is having a significant impact on biodiversity and ecosystems. Rising temperatures disrupt habitats, contribute to the extinction of vulnerable species, and degrade essential ecological services such as pollination and water purification. Immediate climate action is needed to reverse these impacts and safeguard our future."
]





# Prétraitement des documents

In [2]:
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('stopwords')
nltk.download('punkt')

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Get NLTK stopwords
nltk_stop_words = set(stopwords.words('english'))

def clean_text_combined(text):
    """
    Combined approach using both NLTK and spaCy for text cleaning
    """
    # NLTK processing
    tokens = word_tokenize(text.lower())
    nltk_cleaned = [word for word in tokens if word.isalnum() and word not in nltk_stop_words]
    
    # spaCy processing
    doc = nlp(text.lower())
    spacy_cleaned = [token.text for token in doc if token.is_alpha and not token.is_stop]
    
    # Combine both results and remove duplicates while maintaining order
    combined_tokens = list(dict.fromkeys(nltk_cleaned + spacy_cleaned))
    
    return combined_tokens

# Clean the documents using the combined approach
cleaned_documents = [clean_text_combined(doc) for doc in documents]
print("Cleaned documents:")
for i, doc in enumerate(cleaned_documents):
    print(f"\nDocument {i+1}:")
    print(doc)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Cleaned documents:

Document 1:
['climate', 'change', 'urgent', 'global', 'challenge', 'affects', 'ecosystems', 'worldwide', 'disrupting', 'weather', 'patterns', 'melting', 'polar', 'ice', 'caps', 'threatening', 'survival', 'countless', 'species', 'immediate', 'action', 'mitigate', 'reduce', 'greenhouse', 'gas', 'emissions', 'essential', 'preserve', 'biodiversity', 'protect', 'planet']

Document 2:
['renewable', 'energy', 'sustainability', 'key', 'addressing', 'growing', 'environmental', 'concerns', 'face', 'today', 'offering', 'viable', 'alternatives', 'fossil', 'fuels', 'transition', 'sources', 'like', 'solar', 'wind', 'hydropower', 'essential', 'reducing', 'impact', 'mitigating', 'climate', 'change']

Document 3:
['green', 'technologies', 'solar', 'power', 'wind', 'energy', 'electric', 'vehicles', 'become', 'essential', 'reducing', 'carbon', 'footprints', 'curbing', 'greenhouse', 'gas', 'emissions', 'paving', 'way', 'sustainable', 'environmentally', 'friendly', 'future', 'developmen

# Representation BOW

In [88]:
from gensim.corpora import Dictionary

# Créer le dictionnaire
dictionary = Dictionary(cleaned_documents)

# Convertir les documents en Bag-of-Words
corpus_bow = [dictionary.doc2bow(doc) for doc in cleaned_documents]
print(corpus_bow)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1)], [(5, 1), (6, 1), (11, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 2), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 2), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1)], [(5, 1), (6, 1), (10, 1), (11, 1), (12, 1), (14, 1), (34, 1), (46, 1), (48, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 2), (72, 1), (73, 1)], [(2, 1), (5, 1), (6, 1), (9, 1), (10, 1), (13, 1), (24, 1), (30, 1), (34, 1), (39, 1), (47, 1), (55, 2), (57, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (8

# Rpresentation TF-IDF

In [89]:
from gensim.models import TfidfModel

# Appliquer le modèle TF-IDF
tfidf_model = TfidfModel(corpus_bow)
corpus_tfidf = tfidf_model[corpus_bow]
print(tfidf_model)
#print(corpus_tfidf)

TfidfModel<num_docs=6, num_nnz=169>


# Rpresentation "LDA"

In [90]:
from gensim.models import ldamodel

# Appliquer LDA
model = ldamodel.LdaModel(corpus_bow, id2word=dictionary, num_topics=2)

model.show_topics()

[(0,
  '0.026*"climate" + 0.019*"change" + 0.018*"energy" + 0.017*"ecosystems" + 0.017*"essential" + 0.016*"environmental" + 0.016*"renewable" + 0.015*"biodiversity" + 0.015*"global" + 0.014*"impact"'),
 (1,
  '0.028*"climate" + 0.027*"change" + 0.019*"sustainable" + 0.018*"technologies" + 0.018*"biodiversity" + 0.017*"essential" + 0.016*"ecosystems" + 0.016*"energy" + 0.015*"emissions" + 0.014*"development"')]

# Similarité entre documents

In [91]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# First, we need to join the cleaned tokens into full text strings
cleaned_documents_text = [" ".join(doc) for doc in cleaned_documents]

# Initialize TfidfVectorizer and compute TF-IDF matrix
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(cleaned_documents_text)

# Compute cosine similarity between all pairs of documents
cosine_sim = cosine_similarity(tfidf_matrix)

# Display the similarity matrix
print(cosine_sim)

[[1.         0.04878316 0.12747697 0.16211673 0.14662359 0.14978167]
 [0.04878316 1.         0.14794541 0.14149145 0.07279338 0.04396705]
 [0.12747697 0.14794541 1.         0.15183021 0.19455087 0.04717971]
 [0.16211673 0.14149145 0.15183021 1.         0.07498059 0.05621281]
 [0.14662359 0.07279338 0.19455087 0.07498059 1.         0.03594118]
 [0.14978167 0.04396705 0.04717971 0.05621281 0.03594118 1.        ]]


# Calcule des similarités

In [92]:
# Document de requête
query_doc = "renewable energy technologies play a crucial role in reducing carbon emissions and mitigating climate change".lower().split()


query_bow = dictionary.doc2bow(query_doc)


In [93]:
from gensim.similarities import MatrixSimilarity
# Calcul de la similarité avec BoW
index_bow = MatrixSimilarity(corpus_bow)  # Crée un index pour le corpus BoW
sims_bow = index_bow[query_bow]

print("Similarités BoW (pair à pair) :")
sims_bow = sorted(enumerate(sims_bow), key = lambda item:-item[1])

for doc_id, similarity in sims_bow :
    print(cleaned_documents[doc_id], similarity)

Similarités BoW (pair à pair) :
['green', 'technologies', 'solar', 'power', 'wind', 'energy', 'electric', 'vehicles', 'essential', 'reducing', 'carbon', 'footprints', 'curbing', 'greenhouse', 'gas', 'emissions', 'paving', 'way', 'sustainable', 'environmentally', 'friendly', 'future', 'development', 'technologies', 'plays', 'crucial', 'role', 'combating', 'climate', 'change'] 0.5330018
['climate', 'change', 'mitigation', 'includes', 'wide', 'range', 'strategies', 'reforestation', 'carbon', 'capture', 'storage', 'promotion', 'renewable', 'energy', 'efforts', 'crucial', 'reduce', 'carbon', 'emissions', 'limit', 'global', 'warming', 'growing', 'threat', 'biodiversity', 'ecosystems', 'worldwide'] 0.447914
['renewable', 'energy', 'sustainability', 'key', 'addressing', 'growing', 'environmental', 'concerns', 'face', 'today', 'offering', 'viable', 'alternatives', 'fossil', 'fuels', 'transition', 'renewable', 'energy', 'sources', 'like', 'solar', 'wind', 'hydropower', 'essential', 'reducing', '

In [94]:
# Calcul de la similarité avec tfidf
index_tfidf = MatrixSimilarity(corpus_tfidf) 
sims_tfidf = index_tfidf[tfidf_model[query_bow]]

print("Similarités tfidf (pair à pair) :")
sims_tfidf = sorted(enumerate(sims_tfidf), key = lambda item:-item[1])

for doc_id, similarity in sims_tfidf :
    print(cleaned_documents[doc_id], similarity)

Similarités tfidf (pair à pair) :
['green', 'technologies', 'solar', 'power', 'wind', 'energy', 'electric', 'vehicles', 'essential', 'reducing', 'carbon', 'footprints', 'curbing', 'greenhouse', 'gas', 'emissions', 'paving', 'way', 'sustainable', 'environmentally', 'friendly', 'future', 'development', 'technologies', 'plays', 'crucial', 'role', 'combating', 'climate', 'change'] 0.3672793
['renewable', 'energy', 'sustainability', 'key', 'addressing', 'growing', 'environmental', 'concerns', 'face', 'today', 'offering', 'viable', 'alternatives', 'fossil', 'fuels', 'transition', 'renewable', 'energy', 'sources', 'like', 'solar', 'wind', 'hydropower', 'essential', 'reducing', 'environmental', 'impact', 'mitigating', 'climate', 'change'] 0.2565498
['climate', 'change', 'mitigation', 'includes', 'wide', 'range', 'strategies', 'reforestation', 'carbon', 'capture', 'storage', 'promotion', 'renewable', 'energy', 'efforts', 'crucial', 'reduce', 'carbon', 'emissions', 'limit', 'global', 'warming', 

In [95]:
# Calculer les similarités pair à pair avec LDA
#corpus_lda = [lda_model[doc] for doc in corpus_bow]
index_lda = MatrixSimilarity(model[corpus_bow])
similarities_lda = index_lda[model[query_bow]]
print("Similarités Lda (pair à pair) :")
sims_lda = sorted(enumerate(similarities_lda), key = lambda item:-item[1])

for doc_id, similarity in sims_lda :
    print(cleaned_documents[doc_id], similarity)

Similarités Lda (pair à pair) :
['climate', 'change', 'urgent', 'global', 'challenge', 'affects', 'ecosystems', 'worldwide', 'disrupting', 'weather', 'patterns', 'melting', 'polar', 'ice', 'caps', 'threatening', 'survival', 'countless', 'species', 'immediate', 'action', 'mitigate', 'climate', 'change', 'reduce', 'greenhouse', 'gas', 'emissions', 'essential', 'preserve', 'biodiversity', 'protect', 'planet'] 0.9978501
['environmental', 'protection', 'closely', 'linked', 'sustainable', 'practices', 'agriculture', 'industry', 'urban', 'development', 'embracing', 'sustainable', 'agricultural', 'methods', 'green', 'technologies', 'preserve', 'natural', 'resources', 'reduce', 'pollution', 'mitigate', 'effects', 'climate', 'change', 'ecosystems', 'biodiversity'] 0.99783844
['green', 'technologies', 'solar', 'power', 'wind', 'energy', 'electric', 'vehicles', 'essential', 'reducing', 'carbon', 'footprints', 'curbing', 'greenhouse', 'gas', 'emissions', 'paving', 'way', 'sustainable', 'environment

# Avantages et limites 
## Bag of Words (BoW)
### Avantages :

- Facile à comprendre et implémenter.

- Représente les documents par la fréquence des termes.

- Calcul rapide pour des corpus de petite ou moyenne taille.

- Utile pour capturer des similarités dans des documents partageant des mots exacts.
### Limites :
Les relations entre mots synonymes ou contextuellement similaires ne sont pas prises en compte.
Exemple : "chat" et "feline" seront considérés comme différents.


## TF-IDF (Term Frequency-Inverse Document Frequency)
### Avantages :

- Réduit l'importance des mots courants tout en valorisant ceux qui sont spécifiques au document.

- Représentation équilibrée :
Combine fréquence des mots et pertinence contextuelle.

- Efficace pour documents courts :
Performant pour comparer des documents contenant des termes significatifs.

### Limites :

- Ignore la relation sémantique :
Les mots ayant un sens proche ou les synonymes sont considérés comme différents.

- Dépendance à la qualité du corpus :
Si le corpus est déséquilibré ou bruité, les scores TF-IDF peuvent être biaisés.

- Statique :
Ne capture pas les relations entre termes ou leur contexte au-delà de la fréquence.


## Latent Dirichlet Allocation (LDA)
### Avantages :

- Capture les relations sémantiques :
Regroupe les mots en topics basés sur leurs cooccurrences dans le corpus.
Exemple : "chat", "feline", "animal" pourraient appartenir au même topic.

- Robustesse contextuelle :
Permet de comparer des documents même s'ils n'ont pas de termes en commun mais partagent des thématiques.

- Réduction dimensionnelle :
Simplifie la représentation des documents en réduisant leur complexité.
### Limites :


- Nécessite des ressources importantes pour des corpus volumineux.

- Perte de précision locale :
Les détails spécifiques d'un document peuvent être dilués dans des topics généraux.

- Paramètres sensibles :
Les résultats dépendent fortement du choix du nombre de topics et des hyperparamètres.


### Conclusion:
Ces approches sont complémentaires selon les besoins d'analyse:

- LDA capture les relations sémantiques implicites via une approche probabiliste basée sur les topics.
- BoW et TF-IDF restent utiles pour des comparaisons lexicales précises.
