<a href="https://colab.research.google.com/github/Elmahi92/Exploring-Global-Discourse-Evolution/blob/main/United_Nations_General_Debate_Corpus_(UNGDC)_1946_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Global Discourse Evolution: Applying BERTopic, NMF, and LDA to the United Nations General Debate Corpus in the Pre/Post-Millennium Development Goals Era

---

We studied the United Nations (UN) General Debate Corpus, a collection of 7,314 speeches from 1970 to 2014. We wanted to understand how the focus and emphasis of speeches at the UN General Assembly changed after the adoption of the Millennium Development Goals (MDGs). We employed topic modeling techniques to identify key themes and topics in speeches. Our analysis revealed that BERTopic, a neural topic modeling algorithm, generated the most coherent topics. BERTopic's effectiveness in handling complex datasets was evident. However, we also encountered challenges. BERTopic relies on pre-trained word embeddings, which may not effectively capture domain-specific information. Additionally, BERTopic can struggle with noisy data. Our dataset presented unique challenges, as it included scanned documents and complete text, deviating from the standard format of UN General Assembly speeches. The model's results indicated a degree of semantic similarity, but interpreting the results proved difficult.


In [None]:
!pip install bertopic gensim pyLDAvis

In [None]:
 import pandas as pd
 df = pd.read_csv('/content/drive/MyDrive/UN Debates data/un-general-debates.csv')

We start by splitting the dataset into 15 years before and 15 years after the adoption of the Millennium Development Goals (MDGs)

In [None]:

df['year'] = pd.to_numeric(df['year'], errors='coerce')

mdgs_adoption_year = 2000

years_before = 15
years_after = 15

pre_mdgs_mask = (df['year'] >= mdgs_adoption_year - years_before) & (df['year'] < mdgs_adoption_year)
post_mdgs_mask = (df['year'] >= mdgs_adoption_year) & (df['year'] < mdgs_adoption_year + years_after)

pre_mdgs_df = df[pre_mdgs_mask]
post_mdgs_df = df[post_mdgs_mask]

print("15 Years Before MDGs Period:")
print(pre_mdgs_df.head())

print("\n15 Years After MDGs Period:")
print(post_mdgs_df.head())


15 Years Before MDGs Period:
   session  year country                                               text
0       44  1989     MDV  ﻿It is indeed a pleasure for me and the member...
1       44  1989     FIN  ﻿\nMay I begin by congratulating you. Sir, on ...
2       44  1989     NER  ﻿\nMr. President, it is a particular pleasure ...
3       44  1989     URY  ﻿\nDuring the debate at the fortieth session o...
4       44  1989     ZWE  ﻿I should like at the outset to express my del...

15 Years After MDGs Period:
     session  year country                                               text
223       68  2013     SUR  Allow me at the outset, on \nbehalf of the Pre...
224       68  2013     KOR  May I first \ncongratulate you, Sir, on your e...
225       68  2013     GNB  I would like to begin \nmy statement by congra...
226       68  2013     BGD  I congratulate \nPresident Ashe very warmly on...
227       68  2013     BIH  First of all, I would like to \ncongratulate y...


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# Assuming you have two dataframes: pre_mdgs_df and post_mdgs_df

# Display basic information about the pre-MDGs dataset
print("Pre-MDGs Dataset:")
print(pre_mdgs_df.info())

# Display summary statistics for the pre-MDGs dataset
print(pre_mdgs_df.describe())

# Display the first few rows of the pre-MDGs dataset
print(pre_mdgs_df.head())

# Countplot of sessions for pre-MDGs
plt.figure(figsize=(10, 6))
sns.countplot(x='session', data=pre_mdgs_df)
plt.title('Distribution of Sessions (Pre-MDGs)')
plt.show()

# Countplot of countries for pre-MDGs
plt.figure(figsize=(15, 6))
sns.countplot(x='country', data=pre_mdgs_df)
plt.title('Distribution of Countries (Pre-MDGs)')
plt.xticks(rotation=45, ha='right')
plt.show()

# Distribution of text lengths for pre-MDGs
pre_mdgs_df['text_length'] = pre_mdgs_df['text'].apply(len)
plt.figure(figsize=(10, 6))
sns.histplot(pre_mdgs_df['text_length'], bins=30, kde=True)
plt.title('Distribution of Text Lengths (Pre-MDGs)')
plt.xlabel('Text Length')
plt.show()

# Wordcloud for text for pre-MDGs
text_combined_pre = ' '.join(pre_mdgs_df['text'])
wordcloud_pre = WordCloud(width=800, height=400, max_words=100, background_color='white').generate(text_combined_pre)

plt.figure(figsize=(12, 8))
plt.imshow(wordcloud_pre, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Text (Pre-MDGs)')
plt.show()


# Display basic information about the post-MDGs dataset
print("\nPost-MDGs Dataset:")
print(post_mdgs_df.info())

# Display summary statistics for the post-MDGs dataset
print(post_mdgs_df.describe())

# Display the first few rows of the post-MDGs dataset
print(post_mdgs_df.head())

# Countplot of sessions for post-MDGs
plt.figure(figsize=(10, 6))
sns.countplot(x='session', data=post_mdgs_df)
plt.title('Distribution of Sessions (Post-MDGs)')
plt.show()

# Countplot of countries for post-MDGs
plt.figure(figsize=(15, 6))
sns.countplot(x='country', data=post_mdgs_df)
plt.title('Distribution of Countries (Post-MDGs)')
plt.xticks(rotation=45, ha='right')
plt.show()

# Distribution of text lengths for post-MDGs
post_mdgs_df['text_length'] = post_mdgs_df['text'].apply(len)
plt.figure(figsize=(10, 6))
sns.histplot(post_mdgs_df['text_length'], bins=30, kde=True)
plt.title('Distribution of Text Lengths (Post-MDGs)')
plt.xlabel('Text Length')
plt.show()

# Wordcloud for text for post-MDGs
text_combined_post = ' '.join(post_mdgs_df['text'])
wordcloud_post = WordCloud(width=800, height=400, max_words=100, background_color='white').generate(text_combined_post)

plt.figure(figsize=(12, 8))
plt.imshow(wordcloud_post, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Text (Post-MDGs)')
plt.show()


# LDA Implemention

The below code initiates the setup for working with text data, focusing on natural language processing, topic modeling, and text preprocessing. It begins by importing essential libraries such as gensim for topic modeling and spacy for natural language processing. Following this, it includes the corpora module from gensim, allowing for the effective handling of document corpora. Moreover, the code imports the CoherenceModel class from gensim.models, which is crucial for evaluating the effectiveness of topic models. Additionally, the Natural Language Toolkit (NLTK) is employed to import the stopwords module, specifically targeting common English stopwords. Subsequently, the code proceeds to load English stopwords into the 'stop_words' variable using the statement 'stop_words = stopwords.words('english')'. Finally, custom stopwords are incorporated into the 'stop_words' list through the 'stop_words.extend(...)' operation, thereby enhancing the list of stopwords for subsequent text processing tasks.

In [None]:
import nltk
nltk.download('stopwords')
import sys
# !{sys.executable} -m spacy download en
import re, numpy as np, pandas as pd
from pprint import pprint

# Gensim
import gensim, spacy, logging, warnings
import gensim.corpora as corpora
from gensim.utils import  simple_preprocess
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt

# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use','nation','united','organization', 'development',' General Assembly ','not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also', 'may', 'take', 'come'])

%matplotlib inline
warnings.filterwarnings("ignore",category=DeprecationWarning)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Data

Tokenize Sentences and Clean for both data frames

In [None]:
def sent_to_words(sentences):
    for sent in sentences:
        sent = re.sub('\s+', ' ', sent)
        sent = re.sub("\'", "", sent)
        sent = gensim.utils.simple_preprocess(str(sent), deacc=True)
        yield(sent)



In [None]:

pre_mdgs_df_data = pre_mdgs_df.text.values.tolist()
data_words_pre_mdgs = list(sent_to_words(pre_mdgs_df_data))
print(data_words_pre_mdgs[:1])


post_mdgs_df_data = post_mdgs_df.text.values.tolist()
data_words_post_mdgs = list(sent_to_words(post_mdgs_df_data))
print(data_words_post_mdgs[:1])

[['allow', 'me', 'at', 'the', 'outset', 'on', 'behalf', 'of', 'the', 'president', 'of', 'the', 'republic', 'of', 'suriname', 'his', 'excellency', 'mr', 'desire', 'delano', 'bouterse', 'and', 'the', 'people', 'and', 'the', 'government', 'of', 'suriname', 'to', 'congratulate', 'you', 'mr', 'president', 'on', 'your', 'well', 'deserved', 'election', 'with', 'your', 'election', 'you', 'bring', 'honour', 'to', 'your', 'country', 'antigua', 'and', 'barbuda', 'and', 'to', 'the', 'caribbean', 'with', 'your', 'background', 'in', 'sustainable', 'development', 'you', 'are', 'well', 'prepared', 'to', 'lead', 'us', 'in', 'our', 'deliberations', 'on', 'this', 'year', 'theme', 'the', 'post', 'development', 'agenda', 'setting', 'the', 'stage', 'assure', 'you', 'of', 'the', 'support', 'and', 'cooperation', 'of', 'suriname', 'during', 'your', 'presidency', 'should', 'also', 'like', 'to', 'take', 'this', 'opportunity', 'to', 'pay', 'tribute', 'to', 'your', 'predecessor', 'mr', 'vuk', 'jeremi', 'for', 'his

 Here we bulid our Build the Bigram, Trigram Models and Lemmatize, and analyze the topical patterns in the pre- and post-intervention data, by initialized our LDA model and calculated the coherence score.

In [None]:
# Build the bigram and trigram models for Pre MDGs Data
bigram = gensim.models.Phrases(data_words_pre_mdgs, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words_pre_mdgs], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

def process_words(texts, stop_words=stop_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """Remove Stopwords, Form Bigrams, Trigrams and Lemmatization"""
    texts = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
    texts = [bigram_mod[doc] for doc in texts]
    texts = [trigram_mod[bigram_mod[doc]] for doc in texts]
    texts_out = []
    nlp = spacy.load("en_core_web_sm")
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    # remove stopwords once more after lemmatization
    texts_out = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts_out]
    return texts_out

data_ready_pre = process_words(data_words_pre_mdgs)

In [None]:
# Filter out short documents Pre
min_doc_length = 10  # You can adjust this threshold
id2word = corpora.Dictionary(data_ready_pre)
filtered_data_ready_pre = [doc for doc in data_ready_pre if len(doc) >= min_doc_length]
corpus_pre = [id2word.doc2bow(text) for text in filtered_data_ready_pre]

In [None]:
# Build LDA model on filtered corpus_pre
lda_model_pre = gensim.models.ldamodel.LdaModel(corpus=corpus_pre,
                                           id2word=id2word,
                                           num_topics=10,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=10,
                                           passes=10,
                                           alpha='symmetric',
                                           iterations=100,
                                           per_word_topics=True)


In [None]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model_pre, texts=filtered_data_ready_pre, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score Pre MDGs: ', coherence_lda)

Coherence Score Pre MDGs:  0.4369305062039631


In [None]:
# Build the bigram and trigram models for post MDGs Data
bigram = gensim.models.Phrases(data_words_post_mdgs, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words_post_mdgs], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

def process_words(texts, stop_words=stop_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """Remove Stopwords, Form Bigrams, Trigrams and Lemmatization"""
    texts = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
    texts = [bigram_mod[doc] for doc in texts]
    texts = [trigram_mod[bigram_mod[doc]] for doc in texts]
    texts_out = []
    nlp = spacy.load("en_core_web_sm")
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    # remove stopwords once more after lemmatization
    texts_out = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts_out]
    return texts_out

data_ready_post = process_words(data_words_post_mdgs)

In [None]:
# Filter out short documents Post
min_doc_length = 10  # You can adjust this threshold
id2word = corpora.Dictionary(data_ready_post)
filtered_data_ready_post = [doc for doc in data_ready_post if len(doc) >= min_doc_length]
corpus_post = [id2word.doc2bow(text) for text in filtered_data_ready_post]

In [None]:
# Build LDA model on filtered corpus_pre
lda_model_post = gensim.models.ldamodel.LdaModel(corpus=corpus_post,
                                           id2word=id2word,
                                           num_topics=10,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=10,
                                           passes=10,
                                           alpha='symmetric',
                                           iterations=100,
                                           per_word_topics=True)

In [None]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model_post, texts=filtered_data_ready_post, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score Post MDGs: ', coherence_lda)

Coherence Score Post MDGs:  0.39247713251066063


What are the most discussed topics in the documents?

In [None]:
from concurrent.futures import ThreadPoolExecutor
import pyLDAvis.gensim
pyLDAvis.enable_notebook(local=True)
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary=lda_model.id2word, mds='mmds')
vis


  and should_run_async(code)


In [None]:
data_ready_pre[:1]

# BERTopic

Now we perform BERTopic we conduct text preprocessing using NLTK and spaCy, builds and fits a BERTopic model to identify topics in the preprocessed text, analyzes the topics and calculates coherence scores, and finally prints the coherence scores and topic information.

# Pre-MDGs Data Set
To identify and extract topics from the pre-MDGs dataset, we first train a BERTopic model. and pre process our data, the function below takes a text input, converts it to lowercase, removes punctuation and special characters, tokenizes it into individual words, removes stop words, applies bigram and trigram models , stems the words, and removes short words, and finally joins the processed tokens back into a single string.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import gensim
from gensim.utils import simple_preprocess

# Assuming data_words is the result of process_words function
data_words = data_words_pre_mdgs

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# Download spaCy model
# !python3 -m spacy download en

import spacy

def preprocess_text(text, stop_words=stopwords.words('english'), bigram_mod=None, trigram_mod=None):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize text into individual words
    tokens = word_tokenize(text)

    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # Apply bigram and trigram models
    if bigram_mod:
        filtered_tokens = bigram_mod[filtered_tokens]
    if trigram_mod:
        filtered_tokens = trigram_mod[filtered_tokens]

    # Stem words
    stemmer = SnowballStemmer('english')
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

    # Optionally, remove short words
    min_word_length = 1
    long_tokens = [token for token in stemmed_tokens if len(token) >= min_word_length]

    # Join the tokens back into a single string
    processed_text = ' '.join(long_tokens)

    return processed_text


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import pandas as pd
from bertopic import BERTopic

# Assuming 'Preprocessed data' is the column containing your preprocessed text in the DataFrame df
docs  = pre_mdgs_df['text'].apply(lambda x: preprocess_text(x, bigram_mod=bigram_mod, trigram_mod=trigram_mod))

# Create BERTopic model
topic_model = BERTopic(verbose=True, n_gram_range=(1, 5))

topics, _ = topic_model.fit_transform(docs)

# Create a DataFrame with documents, IDs, and topics
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})

# Group by topic and join documents
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})

# Preprocess documents for coherence calculation
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Extract features for Topic Coherence evaluation
topic_words = [topic_model.get_topic(topic) for topic in range(len(set(topics))-1)]


2024-01-09 10:53:36,959 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/78 [00:00<?, ?it/s]

2024-01-09 10:58:51,064 - BERTopic - Embedding - Completed ✓
2024-01-09 10:58:51,065 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-01-09 10:59:04,058 - BERTopic - Dimensionality - Completed ✓
2024-01-09 10:59:04,060 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-01-09 10:59:04,138 - BERTopic - Cluster - Completed ✓
2024-01-09 10:59:04,149 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-01-09 11:01:34,105 - BERTopic - Representation - Completed ✓


In [None]:

# Extract features for Topic Coherence evaluation
import gensim.corpora as corpora
from gensim.models import CoherenceModel

words = vectorizer.get_feature_names_out()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)]
               for topic in range(len(set(topics))-1)]

# Evaluate Coherence
coherence_model = CoherenceModel(topics=topic_words,
                                 texts=tokens,
                                 corpus=corpus,
                                 dictionary=dictionary,
                                 coherence='c_v')
coherence = coherence_model.get_coherence()


In [None]:
print(f'BERTopic - Pre MDGs Data Coherence Score: {coherence}')

BERTopic - Pre MDGs Data Coherence Score: 0.7376672426320015


**Model Results For Pre data set**

In [None]:
topic_model.visualize_heatmap()

# Post-MDGs Data Set

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import gensim
from gensim.utils import simple_preprocess

# Assuming data_words is the result of process_words function
data_words = data_words_post_mdgs

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# Download spaCy model
# !python3 -m spacy download en

import spacy

def preprocess_text(text, stop_words=stopwords.words('english'), bigram_mod=None, trigram_mod=None):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize text into individual words
    tokens = word_tokenize(text)

    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # Apply bigram and trigram models
    if bigram_mod:
        filtered_tokens = bigram_mod[filtered_tokens]
    if trigram_mod:
        filtered_tokens = trigram_mod[filtered_tokens]

    # Stem words
    stemmer = SnowballStemmer('english')
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

    # Optionally, remove short words
    min_word_length = 1
    long_tokens = [token for token in stemmed_tokens if len(token) >= min_word_length]

    # Join the tokens back into a single string
    processed_text = ' '.join(long_tokens)

    return processed_text



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import pandas as pd
from bertopic import BERTopic

# Assuming 'Preprocessed data' is the column containing your preprocessed text in the DataFrame df
docs  = post_mdgs_df['text'].apply(lambda x: preprocess_text(x, bigram_mod=bigram_mod, trigram_mod=trigram_mod))

# Create BERTopic model
topic_model = BERTopic(verbose=True, n_gram_range=(1, 5))

topics, _ = topic_model.fit_transform(docs)

# Create a DataFrame with documents, IDs, and topics
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})

# Group by topic and join documents
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})

# Preprocess documents for coherence calculation
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Extract features for Topic Coherence evaluation
topic_words = [topic_model.get_topic(topic) for topic in range(len(set(topics))-1)]


2024-01-09 10:14:20,897 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/90 [00:00<?, ?it/s]

2024-01-09 10:20:44,756 - BERTopic - Embedding - Completed ✓
2024-01-09 10:20:44,757 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-01-09 10:20:59,770 - BERTopic - Dimensionality - Completed ✓
2024-01-09 10:20:59,771 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-01-09 10:20:59,889 - BERTopic - Cluster - Completed ✓
2024-01-09 10:20:59,914 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-01-09 10:22:44,908 - BERTopic - Representation - Completed ✓


In [None]:

# Extract features for Topic Coherence evaluation
import gensim.corpora as corpora
from gensim.models import CoherenceModel

words = vectorizer.get_feature_names_out()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)]
               for topic in range(len(set(topics))-1)]

# Evaluate Coherence
coherence_model = CoherenceModel(topics=topic_words,
                                 texts=tokens,
                                 corpus=corpus,
                                 dictionary=dictionary,
                                 coherence='c_v')
coherence = coherence_model.get_coherence()



In [None]:

print(f'BERTopic - Post MDGs Data Coherence Score: {coherence}')

BERTopic - Post MDGs Data Coherence Score: 0.7133268548076267


In [None]:
topic_model.visualize_heatmap()


# Non-Negative Matrix Factorization (NMF)

Here we apply TF-IDF vectorization with specified parameters such as minimum and maximum document frequency, maximum features, and n-gram range. Subsequently, Gensim's Dictionary class is utilized to create a word-to-ID mapping, and extreme values are filtered out to limit features. The bag-of-words representation for each document is generated. The scikit-learn NMF model is then trained with the bag-of-words representation, the dictionary, and additional parameters.
Finally, the coherence score of the trained NMF model is evaluated using Gensim's CoherenceModel with the 'c_v' coherence measure.

# Pre-MDGs Data Set

In [None]:
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import TweetTokenizer, RegexpTokenizer
import nltk
import numpy as np
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary

texts = data_ready_pre

tfidf_vectorizer = TfidfVectorizer(
    min_df=3,
    max_df=0.85,
    max_features=5000,
    ngram_range=(1, 2),
    preprocessor=' '.join
)

tfidf = tfidf_vectorizer.fit_transform(texts)

In [None]:
# Use Gensim's NMF to get the best num of topics via coherence score
texts = data_ready_pre

# Create a dictionary
# In gensim a dictionary is a mapping between words and their integer id
dictionary = Dictionary(texts)

# Filter out extremes to limit the number of features
dictionary.filter_extremes(
    no_below=3,
    no_above=0.85,
    keep_n=5000
)

# Create the bag-of-words format (list of (token_id, token_count))
corpus = [dictionary.doc2bow(text) for text in texts]
nmf = Nmf(
        corpus=corpus,
        num_topics=10,
        id2word=dictionary,
        chunksize=2000,
        passes=5,
        kappa=.1,
        minimum_probability=0.01,
        w_max_iter=300,
        w_stop_condition=0.0001,
        h_max_iter=100,
        h_stop_condition=0.001,
        eval_every=10,
        normalize=True,
        random_state=42)

# Run the coherence model to get the score
cm = CoherenceModel(
    model=nmf,
    texts=texts,
    dictionary=dictionary,
    coherence='c_v'
)

In [None]:

print(f'NMF - Pre MDGs Data Coherence Score: {cm.get_coherence()}')

NMF - Pre MDGs Data Coherence Score: 0.2386311740139692


In [None]:
# Post-MDGs Data Set

In [None]:
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import TweetTokenizer, RegexpTokenizer
import nltk
import numpy as np
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary

texts = data_ready_post

tfidf_vectorizer = TfidfVectorizer(
    min_df=3,
    max_df=0.85,
    max_features=5000,
    ngram_range=(1, 2),
    preprocessor=' '.join
)

tfidf = tfidf_vectorizer.fit_transform(texts)

In [None]:
# Use Gensim's NMF to get the best num of topics via coherence score
texts = data_ready_post

# Create a dictionary
# In gensim a dictionary is a mapping between words and their integer id
dictionary = Dictionary(texts)

# Filter out extremes to limit the number of features
dictionary.filter_extremes(
    no_below=3,
    no_above=0.85,
    keep_n=5000
)

# Create the bag-of-words format (list of (token_id, token_count))
corpus = [dictionary.doc2bow(text) for text in texts]
nmf = Nmf(
        corpus=corpus,
        num_topics=10,
        id2word=dictionary,
        chunksize=2000,
        passes=5,
        kappa=.1,
        minimum_probability=0.01,
        w_max_iter=300,
        w_stop_condition=0.0001,
        h_max_iter=100,
        h_stop_condition=0.001,
        eval_every=10,
        normalize=True,
        random_state=42)

# Run the coherence model to get the score
cm = CoherenceModel(
    model=nmf,
    texts=texts,
    dictionary=dictionary,
    coherence='c_v'
)

In [None]:
print(f'NMF - Post MDGs Data Coherence Score: {cm.get_coherence()}')

NMF - Post MDGs Data Coherence Score: 0.3037949856225185
