
This module is prepared for  SOCIAL MEDIA MINING MADE EASY: PYTHON TEXT ANALYTICS CRASH COURSE WORKSHOP, 25-26 September 2023.

Author: Dr Lailatul Qadri Zakaria, Asian Language Processing Lab (ASLAN), Center For Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology (FTSM), Universiti Kebangsaan Malaysia (UKM).

Email: lailatul.qadri@ukm.edu.my


#TOPIC MODELLING

**Topic modelling** is a **natural language processing (NLP)** approach used in **machine learning and text mining** to identify the underlying topics or themes in a set of texts. It's especially effective for organising, summarising, and comprehending enormous amounts of textual material. The goal of topic modelling algorithms is to automatically detect patterns in text and group similar words and documents together based on their content.

One of the most popular methods for topic modeling is **Latent Dirichlet Allocation (LDA)**. LDA assumes that texts are topic mixtures and that subjects are word mixtures. It attempts to reverse-engineer this process in order to discover the subjects and their distributions across a corpus of documents. This is how it works:


* LDA assumes that each document in the corpus is a combination of subjects. A news piece, for example, may include 30% on politics, 20% about sports, and 50% about technology.

* Topic-Word Distribution: : It is also assumed that each subject is a word combination. A "politics" theme, for example, may include terms like "government," "election," and "policy."

By analyzing the co-occurrence patterns of words across documents, LDA and similar algorithms can estimate these document-topic and topic-word distributions. Researchers and analysts can then interpret the results to understand the primary themes in the data.

In [1]:
#Import all the required libararies
import gensim
import nltk
from gensim import corpora
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel
from pprint import pprint
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

import string

#Phase 1: Data Collection
documents =[
    "Artificial intelligence is intelligence exhibited by machines, rather than humans or other animals.",
    "The field of AI research defines itself as the study of intelligent agents any device that perceives its environment and takes actions that maximize its chance of success at some goal.",
    "The overall research goal of artificial intelligence is to create technology that allows computers and machines to function in an intelligent manner.",
    "Chemistry is sometimes called the central science because it bridges other natural sciences, including physics, geology and biology.",
    "Chemistry includes topics such as the properties of individual atoms and how atoms form chemical bonds to create chemical compounds.",
    "Chemistry is a branch of physical science that studies the composition, structure of atoms, properties and change of matter."]

#Phase 2: Data Cleaning
# Tokenize the documents and create a dictionary
cleaned = []
for doc in documents:
  doc = doc.translate(str.maketrans('', '', string.punctuation))
  doc = doc.lower()
  cleaned.append(doc)


tokenized_documents = [doc.split() for doc in cleaned]
#clean tokenized document
cleaned_documents=[]
for tokenized_document in tokenized_documents:
  filtered_sentence = [w for w in tokenized_document if not w.lower() in stop_words]
  cleaned_documents.append(filtered_sentence)
print(tokenized_documents)
dictionary = corpora.Dictionary(cleaned_documents)

#Phase 3: Document Representation
# Create a corpus (a bag of words representation of the documents)
corpus = [dictionary.doc2bow(doc) for doc in cleaned_documents]

#Phase 4: Modelling
# Build an LDA model
lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary, passes=5)

# Print the topics and their top words
pprint(lda_model.print_topics(num_words=20))





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


[['artificial', 'intelligence', 'is', 'intelligence', 'exhibited', 'by', 'machines', 'rather', 'than', 'humans', 'or', 'other', 'animals'], ['the', 'field', 'of', 'ai', 'research', 'defines', 'itself', 'as', 'the', 'study', 'of', 'intelligent', 'agents', 'any', 'device', 'that', 'perceives', 'its', 'environment', 'and', 'takes', 'actions', 'that', 'maximize', 'its', 'chance', 'of', 'success', 'at', 'some', 'goal'], ['the', 'overall', 'research', 'goal', 'of', 'artificial', 'intelligence', 'is', 'to', 'create', 'technology', 'that', 'allows', 'computers', 'and', 'machines', 'to', 'function', 'in', 'an', 'intelligent', 'manner'], ['chemistry', 'is', 'sometimes', 'called', 'the', 'central', 'science', 'because', 'it', 'bridges', 'other', 'natural', 'sciences', 'including', 'physics', 'geology', 'and', 'biology'], ['chemistry', 'includes', 'topics', 'such', 'as', 'the', 'properties', 'of', 'individual', 'atoms', 'and', 'how', 'atoms', 'form', 'chemical', 'bonds', 'to', 'create', 'chemical'

Lets us test a new document and see if it can be identified in the similar group correctly.

In [2]:
# Phase 5: Evaluation
from gensim.models import CoherenceModel

# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=cleaned_documents, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print('Coherence Score:', coherence_lda)

# Calculate and print the perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))

Coherence Score: 0.4180324419606567

Perplexity:  -4.589589812984205


In [3]:
# Test the model on a new document
doc1 = "Chemistry is the branch of science that deals with the properties, composition, and structure of elements and compounds"
doc2 ="Machine learning is a area of Artificial intelligence"

new_bow_doc1 = dictionary.doc2bow(doc1.split())
print("\nTopic distribution for the new doc1:")
pprint(lda_model[new_bow_doc1])

new_bow_doc2 = dictionary.doc2bow(doc2.split())
print("\nTopic distribution for the new doc2:")
pprint(lda_model[new_bow_doc2])




Topic distribution for the new doc1:
[(0, 0.2686649), (1, 0.7313351)]

Topic distribution for the new doc2:
[(0, 0.3546966), (1, 0.64530337)]


In [4]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/2.6 MB[0m [31m10.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.6/2.6 MB[0m [31m45.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


In [5]:
import pyLDAvis.gensim
import matplotlib.pyplot as plt

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

# Save the visualization as an HTML file (optional)
pyLDAvis.save_html(vis, 'lda_visualization.html')

# Show the visualization in a matplotlib figure (optional)
plt.figure(figsize=(8, 4))
pyLDAvis.display(vis)





<Figure size 800x400 with 0 Axes>

## Activity: Lets check on topics in Trip Advisor Dataset.




## Phase 1: Dataset collection

In this activity, we will be using Trip Advisor dataset. Lets try to observe topics in the dataset. How many documents do we have in the dataset?


In [6]:
import chardet
import io
import pandas as pd

file_path = 'trip_advisor_dataset.csv'

# Step 1: Detect encoding
with open(file_path, 'rb') as file:
    raw_data = file.read()

result = chardet.detect(raw_data)
encoding = result['encoding']
print(f"Detected encoding: {encoding}")

# Step 2: Read the file content with error handling
with open(file_path, 'r', encoding=encoding, errors='replace') as file:
    content = file.read()

# Step 3: Read the CSV content using pandas
df = pd.read_csv(io.StringIO(content))
print(df)

  and should_run_async(code)


Detected encoding: MacRoman


ParserError: Error tokenizing data. C error: EOF inside string starting at row 8219

We will observe the data in cleaned_data column and submit the data to our topic modelling code.

In [None]:
documents = df['Review']
print(documents)

In [None]:
print(documents[10])

## Phase 2: Data Preprocessing


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [None]:
tokenized_documents = [doc.split() for doc in documents]
#clean tokenized document
cleaned_documents=[]
for tokenized_document in tokenized_documents:
  filtered_sentence = [w for w in tokenized_document if not w.lower() in stop_words]
  cleaned_documents.append(filtered_sentence)
print(tokenized_documents)


#print(tokenized_documents)
dictionary = corpora.Dictionary(cleaned_documents)



## Phase 3: Data Representation

In [None]:
# Create a corpus (a bag of words representation of the documents)
corpus = [dictionary.doc2bow(doc) for doc in cleaned_documents]




## Phase 4: Topic Modelling

In [None]:

# Build an LDA model
lda_model = LdaModel(corpus, num_topics=12, id2word=dictionary, iterations=50, passes=8, alpha=1.0)

# Print the topics and their top words
pprint(lda_model.print_topics(num_words=20))

## Phase 5: Evaluation

In [None]:
from gensim.models import CoherenceModel

# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=cleaned_documents, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print('Coherence Score:', coherence_lda)

# Calculate and print the perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))

## Phase 6: Visualization

In [None]:
import pyLDAvis.gensim
import matplotlib.pyplot as plt

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

# Save the visualization as an HTML file (optional)
pyLDAvis.save_html(vis, 'lda_visualization.html')

# Show the visualization in a matplotlib figure (optional)
plt.figure(figsize=(8, 4))
pyLDAvis.display(vis)




In [None]:
print(documents[0])

In [None]:
from collections import defaultdict

In [None]:
# Initialize topic terms dictionary
topic_terms = defaultdict(list)

# Initialize topic percentage dictionary
topic_percentage_total = defaultdict(float)

# Get top terms for each topic and calculate aggregated topic percentages
for topic_id in range(lda_model.num_topics):
    topic_terms[topic_id] = [term for term, _ in lda_model.show_topic(topic_id)]

    # Calculate aggregated topic percentages
    for doc_id, doc in enumerate(cleaned_documents):
        topic_distribution = lda_model.get_document_topics(corpus[doc_id])
        for topic, percentage in topic_distribution:
            if topic == topic_id:
                topic_percentage_total[topic_id] += percentage / len(cleaned_documents)

# Plot the aggregated topic percentages
topics = list(topic_percentage_total.keys())
percentages = list(topic_percentage_total.values())

plt.figure(figsize=(10, 6))
plt.bar(topics, percentages, color='skyblue')
plt.xlabel('Topic')
plt.ylabel('Average Percentage')
plt.title('Average Topic Percentage Across Documents')

# Display the top terms for each topic
for topic_id, terms in topic_terms.items():
    plt.text(topic_id, topic_percentage_total[topic_id] + 0.5, ', '.join(terms), ha='center', va='bottom', rotation=90)

plt.xticks(range(len(topics)), [f"Topic {topic_id}" for topic_id in topics])
plt.tight_layout()
plt.show()

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [None]:
for topic_id, topic in enumerate(lda_model.print_topics(num_topics=10, num_words=20)):
    topic_words = " ".join([word.split("*")[1].strip() for word in topic[1].split(" + ")])
    wordcloud = WordCloud(width=800, height=800, random_state=21, max_font_size=110).generate(topic_words)
    plt.figure()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.title("Topic: {}".format(topic_id))
    plt.show()

In [None]:
print(documents[40])

In [None]:
from collections import defaultdict

# Get topic distribution for the document
topic_distribution = lda_model.get_document_topics(corpus[40])

# Convert topic distribution to percentage
topic_percentage = defaultdict(float)
for topic, percentage in topic_distribution:
    topic_percentage[topic] = percentage * 100

# Plot the topic percentages
topics = list(topic_percentage.keys())
percentages = list(topic_percentage.values())

plt.bar(topics, percentages)
plt.xlabel('Topic')
plt.ylabel('Percentage')
plt.title('Topic Percentage in Document')
plt.xticks(range(len(topics)), topics)
plt.show()

Name: Muhammad Izzul Islam Bin Faisal, Muhammad Ajrul Amin Bin Mohd Zaidi

Student Number: A200363, A194789


# Question

In this activity, you will fine-tune the LDA parameters by adjusting this parameters:

a) alpha : 0.25, 0.5, 0.75, 1.0

b) num_topics = 3, 6, 9, 12

c) iterations = 25, 50, 75, 100

d) passes = 2, 4, 6, 8

1. Run 2 LDA models - each with different set of parameters.

  1a) alpha = 0.25, num_topics = 7, iteration = 100, passes = 6

  1b) alpha = 1.0, num_topics = 7, iteration = 100, passes = 6

  2a) alpha = 1.0, num_topics = 9, iteration = 50, passes = 8

  2b) alpha = 1.0, num_topics = 12, iteration = 50, passes = 8
  


2. Compare the result of the two models

a) Observe the coherence and perplexity score for each model

  1a)  

    - Coherence Score: 0.4945604234638519

    - Perplexity:  -8.2351585916133

  1b)
  
    - Coherence Score: 0.38720069851969047
  
    - Perplexity:  -8.220805528031194

  2a)

    - Coherence Score: 0.3962586175679874

    - Perplexity:  -8.44691534104617

  2b)

    - Coherence Score: 0.3914515838525343

    - Perplexity:  -8.92480406807112

b) Explain how these parameter influenced the topics across documents.

alpha:

    - lower alpha value means each documents are associated with fewer and more distinct topics. This leads to higher coherence scores and slightly better perplexity, indicating more interpretable and well-defined topics.
    - higher alpha value emans that each documents are associated with more overlapping topics. This leads to lower coherence scores and slightly worse perplexity, indicating less distinct and less interpretable topics.

num_topics:

    - When the number of subjects is reduced (as in Model 2a with 9 topics), each topic becomes larger, covering more general issues. This can result in higher coherence ratings since the themes are less fragmented and the high-scoring words within each subject are more likely to be semantically similar. However, it may fail to catch certain nuances in the data.
    - When the number of topics increases (as in Model 2b's 12 topics), the subjects become more detailed, collecting finer features within the data. This can result in poorer coherence scores since the high-scoring words within each subject may be less semantically connected due to the topics' granularity. However, it frequently leads in a better fit (lower perplexity) since the model can detect more particular patterns in the data.
