## Group Members:
### Name: Adam Nabil Rasyad Bin Mohd Rafidi 
### Student ID : SW01081420
### Name: Muhamad Afnan Nadzran Bin Mohd Fazlee 
### Student ID : SW01081169

In [1]:


# Import the necessary libraries
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim import corpora
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

# Read the data (use only the ‘text’ column)
df = pd.read_csv('news_dataset.csv')
documents = df['text'].dropna().tolist()  # Use only the 'text' column and remove null values

# Text pre-processing
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenize the text into words and convert to lowercase
    tokens = [token for token in tokens if token.isalnum()]  # Filter out non-alphanumeric tokens
    tokens = [token for token in tokens if token not in stop_words]  # Remove stopwords from the tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatize each token
    return tokens  # Return the preprocessed tokens

preprocessed_documents = [preprocess_text(doc) for doc in documents]

# Create a dictionary and corpus for LDA
dictionary = corpora.Dictionary(preprocessed_documents)  # Create a Gensim Dictionary object from the preprocessed documents
dictionary.filter_extremes(no_below=15, no_above=0.5)  # Filter out tokens that appear in less than 15 documents or more than 50% of the documents
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_documents]  # Convert each preprocessed document into a bag-of-words representation using the dictionary

# Perform LDA using Gensim
lda_model = LdaModel(corpus, num_topics=4, id2word=dictionary, passes=15)  # Train an LDA model on the corpus with 4 topics using Gensim's LdaModel class

# Evaluate the LDA model using Coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=preprocessed_documents, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

# Print the topics and coherence score
print(f'Topic Coherence Score (C_V): {coherence_lda:.4f}')

# Interpret the result
article_labels = []  # Empty list to store dominant topic labels for each document

# Iterate over each processed document
article_labels = []  # Empty list to store dominant topic labels for each document

# Iterate over each processed document
for i, doc in enumerate(preprocessed_documents):
    bow = dictionary.doc2bow(doc)  # For each document, convert to BOW representation
    topics = lda_model.get_document_topics(bow)  # Get list of topic probabilities
    dominant_topic = max(topics, key=lambda x: x[1])[0]  # Determine topic with highest probability
    article_labels.append(dominant_topic)  # Append to the list

# Print the top terms for each topic
print("Top Terms for Each Topic with Weight:")
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx}:")
    terms = [term.strip() for term in topic.split("+")]
    for term in terms:
        weight, word = term.split("*")
        print(f"- {word.strip()} (weight: {weight.strip()})")
    print()

# Interpretation of the coherence score
"""
The coherence score reflects the interpretability and meaningfulness of the topics produced by the LDA model.
A higher coherence score signifies that the topics are more coherent and well-defined, making them easier to interpret and analyze.
A lower coherence score suggests that the topics are less coherent and may need further refinement.
"""


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADAM\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ADAM\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADAM\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Topic Coherence Score (C_V): 0.6466
Top Terms for Each Topic with Weight:
Topic 0:
- "year" (weight: 0.008)
- "would" (weight: 0.006)
- "president" (weight: 0.006)
- "game" (weight: 0.005)
- "one" (weight: 0.005)
- "get" (weight: 0.005)
- "new" (weight: 0.005)
- "time" (weight: 0.005)
- "team" (weight: 0.004)
- "think" (weight: 0.004)

Topic 1:
- "1" (weight: 0.058)
- "0" (weight: 0.044)
- "q" (weight: 0.043)
- "x" (weight: 0.043)
- "max" (weight: 0.042)
- "2" (weight: 0.039)
- "7" (weight: 0.027)
- "g" (weight: 0.027)
- "r" (weight: 0.027)
- "3" (weight: 0.024)

Topic 2:
- "people" (weight: 0.012)
- "would" (weight: 0.011)
- "one" (weight: 0.010)
- "say" (weight: 0.006)
- "think" (weight: 0.006)
- "know" (weight: 0.006)
- "government" (weight: 0.005)
- "god" (weight: 0.005)
- "right" (weight: 0.005)
- "u" (weight: 0.005)

Topic 3:
- "key" (weight: 0.011)
- "use" (weight: 0.009)
- "file" (weight: 0.008)
- "system" (weight: 0.008)
- "x" (weight: 0.006)
- "one" (weight: 0.006)
- "chip" (

'\nThe coherence score reflects the interpretability and meaningfulness of the topics produced by the LDA model.\nA higher coherence score signifies that the topics are more coherent and well-defined, making them easier to interpret and analyze.\nA lower coherence score suggests that the topics are less coherent and may need further refinement.\n'