## <center>CISB5123 Text Analytics</center>
## <center> Lab Assignment 3 - Topic Modeling </center>

#### 1) Amirul Farhan bin Kamaruzaman, SW01082374
#### 2) Maizatul Aufa binti Zamidi, SW01082394

In [4]:
# For text preprocessing 
import nltk 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer, PorterStemmer

# For topic modeling 
from gensim import corpora 
from gensim.models import LdaModel, CoherenceModel
import pandas as pd 
import string
 
# Download NLTK Resources 
nltk.download('stopwords') 
nltk.download('punkt') 
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maiza\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\maiza\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\maiza\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
# Load the dataset
df = pd.read_csv('news_dataset.csv')
df = df[['text']]
df.dropna(inplace=True) #remove null values

In [8]:
# Initialize tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# Text preprocessing function
def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Lowercase and tokenize
    tokens = [token for token in tokens if token.isalpha()]  # Keep only pure words (remove numbers & punctuation)
    tokens = [token for token in tokens if token not in stop_words]  # Remove stopwords
    tokens = [stemmer.stem(token) for token in tokens]  # Apply stemming
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Apply lemmatization
    return tokens

#Apply preprocessing
preprocessed_documents = [preprocess_text(doc) for doc in df['text'].tolist()]

In [10]:
print(preprocessed_documents[0])

['wonder', 'anyon', 'could', 'enlighten', 'car', 'saw', 'day', 'sport', 'car', 'look', 'late', 'earli', 'call', 'bricklin', 'door', 'realli', 'small', 'addit', 'front', 'bumper', 'separ', 'rest', 'bodi', 'know', 'anyon', 'tellm', 'model', 'name', 'engin', 'spec', 'year', 'product', 'car', 'made', 'histori', 'whatev', 'info', 'funki', 'look', 'car', 'plea']


In [12]:
# Create document-term matrix
dictionary = corpora.Dictionary(preprocessed_documents)
dictionary.filter_extremes(no_below=15, no_above=0.5)
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_documents]

In [14]:
# Run LDA
lda_model = LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15, random_state=42)

In [16]:
# Evaluate Model
coherence_model_lda = CoherenceModel(model=lda_model, texts=preprocessed_documents, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f'Coherence Score: {coherence_lda}\n')

Coherence Score: 0.6419611361396821



The coherence score of LDA model is 0.6419, which shows that the topics generated are quite interpretable and make sense. Since a score above 0.5 is usually considered good, this means the model did a decent job at goruping related words into meaningful topic

In [18]:
# Interpret Results
article_labels = []
for doc in preprocessed_documents:
    bow = dictionary.doc2bow(doc)
    topics = lda_model.get_document_topics(bow)
    dominant_topic = max(topics, key=lambda x: x[1])[0]
    article_labels.append(dominant_topic)

In [20]:
# Create DataFrame
df_result = pd.DataFrame({"Article": df['text'].tolist(), "Topic": article_labels})

In [22]:
# Print the DataFrame
print("Table with Articles and Topic:")
print(df_result)
print()

Table with Articles and Topic:
                                                 Article  Topic
0      I was wondering if anyone out there could enli...      2
1      I recently posted an article asking what kind ...      2
2      \nIt depends on your priorities.  A lot of peo...      1
3      an excellent automatic can be found in the sub...      1
4      : Ford and his automobile.  I need information...      1
...                                                  ...    ...
11091  Secrecy in Clipper Chip\n\nThe serial number o...      1
11092  Hi !\n\nI am interested in the source of FEAL ...      4
11093  The actual algorithm is classified, however, t...      1
11094  \n\tThis appears to be generic calling upon th...      1
11095  \nProbably keep quiet and take it, lest they g...      2

[11096 rows x 2 columns]



In [24]:
# Print top terms for each topic
for topic_id in range(lda_model.num_topics):
    print(f"Top terms for Topic #{topic_id}:")
    top_terms = lda_model.show_topic(topic_id, topn=10)
    print([term[0] for term in top_terms])
    print()

Top terms for Topic #0:
['peopl', 'would', 'one', 'govern', 'god', 'state', 'law', 'say', 'use', 'right']

Top terms for Topic #1:
['use', 'key', 'encrypt', 'one', 'chip', 'would', 'system', 'get', 'like', 'clipper']

Top terms for Topic #2:
['go', 'would', 'get', 'one', 'know', 'like', 'think', 'year', 'time', 'peopl']

Top terms for Topic #3:
['q', 'max', 'g', 'r', 'p', 'n', 'db', 'k', 'w', 'c']

Top terms for Topic #4:
['x', 'file', 'use', 'program', 'window', 'anonym', 'inform', 'avail', 'post', 'includ']



In [26]:
# Print the top terms for each topic with weight
print("Top Terms for Each Topic:")
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx}:")
    terms = [term.strip() for term in topic.split("+")]
    for term in terms:
        weight, word = term.split("*")
        print(f"- {word.strip()} (weight: {weight.strip()})")
    print()

Top Terms for Each Topic:
Topic 0:
- "peopl" (weight: 0.009)
- "would" (weight: 0.008)
- "one" (weight: 0.007)
- "govern" (weight: 0.007)
- "god" (weight: 0.006)
- "state" (weight: 0.005)
- "law" (weight: 0.005)
- "say" (weight: 0.005)
- "use" (weight: 0.005)
- "right" (weight: 0.005)

Topic 1:
- "use" (weight: 0.020)
- "key" (weight: 0.018)
- "encrypt" (weight: 0.011)
- "one" (weight: 0.010)
- "chip" (weight: 0.010)
- "would" (weight: 0.009)
- "system" (weight: 0.008)
- "get" (weight: 0.008)
- "like" (weight: 0.007)
- "clipper" (weight: 0.006)

Topic 2:
- "go" (weight: 0.011)
- "would" (weight: 0.009)
- "get" (weight: 0.009)
- "one" (weight: 0.008)
- "know" (weight: 0.008)
- "like" (weight: 0.008)
- "think" (weight: 0.007)
- "year" (weight: 0.007)
- "time" (weight: 0.006)
- "peopl" (weight: 0.006)

Topic 3:
- "q" (weight: 0.099)
- "max" (weight: 0.086)
- "g" (weight: 0.056)
- "r" (weight: 0.055)
- "p" (weight: 0.047)
- "n" (weight: 0.044)
- "db" (weight: 0.041)
- "k" (weight: 0.031)
-