## Lab Assignment 3 [Text Analytics - CISB5123]

### <span style='background :yellow' > Student Information (Group Members) </span>

Member 1

**Name     :** Zulfadhli Fakhri bin Johan

**ID No    :** SW01081044

**Section  :** 02

=============================================================

Member 2

**Name     :** Errenbai Yeerhanati

**ID No    :** SW01081538

**Section  :** 02

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim import corpora
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel

# Download NLTK Resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ZeeF\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ZeeF\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ZeeF\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# Load dataset
df = pd.read_csv("news_dataset.csv")

In [3]:
# Display first 5 rows BEFORE modifying the score
df.head()

Unnamed: 0.1,Unnamed: 0,text,target,title,date
0,0,I was wondering if anyone out there could enli...,7,rec.autos,2022-08-02 13:48:37.251043
1,17,I recently posted an article asking what kind ...,7,rec.autos,2022-08-02 13:48:37.251043
2,29,\nIt depends on your priorities. A lot of peo...,7,rec.autos,2022-08-02 13:48:37.251043
3,56,an excellent automatic can be found in the sub...,7,rec.autos,2022-08-02 13:48:37.251043
4,64,: Ford and his automobile. I need information...,7,rec.autos,2022-08-02 13:48:37.251043


In [4]:
# Use only the 'text' column and remove null values
df = df[['text']].dropna()
df.head()

Unnamed: 0,text
0,I was wondering if anyone out there could enli...
1,I recently posted an article asking what kind ...
2,\nIt depends on your priorities. A lot of peo...
3,an excellent automatic can be found in the sub...
4,: Ford and his automobile. I need information...


In [5]:
# Initialize stopwords, lemmatizer, and set of stop words
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

import re

# Text preprocessing function
def preprocess_text(text):
    # Remove email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)
    # Remove strings in the format <...>
    text = re.sub(r'<.*?>', '', text)
    
    tokens = word_tokenize(text.lower())  # Tokenize the text into words and convert to lowercase
    tokens = [token for token in tokens if token.isalnum()]  # Filter out non-alphanumeric tokens
    tokens = [token for token in tokens if token not in stop_words]  # Remove stopwords from the tokens
    tokens = [token for token in tokens if not token.isnumeric()]  # Remove numeric tokens
    tokens = [token for token in tokens if len(token) > 1]  # Remove single-letter tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatize each token
    return tokens

In [6]:
# Preprocess the text column
df['processed_text'] = df['text'].apply(preprocess_text)
df.head()

Unnamed: 0,text,processed_text
0,I was wondering if anyone out there could enli...,"[wondering, anyone, could, enlighten, car, saw..."
1,I recently posted an article asking what kind ...,"[recently, posted, article, asking, kind, rate..."
2,\nIt depends on your priorities. A lot of peo...,"[depends, priority, lot, people, put, higher, ..."
3,an excellent automatic can be found in the sub...,"[excellent, automatic, found, subaru, legacy, ..."
4,: Ford and his automobile. I need information...,"[ford, automobile, need, information, whether,..."


In [7]:
# Create a Gensim Dictionary object from the preprocessed documents
dictionary = corpora.Dictionary(df['processed_text'])

# Filter out tokens that appear in less than 15 documents or more than 50% of the documents 
dictionary.filter_extremes(no_below=15, no_above=0.5) 

# Convert each preprocessed document into a bag-of-words representation using the dictionary
corpus = [dictionary.doc2bow(doc) for doc in df['processed_text']]

In [8]:
# Train an LDA model on the corpus with 4 topics
lda_model = LdaModel(corpus, num_topics=4, id2word=dictionary, passes=15)

In [9]:
# Empty list to store dominant topic labels for each document
article_labels = []

# Iterate over each processed document
for i, doc in enumerate(df['processed_text']):
    # For each document, convert to bag-of-words representation
    bow = dictionary.doc2bow(doc)
    # Get list of topic probabilities
    topics = lda_model.get_document_topics(bow)
    # Determine topic with highest probability
    dominant_topic = max(topics, key=lambda x: x[1])[0]
    # Append to the list
    article_labels.append(dominant_topic)

# Create DataFrame
df_result = pd.DataFrame({"Article": df['text'], "Topic": article_labels})

# Print the DataFrame
print("Table with Articles and Topic:")
print(df_result)
print()

Table with Articles and Topic:
                                                 Article  Topic
0      I was wondering if anyone out there could enli...      1
1      I recently posted an article asking what kind ...      1
2      \nIt depends on your priorities.  A lot of peo...      1
3      an excellent automatic can be found in the sub...      1
4      : Ford and his automobile.  I need information...      1
...                                                  ...    ...
11309  Secrecy in Clipper Chip\n\nThe serial number o...      2
11310  Hi !\n\nI am interested in the source of FEAL ...      2
11311  The actual algorithm is classified, however, t...      0
11312  \n\tThis appears to be generic calling upon th...      1
11313  \nProbably keep quiet and take it, lest they g...      1

[11096 rows x 2 columns]



In [10]:
# Print the top terms for each topic with weight
print("Top Terms for Each Topic:")
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx}:")
    terms = [term.strip() for term in topic.split("+")]
    for term in terms:
        weight, word = term.split("*")
        print(f"- {word.strip()} (weight: {weight.strip()})")
    print()

Top Terms for Each Topic:
Topic 0:
- "key" (weight: 0.015)
- "encryption" (weight: 0.008)
- "information" (weight: 0.007)
- "new" (weight: 0.007)
- "chip" (weight: 0.006)
- "program" (weight: 0.006)
- "system" (weight: 0.006)
- "file" (weight: 0.005)
- "use" (weight: 0.005)
- "clipper" (weight: 0.005)

Topic 1:
- "would" (weight: 0.014)
- "one" (weight: 0.010)
- "get" (weight: 0.010)
- "know" (weight: 0.009)
- "like" (weight: 0.009)
- "think" (weight: 0.008)
- "time" (weight: 0.007)
- "year" (weight: 0.007)
- "good" (weight: 0.007)
- "could" (weight: 0.006)

Topic 2:
- "max" (weight: 0.022)
- "db" (weight: 0.010)
- "use" (weight: 0.009)
- "window" (weight: 0.009)
- "file" (weight: 0.008)
- "system" (weight: 0.008)
- "one" (weight: 0.007)
- "using" (weight: 0.006)
- "version" (weight: 0.006)
- "program" (weight: 0.005)

Topic 3:
- "people" (weight: 0.011)
- "would" (weight: 0.009)
- "one" (weight: 0.009)
- "god" (weight: 0.006)
- "say" (weight: 0.005)
- "government" (weight: 0.005)
- "r

In [11]:
# Calculate the coherence score for the LDA model
coherence_model_lda = CoherenceModel(model=lda_model, texts=df['processed_text'], dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

In [12]:
# Print the coherence score
print(f'Topic Coherence Score (C_V): {coherence_lda:.4f}')

Topic Coherence Score (C_V): 0.5707


# Topic Modeling with LDA
Number of Topics
- The LDA model was trained to identify 4 topics from the dataset. Each document was then assigned a dominant topic based on the highest probability of topic representation.

# Interpretation of the Top Terms for Each Topic
Top Terms for Each Topic
- Topic 0: Primarily relates to technical and encryption-related content, with terms like "key," "encryption," "information," and "chip."
- Topic 1: Contains general conversational terms, suggesting topics related to opinions, discussions, or inquiries, with words like "would," "one," "get," and "know."
- Topic 2: Appears to be focused on software or technical instructions, featuring words like "max," "db," "use," "window," and "file."
- Topic 3: Seems to cover social or political discussions, indicated by terms like "people," "god," "government," "right," and "state."

# Interpretation of the Coherence Score
Topic Coherence Score (C_V)
- The coherence score of 0.5707 suggests that the topics identified by the LDA model are reasonably coherent and interpretable. This indicates that the model has been able to identify distinct and meaningful topics from the dataset, though there is still room for improvement.

# Summary
LDA Model
- Successfully identified 4 distinct topics in the dataset, with each document being assigned a dominant topic based on its content.

Coherence Score
- The coherence score provides a quantitative measure of the quality of the topics, indicating that the topics are fairly well-defined and interpretable.
- The coherence score helps in evaluating the effectiveness of the topic modeling process.