University of North Carolina Charlotte<br>
DSBA-6162 Data Mining<br>
Instructor: Dr. Xi (Sunshine) Niu<br>
<br>
Nathan Schaaf<br>
3/25/2025

# LDA Topic Modeling Exercise
In this exercise we apply LDA Topic Modeling on the entire corpus and extract k=10 topics. The example corpus contains 180 movie reviews, eact text file (review) written by one of two reviewers. The first 80 were written by Berardinelli and the remaining were by Schwartz.

If you want to work through this exercise, you will first need to clone the repository. Please refer to the README.MD file.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd /content/drive/My Drive/Data_Mining/08

/content/drive/My Drive/Data_Mining/08


In [5]:
# !pip install -r requirements.txt

Collecting gensim (from -r requirements.txt (line 1))
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy (from -r requirements.txt (line 3))
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy (from -r requirements.txt (line 6))
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manyli

In [4]:
import os
import gensim
import spacy
import pandas as pd
from gensim import corpora
from gensim.models import LdaModel
from nltk.corpus import stopwords
from collections import defaultdict
import nltk

## Problem 1
List the top 10 words for each topic.

In [5]:
# Download NLTK stopwords if not already downloaded
nltk.download('stopwords')

# Load Spacy model for text preprocessing
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [6]:
# Define path to corpus
corpus_path = "/content/drive/My Drive/Data_Mining/08/MovieReviews"

In [9]:
# Read and preprocess the documents
documents = []
for file in os.listdir(corpus_path):
    file_path = os.path.join(corpus_path, file)
    if os.path.isfile(file_path):  # Ensure it's a file
        with open(file_path, "r", encoding="latin-1", errors="replace") as f:
            text = f.read()
            # Tokenization, lemmatization, stopword removal
            doc = nlp(text.lower())
            words = [token.lemma_ for token in doc if token.is_alpha and token.text not in stopwords.words("english")]
            documents.append(words)

In [10]:
# Create a dictionary and corpus
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]

In [11]:
# Train LDA model
num_topics = 10
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15, random_state=42)

In [12]:
# Print top 10 words for each topic
topics = lda_model.show_topics(num_topics=num_topics, num_words=10, formatted=False)
for i, topic in topics:
    words = [word for word, _ in topic]
    print(f"Topic {i + 1}: {', '.join(words)}")

Topic 1: film, man, one, take, go, war, movie, like, get, even
Topic 2: film, family, leave, go, one, seem, joon, act, tell, like
Topic 3: film, movie, one, make, story, go, character, get, life, lose
Topic 4: film, make, get, ali, see, child, much, life, kid, problem
Topic 5: film, one, see, make, get, take, story, life, go, even
Topic 6: film, movie, one, make, like, get, good, story, character, go
Topic 7: film, get, one, see, story, go, take, give, make, movie
Topic 8: film, one, movie, get, make, go, see, life, story, well
Topic 9: film, one, make, see, story, like, get, life, go, man
Topic 10: film, one, movie, life, even, make, well, get, play, much


## Problem 2
List the topic coverage probabilities for each document in this corpus.

In [13]:
# Get topic distribution for each document
topic_distributions = []
for i, doc in enumerate(corpus):
    topic_probs = lda_model.get_document_topics(doc, minimum_probability=0)
    topic_probs_dict = {f"Topic_{topic_id}": prob for topic_id, prob in topic_probs}
    topic_probs_dict["Document"] = f"Doc_{i+1}"
    topic_distributions.append(topic_probs_dict)

# Convert to DataFrame for better readability
df_topic_dist = pd.DataFrame(topic_distributions)
df_topic_dist.set_index("Document", inplace=True)

In [14]:
# Display topic probabilities for each document
print(df_topic_dist)

           Topic_0   Topic_1   Topic_2   Topic_3   Topic_4   Topic_5  \
Document                                                               
Doc_1     0.000312  0.000312  0.000312  0.000312  0.000312  0.000312   
Doc_2     0.000266  0.000266  0.000266  0.000266  0.997610  0.000266   
Doc_3     0.998476  0.000169  0.000169  0.000169  0.000169  0.000169   
Doc_4     0.000143  0.000143  0.000143  0.000143  0.998713  0.000143   
Doc_5     0.000236  0.000236  0.000236  0.000236  0.000236  0.997879   
...            ...       ...       ...       ...       ...       ...   
Doc_176   0.000256  0.000256  0.000256  0.000256  0.000256  0.408640   
Doc_177   0.000197  0.000197  0.000197  0.000197  0.503617  0.494809   
Doc_178   0.000217  0.000217  0.000217  0.000217  0.000217  0.007227   
Doc_179   0.000208  0.000208  0.000208  0.000208  0.000208  0.000208   
Doc_180   0.000314  0.000314  0.000314  0.000314  0.000314  0.144565   

           Topic_6   Topic_7   Topic_8   Topic_9  
Document    

# Conclusion

## 1. Top 10 Words for Each Topic

The top 10 words identified for each topic suggest that the model is detecting specific patterns within the movie reviews, predominantly centered around general themes in film, character development, and emotions associated with film watching.

- **Topic 1**: Focuses on general movie-related terms such as *film, man, movie, war*, and *get*, which may reflect action or war films.
- **Topic 2**: Shows terms related to family dynamics (*family, leave, act, joon*), possibly indicating themes of family drama or relationships.
- **Topic 3**: Emphasizes emotional or plot elements, such as *story, character, life*, which could signify narratives focused on personal growth or struggle.
- **Topic 4**: Seems to center around children's films or family-friendly genres, evident from words like *child, kid, problem, life*.
- **Topic 5**: Again shows a focus on storytelling, with *story, life, make, even*, highlighting narrative-driven films.
- **Topic 6**: Suggests a focus on quality and story, with words like *good, story, character, movie* pointing to well-developed films.
- **Topic 7**: Includes terms such as *give, make, movie, story*, possibly reflecting themes of contribution or action within film narratives.
- **Topic 8**: Indicates a focus on films with a life-lesson or personal journey aspect, evident from words like *life, story, make, well*.
- **Topic 9**: Appears to focus on dramatic elements, with terms like *life, movie, story*, reflecting more intense or emotional narratives.
- **Topic 10**: Combines general movie-related terms with terms like *play* and *much*, which might suggest films involving performance or highly engaging content.

## 2. Topic Coverage Probabilities for Each Document

The topic coverage probabilities show how each document (movie review) is associated with each of the 10 topics. The distribution of probabilities indicates that certain documents are highly associated with specific topics, with one topic often being dominant per document. For instance:

- **Document 1** is mainly associated with Topic 8, with a probability of 0.997194.
- **Document 2** shows a strong association with Topic 4, with a probability of 0.997610.
- **Document 3** has a strong association with Topic 0, with a probability of 0.998476.
- **Document 4** has a high probability (0.998713) for Topic 4.
- **Document 180** is most strongly associated with Topic 7, with a probability of 0.852925.

Overall, the topic modeling results provide a clear indication of the themes present in the corpus of movie reviews, with each document showing a varying degree of association with the topics generated by the LDA model.
