step 1: read the raw corpus, 

In [4]:
import os

folder_path = "../2_Preprocessing/Corpus_deepseek_cleaned/"
documents = []

for filename in os.listdir(folder_path):
    if filename.endswith(".txt") and filename != "all_texts.txt":  # ignore combined file
        file_path = os.path.join(folder_path, filename)
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
            documents.append(file.read())

print(f"Loaded {len(documents)} documents for topic modeling")
print("First document preview:\n", documents[0][:500])


Loaded 115 documents for topic modeling
First document preview:
 cnn surprisingly efficient powerful chinese ai model take technology industry storm call deepseek rattling nerve wall street new ai model develop deepseek startup bear year ago somehow manage breakthrough famed tech investor marc andreessen call ai sputnik moment nearly match capability far famous rival include openai gpt meta llama google gemini fraction cost company say spent million power base ai model compare hundred million billion dollar u company spend ai technology even shock consider un


2️⃣ Preprocess the Texts
We need to lowercase, remove punctuation/numbers, tokenize, remove stopwords, and lemmatize.

In [7]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
custom_stopwords = {'deepseek','ai','chinese','say'}  # noise words
lemmatizer = WordNetLemmatizer()

docs_tokens = []

for text in documents:
    text = text.lower()                          # lowercase
    text = re.sub(r'[^a-z\s]', '', text)         # keep only letters
    tokens = nltk.word_tokenize(text)            # tokenize
    tokens = [t for t in tokens if t not in stop_words and t not in custom_stopwords]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]  # lemmatize
    
    if tokens:
        docs_tokens.append(tokens)

print("Example tokens from first doc:", docs_tokens[0][:50])


[nltk_data] Downloading package punkt to /Users/lulu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/lulu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/lulu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Example tokens from first doc: ['cnn', 'surprisingly', 'efficient', 'powerful', 'model', 'take', 'technology', 'industry', 'storm', 'call', 'rattling', 'nerve', 'wall', 'street', 'new', 'model', 'develop', 'startup', 'bear', 'year', 'ago', 'somehow', 'manage', 'breakthrough', 'famed', 'tech', 'investor', 'marc', 'andreessen', 'call', 'sputnik', 'moment', 'nearly', 'match', 'capability', 'far', 'famous', 'rival', 'include', 'openai', 'gpt', 'meta', 'llama', 'google', 'gemini', 'fraction', 'cost', 'company', 'spent', 'million']


Now we have docs_tokens = list of lists, where each inner list is a document’s cleaned tokens.

3️⃣ Create Dictionary and Corpus for LDA

In [10]:
from gensim import corpora

dictionary = corpora.Dictionary(docs_tokens)
dictionary.filter_extremes(no_below=2, no_above=0.5)  # optional filtering

corpus = [dictionary.doc2bow(text) for text in docs_tokens]

print("Number of unique tokens:", len(dictionary))
print("Number of documents:", len(corpus))


Number of unique tokens: 2984
Number of documents: 115


4️⃣ Train LDA Topic Model

In [13]:
from gensim.models import LdaModel

num_topics = 5  # adjust based on your corpus size

lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=42,
    passes=10,
    alpha='auto',
    per_word_topics=True
)

# Print topics
for idx, topic in lda_model.print_topics(num_words=10):
    print(f"Topic {idx+1}: {topic}")


Topic 1: 0.006*"like" + 0.006*"build" + 0.006*"question" + 0.005*"much" + 0.005*"train" + 0.005*"nvidia" + 0.005*"world" + 0.005*"american" + 0.004*"think" + 0.004*"time"
Topic 2: 0.013*"government" + 0.009*"security" + 0.009*"information" + 0.008*"user" + 0.007*"chatbot" + 0.007*"ban" + 0.007*"privacy" + 0.006*"national" + 0.006*"country" + 0.006*"concern"
Topic 3: 0.011*"trump" + 0.010*"newsletter" + 0.009*"monday" + 0.006*"wakeup" + 0.006*"nvidia" + 0.006*"need" + 0.006*"per" + 0.006*"privacy" + 0.006*"fund" + 0.005*"cent"
Topic 4: 0.007*"nvidia" + 0.007*"bn" + 0.006*"trump" + 0.006*"musk" + 0.004*"power" + 0.004*"monday" + 0.004*"investor" + 0.004*"analyst" + 0.004*"state" + 0.004*"export"
Topic 5: 0.007*"liang" + 0.004*"nvidia" + 0.004*"release" + 0.004*"like" + 0.004*"world" + 0.004*"research" + 0.004*"big" + 0.004*"include" + 0.003*"advance" + 0.003*"report"


5️⃣ (Optional) Visualize Topics
If you want interactive topic visualization:

In [17]:
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis


In [18]:
# Save to HTML
pyLDAvis.save_html(vis, 'lda_topics.html')

Revised topic grouping & interpretation

Technology Development & Global Influence
(Merged Topic 1 + Topic 2 to reduce redundancy)

Keywords merged from both: build, research, world, nvidia, advance, release, include, train, large, give, month, university, government, silicon, valley, launch, big, billion, investor, people, power, time.

Interpretation: Focuses on global tech advancements — AI training, Nvidia’s role, large-scale research, and the infrastructure needed to build powerful systems. Covers both the innovation side (universities, R&D, Silicon Valley) and the strategic/global competition side.

Example angle: “How Nvidia and other tech players shape the next wave of AI innovation globally.”

Policy & Security Concerns (Topic 3)

Keywords: security, information, user, chatbot, ban, privacy, national, country, concern, taiwan, device, personal, australia, south, policy, state.

Interpretation: Centers on political and regulatory aspects — data privacy, national security, international relations, and government oversight of AI/tech.

Example angle: “Global policy battles over data privacy, security, and AI governance.”

Economic & Market Dynamics (Topic 4)

Keywords: nvidia, bn, trump, musk, power, monday, investor, analyst, export, india, meta, billion, research, control, release, world.

Interpretation: Focused on the financial and strategic market side — investments, stock valuations, trade/export restrictions, and how tech power intersects with national economic agendas.

Example angle: “The financial and geopolitical power plays driving the AI economy.”

Why merge Topic 1 and Topic 2?

Topic 1 is vague — words like think, time, mean, get, well indicate conversational or general news framing rather than a distinct thematic field.

Topic 2 is much sharper, focusing on technology and research. Combining them keeps the core tech theme while removing the filler.

Why keep Topic 3 separate?

It’s semantically distant (in your map) and topically distinct — entirely about policy, governance, and security.

Why keep Topic 4 separate?

Strong economic/market emphasis with overlap in key entities (Nvidia, Musk, Trump) but different framing — money, investment, control.

Why Topic 5’s circle is the smallest
In pyLDAvis, the circle size reflects the proportion of tokens in your corpus assigned to that topic.

A small circle = this topic has fewer total words assigned to it (low prevalence).

That means Topic 5 appears less often in your corpus compared to Topics 1–4.

If its position is far from other topics in the map, it suggests that the words in this topic are not strongly related to words in the other topics — but the model still detected it as distinct.

Interpretation attempt for Topic 5
Right now, it’s vague because:

It mixes political names (trump),

Platform/media terms (newsletter, content, sign, online, chatbot),

Policy/privacy words (privacy, policy, information),

Business terms (investor, fund, launch),

And even casual time markers (monday, tuesday, wakeup).

This could indicate:

Media + event coverage — where political names, tech company names, and platform-related terms co-occur.

A “miscellaneous” or residual topic — when the model collects leftover words that don’t fit neatly into the other topics.