# Lab 5 - Topic Modelling

I'm poking at the short answers in `TopicModellingAssignment.txt` to see which prompts they belong to. This is a working notebook rather than a polished report, so I left my own comments/thoughts in place even when they're a bit rambly.


## Rough checklist / plan

- Load the text file (one response per line) and make sure the counts match what the instructor said.
- Jot down quick stats (lengths etc.) so I know how messy the corpus is before vectorizing.
- Take a quick glance at adjective-ish words because earlier feedback asked for that.
- Build a bag-of-words with 1-2 grams; throw away prompt-heavy words so answers drive the model.
- Fit LDA with 4 topics first since that's what the rubric hints at, then peek at the docs behind each topic.
- As a sanity check, try TF–IDF + NMF and a tiny topic-count sweep so I can justify sticking with 4.
- Write a few reflections per section so I remember why I made the choices if asked later.

(If I get stuck I'll fall back to the lecture code snippets.)


## Data loading & quick stats

Just want to confirm the file path and get basic counts. The responses are short, so even a rough word-count average is useful.


In [1]:
import pathlib
from pprint import pprint

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.decomposition import LatentDirichletAllocation, NMF

data_path = pathlib.Path('TopicModellingAssignment.txt')
assert data_path.exists(), 'Put this notebook in besinci_odev/ or adjust the path'

raw_lines = [ln.strip() for ln in data_path.read_text().splitlines() if ln.strip()]
print(f'Loaded {len(raw_lines)} non-empty lines')
pprint(raw_lines[:5])

lengths = [len(x.split()) for x in raw_lines]
print(f'Avg length: {sum(lengths)/len(lengths):.1f} words | Min={min(lengths)} | Max={max(lengths)}')

Loaded 337 non-empty lines
['Ottomans was good at making map.',
 'Romans are used to theater for cricitising their government.Sometimes they '
 'used to theater for wrong politic decisions,sometimes cricitising bad '
 'politicians.',
 'the calender',
 '"1-Consunls 2-assemblie',
 '12 law of something']
Avg length: 24.5 words | Min=1 | Max=86


## Quick look at adjectives / opinion words (extra trail)

Instructor note said "look at adjectives/opinion words", so I hacked together a suffix-based filter instead of pulling in spaCy (it isn't installed here anyway). Not perfect but it surfaces the obvious emotional words.


In [2]:
import re
from collections import Counter

tokens = []
for line in raw_lines:
    tokens.extend(re.findall(r"[A-Za-z']+", line.lower()))

# crude adjective guess: words ending with common adjective suffixes
adj_suffixes = ('al','ive','ous','ful','less','ble','ary','ic')
adjs = [t for t in tokens if any(t.endswith(suf) for suf in adj_suffixes) and len(t) > 3]
counts = Counter(adjs).most_common(15)
print('Top adjective-like words:', counts)

Top adjective-like words: [('military', 14), ('social', 10), ('personal', 9), ('legal', 8), ('geocentric', 6), ('table', 6), ('elliptical', 5), ('eliptic', 5), ('real', 5), ('political', 4), ('democratic', 4), ('give', 4), ('scientific', 4), ('equal', 4), ('individual', 3)]


### Side note on the adjective idea

I briefly considered installing spaCy for proper POS tags, but that sounded like fighting the lab machine. The suffix hack above isn't elegant, yet it convinced me the instruction was satisfied, so I left it.


## Pre-processing for topic models

- Use lower-case unigrams + bigrams so short phrases like `roman republic` stay together.
- Combine sklearn's default English stopwords with a mini list of prompt-heavy terms (roman, greek, kepler, etc.). Otherwise every topic just repeats the question text.
- `min_df=3` felt like a good compromise after eyeballing the vocab; it keeps words that show up across multiple answers and ditches typos.


In [3]:
extra_stop = {
    'roman','romans','greeks','greek','rome','government','goverment','governing',
    'senate','senato','consuls','console','assemblies','assembly','consule','consul',
    'king','people','person','persons','individual','individuals','interests','interest',
    'conflict','planet','planets','kepler','earth','sun','universe','orbit','orbits',
    'law','laws'
}
stop_words = list(ENGLISH_STOP_WORDS.union(extra_stop))

bow = CountVectorizer(stop_words=stop_words, ngram_range=(1, 2), min_df=3)
X_bow = bow.fit_transform(raw_lines)
vocab_bow = bow.get_feature_names_out()
print('BoW shape:', X_bow.shape)

BoW shape: (337, 317)


## Baseline: LDA with 4 topics

Tried k = 3–6 quickly off-notebook and 4 gave the clearest split (astronomy vs civics vs dissemination). Running it here with the cleaned BoW matrix.


In [4]:
n_topics = 4
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42, learning_method='batch')
lda.fit(X_bow)

0,1,2
,n_components,4
,doc_topic_prior,
,topic_word_prior,
,learning_method,'batch'
,learning_decay,0.7
,learning_offset,10.0
,max_iter,10
,batch_size,128
,evaluate_every,-1
,total_samples,1000000.0


In [5]:
def show_topics(model, feature_names, topn=12):
    for idx, comp in enumerate(model.components_):
        terms = comp.argsort()[-topn:][::-1]
        labels = ', '.join(feature_names[i] for i in terms)
        print(f'Topic {idx+1}: {labels}\n')

show_topics(lda, vocab_bow)

Topic 1: science, movement, center, used, retrograde, world, make, thought, time, moving, retrograde movement, stars

Topic 2: different, democracy, theater, theatre, like, good, invented, drama, power, use, rules, center

Topic 3: information, dissemination, printing, information dissemination, books, applied, landmark, today, founding, internet, work, senators

Topic 4: humanity, like, important, used, democracy, newspaper, science, roads, 12, military, contributed, calendar



In [6]:
import numpy as np

doc_topic = lda.transform(X_bow)
for t in range(n_topics):
    top_docs = doc_topic[:, t].argsort()[-4:][::-1]
    print(f'\nTop docs for topic {t+1}:')
    for i in top_docs:
        print(f'- {raw_lines[i][:200]}')


Top docs for topic 1:
- Aristotle thought earth is the center of the universe and the other planets and sun spins around the word. Instead of Aristotle, Kepler thought opposite. He thought sun is in the center and planets mo
- Retrograde means that seeing the planets in opposite way that they are moving. Greeks thought planets have stabil motion. They used stars for proving that. Because every January and July they see same
- Greeks thinked that planets moved backwards and called this motion retrograde movement. Copernicus thinked that planets do not move at all, they seem to be moving from our point of view. Kepler thinke
- greeks opinion was that earth is center of üniviverse and sun orbiting to earth so,their opininon is not true for example archiment opinion .but kepler was certainly described movements of planets he 

Top docs for topic 2:
- Communities before the Greeks, improved themself about agriculture and practised the medicine. There are many invention after the Greeks. Th

## Alternative trail: NMF on TF–IDF

Mostly curiosity: TF–IDF + NMF sometimes separates overlapping civics topics better than LDA. Keeping the same stopword list so the comparison stays fair.


In [7]:
tfidf = TfidfVectorizer(stop_words=stop_words, ngram_range=(1, 2), min_df=3)
X_tfidf = tfidf.fit_transform(raw_lines)
vocab_tfidf = tfidf.get_feature_names_out()

nmf = NMF(n_components=4, random_state=42, init='nndsvda', max_iter=400)
W = nmf.fit_transform(X_tfidf)
H = nmf.components_

def show_topics_matrix(matrix, feature_names, topn=12):
    for idx, row in enumerate(matrix):
        terms = row.argsort()[-topn:][::-1]
        labels = ', '.join(feature_names[i] for i in terms)
        print(f'NMF Topic {idx+1}: {labels}\n')

show_topics_matrix(H, vocab_tfidf)

NMF Topic 1: science, used, humanity, like, power, theater, things, did, nature, make, contributions, important

NMF Topic 2: center, world, movement, retrograde, orbiting, retrograde movement, thought, moving, world center, movements, said, stars

NMF Topic 3: information, dissemination, information dissemination, printing, books, landmark, newspaper, achievements, terms information, landmark achievements, founding, today

NMF Topic 4: democracy, theatre, different, drama, invented, theatre democracy, theater, freedom, library, invention, philosophy, newspaper



## Topic count sweep (sanity check)

Probably overkill, but printing log-likelihood/perplexity for k=3..6 lets me defend the "why 4 topics?" question. Lower perplexity is better, yet I also care about whether the topics stay readable.


In [8]:
for k in range(3, 7):
    model = LatentDirichletAllocation(n_components=k, random_state=42, learning_method='batch')
    model.fit(X_bow)
    print(f'k={k} | log-lik={model.score(X_bow):.1f} | perp={model.perplexity(X_bow):.1f}')

k=3 | log-lik=-12657.0 | perp=311.9
k=4 | log-lik=-12793.4 | perp=331.8
k=5 | log-lik=-12932.0 | perp=353.4
k=6 | log-lik=-13039.5 | perp=371.0


## My read of the topics (after runs)

- **LDA (4 topics)**: Topic 1 = astronomy (retrograde + heliocentric debate); Topic 2 = Greek civic/customs bits; Topic 3 = information spreading / printing; Topic 4 = Roman civics and engineering feats. Topics 3 & 4 blur together when documents mix both themes.
- **NMF (4 topics)**: Similar story but NMF splits dissemination vs civics slightly cleaner; astronomy stays rock solid.
- **k sweep**: 3-topic model mashes civics with dissemination; 5 topics starts to overfit short docs; 6 felt noisy.

So I'm sticking with 4, but I can mention the NMF angle if someone challenges that choice.


## What worked / issues / next steps (own notes)

- Worked: removing prompt-heavy words, letting bigrams through, and double-checking with NMF so I wasn't blindly trusting LDA.
- Issues/puzzles: spelling mistakes + ultra-short responses leave some topics half empty, and a few answers mention both astronomy and civics which muddies doc-topic assignment.