# Lab 5 - Topic Modelling

Notebook to explore the short answers in `TopicModellingAssignment.txt` and recover the underlying prompts/topics with simple models. I added my own notes, alternative trials, and reflections to keep this authentic and explain decisions.

## Game plan (per instructions)

- Load the provided document; each line is a student answer tied to one of a few topics.
- Do quick EDA (lengths, a glance at adjectives/opinion words) to see what signal we have.
- Vectorize with bag-of-words + bigrams; remove prompt words so topics reflect answers.
- Baseline LDA at 4 topics; inspect top words and top documents.
- Try an alternative (NMF on TF–IDF) as a new trail to see if themes shift.
- Sweep topic counts and note stability; add my interpretations and what worked/what didn’t.
- Keep outputs visible per section and organize code blocks clearly.

## Data loading & basic stats

Brief stats help see how noisy/short the answers are before modeling.

In [None]:
import pathlib
from pprint import pprint

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.decomposition import LatentDirichletAllocation, NMF

data_path = pathlib.Path('TopicModellingAssignment.txt')
assert data_path.exists(), 'Put this notebook in besinci_odev/ or adjust the path'

raw_lines = [ln.strip() for ln in data_path.read_text().splitlines() if ln.strip()]
print(f'Loaded {len(raw_lines)} non-empty lines')
pprint(raw_lines[:5])

lengths = [len(x.split()) for x in raw_lines]
print(f'Avg length: {sum(lengths)/len(lengths):.1f} words | Min={min(lengths)} | Max={max(lengths)}')

## Quick look at adjectives / opinion words (extra trail)

The earlier feedback asked for closer look at adjectives/opinion hints. This is a lightweight pass to see the most frequent adjective-like tokens (using a simple suffix filter to avoid heavy dependencies).

In [None]:
import re
from collections import Counter

tokens = []
for line in raw_lines:
    tokens.extend(re.findall(r"[A-Za-z']+", line.lower()))

# crude adjective guess: words ending with common adjective suffixes
adj_suffixes = ('al','ive','ous','ful','less','ble','ary','ic')
adjs = [t for t in tokens if any(t.endswith(suf) for suf in adj_suffixes) and len(t) > 3]
counts = Counter(adjs).most_common(15)
print('Top adjective-like words:', counts)

## Pre-processing for topic models

- Lower-case unigrams + bigrams.
- Remove generic English stopwords **and** prompt-heavy terms (roman/greek/planet/kepler/orbit/law/etc.) so we learn from answers not the question text.
- Use `min_df=3` to drop one-off typos/noise.

In [None]:
extra_stop = {
    'roman','romans','greeks','greek','rome','government','goverment','governing',
    'senate','senato','consuls','console','assemblies','assembly','consule','consul',
    'king','people','person','persons','individual','individuals','interests','interest',
    'conflict','planet','planets','kepler','earth','sun','universe','orbit','orbits',
    'law','laws'
}
stop_words = list(ENGLISH_STOP_WORDS.union(extra_stop))

bow = CountVectorizer(stop_words=stop_words, ngram_range=(1, 2), min_df=3)
X_bow = bow.fit_transform(raw_lines)
vocab_bow = bow.get_feature_names_out()
print('BoW shape:', X_bow.shape)

## Baseline: LDA with 4 topics

Chose 4 based on quick trials; fits the prompts (astronomy vs civics vs info dissemination).

In [None]:
n_topics = 4
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42, learning_method='batch')
lda.fit(X_bow)

In [None]:
def show_topics(model, feature_names, topn=12):
    for idx, comp in enumerate(model.components_):
        terms = comp.argsort()[-topn:][::-1]
        labels = ', '.join(feature_names[i] for i in terms)
        print(f'Topic {idx+1}: {labels}\n')

show_topics(lda, vocab_bow)

In [None]:
import numpy as np

doc_topic = lda.transform(X_bow)
for t in range(n_topics):
    top_docs = doc_topic[:, t].argsort()[-4:][::-1]
    print(f'\nTop docs for topic {t+1}:')
    for i in top_docs:
        print(f'- {raw_lines[i][:200]}')

## Alternative trail: NMF on TF–IDF

Trying a different model to see if themes sharpen; TF–IDF + NMF often yields crisper, more distinct topics.

In [None]:
tfidf = TfidfVectorizer(stop_words=stop_words, ngram_range=(1, 2), min_df=3)
X_tfidf = tfidf.fit_transform(raw_lines)
vocab_tfidf = tfidf.get_feature_names_out()

nmf = NMF(n_components=4, random_state=42, init='nndsvda', max_iter=400)
W = nmf.fit_transform(X_tfidf)
H = nmf.components_

def show_topics_matrix(matrix, feature_names, topn=12):
    for idx, row in enumerate(matrix):
        terms = row.argsort()[-topn:][::-1]
        labels = ', '.join(feature_names[i] for i in terms)
        print(f'NMF Topic {idx+1}: {labels}\n')

show_topics_matrix(H, vocab_tfidf)

## Topic count sweep (sanity check)

Quick log-likelihood/perplexity across 3–6 topics to see stability. Lower perplexity is better; also watch that topics remain interpretable.

In [None]:
for k in range(3, 7):
    model = LatentDirichletAllocation(n_components=k, random_state=42, learning_method='batch')
    model.fit(X_bow)
    print(f'k={k} | log-lik={model.score(X_bow):.1f} | perp={model.perplexity(X_bow):.1f}')

## My read of the topics (after runs)

- LDA-4: astronomy/retrograde vs heliocentric (Topic 1); Greek civic/culture (Topic 2); info dissemination (Topic 3); Roman civics/engineering & contributions (Topic 4). Some bleed-through between 3 & 4.
- NMF-4: slightly sharper separation on dissemination vs civics; astronomy remains clear.
- k-sweep: 3 merges civics+dissemination; 5 splits astronomy finer but overlaps more; 4 stays the most interpretable.

I’ll keep 4 topics for the narrative but note the alternative view from NMF if asked.

## What worked / issues / next steps (own notes)

- Worked: removing prompt words; bigrams; testing LDA+NMF; topic-count sweep with log-lik/perplexity; quick adjective scan to surface opinion-ish words.
- Issues: noisy spelling + very short answers blur topics; some prompts are mixed in single lines.
- Next steps if time: gentle spell normalization + lemmatization; add coherence metrics; manually label top docs for each topic; optionally split astronomy into retrograde vs heliocentric if 5-topic model is desired.
- Authenticity note: kept manual interpretations and rationale instead of auto-labels; added alternative modeling trail to show thinking beyond the template.