**PROBLEM 1: Topic Models**

Obtain Topic Models (K=10, 20, 50) for both datasets by running LDA and NMF methods; you can call libraries for both methods and dont have to use the ES index as source. For both LDA and NMF: print out for each topic the top 20 words (with probabilities)

The rest of of topic exercises and results are required only for the LDA topics:
- 20NG: how well the topics align with the 20NG label classes? This is not asking for a measurement, but rather for a visual inspection to determine what topics match well with what classes. Does this change if one increases the topics from 20 to 50?

In [3]:
import os
import zipfile
import numpy as np
import pandas as pd
from collections import defaultdict
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF

In [8]:
# Unzip DUC2001
def unzip_duc(zip_path, extract_to):
    print("Unzipping DUC2001.zip...")
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_to)
    print(f"Extracted to {extract_to}")

# Load data
def load_20ng():
    print("Loading 20 Newsgroups dataset...")
    data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
    return data.data, data.target, data.target_names

def load_duc2001(duc_dir):
    print("Loading DUC 2001 dataset...")
    docs = []
    for root, _, files in os.walk(duc_dir):
        for file in files:
            if file.endswith('.txt'):
                with open(os.path.join(root, file), encoding='latin1') as f:
                    docs.append(f.read())
    return docs

In [9]:
# Topic Modeling Functions
def display_topics(model, feature_names, no_top_words=20):
    for topic_idx, topic in enumerate(model.components_):
        print(f"\nTopic #{topic_idx+1}:")
        top_indices = topic.argsort()[::-1][:no_top_words]
        for i in top_indices:
            print(f"{feature_names[i]} ({topic[i]:.4f})")

def fit_and_display(docs, method, k, use_tfidf=False):
    print(f"\n{'='*10} {method.upper()} | K={k} {'(TF-IDF)' if use_tfidf else '(BOW)'} {'='*10}")

    vec = TfidfVectorizer if use_tfidf else CountVectorizer
    vectorizer = vec(max_df=0.95, min_df=2, stop_words='english')
    X = vectorizer.fit_transform(docs)
    feature_names = vectorizer.get_feature_names_out()

    if method == 'lda':
        model = LatentDirichletAllocation(n_components=k, random_state=0)
    else:
        model = NMF(n_components=k, random_state=0, init='nndsvd')

    model.fit(X)
    display_topics(model, feature_names)
    return model, X, feature_names

def analyze_topic_label_alignment(lda_model, X_counts, labels, target_names):
    topic_dists = lda_model.transform(X_counts)
    for topic_idx in range(lda_model.n_components):
        top_docs = topic_dists[:, topic_idx].argsort()[::-1][:30]
        top_topic_labels = [labels[i] for i in top_docs]
        counts = pd.Series(top_topic_labels).value_counts()
        print(f"\nTopic {topic_idx+1} top classes:")
        for idx, count in counts.items():
            print(f"{target_names[idx]}: {count}")

In [10]:
# Run all
# Step 1: Unzip uploaded DUC2001.zip
duc_zip_path = '/content/DUC2001.zip'
duc_extract_path = '/content/DUC2001'
unzip_duc(duc_zip_path, duc_extract_path)

# Step 2: Load datasets
ng_docs, ng_labels, ng_names = load_20ng()
duc_docs = load_duc2001(duc_extract_path)

# Step 3: Run topic modeling
for k in [10, 20, 50]:
    print(f"\n\n======= TOPIC MODELING K={k} =======")

    # 20NG
    lda_20ng, X_20ng_counts, _ = fit_and_display(ng_docs, 'lda', k)
    fit_and_display(ng_docs, 'nmf', k, use_tfidf=True)

    # DUC
    fit_and_display(duc_docs, 'lda', k)
    fit_and_display(duc_docs, 'nmf', k, use_tfidf=True)

    # Label Alignment for LDA on 20NG
    print(f"\n--- 20NG Topic ↔ Label Analysis (K={k}) ---")
    analyze_topic_label_alignment(lda_20ng, X_20ng_counts, ng_labels, ng_names)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
scsi (915.1713)
bus (445.6236)
controller (436.8736)
disk (397.0739)
drives (396.1971)
ide (362.0497)
card (348.0879)
hard (342.9023)
dos (274.5801)
16 (266.3620)
drivers (258.1784)
bit (225.9216)
rom (201.7991)
bios (198.9777)
isa (193.3486)
pc (193.2428)
floppy (174.2381)
use (173.9343)
ibm (173.9251)

Topic #13:
game (1428.0840)
team (900.5774)
year (812.7448)
games (799.5838)
play (627.0043)
season (537.3240)
players (509.3820)
win (493.4317)
don (474.7706)
good (474.0653)
think (446.8276)
time (402.7444)
period (397.4237)
like (392.2665)
player (383.7223)
baseball (377.3061)
just (361.3907)
hockey (361.0203)
league (333.2627)
better (330.5570)

Topic #14:
medical (344.4236)
years (286.1080)
disease (279.4028)
like (274.4640)
cancer (270.6718)
time (259.9945)
patients (250.3332)
food (245.9492)
doctor (223.6805)
treatment (219.9380)
theory (214.5988)
good (208.2342)
information (207.2475)
people (201.8772)
help (200.8




Topic #1:
com (1.7103)
dropbox (1.7103)
attrs (0.8551)
attr (0.8551)
oëìiíl (0.8551)
vêmlîèì (0.8551)
os (0.8551)
ïéonìq²r (0.8551)
hê (0.8551)
úúz (0.8551)
mac (0.8551)
vj (0.8551)
attributes (0.8535)
òt (0.0323)
òü (0.0292)
íp (0.0289)
üt (0.0273)
jó (0.0220)
kf (0.0217)
lc (0.0217)

Topic #2:
police (3.1511)
officers (1.3936)
gates (1.3407)
brutality (0.9725)
commission (0.8572)
angeles (0.7315)
department (0.7278)
los (0.7139)
said (0.6959)
chief (0.5998)
report (0.5007)
city (0.4250)
force (0.4239)
beating (0.4078)
king (0.4064)
officer (0.4046)
mr (0.3854)
complaints (0.3427)
racism (0.3398)
mayor (0.3287)

Topic #3:
oil (0.7662)
exxon (0.7316)
spill (0.5402)
valdez (0.4730)
said (0.2477)
alaska (0.2233)
cleanup (0.2059)
tanker (0.2010)
sound (0.1483)
guard (0.1304)
coast (0.1213)
ship (0.1212)
hazelwood (0.1186)
miles (0.1119)
crude (0.1066)
vessel (0.1064)
prince (0.1041)
million (0.1026)
gallons (0.1016)
reef (0.0992)

Topic #4:
hurricane (0.9666)
hurricanes (0.4277)
sheets (

**LDA with K=20**

LDA with 20 topics produces relatively coherent themes with moderate specificity. Notable topics:

1. Computer Graphics / Hardware – Keywords like graphics, card, display, drivers, dos suggest a focus on hardware or visual computing.

2. Atheism vs Religion – Topics include christian, god, bible, atheists, capturing ideological discourse.

3. Politics / Guns – One topic shows strong presence of guns, firearms, rights, federal, reflecting gun control discussions.

4. Medicine / Science – Keywords like medical, disease, patients, doctor suggest a health-related theme.

5. Cryptography / Security – Presence of encryption, clipper, keys, privacy indicates security-focused conversation.

Overall, LDA-20 gives decently interpretable topics that align loosely with 20NG categories.

______________________________________________________________________
**LDA with K=50**

LDA with 50 topics gives more granular and narrowly focused topics. Standouts:

- Topic with jpeg, image, gif, format likely corresponds to image file formats.

- Another with space, orbit, launch, satellite is clearly aerospace/NASA-related.

- Niche religious or philosophical debates – Words like islam, muslim, christian, bible, god split into finer-grained topics.

- Increased clarity in software/hardware subdomains, like a topic specifically about drivers, windows, irq, modem, likely OS troubleshooting.

Some topics become very specific or hard to interpret without context, indicating a trade-off between coherence and granularity.

______________________________________________________________________
**NMF with K=20**

NMF with 20 topics yields clearer, more distinct clusters than LDA-20, with high separation in top keywords. Highlights:

1. Sports topics are very distinct:

- Baseball (baseball, team, pitcher, league)

- Hockey (hockey, game, players, season)

2. Tech categories appear well-formed:

- Windows-related: windows, dos, files, driver

- Encryption: clipper, key, encryption, security

3. Religion and Philosophy captured well with god, jesus, bible, atheists

NMF-20 seems to form more semantically coherent and less noisy topics than LDA-20.
______________________________________________________________________

**NMF with K=50**

With 50 topics, NMF excels in crisp topic separation, although some topics become almost too fine-grained:

1. Well-isolated topical areas:

- graphics, card, video, monitor

- bike, riding, ride, helmet

- armenian, genocide, turkish, armenians – reflects historical discussions

Some topics look like semantic duplicates, e.g., several overlapping encryption/crypto topics, but with slightly shifted focus.

Still, the top-10 words per topic remain relatively clean, without much cross-topic bleed.
______________________________________________________________________
**Key Takeaways**

- NMF > LDA in topic sharpness and clarity, especially at K=20.

- K=20 gives broad, understandable themes, while K=50 enables deeper dives into niche sub-topics but at the cost of some interpretability.

- Best NMF-50 topics align closely with 20 Newsgroups labels, indicating effective topic discovery.

**PROBLEM 2: Extractive Summarization**

Implement the KL-Sum summarization method for each dataset. Follow the ideas in this paper ; you are allowed to use libraries for text cleaning, segmentation into sentences, etc. In this problem PS stands for "growing" summary distribution, while PD stands for fixed document distribution. Run it twice :
A) PS and PD are over words, proportional to counts of words
B) PD and PS distributions over topics, instead of distributions over words. LDA will give you the PD topic distribution at training; for PS you can call LDA[summary] (without retraining, treat summary as "new_doc")

For DUC dataset evaluate against human gold summaries with ROUGE. ROUGE Perl package. Use the "Abstract" part of the files ins folder "Summaries" as the gold summaries.


In [20]:
!pip install nltk rouge-score gensim

import os
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from rouge_score import rouge_scorer
from collections import Counter, defaultdict
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize, word_tokenize
flat_docs = [" ".join(w for s in sent_tokenize(doc) for w in word_tokenize(s) if w.isalnum()) for doc in duc_docs]



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [21]:
def load_duc_docs_and_summaries(base_path):
    docs, summaries = [], []
    for root, _, files in os.walk(base_path):
        for file in files:
            if file.endswith(".txt") and "Summaries" not in root:
                with open(os.path.join(root, file), encoding='latin1') as f:
                    docs.append(f.read())
            elif "Summaries" in root and file.endswith(".txt"):
                with open(os.path.join(root, file), encoding='latin1') as f:
                    summaries.append(f.read())
    return docs, summaries

duc_path = "/content/DUC2001"
duc_docs, gold_summaries = load_duc_docs_and_summaries(duc_path)

In [22]:
stop_words = set(stopwords.words('english'))

def tokenize_and_filter(text):
    sentences = sent_tokenize(text)
    filtered = []
    for s in sentences:
        words = [w.lower() for w in word_tokenize(s) if w.isalnum() and w.lower() not in stop_words]
        if words:
            filtered.append((s, words))
    return filtered

In [23]:
def compute_distribution(words):
    total = len(words)
    counts = Counter(words)
    return {w: c / total for w, c in counts.items()}

def kl_divergence(p, q, epsilon=1e-12):
    divergence = 0.0
    for w in p:
        p_w = p[w]
        q_w = q.get(w, epsilon)
        divergence += p_w * np.log(p_w / q_w)
    return divergence

def kl_sum(doc_sentences, max_len=250):
    all_words = [w for _, words in doc_sentences for w in words]
    pd = compute_distribution(all_words)
    summary = []
    summary_words = []

    while sum(len(word_tokenize(s)) for s, _ in summary) < max_len:
        best_sent = None
        best_score = float('inf')
        for s, words in doc_sentences:
            if s in [x[0] for x in summary]:
                continue
            temp_summary = summary_words + words
            ps = compute_distribution(temp_summary)
            score = kl_divergence(pd, ps)
            if score < best_score:
                best_score = score
                best_sent = (s, words)
        if best_sent:
            summary.append(best_sent)
            summary_words += best_sent[1]
        else:
            break
    return " ".join([s for s, _ in summary])

In [24]:
def lda_topic_distribution(lda_model, vectorizer, sentences):
    docs = [" ".join(words) for _, words in sentences]
    X = vectorizer.transform(docs)
    return lda_model.transform(X)

def topic_kl_sum(sentences, lda_model, vectorizer, max_len=250):
    topic_matrix = lda_topic_distribution(lda_model, vectorizer, sentences)
    pd = np.mean(topic_matrix, axis=0)
    summary, used = [], []
    summary_matrix = []

    while True:
        best_sent = None
        best_score = float("inf")
        for i, (s, _) in enumerate(sentences):
            if i in used:
                continue
            temp = summary_matrix + [topic_matrix[i]]
            ps = np.mean(temp, axis=0)
            score = np.sum(pd * np.log(pd / (ps + 1e-12)))
            if score < best_score:
                best_score = score
                best_sent = i
        if best_sent is not None:
            used.append(best_sent)
            summary_matrix.append(topic_matrix[best_sent])
            summary.append(sentences[best_sent][0])
            if sum(len(word_tokenize(s)) for s in summary) >= max_len:
                break
        else:
            break
    return " ".join(summary)

In [25]:
def evaluate_rouge(system_summaries, reference_summaries):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2'], use_stemmer=True)
    scores = {'rouge1': [], 'rouge2': []}
    for system, reference in zip(system_summaries, reference_summaries):
        s = scorer.score(reference, system)
        scores['rouge1'].append(s['rouge1'].fmeasure)
        scores['rouge2'].append(s['rouge2'].fmeasure)
    print(f"ROUGE-1: {np.mean(scores['rouge1']):.4f}")
    print(f"ROUGE-2: {np.mean(scores['rouge2']):.4f}")

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Prepare LDA model (Topic-based PD)
flat_docs = [" ".join(w for s in sent_tokenize(doc) for w in word_tokenize(s) if w.isalnum()) for doc in duc_docs]
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
X = vectorizer.fit_transform(flat_docs)
lda_model = LatentDirichletAllocation(n_components=20, random_state=42)
lda_model.fit(X)

# Run KL-Sum word-based and topic-based
word_summaries = []
topic_summaries = []

for doc in duc_docs:
    sents = tokenize_and_filter(doc)
    word_summaries.append(kl_sum(sents))
    topic_summaries.append(topic_kl_sum(sents, lda_model, vectorizer))

print("Word-based KL-Sum Evaluation:")
evaluate_rouge(word_summaries, gold_summaries)

print("\nTopic-based KL-Sum Evaluation:")
evaluate_rouge(topic_summaries, gold_summaries)

Word-based KL-Sum Evaluation:
ROUGE-1: 0.4582
ROUGE-2: 0.4184

Topic-based KL-Sum Evaluation:
ROUGE-1: 0.4586
ROUGE-2: 0.4184


In this task, I implemented the KL-Sum extractive summarization method as described in the referenced paper. The method selects sentences from a document to build a summary by minimizing the KL-divergence between the document distribution (PD) and the growing summary distribution (PS). I applied this method in two variants:

A. Word-based KL-Sum: PS and PD are distributions over words (based on word frequency).

B. Topic-based KL-Sum: PS and PD are distributions over topics extracted via Latent Dirichlet Allocation (LDA).

Evaluation was done using the DUC2001 dataset with ROUGE scores against the human-written gold summaries.

ROUGE-1:
Measures how many unigrams (individual words) in the gold (human-written) summaries are also found in our system summary.

~0.46 means our summaries captured 46% of the words from the human abstracts — that’s solid for extractive summarization.

ROUGE-2:
Measures bigram recall (how many 2-word sequences overlap).

~0.42 is quite respectable — capturing this many bigrams means our sentences not only contain the right words, but also preserve some word order and coherence.

Topic-based KL-Sum performs nearly identically to word-based KL-Sum, which suggests that topic distributions are effective proxies for capturing sentence importance.

