# 1. Introduction and Data Acquisition

**Starting Point:**  
The article data originates from a Kafka datastream. It is not normalized (so it cannot be analyzed directly) and requires Active Directory login (making collaboration difficult).  
- [View Kafka topic data (AKHQ)](https://akhq.pdp.production.admin.srgssr.ch/ui/strimzi/topic/articles-v2/data?sort=NEWEST&partition=All)

**Processing steps:**  
1. Read article data from the Delta table populated from Kafka.
2. Flatten and transform nested fields (e.g., titles, resources, contributors) using a SQL view.
3. Create a Spark DataFrame from the flattened view and inspect the results.
4. Write the DataFrame to a Delta table for analytics and automation.
5. Export a <25MB Parquet sample with only public data for sharing (e.g., via GitHub).

**Goal:**  
The data should be available as a Parquet file for sharing. Since the dataset is large (5GB), only a public sample is exported for easy distribution.

**Access Control:**  
To guarantee data integrity and protect sensitive information, data distribution is based on user access rights. Entitled users can access the full confidential dataset, while restricted users are provided with only the public sample.

## 1.1 Installation and Setup

Install required packages for the complete analysis pipeline.

In [None]:
%pip install pandas pyarrow fastparquet
%pip install sentence-transformers tf-keras
%pip install scikit-learn matplotlib umap-learn
%pip install googletrans==4.0.0-rc1
%pip install transformers accelerate torch bertopic hf-transfer seaborn

## 1.2 Load Data

Load article data from the public Parquet file hosted on GitHub.

In [None]:
import pandas as pd

url = "https://github.com/Tao-Pi/CAS-Applied-Data-Science/raw/main/Module-3/01_Module%20Final%20Assignment/export_articles_v2_sample25mb.parquet"
srgssr_article_corpus = pd.read_parquet(url, engine="fastparquet")
has_read_access_udp_articles_v2 = False

# 2. Dataset Overview

In this section, we provide a comprehensive overview of the dataset, including:
- Dataset version (confidential vs public)
- Total number of articles
- Data structure and schema
- Sample data inspection

## 2.1 Check Dataset Version

In [None]:
def format_rowcount(n):
    if n >= 1_000_000:
        return f"more than {n // 1_000_000} million"
    elif n >= 1_000:
        return f"more than {n // 1_000} thousand"
    else:
        return f"{n}"

if has_read_access_udp_articles_v2:
    rowcount = srgssr_article_corpus.count()
    print(f"congrats: you have successfully read the full data set. This contains the full corpus of {format_rowcount(rowcount)} Articles published by SRG-SSR as plain text together with some relevant metadata. You can access the dataframe object by calling 'srgssr_article_corpus' from Python now.")
else:
    if isinstance(srgssr_article_corpus, pd.DataFrame):
        rowcount = len(srgssr_article_corpus)
    else:
        rowcount = srgssr_article_corpus.count()
    print(f"congrats: you have successfully read the publically available (sampled) data set. This contains an excerpt of {format_rowcount(rowcount)} articles within SRG-SSR as plain text together with some relevant metadata. You can access the dataframe object by calling 'srgssr_article_corpus' from Python now.")

## 2.2 Data Structure and Schema

In [None]:
# Show column information
first_row = srgssr_article_corpus.iloc[0].to_dict() if not srgssr_article_corpus.empty else {}

cols_info = [
    {
        "column": col,
        "type": str(dtype),
        "example": first_row.get(col, None)
    }
    for col, dtype in srgssr_article_corpus.dtypes.items()
]

pd.DataFrame(cols_info).head(20)

## 2.3 Sample Data Inspection

In [None]:
display(srgssr_article_corpus)

## 2.4 Limit Dataset for Analysis

For demonstration purposes, we'll work with the first 1000 articles to ensure reasonable processing time.

In [None]:
srgssr_article_corpus = srgssr_article_corpus.head(1000)
print(f"Working with {len(srgssr_article_corpus)} articles for analysis")

# 3. Semantic Search Implementation

**Use Case:** Quickly search all existing articles without needing Google.

**Goal:** Enable journalists to verify if a story has already been written by colleagues in different branches.

**Approach:**
- Use text embeddings (Sentence Transformers) to represent article content
- Enable similarity-based semantic search
- Return most relevant articles for any query

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer

TEXT_COL = "content_text_csv"
ID_COL = "id"

# Prepare data
df = srgssr_article_corpus.copy()
df[TEXT_COL] = df[TEXT_COL].fillna("").astype(str)

# Initialize embedding model
_model = None
def get_embedder():
    global _model
    if _model is None:
        _model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    return _model

print("Creating embeddings for semantic search...")
model = get_embedder()
emb_matrix = model.encode(
    df[TEXT_COL].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True,
)

ids = df[ID_COL].tolist()
texts = df[TEXT_COL].tolist()

def semantic_search(query: str, top_k: int = 10) -> pd.DataFrame:
    """Search for semantically similar articles"""
    q = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)[0]
    sims = emb_matrix @ q
    top_idx = np.argpartition(-sims, kth=min(top_k, len(sims)-1))[:top_k]
    top_idx = top_idx[np.argsort(-sims[top_idx])]
    return pd.DataFrame({
        "id": [ids[i] for i in top_idx],
        "content_text_csv": [texts[i] for i in top_idx],
        "similarity": [float(sims[i]) for i in top_idx],
    })

# Example search
print("\n" + "="*80)
print("Example: Searching for 'climate change' articles")
print("="*80)
results = semantic_search("climate change", top_k=10)
display(results)

# 4. Topic Clustering Analysis (K-means)

**Use Case:** Discover what topics SRG writes about to improve navigation and content organization.

**Approach:**
- Use K-means clustering on article embeddings
- Extract representative keywords for each cluster
- Visualize topic distribution in 2D space using UMAP

In [None]:
from sklearn.cluster import KMeans
from collections import Counter
import re

# Perform K-means clustering
n_clusters = 10
print(f"Performing K-means clustering with {n_clusters} clusters...")

kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init="auto")
labels = kmeans.fit_predict(emb_matrix)

df_clusters = pd.DataFrame({
    "id": ids,
    "content_text_csv": texts,
    "cluster": labels
})

# Extract topic keywords for each cluster
def get_topic_keywords(cluster_id, df_clusters, top_n=3):
    """Extract most common meaningful words from articles in a cluster"""
    cluster_texts = df_clusters[df_clusters['cluster'] == cluster_id]['content_text_csv'].tolist()
    combined_text = ' '.join(cluster_texts).lower()
    words = re.findall(r'\b[a-zäöüàéèêëïôùû]{4,}\b', combined_text)
    
    stopwords = {'dass', 'sind', 'wird', 'wurden', 'wurde', 'haben', 'sein', 
                 'eine', 'einem', 'einen', 'einer', 'dies', 'diese', 'dieser',
                 'auch', 'mehr', 'beim', 'über', 'nach', 'sich', 'oder', 'kann',
                 'können', 'müssen', 'soll', 'sollen', 'noch', 'bereits', 'aber',
                 'wenn', 'weil', 'denn', 'dann', 'sowie', 'dass', 'damit', 'with',
                 'from', 'have', 'this', 'that', 'will', 'been', 'were', 'their',
                 'what', 'which', 'when', 'where', 'there', 'pour', 'dans', 'avec',
                 'sont', 'être', 'cette', 'mais', 'plus', 'comme', 'fait'}
    
    words = [w for w in words if w not in stopwords]
    word_counts = Counter(words)
    top_words = [word for word, count in word_counts.most_common(top_n)]
    return ', '.join(top_words) if top_words else f"Topic {cluster_id}"

# Generate topic labels
topic_labels = {}
print("\nCluster Topics (K-means, based on most frequent keywords):")
print("="*80)
for cluster_id in range(n_clusters):
    keywords = get_topic_keywords(cluster_id, df_clusters, top_n=3)
    topic_labels[cluster_id] = keywords
    count = len(df_clusters[df_clusters['cluster'] == cluster_id])
    print(f"Cluster {cluster_id}: {keywords} ({count} articles)")

df_clusters['cluster_topic'] = df_clusters['cluster'].map(topic_labels)

print("\n")
display(df_clusters.head(10))

## 4.1 Visualize K-means Clusters

In [None]:
import matplotlib.pyplot as plt
import umap

print("Creating UMAP projection for visualization...")
reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
embedding_2d = reducer.fit_transform(emb_matrix)

# Create scatter plot
plt.figure(figsize=(16, 12))
scatter = plt.scatter(
    embedding_2d[:, 0], 
    embedding_2d[:, 1], 
    c=labels, 
    cmap='tab10', 
    alpha=0.6, 
    s=50
)

plt.colorbar(scatter, label='Cluster')
plt.title('K-means Topic Clusters Visualization (UMAP Projection)', fontsize=16)
plt.xlabel('UMAP Dimension 1', fontsize=12)
plt.ylabel('UMAP Dimension 2', fontsize=12)
plt.grid(True, alpha=0.3)

# Add cluster centers with labels
kmeans_centers_2d = reducer.transform(kmeans.cluster_centers_)
plt.scatter(
    kmeans_centers_2d[:, 0], 
    kmeans_centers_2d[:, 1], 
    c='red', 
    marker='X', 
    s=200, 
    edgecolors='black', 
    linewidths=2,
    label='Cluster Centers'
)

# Add text labels
for cluster_id in range(n_clusters):
    x, y = kmeans_centers_2d[cluster_id]
    label_text = f"C{cluster_id}: {topic_labels[cluster_id]}"
    plt.annotate(
        label_text,
        xy=(x, y),
        xytext=(10, 10),
        textcoords='offset points',
        fontsize=9,
        bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.7),
        arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0', color='black', lw=1)
    )

plt.legend()
plt.tight_layout()
plt.show()

# Print cluster distribution
print("\nK-means Cluster Distribution:")
cluster_counts = df_clusters['cluster'].value_counts().sort_index()
for cluster_id, count in cluster_counts.items():
    print(f"Cluster {cluster_id} ({topic_labels[cluster_id]}): {count} articles")

# 5. Translation Pipeline

**Use Case:** Translate all existing articles into all supported languages.

**Goal:** Multiply content availability by making every article accessible in all 11 languages used by SRG.

**Approach:**
- Use Google Translate API (googletrans) for demonstration
- Translate articles to English as target language
- Handle rate limiting and errors gracefully

**Note:** In a Databricks environment, you would use the `ai_translate()` function instead.

In [None]:
from googletrans import Translator
import time

# Initialize translator
translator = Translator()
df_translated = srgssr_article_corpus.copy()

def translate_text(text, dest='en', max_retries=3):
    """Translate text to target language with retry logic"""
    if pd.isna(text) or text == "":
        return ""
    
    text_str = str(text)[:5000]  # Limit to 5000 characters
    
    for attempt in range(max_retries):
        try:
            result = translator.translate(text_str, dest=dest)
            return result.text
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(1)
                continue
            else:
                print(f"Translation failed after {max_retries} attempts: {str(e)[:100]}")
                return text_str
    
    return text_str

# Translate articles
print("Translating articles to English...")
print(f"Total articles to translate: {len(df_translated)}")
print("Note: This may take several minutes. Adding delays to avoid rate limiting.\n")

translated_texts = []
for idx, text in enumerate(df_translated['content_text_csv']):
    if idx % 10 == 0:
        print(f"Progress: {idx}/{len(df_translated)} articles translated...")
    
    translated = translate_text(text, dest='en')
    translated_texts.append(translated)
    
    if idx % 10 == 0 and idx > 0:
        time.sleep(0.5)

df_translated['content_text_en'] = translated_texts

print(f"\n✅ Translation complete! Translated {len(df_translated)} articles.")
print("\nShowing first 3 translated articles:")
display(df_translated[['id', 'content_text_csv', 'content_text_en']].head(3))

# 6. Enhanced Topic Categorization

After translation, we can perform more sophisticated topic analysis on the English text.
This section maps clusters to higher-level topic categories.

## 6.1 Cluster Translated Articles

In [None]:
print("Creating embeddings for translated English articles...")

df_en = df_translated.copy()
df_en['content_text_en'] = df_en['content_text_en'].fillna("").astype(str)

# Create embeddings for English text
model_en = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
emb_matrix_en = model_en.encode(
    df_en['content_text_en'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True,
)

# Perform clustering
n_clusters_en = 10
kmeans_en = KMeans(n_clusters=n_clusters_en, random_state=42, n_init="auto")
labels_en = kmeans_en.fit_predict(emb_matrix_en)

df_clusters_en = pd.DataFrame({
    "id": df_en['id'].tolist(),
    "original_text": df_en['content_text_csv'].tolist(),
    "translated_text_en": df_en['content_text_en'].tolist(),
    "cluster": labels_en
})

# Extract English keywords
def get_topic_keywords_en(cluster_id, df_clusters, top_n=3):
    cluster_texts = df_clusters[df_clusters['cluster'] == cluster_id]['translated_text_en'].tolist()
    combined_text = ' '.join(cluster_texts).lower()
    words = re.findall(r'\b[a-z]{4,}\b', combined_text)
    
    stopwords = {
        'this', 'that', 'with', 'from', 'have', 'been', 'were', 'their',
        'what', 'which', 'when', 'where', 'there', 'will', 'would', 'could',
        'should', 'about', 'after', 'also', 'many', 'more', 'most', 'other',
        'some', 'such', 'than', 'them', 'then', 'these', 'they', 'very',
        'into', 'just', 'like', 'only', 'over', 'said', 'same', 'says',
        'does', 'make', 'made', 'well', 'much', 'even', 'back', 'through',
        'year', 'years', 'being', 'people', 'according', 'since', 'during'
    }
    
    words = [w for w in words if w not in stopwords and len(w) > 3]
    word_counts = Counter(words)
    top_words = [word for word, count in word_counts.most_common(top_n)]
    return ', '.join(top_words) if top_words else f"Topic {cluster_id}"

topic_labels_en = {}
print("\nCluster Topics (based on English translated text):")
for cluster_id in range(n_clusters_en):
    keywords = get_topic_keywords_en(cluster_id, df_clusters_en, top_n=3)
    topic_labels_en[cluster_id] = keywords

df_clusters_en['cluster_topic'] = df_clusters_en['cluster'].map(topic_labels_en)

for cluster_id in range(n_clusters_en):
    count = len(df_clusters_en[df_clusters_en['cluster'] == cluster_id])
    print(f"Cluster {cluster_id}: {topic_labels_en[cluster_id]} ({count} articles)")

## 6.2 Map to Higher-Level Categories

In [None]:
# Define category mappings
topic_categories_enhanced = {
    'Politics': ['government', 'election', 'parliament', 'minister', 'political', 'policy', 'president', 
                 'vote', 'party', 'democrat', 'republican', 'law', 'congress', 'senate', 'council',
                 'federal', 'state', 'referendum', 'campaign', 'diplomat', 'legislative', 'executive',
                 'parliament', 'coalition', 'opposition', 'chancellor', 'mayor', 'governor', 'prime'],
    
    'Sports': ['football', 'soccer', 'tennis', 'basketball', 'hockey', 'olympic', 'champion', 'team',
               'player', 'match', 'game', 'tournament', 'league', 'coach', 'athlete', 'sport',
               'championship', 'victory', 'defeat', 'goal', 'score', 'final', 'world', 'cup',
               'season', 'club', 'training', 'competition', 'medal', 'race', 'swimming', 'skiing'],
    
    'Economy & Business': ['economy', 'economic', 'business', 'market', 'bank', 'finance', 'investment', 'trade',
                           'company', 'stock', 'price', 'inflation', 'currency', 'export', 'import', 'growth',
                           'gdp', 'employment', 'unemployment', 'budget', 'debt', 'profit', 'financial',
                           'corporate', 'industry', 'commercial', 'entrepreneur', 'revenue', 'sales', 'consumer'],
    
    'Science & Technology': ['science', 'technology', 'research', 'study', 'university', 'scientist',
                             'experiment', 'discovery', 'innovation', 'digital', 'computer', 'internet',
                             'software', 'data', 'artificial', 'intelligence', 'robot', 'space', 'energy',
                             'tech', 'innovation', 'laboratory', 'academic', 'technical', 'engineering',
                             'development', 'scientific', 'app', 'online', 'platform', 'system'],
    
    'Health': ['health', 'medical', 'hospital', 'doctor', 'patient', 'disease', 'treatment', 'medicine',
               'virus', 'vaccine', 'pandemic', 'covid', 'care', 'mental', 'clinic', 'drug', 'therapy',
               'healthcare', 'diagnosis', 'symptoms', 'infection', 'prevention', 'nursing', 'surgery',
               'pharmaceutical', 'wellness', 'emergency', 'healthcare'],
    
    'Environment & Climate': ['climate', 'environment', 'environmental', 'weather', 'temperature', 'global',
                              'warming', 'carbon', 'pollution', 'sustainable', 'renewable', 'energy', 'nature',
                              'forest', 'ocean', 'animal', 'species', 'biodiversity', 'ecological', 'green',
                              'conservation', 'wildlife', 'natural', 'emission', 'solar', 'wind', 'water'],
    
    'Culture & Entertainment': ['culture', 'cultural', 'music', 'film', 'movie', 'concert', 'festival',
                                'artist', 'museum', 'exhibition', 'theater', 'performance', 'book',
                                'author', 'literature', 'entertainment', 'celebrity', 'show', 'art',
                                'cinema', 'gallery', 'dance', 'opera', 'creative', 'painting', 'song'],
    
    'Society & Education': ['social', 'society', 'community', 'people', 'family', 'education', 'school', 'student',
                            'teacher', 'child', 'women', 'rights', 'justice', 'police', 'crime', 'court', 'prison',
                            'learning', 'teaching', 'university', 'college', 'children', 'youth', 'citizenship',
                            'welfare', 'public', 'human'],
    
    'International & Foreign Affairs': ['international', 'foreign', 'country', 'countries', 'world', 'global',
                                        'nation', 'diplomatic', 'relations', 'treaty', 'ambassador', 'border',
                                        'crisis', 'conflict', 'peace', 'war', 'alliance', 'united', 'nations'],
    
    'Media & Communication': ['media', 'news', 'press', 'journalist', 'television', 'radio', 'broadcast',
                              'newspaper', 'magazine', 'report', 'reporter', 'channel', 'publishing',
                              'communication', 'interview', 'announcement', 'statement']
}

def assign_category_enhanced(cluster_keywords):
    """Assign category based on keyword matching"""
    keywords_lower = cluster_keywords.lower()
    scores = {}
    
    for category, category_keywords in topic_categories_enhanced.items():
        score = sum(1 for kw in category_keywords if kw in keywords_lower)
        if score > 0:
            scores[category] = score
    
    if scores:
        return max(scores.items(), key=lambda x: x[1])[0]
    else:
        return 'Other'

# Assign categories
cluster_categories_enhanced = {}
print("\nMapping clusters to higher-level topic categories:\n")
print(f"{'Cluster':<10} {'Keywords':<45} {'Category':<30}")
print("=" * 90)

for cluster_id in range(n_clusters_en):
    keywords = topic_labels_en[cluster_id]
    category = assign_category_enhanced(keywords)
    cluster_categories_enhanced[cluster_id] = category
    print(f"{cluster_id:<10} {keywords:<45} {category:<30}")

df_clusters_en['topic_category_enhanced'] = df_clusters_en['cluster'].map(cluster_categories_enhanced)

# Show distribution
print("\n\nArticle Distribution by Enhanced Topic Category:")
print("=" * 60)
category_counts_enhanced = df_clusters_en['topic_category_enhanced'].value_counts()
for category, count in category_counts_enhanced.items():
    percentage = (count / len(df_clusters_en)) * 100
    print(f"{category:<30} {count:>5} articles ({percentage:>5.1f}%)")

## 6.3 Visualize Enhanced Categorization

In [None]:
# Create UMAP projection for translated text
reducer_en = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
embedding_2d_en = reducer_en.fit_transform(emb_matrix_en)

# Color map for categories
unique_categories = sorted(df_clusters_en['topic_category_enhanced'].unique())
category_colors = plt.cm.Set3(np.linspace(0, 1, len(unique_categories)))
category_color_map = {cat: color for cat, color in zip(unique_categories, category_colors)}

# Create visualization
fig = plt.figure(figsize=(22, 10))

# Left: Scatter plot by category
ax1 = plt.subplot(1, 2, 1)
for category in unique_categories:
    mask = df_clusters_en['topic_category_enhanced'] == category
    indices = df_clusters_en[mask].index
    ax1.scatter(
        embedding_2d_en[indices, 0],
        embedding_2d_en[indices, 1],
        c=[category_color_map[category]],
        label=category,
        alpha=0.7,
        s=60,
        edgecolors='black',
        linewidths=0.5
    )

ax1.set_title('Article Clusters with Enhanced Topic Categories\n(UMAP 2D Projection)', 
              fontsize=16, fontweight='bold', pad=20)
ax1.set_xlabel('UMAP Dimension 1', fontsize=12)
ax1.set_ylabel('UMAP Dimension 2', fontsize=12)
ax1.legend(loc='best', fontsize=9, framealpha=0.9, edgecolor='black')
ax1.grid(True, alpha=0.3, linestyle='--')

# Right: Pie chart
ax2 = plt.subplot(1, 2, 2)
category_counts = df_clusters_en['topic_category_enhanced'].value_counts()
colors = [category_color_map[cat] for cat in category_counts.index]

wedges, texts, autotexts = ax2.pie(
    category_counts.values,
    labels=category_counts.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=colors,
    textprops={'fontsize': 10, 'weight': 'bold'},
    wedgeprops={'edgecolor': 'black', 'linewidth': 1.5}
)

for autotext in autotexts:
    autotext.set_color('black')
    autotext.set_fontsize(9)

ax2.set_title('Distribution of Articles by Enhanced Topic Category\n' + 
              f'(Total: {len(df_clusters_en)} articles)', 
              fontsize=16, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

# 7. BERTopic Analysis (Complete Implementation)

This section implements advanced topic modeling using BERTopic with:
- **Custom embedder**: Nvidia LLaMA-Embed-Nemotron-8B model
- **UMAP**: Dimensionality reduction with configurable parameters
- **HDBSCAN**: Hierarchical density-based clustering
- **Custom topic labels**: Human-readable topic names
- **Visualizations**: Bar charts and time series analysis
- **Comparison**: Results compared with K-means clustering

BERTopic provides more sophisticated topic modeling than K-means by:
1. Using state-of-the-art embeddings
2. Discovering topics hierarchically
3. Handling noise and outliers better
4. Providing interpretable topic representations

## 7.1 Custom Embedder Class

Define a custom embedder that wraps the Nvidia LLaMA embedding model for use with BERTopic.

In [None]:
import torch
import pyarrow.parquet as pq
from transformers import AutoTokenizer, AutoModel
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
from hdbscan import HDBSCAN
from IPython.display import display

class CustomEmbedder:
    """Custom embedder using Nvidia LLaMA-Embed-Nemotron-8B model"""
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
    
    def encode(self, texts, **kwargs):
        inputs = self.tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
        with torch.no_grad():
            outputs = self.model(**inputs)
            embeddings = outputs.last_hidden_state.mean(dim=1)
        return embeddings.cpu().numpy()

## 7.2 Load Model and Prepare Data

In [None]:
print("="*80)
print("BERTOPIC ANALYSIS - LOADING MODEL")
print("="*80)

# Load Nvidia LLaMA embedding model
print("\nLoading Nvidia LLaMA-Embed-Nemotron-8B model...")
MODEL_NAME = "nvidia/llama-embed-nemotron-8b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
model = AutoModel.from_pretrained(MODEL_NAME, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
print(f"✓ Loaded {MODEL_NAME}")

# Prepare data - use translated English text if available, otherwise original
print("\nPreparing data...")
if 'content_text_en' in df_translated.columns:
    df_bertopic = df_translated.copy()
    docs = df_bertopic["content_text_en"].dropna().tolist()
    print(f"✓ Using translated English text: {len(docs)} documents")
else:
    df_bertopic = srgssr_article_corpus.copy()
    docs = df_bertopic["content_text_csv"].dropna().tolist()
    print(f"✓ Using original text: {len(docs)} documents")

## 7.3 Configure UMAP and HDBSCAN Parameters

BERTopic uses UMAP for dimensionality reduction and HDBSCAN for clustering.
These parameters can be tuned to adjust topic granularity and quality.

In [None]:
print("\nConfiguring UMAP and HDBSCAN parameters...")

# UMAP parameters
umap_model = UMAP(
    n_neighbors=20,        # Higher = more global structure (default: 15)
    n_components=5,        # Number of dimensions (default: 5)
    min_dist=0.0,          # Minimum distance between points (default: 0.1)
    metric='cosine',       # Distance metric
    random_state=42        # For reproducibility
)

# HDBSCAN parameters
hdbscan_model = HDBSCAN(
    min_cluster_size=220,   # Minimum size of clusters (increase for fewer, larger topics)
    min_samples=15,         # Higher = more conservative clustering
    metric='euclidean',     # Distance metric
    cluster_selection_method='eom',  # 'eom' or 'leaf'
    prediction_data=True    # Needed for transform
)

print("✓ UMAP configured: n_neighbors=20, n_components=5, min_dist=0.0")
print("✓ HDBSCAN configured: min_cluster_size=220, min_samples=15")

## 7.4 Setup BERTopic and Run Analysis

In [None]:
print("\n" + "="*80)
print("SETTING UP BERTOPIC")
print("="*80)

# Setup embedding model
embedding_model = CustomEmbedder(model, tokenizer)

# Configure vectorizer with custom stopwords
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
custom_stopwords = list(ENGLISH_STOP_WORDS) + ['said', 'efe']
vectorizer_model = CountVectorizer(stop_words=custom_stopwords)

# Initialize BERTopic
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    verbose=True
)
print("✓ BERTopic configured with custom embedder")

# Run topic modeling
print("\n" + "="*80)
print("RUNNING TOPIC MODELING")
print("="*80)
print("This may take several minutes depending on the dataset size...")

topics, probs = topic_model.fit_transform(docs)

# Add topics to dataframe
df_clean = df_bertopic.dropna(subset=["content_text_en" if 'content_text_en' in df_bertopic.columns else "content_text_csv"]).copy()
df_clean["topic"] = topics

# Display results
print(f"\n{'='*80}")
print(f"RESULTS: Found {len(set(topics))} topics (including outliers)")
print(f"{'='*80}\n")

topic_info = topic_model.get_topic_info()
display(topic_info)

## 7.5 Assign Custom Topic Labels

Replace automatic topic names with human-readable labels based on manual inspection.

In [None]:
print("\nAssigning custom topic labels...")

# Define custom topic labels
custom_topic_labels = {
    -1: "Outliers / Unassigned",
    0: "Latin American Politics",
    1: "Swiss Domestic Affairs",
    2: "Africa & Middle East",
    3: "International Sports News",
    4: "Law, Crime, Public Safety",
    5: "Arts & Culture",
    6: "Business & Economics",
    7: "International Security & Military Affairs",
    8: "International Trade & Geopolitics",
    9: "Natural Disaster & Humanitarian Response",
    10: "US Domestic Affairs",
    11: "Climate Action & Policy",
    12: "Business & Economics in Latin America"
}

# Map topics to custom labels
df_clean['topic_label'] = df_clean['topic'].map(custom_topic_labels)

print("✓ Custom topic labels assigned")
print("\nSample articles with BERTopic topics:")
display(df_clean[['content_text_en' if 'content_text_en' in df_clean.columns else 'content_text_csv', 
                  'topic', 'topic_label']].head(10))

## 7.6 Visualize Topic Distribution (Bar Chart)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Calculate topic percentages
topic_counts = df_clean['topic_label'].value_counts()
topic_percentages = (topic_counts / len(df_clean) * 100).sort_values(ascending=True)

# Create horizontal bar chart
fig, ax = plt.subplots(figsize=(10, 8))

colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(topic_percentages)))
bars = ax.barh(topic_percentages.index, topic_percentages.values, color=colors)

# Add bar labels
for i, (bar, value) in enumerate(zip(bars, topic_percentages.values)):
    ax.text(value + 0.3, i, f'{value:.1f}%', 
            va='center', fontsize=10, fontweight='bold')

ax.set_xlabel('Percentage of Corpus (%)', fontsize=11)
ax.set_ylabel('')
ax.set_title('BERTopic Distribution Across Corpus', fontsize=14, fontweight='bold')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.grid(axis='x', alpha=0.3, linestyle='--')

plt.tight_layout()
plt.savefig('bertopic_distribution.png', dpi=300, bbox_inches='tight')
plt.show()
print(f"✓ Saved chart to: bertopic_distribution.png")

## 7.7 Time Series Analysis (Small Multiples)

Visualize how different topics evolve over time using small multiples.

In [None]:
import matplotlib.dates as mdates
from datetime import datetime

print("\n" + "="*80)
print("CREATING TIME SERIES VISUALIZATION")
print("="*80)

# Parse dates
if 'releaseDate' in df_clean.columns:
    df_clean['releaseDate'] = pd.to_datetime(df_clean['releaseDate'])
    df_clean['date'] = df_clean['releaseDate'].dt.date
    print(f"Date range: {df_clean['releaseDate'].min()} to {df_clean['releaseDate'].max()}")
    
    # Set date limits
    date_min = datetime(2025, 10, 27)
    date_max = datetime(2025, 11, 5)
    
    # Calculate topic totals
    topic_totals = df_clean.groupby('topic_label').size().reset_index(name='total')
    topic_totals = topic_totals.sort_values('total', ascending=False)
    
    # Separate outliers
    outliers_mask = topic_totals['topic_label'] == 'Outliers / Unassigned'
    outliers_topic = topic_totals[outliers_mask]['topic_label'].tolist()
    other_topics = topic_totals[~outliers_mask]['topic_label'].tolist()
    topics_list = other_topics + outliers_topic
    
    # Daily counts
    daily_counts = df_clean.groupby(['topic_label', 'date']).size().reset_index(name='count')
    
    # Create small multiples
    n_topics = len(topics_list)
    n_cols = 4
    n_rows = (n_topics + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 2.5*n_rows))
    axes = axes.flatten() if n_rows > 1 else axes
    
    for idx, topic in enumerate(topics_list):
        ax = axes[idx]
        
        topic_data = daily_counts[daily_counts['topic_label'] == topic].copy()
        topic_data = topic_data.sort_values('date')
        topic_data['date'] = pd.to_datetime(topic_data['date'])
        
        total = topic_data['count'].sum()
        
        ax.plot(topic_data['date'], topic_data['count'], 
                color='steelblue', linewidth=1.5, marker='o', markersize=3, 
                markerfacecolor='steelblue', markeredgecolor='white', markeredgewidth=0.5,
                label=f'Total: {total:,}')
        
        ax.set_title(topic, fontsize=9, fontweight='bold', pad=5)
        
        if topic == 'Outliers / Unassigned':
            ax.set_ylim(0, 385)
        else:
            ax.set_ylim(0, 190)
        
        ax.grid(axis='y', alpha=0.3, linestyle='--')
        ax.set_xlim(date_min, date_max)
        ax.xaxis.set_major_formatter(mdates.DateFormatter('%m-%d'))
        ax.xaxis.set_major_locator(mdates.DayLocator(interval=2))
        plt.setp(ax.xaxis.get_majorticklabels(), fontsize=7)
        
        ax.legend(loc='upper right', fontsize=7, framealpha=0.9, handlelength=0, handletextpad=0, markerscale=0)
    
    for idx in range(n_topics, len(axes)):
        axes[idx].axis('off')
    
    plt.suptitle('BERTopic Distribution Over Time', fontsize=16, fontweight='bold', y=1.00)
    plt.tight_layout(h_pad=1.5, w_pad=1.5)
    plt.savefig('bertopic_time_series.png', dpi=300, bbox_inches='tight')
    print("✓ Saved: bertopic_time_series.png")
    plt.show()
else:
    print("⚠ No releaseDate column found - skipping time series visualization")

## 7.8 Save BERTopic Results

In [None]:
# Save results
output_file = "articles_with_bertopic.csv"
df_clean.to_csv(output_file, index=False, encoding='utf-8')
print(f"✓ Saved all articles with BERTopic topics to: {output_file}")

# Save model (optional - can be large)
# topic_model.save("bertopic_model")
# print("✓ Saved BERTopic model to: bertopic_model/")

print("\n" + "="*80)
print("BERTOPIC ANALYSIS COMPLETE")
print("="*80)

## 7.9 Compare BERTopic vs K-means

Compare the results from BERTopic (hierarchical, density-based) with K-means (centroid-based) clustering.

In [None]:
print("\n" + "="*80)
print("COMPARISON: BERTopic vs K-means")
print("="*80)

print("\n### K-means Results (Section 4):")
print(f"- Algorithm: Centroid-based clustering")
print(f"- Number of clusters: {n_clusters} (pre-defined)")
print(f"- All documents assigned to a cluster: Yes")
print(f"- Outlier handling: No explicit outlier detection")
print(f"- Topic representation: Based on most frequent keywords")

print("\n### BERTopic Results (Section 7):")
print(f"- Algorithm: Hierarchical density-based (HDBSCAN)")
print(f"- Number of topics: {len(set(topics))} (automatically discovered)")
print(f"- Outliers detected: {len([t for t in topics if t == -1])} documents")
print(f"- Outlier handling: Explicit outlier topic (-1)")
print(f"- Topic representation: Using c-TF-IDF")

print("\n### Key Differences:")
print("1. **Flexibility**: BERTopic discovers topics automatically; K-means requires pre-defined k")
print("2. **Outliers**: BERTopic identifies noise/outliers; K-means forces all points into clusters")
print("3. **Embeddings**: BERTopic uses advanced LLaMA embeddings; K-means used lighter MiniLM")
print("4. **Interpretability**: BERTopic provides c-TF-IDF scores; K-means uses raw keyword frequency")
print("5. **Hierarchy**: BERTopic supports hierarchical topics; K-means is flat")

print("\n### When to Use Each:")
print("- **K-means**: Fast, simple, good for exploration with known number of categories")
print("- **BERTopic**: More sophisticated, handles noise, better for discovery of unknown topics")

print("\n" + "="*80)

# 8. Summary

This comprehensive analysis demonstrated multiple approaches to understanding the SRG SSR article corpus:

## Key Findings

### 1. Data Acquisition & Processing
- Successfully loaded and processed 1,000 articles from the public sample
- Implemented robust data pipeline with translation capabilities

### 2. Semantic Search (Section 3)
- Implemented efficient semantic search using Sentence Transformers
- Enables journalists to quickly find related articles
- Reduces duplicate story creation across branches

### 3. K-means Clustering (Section 4)
- Identified 10 distinct topic clusters
- Visualized topics in 2D space using UMAP
- Extracted representative keywords for each cluster

### 4. Translation Pipeline (Section 5)
- Successfully translated articles to English
- Demonstrated scalability to multiple languages
- Enables multilingual content availability

### 5. Enhanced Categorization (Section 6)
- Mapped clusters to 10 high-level categories:
  - Politics
  - Sports
  - Economy & Business
  - Science & Technology
  - Health
  - Environment & Climate
  - Culture & Entertainment
  - Society & Education
  - International Affairs
  - Media & Communication

### 6. BERTopic Analysis (Section 7)
- Advanced topic modeling with Nvidia LLaMA embeddings
- Discovered 13 distinct topics (plus outliers)
- Provided temporal analysis showing topic evolution
- Better handling of outliers compared to K-means

## Methodological Comparison

| Aspect | K-means | BERTopic |
|--------|---------|----------|
| Clusters | 10 (pre-defined) | 13 (discovered) |
| Outliers | None | Explicitly handled |
| Embedding | MiniLM-L6 | LLaMA-8B |
| Speed | Fast | Slower but more accurate |
| Use Case | Quick exploration | Deep analysis |

## Recommendations

1. **For Journalists**: Use semantic search (Section 3) to quickly find related articles
2. **For Content Strategy**: Use BERTopic results (Section 7) for editorial planning
3. **For Navigation**: Implement enhanced categories (Section 6) for user-facing organization
4. **For Multilingual**: Scale translation pipeline (Section 5) to all 11 languages

## Next Steps

1. Apply to full dataset (not just 1,000 sample)
2. Implement real-time topic tracking
3. Build automated translation for all languages
4. Create interactive dashboard for exploration
5. Integrate with CMS for automatic categorization

# 9. Appendix

## Technical Details

### Models Used
- **Semantic Search & K-means**: `sentence-transformers/all-MiniLM-L6-v2`
- **BERTopic**: `nvidia/llama-embed-nemotron-8b`
- **Translation**: Google Translate API

### Libraries
- pandas, numpy: Data manipulation
- sentence-transformers: Text embeddings
- scikit-learn: K-means clustering
- umap-learn: Dimensionality reduction
- BERTopic: Advanced topic modeling
- hdbscan: Density-based clustering
- matplotlib, seaborn: Visualization
- transformers: HuggingFace models

### Parameters

**UMAP:**
- n_neighbors: 20
- n_components: 5
- min_dist: 0.0
- metric: cosine

**HDBSCAN:**
- min_cluster_size: 220
- min_samples: 15
- metric: euclidean

**K-means:**
- n_clusters: 10
- random_state: 42

## Data Access

**Public Sample:**
- URL: https://github.com/Tao-Pi/CAS-Applied-Data-Science/raw/main/Module-3/01_Module%20Final%20Assignment/export_articles_v2_sample25mb.parquet
- Size: <25MB
- Articles: 1,000 (sampled)

**Full Dataset (Confidential):**
- Location: Databricks Delta table
- Access: Requires SRG SSR credentials

## Resources

- [RenkuLab](https://renkulab.io) - Cloud-based Jupyter environment
- [BERTopic Documentation](https://maartengr.github.io/BERTopic/)
- [Sentence Transformers](https://www.sbert.net/)

## Contact

For questions about this analysis or data access, contact the SRG SSR data team.

---

**Analysis Date:** November 2025  
**Notebook Version:** 1.0 (Comprehensive Merged)  
**Authors:** CAS ADS Module 3 Project Team