# Data Acquisition (ETL) - test

**Starting Point:**  
The article data originates from a Kafka datastream. It is not normalized (so it cannot be analyzed directly) and requires Active Directory login (making collaboration difficult).  
- [View Kafka topic data (AKHQ)](https://akhq.pdp.production.admin.srgssr.ch/ui/strimzi/topic/articles-v2/data?sort=NEWEST&partition=All)

- **Processing steps:**  
  1. Read article data from the Delta table populated from Kafka.
  2. Flatten and transform nested fields (e.g., titles, resources, contributors) using a SQL view.
  3. Create a Spark DataFrame from the flattened view and inspect the results.
  4. Write the DataFrame to a Delta table for analytics and automation.
  5. Export a <25MB Parquet sample with only public data for sharing (e.g., via GitHub).

**Goal:**  
The data should be available as a Parquet file for sharing. Since the dataset is large (5GB), only a public sample is exported for easy distribution.

**Access Control:**  
To guarantee data integrity and protect sensitive information, data distribution is based on user access rights. Entitled users can access the full confidential dataset, while restricted users are provided with only the public sample. This ensures that only authorized users can view sensitive data, maintaining compliance and data security.

## Step-01: Read from Kafka (SQL)

The following steps read article data from the Delta table `udp_prd_atomic.pdp.articles_v2`, which is assumed to be populated from a Kafka stream. The original Kafka data contains nested lists and complex structures (e.g., for multilingual fields or arrays of resources). In the transformation, the SQL view `articles_flat` flattens this nested data by extracting relevant fields and, where multiple values exist (such as for titles in different languages), selects the first available entry—typically prioritizing German (`'de'`) or otherwise the first value. This process prepares the data, along with Kafka metadata (key, topic, partition, offset, timestamp), for further analysis in a flat, tabular format.

## Step-02: Create DataFrame and Visually Inspect Results

The code below runs a Spark SQL query against the temporary view `articles_flat`, loads the result into a Spark DataFrame named `df`, and then displays the DataFrame for visual inspection. This step materializes the flattened article data so it can be further processed or written to a Delta table.

## Step-03: Save Data to Delta Table

In the next steps, the data will be saved both to a Delta table (for better automation and analytics).

### Write to (private) Delta Table - all articles

The following code appends the transformed DataFrame `df` to the Delta table `swi_audience_prd.pdp_articles_v2.articles_v2`. It writes in **append** mode, uses the **Delta** format, and enables **schema merging** so that any new columns are automatically added to the target table without overwriting existing data.

- Contains all articles (**confidential**)


### Write to (public) Parquet File - selected articles, manually Upload to GitHub

Export a <25 MB sample of the data with only public data as a Parquet file for easy sharing via GitHub.  
**Note:** The Parquet file must be manually downloaded from Databricks and then uploaded to your GitHub repository.

...now manually:

1. **Open the CSV file in Databricks:**
   - Navigate to [Databricks Volume Browser](https://adb-4119964566130471.11.azuredatabricks.net/explore/data/volumes/swi_audience_prd/pdp_articles_v2/pdp_articles_v2_volume?o=4119964566130471) in the Databricks workspace file browser.

2. **Download the file:**
   - Right-click on `export_articles_v2_sample25mb.parquet` and select **"Download"** to save the file to your local machine.

3. **Upload the file to GitHub:**
   - Go to [GitHub Folder](https://github.com/Tao-Pi/CAS-Applied-Data-Science/tree/main/Module-3/01_Module%20Final%20Assignment).
   - Click **"Add file"** > **"Upload files"**.
   - Drag and drop `export_articles_v2_sample25mb.parquet` or use the file picker to select it.
   - Commit the changes to upload the file.


## Step-04: Load Data Based on User Rights

The next step is to load the data, with access determined by user rights:

- **Restricted users** can load only the public data sample (e.g., the Parquet file exported for sharing).
- **Entitled users** can load the full, confidential dataset from the Delta table.

This ensures that sensitive information is only accessible to authorized users, while still allowing broader access to public data for collaboration and analysis.

In [3]:
%pip install pandas
%pip install pyarrow
%pip install fastparquet
#%pip install -U sentence-transformers torch safetensors accelerate


Collecting pandas
  Downloading pandas-2.3.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m70.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting numpy>=1.22.4 (from pandas)
  Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m62.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m347.8/347.8 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tzdata, numpy, pandas
Successfully installed numpy-2.2.6 pandas-2.3.3 tzdata-2025.2
Note: you may need to restart the kernel to use updated packages.
Collecting pyarrow
  Downloading pyarrow-22.0.0-cp310-cp310-manylin

In [4]:
import pandas as pd
url = "https://github.com/Tao-Pi/CAS-Applied-Data-Science/raw/main/Module-3/01_Module%20Final%20Assignment/export_articles_v2_sample25mb.parquet"
srgssr_article_corpus = pd.read_parquet(url, engine="fastparquet")

In [5]:
has_read_access_udp_articles_v2 = False

# Dataset Overview

In this chapter, we provide a brief overview of the dataset used for analysis. We indicate whether the loaded dataset is the full confidential version or the public sample, report the total number of articles available, and present a first look at the articles data to understand its structure and content.


## Step 1: Check Dataset Version (Confidential vs Public)

Here we check and indicate whether the loaded dataset is the full confidential version or the public sample.

In [6]:
def format_rowcount(n):
    if n >= 1_000_000:
        return f"more than {n // 1_000_000} million"
    elif n >= 1_000:
        return f"more than {n // 1_000} thousand"
        return f"more than {n // 1_000_000} Mio."
    elif n >= 1_000:
        return f"more than {n // 1_000} Tsd."
    else:
        return f"{n}"

if has_read_access_udp_articles_v2:
    rowcount = srgssr_article_corpus.count()
    print(f"congrats: you have successfully read the full data set. This contains the full corpus of {format_rowcount(rowcount)} Articles published by SRG-SSR as plain text together with some relevant metadata. You can access the dataframe object by calling 'srgssr_article_corpus' from Python now.")
else:
    if isinstance(srgssr_article_corpus, pd.DataFrame):
        rowcount = len(srgssr_article_corpus)
    else:
        rowcount = srgssr_article_corpus.count()
    print(f"congrats: you have successfully read the publically available (sampled) data set. This contains an excerpt of {format_rowcount(rowcount)} articles within SRG-SSR as plain text together with some relevant metadata. You can access the dataframe object by calling 'srgssr_article_corpus' from Python now.")

congrats: you have successfully read the publically available (sampled) data set. This contains an excerpt of more than 11 thousand articles within SRG-SSR as plain text together with some relevant metadata. You can access the dataframe object by calling 'srgssr_article_corpus' from Python now.



## Step 2: Overview of the Data

In this step, we provide an overview of the data contained in the loaded dataset. This includes a summary of the available articles and a first look at their structure and content.

In [7]:
# Falls DataFrame leer ist → leeres dict
first_row = srgssr_article_corpus.iloc[0].to_dict() if not srgssr_article_corpus.empty else {}

cols_info = [
    {
        "column": col,
        "type": str(dtype),
        "example": first_row.get(col, None)
    }
    for col, dtype in srgssr_article_corpus.dtypes.items()
]

# Schön anzeigen
import pandas as pd
pd.DataFrame(cols_info).head(20)


Unnamed: 0,column,type,example
0,id,object,urn:pdp:cue_rsi:article:rsi:cue:story:3238385
1,publisher,object,RSI
2,provenance,object,CUE_RSI
3,modificationDate,datetime64[ns],2025-11-17 15:48:25
4,releaseDate,datetime64[ns],2025-10-29 15:48:31
5,title_auto,object,L’amore oltre le sbarre
6,lead_auto,object,Veronica Barbato fotografa la vita tormentata ...
7,kicker_auto,object,Fotografia e carceri
8,id_urn,object,
9,id_srg,object,


## Step 3: A Closer Look

In this step, we take a deeper look at the loaded dataset, exploring its structure and content in more detail.

In [8]:
display(srgssr_article_corpus)

Unnamed: 0,id,publisher,provenance,modificationDate,releaseDate,title_auto,lead_auto,kicker_auto,id_urn,id_srg,picture_url,content_text_csv,contributors_csv,resources_locator_urls_csv,keywords_csv,key,topic,partition,offset,timestamp
0,urn:pdp:cue_rsi:article:rsi:cue:story:3238385,RSI,CUE_RSI,2025-11-17 15:48:25.000,2025-10-29 15:48:31,L’amore oltre le sbarre,Veronica Barbato fotografa la vita tormentata ...,Fotografia e carceri,,,,“Buonanotte” era una trasmissione di una radio...,Author,https://www.rsi.ch/s/3238385,CULTURA,urn:pdp:cue_rsi:article:rsi:cue:story:3238385,articles-v2,6,897052,2025-10-29 15:55:20.290
1,urn:pdp:cue_rsi:article:rsi:cue:story:3242447,RSI,CUE_RSI,2025-10-31 08:00:00.000,2025-10-30 20:32:57,“Dietro le quinte tanto lavoro per strutturars...,"In attesa della licenza, l’ACB chiude all’FCL ...",Calcio Svizzero,,,https://il.rsi.ch/rsi-api/resize/image/v2//WEB...,Sono giorni decisamente importanti per il Bell...,Author,https://www.rsi.ch/s/3242447,SPORT,urn:pdp:cue_rsi:article:rsi:cue:story:3242447,articles-v2,3,902310,2025-10-30 20:34:41.815
2,urn:pdp:cms_swi:article:90255795,SWI,CMS_SWI,2025-10-31 00:01:11.708,2025-10-30 23:56:45,EUA reduzirá bruscamente admissão de refugiado...,,,,,https://www.swissinfo.ch/content/wp-content/up...,EUA reduzirá bruscamente admissão de refugiado...,,https://www.swissinfo.ch/por/eua-reduzir%c3%a1...,"Política,Sociedade,sociedade (geral)",urn:pdp:cms_swi:article:90255795,articles-v2,2,909146,2025-10-31 00:01:14.422
3,urn:pdp:cms_swi:article:90255781,SWI,CMS_SWI,2025-10-30 23:56:11.323,2025-10-30 23:50:50,República Dominicana se solidariza con el pres...,,,,,,República Dominicana se solidariza con el pres...,,https://www.swissinfo.ch/spa/rep%c3%bablica-do...,"Política,tratados y organizaciones,democracia",urn:pdp:cms_swi:article:90255781,articles-v2,5,906285,2025-10-30 23:56:13.949
4,urn:pdp:cms_swi:article:90255780,SWI,CMS_SWI,2025-10-30 23:51:12.935,2025-10-30 23:49:07,La española Telefónica se despide de Ecuador t...,,,,,,La española Telefónica se despide de Ecuador t...,,https://www.swissinfo.ch/spa/la-espa%c3%b1ola-...,"servicio de telecomunicaciones,empresas",urn:pdp:cms_swi:article:90255780,articles-v2,6,897411,2025-10-30 23:51:15.547
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11065,urn:pdp:cms_swi:article:90204318,SWI,CMS_SWI,2025-10-21 21:31:05.000,2025-10-21 21:27:11,Dow Jones cierra al alza con nuevo récord apoy...,,,,,,Dow Jones cierra al alza con nuevo récord apoy...,,https://www.swissinfo.ch/spa/dow-jones-cierra-...,valores,urn:pdp:cms_swi:article:90204318,articles-v2,0,897977,2025-10-21 21:31:07.468
11066,urn:pdp:cms_swi:article:90204319,SWI,CMS_SWI,2025-10-21 21:31:03.932,2025-10-21 21:29:15,Condenan en Colombia a 21 años de prisión a ot...,,,,,,Condenan en Colombia a 21 años de prisión a ot...,,https://www.swissinfo.ch/spa/condenan-en-colom...,"criminalidad,administración de corte,investiga...",urn:pdp:cms_swi:article:90204319,articles-v2,2,906813,2025-10-21 21:31:06.370
11067,urn:pdp:cms_swi:article:90204316,SWI,CMS_SWI,2025-10-21 21:26:09.636,2025-10-21 21:20:17,Costa Rica celebra la elección de la primera m...,,,,,,Costa Rica celebra la elección de la primera m...,,https://www.swissinfo.ch/spa/costa-rica-celebr...,gobierno,urn:pdp:cms_swi:article:90204316,articles-v2,7,900307,2025-10-21 21:26:12.101
11068,urn:pdp:cms_swi:article:90204317,SWI,CMS_SWI,2025-10-21 21:26:07.003,2025-10-21 21:20:56,Netflix aumentó su beneficio neto en un 25 % e...,,,,,,Netflix aumentó su beneficio neto en un 25 % e...,,https://www.swissinfo.ch/spa/netflix-aument%c3...,"internet,economía (general),medios informativo...",urn:pdp:cms_swi:article:90204317,articles-v2,2,906811,2025-10-21 21:26:09.480


#Analyses
This is where analyses are performed. This is work in progress. Some ideas:

**Story 1:** I want to quickly search all existing articles without the need to use Google. I want to do that because when I write a story, I want to make sure the same story was not just written by my colleagues working in a different branch.

**Story 2:** I want to find out what topics SRG writes about – this could, for example, be used for navigation (News / Sport / etc.).

**Story 3:** I want to translate all existing articles into all languages used. This way, I can multiply the offer easily. Instead of having some articles in French and some in English, I will have all articles available in all of our 11 languages.

*List other ideas here...*

In [9]:
srgssr_article_corpus = srgssr_article_corpus.head(1000)

## USE CASE: Quickly Search All Existing Articles

I want to quickly search all existing articles without the need to use Google. I want to do that because when I write a story, I want to make sure the same story was not just written by my colleagues working in a different branch.

**Approach:**
- Implement a semantic search feature within Databricks that allows users to search articles by keywords, phrases, or topics.
- Use text embeddings (e.g., with Sentence Transformers) to represent article content and enable similarity-based search.
- Provide a simple search interface where users can enter queries and retrieve the most relevant articles.
- Optionally, add filters for date, author, or branch to refine search results.

In [10]:
# If needed (run once per environment):
# %pip install -U sentence-transformers torch safetensors accelerate

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer

# ----- 1) Prepare data (pandas) -----
TEXT_COL = "content_text_csv"
ID_COL = "id"

df = srgssr_article_corpus.copy()
df[TEXT_COL] = df[TEXT_COL].fillna("").astype(str)

# (Optional) downsample for quick prototyping
# df = df.head(1000)

# ----- 2) Load the embedder (cached) -----
_model = None
def get_embedder():
    global _model
    if _model is None:
        _model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    return _model

# ----- 3) Compute corpus embeddings (NumPy) -----
model = get_embedder()
# normalize_embeddings=True gives unit vectors → cosine = dot product
emb_matrix = model.encode(
    df[TEXT_COL].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True,
)

# Keep references for lookup
ids = df[ID_COL].tolist()
texts = df[TEXT_COL].tolist()

# ----- 4) Semantic search (cosine similarity) -----
def semantic_search(query: str, top_k: int = 10) -> pd.DataFrame:
    q = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)[0]  # shape (d,)
    sims = emb_matrix @ q  # cosine similarity because both are normalized
    # Argpartition for speed, then sort exact top_k
    top_idx = np.argpartition(-sims, kth=min(top_k, len(sims)-1))[:top_k]
    top_idx = top_idx[np.argsort(-sims[top_idx])]

    return pd.DataFrame({
        "id": [ids[i] for i in top_idx],
        "content_text_csv": [texts[i] for i in top_idx],
        "similarity": [float(sims[i]) for i in top_idx],
    })

# ----- 5) Example -----
results = semantic_search("climate change", top_k=10)
results.head(10)

ModuleNotFoundError: No module named 'sentence_transformers'


##USE CASE: find out what topics SRG writes about.
**Approach:**  
- Read the text from the `content_text_csv` column of the articles.
- Compute similarity between article contents, e.g., by embedding the texts and using a Random Forest or other clustering/classification methods to group similar articles.
- Identify clusters of similar content to reveal common topics or themes.
- Use these clusters to enhance navigation and filtering options for users.

In [None]:
# For quick prototyping, we sample only 1000 articles here.
# To run on the full dataset, remove the .limit(1000) line below.
df_sample = srgssr_article_corpus.limit(1000)

from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import ArrayType, FloatType
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.ml.clustering import KMeans
import pandas as pd

# Load model once per executor for efficiency
from sentence_transformers import SentenceTransformer

def get_embedder():
    if not hasattr(get_embedder, "model"):
        get_embedder.model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    return get_embedder.model

@pandas_udf(ArrayType(FloatType()))
def embed_udf(texts: pd.Series) -> pd.Series:
    model = get_embedder()
    return pd.Series(model.encode(texts.tolist(), show_progress_bar=False, batch_size=64).tolist())

# Add embeddings column (no translation)
df_emb = df_sample.withColumn(
    "content_emb", embed_udf(col("content_text_csv"))
)

# Convert array to Spark Vector (required by ML algorithms)
to_vector = pandas_udf(lambda arr: [Vectors.dense(x) if x else Vectors.dense([]) for x in arr], returnType=VectorUDT())
df_vec = df_emb.withColumn("features", to_vector(col("content_emb")))

# KMeans clustering (choose number of clusters, e.g., 10)
k = 10
kmeans = KMeans(featuresCol="features", predictionCol="cluster", k=k, seed=42)
model = kmeans.fit(df_vec)

# Assign cluster IDs
df_clustered = model.transform(df_vec).select("id", "cluster", "content_text_csv")

# Show a sample of clustered articles
display(df_clustered.orderBy("cluster"))


## USE CASE: Translate All Existing Articles into All Languages Used

Goal: Automatically translate every article into all supported languages, so each article is available in every language used by SRG.

**Approach:**
- For each article, use the `ai_translate` function to generate translations for all target languages.
- Store the translated articles alongside the originals for easy access and analytics.

**Example (SQL):**
sql
SELECT
  *,
  ai_translate(title, 'fr') AS title_fr,
  ai_translate(title, 'it') AS title_it,
  ai_translate(title, 'en') AS title_en,
  ai_translate(title, 'rm') AS title_rm
FROM articles_flat

*(Repeat for all relevant text fields and all required languages.)*

In [None]:
# Assume ai_translate(text: str, target_lang: str) -> str is defined and available

from pyspark.sql.functions import col, struct, udf
from pyspark.sql.types import StringType

target_languages = ["en", "fr", "it", "de"]  # Example target languages

def make_translate_udf(lang):
    @udf(StringType())
    def translate_udf(text):
        return ai_translate(text, lang)
    return translate_udf

df = srgssr_article_corpus

for lang in target_languages:
    df = df.withColumn(f"content_text_csv_{lang}", make_translate_udf(lang)(col("content_text_csv")))

# Store the DataFrame with translations alongside originals
df.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("swi_audience_prd.pdp_articles_v2.articles_v2_translated")

display(df)

In [None]:
# Falls noch nicht installiert+neu gestartet:
# %pip install -U transformers sentencepiece torch safetensors accelerate
# dbutils.library.restartPython()

from transformers import pipeline
import pandas as pd

# 1) 20 Zeilen holen (nur nötige Spalten)
pdf = (
    srgssr_article_corpus
    .select("id", "content_text_csv")
    .limit(20)
    .toPandas()
)

# 2) Pipeline lokal laden (CPU)
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-mul-en", device=-1)

# 3) Übersetzen (Batch pro Text-Chunk – hier simpel ohne Chunking)
def translate_text(t):
    if t is None or not str(t).strip():
        return None
    res = translator(t, truncation=True, max_length=1000)
    return res[0]["translation_text"]

pdf["content_text_csv_english"] = [translate_text(t) for t in pdf["content_text_csv"]]

# 4) Zurück nach Spark (nur für Anzeige)
df_test = spark.createDataFrame(pdf)
display(df_test)

In [None]:
%skip
# 🧪 Nur 20 Zeilen nehmen
df_sample = srgssr_article_corpus.limit(20)

# 🔄 Übersetzungsspalte hinzufügen
df_test = df_sample.withColumn(
    "content_text_csv_english",
    translate_series_to_en(F.col("content_text_csv"))
)

# 👀 Ergebnis anzeigen
df_test.select("id", "content_text_csv", "content_text_csv_english").display()

#Hints and Notes
> **Hint:**  
> For a temporary Jupyter environment to experiment or explore data, consider using [RenkuLab](https://renkulab.io/p/snsf-anoxia-project/proxy-proxy/sessions/01JX2TG1RZ9J0PQ53H3RT81BD4/start). RenkuLab offers cloud-based notebook sessions—no local setup required.

> **Note:**  
> RenkuLab sessions may require authentication and have limited resources. Save your work frequently, as sessions can time out or be terminated after inactivity.