## Context and Starting Point

The purpose of this work is to build a scalable pipeline that simplifies multilingual publishing across SRG SSR entities.

Within SRG SSR, several initiatives already address related use cases. This work builds on these existing foundations.

One key initiative is the **Publication Data Platform (PDP)**.  
The PDP aggregates all published articles from SRG SSR and its business units:

- Swissinfo  
- SRF  
- RTS  
- RSI  
- RTR  

All published content is consolidated into a centralized Kafka feed.

- [View Kafka topic data (AKHQ)](https://akhq.pdp.production.admin.srgssr.ch/ui/strimzi/topic/articles-v2/data?sort=NEWEST&partition=All)

---

## Infrastructure

A Databricks infrastructure is in place and operated within SRG SSR.

Workspace:
https://adb-4119964566130471.11.azuredatabricks.net/

The Kafka feed is ingested into Databricks and available as the Delta table:

`udp_prd_modeled.pdp.articles_v2`

This table represents the **Bronze layer**, meaning the raw modeled Kafka data.

However, the structure still reflects the original Kafka schema and is therefore:

- Nested (JSON structure)
- Not normalized
- Not directly suited for analytics or downstream automation

The data represents raw publication events and requires structural transformation before it can be used in analytical or AI-driven workflows.

Access to this infrastructure and its datasets is available only to entitled SRG staff.

---

## Purpose of this Notebook (Silver Layer)

This notebook represents the **Silver layer** of the pipeline.

Its goal is to:

1. Read the raw PDP article data from `udp_prd_modeled.pdp.articles_v2`.
2. Explode nested JSON structures.
3. Separate complex hierarchical structures into structured tables.
4. Persist relationally accessible Delta tables for downstream processing.

At this stage:

- No AI logic is implemented.
- No content generation is performed.
- Only structural normalization is applied.

The result is a clean relational foundation for further transformations in Gold or AI application layers.

---

## Processing Steps

1. Read article data from `udp_prd_modeled.pdp.articles_v2`.
2. Flatten and explode nested fields (e.g., titles, resources, contributors).
3. Create structured Spark DataFrames.
4. Persist the transformed data as Delta tables.
5. Prepare relational access patterns for downstream usage.

---

## Internal Infrastructure Notice

This notebook is developed and executed within the secured SRG SSR Databricks environment.

- The dataset may contain confidential publication data.
- Access is restricted to entitled SRG staff.
- Confidential datasets cannot be exported.
- Code execution is limited to the Databricks (Spark) environment.
- The infrastructure is not accessible to external users.

This ensures compliance with SRG governance, data protection, and infrastructure policies.

---

## Note on AI Assistance

Parts of the transformation logic in this notebook were developed with AI assistance to accelerate structural exploration and flattening of complex nested JSON schemas.

In [0]:
df = spark.table("udp_prd_modeled.pdp.articles_v2")
#display(df)
#Some needed metadata is missing as of today.  

Direct stream: https://akhq.pdp.production.admin.srgssr.ch/ui/strimzi/topic/articles-v2/data?sort=NEWEST&partition=All

In [0]:
df = spark.table("udp_prd_atomic.pdp.article_metadata_kafka")
display(df) 
# Does not work: request from database owner Keller, Pascal (SRF)