# **CONTENTDATA COLLECTION NOTEBOOK**

## Objectives

* Use inputs/metadata/gazette_metadata_jupyter.csv to fetch further informations.

## Inputs

* inputs/metadata/gazette_metadata_jupyter.csv

## Outputs

* The output data is a csv file called inputs/contentdata/gazette_contentdata_jupyter.csv

## Additional Comments

* ...


---

# Install python packages in the notebooks

In [1]:
%pip install requests xmltodict psycopg2-binary pandas

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


---

# Change working directory

* Storing the notebooks in a subfolder, therefore we change the working directory from its current folder to its parent folder

* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/GazetteAnaliticsTools/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/GazetteAnaliticsTools'

---

# Fetch data from the Gazette API

### Some imports:
* os: for reading environment variables (e.g. your PG_DSN)
* datetime: to compute today’s date
* uuid: to generate UUIDs for new records
* json: to serialize raw metadata into JSONB
* requests: to call the Gazette API
* xmltodict: to convert the XML response into Python dicts
* psycopg2 / psycopg2.extras.execute_values: to connect and bulk-upsert into PostgreSQL

In [4]:
# 1. Imports Standard library
import os, datetime, uuid, json
# 2. Imports HTTP + XML parsing
import requests, xmltodict
# 3. Imports PostgreSQL driver
import psycopg2
from psycopg2.extras import execute_values
# 4. Imports Pandas for data analysis
import pandas as pd
# 5. Imports math for dividing
import math

### Fetch list of today's metadata.csv

In [5]:
# Fetch today's metadata
df_meta = pd.read_csv("inputs/metadata/gazette_metadata_jupyter.csv")
df_meta.head(3)

Unnamed: 0,ref,schemaLocation,id,subRubric,publicationDate,legalRemedy,title_en,entryType
0,https://amtsblattportal.ch/api/v1/publications...,https://amtsblattportal.ch/api/v1/schemas/shab...,4f614e43-c3e7-4df5-805f-4574ae3eabdc,HR03,2025-01-15,,Deletion Société coopérative de la Guinguette ...,Deletion
1,https://amtsblattportal.ch/api/v1/publications...,https://amtsblattportal.ch/api/v1/schemas/shab...,e8608e46-d33e-4224-803f-d5d0cb6136bc,HR03,2025-01-15,,"Deletion DENTAMINA AG in Liquidation, Bad Ragaz",Deletion
2,https://amtsblattportal.ch/api/v1/publications...,https://amtsblattportal.ch/api/v1/schemas/shab...,76a76b6a-7cc8-4378-b9f4-4c444315c931,HR03,2025-01-15,,"Deletion schwarz logistics engineering, Reinac...",Deletion


### Fetch all ref.xml for each metadata entry & Transform into flat rows

In [6]:
#1. Paths
input_path  = "inputs/metadata/gazette_metadata_jupyter.csv"

#2. Load previous CSV of refs
references = df_meta["ref"].dropna().unique().tolist()

#3. Deviding into 2 batches
n = len(references)
half = math.ceil(n / 2)

first_batch  = references[:half]
second_batch = references[half:]

#4. Define the mapping for subRubric -> entryType
rubric_map = {
    "HR01": "New entries",
    "HR02": "Change",
    "HR03": "Deletion"
}

#5. Build the rows, and replace subRubric with entryType
def process_batch(batch):
    rows = []
    for ref_url in batch:
        # Fetch & parse XML
        resp = requests.get(ref_url)
        resp.raise_for_status()
        data = xmltodict.parse(resp.text)
        # Find the <publication> root key (namespace-agnostic) so that if <HR01:publication...> changes the code don't break
        root_key = next((k for k in data if k.endswith("publication")), None)
        pub      = data[root_key] if root_key else {}
        
        meta    = pub.get("meta", {})
        content = pub.get("content", {})
        sub     = meta.get("subRubric", "")

        # Build the row
        row = {
            # -- meta fields --
            "id":               meta.get("id", "no data"),
            "entryType":        rubric_map.get(sub, "no data"),
            "language":         meta.get("language", "no data"),
            "publicationDate":  meta.get("publicationDate", "no data"),
            "legalRemedy":      meta.get("legalRemedy", "no data"),
            "cantons":      meta.get("cantons", "no data"),
            "title_en":         meta.get("title", {}).get("en", "no data"),
            # -- content fields --
            "journal_date":     content.get("journalDate", "no data"),
            "publication_text": content.get("publicationText", "no data"),
        }
        
        # -- company / commonsNew --
        commons = content.get("commonsActual", {}) if sub == "HR03" else content.get("commonsNew", {})
        comp    = commons.get("company", {})
        addr    = comp.get("address", {})
        row.update({
            "company_name":             comp.get("name", "no data"),
            "company_uid":              comp.get("uid", "no data"),
            "company_code13":           comp.get("code13", "no data"),
            "company_seat":             comp.get("seat", "no data"),
            "company_legalForm":        comp.get("legalForm", "no data"),
            "company_street_and_number": f"{addr.get('street','no data')} {addr.get('houseNumber','no data')}",
            "company_zip_and_town":      f"{addr.get('swissZipCode','no data')} {addr.get('town','no data')}",
            "company_purpose":          commons.get("purpose", "no data"),
        })
        
        # -- capital & revision --
        cap     = commons.get("capital", {})
        revision= commons.get("revision", {})
        row.update({
            "company_capital_nominal": cap.get("nominal", "no data"),
            "company_capital_paid":    cap.get("paid",    "no data"),
            "company_optingout":       revision.get("optingOut", "no data"),
        })
        
        # -- deletion date from transaction.delete --
        delete  = content.get("transaction", {}).get("delete", {})
        row["company_deletiondate"] = delete.get("deletionDate", "no data")
        
        rows.append(row)
    return rows

First batch

In [7]:
# First run
rows1 = process_batch(first_batch)
# e.g. save rows1 to CSV or accumulate
print(f"Processed first batch ({len(first_batch)} refs)")

Processed first batch (637 refs)


### Save inside new .csv

In [8]:
#1. Where to write
output_path = "inputs/contentdata/gazette_contentdata_jupyter.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)

In [9]:
#7. Dump to CSV (append if exists, otherwise write header)
df_content = pd.DataFrame(rows1)

---

# Inspect meta data

Section 2 content

In [10]:
df_content.head(10)

Unnamed: 0,id,entryType,language,publicationDate,legalRemedy,cantons,title_en,journal_date,publication_text,company_name,...,company_code13,company_seat,company_legalForm,company_street_and_number,company_zip_and_town,company_purpose,company_capital_nominal,company_capital_paid,company_optingout,company_deletiondate
0,4f614e43-c3e7-4df5-805f-4574ae3eabdc,Deletion,fr,2025-01-15,no data,JU,Deletion Société coopérative de la Guinguette ...,2025-01-10,Société coopérative de la Guinguette en liquid...,Société coopérative de la Guinguette en liquid...,...,CH67050089076,Delémont,108,Route de Bâle 10,2800 Delémont,Soutenir et protéger ses membres; coordonner e...,no data,no data,no data,2025-01-10
1,e8608e46-d33e-4224-803f-d5d0cb6136bc,Deletion,de,2025-01-15,no data,SG,"Deletion DENTAMINA AG in Liquidation, Bad Ragaz",2025-01-10,"DENTAMINA AG in Liquidation, in Bad Ragaz, CHE...",DENTAMINA AG in Liquidation,...,CH32030708454,Bad Ragaz,106,Marausstrasse 3,7310 Bad Ragaz,Führung einer Zahnarztpraxis und Erbringung vo...,100000.00,100000.00,no data,2025-01-10
2,76a76b6a-7cc8-4378-b9f4-4c444315c931,Deletion,de,2025-01-15,no data,AG,"Deletion schwarz logistics engineering, Reinac...",2025-01-10,"schwarz logistics engineering, in Reinach (AG)...",schwarz logistics engineering,...,CH40010362385,Reinach (AG),101,Tannenrain 9,5734 Reinach AG,"Beratung auf dem Gebiet der Logistik, Planung ...",no data,no data,no data,2025-01-10
3,d8b04ace-8253-44e9-aae2-53434adbb228,Deletion,de,2025-01-15,no data,SH,"Deletion Wipf & Co. Immobilien, Lohn (SH)",2025-01-10,"Wipf & Co. Immobilien, in Lohn (SH), CHE-101.3...",Wipf & Co. Immobilien,...,CH29020046673,Lohn (SH),103,Blattenacker 1,8235 Lohn SH,Verwaltung von Immobilien.,no data,no data,no data,2025-01-10
4,4fe15a9e-237b-487a-9334-fd0fc248602e,Deletion,de,2025-01-15,no data,AG,"Deletion Freuler Haustechnik, Oberwil-Lieli",2025-01-10,"Freuler Haustechnik, in Oberwil-Lieli, CHE-136...",Freuler Haustechnik,...,CH40016081529,Oberwil-Lieli,101,Lettenstrasse 29,8966 Oberwil-Lieli,Ausführung von Sanitärarbeiten.,no data,no data,no data,2025-01-10
5,c403d707-6d39-4e86-8796-9ac2a4827adc,Deletion,de,2025-01-15,no data,SG,"Deletion Bourbon-Baron, Lübberstedt, Oberriet ...",2025-01-10,"Bourbon-Baron, Lübberstedt, in Oberriet (SG), ...","Bourbon-Baron, Lübberstedt",...,CH32010897475,Oberriet (SG),101,Montlingerstrasse 1,9463 Oberriet SG,"An- und Verkauf von Getränken, vorrangig Spiri...",no data,no data,no data,2025-01-10
6,c865bf87-244a-4664-9385-cd480afead8b,Deletion,de,2025-01-15,no data,AG,"Deletion LS Adventure KLG, Spreitenbach",2025-01-10,"LS Adventure KLG, in Spreitenbach, CHE-162.062...",LS Adventure KLG,...,CH40026072354,Spreitenbach,103,Limmatstrasse 2,8957 Spreitenbach,"Handel mit Waren aller Art, insbesondere mit L...",no data,no data,no data,2025-01-10
7,40ea91e8-6658-4c15-9933-5f614b360899,Deletion,de,2025-01-15,no data,BS,"Deletion Wallprint Momirovic, Basel",2025-01-10,"Wallprint Momirovic, in Basel, CHE-358.767.617...",Wallprint Momirovic,...,CH27010214548,Basel,101,Oetlingerstrasse 40,4057 Basel,Das Einzelunternehmen bezweckt die Erbringung ...,no data,no data,no data,2025-01-10
8,99314240-16ce-43d8-a758-ce620b10f3aa,Deletion,de,2025-01-15,no data,SG,"Deletion Max Kappler, Malerarbeiten, Zuzwil (SG)",2025-01-10,"Max Kappler, Malerarbeiten, in Zuzwil (SG), CH...","Max Kappler, Malerarbeiten",...,CH32010412521,Zuzwil (SG),101,Weierenstrasse 55,9524 Zuzwil SG,Allgemeine Maler- und Tapeziererarbeiten,no data,no data,no data,2025-01-10
9,ecb817fa-8a57-4cb4-abe0-b360f86df91a,Deletion,de,2025-01-15,no data,SG,"Deletion Egli Trade, Lütisburg",2025-01-10,"Egli Trade, in Lütisburg, CHE-114.803.810, Ein...",Egli Trade,...,CH32010650403,Lütisburg,101,Station 8,9601 Lütisburg Station,"Handel mit Waren aller Art, insbesondere Konsu...",no data,no data,no data,2025-01-10


DataFrame Summary

In [11]:
df_content.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 637 entries, 0 to 636
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   id                         637 non-null    object
 1   entryType                  637 non-null    object
 2   language                   637 non-null    object
 3   publicationDate            637 non-null    object
 4   legalRemedy                637 non-null    object
 5   cantons                    637 non-null    object
 6   title_en                   637 non-null    object
 7   journal_date               637 non-null    object
 8   publication_text           637 non-null    object
 9   company_name               637 non-null    object
 10  company_uid                637 non-null    object
 11  company_code13             637 non-null    object
 12  company_seat               637 non-null    object
 13  company_legalForm          637 non-null    object
 14  company_st

We want to check if there are duplicated `id`: There are not.

In [12]:
df_content[df_content.duplicated(subset=['id'])]

Unnamed: 0,id,entryType,language,publicationDate,legalRemedy,cantons,title_en,journal_date,publication_text,company_name,...,company_code13,company_seat,company_legalForm,company_street_and_number,company_zip_and_town,company_purpose,company_capital_nominal,company_capital_paid,company_optingout,company_deletiondate


Find missing values

In [13]:
df_content.isna().sum()

id                           0
entryType                    0
language                     0
publicationDate              0
legalRemedy                  0
cantons                      0
title_en                     0
journal_date                 0
publication_text             0
company_name                 0
company_uid                  0
company_code13               0
company_seat                 0
company_legalForm            0
company_street_and_number    0
company_zip_and_town         0
company_purpose              0
company_capital_nominal      0
company_capital_paid         0
company_optingout            0
company_deletiondate         0
dtype: int64

Evaluating distribution and shape of a variable with missing data

In [None]:
#1. Variable containing missing data
# missing_data = df_content.columns[df_content.isna().sum() > 0].to_list()
# missing_data

#2. Exemple of missing data (not missing) in order to understand what to do with missing values
# for col in missing_data:
#     unique_values = df_content[col].dropna().unique()[:5]
#     print(f"{col}: {unique_values}\n")

#3. Trying to understand the importance of the missing data, Filter rows where 'xxx' is missing and display the first 5
# missing_rows = df_content[df_content['legalRemedy'].isna()]
# print(missing_rows.head(5))

#4. Print a list of unique missing variable, Filter rows where 'xxx' is missing and get unique 'yyy' values
# unique_yyy = df_content[df_content['legalRemedy'].isna()]['subRubric'].unique()
# print(unique_yyy)

['HR03']


---

# Push files to Repo

If output .csv file already exist then proceed to check if 'id' already are saved inside of it.

In [14]:
# Ensure id column is string for reliable comparison
df_content["id"] = df_content["id"].astype(str)
new_count = len(df_content)

if os.path.exists(output_path):
    # 1. Load existing IDs
    df_existing = pd.read_csv(output_path, usecols=["id"])
    df_existing["id"] = df_existing["id"].astype(str)
    existing_ids = set(df_existing["id"])
    existing_count = len(existing_ids)

    # 2. Determine which rows are truly new
    mask_new = ~df_content["id"].isin(existing_ids)
    df_to_append = df_content[mask_new]
    appended_count = len(df_to_append)

    # 3. Compute discarded count
    discarded_count = new_count - appended_count

    # 4. Append new rows if any
    if appended_count > 0:
        df_to_append.to_csv(output_path, mode="a", header=False, index=False)

    # 5. Totals after append
    total_after = existing_count + appended_count

    # 6. Report to user
    print(f"Existing publications before append: {existing_count}")
    print(f"New publications fetched:             {new_count}")
    print(f"Publications discarded (duplicates): {discarded_count}")
    print(f"Publications appended:               {appended_count}")
    print(f"Total publications after append:     {total_after}")

else:
    # No existing file: write all
    df_content.to_csv(output_path, index=False)
    print(f"Existing publications before append: 0")
    print(f"New publications fetched:            {new_count}")
    print(f"Publications discarded (duplicates): 0")
    print(f"Publications appended:               {new_count}")
    print(f"Total publications after append:     {new_count}")

Existing publications before append: 9136
New publications fetched:             637
Publications discarded (duplicates): 0
Publications appended:               637
Total publications after append:     9773


Second Batch

In [15]:
# Second run
rows2 = process_batch(second_batch)
# e.g. save rows1 to CSV or accumulate
print(f"Processed first batch ({len(second_batch)} refs)")

Processed first batch (636 refs)


### Save inside new .csv + Data Inspection

In [16]:
#1. Where to write
output_path = "inputs/contentdata/gazette_contentdata_jupyter.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)
#7. Dump to CSV (append if exists, otherwise write header)
df_content = pd.DataFrame(rows2)
df_content.head(10)

Unnamed: 0,id,entryType,language,publicationDate,legalRemedy,cantons,title_en,journal_date,publication_text,company_name,...,company_code13,company_seat,company_legalForm,company_street_and_number,company_zip_and_town,company_purpose,company_capital_nominal,company_capital_paid,company_optingout,company_deletiondate
0,0cd03fb5-e998-4b75-b16f-35ea26a297ca,Change,de,2025-01-15,Die Mutation der aufgeführten Rechtseinheit wu...,BL,"Change AnalytikPro GmbH, Birsfelden",2025-01-10,"AnalytikPro GmbH, in Birsfelden, CHE-486.084.8...",AnalytikPro GmbH,...,CH28040256608,Birsfelden,107,Zwinglistrasse 8,4127 Birsfelden,Zweck des Unternehmens ist die Erbringung von ...,20000.00,20000.00,False,no data
1,fa94a533-783c-4549-bb85-48f182895c14,Change,de,2025-01-15,Die Mutation der aufgeführten Rechtseinheit wu...,BS,"Change Vector BioPharma AG, Basel, new Vector ...",2025-01-10,"Vector BioPharma AG, in Basel, CHE-471.427.733...",Vector BioPharma AG in Liquidation,...,CH27030163150,Basel,106,Aeschenvorstadt 36,4051 Basel,"Die Gesellschaft bezweckt die Erforschung, die...",400000.00,400000.00,False,no data
2,50869199-7de3-45f8-a5b9-943fc868b887,Change,de,2025-01-15,Die Mutation der aufgeführten Rechtseinheit wu...,GR,"Change Stiftung GRÜN & CHROM, Bergün Filisur",2025-01-10,"Stiftung GRÜN & CHROM, in Bergün Filisur, CHE-...",Stiftung GRÜN & CHROM,...,CH35070011983,Bergün Filisur,110,Veja Stazion 11,7482 Bergün/Bravuogn,Die gemeinnützige Stiftung GRÜN & CHROM bezwec...,no data,no data,False,no data
3,717fdfde-006f-4958-9614-c85ca6840b78,Change,fr,2025-01-15,La mutation de l'entité juridique mentionnée a...,VD,Change CFPS Compagnie Financière de Patrimoine...,2025-01-10,CFPS Compagnie Financière de Patrimoine Suisse...,CFPS Compagnie Financière de Patrimoine Suisse SA,...,CH55000687334,Saint-Sulpice (VD),106,Rue des Jordils 40,1025 St-Sulpice VD,la société a pour but toute activité financièr...,9280000.00,9280000.00,False,no data
4,6ef18c9c-67a2-4c18-9b03-27e611300890,Change,fr,2025-01-15,La mutation de l'entité juridique mentionnée a...,VD,"Change beWell SA, Lausanne, new beWell SA en l...",2025-01-10,"beWell SA, à Lausanne, CHE-166.653.133 (FOSC d...",beWell SA en liquidation,...,CH55011927500,Lausanne,106,Avenue de Rumine 11,1005 Lausanne,"la société a pour but le développement, la cré...",100000.00,100000.00,False,no data
5,6959fcd4-4ef7-48a9-9031-ae9a779ebc7f,Change,fr,2025-01-15,La mutation de l'entité juridique mentionnée a...,VD,Change Gilbert Favre Installations Sanitaires ...,2025-01-10,"Gilbert Favre Installations Sanitaires Sàrl, à...",Jonas Reymond Sàrl,...,CH55011907766,La Tour-de-Peilz,107,Rue du Temple 6,1814 La Tour-de-Peilz,la société a le but suivant: toutes activités ...,30000.00,30000.00,False,no data
6,8d4d6b9b-7715-4cfa-ada8-95ad57de4097,Change,de,2025-01-15,Die Mutation der aufgeführten Rechtseinheit wu...,SG,"Change MaWie Kommunikation GmbH, Wil (SG), new...",2025-01-10,"MaWie Kommunikation GmbH, in Wil (SG), CHE-311...",MaWie Kommunikation GmbH in Liquidation,...,CH32040824128,Wil (SG),107,Erlenstrasse 3,9500 Wil SG,Beratung und Erbringung von Dienstleistungen i...,20000.00,no data,False,no data
7,33a9b23d-98b8-42e5-af9b-3c754d5afb40,Change,fr,2025-01-15,La mutation de l'entité juridique mentionnée a...,VD,"Change Café-Barre Sàrl, Lausanne",2025-01-10,"Café-Barre Sàrl, à Lausanne, CHE-346.085.505 (...",Café-Barre Sàrl,...,CH55011113871,Lausanne,107,Avenue du Tribunal-Fédéral 1,1005 Lausanne,la société a pour but l'exploitation de cafés ...,20000.00,20000.00,True,no data
8,b4ccf559-9a96-4593-9de6-66d17cca6552,Change,de,2025-01-15,Die Mutation der aufgeführten Rechtseinheit wu...,ZG,"Change RedcoMet Resources AG, Zug",2025-01-10,"Berichtigung des im SHAB vom 15.06.2021, Meldu...",RedcoMet Resources AG,...,CH17030279439,Zug,106,Baarerstrasse 82,6302 Zug,Die Gesellschaft bezweckt den internationalen ...,761000.00,761000.00,False,no data
9,38e0f544-dc65-4c40-9dbc-7588043d9a93,Change,de,2025-01-15,Die Mutation der aufgeführten Rechtseinheit wu...,ZG,"Change Rhynea Capital AG, Zug",2025-01-10,"Rhynea Capital AG, in Zug, CHE-357.586.800, Ak...",Rhynea Capital AG,...,CH17030504239,Zug,106,Bahnhofstrasse 20,6300 Zug,"Die Gesellschaft bezweckt den Erwerb, das Halt...",142842.00,142842.00,False,no data


Dataframe summary

In [17]:
df_content.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 636 entries, 0 to 635
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   id                         636 non-null    object
 1   entryType                  636 non-null    object
 2   language                   636 non-null    object
 3   publicationDate            636 non-null    object
 4   legalRemedy                636 non-null    object
 5   cantons                    636 non-null    object
 6   title_en                   636 non-null    object
 7   journal_date               636 non-null    object
 8   publication_text           636 non-null    object
 9   company_name               636 non-null    object
 10  company_uid                636 non-null    object
 11  company_code13             636 non-null    object
 12  company_seat               636 non-null    object
 13  company_legalForm          636 non-null    object
 14  company_st

'id' duplicated?

In [18]:
df_content[df_content.duplicated(subset=['id'])]

Unnamed: 0,id,entryType,language,publicationDate,legalRemedy,cantons,title_en,journal_date,publication_text,company_name,...,company_code13,company_seat,company_legalForm,company_street_and_number,company_zip_and_town,company_purpose,company_capital_nominal,company_capital_paid,company_optingout,company_deletiondate


Missing values?

In [19]:
df_content.isna().sum()

id                           0
entryType                    0
language                     0
publicationDate              0
legalRemedy                  0
cantons                      0
title_en                     0
journal_date                 0
publication_text             0
company_name                 0
company_uid                  0
company_code13               0
company_seat                 0
company_legalForm            0
company_street_and_number    0
company_zip_and_town         0
company_purpose              0
company_capital_nominal      0
company_capital_paid         0
company_optingout            0
company_deletiondate         0
dtype: int64

## Push file to repo ''

In [20]:
# Ensure id column is string for reliable comparison
df_content["id"] = df_content["id"].astype(str)
new_count = len(df_content)

if os.path.exists(output_path):
    # 1. Load existing IDs
    df_existing = pd.read_csv(output_path, usecols=["id"])
    df_existing["id"] = df_existing["id"].astype(str)
    existing_ids = set(df_existing["id"])
    existing_count = len(existing_ids)

    # 2. Determine which rows are truly new
    mask_new = ~df_content["id"].isin(existing_ids)
    df_to_append = df_content[mask_new]
    appended_count = len(df_to_append)

    # 3. Compute discarded count
    discarded_count = new_count - appended_count

    # 4. Append new rows if any
    if appended_count > 0:
        df_to_append.to_csv(output_path, mode="a", header=False, index=False)

    # 5. Totals after append
    total_after = existing_count + appended_count

    # 6. Report to user
    print(f"Existing publications before append: {existing_count}")
    print(f"New publications fetched:             {new_count}")
    print(f"Publications discarded (duplicates): {discarded_count}")
    print(f"Publications appended:               {appended_count}")
    print(f"Total publications after append:     {total_after}")

else:
    # No existing file: write all
    df_content.to_csv(output_path, index=False)
    print(f"Existing publications before append: 0")
    print(f"New publications fetched:            {new_count}")
    print(f"Publications discarded (duplicates): 0")
    print(f"Publications appended:               {new_count}")
    print(f"Total publications after append:     {new_count}")

Existing publications before append: 9773
New publications fetched:             636
Publications discarded (duplicates): 0
Publications appended:               636
Total publications after append:     10409


---

# Conclusion and next step

We have a clean way to retrieve 'New entries', 'Change', 'Deletion'.

The data seems to be coherent. 

Next step is filling up the streamlit app with graphs and analysis. 