# **METADATA COLLECTION NOTEBOOK**

## Objectives

* Fetch data from https://amtsblattportal.ch/api/v1/publications/xml, save as raw data and inspect data.
* Find information about the API here: 
    https://amtsblattportal.ch/docs/api/#_api_reference
    and here:
    https://official-gazettes-portal.ch/#!/publish/info/technical-information

## Inputs

* The data are from official gazettes portal.
* The input data is a xml file that i can call with syntax: https://amtsblattportal.ch/api/v1/publications/csv?publicationStates=PUBLISHED&title.fr=Nouvelles entrées
* You need to install requirements.txt with the commande: pip install -r requirements.txt

## Outputs

* The output data is a csv file called inputs/metadata/gazette_metadata_jupyter.csv

## Additional Comments

* The data is filtered, we keep only the subRubric{'HR01': 'New entries', 'HR02': 'Change', 'HR03': 'Deletion'} for our CRM


---

# Install python packages in the notebooks

In [1]:
%pip install requests xmltodict psycopg2-binary pandas

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m78.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pytz>=2020.1
  Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Collecting tzdata>=2022.7
  Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m347.8/347.8 KB[0m [31m88.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy>=1.22.4
  Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m82.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: pytz, tzdata, numpy, pandas
Successfully installed numpy-2.2.6 pandas-2.2.3 pytz-2025.2

---

# Change working directory

* Storing the notebooks in a subfolder, therefore we change the working directory from its current folder to its parent folder

* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/GazetteAnaliticsTools/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/GazetteAnaliticsTools'

---

# Fetch data from the Gazette API

### Some imports:
* os: for reading environment variables (e.g. your PG_DSN)
* datetime: to compute today’s date
* uuid: to generate UUIDs for new records
* json: to serialize raw metadata into JSONB
* requests: to call the Gazette API
* xmltodict: to convert the XML response into Python dicts
* psycopg2 / psycopg2.extras.execute_values: to connect and bulk-upsert into PostgreSQL

In [4]:
# 1. Imports Standard library
import os, datetime, uuid, json
# 2. Imports HTTP + XML parsing
import requests, xmltodict
# 3. Imports PostgreSQL driver
import psycopg2
from psycopg2.extras import execute_values
# 4. Imports Pandas for data analysis
import pandas as pd

### Configurations

In [5]:
API_BASE   = "https://amtsblattportal.ch/api/v1/publications/xml"
PG_DSN     = os.getenv("PG_DSN", "postgresql://user:pass@localhost:5432/mydb")

### Compute today's date

In [132]:
import datetime
manual = "15.01.2025"
today = datetime.datetime.strptime(manual, "%d.%m.%Y").date().isoformat()
# or
# today = datetime.date.today().isoformat()   # e.g. "2025-05-21"
print(today)

2025-01-15


### Fetch list of today's published gazettes

In [133]:
# We'll fetch all pages (if necessary) in one shot by setting size=2000
params = {
    "publicationStates":      "PUBLISHED",
    "publicationDate.start":  today,
    "publicationDate.end":    today,
    "tenant":                 "shab",        # only SHAB entries
    "pageRequest.size":       3000,          # up to 2000 per page
    "pageRequest.page":       0
}
resp = requests.get(API_BASE, params=params)
resp.raise_for_status()
bulk_xml = resp.text

### Pars XML -> Python dict

In [120]:
data = xmltodict.parse(bulk_xml)

# 1. Locate the bulk-export root (namespace-prefix agnostic)
root_key = next((k for k in data if k.endswith("bulk-export")), None)
if root_key is None:
    raise KeyError("Could not find the bulk-export root element")

bulk = data[root_key]

# 2. Pull out the raw publication element(s)
pubs = bulk.get("publication", [])
if not isinstance(pubs, list):
    pubs = [pubs]

# 3. Extract all @schemaLocation attributes
schema_locations = [
    pub.get("@schemaLocation")
    for pub in pubs
    if "@schemaLocation" in pub
]
print(f"Retrieved {len(pubs)} SHAB publications today.")

# schema_locations


Retrieved 1530 SHAB publications today.


In [134]:
data2 = xmltodict.parse(bulk_xml)
# 1. Find the “bulk-export” root (namespace‐prefix agnostic)
root_key = next(
    (k for k in data2.keys() if k.endswith('bulk-export')), 
    None
)
if root_key is None:
    raise ValueError("Couldn't find the bulk-export root in response")

bulk = data2[root_key]

# 2. Extract the list of publications
items = bulk.get("publication", [])

# 3. Normalize to a list if it's a singleton
if not isinstance(items, list):
    items = [items]

items

[{'@ref': 'https://amtsblattportal.ch/api/v1/publications/4f614e43-c3e7-4df5-805f-4574ae3eabdc/xml',
  '@schemaLocation': 'https://amtsblattportal.ch/api/v1/schemas/shab/1.23/HR03-export.xsd',
  'meta': {'id': '4f614e43-c3e7-4df5-805f-4574ae3eabdc',
   'rubric': 'HR',
   'subRubric': 'HR03',
   'language': 'fr',
   'registrationOffice': {'id': 'e15a629a-a08d-11e8-aa11-0050569d3c43',
    'displayName': 'Bundesamt für Justiz (BJ), Eidgenössisches Amt für das Handelsregister',
    'street': 'Bundesrain',
    'streetNumber': '20',
    'swissZipCode': '3003',
    'town': 'Bern',
    'containsPostOfficeBox': 'false'},
   'publicationNumber': 'HR03-1006229436',
   'publicationState': 'PUBLISHED',
   'publicationDate': '2025-01-15',
   'primaryTenantCode': 'shab',
   'cantons': 'JU',
   'title': {'de': 'Löschung Société coopérative de la Guinguette en liquidation, Delémont',
    'en': 'Deletion Société coopérative de la Guinguette en liquidation, Delémont',
    'it': 'Cancellazione Société coo

### Transform into flat rows

In [135]:
#1. Where to write
os.makedirs("inputs/metadata", exist_ok=True)
output_path = "inputs/metadata/gazette_metadata_jupyter.csv"

In [136]:
#2. Define the mapping
rubric_map = {
    "HR01": "New entries",
    "HR02": "Change",
    "HR03": "Deletion"
}

#3. Build the rows with an extra "entryType" column
rows = []
for pub in items:
    meta = pub.get("meta", {})
    sub = meta.get("subRubric")

    #4. Filter to HR01, HR02, HR03
    if sub not in rubric_map:
        continue

    #5. Build the row
    row = {
        "ref":            pub.get("@ref"),
        "schemaLocation": pub.get("@schemaLocation"),
        "id":             meta.get("id"),
        "subRubric":      sub,
        "publicationDate":    meta.get("publicationDate"),
        "legalRemedy":    meta.get("legalRemedy"),
        "title_en":       meta.get("title", {}).get("en"),
        #6. Map the data
        "entryType":      rubric_map[sub]
    }

    rows.append(row)

In [137]:
#7. Dump to CSV (delete and recreate if exists, otherwise write header)
df = pd.DataFrame(rows)

if os.path.exists(output_path):
    os.remove(output_path)

# 2. Write afresh, including the header
df.to_csv(output_path, index=False)

print(f"Wrote {len(df)} rows to {output_path}")

Wrote 1273 rows to inputs/metadata/gazette_metadata_jupyter.csv


---

# Inspect meta data

Section 2 content

In [138]:
df = pd.read_csv(f"inputs/metadata/gazette_metadata_jupyter.csv")
df.head()

Unnamed: 0,ref,schemaLocation,id,subRubric,publicationDate,legalRemedy,title_en,entryType
0,https://amtsblattportal.ch/api/v1/publications...,https://amtsblattportal.ch/api/v1/schemas/shab...,4f614e43-c3e7-4df5-805f-4574ae3eabdc,HR03,2025-01-15,,Deletion Société coopérative de la Guinguette ...,Deletion
1,https://amtsblattportal.ch/api/v1/publications...,https://amtsblattportal.ch/api/v1/schemas/shab...,e8608e46-d33e-4224-803f-d5d0cb6136bc,HR03,2025-01-15,,"Deletion DENTAMINA AG in Liquidation, Bad Ragaz",Deletion
2,https://amtsblattportal.ch/api/v1/publications...,https://amtsblattportal.ch/api/v1/schemas/shab...,76a76b6a-7cc8-4378-b9f4-4c444315c931,HR03,2025-01-15,,"Deletion schwarz logistics engineering, Reinac...",Deletion
3,https://amtsblattportal.ch/api/v1/publications...,https://amtsblattportal.ch/api/v1/schemas/shab...,d8b04ace-8253-44e9-aae2-53434adbb228,HR03,2025-01-15,,"Deletion Wipf & Co. Immobilien, Lohn (SH)",Deletion
4,https://amtsblattportal.ch/api/v1/publications...,https://amtsblattportal.ch/api/v1/schemas/shab...,4fe15a9e-237b-487a-9334-fd0fc248602e,HR03,2025-01-15,,"Deletion Freuler Haustechnik, Oberwil-Lieli",Deletion


DataFrame Summary

In [139]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1273 entries, 0 to 1272
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ref              1273 non-null   object
 1   schemaLocation   1273 non-null   object
 2   id               1273 non-null   object
 3   subRubric        1273 non-null   object
 4   publicationDate  1273 non-null   object
 5   legalRemedy      1148 non-null   object
 6   title_en         1273 non-null   object
 7   entryType        1273 non-null   object
dtypes: object(8)
memory usage: 79.7+ KB


We want to check if there are duplicated `id`: There are not.

In [140]:
df[df.duplicated(subset=['id'])]

Unnamed: 0,ref,schemaLocation,id,subRubric,publicationDate,legalRemedy,title_en,entryType


Find missing values

In [141]:
df.isna().sum()

ref                  0
schemaLocation       0
id                   0
subRubric            0
publicationDate      0
legalRemedy        125
title_en             0
entryType            0
dtype: int64

Evaluating distribution and shape of a variable with missing data

In [142]:
#1. Variable containing missing data
missing_data = df.columns[df.isna().sum() > 0].to_list()
missing_data

#2. Exemple of missing data (not missing) in order to understand what to do with missing values
# for col in missing_data:
#     unique_values = df[col].dropna().unique()[:5]
#     print(f"{col}: {unique_values}\n")

#3. Trying to understand the importance of the missing data, Filter rows where 'xxx' is missing and display the first 5
# missing_rows = df[df['legalRemedy'].isna()]
# print(missing_rows.head(5))

#4. Print a list of unique missing variable, Filter rows where 'xxx' is missing and get unique 'yyy' values
unique_yyy = df[df['legalRemedy'].isna()]['subRubric'].unique()
print(unique_yyy)

['HR03']


Print unique liste of subRubric in order to make sure that we got all necessary sementic.

In [143]:
# 1. Make sure there are no missing values
subrubrics = df['subRubric'].dropna()

# 2. Extract unique values as a list
unique_subrubrics = subrubrics.unique().tolist()
print(f"Total {len(unique_subrubrics)} subrubrics")

# 3. Print them
for sr, cnt in df['subRubric'].value_counts().items():
    print(f"{sr}: {cnt}")

Total 3 subrubrics
HR02: 928
HR01: 220
HR03: 125


Other way of knowing the meaning of all the subrubrics

In [144]:
# 1. Group by subRubric, aggregating count and the first title_en we see
summary = (
    df
    .groupby("subRubric")["title_en"]
    .agg(count="size", example_title_en="first")
    .reset_index()
)

# 2. Print it out
for _, row in summary.iterrows():
    print(f"{row['subRubric']}: {row['count']} — example title: {row['example_title_en']}")

HR01: 220 — example title: New entries Fire Technic SA, Genève
HR02: 928 — example title: Change Fabio FOSSATI - Architectes SA, Chêne-Bougeries
HR03: 125 — example title: Deletion Société coopérative de la Guinguette en liquidation, Delémont


---

# Conclusion and next step

In conclusion we saw that the daily data is not very large.

We filtered and kept only subRubric{'HR01': 'New entries', 'HR02': 'Change', 'HR03': 'Deletion'}.

Next step is to call each "ref"/"@ref" and fetch further informations.