# **CONTENTDATA COLLECTION NOTEBOOK**

## Objectives

* Use inputs/metadata/gazette_metadata_jupyter.csv to fetch further informations.

## Inputs

* inputs/metadata/gazette_metadata_jupyter.csv

## Outputs

* The output data is a csv file called inputs/contentdata/gazette_contentdata_jupyter.csv

## Additional Comments

* ...


---

# Install python packages in the notebooks

In [1]:
%pip install requests xmltodict psycopg2-binary pandas

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


---

# Change working directory

* Storing the notebooks in a subfolder, therefore we change the working directory from its current folder to its parent folder

* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/GazetteAnaliticsTools/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/GazetteAnaliticsTools'

---

# Fetch data from the Gazette API

### Some imports:
* os: for reading environment variables (e.g. your PG_DSN)
* datetime: to compute today’s date
* uuid: to generate UUIDs for new records
* json: to serialize raw metadata into JSONB
* requests: to call the Gazette API
* xmltodict: to convert the XML response into Python dicts
* psycopg2 / psycopg2.extras.execute_values: to connect and bulk-upsert into PostgreSQL

In [4]:
# 1. Imports Standard library
import os, datetime, uuid, json
# 2. Imports HTTP + XML parsing
import requests, xmltodict
# 3. Imports PostgreSQL driver
import psycopg2
from psycopg2.extras import execute_values
# 4. Imports Pandas for data analysis
import pandas as pd

### Fetch list of today's metadata.csv

In [5]:
# Fetch today's metadata
df_meta = pd.read_csv("inputs/metadata/gazette_metadata_jupyter.csv")
df_meta.head(3)

Unnamed: 0,ref,schemaLocation,id,subRubric,publicationDate,legalRemedy,title_en,entryType
0,https://amtsblattportal.ch/api/v1/publications...,https://amtsblattportal.ch/api/v1/schemas/shab...,84d4f6a6-aa94-4edc-9da6-9c6c8d7b65ed,HR02,2025-01-14,Die Mutation der aufgeführten Rechtseinheit wu...,"Change IST3 Beteiligungs AG, Zürich",Change
1,https://amtsblattportal.ch/api/v1/publications...,https://amtsblattportal.ch/api/v1/schemas/shab...,ebc8e527-d8b6-45f4-bc60-d83ff729f830,HR02,2025-01-14,Die Mutation der aufgeführten Rechtseinheit wu...,"Change East Afrika Top 4 Business Group GmbH, ...",Change
2,https://amtsblattportal.ch/api/v1/publications...,https://amtsblattportal.ch/api/v1/schemas/shab...,e703d801-b0aa-443c-bbea-48f31fb4a277,HR02,2025-01-14,Die Mutation der aufgeführten Rechtseinheit wu...,"Change Mercer Schweiz AG, Zürich",Change


### Fetch all ref.xml for each metadata entry & Transform into flat rows

In [None]:
#1. Paths
input_path  = "inputs/metadata/gazette_metadata_jupyter.csv"

#2. Load previous CSV of refs
references = df_meta["ref"].dropna().unique().tolist()

#3. Define the mapping for subRubric -> entryType
rubric_map = {
    "HR01": "New entries",
    "HR02": "Change",
    "HR03": "Deletion"
}

#4. Build the rows, and replace subRubric with entryType
rows = []
for ref_url in references:
    # Fetch & parse XML
    resp = requests.get(ref_url)
    resp.raise_for_status()
    data = xmltodict.parse(resp.text)
    # Find the <publication> root key (namespace-agnostic) so that if <HR01:publication...> changes the code don't break
    root_key = next((k for k in data if k.endswith("publication")), None)
    pub      = data[root_key] if root_key else {}
    
    meta    = pub.get("meta", {})
    content = pub.get("content", {})
    sub     = meta.get("subRubric", "")

    # Build the row
    row = {
        # -- meta fields --
        "id":               meta.get("id", "no data"),
        "entryType":        rubric_map.get(sub, "no data"),
        "language":         meta.get("language", "no data"),
        "publicationDate":  meta.get("publicationDate") or None,
        "legalRemedy":      meta.get("legalRemedy", "no data"),
        "cantons":      meta.get("cantons", "no data"),
        "title_en":         meta.get("title", {}).get("en", "no data"),
        # -- content fields --
        "journal_date":     content.get("journalDate") or None,
        "publication_text": content.get("publicationText", "no data"),
    }
    
    # -- company / commonsNew --
    commons = content.get("commonsActual", {}) if sub == "HR03" else content.get("commonsNew", {})
    comp    = commons.get("company", {})
    addr    = comp.get("address", {})
    row.update({
        "company_name":             comp.get("name", "no data"),
        "company_uid":              comp.get("uid", "no data"),
        "company_code13":           comp.get("code13", "no data"),
        "company_seat":             comp.get("seat", "no data"),
        "company_legalForm":        comp.get("legalForm", "no data"),
        "company_street_and_number": f"{addr.get('street','no data')} {addr.get('houseNumber','no data')}",
        "company_zip_and_town":      f"{addr.get('swissZipCode','no data')} {addr.get('town','no data')}",
        "company_purpose":          commons.get("purpose", "no data"),
    })
    
    # -- capital & revision --
    cap     = commons.get("capital", {})
    revision= commons.get("revision", {})
    row.update({
        "company_capital_nominal": cap.get("nominal") or None,
        "company_capital_paid":    cap.get("paid") or None,
        "company_optingout":       revision.get("optingOut") or None,
    })
    
    # -- deletion date from transaction.delete --
    delete  = content.get("transaction", {}).get("delete", {})
    row["company_deletiondate"] = delete.get("deletionDate") or None,
    
    rows.append(row)


HTTPError: 502 Server Error: Proxy Error for url: https://amtsblattportal.ch/api/v1/publications/39145594-23d5-43eb-b403-d029a2d2179e/xml

### Save inside new .csv

In [7]:
#1. Where to write
output_path = "inputs/contentdata/gazette_contentdata_jupyter.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)

In [8]:
#7. Dump to CSV (append if exists, otherwise write header)
df_content = pd.DataFrame(rows)

---

# Inspect meta data

Section 2 content

In [9]:
df_content.head(10)

Unnamed: 0,id,entryType,language,publicationDate,legalRemedy,cantons,title_en,journal_date,publication_text,company_name,...,company_code13,company_seat,company_legalForm,company_street_and_number,company_zip_and_town,company_purpose,company_capital_nominal,company_capital_paid,company_optingout,company_deletiondate
0,40d64a9e-e70c-4f17-b022-59ae0690f155,Deletion,fr,2025-01-13,no data,GE,"Deletion FINDERS SA succursale de Genève, Genève",2025-01-08,"FINDERS SA succursale de Genève, à Genève, CHE...",FINDERS SA succursale de Genève,...,CH66012909977,Genève,151,rue du Port 8-10,1204 Genève,recherche et sélection de personnel administra...,no data,no data,no data,2025-01-08
1,d341fee7-5dd9-4507-ba53-f6b08df9be5b,Deletion,fr,2025-01-13,no data,VS,"Deletion Intermarché Morgins, Didier Touillet,...",2025-01-08,"Intermarché Morgins, Didier Touillet, à Troist...","Intermarché Morgins, Didier Touillet",...,CH62110015528,Troistorrents,101,Route du Village 8,1875 Morgins,exploitation d'un magasin d'alimentation générale,no data,no data,no data,2025-01-08
2,7403ed4c-3dab-4366-922a-de81d6e85d19,Deletion,fr,2025-01-13,no data,VS,"Deletion P.A.D, Phillipe Aubert Distribution, ...",2025-01-08,"P.A.D, Phillipe Aubert Distribution, à Arbaz, ...","P.A.D, Phillipe Aubert Distribution",...,CH55011069836,Arbaz,101,Route de Grand-Pro Barra 19,1974 Arbaz,Commerce et distribution de vêtements et produ...,no data,no data,no data,2025-01-08
3,e4d591b0-1cd8-4284-a55e-624208b3e4c2,Deletion,fr,2025-01-13,no data,GE,"Deletion BIOCAL, titulaire PIOT, Genève",2025-01-08,"BIOCAL, titulaire PIOT, à Genève, CHE-381.997....","BIOCAL, titulaire PIOT",...,CH66018110248,Genève,101,Boulevard Carl-Vogt 53,1205 Genève,"la transformation, le commerce et le service d...",no data,no data,no data,2025-01-08
4,6596eb57-1297-4535-b615-edbc31ac0f43,Change,fr,2025-01-13,La mutation de l'entité juridique mentionnée a...,GE,"Change Christian GUINCHARD SA, Genève, new Bernex",2025-01-08,"Christian GUINCHARD SA, à Genève, CHE-108.016....",Christian GUINCHARD SA,...,CH66014299981,Bernex,106,Chemin de Saule 134,1233 Bernex,"vente, conseil, assistance, expertise, locatio...",400000.00,400000.00,false,no data
5,fc425468-72e5-4898-9181-11e803839542,Change,fr,2025-01-13,La mutation de l'entité juridique mentionnée a...,GE,"Change EGCR SA, Carouge (GE)",2025-01-08,"EGCR SA, à Carouge (GE), CHE-333.372.496 (FOSC...",EGCR SA,...,CH66037890199,Carouge (GE),106,Chemin du Faubourg-de-Cruseilles 14,1227 Carouge GE,"exploitation d'une entreprise générale, rénova...",100000.00,100000.00,false,no data
6,0818ec88-fa52-4087-8444-2210579dcde9,Change,fr,2025-01-13,La mutation de l'entité juridique mentionnée a...,GE,"Change Mavala SA, Carouge (GE)",2025-01-08,"Mavala SA, à Carouge (GE), CHE-107.741.977 (FO...",Mavala SA,...,CH66001739618,Carouge (GE),106,rue Antoine-Jolivet 2,1227 Carouge (GE),commerce et fabrication de produits cosmétique...,1750000.00,1750000.00,false,no data
7,8798dcef-0fda-46d9-a04f-eb3fa7e17cef,Change,fr,2025-01-13,La mutation de l'entité juridique mentionnée a...,GE,"Change Itcos SA, Genève",2025-01-08,"Itcos SA, à Genève, CHE-103.137.191 (FOSC du 2...",Itcos SA,...,CH66002439800,Genève,106,place de Saint-Gervais 1,1201 Genève,toute activité entrant dans le cadre d'une soc...,100000.00,100000.00,false,no data
8,c0ab7e87-e17b-4a8f-a6df-4949af512a37,Deletion,fr,2025-01-13,no data,GE,"Deletion GRISONI, LACROIX, FLEURY SA, en liqui...",2025-01-08,"GRISONI, LACROIX, FLEURY SA, en liquidation, à...","GRISONI, LACROIX, FLEURY SA, en liquidation",...,CH66002149996,Versoix,106,route de Sauverny 230,1290 Versoix,exploitation d'une entreprise générale du bâti...,100000.00,100000.00,no data,2025-01-08
9,89765f67-f915-41d5-8e31-c59992e4cb2d,Deletion,fr,2025-01-13,no data,GE,"Deletion Rodrigues Vicente, Jardinage, Choulex",2025-01-08,"Rodrigues Vicente, Jardinage, à Choulex, CHE-1...","Rodrigues Vicente, Jardinage",...,CH66007450160,Choulex,101,Route des Jurets 39,1244 Choulex,"paysagiste, création et entretien des espaces ...",no data,no data,no data,2025-01-08


DataFrame Summary

In [10]:
df_content.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1266 entries, 0 to 1265
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   id                         1266 non-null   object
 1   entryType                  1266 non-null   object
 2   language                   1266 non-null   object
 3   publicationDate            1266 non-null   object
 4   legalRemedy                1266 non-null   object
 5   cantons                    1266 non-null   object
 6   title_en                   1266 non-null   object
 7   journal_date               1266 non-null   object
 8   publication_text           1266 non-null   object
 9   company_name               1266 non-null   object
 10  company_uid                1266 non-null   object
 11  company_code13             1266 non-null   object
 12  company_seat               1266 non-null   object
 13  company_legalForm          1266 non-null   object
 14  company_

We want to check if there are duplicated `id`: There are not.

In [11]:
df_content[df_content.duplicated(subset=['id'])]

Unnamed: 0,id,entryType,language,publicationDate,legalRemedy,cantons,title_en,journal_date,publication_text,company_name,...,company_code13,company_seat,company_legalForm,company_street_and_number,company_zip_and_town,company_purpose,company_capital_nominal,company_capital_paid,company_optingout,company_deletiondate


Find missing values

In [12]:
df_content.isna().sum()

id                           0
entryType                    0
language                     0
publicationDate              0
legalRemedy                  0
cantons                      0
title_en                     0
journal_date                 0
publication_text             0
company_name                 0
company_uid                  0
company_code13               0
company_seat                 0
company_legalForm            0
company_street_and_number    0
company_zip_and_town         0
company_purpose              0
company_capital_nominal      0
company_capital_paid         0
company_optingout            0
company_deletiondate         0
dtype: int64

Evaluating distribution and shape of a variable with missing data

In [None]:
#1. Variable containing missing data
# missing_data = df_content.columns[df_content.isna().sum() > 0].to_list()
# missing_data

#2. Exemple of missing data (not missing) in order to understand what to do with missing values
# for col in missing_data:
#     unique_values = df_content[col].dropna().unique()[:5]
#     print(f"{col}: {unique_values}\n")

#3. Trying to understand the importance of the missing data, Filter rows where 'xxx' is missing and display the first 5
# missing_rows = df_content[df_content['legalRemedy'].isna()]
# print(missing_rows.head(5))

#4. Print a list of unique missing variable, Filter rows where 'xxx' is missing and get unique 'yyy' values
# unique_yyy = df_content[df_content['legalRemedy'].isna()]['subRubric'].unique()
# print(unique_yyy)

['HR03']


---

# Push files to Repo

If output .csv file already exist then proceed to check if 'id' already are saved inside of it.

In [13]:
# Ensure id column is string for reliable comparison
df_content["id"] = df_content["id"].astype(str)
new_count = len(df_content)

if os.path.exists(output_path):
    # 1. Load existing IDs
    df_existing = pd.read_csv(output_path, usecols=["id"])
    df_existing["id"] = df_existing["id"].astype(str)
    existing_ids = set(df_existing["id"])
    existing_count = len(existing_ids)

    # 2. Determine which rows are truly new
    mask_new = ~df_content["id"].isin(existing_ids)
    df_to_append = df_content[mask_new]
    appended_count = len(df_to_append)

    # 3. Compute discarded count
    discarded_count = new_count - appended_count

    # 4. Append new rows if any
    if appended_count > 0:
        df_to_append.to_csv(output_path, mode="a", header=False, index=False)

    # 5. Totals after append
    total_after = existing_count + appended_count

    # 6. Report to user
    print(f"Existing publications before append: {existing_count}")
    print(f"New publications fetched:             {new_count}")
    print(f"Publications discarded (duplicates): {discarded_count}")
    print(f"Publications appended:               {appended_count}")
    print(f"Total publications after append:     {total_after}")

else:
    # No existing file: write all
    df_content.to_csv(output_path, index=False)
    print(f"Existing publications before append: 0")
    print(f"New publications fetched:            {new_count}")
    print(f"Publications discarded (duplicates): 0")
    print(f"Publications appended:               {new_count}")
    print(f"Total publications after append:     {new_count}")

Existing publications before append: 6583
New publications fetched:             1266
Publications discarded (duplicates): 0
Publications appended:               1266
Total publications after append:     7849


---

# Conclusion and next step

We have a clean way to retrieve 'New entries', 'Change', 'Deletion'.

The data seems to be coherent. 

Next step is filling up the streamlit app with graphs and analysis. 