# Tips
## 1. **Cellar - WEMI codes based on FRBR** 
- (Check https://op.europa.eu/documents/d/cellar/cellar-end-user-manual_eec84490f0b94079960fcf6919271c37-280824-1601-502)
- Work: An abstract intellectual idea
- Expression: A realisation - text version or translation - of Work, in a specific language
- Manifestation: An instantiation - file or format - of Expression
- Item / Content Stream: The actual document in the specified language and format
## 2. **Resource URL**
- **http://publications.europa.eu/resource/{ps-name}/{ps-id}**
- When type = content stream
- example 1: http://publications.europa.eu/resource/cellar/550e8400-e29b-41d4-a716-446655440000.0001.03/DOC_1
- ps-name is cellar
- ps-id = {work-id}.{expr-id}.{man-id}/{cs-id} = 550e8400-e29b-41d4-a716-446655440000.0001.03/DOC_1
- The work-id is a Universally Unique Identifier (UUID)
- example 2: http://publications.europa.eu/resource/celex/32006D0241.FRA.fmx4.L_2006088FR.01006402.xml
- ps-name is not cellar (here celex)
- ps_id = {work-id}.{expr-id}.{man-id}.{cs-id} = 32006D0241.FRA.fmx4.L_2006088FR.01006402.xml
- The work-id is an alphanumeric value (here a celex number)
## 3. **Celex number on EUR-Lex**
- (Check https://eur-lex.europa.eu/content/tools/HowCelexNumbersAreComposed.pdf)
- {sector}{year}{doc-type (descriptors by sector)}{doc-number}
- example: 32006D0241
- sector = 3 (Legislation)
- year = 2006
- doc-type = D (Decisions)
- doc-number = 0060
## 4. **Cellar ontology / Common Data Model (CDM)**
- (Check documentation: https://publications.europa.eu/resource/cellar/fb442510-0826-11ed-b11c-01aa75ed71a1.0001.02/DOC_1)
- http://publications.europa.eu/ontology/cdm#
## 5. **Eurovoc**
- (Check handbook: file:///Users/demouser/Downloads/EuroVoc-Handbook.pdf)
- Download in https://op.europa.eu/en/web/eu-vocabularies/concept-scheme/-/resource?uri=http://eurovoc.europa.eu/100141
- It defines the domains / thesaurus that a document is related to.
- A document can be tagged to multiple eurovoc tags.
- Every domain has a identification code.
- Search in the metadata file of a Work for 
    - <j.0:work_is_about_concept_eurovoc rdf:resource="http://eurovoc.europa.eu/{code}"/>
    - Use {code} to identify the Eurovoc term.
## 6. **SPARQL**
- (Check https://publications.europa.eu/webapi/rdf/sparql)
- Query data from the database

# Metadata
The metadata to keep:
- creation date
- domain: from eurovoc
- author/publisher
- doc-type: from celex number
- version

How to get metadata:
- Collect <meta ...> in html
- Check metadata rdf file


# Library

uk API Key: 4a5e1b6dcb7f818746d6d51ce69f5331

In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON 
import requests 
import pandas as pd
import numpy as np
import os
import json
import rdflib
from tqdm import tqdm
from bs4 import BeautifulSoup
from multiprocessing import Pool
from helper.eurlex import Eurlex
from dotenv import load_dotenv
load_dotenv()

True

In [55]:
pd.set_option('display.max_rows', 100)

# SPARQL Query

In [2]:
sparql = SPARQLWrapper("https://publications.europa.eu/webapi/rdf/sparql")
sparql.setReturnFormat(JSON)

In [25]:
query = """
PREFIX cdm: <http://publications.europa.eu/ontology/cdm#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT 
  DISTINCT (GROUP_CONCAT(DISTINCT STR(?work); SEPARATOR=",") AS ?cellarURIs)
  (GROUP_CONCAT(DISTINCT STR(?title_); SEPARATOR=",") AS ?title)
  ?langIdentifier
  (GROUP_CONCAT(DISTINCT STR(?mtype); SEPARATOR=",") AS ?mtypes)
  (GROUP_CONCAT(DISTINCT STR(?thumbnail); SEPARATOR=",") AS ?thumbnails)
  (GROUP_CONCAT(DISTINCT STR(?resType); SEPARATOR=",") AS ?workTypes)
  (GROUP_CONCAT(DISTINCT STR(?agentName); SEPARATOR=",") AS ?authors)
  (GROUP_CONCAT(DISTINCT STR(?privateAgentName); SEPARATOR=";") AS ?privateAuthors)
  ?date
  (GROUP_CONCAT(DISTINCT STR(?subjectLabel); SEPARATOR=",") AS ?subjects)
  (GROUP_CONCAT(DISTINCT STR(?workId_); SEPARATOR=",") AS ?workIds)
WHERE {
  GRAPH ?gw {
    ?work rdf:type ?resType ;
          cdm:work_date_document ?date ;
          cdm:work_id_document ?workId_ ;
          cdm:work_is_about_concept_eurovoc ?subject .
    FILTER(?resType = cdm:resource_legal)
    FILTER(xsd:date(?date) < "2023-01-01"^^xsd:date)
    
    GRAPH ?gs {
      ?subject skos:prefLabel ?subjectLabel .
      FILTER(LANG(?subjectLabel) = "en")
    }
  }

  GRAPH ?eg {
    ?exp cdm:expression_belongs_to_work ?work ;
         cdm:expression_title ?title_ ;
         cdm:expression_uses_language ?lg .
    FILTER(LANG(?title_) = "en" || LANG(?title_) = "eng" || LANG(?title_) = "")

    GRAPH ?lgc {
      ?lg dc:identifier ?langIdentifier .
      FILTER(STR(?langIdentifier) = "ENG")
    }
  }

  GRAPH ?gm {
    ?manif cdm:manifestation_manifests_expression ?exp ;
           cdm:manifestation_type ?mtype .
    OPTIONAL { ?manif cdm:manifestation_has_thumbnail ?thumbnail }
  }

  OPTIONAL {
    GRAPH ?gagent {
      {
        ?work cdm:work_contributed_to_by_agent ?agent .
      } UNION {
        ?work cdm:work_created_by_agent ?agent .
      } UNION {
        ?work cdm:work_authored_by_agent ?agent .
      }
    }
    GRAPH ?ga {
      ?agent skos:prefLabel ?agentName .
      FILTER(LANG(?agentName) = "en")
    }
  }

  OPTIONAL {
    GRAPH ?persAuthor {
      {
        ?work cdm:work_contributed_to_by_agent ?privateAgent .
      } UNION {
        ?work cdm:work_authored_by_agent ?privateAgent .
      }
      ?privateAgent rdf:type cdm:person ;
                    cdm:agent_name ?privateAgentName .
    }
  }
}
GROUP BY ?work ?date ?langIdentifier
ORDER BY DESC(?date)
LIMIT 100
"""

In [None]:
query = """
prefix cdm: <http://publications.europa.eu/ontology/cdm#>
prefix purl: <http://purl.org/dc/elements/1.1/>
select distinct ?item ?date ?class
where {
    ?work cdm:date_creation_legacy ?date.
    ?w a ?class .
    ?expr cdm:expression_belongs_to_work ?work ;
        cdm:expression_uses_language ?lang .
    ?lang purl:identifier ?langCode .
    ?manif cdm:manifestation_manifests_expression ?expr;
        cdm:manifestation_type "pdfa1a".
    ?item cdm:item_belongs_to_manifestation ?manif.
    FILTER (
        ?date > "2016-05-23T10:20:13+05:30"^^xsd:dateTime
        AND
        ?date < "2020-05-23T10:20:13+05:30"^^xsd:dateTime
    ).
    FILTER(
        ?class in (
        <http://publications.europa.eu/ontology/cdm#document_cjeu>,
        <http://publications.europa.eu/ontology/cdm#case-law>,
        <http://publications.europa.eu/ontology/cdm#summary_caselaw>,
        <http://publications.europa.eu/ontology/cdm#summary_caselaw_jure>
        )
    )
    FILTER(STR(?langCode) = "ENG")
}
LIMIT 10
"""

In [None]:
### Select format? = html or xhtml


query = """
prefix cdm: <http://publications.europa.eu/ontology/cdm#>
prefix purl: <http://purl.org/dc/elements/1.1/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

select distinct ?item ?expr_title ?celex ?date ?class ?className ?authorName #  ?subject
where {
    ?work cdm:work_date_document ?date ;
        rdf:type cdm:resource_legal .
    ?w a ?class .
    ?class rdfs:subClassOf cdm:resource_legal .
    ?expr cdm:expression_belongs_to_work ?work ;
        cdm:expression_title ?expr_title ;
        cdm:expression_uses_language ?lang .
    ?lang purl:identifier ?langCode .
    ?manif cdm:manifestation_manifests_expression ?expr;
        # cdm:manifestation_type "pdfa1a".
        cdm:manifestation_type "xhtml".
    ?item cdm:item_belongs_to_manifestation ?manif.
    
    FILTER (
        ?date > "2000-01-01T23:59:59+08:00"^^xsd:dateTime
        AND
        ?date < "2025-08-01T23:59:59+08:00"^^xsd:dateTime
    ).

    FILTER(STR(?langCode) = "ENG")

    OPTIONAL { ?work cdm:work_id_celex ?celex . }
    
    BIND(
        IF(CONTAINS(STR(?class), "#"),
            STRAFTER(STR(?class), "#"),
            STRAFTER(STR(?class), "/")
        ) AS ?className
    )
    
    OPTIONAL {
        GRAPH ?gagent {
            {
            ?work cdm:work_contributed_to_by_agent ?author .
            } UNION {
            ?work cdm:work_created_by_agent ?author .
            } UNION {
            ?work cdm:work_authored_by_agent ?author .
            }
        }
    
        
    OPTIONAL {
        GRAPH ?glabel {
        ?author skos:prefLabel ?authorName .
        FILTER(LANG(?authorName) = "en")
        }
    }
    }

}
ORDER BY ?date
"""

In [23]:
sparql.setQuery(query)
results = sparql.query().convert()
results["results"]["bindings"]

[{'item': {'type': 'uri',
   'value': 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1'},
  'expr_title': {'type': 'literal',
   'value': 'Opinion of Advocate General Sharpston delivered on 13 March 2008. # Ecotrade SpA v Agenzia delle Entrate - Ufficio di Genova 3. # Reference for a preliminary ruling: Commissione tributaria provinciale di Genova - Italy. # Sixth VAT Directive - Reverse charge procedure - Right to deduct - Time-bar - Irregularity in accounts and tax returns affecting transactions subject to the reverse charge procedure. # Joined cases C-95/07 and C-96/07.'},
  'date': {'type': 'typed-literal',
   'datatype': 'http://www.w3.org/2001/XMLSchema#date',
   'value': '2008-03-13'},
  'class': {'type': 'uri',
   'value': 'http://publications.europa.eu/ontology/cdm#act_consolidated'},
  'className': {'type': 'literal', 'value': 'act_consolidated'},
  'authorName': {'type': 'literal',
   'xml:lang': 'en',
   'value': 'Court of Ju

In [24]:
for result in results["results"]["bindings"]:
    print(result["celex"]["value"])

KeyError: 'celex'

In [20]:
def clean_results(results):
    results_clean = [
        {k:v["value"] for k,v in item.items()}
        for item in results["results"]["bindings"]
    ]
    results_clean = pd.DataFrame(results_clean)
    return results_clean

In [27]:
items = clean_results(results)
print(items.shape[0])
items.head(3)

32


Unnamed: 0,item,expr_title,date,class,className,authorName
0,http://publications.europa.eu/resource/cellar/...,Opinion of Advocate General Sharpston delivere...,2008-03-13,http://publications.europa.eu/ontology/cdm#act...,act_consolidated,Court of Justice
1,http://publications.europa.eu/resource/cellar/...,Opinion of Advocate General Sharpston delivere...,2008-03-13,http://publications.europa.eu/ontology/cdm#act...,act_preparatory,Court of Justice
2,http://publications.europa.eu/resource/cellar/...,Opinion of Advocate General Sharpston delivere...,2008-03-13,http://publications.europa.eu/ontology/cdm#agr...,agreement_international,Court of Justice


In [31]:
items['item'].to_list()

['http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.0

In [34]:
results["results"]["bindings"]

[{'item': {'type': 'uri',
   'value': 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1'},
  'expr_title': {'type': 'literal',
   'value': 'Opinion of Advocate General Sharpston delivered on 13 March 2008. # Ecotrade SpA v Agenzia delle Entrate - Ufficio di Genova 3. # Reference for a preliminary ruling: Commissione tributaria provinciale di Genova - Italy. # Sixth VAT Directive - Reverse charge procedure - Right to deduct - Time-bar - Irregularity in accounts and tax returns affecting transactions subject to the reverse charge procedure. # Joined cases C-95/07 and C-96/07.'},
  'date': {'type': 'typed-literal',
   'datatype': 'http://www.w3.org/2001/XMLSchema#date',
   'value': '2008-03-13'},
  'class': {'type': 'uri',
   'value': 'http://publications.europa.eu/ontology/cdm#act_consolidated'},
  'className': {'type': 'literal', 'value': 'act_consolidated'},
  'authorName': {'type': 'literal',
   'xml:lang': 'en',
   'value': 'Court of Ju

In [37]:
os.makedirs("data/api_download", exist_ok=True)
# Call to the  API to download the digital file
unique_items = set([r["item"]["value"] for r in results["results"]["bindings"]])
for doc_url in unique_items:
    item = '.'.join(doc_url.split("/")[-2:])
    file_name = "data/api_download/" + item + ".xhtml"
    try:
        response = requests.get(doc_url)
    except Exception as e:
        print(f"Error downloading {doc_url}: {e}")
        continue
    
    with open(file_name, 'w', encoding='utf-8') as file:
        file.write(response.text)
        print(f"Downloaded: {file_name}")

Downloaded: data/api_download/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02.DOC_1.xhtml
Downloaded: data/api_download/3d4572e0-3dcd-432d-85ad-ad918607176b.0020.02.DOC_1.xhtml


# Datadump
User can force downloading all the Legal Acts and metadata in the Datadump Service.
https://datadump.publications.europa.eu/create-download-request

`LEG_EN_HTML` contains all the Legal Acts
    > {work-id}
    > {file}

`LEG_MTD` contains the metadata
    > {work-id}
    > {mtdfile}

Every {work-id} might have multiple files, but only has one metadata file.

In [5]:
def datadump_la_hierarchy(folder):
    wids = os.listdir(folder)
    if '.DS_Store' in wids:
        wids.remove('.DS_Store')
    rows = []
    for wid in wids:
        formats = os.listdir(folder + "/" + wid)
        for f in formats:
            docs = os.listdir(folder + "/" + wid + "/" + f)
            for doc in docs:
                rows.append({"work-id": wid, "format": f, "doc": doc})
    return pd.DataFrame(rows)

def datadump_mtd_hierarchy(folder):
    wids = os.listdir(folder)
    if '.DS_Store' in wids:
        wids.remove('.DS_Store')
    rows = []
    for wid in wids:
        docs = os.listdir(folder + "/" + wid)
        for doc in docs:
            rows.append({"work-id": wid, "mtd": doc})
    return pd.DataFrame(rows)

In [6]:
legal_act_folder = os.getenv("EU_LEGAL_ACT_PATH")
metadata_folder = os.getenv("EU_METADATA_PATH")
legal_act_files = datadump_la_hierarchy(legal_act_folder)
metadata_files = datadump_mtd_hierarchy(metadata_folder)

In [11]:
# Some folders have jpg files in addition to html files
# e.g. work-id = "bb5339f8-f387-4c05-a895-1a64f898413c"
# xhtml L_1998069EN.01000101.doc.html
# html 31998D0181en.html

# Get file format suffix, only keep .html
legal_act_files['suffix'] = legal_act_files['doc'].apply(lambda x: x.split('.')[-1])
print(legal_act_files['suffix'].value_counts())
legal_act_files = legal_act_files[legal_act_files['suffix'] == 'html']

suffix
html    43056
jpg      3604
ent         3
0004        1
0005        1
Name: count, dtype: int64


## Find all sameAs links in metadata

In [None]:
# # <j.0:work_is_about_concept_eurovoc rdf:resource="http://eurovoc.europa.eu/{code}"/>
# import rdflib
# g = rdflib.Graph()
# g.parse(metadata_folder + "/bb5339f8-f387-4c05-a895-1a64f898413c/tree_non_inferred.rdf", format="xml")
# eurovoc_codes = []
# same_as = []
# for idx, (subj, pred, obj) in enumerate(g):
#     if pred == rdflib.URIRef("http://publications.europa.eu/ontology/cdm#work_is_about_concept_eurovoc"):
#         code = obj.split("/")[-1]
#         # print(f"subj: {subj}")
#         # print(f"pred: {pred}")
#         # print(f"code: {code}")
#         eurovoc_codes.append(code)
#     elif pred == rdflib.URIRef("http://www.w3.org/2002/07/owl#sameAs"):
#         same_as.append({"subject": subj, "object": obj})

In [None]:
# same_as_df = pd.DataFrame(same_as).sort_values('subject').reset_index(drop=True)
# same_as_df_sorted = same_as_df['subject'].value_counts().reset_index().sort_values('subject')
# same_as_df_sorted

## Find all eurovoc codes in metadata

In [None]:
# os.makedirs("data/eurovoc/metadata_mapping", exist_ok=True)
# eurovoc_codes = []
# for idx, row in tqdm(metadata_files.iterrows(), total=metadata_files.shape[0], desc="Examine metadata"):
#     work_id = row["work-id"]
#     g = rdflib.Graph()
#     g.parse(metadata_folder + "/" + work_id + "/tree_non_inferred.rdf", format="xml")
#     for i, (subj, pred, obj) in enumerate(g):
#         if pred == rdflib.URIRef("http://publications.europa.eu/ontology/cdm#work_is_about_concept_eurovoc"):
#             code = obj.split("/")[-1]
#             eurovoc_codes.append({"work-id": work_id, "eurovoc-code": code})
#     if idx % 1000 == 999:
#         with open(f"data/eurovoc/metadata_mapping/{idx}.json", "w") as f:
#             json.dump(eurovoc_codes, f, indent=4)
#         print(f"Processed until index {idx}, saved to data/eurovoc/metadata_mapping/{idx}.json")
#         print(f"Length of this file: {len(eurovoc_codes)}")
#         eurovoc_codes = []
# if eurovoc_codes:
#     with open(f"data/eurovoc/metadata_mapping/{idx}.json", 'w') as f:
#         json.dump(eurovoc_codes, f, indent=4)
#     print(f"Processed until index {idx}, saved to data/eurovoc/metadata_mapping/{idx}.json")
#     print(f"Length of this file: {len(eurovoc_codes)}")

# eurovoc_codes = pd.DataFrame(eurovoc_codes)

Examine metadata:   0%|          | 5/55367 [00:02<6:16:55,  2.45it/s]

Processed until index 4, saved to data/eurovoc/metadata_mapping/4.json
Length of this file: 26





## Find all eurovoc codes in metadata (parallel)

In [4]:
def parallel_extract(batch_func, df, num_processes=4):
    chunks = np.array_split(df, num_processes)

    with Pool(processes=num_processes) as pool:
        pool.map(batch_func, chunks)

In [None]:
# from helper.preprocess import get_eurovoc_batch
# output_folder = os.getenv("EU_WORK_EUROVOC_MAPPING_PATH")
# os.makedirs(output_folder, exist_ok=True)
# parallel_extract(get_eurovoc_batch, metadata_files, num_processes=4)

[28496] Saved batch 0 with 5441 records
[28498] Saved batch 0 with 5474 records
[28499] Saved batch 0 with 5573 records
[28497] Saved batch 0 with 5327 records
[28496] Saved batch 1 with 5339 records
[28498] Saved batch 1 with 5496 records
[28499] Saved batch 1 with 5542 records
[28497] Saved batch 1 with 5386 records
[28496] Saved batch 2 with 5477 records
[28499] Saved batch 2 with 5232 records
[28498] Saved batch 2 with 5567 records
[28497] Saved batch 2 with 5324 records
[28498] Saved batch 3 with 5543 records
[28496] Saved batch 3 with 5481 records
[28499] Saved batch 3 with 5485 records
[28497] Saved batch 3 with 5361 records
[28496] Saved batch 4 with 5521 records
[28498] Saved batch 4 with 5467 records
[28499] Saved batch 4 with 5375 records
[28497] Saved batch 4 with 5487 records
[28496] Saved batch 5 with 5459 records
[28499] Saved batch 5 with 5450 records
[28498] Saved batch 5 with 5603 records
[28497] Saved batch 5 with 5494 records
[28496] Saved batch 6 with 5320 records


In [None]:
# Find concepts behind eurovoc codes
# Save work-concepts mappings

output_folder = os.getenv("EU_WORK_EUROVOC_MAPPING_PATH")
eurovoc_mapping_files = os.listdir(output_folder)
eurovoc_mapping = []
for file in eurovoc_mapping_files:
    with open(os.path.join(output_folder, file), 'r') as f:
        eurovoc_mapping.extend(json.load(f))
eurovoc_mapping = pd.DataFrame(eurovoc_mapping).sort_values('work-id').reset_index(drop=True)
eurovoc_mapping.head()

Unnamed: 0,work-id,eurovoc-code
0,00000dbc-76cd-11ed-9887-01aa75ed71a1,2136
1,00000dbc-76cd-11ed-9887-01aa75ed71a1,4828
2,00000dbc-76cd-11ed-9887-01aa75ed71a1,3648
3,00000dbc-76cd-11ed-9887-01aa75ed71a1,2449
4,00000dbc-76cd-11ed-9887-01aa75ed71a1,1474


In [29]:
eurovoc = pd.read_excel("data/eurovoc/EuroVoc_Excel_export/eurovoc_export_en.xlsx", sheet_name="en")
eurovoc["MT-code"] = eurovoc["MT"].apply(lambda x: x[0:5])
eurovoc["MT-term"] = eurovoc["MT"].apply(lambda x: x[5:])

In [None]:
# TERMS (PT-NPT) is the sepecific term for the concept
# e.g. "5485" is "European charter"
# MT is the main term (domain) fot the concept
# e.g. "5485" belongs to "0806 international affairs"
eurovoc[eurovoc["ID"].isin(["5485","2498","5344","5327", "2470","5420"]) & eurovoc["PT"].isna()].sort_values("MT")

Unnamed: 0,ID,TERMS (PT-NPT),RELATIONS,PT,MT
1916,5485,European charter,,,0806 international affairs
4666,5420,accession to an agreement,,,0806 international affairs
1085,5344,EAEC,,,1016 European construction
1218,5327,ECSC,,,1016 European construction
8699,2470,environmental policy,,,5206 environmental policy
8618,2498,energy policy,,,6606 energy policy


In [30]:
eurovoc_simple = eurovoc[eurovoc["PT"].isna()].drop(columns=["RELATIONS", "PT"])
eurovoc_mapping_concepts = eurovoc_mapping.merge(eurovoc_simple, left_on="eurovoc-code", right_on="ID", how="left").drop(columns=["ID"])
# eurovoc_mapping_concepts

In [None]:
legal_act_concepts = legal_act_files.merge(eurovoc_mapping_concepts, on="work-id", how="left")
# legal_act_concepts
la_concepts_file = os.getenv("EU_LEGAL_ACT_EUROVOC_CONCEPTS_FILE")
legal_act_concepts.to_csv(la_concepts_file, index=False)

## Find all celex numbers in metadata (parallel)

In [None]:
# from helper.preprocess import get_celex_batch
# output_folder = os.getenv("EU_WORK_CELEX_MAPPING_PATH")
# os.makedirs(output_folder, exist_ok=True)
# parallel_extract(get_celex_batch, metadata_files, num_processes=4)

  return bound(*args, **kwds)


[64636] Saved batch 0 with 158 records
[64638] Saved batch 0 with 316 records
[64637] Saved batch 0 with 474 records
[64639] Saved batch 0 with 1000 records
[64636] Saved batch 1 with 1000 records
[64638] Saved batch 1 with 1000 records
[64637] Saved batch 1 with 1000 records
[64639] Saved batch 1 with 1000 records
[64636] Saved batch 2 with 1000 records
[64638] Saved batch 2 with 1000 records
[64637] Saved batch 2 with 1000 records
[64639] Saved batch 2 with 1000 records
[64636] Saved batch 3 with 1000 records
[64638] Saved batch 3 with 1000 records
[64637] Saved batch 3 with 1000 records
[64639] Saved batch 3 with 1000 records
[64636] Saved batch 4 with 1000 records
[64638] Saved batch 4 with 1000 records
[64637] Saved batch 4 with 1000 records
[64639] Saved batch 4 with 1000 records
[64636] Saved batch 5 with 1000 records
[64638] Saved batch 5 with 1000 records
[64637] Saved batch 5 with 1000 records
[64639] Saved batch 5 with 1000 records
[64636] Saved batch 6 with 1000 records
[64

In [23]:
output_folder = os.getenv("EU_WORK_CELEX_MAPPING_PATH")
celex_mapping_files = os.listdir(output_folder)
celex_mapping = []
for file in celex_mapping_files:
    with open(os.path.join(output_folder, file), 'r') as f:
        celex_mapping.extend(json.load(f))
celex_mapping = pd.DataFrame(celex_mapping).sort_values('work-id').reset_index(drop=True)
celex_mapping_file = os.getenv("EU_WORK_CELEX_MAPPING_FILE")
celex_mapping.to_csv(celex_mapping_file, index=False)

IsADirectoryError: [Errno 21] Is a directory: 'data/celex/work_celex_mapping'

# Process File
backend/data/LEG_EN_HTML_20250721_04_08/e997cfd6-3626-4170-9ded-da698d3af63c/xhtml/L_2011273EN.01000101.doc.html
- save metadata
- look for useful html tags
- save text
- preprocess text

In [12]:
legal_act_files.head(2)

Unnamed: 0,work-id,format,doc,suffix
0,1a1e8486-a474-11e9-9d01-01aa75ed71a1,xhtml,L_2019187EN.01004101.doc.html,html
1,5fa72f58-9564-4ebe-a5a5-853e206ae2ed,html,32000R0212en.html,html


In [None]:
# Continue from Datadump - get hierarchy
for idx, legal_act in legal_act_files.iterrows():
    work_id = legal_act["work-id"]
    expr_id = "ENG"
    man_id = legal_act["format"]
    cs_id = legal_act["doc"]
    doc_path = f"data/LEG_EN_HTML_20250721_04_08/{work_id}/{man_id}/{cs_id}"
    mtd_path = f"data/LEG_MTD_20250709_22_36/{work_id}/tree_non_inferred.rdf"
    celex_number = None

    # Find Celex number in the metadata file
    g = rdflib.Graph()
    try:
        g.parse(mtd_path, format="xml")
    except Exception as e:
        print(f"Error parsing RDF file for work ID {work_id}: {e}")
        continue

    for subj, pred, obj in g:
        if str(pred) in [
            "http://publications.europa.eu/ontology/cdm#resource_legal_id_celex",
            "http://publications.europa.eu/ontology/cdm#celex_number"
        ]:
            print("Celex number:", obj)
            celex_number = obj
            break
    
    if celex_number is None:
        print(f"No Celex number found for work ID: {work_id}")
        continue

    doc_celex_uri = rdflib.URIRef(f"http://publications.europa.eu/resource/celex/{celex_number}.{expr_id}.{man_id}.{cs_id}")

    from langchain_community.document_loaders import BSHTMLLoader

    loader = BSHTMLLoader(
        file_path=doc_path
    )

    docs = loader.load()

## Extract metadata

In [18]:
from langchain_community.document_loaders import BSHTMLLoader

In [21]:
work_id = "e997cfd6-3626-4170-9ded-da698d3af63c"
expr_id = "ENG"
man_id = "xhtml"
cs_id = "L_2011273EN.01000101.doc.html"
doc_path = f"data/LEG_EN_HTML_20250721_04_08/{work_id}/{man_id}/{cs_id}"
mtd_path = f"data/LEG_MTD_20250709_22_36/{work_id}/tree_non_inferred.rdf"
celex_number = None
eurovoc_code = []

In [22]:
# Find Celex number in the metadata file
g = rdflib.Graph()
if os.path.exists(mtd_path):
    try:
        g.parse(mtd_path, format="xml")
    except Exception as e:
        print(f"Error parsing RDF file for work ID {work_id}: {e}")
        # continue
else:
    print(f"Metadata file not found for work ID {work_id}: {mtd_path}")

In [23]:
for subj, pred, obj in g:
    if not celex_number and str(pred) in [
        "http://publications.europa.eu/ontology/cdm#resource_legal_id_celex",
        "http://publications.europa.eu/ontology/cdm#celex_number"
    ]:
        print("Celex number:", obj)
        celex_number = obj
    elif str(pred) == "http://publications.europa.eu/ontology/cdm#work_is_about_concept_eurovoc":
        code = obj.split("/")[-1]
        eurovoc_code.append(code)
        print("Eurovoc: ", code)

if celex_number is None:
    print(f"No Celex number found for work ID: {work_id}")
    # continue

doc_celex_uri = rdflib.URIRef(f"http://publications.europa.eu/resource/celex/{celex_number}.{expr_id}.{man_id}.{cs_id}")

Eurovoc:  1252
Eurovoc:  5040
Celex number: 32011D0694
Eurovoc:  2901
Eurovoc:  4408
Eurovoc:  5887
Eurovoc:  1474
Eurovoc:  225


In [None]:
# Load the document and its metadata
loader = BSHTMLLoader(file_path=doc_path)
docs = loader.load()
docs[0].metadata
# docs[0].metadata.update(meta)


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(f, **self.bs_kwargs)


{'source': 'data/LEG_EN_HTML_20250721_04_08/e997cfd6-3626-4170-9ded-da698d3af63c/xhtml/L_2011273EN.01000101.doc.html',
 'title': 'L_2011273EN.01000101.xml'}

++++++++++++++++

In [14]:
# Find Celex number in the metadata file
g = rdflib.Graph()
g.parse(mtd_path, format="xml")

for subj, pred, obj in g:
    if str(pred) in [
        "http://publications.europa.eu/ontology/cdm#resource_legal_id_celex",
        "http://publications.europa.eu/ontology/cdm#celex_number"
    ]:
        print("Celex number:", obj)
        celex_number = obj
        break

FileNotFoundError: [Errno 2] No such file or directory: '/Users/demouser/Library/CloudStorage/OneDrive-NationalUniversityofSingapore/Reg-Guru/data/LEG_MTD_20250709_22_36/5fa72f58-9564-4ebe-a5a5-853e206ae2ed/tree_non_inferred.rdf'

In [None]:
doc_celex_uri = rdflib.URIRef(f"http://publications.europa.eu/resource/celex/{celex_number}.{expr_id}.{man_id}.{cs_id}")
# "http://publications.europa.eu/resource/celex/32011D0694.ENG.xhtml.L_2011273EN.01000101.doc.html"
print("Document Celex URI:", doc_celex_uri)

Document Celex URI: http://publications.europa.eu/resource/celex/32011D0694.ENG.xhtml.L_2011273EN.01000101.doc.html


In [76]:
g.bind("rdfs", rdflib.RDFS)
g.bind("skos", rdflib.Namespace("http://www.w3.org/2004/02/skos/core#"))
from rdflib.namespace import RDFS, SKOS

def get_label(uri):
    label = None
    # Check if label is already in the graph
    for o in g.objects(subject=uri, predicate=RDFS.label):
        return str(o)
    for o in g.objects(subject=uri, predicate=SKOS.prefLabel):
        return str(o)
    # Fallback: extract local name
    return uri.split("#")[-1] if "#" in uri else uri.split("/")[-1]
print("Document Celex URI:", doc_celex_uri, "\n")
print("========== Subject - Predicate ==========")
for subj, pred in g.subject_predicates(object=doc_celex_uri):
    pred_label = get_label(pred)
    print(f"Subject: {subj}")
    print(f"Predicate: {pred_label} ({pred})")
print("========== Predicate - Object ==========")
for pred, obj in g.predicate_objects(subject=doc_celex_uri):
    print(f"Predicate: {pred_label} ({pred})")
    print(f"Object: {obj}")

Document Celex URI: http://publications.europa.eu/resource/celex/32011D0694.ENG.xhtml.L_2011273EN.01000101.doc.html 

Subject: http://publications.europa.eu/resource/cellar/e997cfd6-3626-4170-9ded-da698d3af63c.0021.02/DOC_1
Predicate: sameAs (http://www.w3.org/2002/07/owl#sameAs)


In [None]:
# Extract metadata from XML
with open(doc_path, "r") as f:
    soup = BeautifulSoup(f, "lxml")
# print(soup.prettify())
meta = soup.meta.attrs

# Extract metadata from RDF (Eurovoc)
legal_act_concepts = pd.read_csv("data/eurovoc/legal_act_concepts.csv")
meta_download = legal_act_concepts[(legal_act_concepts["work-id"] == work_id) & (legal_act_concepts["doc"] == cs_id)]

# Mark: If new files are downloaded, look for Eurovoc codes one by one
meta['eurovoc-terms'] = ';'.join(meta_download["TERMS (PT-NPT)"].unique())
meta['eurovoc-mt'] = ';'.join(meta_download["MT"].unique())
meta



Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(f, "lxml")


{'http-equiv': 'content-type',
 'content': 'text/html; charset=utf-8',
 'eurovoc-terms': 'civil aviation;air safety;approval;ratification of an agreement;technical cooperation;Brazil;agreement (EU)',
 'eurovoc-mt': '4826 air and space transport;4806 transport policy;6411 technology and technical regulations;0806 international affairs;0811 cooperation policy;7216 America;7231 economic geography;7236 political geography;1016 European construction'}

In [14]:
text = soup.get_text(separator="\n", strip=True)

## Loader

In [83]:
docs[0].metadata.update(meta)

In [85]:
docs

[Document(metadata={'source': 'data/LEG_EN_HTML_20250721_04_08/e997cfd6-3626-4170-9ded-da698d3af63c/xhtml/L_2011273EN.01000101.doc.html', 'title': 'L_2011273EN.01000101.xml', 'http-equiv': 'content-type', 'content': 'text/html; charset=utf-8', 'eurovoc-terms': 'civil aviation;air safety;approval;ratification of an agreement;technical cooperation;Brazil;agreement (EU)', 'eurovoc-mt': '4826 air and space transport;4806 transport policy;6411 technology and technical regulations;0806 international affairs;0811 cooperation policy;7216 America;7231 economic geography;7236 political geography;1016 European construction'}, page_content='\n\n\n\n\nL_2011273EN.01000101.xml\n\n\n\n\n\n\n\n\n\n\n19.10.2011\xa0\xa0\xa0\n\n\nEN\n\n\nOfficial Journal of the European Union\n\n\nL 273/1\n\n\n\n\n\nCOUNCIL DECISION\nof 26 September 2011\non the conclusion of an Agreement between the European Union and the Government of the Federative Republic of Brazil on civil aviation safety\n(2011/694/EU)\nTHE COUNCI

In [16]:
soup.find("meta")

<meta content="text/html; charset=utf-8" http-equiv="content-type"/>

In [15]:
print(text)

L_2011273EN.01000101.xml
19.10.2011
EN
Official Journal of the European Union
L 273/1
COUNCIL DECISION
of 26 September 2011
on the conclusion of an Agreement between the European Union and the Government of the Federative Republic of Brazil on civil aviation safety
(2011/694/EU)
THE COUNCIL OF THE EUROPEAN UNION,
Having regard to the Treaty on the Functioning of the European Union, and in particular Article 100(2) and the first subparagraph of Article 207(4), in conjunction with Article 218(6)(a) and Article 218(7) and the first subparagraph of Article 218(8), thereof,
Having regard to the proposal from the European Commission,
Having regard to the consent of the European Parliament,
Whereas:
(1)
The Commission has negotiated, on behalf of the European Union, an Agreement on civil aviation safety with the Government of the Federative Republic of Brazil in accordance with the Council Decision authorising the Commission to open negotiations.
(2)
The Agreement between the European Union a

## Split text

In [74]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Slightly larger chunks
    chunk_overlap=200,
    length_function=len
)

In [None]:
from langchain_community.document_loaders import PyPDFLoader, TextLoader, Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from typing import List
import os

class DocumentUploader:
    def __init__(self, vectorstore_directory: str = "Database"):
        """
        Initialize DocumentUploader with the directory for the vector store.
        
        Args:
            vectorstore_directory: Directory to store vector databases (default: "Database")
        """
        self.vectorstore_directory = os.path.abspath(vectorstore_directory)
        os.makedirs(self.vectorstore_directory, exist_ok=True)
        self.embeddings = OpenAIEmbeddings()  # Initialize embeddings once

    def _get_loader(self, file_path: str):
        """Determine the appropriate loader based on file extension"""
        file_extension = os.path.splitext(file_path)[1].lower()
        loader_map = {
            '.pdf': PyPDFLoader,
            '.txt': TextLoader,
            '.docx': Docx2txtLoader
        }
        return loader_map.get(file_extension)

    def upload_documents(self, file_paths: List[str]):
        """
        Process and upload multiple documents to the vector store.
        
        Args:
            file_paths: List of file paths to process
            
        Returns:
            Tuple: (success_count, error_count)
        """
        success_count = 0
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,  # Slightly larger chunks
            chunk_overlap=200,
            length_function=len
        )

        for file_path in file_paths:
            try:
                if not os.path.exists(file_path):
                    print(f"File not found: {file_path}")
                    continue

                Loader = self._get_loader(file_path)
                if not Loader:
                    print(f"Unsupported file type: {file_path}")
                    continue

                # Load and split documents
                documents = Loader(file_path).load()
                chunks = text_splitter.split_documents(documents)

                # Create or update FAISS index
                if os.path.exists(os.path.join(self.vectorstore_directory, "index.faiss")):
                    # Load existing vectorstore and merge new documents
                    vectorstore = FAISS.load_local(
                        folder_path=self.vectorstore_directory,
                        embeddings=self.embeddings,
                        allow_dangerous_deserialization=True
                    )
                    vectorstore.add_documents(chunks)
                else:
                    # Create new vectorstore
                    vectorstore = FAISS.from_documents(chunks, self.embeddings)

                # Save the updated vectorstore
                vectorstore.save_local(folder_path=self.vectorstore_directory)
                success_count += 1
                print(f"Successfully processed: {file_path}")

            except Exception as e:
                print(f"Error processing {file_path}: {str(e)}")

        return success_count, len(file_paths) - success_count

# ChromaDB

`chroma run` runs the HTTP Client at localhost:8000

In [2]:
import chromadb

In [5]:
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection(name="legal")

In [None]:
for 

In [None]:
# # Set your OPENAI_API_KEY environment variable
# from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# collection = client.create_collection(
#     name="my_collection",
#     embedding_function=OpenAIEmbeddingFunction(
#         model_name="text-embedding-3-small"
#     )
# )


collection.add(
    ids=["id1", "id2"],
    documents=["doc1", "doc2"]
)

In [None]:
collection.add(
    ids=["1", "2", "3", "4", "5", "6", "7"],
    documents=[
        "apple",
        "banana",
        "pineapple",
        "mango",
        "dragonfruit",
        "passionfruit",
        "raspberry"
    ],
    metadatas=[
        { "color": "red", "weight": 180 },
        { "color": "yellow", "weight": 120 },
        { "color": "brown", "weight": 900 },
        { "color": "yellow", "weight": 200 },
        { "color": "pink", "weight": 600 },
        { "color": "purple", "weight": 18 },
        { "color": "red", "weight": 4 },
    ]
)

# Read RDF

In [None]:
import rdflib
g = rdflib.Graph()
g.parse("data/cdm-4.13.2/euvoc.rdf", format="xml")
for idx, (subj, pred, obj) in enumerate(g):
    print(f"Index: {idx} | subj: {subj}")
    print(f"Index: {idx} | pred: {pred}")
    print(f"Index: {idx} | obj : {obj}")