# Tips
## 1. **Cellar - WEMI codes based on FRBR** 
- (Check https://op.europa.eu/documents/d/cellar/cellar-end-user-manual_eec84490f0b94079960fcf6919271c37-280824-1601-502)
- Work: An abstract intellectual idea
- Expression: A realisation - text version or translation - of Work, in a specific language
- Manifestation: An instantiation - file or format - of Expression
- Item / Content Stream: The actual document in the specified language and format
## 2. **Resource URL**
- **http://publications.europa.eu/resource/{ps-name}/{ps-id}**
- When type = content stream
- example 1: http://publications.europa.eu/resource/cellar/550e8400-e29b-41d4-a716-446655440000.0001.03/DOC_1
- ps-name is cellar
- ps-id = {work-id}.{expr-id}.{man-id}/{cs-id} = 550e8400-e29b-41d4-a716-446655440000.0001.03/DOC_1
- The work-id is a Universally Unique Identifier (UUID)
- example 2: http://publications.europa.eu/resource/celex/32006D0241.FRA.fmx4.L_2006088FR.01006402.xml
- ps-name is not cellar (here celex)
- ps_id = {work-id}.{expr-id}.{man-id}.{cs-id} = 32006D0241.FRA.fmx4.L_2006088FR.01006402.xml
- The work-id is an alphanumeric value (here a celex number)
## 3. **Celex number on EUR-Lex**
- (Check https://eur-lex.europa.eu/content/tools/HowCelexNumbersAreComposed.pdf)
- {sector}{year}{doc-type (descriptors by sector)}{doc-number}
- example: 32006D0241
- sector = 3 (Legislation)
- year = 2006
- doc-type = D (Decisions)
- doc-number = 0060
## 4. **Cellar ontology / Common Data Model (CDM)**
- (Check documentation: https://publications.europa.eu/resource/cellar/fb442510-0826-11ed-b11c-01aa75ed71a1.0001.02/DOC_1)
- http://publications.europa.eu/ontology/cdm#
## 5. **Eurovoc**
- (Check handbook: file:///Users/demouser/Downloads/EuroVoc-Handbook.pdf)
- Download in https://op.europa.eu/en/web/eu-vocabularies/concept-scheme/-/resource?uri=http://eurovoc.europa.eu/100141
- It defines the domains / thesaurus that a document is related to.
- A document can be tagged to multiple eurovoc tags.
- Every domain has a identification code.
- Search in the metadata file of a Work for 
    - <j.0:work_is_about_concept_eurovoc rdf:resource="http://eurovoc.europa.eu/{code}"/>
    - Use {code} to identify the Eurovoc term.
## 6. **SPARQL**
- (Check https://publications.europa.eu/webapi/rdf/sparql)
- Query data from the database

# Metadata
The metadata to keep:
- creation date
- domain: from eurovoc
- author/publisher
- doc-type: from celex number
- version

How to get metadata:
- Collect <meta ...> in html
- Check metadata rdf file


# Library

uk API Key: 4a5e1b6dcb7f818746d6d51ce69f5331

In [1]:
from SPARQLWrapper import SPARQLWrapper, JSON 
import requests 
import pandas as pd
import numpy as np
import os
import json
import rdflib
from rdflib.namespace import RDFS, SKOS

import chromadb
from bs4 import BeautifulSoup
from multiprocessing import Pool
from tqdm import tqdm
from helper.eurlex import Eurlex
from dotenv import load_dotenv
load_dotenv(override=True)

True

In [55]:
pd.set_option('display.max_rows', 100)

# SPARQL Query

In [2]:
sparql = SPARQLWrapper("https://publications.europa.eu/webapi/rdf/sparql")
sparql.setReturnFormat(JSON)

In [25]:
query = """
PREFIX cdm: <http://publications.europa.eu/ontology/cdm#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT 
  DISTINCT (GROUP_CONCAT(DISTINCT STR(?work); SEPARATOR=",") AS ?cellarURIs)
  (GROUP_CONCAT(DISTINCT STR(?title_); SEPARATOR=",") AS ?title)
  ?langIdentifier
  (GROUP_CONCAT(DISTINCT STR(?mtype); SEPARATOR=",") AS ?mtypes)
  (GROUP_CONCAT(DISTINCT STR(?thumbnail); SEPARATOR=",") AS ?thumbnails)
  (GROUP_CONCAT(DISTINCT STR(?resType); SEPARATOR=",") AS ?workTypes)
  (GROUP_CONCAT(DISTINCT STR(?agentName); SEPARATOR=",") AS ?authors)
  (GROUP_CONCAT(DISTINCT STR(?privateAgentName); SEPARATOR=";") AS ?privateAuthors)
  ?date
  (GROUP_CONCAT(DISTINCT STR(?subjectLabel); SEPARATOR=",") AS ?subjects)
  (GROUP_CONCAT(DISTINCT STR(?workId_); SEPARATOR=",") AS ?workIds)
WHERE {
  GRAPH ?gw {
    ?work rdf:type ?resType ;
          cdm:work_date_document ?date ;
          cdm:work_id_document ?workId_ ;
          cdm:work_is_about_concept_eurovoc ?subject .
    FILTER(?resType = cdm:resource_legal)
    FILTER(xsd:date(?date) < "2023-01-01"^^xsd:date)
    
    GRAPH ?gs {
      ?subject skos:prefLabel ?subjectLabel .
      FILTER(LANG(?subjectLabel) = "en")
    }
  }

  GRAPH ?eg {
    ?exp cdm:expression_belongs_to_work ?work ;
         cdm:expression_title ?title_ ;
         cdm:expression_uses_language ?lg .
    FILTER(LANG(?title_) = "en" || LANG(?title_) = "eng" || LANG(?title_) = "")

    GRAPH ?lgc {
      ?lg dc:identifier ?langIdentifier .
      FILTER(STR(?langIdentifier) = "ENG")
    }
  }

  GRAPH ?gm {
    ?manif cdm:manifestation_manifests_expression ?exp ;
           cdm:manifestation_type ?mtype .
    OPTIONAL { ?manif cdm:manifestation_has_thumbnail ?thumbnail }
  }

  OPTIONAL {
    GRAPH ?gagent {
      {
        ?work cdm:work_contributed_to_by_agent ?agent .
      } UNION {
        ?work cdm:work_created_by_agent ?agent .
      } UNION {
        ?work cdm:work_authored_by_agent ?agent .
      }
    }
    GRAPH ?ga {
      ?agent skos:prefLabel ?agentName .
      FILTER(LANG(?agentName) = "en")
    }
  }

  OPTIONAL {
    GRAPH ?persAuthor {
      {
        ?work cdm:work_contributed_to_by_agent ?privateAgent .
      } UNION {
        ?work cdm:work_authored_by_agent ?privateAgent .
      }
      ?privateAgent rdf:type cdm:person ;
                    cdm:agent_name ?privateAgentName .
    }
  }
}
GROUP BY ?work ?date ?langIdentifier
ORDER BY DESC(?date)
LIMIT 100
"""

In [None]:
query = """
prefix cdm: <http://publications.europa.eu/ontology/cdm#>
prefix purl: <http://purl.org/dc/elements/1.1/>
select distinct ?item ?date ?class
where {
    ?work cdm:date_creation_legacy ?date.
    ?w a ?class .
    ?expr cdm:expression_belongs_to_work ?work ;
        cdm:expression_uses_language ?lang .
    ?lang purl:identifier ?langCode .
    ?manif cdm:manifestation_manifests_expression ?expr;
        cdm:manifestation_type "pdfa1a".
    ?item cdm:item_belongs_to_manifestation ?manif.
    FILTER (
        ?date > "2016-05-23T10:20:13+05:30"^^xsd:dateTime
        AND
        ?date < "2020-05-23T10:20:13+05:30"^^xsd:dateTime
    ).
    FILTER(
        ?class in (
        <http://publications.europa.eu/ontology/cdm#document_cjeu>,
        <http://publications.europa.eu/ontology/cdm#case-law>,
        <http://publications.europa.eu/ontology/cdm#summary_caselaw>,
        <http://publications.europa.eu/ontology/cdm#summary_caselaw_jure>
        )
    )
    FILTER(STR(?langCode) = "ENG")
}
LIMIT 10
"""

In [None]:
### Select format? = html or xhtml


query = """
prefix cdm: <http://publications.europa.eu/ontology/cdm#>
prefix purl: <http://purl.org/dc/elements/1.1/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

select distinct ?item ?expr_title ?celex ?date ?class ?className ?authorName #  ?subject
where {
    ?work cdm:work_date_document ?date ;
        rdf:type cdm:resource_legal .
    ?w a ?class .
    ?class rdfs:subClassOf cdm:resource_legal .
    ?expr cdm:expression_belongs_to_work ?work ;
        cdm:expression_title ?expr_title ;
        cdm:expression_uses_language ?lang .
    ?lang purl:identifier ?langCode .
    ?manif cdm:manifestation_manifests_expression ?expr;
        # cdm:manifestation_type "pdfa1a".
        cdm:manifestation_type "xhtml".
    ?item cdm:item_belongs_to_manifestation ?manif.
    
    FILTER (
        ?date > "2000-01-01T23:59:59+08:00"^^xsd:dateTime
        AND
        ?date < "2025-08-01T23:59:59+08:00"^^xsd:dateTime
    ).

    FILTER(STR(?langCode) = "ENG")

    OPTIONAL { ?work cdm:work_id_celex ?celex . }
    
    BIND(
        IF(CONTAINS(STR(?class), "#"),
            STRAFTER(STR(?class), "#"),
            STRAFTER(STR(?class), "/")
        ) AS ?className
    )
    
    OPTIONAL {
        GRAPH ?gagent {
            {
            ?work cdm:work_contributed_to_by_agent ?author .
            } UNION {
            ?work cdm:work_created_by_agent ?author .
            } UNION {
            ?work cdm:work_authored_by_agent ?author .
            }
        }
    
        
    OPTIONAL {
        GRAPH ?glabel {
        ?author skos:prefLabel ?authorName .
        FILTER(LANG(?authorName) = "en")
        }
    }
    }

}
ORDER BY ?date
"""

In [21]:
from rdflib import Namespace
Namespace("http://publications.europa.eu/ontology/cdm#")

Namespace('http://publications.europa.eu/ontology/cdm#')

In [23]:
sparql.setQuery(query)
results = sparql.query().convert()
results["results"]["bindings"]

[{'item': {'type': 'uri',
   'value': 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1'},
  'expr_title': {'type': 'literal',
   'value': 'Opinion of Advocate General Sharpston delivered on 13 March 2008. # Ecotrade SpA v Agenzia delle Entrate - Ufficio di Genova 3. # Reference for a preliminary ruling: Commissione tributaria provinciale di Genova - Italy. # Sixth VAT Directive - Reverse charge procedure - Right to deduct - Time-bar - Irregularity in accounts and tax returns affecting transactions subject to the reverse charge procedure. # Joined cases C-95/07 and C-96/07.'},
  'date': {'type': 'typed-literal',
   'datatype': 'http://www.w3.org/2001/XMLSchema#date',
   'value': '2008-03-13'},
  'class': {'type': 'uri',
   'value': 'http://publications.europa.eu/ontology/cdm#act_consolidated'},
  'className': {'type': 'literal', 'value': 'act_consolidated'},
  'authorName': {'type': 'literal',
   'xml:lang': 'en',
   'value': 'Court of Ju

In [24]:
for result in results["results"]["bindings"]:
    print(result["celex"]["value"])

KeyError: 'celex'

In [20]:
def clean_results(results):
    results_clean = [
        {k:v["value"] for k,v in item.items()}
        for item in results["results"]["bindings"]
    ]
    results_clean = pd.DataFrame(results_clean)
    return results_clean

In [27]:
items = clean_results(results)
print(items.shape[0])
items.head(3)

32


Unnamed: 0,item,expr_title,date,class,className,authorName
0,http://publications.europa.eu/resource/cellar/...,Opinion of Advocate General Sharpston delivere...,2008-03-13,http://publications.europa.eu/ontology/cdm#act...,act_consolidated,Court of Justice
1,http://publications.europa.eu/resource/cellar/...,Opinion of Advocate General Sharpston delivere...,2008-03-13,http://publications.europa.eu/ontology/cdm#act...,act_preparatory,Court of Justice
2,http://publications.europa.eu/resource/cellar/...,Opinion of Advocate General Sharpston delivere...,2008-03-13,http://publications.europa.eu/ontology/cdm#agr...,agreement_international,Court of Justice


In [31]:
items['item'].to_list()

['http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1',
 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.0

In [34]:
results["results"]["bindings"]

[{'item': {'type': 'uri',
   'value': 'http://publications.europa.eu/resource/cellar/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02/DOC_1'},
  'expr_title': {'type': 'literal',
   'value': 'Opinion of Advocate General Sharpston delivered on 13 March 2008. # Ecotrade SpA v Agenzia delle Entrate - Ufficio di Genova 3. # Reference for a preliminary ruling: Commissione tributaria provinciale di Genova - Italy. # Sixth VAT Directive - Reverse charge procedure - Right to deduct - Time-bar - Irregularity in accounts and tax returns affecting transactions subject to the reverse charge procedure. # Joined cases C-95/07 and C-96/07.'},
  'date': {'type': 'typed-literal',
   'datatype': 'http://www.w3.org/2001/XMLSchema#date',
   'value': '2008-03-13'},
  'class': {'type': 'uri',
   'value': 'http://publications.europa.eu/ontology/cdm#act_consolidated'},
  'className': {'type': 'literal', 'value': 'act_consolidated'},
  'authorName': {'type': 'literal',
   'xml:lang': 'en',
   'value': 'Court of Ju

In [37]:
os.makedirs("data/api_download", exist_ok=True)
# Call to the  API to download the digital file
unique_items = set([r["item"]["value"] for r in results["results"]["bindings"]])
for doc_url in unique_items:
    item = '.'.join(doc_url.split("/")[-2:])
    file_name = "data/api_download/" + item + ".xhtml"
    try:
        response = requests.get(doc_url)
    except Exception as e:
        print(f"Error downloading {doc_url}: {e}")
        continue
    
    with open(file_name, 'w', encoding='utf-8') as file:
        file.write(response.text)
        print(f"Downloaded: {file_name}")

Downloaded: data/api_download/ea3bdcba-e537-4749-b2dc-4457734dbfcf.0002.02.DOC_1.xhtml
Downloaded: data/api_download/3d4572e0-3dcd-432d-85ad-ad918607176b.0020.02.DOC_1.xhtml


# Datadump
User can force downloading all the Legal Acts and metadata in the Datadump Service.
https://datadump.publications.europa.eu/create-download-request

`LEG_EN_HTML` contains all the Legal Acts
    > {work-id}
    > {file}

`LEG_MTD` contains the metadata
    > {work-id}
    > {mtdfile}

Every {work-id} might have multiple files, but only has one metadata file.

## Get hierarchy

In [24]:
def datadump_la_hierarchy(folder):
    wids = os.listdir(folder)
    if '.DS_Store' in wids:
        wids.remove('.DS_Store')
    rows = []
    for wid in wids:
        formats = os.listdir(folder + "/" + wid)
        for f in formats:
            docs = os.listdir(folder + "/" + wid + "/" + f)
            for doc in docs:
                rows.append({"work-id": wid, "celex-expr-id":"ENG", "celex-man-id": f, "celex-cs-id": doc})
    return pd.DataFrame(rows)

def datadump_mtd_hierarchy(folder):
    wids = os.listdir(folder)
    if '.DS_Store' in wids:
        wids.remove('.DS_Store')
    rows = []
    for wid in wids:
        docs = os.listdir(folder + "/" + wid)
        for doc in docs:
            rows.append({"work-id": wid, "mtd": doc})
    return pd.DataFrame(rows)

In [25]:
legal_act_folder = os.getenv("EU_LEGAL_ACT_PATH")
metadata_folder = os.getenv("EU_METADATA_PATH")
legal_act_files = datadump_la_hierarchy(legal_act_folder)
metadata_files = datadump_mtd_hierarchy(metadata_folder)

In [90]:
# Some folders have jpg files in addition to html files
# e.g. work-id = "bb5339f8-f387-4c05-a895-1a64f898413c"
# xhtml L_1998069EN.01000101.doc.html
# html 31998D0181en.html

# Get file format suffix, only keep .html
legal_act_files['suffix'] = legal_act_files['celex-cs-id'].apply(lambda x: x.split('.')[-1])
print(legal_act_files['suffix'].value_counts())
legal_act_files = legal_act_files[legal_act_files['suffix'] == 'html']

suffix
html    43056
jpg      3604
ent         3
0004        1
0005        1
Name: count, dtype: int64


In [None]:
datadump = legal_act_files.merge(metadata_files, on="work-id", how="left")
datadump.to_csv(os.getenv("EU_DATADUMP_FILE"), index=False)

In [92]:
datadump.isna().sum() / datadump.shape[0] * 100

work-id          0.000000
celex-expr-id    0.000000
celex-man-id     0.000000
celex-cs-id      0.000000
suffix           0.000000
mtd              0.104515
dtype: float64

In [93]:
datadump[datadump['work-id'] == "bb5339f8-f387-4c05-a895-1a64f898413c"]

Unnamed: 0,work-id,celex-expr-id,celex-man-id,celex-cs-id,suffix,mtd
13946,bb5339f8-f387-4c05-a895-1a64f898413c,ENG,xhtml,L_1998069EN.01000101.doc.html,html,tree_non_inferred.rdf
13947,bb5339f8-f387-4c05-a895-1a64f898413c,ENG,html,31998D0181en.html,html,tree_non_inferred.rdf


## Process metadata in batch (parallel)

In [48]:
def parallel_extract(batch_func, df, num_processes=4):
    chunks = np.array_split(df, num_processes)

    with Pool(processes=num_processes) as pool:
        pool.map(batch_func, chunks)

In [23]:
datadump = pd.read_csv(os.getenv("EU_DATADUMP_FILE"))

In [29]:
test = datadump.iloc[0:100]

In [28]:
from rdflib import Namespace
CDM = Namespace("http://publications.europa.eu/ontology/cdm#")
CDM.date_entry_into_force

rdflib.term.URIRef('http://publications.europa.eu/ontology/cdm#date_entry_into_force')

### To do

In [31]:
from helper.preprocess import process_metadata_batch
output_folder = os.getenv("EU_WORK_METADATA_MAPPING_PATH")
os.makedirs(output_folder, exist_ok=True)
parallel_extract(process_metadata_batch, datadump, num_processes=6)

  return bound(*args, **kwds)


[77717] Saved batch 0 with 272 records
[77715] Saved batch 0 with 373 records
[77716] Saved batch 0 with 501 records
[77718] Saved batch 0 with 596 records
[77720] Saved batch 0 with 697 records
[77717] Saved batch 1 with 619 records
[77716] Saved batch 1 with 625 records
[77718] Saved batch 1 with 587 records
[77717] Saved batch 2 with 611 records


Process SpawnPoolWorker-35:
Process SpawnPoolWorker-38:
Process SpawnPoolWorker-36:
Process SpawnPoolWorker-37:


KeyboardInterrupt: 

## Find celex number (parallel)

In [None]:
# from helper.preprocess import get_celex_batch
# output_folder = os.getenv("EU_WORK_CELEX_MAPPING_PATH")
# os.makedirs(output_folder, exist_ok=True)
# parallel_extract(get_celex_batch, metadata_files, num_processes=4)

  return bound(*args, **kwds)


[64636] Saved batch 0 with 158 records
[64638] Saved batch 0 with 316 records
[64637] Saved batch 0 with 474 records
[64639] Saved batch 0 with 1000 records
[64636] Saved batch 1 with 1000 records
[64638] Saved batch 1 with 1000 records
[64637] Saved batch 1 with 1000 records
[64639] Saved batch 1 with 1000 records
[64636] Saved batch 2 with 1000 records
[64638] Saved batch 2 with 1000 records
[64637] Saved batch 2 with 1000 records
[64639] Saved batch 2 with 1000 records
[64636] Saved batch 3 with 1000 records
[64638] Saved batch 3 with 1000 records
[64637] Saved batch 3 with 1000 records
[64639] Saved batch 3 with 1000 records
[64636] Saved batch 4 with 1000 records
[64638] Saved batch 4 with 1000 records
[64637] Saved batch 4 with 1000 records
[64639] Saved batch 4 with 1000 records
[64636] Saved batch 5 with 1000 records
[64638] Saved batch 5 with 1000 records
[64637] Saved batch 5 with 1000 records
[64639] Saved batch 5 with 1000 records
[64636] Saved batch 6 with 1000 records
[64

## Find cellar uri, manifestation, item_identifier

<rdf:Description rdf:about="http://publications.europa.eu/resource/cellar/bb5339f8-f387-4c05-a895-1a64f898413c.0006.05/DOC_100">
    <owl:sameAs rdf:resource="http://publications.europa.eu/resource/uriserv/OJ.L_.1998.069.01.0001.01.ENG.xhtml.L_1998069EN.01000101.doc.html"/>
    <owl:sameAs rdf:resource="http://publications.europa.eu/resource/celex/31998D0181.ENG.xhtml.L_1998069EN.01000101.doc.html"/>
    <owl:sameAs rdf:resource="http://publications.europa.eu/resource/oj/JOL_1998_069_R_0001_001.ENG.xhtml.L_1998069EN.01000101.doc.html"/>
    <j.0:item_belongs_to_manifestation rdf:resource="http://publications.europa.eu/resource/cellar/bb5339f8-f387-4c05-a895-1a64f898413c.0006.05"/>
    <j.0:item_identifier>DOC_100</j.0:item_identifier>
    <j.3:manifestationMimeType>application/xhtml+xml</j.3:manifestationMimeType>
    <rdf:type rdf:resource="http://publications.europa.eu/ontology/cdm#item"/>
  </rdf:Description>

In [None]:
uri = "http://publications.europa.eu/resource/cellar/bb5339f8-f387-4c05-a895-1a64f898413c.0006.05/DOC_100"

pattern = r'^http://publications\.europa\.eu/resource/cellar/([^.]+)\.([^.]+)\.([^.]+)/([^/]+)$'

match = re.match(pattern, uri)
if match:
    work_id, expr_id, man_id, cs_id = match.groups()
    print("work-id:", work_id)
    print("expr-id:", expr_id)
    print("man-id:", man_id)
    print("cs-id:", cs_id)
else:
    print("URI does not match expected pattern.")

meta.append(row.to_dict().update({
    "cellar_man_id": cellar_man_id,
    "cellar_cs_id": cellar_cs_id
}))


In [74]:
legal_act_files

Unnamed: 0,work-id,format,doc
0,1a1e8486-a474-11e9-9d01-01aa75ed71a1,xhtml,L_2019187EN.01004101.doc.html
1,5fa72f58-9564-4ebe-a5a5-853e206ae2ed,html,32000R0212en.html
2,65a9ddff-3e78-403b-ab34-98d7973be2b3,xhtml,L_2010117EN.01006001.doc.html
3,39bcdc85-2e3b-4e36-a488-5197ee502afd,html,31999Y0917_02_en.html
4,a4d26b15-882d-11e9-9369-01aa75ed71a1,xhtml,L_2019148EN.01000101.doc.html
...,...,...,...
46660,de5dedc5-78c9-11ec-9136-01aa75ed71a1,xhtml,C_2022027EN.01001601.doc.html
46661,2c28125f-7af6-48d2-9f7a-1d52e0f383dc,html,32000M1950en.html
46662,4205ac4a-806d-11ea-b94a-01aa75ed71a1,xhtml,L_2020119EN.01001501.doc.html
46663,547b1f39-9694-11e6-a9e2-01aa75ed71a1,xhtml,L_2016284EN.01002701.doc.html


In [None]:
# j.0:item_identifier

In [None]:
"http://publications.europa.eu/resource/celex/31998D0181.ENG.xhtml.L_1998069EN.01000101.doc.html"

In [None]:
# # <j.0:work_is_about_concept_eurovoc rdf:resource="http://eurovoc.europa.eu/{code}"/>
# import rdflib
# g = rdflib.Graph()
# g.parse(metadata_folder + "/bb5339f8-f387-4c05-a895-1a64f898413c/tree_non_inferred.rdf", format="xml")
# eurovoc_codes = []
# same_as = []
# for idx, (subj, pred, obj) in enumerate(g):
#     if pred == rdflib.URIRef("http://publications.europa.eu/ontology/cdm#work_is_about_concept_eurovoc"):
#         code = obj.split("/")[-1]
#         # print(f"subj: {subj}")
#         # print(f"pred: {pred}")
#         # print(f"code: {code}")
#         eurovoc_codes.append(code)
#     elif pred == rdflib.URIRef("http://www.w3.org/2002/07/owl#sameAs"):
#         same_as.append({"subject": subj, "object": obj})

In [None]:
# same_as_df = pd.DataFrame(same_as).sort_values('subject').reset_index(drop=True)
# same_as_df_sorted = same_as_df['subject'].value_counts().reset_index().sort_values('subject')
# same_as_df_sorted

## Find eurovoc codes

In [None]:
# os.makedirs("data/eurovoc/metadata_mapping", exist_ok=True)
# eurovoc_codes = []
# for idx, row in tqdm(metadata_files.iterrows(), total=metadata_files.shape[0], desc="Examine metadata"):
#     work_id = row["work-id"]
#     g = rdflib.Graph()
#     g.parse(metadata_folder + "/" + work_id + "/tree_non_inferred.rdf", format="xml")
#     for i, (subj, pred, obj) in enumerate(g):
#         if pred == rdflib.URIRef("http://publications.europa.eu/ontology/cdm#work_is_about_concept_eurovoc"):
#             code = obj.split("/")[-1]
#             eurovoc_codes.append({"work-id": work_id, "eurovoc-code": code})
#     if idx % 1000 == 999:
#         with open(f"data/eurovoc/metadata_mapping/{idx}.json", "w") as f:
#             json.dump(eurovoc_codes, f, indent=4)
#         print(f"Processed until index {idx}, saved to data/eurovoc/metadata_mapping/{idx}.json")
#         print(f"Length of this file: {len(eurovoc_codes)}")
#         eurovoc_codes = []
# if eurovoc_codes:
#     with open(f"data/eurovoc/metadata_mapping/{idx}.json", 'w') as f:
#         json.dump(eurovoc_codes, f, indent=4)
#     print(f"Processed until index {idx}, saved to data/eurovoc/metadata_mapping/{idx}.json")
#     print(f"Length of this file: {len(eurovoc_codes)}")

# eurovoc_codes = pd.DataFrame(eurovoc_codes)

Examine metadata:   0%|          | 5/55367 [00:02<6:16:55,  2.45it/s]

Processed until index 4, saved to data/eurovoc/metadata_mapping/4.json
Length of this file: 26





## Find eurovoc codes (parallel)

In [None]:
# from helper.preprocess import get_eurovoc_batch
# output_folder = os.getenv("EU_WORK_EUROVOC_MAPPING_PATH")
# os.makedirs(output_folder, exist_ok=True)
# parallel_extract(get_eurovoc_batch, metadata_files, num_processes=4)

[28496] Saved batch 0 with 5441 records
[28498] Saved batch 0 with 5474 records
[28499] Saved batch 0 with 5573 records
[28497] Saved batch 0 with 5327 records
[28496] Saved batch 1 with 5339 records
[28498] Saved batch 1 with 5496 records
[28499] Saved batch 1 with 5542 records
[28497] Saved batch 1 with 5386 records
[28496] Saved batch 2 with 5477 records
[28499] Saved batch 2 with 5232 records
[28498] Saved batch 2 with 5567 records
[28497] Saved batch 2 with 5324 records
[28498] Saved batch 3 with 5543 records
[28496] Saved batch 3 with 5481 records
[28499] Saved batch 3 with 5485 records
[28497] Saved batch 3 with 5361 records
[28496] Saved batch 4 with 5521 records
[28498] Saved batch 4 with 5467 records
[28499] Saved batch 4 with 5375 records
[28497] Saved batch 4 with 5487 records
[28496] Saved batch 5 with 5459 records
[28499] Saved batch 5 with 5450 records
[28498] Saved batch 5 with 5603 records
[28497] Saved batch 5 with 5494 records
[28496] Saved batch 6 with 5320 records


## Concatenate eurovoc and celex files

In [58]:
# Find concepts behind eurovoc code
output_folder = os.getenv("EU_WORK_EUROVOC_MAPPING_PATH")
eurovoc_mapping_files = os.listdir(output_folder)
eurovoc_mapping = []
for file in eurovoc_mapping_files:
    with open(os.path.join(output_folder, file), 'r') as f:
        eurovoc_mapping.extend(json.load(f))
eurovoc_mapping = pd.DataFrame(eurovoc_mapping).sort_values('work-id').reset_index(drop=True)
print(eurovoc_mapping.columns)

Index(['work-id', 'eurovoc-code'], dtype='object')


In [None]:
# TERMS (PT-NPT) is the sepecific term for the concept
# e.g. "5485" is "European charter"
# MT is the main term (domain) fot the concept
# e.g. "5485" belongs to "0806 international affairs"
# eurovoc[eurovoc["ID"].isin(["5485","2498","5344","5327", "2470","5420"]) & eurovoc["PT"].isna()].sort_values("MT")

Unnamed: 0,ID,TERMS (PT-NPT),RELATIONS,PT,MT
1916,5485,European charter,,,0806 international affairs
4666,5420,accession to an agreement,,,0806 international affairs
1085,5344,EAEC,,,1016 European construction
1218,5327,ECSC,,,1016 European construction
8699,2470,environmental policy,,,5206 environmental policy
8618,2498,energy policy,,,6606 energy policy


In [None]:
# Map work-id to eurovoc concepts
eurovoc = pd.read_excel("data/eurovoc/EuroVoc_Excel_export/eurovoc_export_en.xlsx", sheet_name="en")
eurovoc_simple = eurovoc[eurovoc["PT"].isna()].drop(columns=["RELATIONS", "PT"])
eurovoc_mapping_concepts = eurovoc_mapping.merge(eurovoc_simple, left_on="eurovoc-code", right_on="ID", how="left").drop(columns=["ID"])
legal_act_concepts = legal_act_files.merge(eurovoc_mapping_concepts, on="work-id", how="left")

In [60]:
# Map work-id to celex number
output_folder = os.getenv("EU_WORK_CELEX_MAPPING_PATH")
celex_mapping_files = os.listdir(output_folder)
celex_mapping = []
for file in celex_mapping_files:
    with open(os.path.join(output_folder, file), 'r') as f:
        celex_mapping.extend(json.load(f))
celex_mapping = pd.DataFrame(celex_mapping).sort_values('work-id').reset_index(drop=True)

In [69]:
legal_act_metadata = legal_act_concepts.merge(celex_mapping, on="work-id", how="left")
la_mtd_file = os.getenv("EU_LEGAL_ACT_METADATA_FILE")
legal_act_metadata.to_csv(la_mtd_file, index=False)

In [70]:
# N.A. rate
legal_act_metadata.isna().sum() / legal_act_metadata.shape[0] * 100

work-id           0.000000
format            0.000000
doc               0.000000
eurovoc-code      2.006617
TERMS (PT-NPT)    2.331216
MT                2.331216
celex             0.014755
dtype: float64

# Process File
backend/data/LEG_EN_HTML_20250721_04_08/e997cfd6-3626-4170-9ded-da698d3af63c/xhtml/L_2011273EN.01000101.doc.html
- save metadata
- look for useful html tags
- save text
- preprocess text

In [2]:
def extract_mt_code(mt):
    if pd.isna(mt):
        return None
    else:
        return mt[0:5]
    
def extract_mt_term(mt):
    if pd.isna(mt):
        return None
    else:
        return mt[5:]

In [3]:
la_mtd_file = os.getenv("EU_LEGAL_ACT_METADATA_FILE")
la_mtd = pd.read_csv(la_mtd_file)
la_mtd["MT-code"] = la_mtd["MT"].apply(extract_mt_code)
la_mtd["MT-term"] = la_mtd["MT"].apply(extract_mt_term)
la_mtd['doc-path'] = "data/LEG_EN_HTML_20250721_04_08/" + la_mtd['work-id'] + "/" + la_mtd['format'] + "/" + la_mtd['doc']
la_mtd['mtd-path'] = "data/LEG_MTD_20250709_22_36/" + la_mtd['work-id'] + "/tree_non_inferred.rdf"

In [4]:
file_paths = la_mtd.drop_duplicates('doc-path')['doc-path'].to_list()
print(len(file_paths))

46665


## ChromaDB

In [None]:
# from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
# collection = client.get_or_create_collection(
#     name="legal",
#     embedding_function=OpenAIEmbeddingFunction(
#         model_name="text-embedding-3-small"
#     )
# )

In [5]:
import hashlib
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import BSHTMLLoader, UnstructuredXMLLoader
from langchain.embeddings import HuggingFaceEmbeddings

In [6]:
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection(name="legal") # default embedder: all-MiniLM-L6-v2

In [7]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Slightly larger chunks
    chunk_overlap=200
)

In [8]:
def decompose_celex(celex):
    sector = celex[0]
    year = celex[1:5]
    return sector, year

def make_id(doc_path):
    return hashlib.md5(doc_path[32:].encode("utf-8")).hexdigest()[:12]

In [9]:
test = la_mtd.drop_duplicates('doc-path').iloc[0:5]

In [None]:
# print(test.iloc[0]["doc-path"])
# loader = UnstructuredXMLLoader(test.iloc[0]["doc-path"])
# docs = loader.load()

In [11]:
for idx, row in tqdm(test.iterrows(), desc="Processing documents", total=test.shape[0]):
    # Load document
    # loader = UnstructuredXMLLoader(row["doc-path"])
    loader = BSHTMLLoader(file_path = row["doc-path"])
    docs = loader.load()

    # Retrieve metadata
    meta = docs[0].metadata
    meta_download = la_mtd[(la_mtd["work-id"] == row["work-id"]) & (la_mtd["doc"] == row["doc"])]
    meta['eurovoc-terms'] = ';'.join(meta_download["TERMS (PT-NPT)"].unique())
    meta['eurovoc-mt'] = ';'.join(meta_download["MT"].unique())
    sector, year = decompose_celex(row['celex']) if row['celex'] else (None, None)
    meta['celex'] = row['celex'] if row['celex'] else None
    meta['celex-sector'] = sector
    meta['celex-year'] = year

    # Split, embed, and store
    chunks = text_splitter.split_documents(docs)
    texts = [chunk.page_content for chunk in chunks]
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs={"device": "cpu"})
    collection.add(
        documents=texts,
        metadatas=[meta for _ in range(len(texts))],
        embedding=embeddings.embed_documents(texts),
        ids=[f"{make_id(row['doc-path'])}_{i}" for i in range(len(texts))]
    )


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(f, **self.bs_kwargs)


  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs={"device": "cpu"})
  from .autonotebook import tqdm as notebook_tqdm

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.4 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/Users/demouser/anaconda3/envs/regpy10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/demouser/anaconda3/envs/regpy10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/demouser/anaconda3/envs/regpy10/lib/python3.10/site-packages/ipykernel_launcher.py", li

RuntimeError: Numpy is not available

## Flag

## Extract metadata

In [2]:
work_id = "e997cfd6-3626-4170-9ded-da698d3af63c"
expr_id = "ENG"
man_id = "xhtml"
cs_id = "L_2011273EN.01000101.doc.html"
doc_path = f"data/LEG_EN_HTML_20250721_04_08/{work_id}/{man_id}/{cs_id}"
mtd_path = f"data/LEG_MTD_20250709_22_36/{work_id}/tree_non_inferred.rdf"
celex_number = None
eurovoc_code = []

In [3]:
# Load the document and its metadata
loader = BSHTMLLoader(file_path=doc_path)
docs = loader.load()
docs[0].metadata
# docs[0].metadata.update(meta)


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(f, **self.bs_kwargs)


{'source': 'data/LEG_EN_HTML_20250721_04_08/e997cfd6-3626-4170-9ded-da698d3af63c/xhtml/L_2011273EN.01000101.doc.html',
 'title': 'L_2011273EN.01000101.xml'}

++++++++++++++++

In [4]:
# Find Celex number in the metadata file
g = rdflib.Graph()
g.parse(mtd_path, format="xml")

for subj, pred, obj in g:
    if str(pred) in [
        "http://publications.europa.eu/ontology/cdm#resource_legal_id_celex",
        "http://publications.europa.eu/ontology/cdm#celex_number"
    ]:
        print("Celex number:", obj)
        celex_number = obj
        break

Celex number: 32011D0694


In [5]:
doc_celex_uri = rdflib.URIRef(f"http://publications.europa.eu/resource/celex/{celex_number}.{expr_id}.{man_id}.{cs_id}")
# "http://publications.europa.eu/resource/celex/32011D0694.ENG.xhtml.L_2011273EN.01000101.doc.html"
print("Document Celex URI:", doc_celex_uri)

Document Celex URI: http://publications.europa.eu/resource/celex/32011D0694.ENG.xhtml.L_2011273EN.01000101.doc.html


In [None]:
# g.bind("rdfs", rdflib.RDFS)
# g.bind("skos", rdflib.Namespace("http://www.w3.org/2004/02/skos/core#"))


def get_label(uri):
    label = None
    # Check if label is already in the graph
    for o in g.objects(subject=uri, predicate=RDFS.label):
        return str(o)
    for o in g.objects(subject=uri, predicate=SKOS.prefLabel):
        return str(o)
    # Fallback: extract local name
    return uri.split("#")[-1] if "#" in uri else uri.split("/")[-1]

# Get all triples where the celex uri is subject or object
print("Document Celex URI:", doc_celex_uri, "\n")
print("========== Subject - Predicate ==========")
for subj, pred in g.subject_predicates(object=doc_celex_uri):
    pred_label = get_label(pred)
    print(f"Subject: {subj}")
    print(f"Predicate: {pred_label} ({pred})")
print("========== Predicate - Object ==========")
for pred, obj in g.predicate_objects(subject=doc_celex_uri):
    print(f"Predicate: {pred_label} ({pred})")
    print(f"Object: {obj}")

Document Celex URI: http://publications.europa.eu/resource/celex/32011D0694.ENG.xhtml.L_2011273EN.01000101.doc.html 

Subject: http://publications.europa.eu/resource/cellar/e997cfd6-3626-4170-9ded-da698d3af63c.0021.02/DOC_1
Predicate: -sameAs (http://www.w3.org/2002/07/owl#sameAs)


In [11]:
# Extract metadata from XML
with open(doc_path, "r") as f:
    soup = BeautifulSoup(f, "lxml")
# print(soup.prettify())
meta = soup.meta.attrs

la_mtd_file = os.getenv("EU_LEGAL_ACT_METADATA_FILE")
legal_act_metadata = pd.read_csv(la_mtd_file)

meta_download = legal_act_metadata[(legal_act_metadata["work-id"] == work_id) & (legal_act_metadata["doc"] == cs_id)]

# Mark: If new files are downloaded, look for Eurovoc codes one by one
meta['eurovoc-terms'] = ';'.join(meta_download["TERMS (PT-NPT)"].unique())
meta['eurovoc-mt'] = ';'.join(meta_download["MT"].unique())
meta



Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(f, "lxml")


{'http-equiv': 'content-type',
 'content': 'text/html; charset=utf-8',
 'eurovoc-terms': 'civil aviation;air safety;approval;ratification of an agreement;technical cooperation;Brazil;agreement (EU)',
 'eurovoc-mt': '4826 air and space transport;4806 transport policy;6411 technology and technical regulations;0806 international affairs;0811 cooperation policy;7216 America;7231 economic geography;7236 political geography;1016 European construction'}

In [14]:
text = soup.get_text(separator="\n", strip=True)

## Loader

In [83]:
docs[0].metadata.update(meta)

In [85]:
docs

[Document(metadata={'source': 'data/LEG_EN_HTML_20250721_04_08/e997cfd6-3626-4170-9ded-da698d3af63c/xhtml/L_2011273EN.01000101.doc.html', 'title': 'L_2011273EN.01000101.xml', 'http-equiv': 'content-type', 'content': 'text/html; charset=utf-8', 'eurovoc-terms': 'civil aviation;air safety;approval;ratification of an agreement;technical cooperation;Brazil;agreement (EU)', 'eurovoc-mt': '4826 air and space transport;4806 transport policy;6411 technology and technical regulations;0806 international affairs;0811 cooperation policy;7216 America;7231 economic geography;7236 political geography;1016 European construction'}, page_content='\n\n\n\n\nL_2011273EN.01000101.xml\n\n\n\n\n\n\n\n\n\n\n19.10.2011\xa0\xa0\xa0\n\n\nEN\n\n\nOfficial Journal of the European Union\n\n\nL 273/1\n\n\n\n\n\nCOUNCIL DECISION\nof 26 September 2011\non the conclusion of an Agreement between the European Union and the Government of the Federative Republic of Brazil on civil aviation safety\n(2011/694/EU)\nTHE COUNCI

In [16]:
soup.find("meta")

<meta content="text/html; charset=utf-8" http-equiv="content-type"/>

In [15]:
print(text)

L_2011273EN.01000101.xml
19.10.2011
EN
Official Journal of the European Union
L 273/1
COUNCIL DECISION
of 26 September 2011
on the conclusion of an Agreement between the European Union and the Government of the Federative Republic of Brazil on civil aviation safety
(2011/694/EU)
THE COUNCIL OF THE EUROPEAN UNION,
Having regard to the Treaty on the Functioning of the European Union, and in particular Article 100(2) and the first subparagraph of Article 207(4), in conjunction with Article 218(6)(a) and Article 218(7) and the first subparagraph of Article 218(8), thereof,
Having regard to the proposal from the European Commission,
Having regard to the consent of the European Parliament,
Whereas:
(1)
The Commission has negotiated, on behalf of the European Union, an Agreement on civil aviation safety with the Government of the Federative Republic of Brazil in accordance with the Council Decision authorising the Commission to open negotiations.
(2)
The Agreement between the European Union a

## Split text

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, TextLoader, Docx2txtLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from typing import List
import os

In [15]:
legal_act_metadata

Unnamed: 0,work-id,format,doc,eurovoc-code,TERMS (PT-NPT),MT,celex
0,1a1e8486-a474-11e9-9d01-01aa75ed71a1,xhtml,L_2019187EN.01004101.doc.html,2435,fish,5641 fisheries,32019D1194
1,1a1e8486-a474-11e9-9d01-01aa75ed71a1,xhtml,L_2019187EN.01004101.doc.html,5720,endocrine disease,2841 health,32019D1194
2,1a1e8486-a474-11e9-9d01-01aa75ed71a1,xhtml,L_2019187EN.01004101.doc.html,6124,fish disease,5631 agricultural activity,32019D1194
3,1a1e8486-a474-11e9-9d01-01aa75ed71a1,xhtml,L_2019187EN.01004101.doc.html,2739,chemical product,6811 chemistry,32019D1194
4,1a1e8486-a474-11e9-9d01-01aa75ed71a1,xhtml,L_2019187EN.01004101.doc.html,c_667f2c16,endocrine disruptor,5216 deterioration of the environment,32019D1194
...,...,...,...,...,...,...,...
304986,125480ac-2957-11e6-b616-01aa75ed71a1,xhtml,L_2016147EN.01000101.doc.html,4347,market supervision,2006 trade policy,32016R0869
304987,125480ac-2957-11e6-b616-01aa75ed71a1,xhtml,L_2016147EN.01000101.doc.html,3942,financial solvency,1211 civil law,32016R0869
304988,125480ac-2957-11e6-b616-01aa75ed71a1,xhtml,L_2016147EN.01000101.doc.html,2906,reinsurance,2431 insurance,32016R0869
304989,125480ac-2957-11e6-b616-01aa75ed71a1,xhtml,L_2016147EN.01000101.doc.html,1422,information,3231 information and information processing,32016R0869


In [None]:
la_mtd_file = os.getenv("EU_LEGAL_ACT_METADATA_FILE")
legal_act_metadata = pd.read_csv(la_mtd_file)
legal_act_metadata['doc-path'] = "data/LEG_EN_HTML_20250721_04_08/" + legal_act_metadata['work-id'] + \
    "/" + legal_act_metadata['format'] + "/" + legal_act_metadata['doc']

In [21]:
file_paths = legal_act_metadata.drop_duplicates('doc-path')['doc-path'].to_list()
print(len(file_paths))

46665


In [None]:
for doc_path in file_paths[0:1]:
    with open(doc_path, "r") as f:
        soup = BeautifulSoup(f, "lxml")
    # print(soup.prettify())
    meta = soup.meta.attrs

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Slightly larger chunks
    chunk_overlap=200,
    length_function=len
)
for file_path in file_paths:
    try:
        if not os.path.exists(file_path):
            print(f"File not found: {file_path}")
            continue

        Loader = self._get_loader(file_path)
        if not Loader:
            print(f"Unsupported file type: {file_path}")
            continue

        # Load and split documents
        documents = Loader(file_path).load()
        chunks = text_splitter.split_documents(documents)

        # Create or update FAISS index
        if os.path.exists(os.path.join(self.vectorstore_directory, "index.faiss")):
            # Load existing vectorstore and merge new documents
            vectorstore = FAISS.load_local(
                folder_path=self.vectorstore_directory,
                embeddings=self.embeddings,
                allow_dangerous_deserialization=True
            )
            vectorstore.add_documents(chunks)
        else:
            # Create new vectorstore
            vectorstore = FAISS.from_documents(chunks, self.embeddings)

        # Save the updated vectorstore
        vectorstore.save_local(folder_path=self.vectorstore_directory)
        success_count += 1
        print(f"Successfully processed: {file_path}")

    except Exception as e:
        print(f"Error processing {file_path}: {str(e)}")

return success_count, len(file_paths) - success_count

In [None]:
from langchain_community.document_loaders import PyPDFLoader, TextLoader, Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from typing import List
import os

class DocumentUploader:
    def __init__(self, vectorstore_directory: str = "Database"):
        """
        Initialize DocumentUploader with the directory for the vector store.
        
        Args:
            vectorstore_directory: Directory to store vector databases (default: "Database")
        """
        self.vectorstore_directory = os.path.abspath(vectorstore_directory)
        os.makedirs(self.vectorstore_directory, exist_ok=True)
        self.embeddings = OpenAIEmbeddings()  # Initialize embeddings once

    def _get_loader(self, file_path: str):
        """Determine the appropriate loader based on file extension"""
        file_extension = os.path.splitext(file_path)[1].lower()
        loader_map = {
            '.pdf': PyPDFLoader,
            '.txt': TextLoader,
            '.docx': Docx2txtLoader
        }
        return loader_map.get(file_extension)

    def upload_documents(self, file_paths: List[str]):
        """
        Process and upload multiple documents to the vector store.
        
        Args:
            file_paths: List of file paths to process
            
        Returns:
            Tuple: (success_count, error_count)
        """
        success_count = 0
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,  # Slightly larger chunks
            chunk_overlap=200,
            length_function=len
        )

        for file_path in file_paths:
            try:
                if not os.path.exists(file_path):
                    print(f"File not found: {file_path}")
                    continue

                Loader = self._get_loader(file_path)
                if not Loader:
                    print(f"Unsupported file type: {file_path}")
                    continue

                # Load and split documents
                documents = Loader(file_path).load()
                chunks = text_splitter.split_documents(documents)

                # Create or update FAISS index
                if os.path.exists(os.path.join(self.vectorstore_directory, "index.faiss")):
                    # Load existing vectorstore and merge new documents
                    vectorstore = FAISS.load_local(
                        folder_path=self.vectorstore_directory,
                        embeddings=self.embeddings,
                        allow_dangerous_deserialization=True
                    )
                    vectorstore.add_documents(chunks)
                else:
                    # Create new vectorstore
                    vectorstore = FAISS.from_documents(chunks, self.embeddings)

                # Save the updated vectorstore
                vectorstore.save_local(folder_path=self.vectorstore_directory)
                success_count += 1
                print(f"Successfully processed: {file_path}")

            except Exception as e:
                print(f"Error processing {file_path}: {str(e)}")

        return success_count, len(file_paths) - success_count

# ChromaDB

`chroma run` runs the HTTP Client at localhost:8000

In [2]:
import chromadb

In [5]:
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection(name="legal")

In [None]:
# # Set your OPENAI_API_KEY environment variable
# from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# collection = client.create_collection(
#     name="my_collection",
#     embedding_function=OpenAIEmbeddingFunction(
#         model_name="text-embedding-3-small"
#     )
# )


collection.add(
    ids=["id1", "id2"],
    documents=["doc1", "doc2"]
)

In [None]:
collection.add(
    ids=["1", "2", "3", "4", "5", "6", "7"],
    documents=[
        "apple",
        "banana",
        "pineapple",
        "mango",
        "dragonfruit",
        "passionfruit",
        "raspberry"
    ],
    metadatas=[
        { "color": "red", "weight": 180 },
        { "color": "yellow", "weight": 120 },
        { "color": "brown", "weight": 900 },
        { "color": "yellow", "weight": 200 },
        { "color": "pink", "weight": 600 },
        { "color": "purple", "weight": 18 },
        { "color": "red", "weight": 4 },
    ]
)

# Read RDF

In [None]:
import rdflib
g = rdflib.Graph()
g.parse("data/cdm-4.13.2/euvoc.rdf", format="xml")
for idx, (subj, pred, obj) in enumerate(g):
    print(f"Index: {idx} | subj: {subj}")
    print(f"Index: {idx} | pred: {pred}")
    print(f"Index: {idx} | obj : {obj}")

# Connect to S3

In [None]:
import boto3

s3 = boto3.client('s3')
bucket_name = 'regguru'
file_path = 'path/to/local/file.txt'
s3_key = 'folder/file.txt'

s3.upload_file(file_path, bucket_name, s3_key)

In [2]:
import io
import os
import uuid

import boto3
from boto3.s3.transfer import S3UploadFailedError
from botocore.exceptions import ClientError

In [6]:
load_dotenv(override=True)

True

## Upload Legal Acts

In [None]:
# Replace with your actual credentials and bucket/file names
AWS_ACCESS_KEY_ID = os.getenv("S3_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("S3_SECRET_ACCESS_KEY")
BUCKET_NAME = 'regguru'
LOCAL_FOLDER = 'data/LEG_EN_HTML_20250721_04_08'
S3_OBJECT_NAME = 'eu/LEG_EN_HTML_20250721_04_08' # Optional: if different from local file name

# Create an S3 client
s3 = boto3.resource(
    's3',
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)


In [17]:
bucket=s3.Bucket(BUCKET_NAME)

In [28]:
ls = [object.key for object in bucket.objects.filter(Prefix='eu/LEG_EN_HTML_20250721_04_08/')]
len(ls)

13290

In [29]:
ls[0]

'eu/LEG_EN_HTML_20250721_04_08/00012784-19a8-4f6a-8457-fc6291daf9cd/html/31986D0559en.html'

In [30]:
uploaded = [[sp for sp in key.split('/')[2:]] for key in ls]
df = pd.DataFrame(uploaded, columns=['work-id','celex-man-id','celex-cs-id'])

In [32]:
df['uploaded']=1

In [33]:
check_upload = legal_act_files.merge(df, how='left', on=('work-id', 'celex-man-id', 'celex-cs-id'))

In [65]:
not_uploaded=check_upload[check_upload['uploaded'] != 1]

In [79]:
not_uploaded['sub-path'] = not_uploaded['work-id'] + '/'+not_uploaded['celex-man-id']+'/'+not_uploaded['celex-cs-id']
not_uploaded['path'] = "data/LEG_EN_HTML_20250721_04_08/"+not_uploaded['sub-path']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  not_uploaded['sub-path'] = not_uploaded['work-id'] + '/'+not_uploaded['celex-man-id']+'/'+not_uploaded['celex-cs-id']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  not_uploaded['path'] = "data/LEG_EN_HTML_20250721_04_08/"+not_uploaded['sub-path']


In [80]:
not_uploaded.head(1)

Unnamed: 0,work-id,celex-expr-id,celex-man-id,celex-cs-id,uploaded,sub-path,path
0,1a1e8486-a474-11e9-9d01-01aa75ed71a1,ENG,xhtml,L_2019187EN.01004101.doc.html,,1a1e8486-a474-11e9-9d01-01aa75ed71a1/xhtml/L_2...,data/LEG_EN_HTML_20250721_04_08/1a1e8486-a474-...


In [83]:
la_object = bucket.Object(S3_OBJECT_NAME)

In [71]:
# Create an S3 client
client = boto3.client(
    's3',
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)

In [90]:
files_to_upload = not_uploaded['sub-path'].to_list()

In [None]:
exceptions = []
for file in tqdm(files_to_upload):
    local_file = LOCAL_FOLDER + '/' + file
    s3_key = os.path.join(S3_OBJECT_NAME, file).replace("\\", "/")  # S3 expects "/
    try:
        client.upload_file(local_file, BUCKET_NAME, s3_key)
    except Exception as e:
        print(f"Error uploading {file}: {e}")
        exceptions.append((file, e))

 18%|█▊        | 6084/33375 [1:43:25<3:49:39,  1.98it/s]  

Error uploading 1d2af2a3-5303-11e6-89bd-01aa75ed71a1/xhtml/L_2016200EN.01000101.doc.html: SSL validation failed for https://regguru.s3.ap-southeast-1.amazonaws.com/eu/LEG_EN_HTML_20250721_04_08/1d2af2a3-5303-11e6-89bd-01aa75ed71a1/xhtml/L_2016200EN.01000101.doc.html?uploadId=2GFQkYFhycX0RAcXI1nUUt8jqjhOoABVAnU3qPFz3DPl9dq5NK_NnVdLuKnQtcA_Ry5Opji34OJnl5SleBMeNBfTsP8pPwJqqtTegE0E_KRlqLj6YMAEykAPOT9tT1hJUM0r4OBU8iRTZnOoKuAuow--&partNumber=13 [SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)


 18%|█▊        | 6085/33375 [1:47:07<508:16:01, 67.05s/it]

Error uploading cc837096-5c6c-11e9-9c52-01aa75ed71a1/xhtml/L_2019101EN.01000101.doc.html: SSL validation failed for https://regguru.s3.ap-southeast-1.amazonaws.com/eu/LEG_EN_HTML_20250721_04_08/cc837096-5c6c-11e9-9c52-01aa75ed71a1/xhtml/L_2019101EN.01000101.doc.html [SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)


 18%|█▊        | 6086/33375 [1:47:31<409:58:40, 54.08s/it]

Error uploading 769a011a-700a-4b15-ba2a-c7a1e78dc245/html/31988H0041en.html: SSL validation failed for https://regguru.s3.ap-southeast-1.amazonaws.com/eu/LEG_EN_HTML_20250721_04_08/769a011a-700a-4b15-ba2a-c7a1e78dc245/html/31988H0041en.html [SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)


 18%|█▊        | 6087/33375 [1:47:56<343:50:34, 45.36s/it]

Error uploading 36c30467-fed4-41d5-b7ef-afba2bd462fa/xhtml/L_2010125EN.01005201.doc.html: SSL validation failed for https://regguru.s3.ap-southeast-1.amazonaws.com/eu/LEG_EN_HTML_20250721_04_08/36c30467-fed4-41d5-b7ef-afba2bd462fa/xhtml/L_2010125EN.01005201.doc.html [SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)


 35%|███▍      | 11624/33375 [3:32:38<4:42:54,  1.28it/s]   

Error uploading 9191c590-efda-4786-8179-001ea608103e/html/0004: Need to rewind the stream <botocore.httpchecksum.AwsChunkedWrapper object at 0x1249c36a0>, but stream is not seekable.


 70%|██████▉   | 23338/33375 [7:14:14<11:21:31,  4.07s/it]  

Error uploading 6df03d7c-a7af-42cf-9141-623777c70352/xhtml/0005: Need to rewind the stream <botocore.httpchecksum.AwsChunkedWrapper object at 0x1249b0610>, but stream is not seekable.


100%|██████████| 33375/33375 [12:04:50<00:00,  1.30s/it]     


In [103]:
upload_failure = [obj[0][32:] for obj in exceptions]

In [104]:
upload_failure

['1d2af2a3-5303-11e6-89bd-01aa75ed71a1/xhtml/L_2016200EN.01000101.doc.html',
 'cc837096-5c6c-11e9-9c52-01aa75ed71a1/xhtml/L_2019101EN.01000101.doc.html',
 '769a011a-700a-4b15-ba2a-c7a1e78dc245/html/31988H0041en.html',
 '36c30467-fed4-41d5-b7ef-afba2bd462fa/xhtml/L_2010125EN.01005201.doc.html',
 '9191c590-efda-4786-8179-001ea608103e/html/0004',
 '6df03d7c-a7af-42cf-9141-623777c70352/xhtml/0005']

## Upload metadata

In [117]:
# Replace with your actual credentials and bucket/file names
AWS_ACCESS_KEY_ID = os.getenv("S3_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("S3_SECRET_ACCESS_KEY")
BUCKET_NAME = 'regguru'
LOCAL_FOLDER = 'data/LEG_MTD_20250709_22_36'
S3_OBJECT_NAME = 'eu/LEG_MTD_20250709_22_36' # Optional: if different from local file name

# Create an S3 client
s3 = boto3.resource(
    's3',
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)

In [106]:
metadata_files.head(1)

Unnamed: 0,work-id,mtd
0,1a1e8486-a474-11e9-9d01-01aa75ed71a1,tree_non_inferred.rdf


In [107]:
metadata_files['sub-path'] = metadata_files['work-id'] + '/'+metadata_files['mtd']
metadata_files['path'] = "data/LEG_MTD_20250709_22_36/"+metadata_files['sub-path']

In [121]:
metadata_to_upload = metadata_files['sub-path'].to_list()[5:]

In [122]:
exceptions = []
for file in tqdm(metadata_to_upload):
    local_file = LOCAL_FOLDER + '/' + file
    s3_key = os.path.join(S3_OBJECT_NAME, file).replace("\\", "/")  # S3 expects "/
    try:
        client.upload_file(local_file, BUCKET_NAME, s3_key)
    except Exception as e:
        print(f"Error uploading {file}: {e}")
        exceptions.append((file, e))

100%|██████████| 55362/55362 [10:08:40<00:00,  1.52it/s]  


In [74]:
files_to_upload

['1a1e8486-a474-11e9-9d01-01aa75ed71a1/xhtml/L_2019187EN.01004101.doc.html',
 '5fa72f58-9564-4ebe-a5a5-853e206ae2ed/html/32000R0212en.html',
 '65a9ddff-3e78-403b-ab34-98d7973be2b3/xhtml/L_2010117EN.01006001.doc.html',
 '39bcdc85-2e3b-4e36-a488-5197ee502afd/html/31999Y0917_02_en.html',
 'a4d26b15-882d-11e9-9369-01aa75ed71a1/xhtml/L_2019148EN.01000101.doc.html']

In [None]:
def upload_files(file_paths, bucket_name, s3_prefix=""):
    """
    Uploads files to an S3 bucket while preserving folder structure.

    :param file_paths: list of local file paths
    :param bucket_name: target S3 bucket
    :param s3_prefix: optional prefix inside the bucket (like "myfolder/")
    """
    # Get common base directory
    base_dir = os.path.commonpath(file_paths)

    for file_path in file_paths:
        # Create relative path
        relative_path = os.path.relpath(file_path, base_dir)
        s3_key = os.path.join(s3_prefix, relative_path).replace("\\", "/")  # S3 expects "/"

        print(f"Uploading {file_path} to s3://{bucket_name}/{s3_key}")
        s3.upload_file(file_path, bucket_name, s3_key)


# Example usage
files_to_upload = [
    "/home/user/project/data/file1.csv",
    "/home/user/project/data/subdir/file2.csv",
    "/home/user/project/data/subdir/nested/file3.csv"
]

upload_files(files_to_upload, "your-bucket-name", "backup/")

data/LEG_EN_HTML_20250721_04_08 ['1a1e8486-a474-11e9-9d01-01aa75ed71a1', '5fa72f58-9564-4ebe-a5a5-853e206ae2ed', '65a9ddff-3e78-403b-ab34-98d7973be2b3', '39bcdc85-2e3b-4e36-a488-5197ee502afd', 'a4d26b15-882d-11e9-9369-01aa75ed71a1', '6048f5e3-3649-4cf3-98af-31346a57a6bf', '20689fab-dca7-4d06-8929-bff85013eec0', '9aabcb1b-76b8-46bd-accb-45ca452a003b', '86912760-c241-11e4-bbe1-01aa75ed71a1', '6f2a584d-0b00-468a-93b2-b1b6faa9bf99', 'c3a578f6-0b9e-45f2-b994-4caa5b25cefa', 'b58832ee-31f1-11e6-b497-01aa75ed71a1', '594755e8-0d1a-11eb-bc07-01aa75ed71a1', '8fc4ebc3-a014-11ec-83e1-01aa75ed71a1', 'acef55ee-2b68-11e9-8d04-01aa75ed71a1', 'c718b2af-da04-456e-be0b-d415594fee7d', 'b985614e-9d78-456f-9436-671e626bcc03', 'c25825b9-59ca-11e4-a0cb-01aa75ed71a1', '481342cd-1747-11ea-8c1f-01aa75ed71a1', 'b9564ba9-d223-4f1e-a5ee-a0e91cea7512', '79689c49-1d1e-11e3-8d1c-01aa75ed71a1', 'f036ebe7-068e-42e1-8aa2-c6392f2cd2ac', '4c059916-7c4b-11ed-9887-01aa75ed71a1', '9f8be269-624e-11ee-9220-01aa75ed71a1', '43898e

In [53]:
for idx, row in tqdm(test.iterrows(), desc="uploading", total=test.shape[0]):
    path = f"data/LEG_EN_HTML_20250721_04_08/{row['work-id']}/{row['celex-man-id']}/{row['celex-cs-id']}"
    with open(path, 'r') as f:
        try:
            la_object.put(f)
            print(f"File '{path}' uploaded successfully to '{BUCKET_NAME}/{S3_OBJECT_NAME}'")
        except Exception as e:
            print(f"Error upload: {e}")

uploading:   0%|          | 0/5 [00:00<?, ?it/s]

Error upload: put_object() only accepts keyword arguments.


uploading:  20%|██        | 1/5 [00:00<00:00,  9.43it/s]

Error upload: put_object() only accepts keyword arguments.
Error upload: put_object() only accepts keyword arguments.
Error upload: put_object() only accepts keyword arguments.
Error upload: put_object() only accepts keyword arguments.


uploading: 100%|██████████| 5/5 [00:00<00:00, 38.43it/s]


In [None]:
import logging
import boto3
from botocore.exceptions import ClientError
import os


def upload_file(file_name, bucket, object_name=None):
    """Upload a file to an S3 bucket

    :param file_name: File to upload
    :param bucket: Bucket to upload to
    :param object_name: S3 object name. If not specified then file_name is used
    :return: True if file was uploaded, else False
    """

    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = os.path.basename(file_name)

    # Upload the file
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    return True

In [49]:
# Create an S3 client
client = boto3.client(
    's3',
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
for idx, row in tqdm(test.iterrows(), desc="uploading", total=test.shape[0]):
    path = f"data/LEG_EN_HTML_20250721_04_08{row['work-id']}/{row['celex-man-id']}/{row['celex-cs-id']}"
    try:
        response = client.upload_fileobj(path, bucket, S3_OBJECT_NAME)
    except ClientError as e:
        print(f"{path}: e")

uploading:   0%|          | 0/5 [00:00<?, ?it/s]


ValueError: Fileobj must implement read

## Amazon Demo

In [None]:
def do_scenario(s3_resource):
    print("-" * 88)
    print("Welcome to the Amazon S3 getting started demo!")
    print("-" * 88)

    bucket_name = f"amzn-s3-demo-bucket-{uuid.uuid4()}"
    bucket = s3_resource.Bucket(bucket_name)
    try:
        bucket.create(
            CreateBucketConfiguration={
                "LocationConstraint": s3_resource.meta.client.meta.region_name
            }
        )
        print(f"Created demo bucket named {bucket.name}.")
    except ClientError as err:
        print(f"Tried and failed to create demo bucket {bucket_name}.")
        print(f"\t{err.response['Error']['Code']}:{err.response['Error']['Message']}")
        print(f"\nCan't continue the demo without a bucket!")
        return

    file_name = None
    while file_name is None:
        file_name = input("\nEnter a file you want to upload to your bucket: ")
        if not os.path.exists(file_name):
            print(f"Couldn't find file {file_name}. Are you sure it exists?")
            file_name = None

    obj = bucket.Object(os.path.basename(file_name))
    try:
        obj.upload_file(file_name)
        print(
            f"Uploaded file {file_name} into bucket {bucket.name} with key {obj.key}."
        )
    except S3UploadFailedError as err:
        print(f"Couldn't upload file {file_name} to {bucket.name}.")
        print(f"\t{err}")

    answer = input(f"\nDo you want to download {obj.key} into memory (y/n)? ")
    if answer.lower() == "y":
        data = io.BytesIO()
        try:
            obj.download_fileobj(data)
            data.seek(0)
            print(f"Got your object. Here are the first 20 bytes:\n")
            print(f"\t{data.read(20)}")
        except ClientError as err:
            print(f"Couldn't download {obj.key}.")
            print(
                f"\t{err.response['Error']['Code']}:{err.response['Error']['Message']}"
            )

    answer = input(
        f"\nDo you want to copy {obj.key} to a subfolder in your bucket (y/n)? "
    )
    if answer.lower() == "y":
        dest_obj = bucket.Object(f"demo-folder/{obj.key}")
        try:
            dest_obj.copy({"Bucket": bucket.name, "Key": obj.key})
            print(f"Copied {obj.key} to {dest_obj.key}.")
        except ClientError as err:
            print(f"Couldn't copy {obj.key} to {dest_obj.key}.")
            print(
                f"\t{err.response['Error']['Code']}:{err.response['Error']['Message']}"
            )

    print("\nYour bucket contains the following objects:")
    try:
        for o in bucket.objects.all():
            print(f"\t{o.key}")
    except ClientError as err:
        print(f"Couldn't list the objects in bucket {bucket.name}.")
        print(f"\t{err.response['Error']['Code']}:{err.response['Error']['Message']}")

    answer = input(
        "\nDo you want to delete all of the objects as well as the bucket (y/n)? "
    )
    if answer.lower() == "y":
        try:
            bucket.objects.delete()
            bucket.delete()
            print(f"Emptied and deleted bucket {bucket.name}.\n")
        except ClientError as err:
            print(f"Couldn't empty and delete bucket {bucket.name}.")
            print(
                f"\t{err.response['Error']['Code']}:{err.response['Error']['Message']}"
            )

    print("Thanks for watching!")
    print("-" * 88)


if __name__ == "__main__":
    do_scenario(boto3.resource("s3"))