## Import the dataset

Louis Brulé Naudet have created a set of of up-to-date of all the French legal texts, extracted from Légifrance. We will use the dataset of the Code Civil to create instances based on the ontology created.

The objective of this project is to provide researchers, professionals and law students with simplified, up-to-date access to all French legal texts, enriched with a wealth of data to facilitate their integration into Community and European projects.

Normally, the data is refreshed daily on all legal codes, and aims to simplify the production of training sets and labeling pipelines for the development of free, open-source language models based on open data accessible to all.

In [1]:
from ragoon import load_datasets
import datasets
import pandas as pd
import rdfpandas

  from .autonotebook import tqdm as notebook_tqdm


In [2]:

req = [
    "louisbrulenaudet/code-civil"
    # ...
]

datasets_list = load_datasets(
    req=req,
    streaming=False
)

100%|██████████| 1/1 [00:02<00:00,  2.15s/it]


In [5]:
# Explore the dataset structure
print("datasets_list type:", type(datasets_list))
print("datasets_list content:", datasets_list)

# Since it's a list, get the first dataset
if len(datasets_list) > 0:
    code_civil_dict = datasets_list[0]
    print("Available splits:", list(code_civil_dict.keys()))

    # Get the training data
    code_civil = code_civil_dict['train']
    print(f"\nDataset type: {type(code_civil)}")
    print(f"Dataset features: {code_civil.features}")
    print(f"Number of records: {len(code_civil)}")

    # Convert to pandas and examine structure
    df = code_civil.to_pandas()
    print(f"\nDataFrame shape: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")
    print("\nFirst record example:")
    print(df.iloc[0])
else:
    print("No datasets found")

datasets_list type: <class 'list'>
datasets_list content: [DatasetDict({
    train: Dataset({
        features: ['ref', 'title_main', 'texte', 'dateDebut', 'dateFin', 'num', 'id', 'cid', 'type', 'etat', 'nota', 'version_article', 'ordre', 'conditionDiffere', 'infosComplementaires', 'surtitre', 'nature', 'texteHtml', 'dateFinExtension', 'versionPrecedente', 'refInjection', 'idTexte', 'idTechInjection', 'origine', 'dateDebutExtension', 'idEliAlias', 'cidTexte', 'sectionParentId', 'multipleVersions', 'comporteLiensSP', 'sectionParentTitre', 'infosRestructurationBranche', 'idEli', 'sectionParentCid', 'numeroBo', 'infosRestructurationBrancheHtml', 'historique', 'infosComplementairesHtml', 'renvoi', 'fullSectionsTitre', 'notaHtml', 'inap', 'lienCitations', 'lienAutres'],
        num_rows: 2891
    })
})]
Available splits: ['train']

Dataset type: <class 'datasets.arrow_dataset.Dataset'>
Dataset features: {'ref': Value('string'), 'title_main': Value('string'), 'texte': Value('string'), 'dateD

In [6]:
# Examine key fields for ontology mapping
print("Key fields analysis:")
print(f"Total articles: {len(df)}")

# Key fields that are likely important for ontology mapping
key_fields = ['ref', 'title_main', 'texte', 'num', 'id', 'type', 'etat', 'nature']
for field in key_fields:
    if field in df.columns:
        print(f"\n{field}:")
        print(f"  - Non-null values: {df[field].notna().sum()}")
        print(f"  - Unique values: {df[field].nunique()}")
        if df[field].dtype == 'object':
            unique_vals = df[field].value_counts().head(5)
            print(f"  - Top 5 values: {unique_vals.to_dict()}")

# Sample a few records to understand structure
print("\n" + "="*50)
print("Sample records (key fields only):")
sample_df = df[key_fields].head(3)
for idx, row in sample_df.iterrows():
    print(f"\nRecord {idx+1}:")
    for field in key_fields:
        if pd.notna(row[field]):
            value = str(row[field])[:100] + "..." if len(str(row[field])) > 100 else str(row[field])
            print(f"  {field}: {value}")

Key fields analysis:
Total articles: 2891

ref:
  - Non-null values: 2891
  - Unique values: 2891
  - Top 5 values: {'Code civil, art. 1': 1, 'Code civil, art. 1399-3': 1, 'Code civil, art. 1399-5': 1, 'Code civil, art. 1399-6': 1, 'Code civil, art. 1400': 1}

title_main:
  - Non-null values: 2891
  - Unique values: 1
  - Top 5 values: {'Code civil': 2891}

texte:
  - Non-null values: 2891
  - Unique values: 2890
  - Top 5 values: {"En cas de fiducie conclue à titre de garantie, le contrat mentionne à peine de nullité, outre les dispositions prévues à l'article 2018 , la dette garantie.": 2, "Les lois et, lorsqu'ils sont publiés au Journal officiel de la République française, les actes administratifs entrent en vigueur à la date qu'ils fixent ou, à défaut, le lendemain de leur publication. Toutefois, l'entrée en vigueur de celles de leurs dispositions dont l'exécution nécessite des mesures d'application est reportée à la date d'entrée en vigueur de ces mesures. En cas d'urgence, entren

## Best Preprocessing Formats for Ontology Instance Creation

Based on the analysis of the Code Civil dataset and the existing ontology structure, here are the recommended preprocessing formats:

### 1. **RDF/Turtle (TTL) - RECOMMENDED**
**Best for**: Direct semantic web integration and compatibility with existing LKIF-Core ontologies

**Advantages**:
- Native semantic web format
- Direct integration with existing `.owl` and `.ttl` files
- Support for complex relationships and properties
- Excellent for querying with SPARQL
- Human-readable syntax
- Good tooling support (rdflib, rdfpandas)

**When to use**: When you want direct semantic web compatibility and plan to use SPARQL queries

### 2. **JSON-LD**
**Best for**: Web applications and APIs with semantic annotation

**Advantages**:
- JSON syntax (familiar to developers)
- Semantic web compatible
- Good for web applications
- Can be processed as both JSON and RDF

**When to use**: When building web applications or APIs that need semantic data

### 3. **CSV with RDF mapping**
**Best for**: Data analysis workflows with semantic annotation

**Advantages**:
- Easy to work with in pandas/Excel
- Good for data exploration and cleaning
- Can be mapped to RDF later
- Familiar format for data scientists

**When to use**: When you need extensive data preprocessing and analysis before semantic conversion

### 4. **N-Triples/N-Quads**
**Best for**: Large-scale data processing and streaming

**Advantages**:
- Simple line-based format
- Easy to parse and stream
- Good for very large datasets
- Machine-optimized

**When to use**: When dealing with very large datasets or need streaming processing

## Recommended Approach for Code Civil Dataset

### **Primary Recommendation: RDF/Turtle (TTL)**

Given your existing ontology structure and the rich metadata in the Code Civil dataset, I recommend using **Turtle (TTL)** format with the following workflow:

#### Phase 1: Data Preprocessing (CSV/Pandas)
1. Clean and normalize the data using pandas
2. Map dataset fields to ontology properties
3. Create URIs for legal articles
4. Handle temporal information (dateDebut, dateFin)
5. Process hierarchical relationships

#### Phase 2: RDF Generation (TTL)
Convert the preprocessed data to Turtle format using:
- **rdflib** for Python RDF manipulation
- **rdfpandas** for direct pandas-to-RDF conversion
- Custom mapping functions for complex relationships

### Key Dataset Fields Mapping:
- `ref` → URI identifier for legal articles
- `title_main` → rdfs:label
- `texte` → content property
- `dateDebut`/`dateFin` → temporal properties
- `type`, `nature` → classification properties
- `etat` → application state (en vigueur, abrogé)
- `num` → article number
- Hierarchical relationships from section data

In [None]:
# Example preprocessing workflow
import rdflib
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS, OWL, XSD
import re
from urllib.parse import quote

# Define namespaces
CC = Namespace("http://www.semanticweb.org/pierre/ontologies/2025/5/code-civil-ontology//")
ELI = Namespace("http://data.europa.eu/eli/ontology#")

# Create a sample preprocessing function
def preprocess_article_to_rdf(article_row):
    """Convert a single article row to RDF triples"""
    g = Graph()
    
    # Create URI for the article
    article_id = quote(str(article_row['ref']).replace(' ', '_'))
    article_uri = CC[f"article_{article_id}"]
    
    # Basic properties
    g.add((article_uri, RDF.type, CC.LegalArticle))
    g.add((article_uri, RDFS.label, Literal(article_row['title_main'], lang='fr')))
    g.add((article_uri, CC.articleNumber, Literal(article_row['num'])))
    g.add((article_uri, CC.reference, Literal(article_row['ref'])))
    
    # Content
    if pd.notna(article_row['texte']):
        g.add((article_uri, CC.text, Literal(article_row['texte'], lang='fr')))
    
    # Temporal information
    if pd.notna(article_row['dateDebut']):
        g.add((article_uri, CC.startDate, Literal(article_row['dateDebut'], datatype=XSD.date)))
    
    # Application state
    if pd.notna(article_row['etat']):
        state_uri = CC[f"state_{article_row['etat'].replace(' ', '_')}"]
        g.add((article_uri, CC.applicationState, state_uri))
    
    return g

# Example: Convert first article
sample_article = df.iloc[0]
sample_rdf = preprocess_article_to_rdf(sample_article)

print("Sample RDF output (Turtle format):")
print(sample_rdf.serialize(format='turtle'))