# Generate Test Set

In this notebook, we generate a test set using model version v0.1.0. The goal is to create a representative and balanced dataset. To achieve this, we classify a large number of patents, shuffle the results, and then sample a few instances from each class. Finally, we manually verify and correct the labels for each class to ensure accuracy.

## Step 1: Generate Initial Predictions

We classify all patents using the model from `02-ClassifyPatent_v0.1.0.ipynb`.  
For each patent description, we store the following information:

- `num_patent`: Patent number  
- `num_desc`: Description number  
- `desc`: Patent description text  
- `sdg_pred`: Predicted SDG (Sustainable Development Goal) class

The results are saved in `classified_patents_raw.jsonl`.


### 🔹 Step 1.1. Load all Patents

In [11]:
from api.config.db_config import get_db_connection

conn = get_db_connection()

def get_all_patents_number():
    # Fetch the patent data from the database
    patents = []
    fetch_patent_query = """
    SELECT patent.number
    FROM patent
    """

    cursor = conn.cursor()
    cursor.execute(fetch_patent_query)
    for row  in cursor.fetchall():
        patents.append(row[0])
    cursor.close()
    return patents

all_patents_number = get_all_patents_number()

print(f"Total patent numbers collected: {len(all_patents_number)}")
print(f"Retrieved unique patent numbers: {len(list(set(all_patents_number)))}")

Total patent numbers collected: 23337
Retrieved unique patent numbers: 23337


### 🔹 Step 1.2. Analyse patents


#### 1.2.1. Define model

In [2]:
from tqdm.notebook import tqdm
from transformers import pipeline
from api.config.ai_config import ai_huggingface_token

# Dict of SDG candidate labels
sdg_labels_dict = {
    "SDG1": "End poverty in all its forms everywhere", 
    "SDG2": "End hunger, achieve food security and improved nutrition and promote sustainable agriculture", 
    "SDG3": "Ensure healthy lives and promote well-being for all at all ages", 
    "SDG4": "Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all", 
    "SDG5": "Achieve gender equality and empower all women and girls", 
    "SDG6": "Ensure availability and sustainable management of water and sanitation for all", 
    "SDG7": "Ensure access to affordable, reliable, sustainable and modern energy for all", 
    "SDG8": "Promote sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all", 
    "SDG9": "Build resilient infrastructure, promote inclusive and sustainable industrialization and foster innovation", 
    "SDG10": "Reduce inequality within and among countries", 
    "SDG11": "Make cities and human settlements inclusive, safe, resilient and sustainable", 
    "SDG12": "Ensure sustainable consumption and production patterns", 
    "SDG13": "Take urgent action to combat climate change and its impacts", 
    "SDG14": "Conserve and sustainably use the oceans, seas and marine resources for sustainable development", 
    "SDG15": "Protect, restore and promote sustainable use of terrestrial ecosystems, sustainably manage forests, combat desertification, and halt and reverse land degradation and halt biodiversity loss", 
    "SDG16": "Promote peaceful and inclusive societies for sustainable development, provide access to justice for all and build effective, accountable and inclusive institutions at all levels", 
    "SDG17": "Strengthen the means of implementation and revitalize the Global Partnership for Sustainable Development"
}

candidate_label_values = list(sdg_labels_dict.values())

# Initialize the classifier
classifier = pipeline(model="facebook/bart-large-mnli", token=ai_huggingface_token)


Device set to use cpu


In [3]:
from api.models.Patent import FullPatent
from api.repositories.patent_repository import update_full_patent

def get_sdg_code_from_label(label: str, label_dict: dict) -> str:
    """Reverse lookup SDG code from full label text."""
    for code, text in label_dict.items():
        if label == text:
            return code
    return "None"


def classify_full_patent_description(patent: FullPatent,
                                     classifier=classifier,
                                     candidate_labels=candidate_label_values,
                                     label_dict=sdg_labels_dict,
                                     treshold: float = 0.18) -> FullPatent:
    """
    Classify all description blocks in a FullPatent and enrich them with SDG labels.

    Args:
        patent (FullPatent): The patent to analyze.
        classifier: HuggingFace classifier.
        candidate_labels (list): SDG label texts.
        label_dict (dict): Map from SDG label text to SDG code.
        treshold (float): Minimum score to accept prediction.

    Returns:
        FullPatent: Enriched object.
    """

    # Step 1: Filter descriptions with enough length
    valid_descriptions = [(desc, desc.description_text) 
                          for desc in patent.description 
                          if len(desc.description_text.split()) > 20]

    # Step 2: Extract just the text for classification
    texts_to_classify = [text for _, text in valid_descriptions]

    # Step 3: Run classifier on batch
    results = classifier(texts_to_classify, candidate_labels=candidate_labels)

    # Step 4: Assign results back to descriptions
    for (desc, _), result in zip(valid_descriptions, results):

        try:
            top_score = result["scores"][0]
            if top_score >= treshold:
                label_text = result["labels"][0]
                desc.sdg = get_sdg_code_from_label(label_text, label_dict)
            else:
                desc.sdg = "None"
                top_score = -1

            # print(f"[{desc.description_number}] Label: {desc.sdg} | Score: {top_score:.3f} | Text: {desc.description_text}")
        except Exception as e:
            print(f"Error on description {desc.description_number}: {e}")
            desc.sdg = "Error"

    # Step 5: Handle short descriptions (not classified)
    for desc in patent.description:
        if len(desc.description_text.split()) <= 20:
            desc.sdg = "None"

    patent.is_analyzed = True

    # Update the Patent in Database
    update_full_patent(patent.model_dump())

    return patent

#### 1.2.2. Run analyse

In [None]:
import os
import json
import random
from typing import List, Dict
from api.services.patent_service import get_full_patent_by_number

def load_already_classified_patents(file_path: str) -> set:
    """Load already classified patents from the jsonl."""
    if not os.path.exists(file_path):
        return set()

    classified = set()
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            try:
                item = json.loads(line.strip())
                classified.add(item["patent_number"])
            except Exception as e:
                print(f"Error reading line: {e}")
    return classified


def save_classified_descriptions(data: List[Dict], file_path: str):
    """Save classified description at the end of the jsonl."""
    with open(file_path, "a", encoding="utf-8") as f:
        for item in data:
            f.write(json.dumps(item, ensure_ascii=False) + "\n")


def analyze_patents_and_save_descriptions(patent_numbers: List[str], export_file: str = "../src/ai/testsets/raw/classified_patents_raw_test.jsonl"):
    already_classified = load_already_classified_patents(export_file)
    to_process = [pn for pn in patent_numbers if pn not in already_classified]

    print(f"Total patents to process: {len(to_process)}")
    random.shuffle(to_process)

    for patent_number in to_process:
        try:
            print(f"\nProcessing patent {patent_number}...")
            patent = get_full_patent_by_number(patent_number)
            enriched_patent = classify_full_patent_description(patent)

            output_data = []
            for desc in enriched_patent.description:
                output_data.append({
                    "patent_number": patent_number,
                    "description_number": desc.description_number,
                    "description_text": desc.description_text,
                    "sdg": desc.sdg
                })

            save_classified_descriptions(output_data, export_file)

        except Exception as e:
            print(f"Error processing {patent_number}: {e}")

    print("\nProcessing completed and data saved.")

In [6]:
# Analyse all patents
analyze_patents_and_save_descriptions(all_patents_number, export_file="../src/ai/testsets/raw/classified_patents_raw.jsonl" )

Total patents to process: 20447

Processing patent EP4099158A1...


KeyboardInterrupt: 

#### 1.2.3. Remove Duplicates

To ensure the quality of the dataset, we remove duplicate entries (doublons) from the classified results.

In [7]:
import json

def analyze_and_save_clean_jsonl(file_path, output_path):
    seen_pairs = set()
    total_lines = 0
    duplicate_lines = 0
    duplicate_patents = []

    with open(file_path, "r", encoding="utf-8") as infile, \
         open(output_path, "w", encoding="utf-8") as outfile:

        for line in infile:
            try:
                total_lines += 1
                item = json.loads(line.strip())
                pair = (item.get("description_text"))

                if pair in seen_pairs:
                    duplicate_lines += 1
                    duplicate_patents.append(item.get("patent_number"))
                else:
                    seen_pairs.add(pair)
                    outfile.write(json.dumps(item, ensure_ascii=False) + "\n")
            except Exception as e:
                print(f"Erreur parsing JSON : {e}")

    print(f"Total lines: {total_lines}")
    print(f"Duplicate (patent_number, description_text) pairs: {duplicate_lines}")
    print(f"Unique (patent_number, description_text) pairs saved: {len(seen_pairs)}")

    return total_lines, duplicate_lines, duplicate_patents


input_path = "../src/ai/testsets/raw/classified_patents_raw.jsonl"
output_path = "../src/ai/testsets/raw/classified_patents_raw_clean.jsonl"
analyze_and_save_clean_jsonl(input_path, output_path)

Total lines: 291654
Duplicate (patent_number, description_text) pairs: 10664
Unique (patent_number, description_text) pairs saved: 280990


(291654,
 10664,
 ['EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4434779A1',
  'EP4438582A1',
  'EP4443476A1',
  'EP4443476A1',
  'EP4443476A1',
  'EP4443476A1',
  'EP4443476A1',
  'EP4443476A1',
  'EP4462320A2',
  'EP4462320A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2',
  'EP4467184A2

#### 1.2.4. Check if duplicates

In [15]:
def analyze_jsonl_by_patent_and_description(file_path):
    seen_pairs = set()
    total_lines = 0
    duplicate_lines = 0
    duplicate_patents = []


    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            try:
                total_lines += 1
                item = json.loads(line.strip())
                pair = (item.get("description_text"))

                if pair in seen_pairs:
                    duplicate_lines += 1
                    duplicate_patents.append(item.get("patent_number"))
                else:
                    seen_pairs.add(pair)
            except Exception as e:
                print(f"Erreur parsing JSON : {e}")

    print(f"Total lines: {total_lines}")
    print(f"Duplicate (patent_number, description_number) pairs: {duplicate_lines}")
    print(f"Unique (patent_number, description_number) pairs: {len(seen_pairs)}")

    return total_lines, duplicate_lines, duplicate_patents

analyze_jsonl_by_patent_and_description('../src/ai/testsets/raw/classified_patents_raw_clean.jsonl')

Total lines: 280990
Duplicate (patent_number, description_number) pairs: 0
Unique (patent_number, description_number) pairs: 280990


(280990, 0, [])

## Step 2 – Create a balanced test set

To ensure class balance:
- Group data by `sdg_pred`
- Shuffle each group
- Sample 10 items per class

The selected items are saved in testset_[version]_[language].jsonl.

#### 1.3.1. Generate

In [9]:
import json
from collections import defaultdict
import random
from typing import List
from langdetect import detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException

DetectorFactory.seed = 0 

def detect_language(text: str) -> str:
    try:
        lang = detect(text)
        if lang in {"fr", "en", "de"}:
            return lang
    except LangDetectException:
        pass
    return None

def generate_classified_patents_raw(
    jsonl_file: str,
    version: str = "v0",
    max_per_sdg: int = 10
):
    """
    Optimized: Detect language once per unique patent_number, applied to all its entries.
    Also prints statistics per SDG.
    """
    
    # 1. Group items by patent_number (to ensure uniqueness and batch language detection)
    items_by_patent = defaultdict(list)
    with open(jsonl_file, "r", encoding="utf-8") as f:
        for line in f:
            try:
                item = json.loads(line.strip())
                patent_number = item.get("patent_number")
                if patent_number:
                    items_by_patent[patent_number].append(item)
            except Exception as e:
                print(f"Erreur parsing JSON : {e}")

    # 2. Detect language once per patent_number, then group items by language and SDG
    grouped_by_lang = {
        "fr": defaultdict(list),
        "en": defaultdict(list),
        "de": defaultdict(list)
    }

    for patent_number, items in items_by_patent.items():
        # Find first non-empty description text to detect language
        for item in items:
            text = item.get("description_text", "").strip()
            if text:
                lang = detect_language(text)
                break
        else:
            lang = None

        # If language detected and supported, group items by SDG under that language
        if lang in grouped_by_lang:
            for item in items:
                sdg = item.get("sdg")
                if sdg is not None:
                    grouped_by_lang[lang][sdg].append(item)

    # 3. For each language, shuffle and select up to max_per_sdg items per SDG, then write to file
    for lang, grouped in grouped_by_lang.items():
        testset: List[dict] = []
        per_class_count = {}
        for sdg, items in grouped.items():
            random.shuffle(items)
            selected = items[:max_per_sdg]
            testset.extend(selected)
            per_class_count[sdg] = len(selected)

        output_file = f"../src/ai/testsets/raw/testset_{version}_{lang}_raw.jsonl"
        with open(output_file, "w", encoding="utf-8") as out:
            for item in testset:
                out.write(json.dumps(item, ensure_ascii=False) + "\n")


        print(f"\n{output_file} généré avec {len(testset)} éléments.")
        print("Répartition par SDG :")
        for sdg in sorted(per_class_count):
            print(f"  SDG {sdg:>2}: {per_class_count[sdg]} éléments")

In [20]:
generate_classified_patents_raw(jsonl_file="../src/ai/testsets/raw/classified_patents_raw_clean.jsonl", version="v1", max_per_sdg=300)


../src/ai/testsets/raw/testset_v1_fr_raw.jsonl généré avec 1063 éléments.
Répartition par SDG :
  SDG None: 300 éléments
  SDG SDG1: 1 éléments
  SDG SDG10: 12 éléments
  SDG SDG12: 300 éléments
  SDG SDG13: 6 éléments
  SDG SDG14: 12 éléments
  SDG SDG15: 275 éléments
  SDG SDG16: 3 éléments
  SDG SDG17: 6 éléments
  SDG SDG3: 20 éléments
  SDG SDG4: 2 éléments
  SDG SDG6: 40 éléments
  SDG SDG9: 86 éléments

../src/ai/testsets/raw/testset_v1_en_raw.jsonl généré avec 1781 éléments.
Répartition par SDG :
  SDG None: 300 éléments
  SDG SDG10: 275 éléments
  SDG SDG11: 12 éléments
  SDG SDG12: 300 éléments
  SDG SDG13: 61 éléments
  SDG SDG14: 8 éléments
  SDG SDG15: 300 éléments
  SDG SDG16: 4 éléments
  SDG SDG17: 7 éléments
  SDG SDG3: 142 éléments
  SDG SDG4: 28 éléments
  SDG SDG6: 41 éléments
  SDG SDG8: 3 éléments
  SDG SDG9: 300 éléments

../src/ai/testsets/raw/testset_v1_de_raw.jsonl généré avec 775 éléments.
Répartition par SDG :
  SDG None: 300 éléments
  SDG SDG10: 15 élémen

#### 1.3.2. Check duplicates

In [21]:
import json
from collections import defaultdict

def find_duplicates_jsonl(fichier_jsonl):
    """
    Find duplicates in a JSONL file based on 'patent_number' and 'description_text'.

    Args:
        jsonl_file (str): Path to the JSONL file.

    Returns:
        List[dict]: List of duplicate entries (each as a dictionary).
    """
    vus = defaultdict(list)
    doublons = []

    with open(fichier_jsonl, 'r', encoding='utf-8') as f:
        for ligne in f:
            try:
                item = json.loads(ligne.strip())
                cle = (item.get('description_text'))
                vus[cle].append(item)
            except json.JSONDecodeError as e:
                print(f"Skipped line due to parsing error : {e}")

    for items in vus.values():
        if len(items) > 1:
            doublons.extend(items)

    return doublons


In [22]:
find_duplicates_jsonl("../src/ai/testsets/raw/testset_v1_en_raw.jsonl")

[]

In [None]:
find_duplicates_jsonl("../src/ai/testsets/raw/testset_v1_fr_raw.jsonl")

[]

In [None]:
find_duplicates_jsonl("../src/ai/testsets/raw/testset_v1_de_raw.jsonl")

[]

## Manual labelization

We manualy labelize the english raw testset with the *src/ai/labelization_app.py*. Here is the distribution of our final testset.

In [3]:
import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import mannwhitneyu

def tesetset_distribution(testset_path):
    """
    Interactive distribution plot of SDG labels and boxplot of description_number using Plotly.

    Args:
        testset_path (str): Path to the testset file (JSONL format).
    """
    # --- Load data ---
    with open(testset_path, "r", encoding="utf-8") as f:
        data = [json.loads(line) for line in f]

    df = pd.DataFrame(data)

    # --- Plot 1: Interactive Bar Chart of SDG Labels with Counts ---
    df_sdg_counts = df["sdg"].value_counts().reset_index()
    df_sdg_counts.columns = ["sdg", "count"]

    fig_bar = px.bar(
        df_sdg_counts,
        x="sdg", y="count",
        title="Distribution of SDG Labels",
        labels={"sdg": "SDG", "count": "Count"},
        color="sdg"  # Keep color distinction
    )

    # Add count annotations on top of each bar
    for i, row in df_sdg_counts.iterrows():
        fig_bar.add_annotation(
            x=row["sdg"],
            y=row["count"],
            text=str(row["count"]),
            showarrow=False,
            yshift=10,
            font=dict(size=12, color="black")
        )

    fig_bar.update_layout(yaxis_title="Count")
    fig_bar.show()

    # --- Plot 2: Interactive Boxplot of description_number by SDG ---
    fig_box = px.box(
        df[df["sdg"] != "Error"],
        x="sdg",
        y="description_number",
        title="Description Number by SDG Category",
        points="all",
        color="sdg"
    )
    fig_box.update_layout(
        xaxis_title="SDG",
        yaxis_title="Description Number",
        boxmode="group"
    )
    fig_box.show()

    # --- Statistical test: SDG vs None (Mann-Whitney) ---
    df["has_sdg"] = df["sdg"] != "None"
    group1 = df[df["has_sdg"] == False]["description_number"]
    group2 = df[df["has_sdg"] == True]["description_number"]

    stat, p = mannwhitneyu(group1, group2, alternative='two-sided')
    print(f"\nMann-Whitney U Test p-value: {p:.4f}")
    if p < 0.05:
        print("Significant difference in description_number between 'None' and SDG-assigned.")
    else:
        print("No significant difference in description_number between 'None' and SDG-assigned.")

    # --- Plot 3: Binary boxplot: has SDG or not ---
    fig_binary = px.box(
        df,
        x="has_sdg",
        y="description_number",
        color="has_sdg",
        title="Description Number by SDG Presence",
        points="all",
        labels={"has_sdg": "Has SDG Label", "description_number": "Description Number"}
    )
    fig_binary.update_xaxes(
        tickvals=[True, False],
        ticktext=["SDG Assigned", "None"]
    )
    fig_binary.show()


# Call the function (example path)
tesetset_distribution("../src/ai/testsets/testset_v1_en_labeled.jsonl")



Mann-Whitney U Test p-value: 0.0000
Significant difference in description_number between 'None' and SDG-assigned.


## Optional Functions

### Retrieve all patent numbers from the database

If you want to retrieve the labeled data from the database instead of regenerating it, you can use the function below.

In [None]:
from api.config.db_config import get_db_connection

conn = get_db_connection()

def get_all_patents_number_if_analysed():
    """
    Retrieve all patent numbers from the database in batches of 100.

    Returns:
        list: A list containing all patent numbers.
    """
    # Fetch the patent data from the database
    patents = []
    fetch_patent_query = """
    SELECT patent.number, is_analyzed
    FROM patent
    WHERE is_analyzed IS NOT False
    """

    cursor = conn.cursor()
    cursor.execute(fetch_patent_query)
    for row  in cursor.fetchall():
        patents.append(row[0])
    cursor.close()
    return patents

all_patents_number_if_analysed = get_all_patents_number_if_analysed()

[]

In [21]:
from typing import List

def db_get_patents_and_save_descriptions(patent_numbers: List[str], export_file: str = "../src/ai/testsets/raw/classified_patents_raw_test.jsonl"):
    already_classified = load_already_classified_patents(export_file)
    to_process = [pn for pn in patent_numbers if pn not in already_classified]

    print(f"Total patents to process: {len(to_process)}")


    for patent_number in tqdm(to_process, desc="Patent recovered"):
        try:
            patent = get_full_patent_by_number(patent_number)
            # If analysed patent save it
            if patent.is_analyzed:
                output_data = []
                for desc in patent.description:
                    output_data.append({
                        "patent_number": patent_number,
                        "description_number": desc.description_number,
                        "description_text": desc.description_text,
                        "sdg": desc.sdg
                    })

                save_classified_descriptions(output_data, export_file)

        except Exception as e:
            print(f"Error processing {patent_number}: {e}")

    print("\nProcessing completed and data saved.")

In [22]:
db_get_patents_and_save_descriptions(all_patents_number_if_analysed, export_file="../src/ai/testsets/raw/classified_patents_raw_db.jsonl" )

Total patents to process: 2478


Patent recovered:   0%|          | 0/2478 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [23]:
#  Remove duplicates
input_path = "../src/ai/testsets/raw/classified_patents_raw_db.jsonl"
output_path = "../src/ai/testsets/raw/classified_patents_raw_db_clean.jsonl"
analyze_and_save_clean_jsonl(input_path, output_path)

Total lines: 474
Duplicate (patent_number, description_text) pairs: 24
Unique (patent_number, description_text) pairs saved: 450


(474,
 24,
 ['EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4425089A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1',
  'EP4432545A1'])

In [24]:
analyze_jsonl_by_patent_and_description('../src/ai/testsets/raw/classified_patents_raw_clean.jsonl')

Total lines: 42
Duplicate (patent_number, description_number) pairs: 0
Unique (patent_number, description_number) pairs: 42


(42, 0, [])

In [25]:
generate_classified_patents_raw(jsonl_file="../src/ai/testsets/raw/classified_patents_raw_db_clean.jsonl", version="db_v1", max_per_sdg=10)


../src/ai/testsets/raw/testset_db_v1_fr_raw.jsonl généré avec 0 éléments.
Répartition par SDG :

../src/ai/testsets/raw/testset_db_v1_en_raw.jsonl généré avec 23 éléments.
Répartition par SDG :
  SDG None: 10 éléments
  SDG SDG10: 1 éléments
  SDG SDG12: 10 éléments
  SDG SDG4: 1 éléments
  SDG SDG9: 1 éléments

../src/ai/testsets/raw/testset_db_v1_de_raw.jsonl généré avec 0 éléments.
Répartition par SDG :


In [29]:
trouver_doublons_jsonl("../src/ai/testsets/raw/testset_db_v1_en_raw.jsonl")

[]

In [30]:
trouver_doublons_jsonl("../src/ai/testsets/raw/testset_db_v1_fr_raw.jsonl")

[]

In [31]:
trouver_doublons_jsonl("../src/ai/testsets/raw/testset_db_v1_de_raw.jsonl")

[]