# Part 5: Projects – Deduplication

### Context
Deduplication is an important step during data migration or when cleaning existing datasets. Since librarians often follow different cataloguing practices and do not always describe items in the same way, this task can be quite challenging.

### Task
An algorithm for comparing records is already provided, but it only solves part of the problem. The bigger question is: **how do we identify potential duplicate candidates to compare in the first place?**

A common approach is to use the metadata of records to generate several SRU queries. To be thorough, you need to retrieve enough candidates to avoid missing possible duplicates. However, fetching too many candidates can be time-consuming, so you also need to limit the number of results.

In practice, you will take one record from BCUL and retrieve possible duplicate candidates from SLSP. You can then verify the candidates using the provided deduplication function:

```python
# Provided helper (interface may vary in your environment)
# is_duplicated(record1, record2) -> should be used to validate candidate pairs
```

### Result
Implement a function that takes an MMS ID as input. The function should return a tuple in the following format:

```python
(<is_match>, <mms_id_of_the_best_match>, <score_of_the_best_match>)
```

* `is_match`: boolean indicating whether a sufficiently good duplicate was found
* `mms_id_of_the_best_match`: MMS ID of the best candidate (or `None` if none found)
* `score_of_the_best_match`: numeric score produced by your comparison, or `None` if none

In [2]:
# Import libraries
from almasru.client import SruClient, SruRecord, IzSruRecord, SruRequest
from almasru.utils import check_removable_records, analyse_records
from almasru import dedup
from almasru.briefrecord import BriefRecFactory, BriefRec
from almasru import config_log

import numpy as np
import pandas as pd
from typing import Tuple

from dedupmarcxml.evaluate import evaluate_records_similarity, get_similarity_score
from dedupmarcxml.briefrecord import XmlBriefRec

# Config logs
config_log()

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [3]:
def is_duplicated(rec1: XmlBriefRec, rec2: XmlBriefRec) -> Tuple[bool, float]:
    """
    Determine whether two MARC XML records are considered duplicates.

    Parameters
    ----------
    rec1 : XmlBriefRec
        First record, constructed from an etree.Element representing a MARC XML record.
    rec2 : XmlBriefRec
        Second record to compare, also constructed from an etree.Element.

    Returns
    -------
    Tuple[bool, float]
        A tuple containing:
        - A boolean indicating whether the records are considered duplicates.
        - A float representing the similarity score (≥ 0.5 indicates duplication).
    """
    score_detailed = evaluate_records_similarity(rec1, rec2)
    return get_similarity_score(score_detailed, method='random_forest_general') >= 0.5, get_similarity_score(score_detailed, method='random_forest_general')

In [20]:
# Same material type (ISBN: 9782412077467)
SruClient.set_base_url('https://renouvaud.primo.exlibrisgroup.com/view/sru/41BCULAUSA_NETWORK')
r = SruRequest(query='alma.mms_id=991024372153702851')
bcul_rec = r.records[0] 
bcul_briefrec = XmlBriefRec(bcul_rec.data)
SruClient.set_base_url('https://swisscovery.ch/view/sru/41SLSP_NETWORK')
r = SruRequest(query='alma.mms_id=991171135496005501')
slsp_rec = r.records[0] 
slsp_briefrec = XmlBriefRec(slsp_rec.data)
result, score = is_duplicated(slsp_briefrec, bcul_briefrec)
print(f'Is duplicated: {result} ({score})')

2025-09-08 15:27:44,940 - INFO - Records 1 - 1 / 1, "alma.mms_id=991024372153702851": 1
2025-09-08 15:27:44,944 - INFO - Records 1 - 1 / 1, "alma.mms_id=991171135496005501": 1
Is duplicated: True (0.8056666666666668)


In [21]:
# Print compared to e-version (ISBN: 9782412077467)
SruClient.set_base_url('https://renouvaud.primo.exlibrisgroup.com/view/sru/41BCULAUSA_NETWORK')
r = SruRequest(query='alma.mms_id=991024372153702851')
bcul_rec = r.records[0] 
bcul_briefrec = XmlBriefRec(bcul_rec.data)
SruClient.set_base_url('https://swisscovery.ch/view/sru/41SLSP_NETWORK')
r = SruRequest(query='alma.mms_id=991171940910205501')
slsp_rec = r.records[0] 
slsp_briefrec = XmlBriefRec(slsp_rec.data)
result, score = is_duplicated(slsp_briefrec, bcul_briefrec)
print(f'Is duplicated: {result} ({score})')

2025-09-08 15:27:45,179 - INFO - Records 1 - 1 / 1, "alma.mms_id=991024372153702851": 1
2025-09-08 15:27:46,113 - INFO - SRU data fetched: https://swisscovery.ch/view/sru/41SLSP_NETWORK?query=alma.mms_id%3D991171940910205501&version=1.2&operation=searchRetrieve&startRecord=1&maximumRecords=10
2025-09-08 15:27:46,121 - INFO - Records 1 - 1 / 1, "alma.mms_id=991171940910205501": 1
Is duplicated: False (0.3653333333333333)


In [8]:
evaluate_records_similarity(slsp_briefrec, bcul_briefrec)

{'format': 1.0,
 'titles': np.float64(0.7548294315960244),
 'short_titles': np.float64(1.0),
 'creators': np.float64(1.0),
 'corp_creators': 0.0,
 'languages': 1.0,
 'publishers': np.float64(0.9999999924109011),
 'editions': 0.0,
 'extent': np.float64(0.6740657233026162),
 'years': 0.9200000000000002,
 'series': 0.0,
 'parent': 0.0,
 'std_nums': 1.0,
 'sys_nums': 0.1}

In [10]:
print(slsp_briefrec)

{
    "rec_id": "991171135496005501",
    "format": {
        "type": "Book",
        "access": "Physical",
        "analytical": false,
        "f33x": "txt;n;nc"
    },
    "titles": [
        {
            "m": "Python pour la finance",
            "s": ""
        }
    ],
    "short_titles": [
        "Python pour la finance"
    ],
    "creators": [
        "Hilpisch, Yves"
    ],
    "corp_creators": null,
    "languages": [
        "fre"
    ],
    "extent": {
        "nb": [
            689
        ],
        "txt": "689 pages"
    },
    "editions": null,
    "years": {
        "y1": [
            2022
        ],
        "y2": 2022
    },
    "publishers": [
        "First interactive"
    ],
    "series": null,
    "parent": null,
    "std_nums": [
        "9782412077467"
    ],
    "sys_nums": null
}


In [9]:
print(bcul_briefrec)

{
    "rec_id": "991024372153702851",
    "format": {
        "type": "Book",
        "access": "Physical",
        "analytical": false,
        "f33x": "txt;n;nc"
    },
    "titles": [
        {
            "m": "Python pour la finance :",
            "s": "ma\u00eetriser la finance algorithmique /"
        }
    ],
    "short_titles": [
        "Python pour la finance :"
    ],
    "creators": [
        "Hilpisch, Yves"
    ],
    "corp_creators": null,
    "languages": [
        "fre"
    ],
    "extent": {
        "nb": [
            689,
            23
        ],
        "txt": "xxiii, 689 pages :"
    },
    "editions": null,
    "years": {
        "y1": [
            2022
        ],
        "y2": 2024
    },
    "publishers": [
        "First Interactive,",
        "O'Reilly"
    ],
    "series": null,
    "parent": null,
    "std_nums": [
        "9782412077467"
    ],
    "sys_nums": [
        "(RNV_B)0003003761"
    ]
}
