# Single View of Customer

The `svoc` package performs record linkage between two dataframes.  
The objective is to link each record from a _benchmark dataframe_ to at least one record from a second dataframe, referred to as the _input dataframe_.

The script takes as input:
- Two dataframes, benchmark and input (see [Data](#data));
- A [configuration file](#configuration-file), either in `.yaml` or `.env` format.

The record linkage process consists of four main steps:

1. [Data preparation](#data-preparation)
2. [Features calculation](#features-calculation)
3. [Automatic matching](#automatic-matching)
4. [Supervised matching](#supervised-matching)



## Data

Record linkage is performed by comparing the similarity of fields across dataframes.  
The current implementation requires the data to contain the following fields:

| Field        | Content                                      | Example                              |
|:-------------|:---------------------------------------------|:-------------------------------------|
| ID           | Row ID that uniquely identifies each record  | c9425868-0cd2-48eb-9050-8431d9832838 |
| OUTLET_NAME  | Name of the outlet                           | The Golden Fleece                    |
| ADDRESS      | Address of the outlet                        | 9 Queen St, London                   |
| POSTCODE     | Postal code of the outlet                    | EC4N1SP                              |

The data may use different column names; in that case, the corresponding column mappings must be specified in the [configuration file](#configuration-file).

## Configuration file

The `/config` folder contains the .yaml configuration file.  
In the configuration file, you can set the following parameters:

| Parameter               | Description                                                                 | Default value                               | Example                                   |
|:------------------------|:-----------------------------------------------------------------------------|:--------------------------------------------|:------------------------------------------|
| DATA_DIR                | Directory where the benchmark and input `.csv` files are stored              | `./` (current directory)                    | `./data`                                  |
| BENCHMARK_DATA_FILENAME | Name of the `.csv` file containing the benchmark data                         | —                                          | `benchmark_data.csv`                      |
| INPUT_DATA_FILENAME     | Name of the `.csv` file containing the input data                             | —                                          | `input_data.csv`                          |
| BENCHMARK_DATATABLE     | Name of the SQL table containing the benchmark data                           | —                                          | `data.benchmarkdata`                      |
| INPUT_DATATABLE         | Name of the SQL table containing the input data                               | —                                          | `data.inputdatatable`                    |
| BENCHMARK_COLUMNS       | Mapping of required fields to benchmark data column names                     | `{ID, OUTLET_NAME, ADDRESS, POSTCODE}`     | see example below                         |
| INPUT_COLUMNS           | Mapping of required fields to input data column names                         | `{ID, OUTLET_NAME, ADDRESS, POSTCODE}`     | see example below                         |
| MODELS_DIR              | Directory where supervised models are stored                                  | `./models`                                 | `./new_models`                            |
| N_MATCHES               | Maximum number of matches (from input data) per benchmark record              | `3`                                        | `1`                                      |
| BLOCK_COL               | Column used for blocking (matches must share the same value)                  | `'POSTCODE'`                               | _Other values not currently supported_   |

#### _Example_

We want to find a match for each record in the SAP data stored in the CSV file `./data/HUK_sap_data.csv`.  
Possible matches are searched within the Bowimi data stored in `./data/HUK_bowimi_data.csv`.

Since the column names differ from the default ones (see the table above or the [Data](#data) section), they must be explicitly mapped in the configuration file:

```yaml
DATA_DIR: "./data"

BENCHMARK_DATA_FILENAME: "HUK_sap_data.csv"
BENCHMARK_COLUMNS:
  ID: "SapCode"
  OUTLET_NAME: "OutletName"
  POSTCODE: "OutletPostcode"
  ADDRESS: "OutletAddress"

INPUT_DATA_FILENAME: "HUK_bowimi_data.csv"
INPUT_COLUMNS:
  ID: "BowimiId"
  OUTLET_NAME: "OutletName"
  POSTCODE: "OutletPostCode"
  ADDRESS: "OutletAddress"

```
If the data are stored in SQL tables, they can be imported by specifying the corresponding `*_DATATABLE` parameters instead of the `*_FILENAME` ones.
```yaml
BENCHMARK_FILENAME: "rl_data.huk_sap_table"
BENCHMARK_COLUMNS:
  ID: 'SapCode'
  OUTLET_NAME: 'OutletName'
  POSTCODE: 'OutletPostcode'
  ADDRESS: 'OutletAddress'

INPUT_FILENAME: "rl_data.huk_bowimi_table"
INPUT_COLUMNS:
  ID: 'BowimiId'
  OUTLET_NAME: 'OutletName'
  POSTCODE: 'OutletPostCode'
  ADDRESS: 'OutletAddress'
```

If a parameter is not specified in the configuration file, the default value will be used.

### Environment variables

Instead of using a `.yaml` file, parameters can be set as environment variables in a `.env` file.  
In this case:

- Parameter names must be prefixed with `SVOC_`;
- Nested parameters must be specified using a double underscore (`__`) as a separator.

#### _Example_

```env
SVOC_INPUT_FILENAME="rl_data.huk_bowimi_table"
SVOC_BENCHMARK_COLUMNS__ID="SapCode"
SVOC_BENCHMARK_COLUMNS__OUTLET_NAME="OutletName"
# ...
```

### Importing the settings

After defining the parameters, they can be loaded into the script using 
```python 
from svoc.settings import get_settings()
settings = get_settings()   # Load from .env
# or
settings = get_settings("./config/settings.yaml") # Load from a .yaml file 
``` 
 

## Data Preparation

The first step of the algorithm consists of a data preparation phase, whose goal is to harmonize and clean the input datasets before performing the record linkage.

#### Data loading and schema alignment

As a first step, the input datasets are imported and only the columns of interest are retained.These columns are defined in the [configuration file](#configuration-file).

To ensure consistency between the two data sources, the selected columns are then renamed so that field names are aligned across both dataframes. This step guarantees that equivalent information (e.g. outlet name, address) is represented using a common schema.

#### Case normalization

Since the matching process is case sensitive, all string fields are converted to uppercase.
This normalization step avoids mismatches caused solely by differences in capitalization (e.g. Street vs STREET).

#### Removal of special characters

All strings are then cleaned by removing special characters and symbols.
Only letters, numbers, and spaces are preserved. This step reduces noise introduced by punctuation, formatting differences, or non-standard characters that are not informative for matching purposes.

#### Creation of standardized clean fields

In addition to the original fields, two new columns are created:

- ADDRESS_CLEAN
- OUTLET_NAME_CLEAN

These columns are derived from ADDRESS and OUTLET_NAME respectively and are specifically designed to improve matching quality.

For addresses, a standardization process is applied to harmonize common variations.
In particular, frequently used abbreviations are expanded to their full form (e.g. RD → ROAD, LN → LANE, ST → STREET). This reduces inconsistencies caused by alternative address representations.

For outlet names, instead, a filtering approach is adopted.
Common and non-informative words (such as THE, BAR, PUB, LTD, etc.) are removed, as they tend to appear frequently across different records and may negatively affect the matching process without adding discriminative power.

#### Final dataset structure

At the end of the data preparation phase, both dataframes retain:

- all the original columns of interest
- the corresponding standardized CLEAN versions

The subsequent matching step will leverage both the original and cleaned fields, allowing the algorithm to balance strict matching with more robust, noise-resistant comparisons.

## Features Calculation

Record linkage is performed by comparing the fields of two different dataframes and matching those records whose field values are most similar.

To assess the similarity between two fields, the algorithm relies on the Python
[recordlinkage](https://recordlinkage.readthedocs.io/en/latest/) package, which provides classes and methods to compute several distance measures, referred to as _features_.

The similarity between two strings is represented by a numerical value between 0 (no similarity) and 1 (maximum similarity).  
The `recordlinkage` package offers multiple string distance metrics; the `svoc` package uses the following:

- _Jaro–Winkler_ distance;
- _Levenshtein_ distance;
- _Q-gram_ distance;
- _Cosine_ distance.

In addition, the package includes the following custom measures:

- _Exact_: equal to 1 if the strings are exactly the same, 0 otherwise;
- _Substring_: equal to 1 if one string is entirely contained within the other, 0 otherwise;
- _Word inclusion_: equal to 1 if all the words in one string are contained in the other, 0 otherwise.

All available measures are defined in the `DISTANCES` constant:

```python
from svoc.constants import DISTANCES
DISTANCES
```
This is a list of `Distance()` instances, defined as follows:
``` python
from svoc.automatic.enums import DistanceMethod
from svoc.automatic.models import Distance
Distance(
    col_name = 'OUTLET_NAME',          # Field name
    method = DistanceMethod.COSINE,    # Distance metric used
    label = 'outlet_name_cosine'       # Feature label
)
```
These measures are computed for the pairs of records that share the value of the `BLOCK_COL` field (set in the [configuration file](#configuration-file)). With the current implementation, the pairs of records have the same postal code. This constraint guarantees less computational costs and more precision. 

The following table contains the features currently calculated.

| Field Name        | Method          | Label                                |
|:------------------|:----------------|:-------------------------------------|
| OUTLET_NAME       | Cosine          | outlet_name_cosine                   |
|                   | Jarowinkler     | outlet_name_jarowinkler              |
|                   | Levenshtein     | outlet_name_lenvenshtein             |
|                   | QGram           | outlet_name_qgram                    |
|                   | Exact           | outlet_name                          |
|                   | Substring       | outlet_name_in                       |
|                   | Word Inclusion  | outlet_name_in2                      |
| OUTLET_NAME_CLEAN | Cosine          | outlet_name_clean_cosine             |
|                   | Jarowinkler     | outlet_name_clean_jarowinkler        |
|                   | Levenshtein     | outlet_name_clean_lenvenshtein       |
|                   | QGram           | outlet_name_clean_qgram              |
|                   | Exact           | outlet_name_clean                    |
|                   | Substring       | outlet_name_clean_in                 |
|                   | Word Inclusion  | outlet_name_clean_in2                |
| ADDRESS           | Cosine          | address_cosine                       |
|                   | Jarowinkler     | address_jarowinkler                  |
|                   | Levenshtein     | address_lenvenshtein                 |
|                   | QGram           | address_qgram                        |
|                   | Exact           | address                              |
|                   | Substring       | address_in                           |
|                   | Word Inclusion  | address_in2                          |
| ADDRESS_CLEAN     | Cosine          | address_clean_cosine                 |
|                   | Jarowinkler     | address_clean_jarowinkler            |
|                   | Levenshtein     | address_clean_lenvenshtein           |
|                   | QGram           | address_clean_qgram                  |
|                   | Exact           | address_clean                        |
|                   | Substring       | address_clean_in                     |
|                   | Word Inclusion  | address_clean_in2                    |


## Automatic Matching

The automatic matching consists in selecting the matching pairs of records through a sequence of **filtering steps**.  The filters, which are progressively less restrictive, apply constraints to specific similarity measures. Through these filters, for each record from the benchmark dataset the algorithm selects up to 3 matching records from the input dataframe. The maximum number of matches is set through the `N_MATCHES` parameter in the [configuration file](#configuration-file).

### Filters 

You can print the list of filters with:

In [3]:
from svoc.constants import FILTERS_AUTO
FILTERS_AUTO

[DistanceFilter(value={'outlet_name': 0.5, 'address': 0.5}),
 DistanceFilter(value={'outlet_name': 0.5, 'address_clean_cosine': 0.7}),
 DistanceFilter(value={'outlet_name': 0.5, 'address_clean_levenshtein': 0.7}),
 DistanceFilter(value={'outlet_name': 0.5, 'address_clean_qgram': 0.65}),
 DistanceFilter(value={'outlet_name_clean_cosine': 0.7, 'address': 0.5}),
 DistanceFilter(value={'outlet_name_clean_levenshtein': 0.7, 'address': 0.5}),
 DistanceFilter(value={'outlet_name_clean_qgram': 0.65, 'address': 0.5}),
 DistanceFilter(value={'outlet_name_cosine': 0.7, 'address': 0.5}),
 DistanceFilter(value={'outlet_name_levenshtein': 0.7, 'address': 0.5}),
 DistanceFilter(value={'outlet_name_qgram': 0.65, 'address': 0.5}),
 DistanceFilter(value={'outlet_name': 0.5, 'address_cosine': 0.7}),
 DistanceFilter(value={'outlet_name': 0.5, 'address_levenshtein': 0.7}),
 DistanceFilter(value={'outlet_name': 0.5, 'address_qgram': 0.5}),
 DistanceFilter(value={'outlet_name_in': 0.5, 'address': 0.5}),
 Dis


This is a list of `DistanceFilter()` istances, defined as follows:

```python
from svoc.contants import DistanceFilter
filter = DistanceFilter(
    value = {
        'outlet_name_cosine': 0.8,
        'address_levenshtein': 0.7
        }
    )
```

The `value` parameter is a dictionary where each key is chosen among the labels of the `Distance()` instances (see [Features](#features-calculation)) and the associated value is the minimum value for the corresponding measure allowed by the filter. In the example above, the filter select as matching pairs all those records whose Cosine distance between the outlet names is al least equal to 0.8 *and* the levensthein distance between the addresses is at least 0.7. 

Each matching couple will be associated to a overall similarity score given by the aritmetic mean of the similarity measures considered by the filter.

## Supervised Matching

The supervised matching is a probabilistic matching method that follows the automatic one. The aim of this step is to find a match for those records from the benchmark datased which remain un-matched after the filtering process. Records for which the previous step found fewer than 3 matches (`N_MATCHES`) are also included.

The supervised matching exploits some trained models which are probabilistic decision rules that estimate how likely is that two records matches. 

The models are trained during the algorithm development and saved in different pickle files. 
The folder path where to find the models pickle files can ben set with the `MODELS_DIR` parameter in the [configuration file](#configuration-file).

In the current implementation, three supervised models have been trained (see [recordlinkage documentation](https://recordlinkage.readthedocs.io/en/latest/guides/classifiers.html#supervised-learning) for details):

1. Logistic Regression;
2. Support Vector Machine (SVM);
3. Naive Bayes Classifier.

The logistic model and the bayes classifier provide a score which is the probability of matching for each pair of record: the pair is a match if the score is greater than 0.5. The SVM does not provide a probability but simply differentiates between matches and non-matches. The score for these matches is set equal to 0.5 by default.

### Models Training

The supervised models are **already trained** and provided out of the box with the current implementation.
However, the training procedure can be re-executed if needed (e.g. to change features, distance metrics, or to retrain the models on a different labeled dataset).

The supervised models are trained using a subset of records from the input dataset for which the correct match with the benchmark dataset is known.

The benchmark and the input datasets contain their unique identifier column (ID).
To train the models, the input dataset also have to contain an additional column that specifies the ID of the matching record in the benchmark dataset (e.g. benchmark_id). Only the input records for which this matching ID is available are used for training.
This reference column is specified via the `input_cols_id_benchmark` parameter in the `train_all_model()`.
Not all input records are required to have a match:
- records where `input_cols_id_benchmark` is NaN are automatically excluded from training
- only records with a known and valid benchmark match are used to build the training set.

#### Feature Generation and Filtering

During the model training:
1. Similarity features are computed between all candidate pairs of records;
2. Only benchmark records that appear at least once in the known matches are retained.

The final training set consists of:
- feature vectors for candidate pairs
- a label derived from whether the pair corresponds to a known match

#### Model Re-training

Models can be (re)estimated using the `train_all_models()` function from the
`svoc.supervised.match` module. Trained models are saved to the directory specified in path_models.

**Important**: if the features used for record linkage change (e.g. input columns, benchmark columns, or distance metrics), all supervised models must be retrained.

```python

from svoc.settings import get_settings
from svoc.utils import read_data
from svoc.supervised.enums import SupervisedModel
from svoc.supervised.match import train_all_models
from svoc.constants import DISTANCES

settings = get_settings("./config/settings.yaml")

# Read Data
df_input, df_benchmark = read_data(settings)

# Models training
models = train_all_models(
    df_input=df_input,
    input_cols_id_benchmark='sapcode',              ## <-- NEEDS TO BE SET
    input_cols=settings.INPUT_COLUMNS_DICT,
    df_benchmark=df_benchmark,
    benchmark_cols=settings.BENCHMARK_COLUMNS_DICT,
    distances=DISTANCES,
    block_col=settings.BLOCK_COL,
    window=5,
    path_models=settings.SUPERVISED_MODELS_PATHS
)

```

## Results

The script returns a dataframe with the following fields:

- `ID_1`: Id of the records from the benchmark data;
- `ID_2`: Id of the records from the input data;
- `Rank`: An ordered ranking of the matches (from 1 to `N_MATCHES`) for each benchmark record, where rank 1 represents the first match;
- `match_type`: Indicates whether the match was automatically selected through filters (`auto`) or selected through supervised probabilistic models (`supervised`);
- `score`: the global similarity score summarizing the quality of the match according to the considered features or the used probabilistic model;
- `*_score` (e.g. `OUTLET_NAME_score`): individual similarity scores for key attributes such as outlet name and address.
- `*_method` (e.g. `OUTLET_NAME_method`): The string-matching techniques used to compute each field-level similarity score.  If the match is selected through supervised matching, the mathod is the Cosine similarity by deafault.
- `ID_filter`: numeric identifier of the filtering step that selected the match (only if `match_type` is `auto`);
- `model`: the probabilistic model used to select the match (only if `match_type` is `supervised`).
