A Python application for entity matching using Large Language Models (LLMs). The system processes pairs of entities from two datasets, uses LLMs to determine if they match, and provides evaluation metrics against ground truth data.
- MatchVader Comparis
- Table of Contents
This application consists of two main components:
- Entity Matching (
main.py): Processes pairs of entities using HuggingFace LLMs with structured output generation - Evaluation (
evaluate.py): Evaluates LLM predictions against ground truth and calculates performance metrics
- Clone the repository
git clone <repository-url>
cd entity-matching- Install dependencies with uv
uv sync- Set up environment variables
cp example.env .envThen edit .env and add your HuggingFace token:
HF_TOKEN=your_huggingface_token_here- Prepare configuration files
Ensure your configuration files are in the
configs/directory:
configs/dataset_config.yamlconfigs/task_config.yaml
Defines the file paths for different datasets. Each dataset contains:
d1: First dataset CSV filed2: Second dataset CSV filepairs: Candidate pairs CSV filegt: Ground truth CSV file (for evaluation)pairs_with_ids: Pairs file with ID mappings
Example:
dataset_1:
d1: "data/d1_data/rest1clean.csv"
d2: "data/d1_data/rest2clean.csv"
gt: "data/d1_data/gtclean_evaluator_comma.csv"
pairs: "data/d1_data/d1_pairs.csv"
pairs_with_ids: "data/d1_data/d1_pairs_with_ids.csv"
dataset_2:
d1: "data/d2_data/rest1clean.csv"
d2: "data/d2_data/rest2clean.csv"
gt: "data/d2_data/gtclean_evaluator_comma.csv"
pairs: "data/d2_data/d2_pairs.csv"
pairs_with_ids: "data/d2_data/d2_pairs_with_ids.csv"Note: Configuration paths are hardcoded in the code:
- Dataset config:
configs/dataset_config.yaml - Task config:
configs/task_config.yaml
Defines the tasks for entity matching, including prompts, schemas, and LLM parameters.
Example:
Pairs:
schema: "src.schemas.pairs_schema.PairsSchema"
entities: ["entity_1", "entity_2"]
prompt: "prompts/pairs_prompt.txt"
max_tokens: 200
temperature: 0.7
pairs_per_prompt: 1
DualPairs:
schema: "src.schemas.dual_pairs_schema.DualPairsSchema"
entities: ["entity_1a", "entity_1a, entity_2a", "entity_2b"]
prompt: "prompts/dual_pairs_prompt.txt"
max_tokens: 300
temperature: 0.6
pairs_per_prompt: 2Fields:
schema: Path to Pydantic schema class defining output structureentities: List of entity placeholders used in the prompt templateprompt: Path to prompt template filetemperature: LLM temperature parameter (0.0-1.0)max_tokens: Maximum tokens for LLM response
The main script (main.py) processes entity pairs using an LLM to determine matches.
usage: main.py [-h] [--hf-model HF_MODEL] --task TASK --dataset DATASET
[--log-file LOG_FILE] [--log-console] [--save SAVE]
[--partial-save PARTIAL_SAVE] [--start-index START_INDEX]
[--count COUNT]
Run HuggingFaceLLM with structured prompt output for entity matching.
options:
-h, --help show this help message and exit
--hf-model HF_MODEL Hugging Face model name or path.
--task TASK Task name (e.g., Pairs) from the task config file.
--dataset DATASET Dataset name (e.g., dataset_1, dataset_2) from the
dataset config file.
--log-file LOG_FILE Path to log file.
--log-console Enable console logging (default if no log file).
--save SAVE Path to save results (optional).
--partial-save PARTIAL_SAVE
Save results every X entries (enables partial saving).
--start-index START_INDEX
Starting pair index for processing.
--count COUNT Number of pairs to process (default: all from start-
index).
Basic usage:
uv run python main.py --task Pairs --dataset dataset_1 --log-consoleProcess specific range with results saving:
uv run python main.py \
--task Pairs \
--dataset dataset_1 \
--start-index 0 \
--count 1000 \
--save results/experiment_1 \
--log-consoleLong job with partial saving:
uv run python main.py \
--task Pairs \
--dataset dataset_2 \
--save results/dataset2_run1 \
--partial-save 100 \
--log-file logs/run1.log \
--log-consoleWhen using --save, the script creates:
results.csv: All results including successes and failuressuccessful_results.csv: Only successfully processed pairspartial_results.csv: Temporary file for partial saves (deleted upon completion)
Result columns include:
row_id: Sequential row numberpair_index: Original pair index- Dataset-specific index columns (e.g.,
rest1_index,rest2_index) entity1_formatted,entity2_formatted: Formatted entity representationsentity1_raw,entity2_raw: Raw entity dataprompt: Generated prompt sent to LLMresponse: Raw LLM responsesuccess: Boolean indicating if processing succeeded- Additional columns from schema (e.g.,
match,confidence,reasoning)
The evaluation script (evaluate.py) compares LLM predictions against ground truth data.
usage: evaluate.py [-h] [--pairs PAIRS] --ground_truth GROUND_TRUTH
--response_csv RESPONSE_CSV [--join {inner,outer,left,right}]
[--output OUTPUT]
Evaluate candidate pairs against ground truth
options:
-h, --help show this help message and exit
--pairs PAIRS, -p PAIRS
Candidate pairs CSV file
--ground_truth GROUND_TRUTH, -gt GROUND_TRUTH
Ground truth CSV file
--response_csv RESPONSE_CSV, -r RESPONSE_CSV
LLM response CSV file
--join {inner,outer,left,right}, -j {inner,outer,left,right}
Join type for comparison (default: inner)
--output OUTPUT, -o OUTPUT
Output CSV file for matches (optional)
Basic evaluation:
uv run python evaluate.py \
-r results/experiment_1/results.csv \
-p data/d1_data/d1_pairs.csv \
-gt data/d1_data/gtclean_evaluator_comma.csvEvaluation with output file:
uv run python evaluate.py \
-r results/experiment_1/results.csv \
-p data/d1_data/d1_pairs.csv \
-gt data/d1_data/gtclean_evaluator_comma.csv \
-o evaluation_results/exp1_metricsDifferent join types:
# Inner join (only pairs present in both ground truth and predictions)
uv run python evaluate.py -r results.csv -p pairs.csv -gt gt.csv --join inner
# Outer join (all pairs from both ground truth and predictions)
uv run python evaluate.py -r results.csv -p pairs.csv -gt gt.csv --join outerThe evaluation script calculates and displays:
- Confusion Matrix: Visual representation of true/false positives/negatives
- Precision: Of predicted matches, how many are correct?
- Recall: Of actual matches, how many did we find?
- F1-Score: Harmonic mean of precision and recall
- Accuracy: Overall correctness of predictions
Example output:
{
"confusion_matrix": [[1567, 40], [51, 342]],
"precision": 0.8947,
"recall": 0.8502,
"f1_score": 0.8719,
"accuracy": 0.9123
}Step 1: Process entity pairs
uv run python main.py \
--task Pairs \
--dataset dataset_1 \
--save results/run_001 \
--partial-save 50 \
--log-consoleStep 2: Evaluate results
uv run python evaluate.py \
-r results/run_001/results.csv \
-p data/d1_data/d1_pairs.csv \
-gt data/d1_data/gtclean_evaluator_comma.csv \
-o evaluation/run_001_metricsStep 3: Review outputs
# View results
cat results/run_001/successful_results.csv
# View metrics
cat evaluation/run_001_metrics.jsonProcess in chunks to manage memory and save progress:
# Process first 1000 pairs
uv run python main.py \
--task Pairs \
--dataset dataset_2 \
--start-index 0 \
--count 1000 \
--save results/batch_1 \
--log-console
# Process next 1000 pairs
uv run python main.py \
--task Pairs \
--dataset dataset_2 \
--start-index 1000 \
--count 1000 \
--save results/batch_2 \
--log-console
# Combine results and evaluate
python -c "
import pandas as pd
batch1 = pd.read_csv('results/batch_1/results.csv')
batch2 = pd.read_csv('results/batch_2/results.csv')
combined = pd.concat([batch1, batch2])
combined.to_csv('results/combined_results.csv', index=False)
"
uv run python evaluate.py \
-r results/combined_results.csv \
-p data/d2_data/d2_pairs.csv \
-gt data/d2_data/gtclean_evaluator_comma.csv \
-o evaluation/combined_metrics