MatchVader Comparis

A Python application for entity matching using Large Language Models (LLMs). The system processes pairs of entities from two datasets, uses LLMs to determine if they match, and provides evaluation metrics against ground truth data.

dataset_1:
  d1: "data/d1_data/rest1clean.csv"
  d2: "data/d1_data/rest2clean.csv"
  gt: "data/d1_data/gtclean_evaluator_comma.csv"
  pairs: "data/d1_data/d1_pairs.csv"
  pairs_with_ids: "data/d1_data/d1_pairs_with_ids.csv"

dataset_2:
  d1: "data/d2_data/rest1clean.csv"
  d2: "data/d2_data/rest2clean.csv"
  gt: "data/d2_data/gtclean_evaluator_comma.csv"
  pairs: "data/d2_data/d2_pairs.csv"
  pairs_with_ids: "data/d2_data/d2_pairs_with_ids.csv"

Note: Configuration paths are hardcoded in the code:

Dataset config: configs/dataset_config.yaml
Task config: configs/task_config.yaml

Task Configuration (`configs/task_config.yaml`)

Defines the tasks for entity matching, including prompts, schemas, and LLM parameters.

Example:

Pairs:
  schema: "src.schemas.pairs_schema.PairsSchema"
  entities: ["entity_1", "entity_2"]
  prompt: "prompts/pairs_prompt.txt"
  max_tokens: 200
  temperature: 0.7
  pairs_per_prompt: 1

DualPairs:
  schema: "src.schemas.dual_pairs_schema.DualPairsSchema"
  entities: ["entity_1a", "entity_1a, entity_2a", "entity_2b"]
  prompt: "prompts/dual_pairs_prompt.txt"
  max_tokens: 300
  temperature: 0.6
  pairs_per_prompt: 2

Fields:

schema: Path to Pydantic schema class defining output structure
entities: List of entity placeholders used in the prompt template
prompt: Path to prompt template file
temperature: LLM temperature parameter (0.0-1.0)
max_tokens: Maximum tokens for LLM response

Usage

Main Processing Script

The main script (main.py) processes entity pairs using an LLM to determine matches.

Command-Line Arguments

usage: main.py [-h] [--hf-model HF_MODEL] --task TASK --dataset DATASET
               [--log-file LOG_FILE] [--log-console] [--save SAVE]
               [--partial-save PARTIAL_SAVE] [--start-index START_INDEX]
               [--count COUNT]

Run HuggingFaceLLM with structured prompt output for entity matching.

options:
  -h, --help            show this help message and exit
  --hf-model HF_MODEL   Hugging Face model name or path.
  --task TASK           Task name (e.g., Pairs) from the task config file.
  --dataset DATASET     Dataset name (e.g., dataset_1, dataset_2) from the
                        dataset config file.
  --log-file LOG_FILE   Path to log file.
  --log-console         Enable console logging (default if no log file).
  --save SAVE           Path to save results (optional).
  --partial-save PARTIAL_SAVE
                        Save results every X entries (enables partial saving).
  --start-index START_INDEX
                        Starting pair index for processing.
  --count COUNT         Number of pairs to process (default: all from start-
                        index).

Examples

Basic usage:

uv run python main.py --task Pairs --dataset dataset_1 --log-console

Process specific range with results saving:

uv run python main.py \
  --task Pairs \
  --dataset dataset_1 \
  --start-index 0 \
  --count 1000 \
  --save results/experiment_1 \
  --log-console

Long job with partial saving:

uv run python main.py \
  --task Pairs \
  --dataset dataset_2 \
  --save results/dataset2_run1 \
  --partial-save 100 \
  --log-file logs/run1.log \
  --log-console

Output Files

When using --save, the script creates:

results.csv: All results including successes and failures
successful_results.csv: Only successfully processed pairs
partial_results.csv: Temporary file for partial saves (deleted upon completion)

Result columns include:

row_id: Sequential row number
pair_index: Original pair index
Dataset-specific index columns (e.g., rest1_index, rest2_index)
entity1_formatted, entity2_formatted: Formatted entity representations
entity1_raw, entity2_raw: Raw entity data
prompt: Generated prompt sent to LLM
response: Raw LLM response
success: Boolean indicating if processing succeeded
Additional columns from schema (e.g., match, confidence, reasoning)

Evaluation Script

The evaluation script (evaluate.py) compares LLM predictions against ground truth data.

Command-Line Arguments

usage: evaluate.py [-h] [--pairs PAIRS] --ground_truth GROUND_TRUTH
                   --response_csv RESPONSE_CSV [--join {inner,outer,left,right}]
                   [--output OUTPUT]

Evaluate candidate pairs against ground truth

options:
  -h, --help            show this help message and exit
  --pairs PAIRS, -p PAIRS
                        Candidate pairs CSV file
  --ground_truth GROUND_TRUTH, -gt GROUND_TRUTH
                        Ground truth CSV file
  --response_csv RESPONSE_CSV, -r RESPONSE_CSV
                        LLM response CSV file
  --join {inner,outer,left,right}, -j {inner,outer,left,right}
                        Join type for comparison (default: inner)
  --output OUTPUT, -o OUTPUT
                        Output CSV file for matches (optional)

Examples

Basic evaluation:

uv run python evaluate.py \
  -r results/experiment_1/results.csv \
  -p data/d1_data/d1_pairs.csv \
  -gt data/d1_data/gtclean_evaluator_comma.csv

Evaluation with output file:

uv run python evaluate.py \
  -r results/experiment_1/results.csv \
  -p data/d1_data/d1_pairs.csv \
  -gt data/d1_data/gtclean_evaluator_comma.csv \
  -o evaluation_results/exp1_metrics

Different join types:

# Inner join (only pairs present in both ground truth and predictions)
uv run python evaluate.py -r results.csv -p pairs.csv -gt gt.csv --join inner

# Outer join (all pairs from both ground truth and predictions)
uv run python evaluate.py -r results.csv -p pairs.csv -gt gt.csv --join outer

Evaluation Metrics

The evaluation script calculates and displays:

Confusion Matrix: Visual representation of true/false positives/negatives
Precision: Of predicted matches, how many are correct?
Recall: Of actual matches, how many did we find?
F1-Score: Harmonic mean of precision and recall
Accuracy: Overall correctness of predictions

Example output:

{
  "confusion_matrix": [[1567, 40], [51, 342]],
  "precision": 0.8947,
  "recall": 0.8502,
  "f1_score": 0.8719,
  "accuracy": 0.9123
}

Workflow Examples

Complete Workflow: Processing to Evaluation

Step 1: Process entity pairs

uv run python main.py \
  --task Pairs \
  --dataset dataset_1 \
  --save results/run_001 \
  --partial-save 50 \
  --log-console

Step 2: Evaluate results

uv run python evaluate.py \
  -r results/run_001/results.csv \
  -p data/d1_data/d1_pairs.csv \
  -gt data/d1_data/gtclean_evaluator_comma.csv \
  -o evaluation/run_001_metrics

Step 3: Review outputs

# View results
cat results/run_001/successful_results.csv

# View metrics
cat evaluation/run_001_metrics.json

Batch Processing Large Datasets

Process in chunks to manage memory and save progress:

# Process first 1000 pairs
uv run python main.py \
  --task Pairs \
  --dataset dataset_2 \
  --start-index 0 \
  --count 1000 \
  --save results/batch_1 \
  --log-console

# Process next 1000 pairs
uv run python main.py \
  --task Pairs \
  --dataset dataset_2 \
  --start-index 1000 \
  --count 1000 \
  --save results/batch_2 \
  --log-console

# Combine results and evaluate
python -c "
import pandas as pd
batch1 = pd.read_csv('results/batch_1/results.csv')
batch2 = pd.read_csv('results/batch_2/results.csv')
combined = pd.concat([batch1, batch2])
combined.to_csv('results/combined_results.csv', index=False)
"

uv run python evaluate.py \
  -r results/combined_results.csv \
  -p data/d2_data/d2_pairs.csv \
  -gt data/d2_data/gtclean_evaluator_comma.csv \
  -o evaluation/combined_metrics

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
configs		configs
data		data
prompts		prompts
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
batch_evaluate.py		batch_evaluate.py
data_reader.py		data_reader.py
download_model.py		download_model.py
evaluate_main.py		evaluate_main.py
example.env		example.env
main.py		main.py
main_cpp.py		main_cpp.py
paired_data_with_indexes.csv		paired_data_with_indexes.csv
paired_data_with_indexes1.csv		paired_data_with_indexes1.csv
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MatchVader Comparis

Table of Contents

Overview

Installation

Configuration Files

Dataset Configuration (`configs/dataset_config.yaml`)

Task Configuration (`configs/task_config.yaml`)

Usage

Main Processing Script

Command-Line Arguments

Examples

Output Files

Evaluation Script

Command-Line Arguments

Examples

Evaluation Metrics

Workflow Examples

Complete Workflow: Processing to Evaluation

Batch Processing Large Datasets

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MatchVader Comparis

Table of Contents

Overview

Installation

Configuration Files

Dataset Configuration (configs/dataset_config.yaml)

Task Configuration (configs/task_config.yaml)

Usage

Main Processing Script

Command-Line Arguments

Examples

Output Files

Evaluation Script

Command-Line Arguments

Examples

Evaluation Metrics

Workflow Examples

Complete Workflow: Processing to Evaluation

Batch Processing Large Datasets

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Dataset Configuration (`configs/dataset_config.yaml`)

Task Configuration (`configs/task_config.yaml`)

Packages