Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precision over 100% reported if ground truth contains pairs of identical ids #20

Closed
mrckzgl opened this issue Apr 16, 2024 · 4 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@mrckzgl
Copy link

mrckzgl commented Apr 16, 2024

We have a dirty ER workflow, where the EntityMatching graph is generated with similarity_threshold=0.0 (to get all compared edges) and then we optimize the clustering for the optimal similarity_threshold using optuna. We encountered this:
Figure_1

On the top end, where the threshold gets towards 1.0 and as such the clustering does not produce alot of matches, the reported precision goes beyond 100%. I would have to dig deeper what exactly causes this, but maybe you have an idea, possibly it is only a bug regarding edge cases where the number of matches is low.

best

@mrckzgl
Copy link
Author

mrckzgl commented Apr 16, 2024

Some more data. From

eval_obj.calculate_scores(true_positives=true_positives)

I printed eval_obj.__dict__:

{'total_matching_pairs': 76.0, 'data': <pyjedai.datamodel.Data object at 0x7e11d1839db0>, 'true_positives': 102, 'true_negatives': 185456764.0, 'false_positives': -26.0, 'false_negatives': 553360, 'all_gt_ids': {0, 1, 2, [...], 19316}, 'num_of_true_duplicates': 553462, 'precision': 1.3421052631578947, 'recall': 0.00018429449537637633, 'f1': 0.00036853838399531744}

So total_matching_pairs is smaller than true_positives.

@mrckzgl
Copy link
Author

mrckzgl commented Apr 16, 2024

Ah I got it. We have matching pairs of the same id in our ground truth. So sth. like "id1|id1" as row in the csv file. Thinking about it, this is not incorrect: An entity obviously is identical to itself, but I see also that the gt is not as clean as it should be. I will cleanup the gt, but an additional approach might be to check for identity of the ids here:

if id1 in entity_index and \

and in that case not increase true_positives to make evaluation more robust. But of course, one would need to investigate also for clean clean ER case and the other steps' evaluations, that calculations remain correct / consistent.

@mrckzgl mrckzgl changed the title Precision over 100% reported in some edge cases Precision over 100% reported if ground truth contains pairs of identical ids Apr 16, 2024
@Nikoletos-K
Copy link
Member

We hadn't considered this scenario before. I fully agree that it should be addressed, given the prevalence of errors in data. We will address this by adding a validation check.

Thanks for the detailed trace and feedback!

@Nikoletos-K Nikoletos-K self-assigned this Apr 16, 2024
@Nikoletos-K Nikoletos-K added the bug Something isn't working label Apr 16, 2024
@Nikoletos-K
Copy link
Member

We added a drop_duplicates when we parse the GT file. Here:

self.ground_truth.drop_duplicates(inplace=True)

I think this will work better.

Cheers,
Konstantinos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants