Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change duplicate person detection #5758

Open
SORMAS-JanBoehme opened this issue Jun 10, 2021 · 2 comments
Open

Change duplicate person detection #5758

SORMAS-JanBoehme opened this issue Jun 10, 2021 · 2 comments
Labels
backend Affects the web backend change A change of an existing feature (ticket type) general Not directly concerned with any particular functional section of the application

Comments

@SORMAS-JanBoehme
Copy link

SORMAS-JanBoehme commented Jun 10, 2021

Summary:
The departments of health often times face the problem that the duplicate detection for persons does not trigger when creating a new case or contact because when the data was entered some typo happened (i.e. entering 1991 as the birth year instead of 1990)

See SORMAS-Foundation/SORMAS-Glossary#23 (comment) for a description on how the duplicate detection for persons is working at the moment.

This issue is meant to describe a proposed change to the current duplicate detection with the goal of providing more relevant results to the user.

Basic concept:

The detection should use weighted results when checking for duplicates instead of just looking for perfect matches. If the sum of all checks exceeds a configured threshhold the result is considered a possible duplicate and presented to the user. The values used in the duplicate detection should be made avilable for GSA admins via the UI to allow them to adjust the granularity of the detection themselves without having to raise a ticket.

Prerequisites:

  • PostgreSQl database needs to have the module pg_trgm enabled to allow trigram checks on the database during query execution
  • Executing tests on a SORMAS database to determine the load being put on the database when using multiple "similarity" methods during query execution if the database has 1.000.000+ persons. Otherwise this feature may be to much to handle for the international version where everything is stored in one instance
  • Identify and create on ore more appropriate index on the database to speed up the trigram calculation

Changes to config values

Remove values
namesimilaritythreshhold

Introduce values

Name Value Range Default Value
DuplicateDetectionPersonMaxResults 0 - 100 tbd
DuplicateDetectionPersonNameWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonNameThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonSexWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonSexThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonBirthdateDayWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonBirthdateDayThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonBirthdateMonthWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonBirthdateMonthThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonBirthdateYearWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonBirthdateYearThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonPassportNumberWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonPassportNumberThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonHealthIdWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonHealthIdThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonResultThreshhold 0.0f - 1.0f tbd

Process:

The following assumes that all of the weights have a value > 0.0f.
If the admin set one of the values to 0.0f it indicates that the admin does not want those check to have an impact on the duplicate detection and the corresponding check will be skipped and must not have an impact when calculating the final result.

  1. Take the first and last name entered by the user and use those to do a first evaluation of possible duplicates by using "similarity" when selecting the data. Check for similarity values of first name, last name alone and compare it to each other in case the names have been switched by accident. Would basically look something like this (the actual query is probably way more complicated):
SELECT data FROM tables
WHERE
SIMILARITY(firstName, firstNameEntered) > DuplicateDetectionPersonNameThreshhold OR
SIMILARITY(lastName, lastNameEntered)  > DuplicateDetectionPersonNameThreshhold OR
SIMILARITY(firstName, lastNameEntered) > DuplicateDetectionPersonNameThreshhold OR
SIMILARITY(lastName, firstNameEntered) > DuplicateDetectionPersonNameThreshhold
LIMIT DuplicateDetectionPersonMaxResults
ORDER BY highestValue DESC
  1. Do the following checks for all results:

2.1 Sex

  • If equal --> score of 1.0f
  • If any of the compared sides has a value of UNKNOWN or null --> score of 1.0f
  • If person.sex is set but not equal --> score of 0.5f

2.2 Person.birthdateDD

  • If it is an exact match --> score of 1.0f
  • The score is reduced by 0.1f for each value they are apart until it reaches 0.0f. Make sure to include "wrap around" between 31 and 1. This check does not take the actual month into account and uses 31 as base for the calculation. Checking it against the month would make this way more complex while providing little in terms of accuracy (i.e 13 and 14 would have a score of 0.9 & 03 and 27 have a score of 0.3f)

2.3 Person.birthdateMM

  • If it is an exact match --> score of 1.0f
  • The score is reduced by 0.2f for each value they are apart until it reaches 0.0f. Make sure to include "wrap around" between 12 and 1.

2.4 Person.birthdateYYYY

  • If it is an exact match --> score of 1.0f
  • The score is reduced by 0.1f for each value they are apart until it reaches 0.0f.

2.5 Person.passportNumber

  • Perform a trigram calculation and use the resulting score

2.6 Person.nationalHealthId

  • Perform a trigram calculation and use the resulting score
  1. Apply weights
    Multiply the scores for each test with the corresponding weight and normalize the value to a result of 0.0f - 1.0f.
    Could look something like this:
float maxScore;
float achievedScore;

for each(Result result in allResults){
	maxScore = 0.0f;
	achievedScore = 0.0f;
	
	if(DuplicateDetectionPersonNameWeight > 0.0f) {
		maxScore += DuplicateDetectionPersonNameWeight
		achievedScore += result.DuplicateDetectionPersonNameScore * DuplicateDetectionPersonNameWeight
	}
	
	if(DuplicateDetectionPersonSexWeight > 0.0f) {
		maxScore += DuplicateDetectionPersonSexWeight
		achievedScore += result.DuplicateDetectionPersonSexScore * DuplicateDetectionPersonSexWeight
	}
	
	if(DuplicateDetectionPersonBirthdateDayWeight > 0.0f) {
		maxScore += DuplicateDetectionPersonBirthdateDayWeight
		achievedScore += result.DuplicateDetectionPersonBirthdateDayScore * DuplicateDetectionPersonBirthdateDayWeight
	}
	
	if(DuplicateDetectionPersonBirthdateMonthWeight > 0.0f) {
		maxScore += DuplicateDetectionPersonBirthdateMonthWeight
		achievedScore += result.DuplicateDetectionPersonBirthdateMonthScore * DuplicateDetectionPersonBirthdateMonthWeight
	}
	
	if(DuplicateDetectionPersonBirthdateYearWeight > 0.0f) {
		maxScore += DuplicateDetectionPersonBirthdateYearWeight
		achievedScore += result.DuplicateDetectionPersonBirthdateYearScore * DuplicateDetectionPersonBirthdateYearWeight
	}
	
	if(DuplicateDetectionPersonPassportNumberWeight > 0.0f) {
		maxScore += DuplicateDetectionPersonPassportNumberWeight
		achievedScore += result.DuplicateDetectionPersonPassportNumberScore * DuplicateDetectionPersonPassportNumberWeight
	}
	
	if(DuplicateDetectionPersonNameWeight > 0.0f) {
		maxScore += DuplicateDetectionPersonHealthIdWeight
		achievedScore += result.DuplicateDetectionPersonHealthIdScore * DuplicateDetectionPersonHealthIdWeight
	}
	
	if(achievedScore / maxScore >= DuplicateDetectionPersonResultThreshhold){
		//Is possible duplicate
	}
}

Issues that may be connected to this:
SORMAS-Foundation/SORMAS-Glossary#23
#3560
#5576

@SORMAS-JanBoehme SORMAS-JanBoehme added the change A change of an existing feature (ticket type) label Jun 10, 2021
@marko-arn
Copy link

@kwa20 @Jan-Boehme As I am currently comparing the data between our IfSG-application (Octoware) and Sormas, I have also noticed several duplicates for persons.

Quite often it happens that people have two first names, one of which is the call name.Depending on who reports the contact/case/ep, only one or both names are given.

The duplicate recognition should, in the case of more than one first name, check each name separated by a space individually.

@MateStrysewske MateStrysewske added general Not directly concerned with any particular functional section of the application backend Affects the web backend needs-refinement Refinement or further specification required labels Jul 9, 2021
@MateStrysewske MateStrysewske added this to Discussion in Refinement Board via automation Jul 9, 2021
@MateStrysewske MateStrysewske moved this from Discussion to Symeda Refinements in Refinement Board Jul 9, 2021
@bernardsilenou
Copy link

@Jan-Boehme @MateStrysewske @kwa20

  • As mentioned in Prerequisites section, I also think it would be good to first do some form of testing/simulation before implementation.

  • Two point I can think of are:

    • Performance , especially on the mobile devices
    • sensitivity, ie if there is a significant amount of duplicate that the weighted method can detect over the non weighted.
  • Users can have options to choose the weights for each variable. Reason being that not all the variables or weight apply to all instances

@MateStrysewske MateStrysewske removed the needs-refinement Refinement or further specification required label Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Affects the web backend change A change of an existing feature (ticket type) general Not directly concerned with any particular functional section of the application
Projects
No open projects
Refinement Board
  
Symeda Refinements
Development

No branches or pull requests

4 participants