You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
backendAffects the web backendchangeA change of an existing feature (ticket type)generalNot directly concerned with any particular functional section of the application
Summary:
The departments of health often times face the problem that the duplicate detection for persons does not trigger when creating a new case or contact because when the data was entered some typo happened (i.e. entering 1991 as the birth year instead of 1990)
This issue is meant to describe a proposed change to the current duplicate detection with the goal of providing more relevant results to the user.
Basic concept:
The detection should use weighted results when checking for duplicates instead of just looking for perfect matches. If the sum of all checks exceeds a configured threshhold the result is considered a possible duplicate and presented to the user. The values used in the duplicate detection should be made avilable for GSA admins via the UI to allow them to adjust the granularity of the detection themselves without having to raise a ticket.
Prerequisites:
PostgreSQl database needs to have the module pg_trgm enabled to allow trigram checks on the database during query execution
Executing tests on a SORMAS database to determine the load being put on the database when using multiple "similarity" methods during query execution if the database has 1.000.000+ persons. Otherwise this feature may be to much to handle for the international version where everything is stored in one instance
Identify and create on ore more appropriate index on the database to speed up the trigram calculation
Changes to config values
Remove values
namesimilaritythreshhold
Introduce values
Name
Value Range
Default Value
DuplicateDetectionPersonMaxResults
0 - 100
tbd
DuplicateDetectionPersonNameWeight
0.0f - 5.0f
tbd
DuplicateDetectionPersonNameThreshhold
0.0f - 1.0f
tbd
DuplicateDetectionPersonSexWeight
0.0f - 5.0f
tbd
DuplicateDetectionPersonSexThreshhold
0.0f - 1.0f
tbd
DuplicateDetectionPersonBirthdateDayWeight
0.0f - 5.0f
tbd
DuplicateDetectionPersonBirthdateDayThreshhold
0.0f - 1.0f
tbd
DuplicateDetectionPersonBirthdateMonthWeight
0.0f - 5.0f
tbd
DuplicateDetectionPersonBirthdateMonthThreshhold
0.0f - 1.0f
tbd
DuplicateDetectionPersonBirthdateYearWeight
0.0f - 5.0f
tbd
DuplicateDetectionPersonBirthdateYearThreshhold
0.0f - 1.0f
tbd
DuplicateDetectionPersonPassportNumberWeight
0.0f - 5.0f
tbd
DuplicateDetectionPersonPassportNumberThreshhold
0.0f - 1.0f
tbd
DuplicateDetectionPersonHealthIdWeight
0.0f - 5.0f
tbd
DuplicateDetectionPersonHealthIdThreshhold
0.0f - 1.0f
tbd
DuplicateDetectionPersonResultThreshhold
0.0f - 1.0f
tbd
Process:
The following assumes that all of the weights have a value > 0.0f.
If the admin set one of the values to 0.0f it indicates that the admin does not want those check to have an impact on the duplicate detection and the corresponding check will be skipped and must not have an impact when calculating the final result.
Take the first and last name entered by the user and use those to do a first evaluation of possible duplicates by using "similarity" when selecting the data. Check for similarity values of first name, last name alone and compare it to each other in case the names have been switched by accident. Would basically look something like this (the actual query is probably way more complicated):
SELECT data FROM tables
WHERE
SIMILARITY(firstName, firstNameEntered) > DuplicateDetectionPersonNameThreshhold OR
SIMILARITY(lastName, lastNameEntered) > DuplicateDetectionPersonNameThreshhold OR
SIMILARITY(firstName, lastNameEntered) > DuplicateDetectionPersonNameThreshhold OR
SIMILARITY(lastName, firstNameEntered) > DuplicateDetectionPersonNameThreshhold
LIMIT DuplicateDetectionPersonMaxResults
ORDER BY highestValue DESC
Do the following checks for all results:
2.1 Sex
If equal --> score of 1.0f
If any of the compared sides has a value of UNKNOWN or null --> score of 1.0f
If person.sex is set but not equal --> score of 0.5f
2.2 Person.birthdateDD
If it is an exact match --> score of 1.0f
The score is reduced by 0.1f for each value they are apart until it reaches 0.0f. Make sure to include "wrap around" between 31 and 1. This check does not take the actual month into account and uses 31 as base for the calculation. Checking it against the month would make this way more complex while providing little in terms of accuracy (i.e 13 and 14 would have a score of 0.9 & 03 and 27 have a score of 0.3f)
2.3 Person.birthdateMM
If it is an exact match --> score of 1.0f
The score is reduced by 0.2f for each value they are apart until it reaches 0.0f. Make sure to include "wrap around" between 12 and 1.
2.4 Person.birthdateYYYY
If it is an exact match --> score of 1.0f
The score is reduced by 0.1f for each value they are apart until it reaches 0.0f.
2.5 Person.passportNumber
Perform a trigram calculation and use the resulting score
2.6 Person.nationalHealthId
Perform a trigram calculation and use the resulting score
Apply weights
Multiply the scores for each test with the corresponding weight and normalize the value to a result of 0.0f - 1.0f.
Could look something like this:
@kwa20 @Jan-Boehme As I am currently comparing the data between our IfSG-application (Octoware) and Sormas, I have also noticed several duplicates for persons.
Quite often it happens that people have two first names, one of which is the call name.Depending on who reports the contact/case/ep, only one or both names are given.
The duplicate recognition should, in the case of more than one first name, check each name separated by a space individually.
backendAffects the web backendchangeA change of an existing feature (ticket type)generalNot directly concerned with any particular functional section of the application
Summary:
The departments of health often times face the problem that the duplicate detection for persons does not trigger when creating a new case or contact because when the data was entered some typo happened (i.e. entering 1991 as the birth year instead of 1990)
See SORMAS-Foundation/SORMAS-Glossary#23 (comment) for a description on how the duplicate detection for persons is working at the moment.
This issue is meant to describe a proposed change to the current duplicate detection with the goal of providing more relevant results to the user.
Basic concept:
The detection should use weighted results when checking for duplicates instead of just looking for perfect matches. If the sum of all checks exceeds a configured threshhold the result is considered a possible duplicate and presented to the user. The values used in the duplicate detection should be made avilable for GSA admins via the UI to allow them to adjust the granularity of the detection themselves without having to raise a ticket.
Prerequisites:
Changes to config values
Remove values
namesimilaritythreshhold
Introduce values
Process:
The following assumes that all of the weights have a value > 0.0f.
If the admin set one of the values to 0.0f it indicates that the admin does not want those check to have an impact on the duplicate detection and the corresponding check will be skipped and must not have an impact when calculating the final result.
2.1 Sex
2.2 Person.birthdateDD
2.3 Person.birthdateMM
2.4 Person.birthdateYYYY
2.5 Person.passportNumber
2.6 Person.nationalHealthId
Multiply the scores for each test with the corresponding weight and normalize the value to a result of 0.0f - 1.0f.
Could look something like this:
Issues that may be connected to this:
SORMAS-Foundation/SORMAS-Glossary#23
#3560
#5576
The text was updated successfully, but these errors were encountered: