Skip to content

A model to identify unique persons and hence reduce deduplication of records using Decision Trees and Classification.

Notifications You must be signed in to change notification settings

PriyadharshanSaba/NameDuplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Name Dupliaction Problem

Variation in names leads to difficulty in identifying a unique person and hence deduplication of records is an unsolved challenge. The problem becomes more complicated in cases where data is coming from multiple sources. Following variations are same as Vladimir Frometa:

Vladimir Antonio Frometa Garo Vladimir A Frometa Garo Vladimir Frometa Vladimir Frometa G Vladimir A Frometa Vladimir A Frometa G

This model is trained to reduce duplication of records with various formats of names.

It uses Decision Trees and classification to filter out unique users from the records with their first names, and then it makes the decision further on with other features. The unique record list will be saved in another file called Records.

Run-time: Python 3
Dependencis: Pandas

About

A model to identify unique persons and hence reduce deduplication of records using Decision Trees and Classification.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages