Skip to content

Vatshayan/Data-Duplication-Removal-using-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data-Duplication-Removal-Project

Project is based on Machine Learning

A Salute to Our Veterans

Abstract:

In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. A related and somewhat synonymous term is single-instance (data) storage. This technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent. In the de-duplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced. One of the most common forms of data de-duplication implementations works by comparing chunks of data to detect duplicates. For that to happen, each chunk of data is assigned an identification, calculated by the software, typically using cryptographic hash functions. In many implementations, the assumption is made that if the identification is identical, the data is identical, even though this cannot be true in all cases due to the pigeonhole principle; other implementations do not assume that two blocks of data with the same identifier are identical, but actually verify that data with the same identification is identical.

Summary :

This work is a novel approach for detecting duplicate records in the context of digital gazetteers, using state-of-the-art machine learning techniques. It reports a thorough evaluation of alternative machine learning approaches designed for the task of classifying pairs of gazetteer records as either duplicates or not, built by using support vector machines or alternating decision trees with different combinations of similarity scores for the feature vectors. Experimental results show that using feature vectors that combine multiple similarity scores, derived from place names, semantic relationships, place types and geospatial footprints, leads to an increase in accuracy. The paper also discusses how the proposed duplicate detection approach can scale to large collections, through the usage of filtering or blocking techniques.

Some Points:

  • Detecting and removing duplicates using Machine Learning by calculating the digest of files which takes less time than other pre-implemented methods.

  • The project proposes an efficient method for detecting and removing duplicates using machine learning algorithms.

  • Storage optimization by de-duplication.

  • We read the file with duplicate data and store all the unique entries in it in another new file.

  • The input dataset consists of a lot of duplicate entries.

  • The goal of this Project is to use a machine learning approach to remove those duplicate entries.

Want Project files ?

You Can use this Beautiful Project for your college Project and get good marks too.

Email me Now vatshayan007@gmail.com to get this Full Project Code, PPT, Report, Synopsis, Video Presentation and Research paper of this Project.

💌 Feel free to contact me for any kind of help on any projects.

Need Code, Documents & Explanation video ?

How to Reach me :

WhatsApp: +91 9310631437 (Helping 24*7) CHAT

1000 Computer Science Projects : https://www.computer-science-project.in/

New CSE Project : Rainfall Prediction Project