# Data Deduplication using Clustering
**Objective**: Learn and implement data deduplication techniques.

**Task**: DBSCAN for Data Deduplication

**Steps**:
1. Data Set: Download a dataset containing duplicate entries for event registrations.
2. DBSCAN Clustering: Apply the DBSCAN algorithm to cluster similar registrations.
3. Identify Duplicates: Detect duplicates based on density of the clusters.
4. Refinement: Validate clusters and remove any erroneous duplicates.

In [2]:
# write your code from here
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import normalize
import numpy as np

# 1. Sample event registration dataset with duplicates
data = {
    "registration_id": [1, 2, 3, 4, 5, 6],
    "name": [
        "Emma Watson",
        "Ema Watsn",
        "John Doe",
        "Jon Doe",
        "Liam Smith",
        "L. Smith"
    ],
    "event": [
        "AI Conference",
        "AI Conf",
        "Data Summit",
        "Data Summit 2023",
        "Health Forum",
        "Health Forum 2023"
    ]
}
df = pd.DataFrame(data)

# Combine name and event fields for clustering
df["text"] = df["name"] + " " + df["event"]

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["text"])

# Normalize features
X_normalized = normalize(X)

# 2. Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=1, metric='cosine')
df["cluster"] = dbscan.fit_predict(X_normalized)

# 3. Identify and display potential duplicates
duplicates = df.sort_values("cluster")[["registration_id", "name", "event", "cluster"]]
print("Potential Duplicate Groups:\n", duplicates)

# 4. Refinement - remove duplicates (keep first entry per cluster)
deduplicated_df = df.drop_duplicates(subset="cluster", keep="first").drop(columns=["text", "cluster"])
print("\nCleaned Dataset:\n", deduplicated_df)
#

Potential Duplicate Groups:
    registration_id         name              event  cluster
0                1  Emma Watson      AI Conference        0
1                2    Ema Watsn            AI Conf        1
2                3     John Doe        Data Summit        2
3                4      Jon Doe   Data Summit 2023        2
4                5   Liam Smith       Health Forum        3
5                6     L. Smith  Health Forum 2023        3

Cleaned Dataset:
    registration_id         name          event
0                1  Emma Watson  AI Conference
1                2    Ema Watsn        AI Conf
2                3     John Doe    Data Summit
4                5   Liam Smith   Health Forum
