# Data Deduplication using Clustering
**Objective**: Learn and implement data deduplication techniques.

**Task**: DBSCAN for Data Deduplication

**Steps**:
1. Data Set: Download a dataset containing duplicate entries for event registrations.
2. DBSCAN Clustering: Apply the DBSCAN algorithm to cluster similar registrations.
3. Identify Duplicates: Detect duplicates based on density of the clusters.
4. Refinement: Validate clusters and remove any erroneous duplicates.

In [1]:
# write your code from here
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# Step 1: Sample event registration data with duplicates
data = {
    'RegistrationID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Name': ['Alice Smith', 'Alicia Smith', 'Bob Jones', 'Bobby Jones', 'Carol White', 'Caroline White', 'Dave Green', 'David Green'],
    'Age': [28, 29, 35, 34, 42, 43, 31, 31],
    'Email_Score': [0.9, 0.92, 0.85, 0.86, 0.95, 0.96, 0.88, 0.87]  # Simulated similarity score of email or other features
}

df = pd.DataFrame(data)

# Step 2: Preprocess features used for clustering
# Use numerical features Age and Email_Score for clustering
features = df[['Age', 'Email_Score']]

# Standardize features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Step 3: Apply DBSCAN clustering
# eps: max distance between two samples to be considered neighbors
# min_samples: minimum samples in a cluster
dbscan = DBSCAN(eps=0.5, min_samples=2)
df['Cluster'] = dbscan.fit_predict(features_scaled)

# Step 4: Identify duplicates based on clusters
# Cluster label -1 means noise (no cluster), others are clusters
duplicates_indices = []

for cluster_label in set(df['Cluster']):
    if cluster_label == -1:
        continue  # skip noise points
    cluster_indices = df[df['Cluster'] == cluster_label].index.tolist()
    if len(cluster_indices) > 1:
        # Mark all except the first in the cluster as duplicates
        duplicates_indices.extend(cluster_indices[1:])

# Step 5: Remove duplicates
df_deduplicated = df.drop(index=duplicates_indices)

print("Original Data:")
print(df)

print("\nDuplicates detected (indices):", duplicates_indices)

print("\nData after deduplication:")
print(df_deduplicated)


Original Data:
   RegistrationID            Name  Age  Email_Score  Cluster
0               1     Alice Smith   28         0.90       -1
1               2    Alicia Smith   29         0.92       -1
2               3       Bob Jones   35         0.85        0
3               4     Bobby Jones   34         0.86        0
4               5     Carol White   42         0.95        1
5               6  Caroline White   43         0.96        1
6               7      Dave Green   31         0.88        2
7               8     David Green   31         0.87        2

Duplicates detected (indices): [3, 5, 7]

Data after deduplication:
   RegistrationID          Name  Age  Email_Score  Cluster
0               1   Alice Smith   28         0.90       -1
1               2  Alicia Smith   29         0.92       -1
2               3     Bob Jones   35         0.85        0
4               5   Carol White   42         0.95        1
6               7    Dave Green   31         0.88        2
