# Data Deduplication using Clustering
**Objective**: Learn and implement data deduplication techniques.

**Task**: DBSCAN for Data Deduplication

**Steps**:
1. Data Set: Download a dataset containing duplicate entries for event registrations.
2. DBSCAN Clustering: Apply the DBSCAN algorithm to cluster similar registrations.
3. Identify Duplicates: Detect duplicates based on density of the clusters.
4. Refinement: Validate clusters and remove any erroneous duplicates.

In [1]:
# write your code from here
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Step 1: Data Set - Create a sample dataset for event registrations with duplicate entries.
# We'll simulate variations in names, emails, and event titles.
data = {
    'RegistrationID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
    'RegistrantName': [
        'Alex Johnson', 'Sarah Miller', 'David Lee', 'Emily Chen', 'Michael Brown',
        'Alex Jonson', 'Sara Miller', 'Dave Lee', 'Emly Chen', 'Mike Brown',
        'Alex Johnson', # Exact duplicate of ID 1
        'Sarah Miler', # Typo in 'Miller'
        'David Lee' # Exact duplicate of ID 3
    ],
    'RegistrantEmail': [
        'alex.j@example.com', 'sarah.m@example.com', 'david.l@example.com',
        'emily.c@example.com', 'michael.b@example.com',
        'alex.johnson@example.com', 's.miller@example.com', 'd.lee@example.com',
        'e.chen@example.com', 'm.brown@example.com',
        'alex.j@example.com', # Exact duplicate of ID 1's email
        'sarah.miller@example.com', # Similar to ID 2
        'david.l@example.com' # Exact duplicate of ID 3's email
    ],
    'EventTitle': [
        'Annual Tech Summit', 'Marketing Workshop', 'Data Science Conference',
        'AI Innovations Day', 'Product Launch Event',
        'Tech Summit', 'Marketing Workshop', 'Data Science Conf.',
        'AI Innovation Day', 'Product Launch',
        'Annual Tech Summit',
        'Marketing Workshp',
        'Data Science Conference'
    ]
}
df = pd.DataFrame(data)
print("Original Event Registrations Dataset:")
print(df)
print("\n" + "="*50 + "\n")

# For deduplication, combine relevant text fields into a single string for similarity calculation.
# Lowercasing and stripping whitespace helps normalize the data.
df['Combined_Info'] = df['RegistrantName'].str.lower().str.strip() + " " + \
                      df['RegistrantEmail'].str.lower().str.strip() + " " + \
                      df['EventTitle'].str.lower().str.strip()

# Step 2: DBSCAN Clustering - Apply the DBSCAN algorithm
# 2.1 Feature Extraction: Convert text data into numerical features using TF-IDF
vectorizer = TfidfVectorizer().fit_transform(df['Combined_Info'])

# 2.2 Calculate Similarity Matrix: Compute cosine similarity between records
# DBSCAN works with distances, so we'll convert similarity to distance.
similarity_matrix = cosine_similarity(vectorizer)
distance_matrix = 1 - similarity_matrix

# DBSCAN requires a dense distance matrix or precomputed distances.
# We'll use the precomputed distance matrix.

# 2.3 Apply DBSCAN
# eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other.
# min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point.
# Choosing eps and min_samples is crucial and often requires domain knowledge or experimentation.
# For text similarity (where 1-similarity is distance), a small eps value means high similarity.
# Let's start with a small eps, e.g., 0.3 (meaning similarity >= 0.7) and min_samples=2.
# A point is a core point if it has at least min_samples (including itself) within eps distance.
# Adjust these parameters based on your data and desired cluster density.
eps = 0.3 # Max distance for points to be considered neighbors (corresponds to min similarity of 0.7)
min_samples = 2 # Minimum number of samples in a neighborhood for a point to be a core point

# metric='precomputed' tells DBSCAN to use the distance_matrix directly.
dbscan = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed')
clusters = dbscan.fit_predict(distance_matrix)

# Add cluster IDs to the DataFrame
# -1 indicates noise points (outliers) that do not belong to any cluster.
df['ClusterID'] = clusters

print("Dataset with DBSCAN Cluster IDs:")
print(df[['RegistrationID', 'RegistrantName', 'RegistrantEmail', 'EventTitle', 'ClusterID']])
print("\n" + "="*50 + "\n")

# Step 3: Identify Duplicates - Detect duplicates based on density of the clusters.
# Duplicates are typically found in clusters with more than one member (ClusterID != -1).
# Noise points (ClusterID = -1) are considered unique or unclassifiable.
duplicate_clusters_df = df[df['ClusterID'] != -1].groupby('ClusterID').filter(lambda x: len(x) > 1)

print(f"Identified Duplicate Records (Clusters with more than 1 member, excluding noise):")
print(duplicate_clusters_df[['RegistrationID', 'RegistrantName', 'RegistrantEmail', 'EventTitle', 'ClusterID']])
print("\n" + "="*50 + "\n")

# Step 4: Refinement - Validate clusters and remove any erroneous duplicates.
# For each identified cluster, we'll keep one representative record.
# A common strategy is to keep the record with the lowest RegistrationID.
clean_df_list = []
for cluster_id in df['ClusterID'].unique():
    cluster_records = df[df['ClusterID'] == cluster_id]
    if cluster_id == -1: # Keep all noise points as they are considered unique
        clean_df_list.append(cluster_records)
    else: # For actual clusters, keep only one record
        # Keep the record with the minimum RegistrationID as the representative
        representative_record = cluster_records.loc[cluster_records['RegistrationID'].idxmin()]
        clean_df_list.append(pd.DataFrame([representative_record]))

clean_df = pd.concat(clean_df_list).reset_index(drop=True)

# Drop the 'Combined_Info' column as it was for internal processing
clean_df = clean_df.drop(columns=['Combined_Info'])

print("Cleaned Dataset (Duplicates Removed by DBSCAN):")
print(clean_df[['RegistrationID', 'RegistrantName', 'RegistrantEmail', 'EventTitle']])
print("\n" + "="*50 + "\n")

# Verification: Check if any of the original duplicates are still present in the cleaned data
# Define some expected duplicate pairs based on the sample data
expected_duplicate_pairs = [
    ('Alex Johnson', 'Alex Jonson'),
    ('Alex Johnson', 'Alex Johnson'), # Exact duplicate
    ('Sarah Miller', 'Sara Miller'),
    ('Sarah Miller', 'Sarah Miler'),
    ('David Lee', 'Dave Lee'),
    ('David Lee', 'David Lee') # Exact duplicate
]

print("Verification of Cleaned Data:")
for name1, name2 in expected_duplicate_pairs:
    # Check if both names from a pair are still in the cleaned data (indicating a potential missed duplicate)
    # This check is simplified; a more robust check would involve comparing full records.
    if name1 in clean_df['RegistrantName'].values and name2 in clean_df['RegistrantName'].values and name1 != name2:
        # Check if they are in different clusters or if both were kept as unique
        # This is a rough check, as DBSCAN might classify one as noise and the other as unique.
        print(f"Warning: '{name1}' and '{name2}' (or similar) might still be present. DBSCAN parameters (eps, min_samples) might need adjustment.")
    elif name1 in clean_df['RegistrantName'].values or name2 in clean_df['RegistrantName'].values:
        print(f"One of '{name1}' or '{name2}' is present, which is expected after deduplication.")
    else:
        print(f"Neither '{name1}' nor '{name2}' found, indicating successful deduplication for this pair.")

# To see the actual records that were removed:
# Records removed are those that were part of a cluster but not selected as the representative,
# or those that were noise but should have been part of a cluster (if eps/min_samples were different).
removed_records = df[~df['RegistrationID'].isin(clean_df['RegistrationID'])]
print("\nRecords that were removed as duplicates:")
print(removed_records[['RegistrationID', 'RegistrantName', 'RegistrantEmail', 'ClusterID']])


Original Event Registrations Dataset:
    RegistrationID RegistrantName           RegistrantEmail  \
0                1   Alex Johnson        alex.j@example.com   
1                2   Sarah Miller       sarah.m@example.com   
2                3      David Lee       david.l@example.com   
3                4     Emily Chen       emily.c@example.com   
4                5  Michael Brown     michael.b@example.com   
5                6    Alex Jonson  alex.johnson@example.com   
6                7    Sara Miller      s.miller@example.com   
7                8       Dave Lee         d.lee@example.com   
8                9      Emly Chen        e.chen@example.com   
9               10     Mike Brown       m.brown@example.com   
10              11   Alex Johnson        alex.j@example.com   
11              12    Sarah Miler  sarah.miller@example.com   
12              13      David Lee       david.l@example.com   

                 EventTitle  
0        Annual Tech Summit  
1        Marketing 

ValueError: Negative values in data passed to X.