# Data Deduplication using Clustering
**Objective**: Learn and implement data deduplication techniques.

**Task**: DBSCAN for Data Deduplication

**Steps**:
1. Data Set: Download a dataset containing duplicate entries for event registrations.
2. DBSCAN Clustering: Apply the DBSCAN algorithm to cluster similar registrations.
3. Identify Duplicates: Detect duplicates based on density of the clusters.
4. Refinement: Validate clusters and remove any erroneous duplicates.

In [2]:
%pip install pandas scikit-learn
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances
from sklearn.cluster import DBSCAN

# Step 1: Simulated Dataset of Event Registrations with Duplicates
data = {
    'Name': [
        'Anita Sharma', 'Anitha Sharma', 'Brian Lee', 'Bryan Lee',
        'Sophie Zhang', 'Sophy Zhang', 'David Clark', 'Dave Clark'
    ],
    'Email': [
        'anita@gmail.com', 'anitha@gmail.com', 'brian@gmail.com', 'bryan@gmail.com',
        'sophie@gmail.com', 'sophy.z@gmail.com', 'davidc@gmail.com', 'd.clark@gmail.com'
    ]
}

df = pd.DataFrame(data)
print("Original Registrations:\n", df)

# Step 2: Vectorize the combined 'Name + Email' using TF-IDF
combined = df['Name'] + ' ' + df['Email']
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(2, 4))
X = vectorizer.fit_transform(combined)

# Step 3: Apply DBSCAN with cosine distance
distance_matrix = cosine_distances(X)
db = DBSCAN(eps=0.4, min_samples=1, metric='precomputed')
df['Cluster'] = db.fit_predict(distance_matrix)

# Step 4: Deduplicate by keeping the first record in each cluster
deduplicated_df = df.groupby('Cluster').first().reset_index(drop=True)
print("\nDeduplicated Registrations:\n", deduplicated_df)

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Original Registrations:
             Name              Email
0   Anita Sharma    anita@gmail.com
1  Anitha Sharma   anitha@gmail.com
2      Brian Lee    brian@gmail.com
3      Bryan Lee    bryan@gmail.com
4   Sophie Zhang   sophie@gmail.com
5    Sophy Zhang  sophy.z@gmail.com
6    David Clark   davidc@gmail.com
7     Dave Clark  d.clark@gmail.com

Deduplicated Registrations:
            Name              Email
0  Anita Sharma    anita@gmail.com
1     Brian Lee    brian@gmail.com
2     Bryan Lee    bryan@gmail.com
3  Sophie Zhang   sophie@gmail.com
4   Sophy Zhang  sophy.z@gmail.com
5   David Clar