# Data Deduplication using Clustering
**Objective**: Learn and implement data deduplication techniques.

**Task**: Deduplication Using K-means Clustering

**Steps**:
1. Data Set: Download a dataset containing duplicate customer records.
2. Preprocess: Standardize the data to ensure better clustering.
3. Apply K-means: Use K-means clustering to find and group similar customer records.
4. Identify Duplicates: Identify and remove duplicates within clusters.

In [6]:
# write your code from here
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# 1. Sample Data: Simulate customer records with duplicates
data = {
    'CustomerID': [1, 2, 3, 4, 5, 6, 7],
    'Name': ['Alice', 'Alicia', 'Bob', 'Bobby', 'Carol', 'Caroline', 'Dave'],
    'Age': [25, 26, 35, 34, 45, 46, 30],
    'Income': [50000, 51000, 60000, 61000, 80000, 79000, 40000]
}

df = pd.DataFrame(data)

# For deduplication, we consider numerical features Age and Income
features = df[['Age', 'Income']]

# 2. Preprocess: Standardize the data
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# 3. Apply K-means clustering to group similar records
# Number of clusters is an assumption; here 3
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(features_scaled)

# 4. Identify duplicates within each cluster
# We consider duplicates as records within the same cluster that have very similar features
# Define a threshold for Euclidean distance within cluster to flag duplicates

from scipy.spatial.distance import cdist

duplicates_indices = set()

for cluster_label in df['Cluster'].unique():
    cluster_data = df[df['Cluster'] == cluster_label]
    cluster_features = features_scaled[cluster_data.index]
    
    # Calculate pairwise distances within cluster
    distances = cdist(cluster_features, cluster_features, metric='euclidean')
    
    # Mark pairs with distance less than threshold as duplicates
    threshold = 0.5  # Tune this threshold based on data
    for i in range(len(cluster_data)):
        for j in range(i + 1, len(cluster_data)):
            if distances[i, j] < threshold:
                # Mark the record with higher index as duplicate
                duplicates_indices.add(cluster_data.index[j])

# Remove duplicates
df_deduplicated = df.drop(index=duplicates_indices)

print("Original Data:")
print(df)
print("\nDuplicates detected at indices:", duplicates_indices)
print("\nData after deduplication:")
print(df_deduplicated)



Original Data:
   CustomerID      Name  Age  Income  Cluster
0           1     Alice   25   50000        2
1           2    Alicia   26   51000        2
2           3       Bob   35   60000        0
3           4     Bobby   34   61000        0
4           5     Carol   45   80000        1
5           6  Caroline   46   79000        1
6           7      Dave   30   40000        2

Duplicates detected at indices: {1, 3, 5}

Data after deduplication:
   CustomerID   Name  Age  Income  Cluster
0           1  Alice   25   50000        2
2           3    Bob   35   60000        0
4           5  Carol   45   80000        1
6           7   Dave   30   40000        2
