# Data Deduplication using Clustering
**Objective**: Learn and implement data deduplication techniques.

**Task**: Deduplication Using K-means Clustering

**Steps**:
1. Data Set: Download a dataset containing duplicate customer records.
2. Preprocess: Standardize the data to ensure better clustering.
3. Apply K-means: Use K-means clustering to find and group similar customer records.
4. Identify Duplicates: Identify and remove duplicates within clusters.

In [1]:
# write your code from here
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
import numpy as np

# 1. Sample customer dataset with duplicates
data = {
    "customer_id": [1, 2, 3, 4, 5],
    "name": [
        "John Smith",
        "Jon Smith",
        "Jane Doe",
        "Jane D.",
        "Jake Johnson"
    ],
    "address": [
        "123 Elm St",
        "123 Elm Street",
        "456 Oak St",
        "456 Oak Street",
        "789 Pine Rd"
    ]
}
df = pd.DataFrame(data)

# 2. Preprocess - combine name and address
df["text"] = df["name"] + " " + df["address"]

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["text"])

# Normalize features
X_normalized = normalize(X)

# 3. Apply K-means clustering
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
df["cluster"] = kmeans.fit_predict(X_normalized)

# 4. Identify duplicates within each cluster (same cluster implies similarity)
duplicates = df.sort_values("cluster")[["customer_id", "name", "address", "cluster"]]
print(duplicates)

   customer_id          name         address  cluster
1            2     Jon Smith  123 Elm Street        0
2            3      Jane Doe      456 Oak St        1
3            4       Jane D.  456 Oak Street        1
0            1    John Smith      123 Elm St        2
4            5  Jake Johnson     789 Pine Rd        2
