# Duplicate Data Management in Pandas

### What Are Duplicate Values?

Duplicate values in a dataset are rows that are **exact copies** of other rows — meaning every value across all columns is the same. For example, if the Titanic dataset mistakenly includes two identical entries for the same passenger, it inflates row counts and can **skew statistics**, affect model training, and create **data leakage**.

Duplicates usually happen due to:

- Data entry errors
- Accidental merges or joins
- Sensor or API bugs
- Appending the same data twice

Detecting and removing duplicates is part of every serious **data cleaning pipeline**. Pandas gives us powerful tools like `.duplicated()` and `.drop_duplicates()` to identify and eliminate them.

### Why Managing Duplicates Is Important

Duplicate data may not seem harmful at first glance, but in the context of **machine learning and analytics**, it can have serious consequences:

- **Skewed distributions**: Repeated values distort averages, medians, and standard deviations.
- **Biased models**: Training models on repeated examples can lead to overfitting.
- **Inaccurate insights**: Aggregations like `.mean()` or `.value_counts()` become misleading.
- **Data integrity issues**: Repeating entities (e.g., duplicate users) leads to false conclusions.

In AI/ML projects, duplicates **inflate confidence**, introduce **data leakage**, and can **degrade model performance**. Cleaning them early ensures that our models **learn patterns from true diversity** in the data, not artificial repetition.

### Common Methods for Detecting Duplicates

**`.duplicated()`**

Returns a boolean Series marking duplicated rows (default checks all columns):

In [1]:
import pandas as pd
df = pd.read_csv("data/train.csv")

# Detect duplicated rows
duplicates = df.duplicated()
print(duplicates.sum())  # Total duplicates

0


**`.duplicated(subset=...)`**

Check duplicates based on specific columns only:

In [2]:
# Check for duplicates by 'Name' and 'Ticket'
print(df.duplicated(subset=['Name', 'Ticket']).sum())

0


### Removing Duplicates with `.drop_duplicates()`

Once duplicates are detected, we can **remove them safely** using `.drop_duplicates()`:

In [3]:
# Drop all completely duplicated rows
df = df.drop_duplicates()
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


Or drop based on certain columns only:

In [4]:
# Drop based on a subset (keep first occurrence)
df = df.drop_duplicates(subset=['Name', 'Ticket'], keep='first')
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


By default, `keep='first'` keeps the first occurrence and drops others. You can also use:

- `keep='last'`: keeps the last duplicate
- `keep=False`: drops all duplicates entirely

### Duplicates and Indexes

Sometimes duplicates appear in **index labels** rather than rows. Use:

In [5]:
print(df.index.duplicated()[:5])  # Check if index is duplicated

[False False False False False]


To reset duplicate indexes:

In [6]:
df = df.reset_index(drop=True)
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


### Exercises

Q1. Count total duplicated rows (fully identical)

In [7]:
print(df.duplicated().sum())

0


Q2. Drop all duplicated rows

In [8]:
df = df.drop_duplicates()
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


Q3. Drop duplicates based on 'Name' and 'Ticket' (keep first)

In [9]:
df = df.drop_duplicates(subset=['Name', 'Ticket'], keep='first')
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


Q4. Drop all duplicates entirely (no duplicates kept)

In [10]:
df = df.drop_duplicates(keep=False)
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


Q5. Check for duplicated index labels

In [11]:
print(df.index.duplicated().sum())

0


### Summary

Duplicate data is a **silent data quality killer**. It’s easy to overlook, but in analysis and machine learning, it can seriously mislead results and damage model accuracy. Pandas offers us fast and flexible tools to **detect**, **analyze**, and **remove** these repeated records.

We learned that `.duplicated()` marks repeated rows, while `.drop_duplicates()` removes them — either entirely or based on selected columns. The `keep` parameter gives us control over which duplicates to retain. We also covered how to handle duplicate **indexes**, a common issue in messy merges.

Always analyze **why** duplicates exist before removing them — sometimes repetition is intentional (e.g., transactions), and sometimes it signals data errors.

With clean, deduplicated data, we build models that are more **trustworthy**, **fair**, and **robust**.