# Pandas dataframe.drop_duplicates()

The drop_duplicates() method in Pandas is designed to remove duplicate rows from a DataFrame based on all columns or specific ones. By default, it scans the entire DataFrame and retains the first occurrence of each row and removes any duplicates that follow. In this article, we will see how to use the drop_duplicates() method and its examples.

Let's start with a basic example to see how drop_duplicates() works.

**Observations:**
- One duplicate row (`Alice`, 25, `NY`) was removed; 3 unique rows remain.

**Interpretation:**
- `drop_duplicates()` drops exact duplicate rows (keeps the first by default).

In [1]:
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Alice", "David"],
    "Age": [25, 30, 25, 40],
    "City": ["NY", "LA", "NY", "Chicago"]
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

df_cleaned = df.drop_duplicates()

print("\nModified DataFrame (no duplicates)")
print(df_cleaned)

Original DataFrame:
    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
2  Alice   25       NY
3  David   40  Chicago

Modified DataFrame (no duplicates)
    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
3  David   40  Chicago


This example shows how duplicate rows are removed while retaining the first occurrence using dataframe.drop_duplicates().

Syntax:

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

Parameters:

1. subset: Specifies the columns to check for duplicates. If not provided all columns are considered.

2. keep: Finds which duplicate to keep:

'first' (default): Keeps the first occurrence, removes subsequent duplicates.
'last': Keeps the last occurrence and removes previous duplicates.
False: Removes all occurrences of duplicates.
3. inplace: If True it modifies the original DataFrame directly. If False (default), returns a new DataFrame.

Return type: Method returns a new DataFrame with duplicates removed unless inplace=True.

Examples
Below are some examples of dataframe.drop_duplicates() method:

1. Dropping Duplicates Based on Specific Columns
We can target duplicates in specific columns using the subset parameter. This is useful when some columns are more relevant for identifying duplicates.

**Observations:**
- Duplicate `Name` values are removed using `subset=["Name"]`; the first `Alice` row is kept, the second is dropped. Output rows: Alice–NY, Bob–LA, David–Chicago.

**Interpretation:**
- `drop_duplicates(subset=["Name"])` keeps the first occurrence per name and drops the rest.

In [2]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'David'],
    'Age': [25, 30, 25, 40],
    'City': ['NY', 'LA', 'SF', 'Chicago']
})

df_cleaned = df.drop_duplicates(subset=["Name"])

print(df_cleaned)

    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
3  David   40  Chicago


2. Keeping the Last Occurrence of Duplicates

By default drop_duplicates() retains the first occurrence of duplicates. If we want to keep the last occurrence we can use keep='last'.

**Observations:**
- Duplicate row for `Alice` was dropped; the last occurrence was kept. Output rows: Bob–LA, Alice–NY, David–Chicago.

**Interpretation:**
- `drop_duplicates(keep='last')` removes duplicates and keeps the last occurrence of each duplicate group.

In [3]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'David'],
    'Age': [25, 30, 25, 40],
    'City': ['NY', 'LA', 'NY', 'Chicago']
})

df_cleaned= df.drop_duplicates(keep='last')
print(df_cleaned)

    Name  Age     City
1    Bob   30       LA
2  Alice   25       NY
3  David   40  Chicago


3. Dropping All Duplicates

If we want to remove all rows that are duplicates, we can set keep=False.

**Observations:**
- All duplicate rows are removed; only unique rows remain (Bob–LA, David–Chicago).
- Both occurrences of `Alice` were dropped since they were duplicates.

**Interpretation:**
- `drop_duplicates(keep=False)` removes all rows that have duplicates, keeping only completely unique rows.

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'David'],
    'Age': [25, 30, 25, 40],
    'City': ['NY', 'LA', 'NY', 'Chicago']
})
df_cleaned = df.drop_duplicates(keep=False)
print(df_cleaned)

    Name  Age     City
1    Bob   30       LA
3  David   40  Chicago


4. Modifying the Original DataFrame Directly

If we want to modify the DataFrame in place without creating a new DataFrame set inplace=True.

**Observations:**
- Duplicate row for `Alice` was removed; 3 unique rows remain.
- The original DataFrame was modified directly without creating a new variable.

**Interpretation:**
- `drop_duplicates(inplace=True)` modifies the DataFrame in place, removing duplicates without returning a new DataFrame.

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'David'],
    'Age': [25, 30, 25, 40],
    'City': ['NY', 'LA', 'NY', 'Chicago']
})
df.drop_duplicates(inplace=True)
print(df)

    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
3  David   40  Chicago


5. Dropping Duplicates Based on Partially Identical Columns

Sometimes we might encounter situations where duplicates are not exact rows but have identical values in certain columns. For example after merging datasets we may want to drop rows that have the same values in a subset of columns.

**Observations:**
- Duplicates based on `Name` and `City` columns are removed; 4 unique rows remain.
- The first `Bob–LA` and `Alice–NY` are kept; subsequent duplicates are dropped.

**Interpretation:**
- `drop_duplicates(subset=["Name", "City"])` checks only specified columns for duplicates and removes rows with identical values in those columns.

In [None]:
data = {
    "Name": ["Alice", "Bob", "Alice", "David", "Bob"],
    "Age": [25, 30, 25, 40, 30],
    "City": ["NY", "LA", "NY", "Chicago", "LA"]
}

df = pd.DataFrame(data)

df_cleaned = df.drop_duplicates(subset=["Name", "City"])

print(df_cleaned)

    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
3  David   40  Chicago
