# Checking duplicates in Pandas

In [16]:
# Importing libraries
import pandas as pd

## 1. Know how many duplicates exist

In [None]:
data={
    "Name": ["Alice", "Bob", "Alice", "David", "Bob"],
    "Age": [25, 30, 25, 40, 30],
    "City": ["NY", "LA", "NY", "Chicago", "LA"]
}

dataFrame= pd.DataFrame(data)
numberDuplicates=dataFrame.duplicated().sum()
print(numberDuplicates)

0    False
1    False
2     True
3    False
4     True
dtype: bool
2


## 2. See which ones are the duplicates

In [23]:
print(dataFrame.duplicated())
dataFrame[dataFrame.duplicated()]


0    False
1    False
2     True
3    False
4     True
dtype: bool


Unnamed: 0,Name,Age,City
2,Alice,25,NY
4,Bob,30,LA


## 3. Count duplicates considering specif columns

In [25]:
duplicatesColumnName=dataFrame.duplicated(subset=["Name"]).sum()
print(duplicatesColumnName)

2


## 4. Count all repeated occurrences (including the first one)

In [27]:
countAllDuplicates= dataFrame.duplicated(keep=False).sum()
print(countAllDuplicates)

4


## 5. See how many times each value appears.

In [28]:
dataFrame["Name"].value_counts()

Name
Alice    2
Bob      2
David    1
Name: count, dtype: int64

# Removing Duplicates using drop_duplicates()

The drop_duplicates() method in Pandas is designed to remove duplicate rows from a DataFrame based on all columns or specific ones. By default, it scans the entire DataFrame and retains the first occurrence of each row and removes any duplicates that follow. In this article, we will see how to use the drop_duplicates() method and its examples.

## Syntax

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

Parameters:

1. subset: Specifies the columns to check for duplicates. If not provided all columns are considered.

2. keep: Finds which duplicate to keep:

'first' (default): Keeps the first occurrence, removes subsequent duplicates.
'last': Keeps the last occurrence and removes previous duplicates.
False: Removes all occurrences of duplicates.

3. inplace: If True it modifies the original DataFrame directly. If False (default), returns a new DataFrame.

Return type: Method returns a new DataFrame with duplicates removed unless inplace=True.

## Examples

 ### 1. Basic Example

In [None]:
data = {
    "Name": ["Alice", "Bob", "Alice", "David"],
    "Age": [25, 30, 25, 40],
    "City": ["NY", "LA", "NY", "Chicago"]
}

dataFrame=pd.DataFrame(data)
print(dataFrame)

cleanDataFrame= dataFrame.drop_duplicates()
print("\nModified DataFrame (No duplicates):")
cleanDataFrame

    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
2  Alice   25       NY
3  David   40  Chicago

Modified DataFrame (No duplicates)


Unnamed: 0,Name,Age,City
0,Alice,25,NY
1,Bob,30,LA
3,David,40,Chicago


###  2. Dropping Duplicates Based on Specific Columns

We can target duplicates in specific columns using the subset parameter. This is useful when some columns are more relevant for identifying duplicates.

In [5]:
cleanDataFrame2=dataFrame.drop_duplicates(subset=["Name"])
cleanDataFrame2

Unnamed: 0,Name,Age,City
0,Alice,25,NY
1,Bob,30,LA
3,David,40,Chicago


### 3. Keeping the Last Occurrence of Duplicates

By default **drop_duplicates()** retains the first occurrence of duplicates. If we want to keep the last occurrence we can use keep='last'.

In [6]:
cleanDataFrame3=dataFrame.drop_duplicates(keep="last")
cleanDataFrame3

Unnamed: 0,Name,Age,City
1,Bob,30,LA
2,Alice,25,NY
3,David,40,Chicago


### 4. Dropping All Duplicates

If we want to remove all rows that are duplicates, we can set keep=False.

In [11]:
cleanDataFrame4=dataFrame.drop_duplicates(keep=False)
cleanDataFrame4

#With keep=False both occurrences of Alice are removed leaving only the rows with unique values across all columns.

Unnamed: 0,Name,Age,City
1,Bob,30,LA
3,David,40,Chicago


### 5. Modifying the Original DataFrame Directly

If we want to modify the DataFrame in place without creating a new DataFrame set inplace=True.

In [None]:
dataFrame2=dataFrame.copy()
print(dataFrame2)
dataFrame2.drop_duplicates(inplace=True)
dataFrame2

# Using inplace=True directly modifies the original DataFrame saving memory and avoiding the need to assign the result to a new variable.

    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
2  Alice   25       NY
3  David   40  Chicago


Unnamed: 0,Name,Age,City
0,Alice,25,NY
1,Bob,30,LA
3,David,40,Chicago


### 6. Dropping Duplicates Based on Partially Identical Columns

Sometimes we might encounter situations where duplicates are not exact rows but have identical values in certain columns. For example after merging datasets we may want to drop rows that have the same values in a subset of columns.

In [None]:
data={
    "Name": ["Alice", "Bob", "Alice", "David", "Bob"],
    "Age": [25, 30, 25, 40, 30],
    "City": ["NY", "LA", "NY", "Chicago", "LA"]
}

dataFrame3=pd.DataFrame(data)
print(dataFrame3)
cleanDataFrame5=dataFrame3.drop_duplicates(subset=["Name", "City"])
cleanDataFrame5

# Here duplicates are removed based on the Name and City columns leaving only unique combinations of Name and City.

    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
2  Alice   25       NY
3  David   40  Chicago
4    Bob   30       LA


Unnamed: 0,Name,Age,City
0,Alice,25,NY
1,Bob,30,LA
3,David,40,Chicago
