### Removing duplicate values

To demonstrate the techniques, we will use the following artificial example of a DataFrame with lots of duplicate data

In [2]:
import pandas as pd

df = pd.DataFrame(
    {"color": ["blue", "blue", "red", "red", "blue"], "value": [2, 1, 3, 3, 2]}
)
df

Unnamed: 0,color,value
0,blue,2
1,blue,1
2,red,3
3,red,3
4,blue,2


The simplest case of duplicate values is when entire rows are duplicated.

The function duplicated() from pandas will detect such cases. It returns a Series of Booleans where each element of the Series corresponds to a row, with a True if the row is duplicated (meaning that it is not its first appearance), and a False otherwise. Let’s test this out:

In [4]:
# indexing labels are not taken into account for searching for duplicates
df.duplicated()

0    False
1    False
2    False
3     True
4     True
dtype: bool

In [5]:
df.duplicated(keep="last")

0     True
1    False
2     True
3    False
4    False
dtype: bool

There is a third and final option which is to mark all occurrences of a duplicated row as True by using keep=False:

In [6]:
df.duplicated(keep=False)

0     True
1    False
2     True
3     True
4     True
dtype: bool

In [7]:
# Using the default option, we may enter the indexes corresponding to True value susing the following
df.loc[df.duplicated(), :]

Unnamed: 0,color,value
3,red,3
4,blue,2


In [8]:
df.drop_duplicates()

Unnamed: 0,color,value
0,blue,2
1,blue,1
2,red,3


which returns a new DataFrame with the duplicated rows removed.

We could also use the parameter keep with the function drop_duplicates(), for example:

In [9]:
df.drop_duplicates(keep="last")

Unnamed: 0,color,value
1,blue,1
3,red,3
4,blue,2


### Duplicates from a particular column
Sometimes, we might wish to drop duplicates only for a specific column. We can do this by passing the column label as a parameter to the drop_duplicates() function. Let’s try this out:

In [11]:
df.drop_duplicates(["value"])

Unnamed: 0,color,value
0,blue,2
1,blue,1
2,red,3


In [12]:
# But be careful with this - let’s drop duplicates based on the 'color' value:
df.drop_duplicates(["color"])

Unnamed: 0,color,value
0,blue,2
2,red,3
