# Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:

In [1]:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np

In [4]:
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2':[1,1,2,3,3,4,4]})

data

Unnamed: 0,k1,k2
0,one,1
1,one,1
2,one,2
3,two,3
4,two,3
5,two,4
6,two,4


The DataFrame method duplicated returns a boolean Series indicating whether eachrow is a duplicate or not:

In [8]:
data.duplicated()

0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is True:

In [6]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
2,one,2
3,two,3
5,two,4


Both of these methods by default consider all of the columns; alternatively you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:

In [10]:
data['v1'] = range(7)

data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,one,1,1
2,one,2,2
3,two,3,3
4,two,3,4
5,two,4,5
6,two,4,6


In [11]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
3,two,3,3


duplicated and drop_duplicates by default keep the first observed value combination. Passing keep= 'last' return the last one:

In [16]:
data.drop_duplicates(['k1', 'k2'], keep= 'last')

Unnamed: 0,k1,k2,v1
1,one,1,1
2,one,2,2
4,two,3,4
6,two,4,6
