In [1]:
import pandas as pd

### Pandas Duplicated
Often times you'll have a series with duplicate values and you'll want to know where they are.
Pandas duplicated is the function for the job.

First let's create a series with some data

In [30]:
my_series = pd.Series(["Kanye West", "Drake" , "Mac Miller", "Drake", "Beyonce", "Kanye West", "Drake"], name='artists')
my_series

0    Kanye West
1         Drake
2    Mac Miller
3         Drake
4       Beyonce
5    Kanye West
6         Drake
Name: artists, dtype: object

Looks like Kanye West and Drake both have duplicate values in the series above. This is an easy example, but what if you have 100K data points? You'll need a quicker way to locate duplicates than eye-balling it.

Before you get started finding duplicates, you have one decision to make: Which duplicates do you want to flag? The First, Last, or All of them?

#### Method 1 - Keep='first' (default): For when you want to mark all duplicates as true...EXCEPT for the *first* one. 

In [31]:
series_duplicates_first = my_series.duplicated(keep='first') # Finding the duplicates
series_duplicates_first.name = 'duplicates' # Giving the series a name to view later
series_duplicates_first # View your duplicates

0    False
1    False
2    False
3     True
4    False
5     True
6     True
Name: duplicates, dtype: bool

Let's merge the two series together with pd.concat() to easily view them

In [32]:
pd.concat([my_series, series_duplicates_first], axis=1)

Unnamed: 0,artists,duplicates
0,Kanye West,False
1,Drake,False
2,Mac Miller,False
3,Drake,True
4,Beyonce,False
5,Kanye West,True
6,Drake,True


Notice how all of the duplicates ("Kanye West"s and "Drake"s) are marked as True (meaning they are duplicates), except for the **first** one!

#### Method 2 - Keep='last': For when you want to mark all duplicates as true...EXCEPT for the *last* one.

In [33]:
series_duplicates_last = my_series.duplicated(keep='last') # Finding the duplicates
series_duplicates_last.name = 'duplicates' # Giving the series a name to view later
series_duplicates_last # View your duplicates

pd.concat([my_series, series_duplicates_last], axis=1) # View your duplicates next to your values

Unnamed: 0,artists,duplicates
0,Kanye West,True
1,Drake,True
2,Mac Miller,False
3,Drake,True
4,Beyonce,False
5,Kanye West,False
6,Drake,False


Notice how all of the duplicates ("Kanye West"s and "Drake"s) are marked as True (meaning they are duplicates), except for the **last** one!

#### Method 3 - Keep=False: For when you want to mark *all* duplicates as true.

In [29]:
series_duplicates_false = my_series.duplicated(keep=False) # Finding the duplicates
series_duplicates_false.name = 'duplicates' # Giving the series a name to view later
series_duplicates_false # View your duplicates

pd.concat([my_series, series_duplicates_false], axis=1) # View your duplicates next to your values

Unnamed: 0,artists,duplicates
0,Kanye West,True
1,Drake,True
2,Mac Miller,False
3,Drake,True
4,Beyonce,False
5,Kanye West,True
6,Drake,True


Notice how all of the duplicates ("Kanye West"s and "Drake"s) are marked as True (meaning they are duplicates) now.