# Dealing With Duplicates

First we'll make a dataframe that has duplicated values:

In [3]:
import pandas as pd
import numpy as np
np.random.seed(13)

df = pd.DataFrame({'a': [1, 2, 1, 4, 5, 6], 'b': np.random.randn(6)})
df

Unnamed: 0,a,b
0,1,-0.712391
1,2,0.753766
2,1,-0.044503
3,4,0.451812
4,5,1.345102
5,6,0.532338


Notice that the value `1` is repeated twice in the `a` column.

We can use the `.duplicated` method to find where values are duplicated.

`.duplicated` works with entire rows when called on a dataframe, and works with individual columns when called on a series. Let's see an example:

In [9]:
df.a.duplicated()

0    False
1    False
2     True
3    False
4    False
5    False
Name: a, dtype: bool

This gives us a boolean series where each true or false indicates whether that value is a duplicate. By default, only the *last* duplicated value is show. To see all the duplicates, we can pass an argument:

In [10]:
df.a.duplicated(False)

0     True
1    False
2     True
3    False
4    False
5    False
Name: a, dtype: bool

We can then use this boolean series to index into our original dataframe and look at the rows with duplicate values.

In [11]:
df[df.a.duplicated(False)]

Unnamed: 0,a,b
0,1,-0.712391
2,1,-0.044503


We can remove duplicated values with the `drop_duplicates` dataframe method.

In [13]:
df.drop_duplicates(subset=['a'])

Unnamed: 0,a,b
0,1,-0.712391
1,2,0.753766
3,4,0.451812
4,5,1.345102
5,6,0.532338


Here we've specified that we want to drop duplicates from the `a` column.

By default, the first duplicated value is kept, and subsequent values are dropped. We can change this behavior as well:

In [14]:
df.drop_duplicates(subset=['a'], keep='last')

Unnamed: 0,a,b
1,2,0.753766
2,1,-0.044503
3,4,0.451812
4,5,1.345102
5,6,0.532338


Now the first duplicated value was dropped and the last one was kept.