# Remove Duplicate Records

Return DataFrame with duplicate rows removed.

**DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)**

> Return DataFrame with duplicate rows removed.

> Considering certain columns is optional. Indexes, including time indexes are ignored.

> Parameters
  subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.

keep{‘first’, ‘last’, False}, default ‘first’

Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.
inplace bool, default False

Whether to drop duplicates in place or to return a copy.

ignore_indexbool, default False

If True, the resulting axis will be labeled 0, 1, …, n - 1.

New in version 1.0.0.

Returns: DataFrame - DataFrame with duplicates removed or None if inplace=True.



In [1]:
import pandas as pd


Before you remove those duplicates, you’ll need to create Pandas DataFrame to capture that data in Python.


### Create Pandas DataFrame

Next, create Pandas DataFrame using this code:

In [2]:
boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }
df = pd.DataFrame(boxes, columns = ['Color', 'Shape'])

df

Unnamed: 0,Color,Shape
0,Green,Rectangle
1,Green,Rectangle
2,Green,Square
3,Blue,Rectangle
4,Blue,Square
5,Red,Square
6,Red,Square
7,Red,Rectangle


In [7]:
df.drop_duplicates(inplace=True)

In [8]:
df

Unnamed: 0,Color,Shape
0,Green,Rectangle
2,Green,Square
3,Blue,Rectangle
4,Blue,Square
5,Red,Square
7,Red,Rectangle


### Remove Duplicate Records w.r.t Certain Columns/Attributes

Let’s say that you want to remove the duplicates across the two columns of Color and Shape.

We add one more feature - Weight. There is no duplicate values in this feature column. 

Therefore, considering all three features, there remains no duplicate records.

In that case, apply the code below in order to remove those duplicates:

In [23]:
boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle'],
         'Weight': [10,12,5,8,9,7,6,4]         
        }
# No duplicates considering all the colummns
df = pd.DataFrame(boxes, columns = ['Color', 'Shape', 'Weight'])
df

Unnamed: 0,Color,Shape,Weight
0,Green,Rectangle,10
1,Green,Rectangle,12
2,Green,Square,5
3,Blue,Rectangle,8
4,Blue,Square,9
5,Red,Square,7
6,Red,Square,6
7,Red,Rectangle,4


### Let's try to remove duplicates. There is no change. The result is same as the original dataframe.

In [22]:
# The result is same as the original
df1 = df.drop_duplicates()

df1

Unnamed: 0,Color,Shape,Weight
0,Green,Rectangle,10
1,Green,Rectangle,12
2,Green,Square,5
3,Blue,Rectangle,8
4,Blue,Square,9
5,Red,Square,7
6,Red,Square,6
7,Red,Rectangle,4


In [20]:
# Remove duplicates w.r.t feature columns - Color, Shape.

df_duplicates_removed = pd.DataFrame.drop_duplicates(df, subset=['Color', 'Shape'])

df_duplicates_removed


Unnamed: 0,Color,Shape,Weight
0,Green,Rectangle,10
2,Green,Square,5
3,Blue,Rectangle,8
4,Blue,Square,9
5,Red,Square,7
7,Red,Rectangle,4


### Keep the Last Occurrance of the Duplicate Elements

In [24]:
# Remove duplicates w.r.t feature columns - Color, Shape.
# Keep the last occurance of the duplicate record

df_duplicates_removed = pd.DataFrame.drop_duplicates(df, subset=['Color', 'Shape'], keep='last')

df_duplicates_removed


Unnamed: 0,Color,Shape,Weight
1,Green,Rectangle,12
2,Green,Square,5
3,Blue,Rectangle,8
4,Blue,Square,9
6,Red,Square,6
7,Red,Rectangle,4


In [26]:
# Remove duplicates w.r.t feature columns - Color, Shape.
# Reset index

df_duplicates_removed = pd.DataFrame.drop_duplicates(df, subset=['Color', 'Shape'], ignore_index=True)

df_duplicates_removed


Unnamed: 0,Color,Shape,Weight
0,Green,Rectangle,10
1,Green,Square,5
2,Blue,Rectangle,8
3,Blue,Square,9
4,Red,Square,7
5,Red,Rectangle,4


# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# Generate Boolean Mask of the Duplicate Entries

## **pandas.DataFrame.duplicated**

####**DataFrame.duplicated(subset=None, keep='first')**

**Returns boolean Series denoting duplicate rows.**

Considering certain columns is optional.

**Parameters**

subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.

keep{‘first’, ‘last’, False}, default ‘first’

Determines which duplicates (if any) to mark.

first : Mark duplicates as True except for the first occurrence.

last : Mark duplicates as True except for the last occurrence.

False : Mark all duplicates as True.


#### **Returns** - Series: Boolean series for each duplicated rows.


In [27]:
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})

df

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
1,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


Second occurrance of the record is identified as the duplicate record by default.

In [28]:
df.duplicated()

0    False
1     True
2    False
3    False
4    False
dtype: bool

We can use the parameter keep='last' to mark all the occurrences of a record as duplicate except the very last occurrence of the same record.

In [32]:
df.duplicated(keep='last')

0     True
1    False
2    False
3    False
4    False
dtype: bool

In [30]:
df.duplicated(subset=['brand', 'style'])

0    False
1     True
2    False
3    False
4     True
dtype: bool

In [33]:
df.duplicated(subset=['brand', 'style'], keep='last')

0     True
1    False
2    False
3     True
4    False
dtype: bool