# Pandas find and remove duplicate rows

This is a notebook for the medium article [Pandas find and remove duplicate rows](https://bindichen.medium.com/finding-and-removing-duplicate-rows-in-pandas-dataframe-c6117668631f)

Please check out article for instructions

**License**: [BSD 2-Clause](https://opensource.org/licenses/BSD-2-Clause)

In [1]:
import pandas as pd

In [2]:
def load_data(): 
    df_all = pd.read_csv('train.csv')
    # Take a subset
    return df_all.loc[:300, ['Survived', 'Pclass', 'Sex', 'Cabin', 'Embarked']].dropna()

# Load a subset
df = load_data()

In [3]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
0,0,1,male,C30,S
1,1,1,female,D33,C
9,1,3,male,E121,S
10,1,1,female,B22,S
14,0,1,male,B51 B53 B55,S


## 1. Finding duplicate rows

In [4]:
# For single column
df.Cabin.duplicated()

0      False
1      False
9      False
10     False
14     False
       ...  
271    False
278    False
286    False
299    False
300    False
Name: Cabin, Length: 80, dtype: bool

In [5]:
# For a DataFrame as a whole
df.duplicated()

0      False
1      False
9      False
10     False
14     False
       ...  
271    False
278    False
286    False
299    False
300    False
Length: 80, dtype: bool

In [6]:
# To consider certain columns for identifying duplicates
df.duplicated(subset=['Survived', 'Pclass', 'Sex'])

0      False
1      False
9      False
10      True
14      True
       ...  
271     True
278     True
286     True
299     True
300     True
Length: 80, dtype: bool

## 2. Counting duplicates and non-duplicates

In [7]:
df.Cabin.duplicated().sum()

11

In [8]:
df.duplicated().sum()

3

In [9]:
df.duplicated(subset=['Survived', 'Pclass', 'Sex']).sum()

70

In [10]:
# Count the number of non-duplicates
(~df.duplicated()).sum()

77

## 3. Extracting duplicate rows using `loc`

In [11]:
# This allows us to see the rows that were identified by duplicated()
df.loc[df.duplicated(), :]

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
138,1,2,female,F33,S
169,1,1,female,B77,S
237,1,1,female,B96 B98,S


## 4. Determing which duplicates to mark using `keep`

In [12]:
# `keep` defaults to `'first'`
df.loc[df.duplicated(keep='first'), :]

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
138,1,2,female,F33,S
169,1,1,female,B77,S
237,1,1,female,B96 B98,S


In [13]:
df.loc[df.duplicated(keep='last'), :]

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
36,1,1,female,B77,S
77,1,1,female,B96 B98,S
134,1,2,female,F33,S


In [14]:
# There is a third option we can use keep=False. It marks all duplicates as True
df.loc[df.duplicated(keep=False), :]

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
36,1,1,female,B77,S
77,1,1,female,B96 B98,S
134,1,2,female,F33,S
138,1,2,female,F33,S
169,1,1,female,B77,S
237,1,1,female,B96 B98,S


## 5. Dropping duplicates rows

In [16]:
# Note that we started out as 80, now it's 77
# It is not accured in place by default, we can change it to in place by `inplace=True`
df.drop_duplicates()

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
0,0,1,male,C30,S
1,1,1,female,D33,C
9,1,3,male,E121,S
10,1,1,female,B22,S
14,0,1,male,B51 B53 B55,S
...,...,...,...,...,...
271,1,1,male,C93,S
278,0,1,male,C111,C
286,1,1,male,C148,C
299,1,1,female,D21,S


In [17]:
# Use keep='last' to keep the last occurrence 
df.drop_duplicates(keep='last')

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
0,0,1,male,C30,S
1,1,1,female,D33,C
9,1,3,male,E121,S
10,1,1,female,B22,S
14,0,1,male,B51 B53 B55,S
...,...,...,...,...,...
271,1,1,male,C93,S
278,0,1,male,C111,C
286,1,1,male,C148,C
299,1,1,female,D21,S


In [18]:
# To drop all duplicates
df.drop_duplicates(keep=False)

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
0,0,1,male,C30,S
1,1,1,female,D33,C
9,1,3,male,E121,S
10,1,1,female,B22,S
14,0,1,male,B51 B53 B55,S
...,...,...,...,...,...
271,1,1,male,C93,S
278,0,1,male,C111,C
286,1,1,male,C148,C
299,1,1,female,D21,S


In [19]:
# Similarly, we can consider a certain columns for dropping duplicates
df.drop_duplicates(subset=['Survived', 'Pclass', 'Sex'])

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
0,0,1,male,C30,S
1,1,1,female,D33,C
9,1,3,male,E121,S
25,1,2,female,D,S
38,1,1,male,A6,S
48,0,3,male,F G73,S
63,0,2,male,D,C
113,1,3,female,E121,S
136,1,2,male,F4,S
172,0,1,female,C49,C
