In [1]:
import pandas as pd
import numpy as np
import warnings 
warnings.filterwarnings("ignore")

## Duplicates

- Sometimes, you get a messy dataset. For example, you may have to deal with duplicates, which will skew your analaysis.

**Checking for duplicates**

- For checking if you have duplicate records.This checks if the whole row appears elsewhere with the same values in each column.

`df.duplicated()`

In [2]:
# Creating dataframe using dictionary

dict1 = {'Gender': ["Male", "Female", "Male", "Female","Male"], 
        'Married':["Yes", "No","No", "No","Yes"],
       'Loan_Status':["Yes", "No", "No", "No","Yes"]} 
  
df = pd.DataFrame(dict1)
df

Unnamed: 0,Gender,Married,Loan_Status
0,Male,Yes,Yes
1,Female,No,No
2,Male,No,No
3,Female,No,No
4,Male,Yes,Yes


In [3]:
df.duplicated()   #the output of this is boolean value

0    False
1    False
2    False
3     True
4     True
dtype: bool

- In the above dataframe we have 2 duplicate values

In [5]:
df.drop_duplicates()   #all duplicated values are droped

Unnamed: 0,Gender,Married,Loan_Status
0,Male,Yes,Yes
1,Female,No,No
2,Male,No,No


**Dropping duplicates from a particular column**

- Sometimes, you may want to drop duplicates just from one column.

In [9]:
df

Unnamed: 0,Gender,Married,Loan_Status
0,Male,Yes,Yes
1,Female,No,No
2,Male,No,No
3,Female,No,No
4,Male,Yes,Yes


In [8]:
df.drop_duplicates(["Married"])      #Drops duplicate values in a specified column

Unnamed: 0,Gender,Married,Loan_Status
0,Male,Yes,Yes
1,Female,No,No


`first`: (default) Drop duplicates except for the first occurrence.   
`last`: Drop duplicates except for the last occurrence.   
`False`: Drop all duplicates.   
       
The two main methods that we will use are 
- duplicated() 
- drop_duplicates()

The former returns a boolean series and the latter can be used to directly delete duplicate rows from a dataframe. For duplicated() method the inputs are:   
  
- keep   
    - "first": Mark duplicates as True except for the first occurrence.   
    - "last": Mark duplicates as True except for the last occurrence.   
    - False: Mark all duplicates as True   
      
For the drop_duplicates() method the keep arguments does the following.   
  
- keep  
    - "first": Drop duplicates except for the first occurrence.  
    - "last": Drop duplicates except for the last occurrence.  
    - False: Drop all duplicates 
      
The second arguments for both is:   

- `subset`: Only consider certain columns for identifying duplicates. If subset is not specific,  by default all of the columns will be used.

In [10]:
# with keep = "first" - mark duplicates as True except for the first one

duplicates_first=df.duplicated(keep = "first")
duplicates_first

0    False
1    False
2    False
3     True
4     True
dtype: bool

In [11]:
df.loc[duplicates_first,:]

Unnamed: 0,Gender,Married,Loan_Status
3,Female,No,No
4,Male,Yes,Yes


In [12]:
# with keep = "last" - mark duplicates as True except for the last one
duplicates_last=df.duplicated(keep = "last")
duplicates_last

0     True
1     True
2    False
3    False
4    False
dtype: bool

In [14]:
df

Unnamed: 0,Gender,Married,Loan_Status
0,Male,Yes,Yes
1,Female,No,No
2,Male,No,No
3,Female,No,No
4,Male,Yes,Yes


**drop_duplicates**

In [13]:
df.drop_duplicates(keep = False)

Unnamed: 0,Gender,Married,Loan_Status
2,Male,No,No


In [15]:
# with keep = "first" - keep duplicates which occured first time and drop others

df1 = df.drop_duplicates(keep = "first")
df1

Unnamed: 0,Gender,Married,Loan_Status
0,Male,Yes,Yes
1,Female,No,No
2,Male,No,No


In [16]:
# with keep = "last" - keep duplicates which occured last time and drop others

df1 = df.drop_duplicates(keep = "last")
df1

Unnamed: 0,Gender,Married,Loan_Status
2,Male,No,No
3,Female,No,No
4,Male,Yes,Yes


In [17]:
#Using drop_duplicates() method with subset

df.drop_duplicates(keep = "first", subset=["Gender"])

Unnamed: 0,Gender,Married,Loan_Status
0,Male,Yes,Yes
1,Female,No,No
