# 🔍 Why are Duplicates a Problem?

- They bias **analysis** (e.g., same person counted twice).

- They **increase computation time**.

- They **mislead machine learning models** (especially with repeated labels).

## 🧭 How Duplicates Appear in Data

- **Multiple Data Sources**: Merging datasets from different systems.

- **Human Error**: Manual data entry (same record entered twice).

- **Web Scraping or APIs**: Duplicate pages or items fetched multiple times.

- **Data Collection Frequency**: Same sensor or transaction recorded more than once.

In [86]:
import pandas as pd

In [87]:
data = {"name": ["a","b","c","d","a","c","a"], "eng": [8,7,5,8,8,5,8], "hindi": [2,3,4,5,2,6,2]}

In [88]:
df = pd.DataFrame(data)
df

Unnamed: 0,name,eng,hindi
0,a,8,2
1,b,7,3
2,c,5,4
3,d,8,5
4,a,8,2
5,c,5,6
6,a,8,2


In [89]:
df["isDuplicate"] = df.duplicated()
df1 = df.copy( )
df

Unnamed: 0,name,eng,hindi,isDuplicate
0,a,8,2,False
1,b,7,3,False
2,c,5,4,False
3,d,8,5,False
4,a,8,2,True
5,c,5,6,False
6,a,8,2,True


### :star: Note: 
####  **first duplicate is kept** and this is the **default behaivour** of pandas **even without explicitly setting the keep = "first"**

In [90]:
df.drop_duplicates(keep="first", inplace = True)
df
# first duplicate is kept and this is the default behaivour of pandas even without
# explicitly setting the keep = "first"

Unnamed: 0,name,eng,hindi,isDuplicate
0,a,8,2,False
1,b,7,3,False
2,c,5,4,False
3,d,8,5,False
4,a,8,2,True
5,c,5,6,False


### :star: Note: 
### set `keep = False` to remove all the duplicate rows

In [91]:
df1.drop_duplicates(keep = False, inplace= True) # without keep para
df1

Unnamed: 0,name,eng,hindi,isDuplicate
0,a,8,2,False
1,b,7,3,False
2,c,5,4,False
3,d,8,5,False
5,c,5,6,False


In [92]:
help(pd.DataFrame.drop_duplicates)

Help on function drop_duplicates in module pandas.core.frame:

drop_duplicates(self, subset: 'Hashable | Sequence[Hashable] | None' = None, *, keep: 'DropKeep' = 'first', inplace: 'bool' = False, ignore_index: 'bool' = False) -> 'DataFrame | None'
    Return DataFrame with duplicate rows removed.

    Considering certain columns is optional. Indexes, including time indexes
    are ignored.

    Parameters
    ----------
    subset : column label or sequence of labels, optional
        Only consider certain columns for identifying duplicates, by
        default use all of the columns.
    keep : {'first', 'last', ``False``}, default 'first'
        Determines which duplicates (if any) to keep.

        - 'first' : Drop duplicates except for the first occurrence.
        - 'last' : Drop duplicates except for the last occurrence.
        - ``False`` : Drop all duplicates.

    inplace : bool, default ``False``
        Whether to modify the DataFrame rather than creating a new one.
    ign

### ➡️ Performing Duplicate Removal on loan_data_set.csv

In [93]:
dataset = pd.read_csv("loan_data_set.csv")
dataset.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


In [94]:
dataset.shape

(614, 13)

In [95]:
dataset.drop_duplicates(inplace = True)

In [96]:
dataset.shape
# so there were duplicates in this dataset

(614, 13)