# üßÆ Removing Duplicates in Pandas
**Author:** Hamna Munir  
**Repository:** Python-Libraries-for-AI-ML  
**Topic:** 10_drop_duplicates

In real-world datasets, **duplicate rows or values** are common and can **bias analysis, aggregation, or machine learning models**. Pandas provides the `drop_duplicates()` method to **identify and remove duplicates** efficiently.

---

## üìò Why Removing Duplicates is Important?
- Ensures **accuracy** in statistical analysis.
- Avoids **over-counting** in aggregations.
- Prevents **redundant data** in reports or visualizations.
- Improves **data quality** before machine learning.

## ----------------------------------------------------------
## Importing Pandas and Creating Sample DataFrame
## ----------------------------------------------------------
Let's create a DataFrame with some duplicate rows.

In [1]:
import pandas as pd

data = {
    'Name': ['Ali', 'Sara', 'Umar', 'Ali', 'Omar', 'Sara'],
    'Age': [22, 25, 28, 22, 26, 25],
    'Score': [85, 90, 78, 85, 88, 90]
}

df = pd.DataFrame(data)
print("Sample DataFrame with duplicates:\n", df)

Sample DataFrame with duplicates:
    Name  Age  Score
0    Ali   22     85
1    Sara  25     90
2    Umar  28     78
3    Ali   22     85
4    Omar  26     88
5    Sara  25     90


## üß© Removing Duplicate Rows
- By default, `drop_duplicates()` removes **entire rows** that are identical.
- Keeps the **first occurrence** and removes subsequent duplicates.

In [2]:
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print("DataFrame after removing duplicates:\n", df_no_duplicates)

DataFrame after removing duplicates:
    Name  Age  Score
0    Ali   22     85
1    Sara  25     90
2    Umar  28     78
4    Omar  26     88


## üß© Removing Duplicates Based on Specific Columns
- Sometimes duplicates are only relevant for **certain columns**, not the entire row.
- Use the `subset` parameter to specify columns.

In [3]:
# Remove duplicates based on 'Name' and 'Age'
df_no_dup_subset = df.drop_duplicates(subset=['Name', 'Age'])
print("DataFrame after removing duplicates based on Name and Age:\n", df_no_dup_subset)

DataFrame after removing duplicates based on Name and Age:
    Name  Age  Score
0    Ali   22     85
1    Sara  25     90
2    Umar  28     78
4    Omar  26     88


## üß© Keeping Last Occurrence
- By default, **first occurrence** is kept.
- Use `keep='last'` to retain the **last occurrence**.

In [4]:
# Keep last occurrence of duplicates
df_keep_last = df.drop_duplicates(keep='last')
print("DataFrame keeping last occurrence:\n", df_keep_last)

DataFrame keeping last occurrence:
    Name  Age  Score
2    Umar   28     78
3    Ali    22     85
4    Omar   26     88
5    Sara   25     90


## üß© In-Place Removal
- You can remove duplicates **directly in the original DataFrame** using `inplace=True`.
- This avoids creating a new DataFrame.

In [5]:
# Remove duplicates in-place
df.drop_duplicates(inplace=True)
print("Original DataFrame after in-place removal of duplicates:\n", df)

Original DataFrame after in-place removal of duplicates:
    Name  Age  Score
0    Ali   22     85
1    Sara  25     90
2    Umar  28     78
4    Omar  26     88


## üìù Summary
- `drop_duplicates()` removes **duplicate rows** in Series or DataFrame.
- `subset` parameter allows removing duplicates based on **specific columns**.
- `keep` parameter controls whether to keep **first, last, or drop all** duplicates.
- `inplace=True` modifies the original DataFrame.
- Removing duplicates is essential for **data cleaning, accuracy, and ML preprocessing**.

**Next:** `11_GroupBy_Functions.ipynb` ‚Üí Grouping Data in Pandas