### **Identifying Duplicate Rows**
Duplicate rows in a **Pandas DataFrame** occur when one or more rows contain the **same values across all columns** (or a selected subset of columns). Detecting duplicates is a crucial step in **data cleaning** to ensure data accuracy and reliability.

---
‚úÖ **Key Points**
* `df.duplicated()` checks **all columns** by default.
* `df.duplicated(subset=['column_name'])` limits the check to specific columns.
* Use `df.duplicated(keep=False)` to flag **all** duplicates (including the first occurrence).
* Combine with `df[df.duplicated()]` to **view** the duplicate rows.
---
##### ‚û°Ô∏è **Methods for Identifying Duplicates**

‚û°Ô∏è **1. Using `duplicated()` Method**

The `duplicated()` method returns a **boolean Series** indicating whether each row is a duplicate of any of the previous rows.
* By default, **all columns** are checked for duplication.
* The **first occurrence** of each duplicate is **not marked** as a duplicate (it is considered unique).
* Only the **subsequent occurrences** are flagged as `True`.

In [2]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'City': ['NY', 'LA', 'NY', 'SF', 'LA']
})

# Identify duplicate rows (across all columns)
duplicates = df.duplicated()
print(duplicates)

0    False
1    False
2     True
3    False
4     True
dtype: bool


‚û°Ô∏è **2. Using `subset` Parameter**

You can specify a **subset of columns** to check for duplicates using the `subset` parameter. This is useful when only specific columns need to be checked.

In [1]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'City': ['NY', 'LA', 'NY', 'SF', 'LA']
})
# Identify duplicates based on the 'Name' column
duplicates_name = df.duplicated(subset=['Name'])
print(duplicates_name)

0    False
1    False
2     True
3    False
4     True
dtype: bool


In [3]:
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 2, 3, 4, 3, 2],
    'B': ['a', 'b', 'c', 'c', 'd', 'e', 'b']
})
duplicates = df.duplicated() # Identify duplicates
duplicates_A = df.duplicated(subset=['A']) # Identify duplicates based only on column 'A'

print(
    f"Original DataFrame:\n{df}\n"
    f"\nDuplicate rows:\n{duplicates}\n" # Identify duplicates
    f"\nDuplicates based on column 'A':\n{duplicates_A}" # Identify duplicates based only on column 'A'
)

Original DataFrame:
   A  B
0  1  a
1  2  b
2  2  c
3  3  c
4  4  d
5  3  e
6  2  b

Duplicate rows:
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Duplicates based on column 'A':
0    False
1    False
2     True
3    False
4    False
5     True
6     True
dtype: bool


### **Removing Duplicate Rows**

Once **duplicate rows** are identified in a Pandas DataFrame, you can remove them to ensure data quality and consistency.
Pandas provides flexible methods to remove duplicates without altering the original data unless explicitly instructed.

---
‚úÖ **Key Takeaways**
* `drop_duplicates()` helps remove duplicate rows easily.
* The original DataFrame remains unchanged unless you use `inplace=True`.
* Use the **`subset`** parameter for column-specific duplicate removal.
* The **`keep`** parameter provides flexibility in which duplicate entries to retain.
---

‚û°Ô∏è **1. Using `drop_duplicates()`**

The simplest way to remove duplicates is by using the **`drop_duplicates()`** method. It removes all duplicate rows and returns a **new DataFrame**, while keeping the original unchanged.

In [4]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 2, 3, 4, 4],
    'B': ['x', 'y', 'y', 'z', 'x', 'x']
})

# Remove duplicate rows
df_unique = df.drop_duplicates()
print(df_unique)

   A  B
0  1  x
1  2  y
3  3  z
4  4  x


‚û°Ô∏è **2. Keeping Specific Duplicates**

You can control which duplicates to retain using the **`keep`** parameter.
* `keep='first'`: Keeps the first occurrence and removes later duplicates.
* `keep='last'`: Keeps the last occurrence and removes earlier duplicates.
* `keep=False`: Removes **all** occurrences of duplicates.

In [7]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 2, 3, 4, 4],
    'B': ['x', 'y', 'y', 'z', 'x', 'x']
})
# Keep first occurrence of duplicates
df_keep_first = df.drop_duplicates(keep='first')
print(f"Keep First Occurence of Duplicates:\n{df_keep_first}\n")

# Keep last occurrence of duplicates
df_keep_last = df.drop_duplicates(keep='last')
print(f"Keep Last Occurrence of Duplicates:\n{df_keep_last}")

Keep First Occurence of Duplicates:
   A  B
0  1  x
1  2  y
3  3  z
4  4  x

Keep Last Occurrence of Duplicates:
   A  B
0  1  x
2  2  y
3  3  z
5  4  x


‚û°Ô∏è **3. Using `subset` Parameter**

If you want to consider only specific columns while identifying duplicates, use the **`subset`** parameter.

In [9]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 2, 3, 4, 4],
    'B': ['x', 'y', 'y', 'z', 'x', 'x']
})

# Remove duplicates based only on column 'A'
df_unique_A = df.drop_duplicates(subset=['A'])
print(f"Removing duplicates based only on column 'A':\n{df_unique_A}")

Removing duplicates based only on column 'A':
   A  B
0  1  x
1  2  y
3  3  z
4  4  x


üîπ **`Problem:` The DataFrame has duplicate entries for some students with different grades.**

Remove **duplicates based on 'Student_ID'**, **keeping the entry with the highest grade** for each student. 

In order to get the highest grade for each student - you will have to **first sort the dataframe with Student_ID and Grade in descending** order.

In [16]:
import pandas as pd

# Create the DataFrame
df = pd.DataFrame({
    'Student_ID': [1, 2, 3, 1, 2, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'David'],
    'Grade': [85, 92, 78, 88, 90, 95]
})
# Sorting the dataframe with Student_ID and Grade in descending order.
df_sorted = df.sort_values(by=['Student_ID', 'Grade'], ascending=[True, False])
print(f"Original DataFrame after Sorting Student_ID [asc] & Grade [desc]:\n{df_sorted}\n")

# Removing duplicates based on 'Student_ID', keeping the entry with the highest grade for each student.
df_removed = df_sorted.drop_duplicates(subset=['Student_ID'], keep='first')
print(f"Removing Duplicates on Student_ID, Keeping the Highest Grade entry [Descending Order]:\n{df_removed}")

Original DataFrame after Sorting Student_ID [asc] & Grade [desc]:
   Student_ID     Name  Grade
3           1    Alice     88
0           1    Alice     85
1           2      Bob     92
4           2      Bob     90
2           3  Charlie     78
5           4    David     95

Removing Duplicates on Student_ID, Keeping the Highest Grade entry [Descending Order]:
   Student_ID     Name  Grade
3           1    Alice     88
1           2      Bob     92
2           3  Charlie     78
5           4    David     95


### **Handling Duplicates in Index**
In Pandas, the **index** uniquely identifies each row.
However, sometimes during data loading, merging, or transformation, **duplicate index values** can appear.
These duplicates can cause problems during operations like **joining**, **reindexing**, or **data alignment** ‚Äî so detecting and handling them is important.

---
‚úÖ **Key Takeaways**
* **Duplicate indexes** can cause alignment and merge issues in Pandas.
* Use **`index.duplicated()`** to detect them.
* Use boolean masking or **`reset_index()`** to clean or reset the index.
* Always ensure your index is **unique and meaningful** when performing operations like joins or merges.

‚û°Ô∏è **1. Identifying Duplicate Index Values**

You can check for duplicates in the DataFrame index using the **`index.duplicated()`** property.

In [2]:
import pandas as pd

# Sample DataFrame with duplicate index values
df = pd.DataFrame({
    'A': [10, 20, 30, 40],
    'B': [5, 6, 7, 8]
}, index=['a', 'b', 'b', 'c'])

# Identify duplicate index values
print(df.index.duplicated())

[False False  True False]


‚û°Ô∏è **2. Filtering Out Duplicate Index Rows**

To remove rows with duplicate index values, use boolean indexing, You can also specify:
* `keep='last'` ‚Üí Keeps the last occurrence.
* `keep=False` ‚Üí Removes *all* duplicate index entries.

In [4]:
import pandas as pd

# Sample DataFrame with duplicate index values
df = pd.DataFrame({
    'A': [10, 20, 30, 40],
    'B': [5, 6, 7, 8]
}, index=['a', 'b', 'b', 'c'])

# Keep only the first occurrence of each index
df_no_dup_index = df[df.index.duplicated(keep='first')]
print(df_no_dup_index)

    A  B
b  30  7


‚û°Ô∏è **3. Resetting the Index**

If duplicate index values aren‚Äôt meaningful, you can **reset the index** entirely.
* `drop=True` ‚Üí Removes the old index instead of adding it as a new column.

In [6]:
import pandas as pd

# Sample DataFrame with duplicate index values
df = pd.DataFrame({
    'A': [10, 20, 30, 40],
    'B': [5, 6, 7, 8]
}, index=['a', 'b', 'b', 'c'])

df_reset = df.reset_index(drop=True)
print(df_reset)

    A  B
0  10  5
1  20  6
2  30  7
3  40  8


In [8]:
import pandas as pd

# Sample data with duplicate indices
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
    'Age': [25, 30, 35, 40, 45]
}
# Creating a DataFrame with duplicate indices
df = pd.DataFrame(data, index=['a', 'b', 'c', 'a', 'e'])

# Identifying duplicate indices
duplicate_indices = df.index[df.index.duplicated()].unique()

print(f"Displaying the duplicate indices: {duplicate_indices}")

Displaying the duplicate indices: Index(['a'], dtype='object')
