# Handling Duplicate Data in Pandas - Tutorial

### Scenario 1: Identifying Duplicate Rows
#### To identify duplicate rows in a DataFrame, use the `duplicated()` method.

In [7]:
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
    'Age': [25, 30, 25, 40, 30],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']
}
df = pd.DataFrame(data)
print("Our Dataframe:")
print(df)

# Identify duplicate rows
duplicates = df.duplicated()
print("\nDuplicate Rows:")
print(duplicates)

# Display rows that are duplicates
print("\nDuplicate Entries:")
print(df[df.duplicated()])




Our Dataframe:
    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles
2  Alice   25     New York
3  David   40      Chicago
4    Bob   30  Los Angeles

Duplicate Rows:
0    False
1    False
2     True
3    False
4     True
dtype: bool

Duplicate Entries:
    Name  Age         City
2  Alice   25     New York
4    Bob   30  Los Angeles


### Scenario 2: Removing Duplicate Rows
#### To remove duplicate rows, use the `drop_duplicates()` method.

In [8]:
# Remove duplicate rows (default keeps the first occurrence)
deduped_df = df.drop_duplicates()
print("DataFrame After Removing Duplicates:")
print(deduped_df)

# Keep the last occurrence of duplicates
deduped_df_last = df.drop_duplicates(keep='last')
print("\nDataFrame Keeping the Last Duplicate:")
print(deduped_df_last)



DataFrame After Removing Duplicates:
    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles
3  David   40      Chicago

DataFrame Keeping the Last Duplicate:
    Name  Age         City
2  Alice   25     New York
3  David   40      Chicago
4    Bob   30  Los Angeles


### Scenario 3: Handling Duplicates Based on Specific Columns
#### To check for duplicates based on specific columns, use the `subset` parameter in `duplicated()` or `drop_duplicates()`.

In [9]:
# Check for duplicates based on the 'Name' column
duplicates_name = df.duplicated(subset=['Name'])
print("Duplicates Based on 'Name':")
print(duplicates_name)

# Remove duplicates based on the 'Name' column
deduped_name_df = df.drop_duplicates(subset=['Name'])
print("\nDataFrame After Removing Duplicates Based on 'Name':")
print(deduped_name_df)


Duplicates Based on 'Name':
0    False
1    False
2     True
3    False
4     True
dtype: bool

DataFrame After Removing Duplicates Based on 'Name':
    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles
3  David   40      Chicago


### Scenario 4: Marking Duplicate Rows Instead of Dropping Them
#### Sometimes, instead of dropping duplicates, you may want to mark them for further analysis.

In [11]:
# Create a column to indicate if a row is duplicate
df['Is_Duplicate'] = df.duplicated()
print("DataFrame with Duplicate Indicator:")
print(df)



DataFrame with Duplicate Indicator:
    Name  Age         City  Is_Duplicate
0  Alice   25     New York         False
1    Bob   30  Los Angeles         False
2  Alice   25     New York          True
3  David   40      Chicago         False
4    Bob   30  Los Angeles          True


### Scenario 5: Keeping Duplicates for a Specific Task
#### If you want to keep only the rows that are duplicates for a specific task, you can filter them.

In [13]:
df

Unnamed: 0,Name,Age,City,Is_Duplicate
0,Alice,25,New York,False
1,Bob,30,Los Angeles,False
2,Alice,25,New York,True
3,David,40,Chicago,False
4,Bob,30,Los Angeles,True


In [19]:
# Keep only duplicate rows
duplicates_only = df[df['Is_Duplicate'] == True]
print("Only Duplicate Rows:")
print(duplicates_only)




Only Duplicate Rows:
    Name  Age         City  Is_Duplicate
2  Alice   25     New York          True
4    Bob   30  Los Angeles          True


### Notes:
- Always check the shape of your DataFrame before and after removing duplicates to ensure the expected rows were affected.
- `duplicated()` returns a boolean Series, while `drop_duplicates()` modifies the DataFrame.
- Use `inplace=True` in `drop_duplicates()` if you want to modify the original DataFrame directly.

This tutorial provides a practical approach to handling duplicate data with concise examples suitable for Jupyter Notebook. Copy these cells into your notebook and test them out!
