## Topic: Handling Duplicates Values in DF

### OUTCOMES

- 1. Detecting Duplicate Rows
    - df.duplicated()

- 2. Remove Duplicates Rows
    - df.drop_duplicates()

- 3. Remove Duplicates Based on Specific Column
    - df.drop_duplicates(subset = ["Column_name"])

- 4. Remove Duplicates Based on Multiple Columns
    - df.drop_duplicates(subset = ["col1_name", "col2_name"], keep = 'first', inplace = False)

- 5. Keep Last Occurance
    - df.drop_duplicates(subset = ["col_name"], keep = 'last')

In [1]:
import pandas as pd

### 1. Detecting Duplicate Rows

- identity the duplicates row (if entire row are same).

- syntax:
    df.duplicated(subset = None, keep = 'first')

    - subset => column wise duplicates check (all columns defalut)

    - keep => marks all duplicated except the 'first'(default) one.

In [12]:
data = {
    'ID': [101, 102, 103, 101, 104, 103],
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Charlie'],
    'Salary': [50000, 60000, 55000, 50000, 70000, 55000]
}

df = pd.DataFrame(data)

df

Unnamed: 0,ID,Name,Salary
0,101,Alice,50000
1,102,Bob,60000
2,103,Charlie,55000
3,101,Alice,50000
4,104,David,70000
5,103,Charlie,55000


In [None]:
# Example- identity the duplicates rows

df.duplicated()

# True -> Duplicates rows
# False -> Not Duplicates rows

0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool

In [None]:
# how many rows are duplicates

int(df.duplicated().sum())

# for given dataset 2 rows (records) are duplicates

2

### 2. Remove Duplicates Rows

- df.drop_duplicates() to remove duplicate rows

In [11]:
# remove duplicate rows

df.drop_duplicates(inplace=True)
df

Unnamed: 0,ID,Name,Salary
0,101,Alice,50000
1,102,Bob,60000
2,103,Charlie,55000
4,104,David,70000


### 3. Remove Duplicates Based on Specific (single) Column

- Syntax: 
    - df.drop_duplicates(subset = 'col_name', keep = 'first')

In [14]:
df

Unnamed: 0,ID,Name,Salary
0,101,Alice,50000
1,102,Bob,60000
2,103,Charlie,55000
3,101,Alice,50000
4,104,David,70000
5,103,Charlie,55000


In [15]:
# Remove duplicates based on specific(single) column
df.drop_duplicates(subset= 'ID')

Unnamed: 0,ID,Name,Salary
0,101,Alice,50000
1,102,Bob,60000
2,103,Charlie,55000
4,104,David,70000


### 4. Remove Duplicates Based on Multiple Columns

- Syntax:
    - df.drop_duplicates(subset = ["col1", "col2"], inplace = False)

In [17]:
df.drop_duplicates(subset=['ID', 'Name'])

Unnamed: 0,ID,Name,Salary
0,101,Alice,50000
1,102,Bob,60000
2,103,Charlie,55000
4,104,David,70000


### 5. Keep Last Occurance

- To keep last duplicate value instead of first

- Syntax:
    - df.drop_duplicates(subset = 'coln', keep = 'last')

    - By defult : keep => "first"

In [None]:
# DataFrame
df

Unnamed: 0,ID,Name,Salary
0,101,Alice,50000
1,102,Bob,60000
2,103,Charlie,55000
3,101,Alice,50000
4,104,David,70000
5,103,Charlie,55000


In [None]:
df.drop_duplicates(subset= "ID", keep = "last")

# here first duplicate value is delete.
# here 
#    -> 0 idx (101) => Delete
#    -> 2 idx (103) => Delete


Unnamed: 0,ID,Name,Salary
1,102,Bob,60000
3,101,Alice,50000
4,104,David,70000
5,103,Charlie,55000


In [27]:
employees = [
    {"emp_id": 101, "name": "John", "department": "IT", "salary": 85000, "age": 29},
    {"emp_id": 102, "name": "Anna", "department": "HR", "salary": 62000, "age": 31},
    {"emp_id": 103, "name": "David", "department": "Finance", "salary": 72000, "age": 34},
    {"emp_id": 104, "name": "Lisa", "department": "IT", "salary": 85000, "age": 29},
    {"emp_id": 105, "name": "Tom", "department": "Marketing", "salary": 68000, "age": 27},
    {"emp_id": 106, "name": "John", "department": "IT", "salary": 85000, "age": 29},
    {"emp_id": 107, "name": "Susan", "department": "Finance", "salary": 72000, "age": 34},
    {"emp_id": 108, "name": "Anna", "department": "HR", "salary": 62000, "age": 31},
    {"emp_id": 109, "name": "Robert", "department": "IT", "salary": 95000, "age": 36},
    {"emp_id": 110, "name": "Tom", "department": "Marketing", "salary": 68000, "age": 27},
    {"emp_id": 111, "name": "John", "department": "Finance", "salary": 78000, "age": 30},
    {"emp_id": 112, "name": "Anna", "department": "IT", "salary": 70000, "age": 32},
    {"emp_id": 113, "name": "David", "department": "Finance", "salary": 72000, "age": 34},
    {"emp_id": 114, "name": "Lisa", "department": "HR", "salary": 87000, "age": 29},
    {"emp_id": 115, "name": "Tom", "department": "Finance", "salary": 74000, "age": 27},
    {"emp_id": 116, "name": "John", "department": "IT", "salary": 95000, "age": 29},
    {"emp_id": 117, "name": "Anna", "department": "Finance", "salary": 63000, "age": 31},
    {"emp_id": 118, "name": "David", "department": "Marketing", "salary": 76000, "age": 34},
    {"emp_id": 119, "name": "Lisa", "department": "Finance", "salary": 85000, "age": 30},
    {"emp_id": 120, "name": "Tom", "department": "Marketing", "salary": 68000, "age": 27},
    {"emp_id": 121, "name": "Susan", "department": "Finance", "salary": 72000, "age": 34},
    {"emp_id": 122, "name": "Anna", "department": "HR", "salary": 62000, "age": 31},
    {"emp_id": 123, "name": "Robert", "department": "IT", "salary": 95000, "age": 36},
    {"emp_id": 124, "name": "John", "department": "IT", "salary": 85000, "age": 29},
    {"emp_id": 125, "name": "Lisa", "department": "Finance", "salary": 85000, "age": 30}
]


df = pd.DataFrame(employees)
df

Unnamed: 0,emp_id,name,department,salary,age
0,101,John,IT,85000,29
1,102,Anna,HR,62000,31
2,103,David,Finance,72000,34
3,104,Lisa,IT,85000,29
4,105,Tom,Marketing,68000,27
5,106,John,IT,85000,29
6,107,Susan,Finance,72000,34
7,108,Anna,HR,62000,31
8,109,Robert,IT,95000,36
9,110,Tom,Marketing,68000,27


### Task 1

- Find all fully duplicate rows in the dataset.

In [None]:
# Detected Full duplicates

df.duplicated().sum()

# There is no entire rows is duplicates

np.int64(0)

In [31]:
# To see the unique values in Df

df.nunique()

emp_id        25
name           7
department     4
salary        11
age            7
dtype: int64

### Task 2

- Remove duplicates based on "name" only and display the new DataFrame.

In [30]:
df.drop_duplicates(subset='name')

Unnamed: 0,emp_id,name,department,salary,age
0,101,John,IT,85000,29
1,102,Anna,HR,62000,31
2,103,David,Finance,72000,34
3,104,Lisa,IT,85000,29
4,105,Tom,Marketing,68000,27
6,107,Susan,Finance,72000,34
8,109,Robert,IT,95000,36


### Task 3

- Remove duplicates based on both "name" and "department", but keep the last occurrence.

In [38]:
df

Unnamed: 0,emp_id,name,department,salary,age
0,101,John,IT,85000,29
1,102,Anna,HR,62000,31
2,103,David,Finance,72000,34
3,104,Lisa,IT,85000,29
4,105,Tom,Marketing,68000,27
5,106,John,IT,85000,29
6,107,Susan,Finance,72000,34
7,108,Anna,HR,62000,31
8,109,Robert,IT,95000,36
9,110,Tom,Marketing,68000,27


In [None]:
# keep = first
df.drop_duplicates(subset=['name','department'], keep = 'first')

Unnamed: 0,emp_id,name,department,salary,age
0,101,John,IT,85000,29
1,102,Anna,HR,62000,31
2,103,David,Finance,72000,34
3,104,Lisa,IT,85000,29
4,105,Tom,Marketing,68000,27
6,107,Susan,Finance,72000,34
8,109,Robert,IT,95000,36
10,111,John,Finance,78000,30
11,112,Anna,IT,70000,32
13,114,Lisa,HR,87000,29


In [None]:
# when keep is last
df.drop_duplicates(subset=['name','department'], keep = 'last')

Unnamed: 0,emp_id,name,department,salary,age
3,104,Lisa,IT,85000,29
10,111,John,Finance,78000,30
11,112,Anna,IT,70000,32
12,113,David,Finance,72000,34
13,114,Lisa,HR,87000,29
14,115,Tom,Finance,74000,27
16,117,Anna,Finance,63000,31
17,118,David,Marketing,76000,34
19,120,Tom,Marketing,68000,27
20,121,Susan,Finance,72000,34
