# **Lecture 9C**
# **Duplicated Data**


Duplicated rows in DataFrames can be problematic when performing joins and introduce error in data summary. We will learn how to identify and remove them from DataFrames.

In [2]:
# Run the code below to access files in your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# We also need Panadas module in this lecture
# Import Pandas module
import pandas as pd

---
**Example 1:** Duplicated rows in a row can be problematic in data processing. Pandas provided the method **duplicated()** for screening out duplicated rows in a DataFrame. The syntax is **df.duplicated(subset=*list_of_column_names*, keep=*keep_option*)**.
* The method will return a Series containing True/False values. Indicate which rows are duplicated.
* The **subset=** option allows us to specify a list of column names. The specified columns are used to check if rows are duplicated. If this option is not specified, all columns will be used.
* The **keep=** option can be **"first"**, **"last"** or **False**. If **"first"** is used, all duplicates are marked as True except for the first occurence. If **"last"** is used, all duplicates are marked as True except for the last occurence. If **False** is used, all duplicates are marked as True.
* The returned Series can be used with DataFrame slicing to exclude duplicated rows.


In [7]:
# Read inventory_dup.xlsx data file
# This data file has some duplicated rows
inventory = pd.read_excel("/content/drive/MyDrive/Data/inventory_dup.xlsx",sheet_name="data")

# display the first couple records
print("Original Data:")
display(inventory)

# Screen for duplicated rows by product_code and product_name
result = inventory.duplicated(subset=["product_code","product_name"],keep="first")

# Check which rows are duplicates
# row 3, 7 and 11 are all Jacky Cola
# keep="first" will mark row 3 as False and row 7 & 11 as True
# row 1 and row 12 are both Tasty Lucheon Meat
# keep="first" will mark row 1 as False and row 12 as True
print()
print("The resulting Series showing which rows are duplicates:")
print(result)



Original Data:


Unnamed: 0,product_code,product_name,origin,unit_price,quantity
0,A111,ABC Tomato Soup,Japan,12,52
1,B223,Tasty Lucheon Meat,China,25,60
2,A112,ABC Mushroom Soup,Japan,13,34
3,D871,Jacky Cola,Thailand,4,4
4,B201,Tasty Corn Beef,China,16,50
5,C204,Star Chocolate,USA,20,100
6,A342,ABC Chicken Soup,Japan,13,61
7,D871,Jacky Cola,Thailand,4,4
8,B201,Tasty Tuna,China,14,86
9,C491,Star Jello,USA,18,67



The resulting Series showing which rows are duplicates:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11     True
12     True
dtype: bool


In [8]:
# If you want to remove the duplicates, you can use slicing
# "~" is the bitwise "not" operator, ~result means that duplicated rows are now tagged as False
# If we use the True/False values in slicing, it will give you all non-duplicated rows.
inventory_clean = inventory.loc[~result]
inventory_dup = inventory.loc[result]
print()
print("Data after removal of duplicates:")
display(inventory_clean)
print()
print("Data removed:")
display(inventory_dup)


Data after removal of duplicates:


Unnamed: 0,product_code,product_name,origin,unit_price,quantity
0,A111,ABC Tomato Soup,Japan,12,52
1,B223,Tasty Lucheon Meat,China,25,60
2,A112,ABC Mushroom Soup,Japan,13,34
3,D871,Jacky Cola,Thailand,4,4
4,B201,Tasty Corn Beef,China,16,50
5,C204,Star Chocolate,USA,20,100
6,A342,ABC Chicken Soup,Japan,13,61
8,B201,Tasty Tuna,China,14,86
9,C491,Star Jello,USA,18,67
10,D481,Jacky Ginger Beer,Thailand,15,13



Data removed:


Unnamed: 0,product_code,product_name,origin,unit_price,quantity
7,D871,Jacky Cola,Thailand,4,4
11,D871,Jacky Cola,Thailand,3,6
12,B223,Tasty Lucheon Meat,China,25,60


---
**Exampe 2:** In this example, will modify some of the options in the previous example.
* **subset=** option is omitted, meaning that we will use all columns in identifying duplicated rows.
* **keep=** option will be changed to "last", meaning all duplicates will be marked as True except the last one.

In [None]:
# Read inventory_dup.xlsx data file
# This data file has some duplicated rows
inventory = pd.read_excel("/content/drive/MyDrive/Data/inventory_dup.xlsx",sheet_name="data")

# display the first couple records
print("Original Data:")
display(inventory)

# Screen for duplicated rows using all columns
# row 1 and 12 are duplicated and row 1 is marked as duplicate (True)
# row 3 and 7 are duplicated and row 3 is marked as duplicate (True)
# row 11 is not a duplicate of row 3 and 7 because unit_price and quantity are different
result = inventory.duplicated(keep="last")

# Check which rows are duplicates
print("The resulting Series showing which rows are duplicates:")
print(result)



Original Data:


Unnamed: 0,product_code,product_name,origin,unit_price,quantity
0,A111,ABC Tomato Soup,Japan,12,52
1,B223,Tasty Lucheon Meat,China,25,60
2,A112,ABC Mushroom Soup,Japan,13,34
3,D871,Jacky Cola,Thailand,4,4
4,B201,Tasty Corn Beef,China,16,50
5,C204,Star Chocolate,USA,20,100
6,A342,ABC Chicken Soup,Japan,13,61
7,D871,Jacky Cola,Thailand,4,4
8,B201,Tasty Tuna,China,14,86
9,C491,Star Jello,USA,18,67


The resulting Series showing which rows are duplicates:
0     False
1      True
2     False
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
dtype: bool


In [None]:
# If you want to remove the duplicates, you can use slicing
# "~" is the bitwise "not" operator, ~result means that duplicated rows are now tagged as False
# If we use the True/False values in slicing, it will give you all non-duplicated rows.
inventory_clean = inventory.loc[~result]
inventory_dup = inventory.loc[result]
print()
print("Data after removal of duplicates:")
display(inventory_clean)
print()
print("Data removed:")
display(inventory_dup)


Data after removal of duplicates:


Unnamed: 0,product_code,product_name,origin,unit_price,quantity
0,A111,ABC Tomato Soup,Japan,12,52
2,A112,ABC Mushroom Soup,Japan,13,34
4,B201,Tasty Corn Beef,China,16,50
5,C204,Star Chocolate,USA,20,100
6,A342,ABC Chicken Soup,Japan,13,61
7,D871,Jacky Cola,Thailand,4,4
8,B201,Tasty Tuna,China,14,86
9,C491,Star Jello,USA,18,67
10,D481,Jacky Ginger Beer,Thailand,15,13
11,D871,Jacky Cola,Thailand,3,6



Data removed:


Unnamed: 0,product_code,product_name,origin,unit_price,quantity
1,B223,Tasty Lucheon Meat,China,25,60
3,D871,Jacky Cola,Thailand,4,4
