<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST3512/blob/main/CST3512_Class07.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CST3512 Class #07
**Rationalizing Duplicates in Pandas**

from **Finding and removing duplicate rows in Pandas DataFrame**:    

**Pandas tips and tricks to help you get started** with data analysis
by Bindi Chen. Medium. March 23, 2021.    

Available through paywall at: `https://towardsdatascience.com/finding-and-removing-duplicate-rows-in-pandas-dataframe-c6117668631f`

*Exercise data from the titanic data set on Kaggle at:* `https://www.kaggle.com/c/titanic/data`


Demonstrating the two methods, `duplicated()` and `drop_duplicates()`, for finding and removing duplicate rows, as well as how to modify their behavior to suit specific needs. 

This notebook is structured as follows:
1. Finding duplicate rows
2. Counting duplicate and non-duplicate rows
3. Extracting duplicate rows with loc
4. Determining which duplicates to mark with keep
5. Dropping duplicate rows    





---



## 0 - Module and File Housekeeping    

First, import requisite libraries/modules (only pandas)

In [1]:
import pandas as pd

Then attach a handle to the training file set from the titanic dataset sourced from Kaggle `https://www.kaggle.com/c/titanic/data` and read a subset of that data.

*note: this example assumes the `train.csv` file has been renamed `titanic_train.csv` and is uploaded to the `\content` folder of the current Colab session.* 


In [4]:
# define a function `load_data()` to open a file and return a DataFrame
def load_data(): 
    df_all = pd.read_csv('titanic_train.csv')  # renamed and loaded to \content
    # Take a subset of the first 300 rows, dropping records with missing data
    return df_all.loc[:300, ['Survived', 'Pclass', 'Sex', 'Cabin', 'Embarked']].dropna()

# create a DataFrame `df` for use in this exercise
df = load_data()

Have a look at the dataframe

In [None]:
df.head()

## 1 - Find Duplicate Rows     

To find duplicates on a specific column, we can simply call duplicated() method on the column.


In [None]:
# 1-A
# For a single column
df.Cabin.duplicated()

The result is a boolean Series with the value True denoting duplicate. In other words, the value True means the entry is identical to a previous one.    

To take a look at the duplication in the DataFrame as a whole, just call the `duplicated()` method on the DataFrame. It outputs True if an entire row is identical to a previous row.    


In [None]:
# 1-B
# For a DataFrame as a whole
df.duplicated()

To consider certain columns for identifying duplicates, we can pass a list of columns to the argument subset:     


In [None]:
# 1-C
# To consider certain columns for identifying duplicates
df.duplicated(subset=['Survived', 'Pclass', 'Sex'])

## 2 - Counting Duplicates and Non-Duplicates    



The result of the `duplicated()` is a boolean Series, and we can add them up to count the number of duplicates. Behind the scene, True gets converted to 1 and False gets converted to 0, then it adds them up.

In [None]:
# 2-A
# For a single column
df.Cabin.duplicated().sum()

Just like before, we can count the duplicate in a DataFrame and on certain columns.

In [None]:
# 2-B
# For a DataFrame as a whole
df.duplicated().sum()

In [None]:
# 2-C
# To consider certain columns for identifying duplicates
df.duplicated(subset=['Survived', 'Pclass', 'Sex']).sum()

If you want to count the number of non-duplicates (The number of False), you can invert it with negation (~)and then call `sum()`:

In [None]:
# 2-D
# Count the number of non-duplicates
(~df.duplicated()).sum()

## 3 - Extracting Duplicate Rows with `loc`    


Pandas `duplicated()` returns a boolean Series. However, it is not practical to see a list of True and False when we need to perform some data analysis.
We can Pandas `loc` data selector to extract those duplicate rows:

In [None]:
# 3-A 
# This allows us to see the rows that were identified by duplicated()
df.loc[df.duplicated(), :]

`loc` can take a boolean Series and filter data based on True and False. The first argument `df.duplicated()` will find the rows that were identified by `duplicated()`. The second argument : will display all columns.

## 4 - Determining Which Duplicates to Mark with`keep`    


There is an argument `keep` in Pandas `duplicated()` to determine which duplicates to mark. `keep` defaults to 'first', which means the first occurrence gets kept, and all others get identified as duplicates.
We can change it to 'last' keep the last occurrence and mark all others as duplicates.

In [None]:
# 4-A 
# `keep` defaults to `'first'`
df.loc[df.duplicated(keep='first'), :]

In [None]:
# 4-B 
# `keep` defaults to `'last'`
df.loc[df.duplicated(keep='last'), :]

There is a third option we can use `keep=False`. It marks all duplicates as True and allows us to see all duplicate rows.

In [None]:
# 4-C
# There is a third option we can use keep=False. It marks all duplicates as True
df.loc[df.duplicated(keep=False), :]

## 5 - Dropping Duplicate Rows    

We can use Pandas built-in method `drop_duplicates()` to drop duplicate rows.

In [None]:
# 5-A
# Note the change in the number of rows before and after dropping duplicates
# It is not performed in place by default,
df.drop_duplicates()

By default, this method returns a new DataFrame with duplicate rows removed. We can set the argument `inplace=True` to remove duplicates from the original DataFrame.

In [None]:
# 5-B
# Note the change in the number of rows before and after dropping duplicates
# It is not performed in place by default, we can change it to in place by `inplace=True`df.drop_duplicates(inplace=True)
df.drop_duplicates(inplace=True)

**Determining which duplicate to keep**    


The argument keep can be set for `drop_duplicates()` as well to determine which duplicates to keep. It defaults to `'first'` to keep the first occurrence and drop all other duplicates.    


Similarly, we can set keep to `'last'` to keep the last occurrence and drop other duplicates.

In [None]:
# 5-C
# Use keep='last' to keep the last occurrence 
df.drop_duplicates(keep='last')

And we can set `keep` to **False** to drop all duplicates.

In [None]:
# 5-D 
# To drop all duplicates
df.drop_duplicates(keep=False)

**Considering certain columns for dropping duplicates**    


Similarly, to consider certain columns for dropping duplicates, we can pass a list of columns to the argument subset:    


In [None]:
# 5-E
# Similarly, we can consider a certain columns for dropping duplicates
df.drop_duplicates(subset=['Survived', 'Pclass', 'Sex'])



---



## Conclusion

Pandas `duplicated()` and `drop_duplicates()` are two quick and convenient methods to find and remove duplicates. It is important to know them as we often need to use them during the data preprocessing and analysis.

Hopefully, this article helped save time in learning Pandas.     

Check out the documentation for the `duplicated()` and `drop_duplicates()` API and to know about other things you can do.

More tutorials on Pandas by Bindi Chen are available on his [GitHub]([https://github.com/BindiChen/machine-learning]). 

Connect with [Bindi Chen on LinkedIn](https://www.linkedin.com/in/bindi-chen-aa55571a/).




---

