# Question from Irene Sanjay

> How do I load an Excel file and extract duplicates to another Excel file?

# Load an Excel file contained in this folder

⚠️ Besides installing pandas (either with `pip install pandas` or `conda install pandas` if you use conda) you will have to install `openpyxl` with `pip install openpyxl`

In [1]:
import pandas as pd

# "." means the current folder (it will work without but it's important to know 🙈)
df = pd.read_excel('./duplicates.xlsx', sheet_name='duplicates')
df

Unnamed: 0,first_name,last_name,country
0,John,Doe,USA
1,John,Doe,France
2,John,Doe,Belgium
3,John,Rambo,USA
4,John,Travolta,USA
5,Tibaldo,B,France


# Identify duplicates

Create a mask. Note that `keep=False` gives me all the entries, not just the first one or last one

In [2]:
first_name_duplicated = df['first_name'].duplicated(keep=False)
first_name_duplicated

0     True
1     True
2     True
3     True
4     True
5    False
Name: first_name, dtype: bool

select rows where the first name is duplicated, and put this in a new DataFrame. We'll make a copy in case we want to make operations on this view later.

In [3]:
df_duplicates_first_name = df.loc[first_name_duplicated, :].copy()
df_duplicates_first_name

Unnamed: 0,first_name,last_name,country
0,John,Doe,USA
1,John,Doe,France
2,John,Doe,Belgium
3,John,Rambo,USA
4,John,Travolta,USA


In many scenarios though we'd want to check multiple columns together. For instance we could check duplicates on the group `first_name+last_name`.

It's the same strategy as before (make a mask and use it to select rows).

In [4]:
names_duplicated = df[['first_name', 'last_name']].duplicated(keep=False)
df_duplicates_all_names = df.loc[names_duplicated, :].copy()
df_duplicates_all_names

Unnamed: 0,first_name,last_name,country
0,John,Doe,USA
1,John,Doe,France
2,John,Doe,Belgium


# Save results to another Excel file

⚠️ if an Excel file exists with the same name as below, it is replaced (pandas won't open an existing Excel file and save te DataFrame to a dedicated sheet or replace an existing sheet)

See this [Stackoverflow post](https://stackoverflow.com/a/63692307) about appending sheets to an existing Excel file.

In [5]:
df_duplicates_all_names.to_excel('duplicates_results.xlsx', sheet_name='example', index=False)

# Note

I did not use an index to keep it simple but you can of course also check the duplicates within an index with such code

In [6]:
# let's set first name and last name as index for our example
df.set_index(['first_name', 'last_name'], inplace=True)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,country
first_name,last_name,Unnamed: 2_level_1
John,Doe,USA
John,Doe,France
John,Doe,Belgium
John,Rambo,USA
John,Travolta,USA
Tibaldo,B,France


If exporting this to Excel, you'll probably want to set the parameter `merge_cells=False` in the `.to_excel` method.

I you don't do this, the multilevel index you see below will be represented in fused cells in the resulting Excel sheet.

In [7]:
df.loc[df.index.duplicated(keep=False), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,country
first_name,last_name,Unnamed: 2_level_1
John,Doe,USA
John,Doe,France
John,Doe,Belgium
