In [1]:
import pandas as pd

### Pandas Drop Duplicates
We will run through 3 examples:
1. Dropping rows from duplicate rows
2. Dropping rows from duplicate subset of columns
3. Keeping the last duplicate instead of the default first column

Let's create our DataFrame

In [2]:
df = pd.DataFrame({
    'brand': ['Jet Boil', 'Jet Boil', 'Osprey', 'Osprey', 'Osprey'],
    'equipment': ['Stove', 'Stove', 'Backpack', 'Waterbottle', 'Backpack'],
    'rating': [3, 3, 5.5, 8.6, 7]
})
df

Unnamed: 0,brand,equipment,rating
0,Jet Boil,Stove,3.0
1,Jet Boil,Stove,3.0
2,Osprey,Backpack,5.5
3,Osprey,Waterbottle,8.6
4,Osprey,Backpack,7.0


### 1. Dropping rows from duplicate rows
When we call the default drop_duplicates, we are asking pandas to find all the duplicate rows, and then keep only the first ones.

Notice below, we call drop duplicates and row 2 (index=1) gets dropped because is the 2nd instance of a duplicate row.

In [3]:
df.drop_duplicates()

Unnamed: 0,brand,equipment,rating
0,Jet Boil,Stove,3.0
2,Osprey,Backpack,5.5
3,Osprey,Waterbottle,8.6
4,Osprey,Backpack,7.0


### 2. Dropping rows from duplicate subset of columns
When we specify a subset of column, drop duplicates will only look at a column (or mutiple columns) to see if they are duplicates with any other subset of columns from othr rows. If so, then those duplicates will get dropped.

Here we are specifying a subset to only look at the column 'brand.' All duplicates within the brand column will get dropped except for the 1st ones (because keep defaults to 'first').

In [4]:
df.drop_duplicates(subset='brand')

Unnamed: 0,brand,equipment,rating
0,Jet Boil,Stove,3.0
2,Osprey,Backpack,5.5


You can also do multiple columns as a subset by passing a list

In [5]:
df.drop_duplicates(subset=['brand', 'equipment'])

Unnamed: 0,brand,equipment,rating
0,Jet Boil,Stove,3.0
2,Osprey,Backpack,5.5
3,Osprey,Waterbottle,8.6


### 3. Keeping the last duplicate instead of the default first
By default, .drop_duplicates() will keep your *first* duplicate it finds. However, if you wanted to switch it up and keep the *last* one you can specify keep='last'

Here we are running the same command as the first example, but keep='last'. Notice how row 1 (index=0) gets dropped. We keep the last duplicate only.

In [6]:
df.drop_duplicates(keep='last')

Unnamed: 0,brand,equipment,rating
1,Jet Boil,Stove,3.0
2,Osprey,Backpack,5.5
3,Osprey,Waterbottle,8.6
4,Osprey,Backpack,7.0
