## How to manipulate duplicate data using pandas
[tutorial](https://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#26.-How-do-I-find-and-remove-duplicate-rows-in-pandas%3F-%28video%29)
* **```python df.duplicated(subset= , keep= ).tail()/sum()```**  
Logic for **```duplicated```**
* **```python df.loc[df.duplicated(), :]```**

* **```df.drop_duplicates()```**

In [1]:
import pandas as pd

In [2]:
# read a dataset of movie reviewers (modifying the default parameter values for read_table)
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_csv('D:/u.user.txt', sep='|', header=None, names=user_cols, index_col='user_id')
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [3]:
users.shape

(943, 4)

In [4]:
# detect duplicate zip codes: True if an item is identical to a previous item
users['zip_code'].duplicated().tail()

user_id
939    False
940     True
941    False
942    False
943    False
Name: zip_code, dtype: bool

In [5]:
# count the duplicate items (True becomes 1, False becomes 0)
users['zip_code'].duplicated().sum()

148

In [6]:
# detect duplicate DataFrame rows: True if an entire row is identical to a previous row
users.duplicated().tail()

user_id
939    False
940    False
941    False
942    False
943    False
dtype: bool

In [7]:
# count the duplicate rows
users.duplicated().sum()

7

Logic for **```duplicated```**
* **```keep='first'```** (default)
* **```keep='last'```**
* **```keep=False```**

In [8]:
users.loc[users.duplicated(keep='first'), :]

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
496,21,F,student,55414
572,51,M,educator,20003
621,17,M,student,60402
684,28,M,student,55414
733,44,F,other,60630
805,27,F,other,20009
890,32,M,student,97301


In [20]:
users.loc[users.duplicated(keep='last'), :]

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
67,17,M,student,60402
85,51,M,educator,20003
198,21,F,student,55414
350,32,M,student,97301
428,28,M,student,55414
437,27,F,other,20009
460,44,F,other,60630


In [9]:
users.loc[users.duplicated(keep=False), :]

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
67,17,M,student,60402
85,51,M,educator,20003
198,21,F,student,55414
350,32,M,student,97301
428,28,M,student,55414
437,27,F,other,20009
460,44,F,other,60630
496,21,F,student,55414
572,51,M,educator,20003
621,17,M,student,60402


In [10]:
users.drop_duplicates(keep='first').shape

(936, 4)

In [11]:
users.drop_duplicates(keep='last').shape

(936, 4)

In [12]:
users.drop_duplicates(keep=False).shape

(929, 4)

In [13]:
users.duplicated(subset=['age', 'zip_code']).tail()

user_id
939    False
940    False
941    False
942    False
943    False
dtype: bool

In [14]:
users.loc[users.duplicated(subset=['age', 'zip_code'], keep=False), :]

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
28,32,M,writer,55369
31,24,M,artist,10003
67,17,M,student,60402
74,39,M,scientist,T8H1N
84,32,M,executive,55369
85,51,M,educator,20003
178,26,M,other,49512
198,21,F,student,55414
274,20,F,student,55414
350,32,M,student,97301
