# Remove Duplicates
In this short tutorial I show how to remove duplicates from a dataframe, using the `drop_duplicates()` function provided by the `pandas` library.
Duplicates removal is a technique used to preprocess data. Data preprocessing also includes:
* missing values
* standardization
* normalization
* formatting
* binning.

## Data import
Firstly, I import the Python `pandas` library and then I read the CSV file through the `read_csv()` function. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/SSN FDP/Leni.csv')
df.head()

Unnamed: 0,author,channel,cid,heart,photo,text,time,votes
0,Suganya Arumugam,UC_EOwRWhd65qJ5QOZTAlTeA,UgzgWOMHp094OE_ysax4AaABAg,0.0,https://yt3.ggpht.com/P3ibtYF_Seo4QqdrIASBrtuB...,Hai Hai HaI,2 days ago,0
1,Army Public School Dagshai,UC5xN5xDJpIjIzAk7N5BTGJA,UgzyClrerABPxX09Kq54AaABAg,0.0,https://yt3.ggpht.com/ytc/AKedOLRT-EYqtZOWmynM...,Enriching session. Thx,3 days ago,0
2,Magima T.A,UCZ9QVy5nhEsSk_fKtN1CIkA,Ugx9aJ10OEz5P5N7X0t4AaABAg,0.0,https://yt3.ggpht.com/ytc/AKedOLTLjCb9HLiD4tIy...,"Magima Ahamed John, Aiman college of arts and ...",7 days ago,0
3,unnimaya,UCAy8cxcLl-wR6BOB3CWAsuA,Ugw2L8LtEJb5jExHLOF4AaABAg,0.0,https://yt3.ggpht.com/ytc/AKedOLR7AUhSVV1QxBaz...,Unnimaya : Bishop Moore College Mavelikara,7 days ago,0
4,alex philip,UCX62wvhlEwZ9HX5_Rsl15Ig,UgyFCt1Fq_sRoYIrqAd4AaABAg,0.0,https://yt3.ggpht.com/ytc/AKedOLQGBxxk3eV21dw2...,Alex Philip - Assistant professor P. E. S COLL...,7 days ago,0


Now I list the number of records contained in the dataframe. I exploit the `shape` attributes, which shows the number of rows and the number of columns of the dataframe.

In [4]:
df.shape

(45, 8)

## Check for the presence of duplicates
In order to check whether a record is duplicated or not, I can exploit the `duplicated()` function, which returns `True` if a record has other duplicates, `False` otherwise.

In [13]:
df.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
dtype: bool

I can use the `duplicated()` function also on a subset of columns of the dataframe. In this case, I must use the `subset` parameter, which contains the list of columns to be checked.

In [8]:
df.duplicated(subset=['author'])

0     False
1     False
2     False
3     False
4     False
5     False
6      True
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27     True
28    False
29    False
30    False
31     True
32     True
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42     True
43    False
44    False
dtype: bool

Now I can calculate the number duplicates through the sum of `True` records.

In [10]:
df.duplicated().sum()

0

## Drop duplicates
Now I can drop duplicates through the `drop_duplicates()` function. I can use different strategies:
* drop all duplicates, on the basis of all the columns
* drop all duplicates, on the basis of some columns

In both the strategies, I can decide whether to maintain a copy of the duplicated values or not. This can be done through the `keep` parameter, passed as input to the `drop_duplicates()` function.

In [11]:
df1 = df.drop_duplicates()

In [12]:
df1.shape

(45, 8)

In [None]:
df1.head()

Unnamed: 0,ID,Comments,TAG
0,facebook_corpus_msr_1723796,Well said sonu..you have courage to stand agai...,OAG
1,facebook_corpus_msr_466073,"Most of Private Banks ATM's Like HDFC, ICICI e...",NAG
4,facebook_corpus_msr_462570,Wondering why Educated Ambassador is strugglin...,CAG
5,facebook_corpus_msr_465051,How does inflation react to all the after shoc...,NAG
6,facebook_corpus_msr_450994,Not good job.....this guis creating a problem ...,CAG


Drop also the first duplicate

In [None]:
df2 = df.drop_duplicates(keep=False)
df2.head()

Unnamed: 0,ID,Comments,TAG
20,facebook_corpus_msr_472878,Sir you are right but honestly 2000 note is no...,NAG
21,facebook_corpus_msr_397830,India army should gundown all terrists pak pig...,OAG
22,facebook_corpus_msr_2127013,Heading should be..Owaisi huge loss..pseudo se...,CAG
23,facebook_corpus_msr_2018572,Vrinda Singh I don't know what I dislike her m...,CAG


In [None]:
df2.shape

(4, 3)

Drop duplicates on the basis of a subset of columns

In [None]:
df3 = df.drop_duplicates(subset=["Comments"])
df3.shape

(9, 3)

In [None]:
df3.head()

Unnamed: 0,ID,Comments,TAG
0,facebook_corpus_msr_1723796,Well said sonu..you have courage to stand agai...,OAG
1,facebook_corpus_msr_466073,"Most of Private Banks ATM's Like HDFC, ICICI e...",NAG
4,facebook_corpus_msr_462570,Wondering why Educated Ambassador is strugglin...,CAG
5,facebook_corpus_msr_465051,How does inflation react to all the after shoc...,NAG
6,facebook_corpus_msr_450994,Not good job.....this guis creating a problem ...,CAG
