## Identify and Delete Rows That Contain Duplicate Data
#### Rows of duplicate data should probably be deleted from your dataset prior to modeling.

#### Rows that have identical data are probably useless, if not dangerously misleading during model evaluation.

#### Here, a duplicate row is a row where each value in each column for that row appears in identically the same order (same column values) in another row.

#### From a probabilistic perspective, you can think of duplicate data as adjusting the priors for a class label or data distribution. This may help an algorithm like Naive Bayes if you wish to purposefully bias the priors. Typically, this is not the case and machine learning algorithms will perform better by identifying and removing rows with duplicate data.

#### From an algorithm evaluation perspective, duplicate rows will result in misleading performance. For example, if you are using a train/test split or k-fold cross-validation, then it is possible for a duplicate row or rows to appear in both train and test datasets and any evaluation of the model on these rows will be (or should be) correct. This will result in an optimistically biased estimate of performance on unseen data.

#### If you think this is not the case for your dataset or chosen model, design a controlled experiment to test it. This could be achieved by evaluating model skill with the raw dataset and the dataset with duplicates removed and comparing performance. Another experiment might involve augmenting the dataset with different numbers of randomly selected duplicate examples.

In [1]:
# importing libraries
import pandas as pd

In [2]:
# importing dataset
df = pd.read_csv('iris.csv')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
# calculating duplicates
dups = df.duplicated()

True

In [8]:
# report if there are any duplicates
print(dups.any())

True


In [9]:
# list all duplicate rows
print(df[dups])

     sepal_length  sepal_width  petal_length  petal_width    species
142           5.8          2.7           5.1          1.9  virginica


In [13]:
# deleting duplicate rows
df1 = df.drop_duplicates()
print(df.shape)
print(df1.shape)

(150, 5)
(149, 5)


### The shape of the DataFrame is reported to confirm the change