# Dealing with duplicates

Dealing with duplicates is really tricky because removing or not the duplicates will influene the model on putting more or less weights associated with the features

For example, when training sets contain duplicates, the corresponding model puts more weight on getting these examples correct. **So** we have to make sure if the duplicates are real a represent well the domain or if the duplicates don't represent the domain. The duplicates indicate frequency which is important.

**It basicaly comes down to the question whether the dataset you have resembles the real world or if it doesn't**

So, if the duplicate don't represent the real world and are due to an error we should remove. On the other hand, if the duplicates represent a real world (e.g., an event that occurs more times in the real world) we should leave the duplicates in so the model can assign a higher weight to those examples


**Important Note**
- Duplicated elements inthe test data serve no real purpose. You've tested the model on that particular problem once, why would you do it agian, when you'd expect the exact same answer?

Two possible choices:
- Remove 
- Don't remove

In [5]:
import pandas as pd

df = pd.read_csv('Smaller_Building_Permits.csv')
df.shape

(10372, 44)

#### Remove duplicates

In cases when duplicates don't represent the real world or domain
Duplicate data does not add any value but make the algorithm slower
Duplicates are an extreme case of nonrandom sampling, and they bias your fitted model. Including them will essentially lead to the model overfitting this subset of points

In [7]:
# Count the number of duplicates
df.duplicated().sum()

372

In [8]:
df = df.drop_duplicates() # pandas built in function
'''
arguments:
    - subset - only considers a subset of columns for identifying duplicates
    - keep{‘first’, ‘last’, False}, default ‘first’ - False deletes all duplicates
'''

'\narguments:\n    - subset - only considers a subset of columns for identifying duplicates\n    - keep{‘first’, ‘last’, False}, default ‘first’ - False deletes all duplicates\n'

In [9]:
# Count the number of duplicates
df.duplicated().sum()

0

#### Don't remove duplicates

If domains where it is proven that duplicates happen in the real world, we should keep the duplicates to assure that the data represents well the distributions of the real world

For example, if you remove your duplicates, a four-leaf clover is as significant as a regular, three-leaf clover, since each will occur once, whereas in real life there is a four-leaf clover for every 10,000 regular clovers.

In this case we don't have to do anything