#### Fixing invalid values

In [2]:
import pandas as pd
import numpy as np

In [17]:
df = pd.DataFrame({
    'Sex': ['M', 'F', 'F', 'D', '?'],
    'Age': [29, 30, 24, 299, 25],
})
df

Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,D,299
4,?,25


To find the values that might be wrong/out of range, we can verify the unique values of a columns with:
- unique(): returns a list of the unique values in the dataframe
- value_counts(): returns a series with the unique values and it's amount in the dataframe

In [4]:
df['Sex'].unique()

array(['M', 'F', 'D', '?'], dtype=object)

In [5]:
df['Sex'].value_counts()

Sex
F    2
M    1
D    1
?    1
Name: count, dtype: int64

In this case, in the "Sex" column, there're incorrect values. This columns should be filled with just "F" and "M". But we also have "D" and "?". The reason for the "D" might be a mistype, since "F" and "D" are next to each other in the keyboard.

In [6]:
df['Sex'].replace('D', 'F')

0    M
1    F
2    F
3    F
4    ?
Name: Sex, dtype: object

If there was a "N" in the "Sex" column, it could be a mistype from someone trying to type "M"

In [7]:
df['Sex'].replace({'N': 'M', 'D': 'F'})

0    M
1    F
2    F
3    F
4    ?
Name: Sex, dtype: object

Another mistype might be the 290 in the "Age" column, it probably is 29

In [None]:
df.replace({
    'Sex': {'N': 'M', 'D': 'F'},
    'Age': {290: 29}
})

In case of more values from "Age" being out of range, we need to correct then as well. Age will be probably a number bellow 100, so if a number is greater that 100, is probably a mistype

In [None]:
df.loc[df['Age'] > 100, 'Age'] = df.loc[df['Age'] > 100, 'Age'] / 10 
df['Age'] = df['Age'].apply(lambda x: int(x)) # making sure the number is an integer. ex.: 299/10 = 29.9 -> int(29.9) = 29
df

#### Duplicates

In [20]:
ambassadors = pd.Series([
    'France',
    'United Kingdom',
    'United Kingdom',
    'Italy',
    'Germany',
    'Germany',
    'Germany',
], index=[
    'Gérard Araud',
    'Kim Darroch',
    'Peter Westmacott',
    'Armando Varricchio',
    'Peter Wittig',
    'Peter Ammon',
    'Klaus Scharioth '
])
ambassadors

Gérard Araud                  France
Kim Darroch           United Kingdom
Peter Westmacott      United Kingdom
Armando Varricchio             Italy
Peter Wittig                 Germany
Peter Ammon                  Germany
Klaus Scharioth              Germany
dtype: object

To verify if there are duplicates in the series or dataframe, we use duplicated(). Then, to drop this values/rows, we use drop_duplicates(). The duplicated() method has a parameter "keep", with "first" as its default value. This means that after the first occurrence of a value, the duplicated values will return True.

In [33]:
ambassadors.duplicated()

Gérard Araud          False
Kim Darroch           False
Peter Westmacott       True
Armando Varricchio    False
Peter Wittig          False
Peter Ammon            True
Klaus Scharioth        True
dtype: bool

In this case there are 2 items duplicated: United Kingdom and Germany. Using the parameter keep='last', only the last occurrence of a value will not be considered a duplicate

In [None]:
ambassadors.duplicated(keep='last')

To point all the values that have duplicates, you can use keep=False, and it will return True for every occurrence of a item that has a duplicate and its duplicate

In [None]:
a = pd.Series([1, 2, 2, 3, 4, 5, 5, 5, 6, 7]).duplicated(keep=False)
a

To show the values that has duplicates, we can use .loc[]

In [None]:
a.loc[a.duplicated(keep=False)]

In [None]:
ambassadors.duplicated(keep=False)

To drop the duplicated values, use drop_duplicates(), which can also take the keep parameter:

In [None]:
ambassadors.drop_duplicates()

In [None]:
ambassadors.drop_duplicates(keep='last')

In [None]:
ambassadors.drop_duplicates(keep=False)

##### Duplicates in DataFrames
Conceptually speaking, duplicates in a DataFrame happen at "row" level. Two rows with exactly the same values are considered to be duplicates.

In [48]:
players = pd.DataFrame({
    'Name': [
        'Kobe Bryant',
        'LeBron James',
        'Kobe Bryant',
        'Carmelo Anthony',
        'Kobe Bryant',
    ],
    'Pos': [
        'SG',
        'SF',
        'SG',
        'SF',
        'SF'
    ]
})
players

Unnamed: 0,Name,Pos
0,Kobe Bryant,SG
1,LeBron James,SF
2,Kobe Bryant,SG
3,Carmelo Anthony,SF
4,Kobe Bryant,SF


In [38]:
players.duplicated()

0    False
1    False
2     True
3    False
4    False
dtype: bool

We can also analyze the values in a subset

In [39]:
players.duplicated(subset='Name')

0    False
1    False
2     True
3    False
4     True
dtype: bool

In [47]:
players.drop_duplicates()

Unnamed: 0,Name,Pos
0,Kobe Bryant,SG
1,LeBron James,SF
3,Carmelo Anthony,SF
4,Kobe Bryant,SF


In [51]:
players.drop_duplicates(keep='last', subset=['Name'])

Unnamed: 0,Name,Pos
1,LeBron James,SF
3,Carmelo Anthony,SF
4,Kobe Bryant,SF


In [52]:
players.drop_duplicates(keep=False, subset=['Name'])

Unnamed: 0,Name,Pos
1,LeBron James,SF
3,Carmelo Anthony,SF
