#### Fixing invalid values

In [142]:
import pandas as pd
import numpy as np

In [143]:
df = pd.DataFrame({
    'Sex': ['M', 'F', 'F', 'D', '?'],
    'Age': [29, 30, 24, 299, 25],
})
df

Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,D,299
4,?,25


To find the values that might be wrong/out of range, we can verify the unique values of a columns with:
- unique(): returns a list of the unique values in the dataframe
- value_counts(): returns a series with the unique values and it's amount in the dataframe

In [144]:
df['Sex'].unique()

array(['M', 'F', 'D', '?'], dtype=object)

In [145]:
df['Sex'].value_counts()

Sex
F    2
M    1
D    1
?    1
Name: count, dtype: int64

In this case, in the "Sex" column, there're incorrect values. This columns should be filled with just "F" and "M". But we also have "D" and "?". The reason for the "D" might be a mistype, since "F" and "D" are next to each other in the keyboard.

In [146]:
df['Sex'].replace('D', 'F')

0    M
1    F
2    F
3    F
4    ?
Name: Sex, dtype: object

If there was a "N" in the "Sex" column, it could be a mistype from someone trying to type "M"

In [147]:
df['Sex'].replace({'N': 'M', 'D': 'F'})

0    M
1    F
2    F
3    F
4    ?
Name: Sex, dtype: object

Another mistype might be the 290 in the "Age" column, it probably is 29

In [148]:
df.replace({
    'Sex': {'N': 'M', 'D': 'F'},
    'Age': {290: 29}
})

Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,F,299
4,?,25


In case of more values from "Age" being out of range, we need to correct then as well. Age will be probably a number bellow 100, so if a number is greater that 100, is probably a mistype

In [149]:
df.loc[df['Age'] > 100, 'Age'] = df.loc[df['Age'] > 100, 'Age'] / 10 
df['Age'] = df['Age'].apply(lambda x: int(x)) # making sure the number is an integer. ex.: 299/10 = 29.9 -> int(29.9) = 29
df

  df.loc[df['Age'] > 100, 'Age'] = df.loc[df['Age'] > 100, 'Age'] / 10


Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,D,29
4,?,25


#### Duplicates

In [150]:
ambassadors = pd.Series([
    'France',
    'United Kingdom',
    'United Kingdom',
    'Italy',
    'Germany',
    'Germany',
    'Germany',
], index=[
    'Gérard Araud',
    'Kim Darroch',
    'Peter Westmacott',
    'Armando Varricchio',
    'Peter Wittig',
    'Peter Ammon',
    'Klaus Scharioth '
])
ambassadors

Gérard Araud                  France
Kim Darroch           United Kingdom
Peter Westmacott      United Kingdom
Armando Varricchio             Italy
Peter Wittig                 Germany
Peter Ammon                  Germany
Klaus Scharioth              Germany
dtype: object

To verify if there are duplicates in the series or dataframe, we use duplicated(). Then, to drop this values/rows, we use drop_duplicates(). The duplicated() method has a parameter "keep", with "first" as its default value. This means that after the first occurrence of a value, the duplicated values will return True.

In [151]:
ambassadors.duplicated()

Gérard Araud          False
Kim Darroch           False
Peter Westmacott       True
Armando Varricchio    False
Peter Wittig          False
Peter Ammon            True
Klaus Scharioth        True
dtype: bool

In this case there are 2 items duplicated: United Kingdom and Germany. Using the parameter keep='last', only the last occurrence of a value will not be considered a duplicate

In [152]:
ambassadors.duplicated(keep='last')

Gérard Araud          False
Kim Darroch            True
Peter Westmacott      False
Armando Varricchio    False
Peter Wittig           True
Peter Ammon            True
Klaus Scharioth       False
dtype: bool

To point all the values that have duplicates, you can use keep=False, and it will return True for every occurrence of a item that has a duplicate and its duplicate

In [153]:
a = pd.Series([1, 2, 2, 3, 4, 5, 5, 5, 6, 7]).duplicated(keep=False)
a

0    False
1     True
2     True
3    False
4    False
5     True
6     True
7     True
8    False
9    False
dtype: bool

To show the values that has duplicates, we can use .loc[]

In [154]:
a.loc[a.duplicated(keep=False)]

0    False
1     True
2     True
3    False
4    False
5     True
6     True
7     True
8    False
9    False
dtype: bool

In [155]:
ambassadors.duplicated(keep=False)

Gérard Araud          False
Kim Darroch            True
Peter Westmacott       True
Armando Varricchio    False
Peter Wittig           True
Peter Ammon            True
Klaus Scharioth        True
dtype: bool

To drop the duplicated values, use drop_duplicates(), which can also take the keep parameter:

In [156]:
ambassadors.drop_duplicates()

Gérard Araud                  France
Kim Darroch           United Kingdom
Armando Varricchio             Italy
Peter Wittig                 Germany
dtype: object

In [157]:
ambassadors.drop_duplicates(keep='last')

Gérard Araud                  France
Peter Westmacott      United Kingdom
Armando Varricchio             Italy
Klaus Scharioth              Germany
dtype: object

In [158]:
ambassadors.drop_duplicates(keep=False)

Gérard Araud          France
Armando Varricchio     Italy
dtype: object

##### Duplicates in DataFrames
Conceptually speaking, duplicates in a DataFrame happen at "row" level. Two rows with exactly the same values are considered to be duplicates.

In [159]:
players = pd.DataFrame({
    'Name': [
        'Kobe Bryant',
        'LeBron James',
        'Kobe Bryant',
        'Carmelo Anthony',
        'Kobe Bryant',
    ],
    'Pos': [
        'SG',
        'SF',
        'SG',
        'SF',
        'SF'
    ]
})
players

Unnamed: 0,Name,Pos
0,Kobe Bryant,SG
1,LeBron James,SF
2,Kobe Bryant,SG
3,Carmelo Anthony,SF
4,Kobe Bryant,SF


In [160]:
players.duplicated()

0    False
1    False
2     True
3    False
4    False
dtype: bool

We can also analyze the values in a subset

In [161]:
players.duplicated(subset='Name')

0    False
1    False
2     True
3    False
4     True
dtype: bool

In [162]:
players.drop_duplicates()

Unnamed: 0,Name,Pos
0,Kobe Bryant,SG
1,LeBron James,SF
3,Carmelo Anthony,SF
4,Kobe Bryant,SF


In [163]:
players.drop_duplicates(keep='last', subset=['Name'])

Unnamed: 0,Name,Pos
1,LeBron James,SF
3,Carmelo Anthony,SF
4,Kobe Bryant,SF


In [164]:
players.drop_duplicates(keep=False, subset=['Name'])

Unnamed: 0,Name,Pos
1,LeBron James,SF
3,Carmelo Anthony,SF


#### Text Handling
Most of the time, invalid text values involves mistyping. Some of the ways to fix this problems are:

##### Splitting Columns

The result of a survey is loaded and this is what you get:

In [165]:
df = pd.DataFrame({
    'Data': [
        '1987_M_US _1',
        '1990?_M_UK_1',
        '1992_F_US_2',
        '1970?_M_   IT_1',
        '1985_F_I  T_2'
]})
df

Unnamed: 0,Data
0,1987_M_US _1
1,1990?_M_UK_1
2,1992_F_US_2
3,1970?_M_ IT_1
4,1985_F_I T_2


In this case, the values for Year, Sex, Country and Number of kids is grouped in one column. We can use slip() to split the values.

In [166]:
df['Data'].str.split('_')

0       [1987, M, US , 1]
1       [1990?, M, UK, 1]
2        [1992, F, US, 2]
3    [1970?, M,    IT, 1]
4      [1985, F, I  T, 2]
Name: Data, dtype: object

In [167]:
df['Data'].str.split('_', expand=True)

Unnamed: 0,0,1,2,3
0,1987,M,US,1
1,1990?,M,UK,1
2,1992,F,US,2
3,1970?,M,IT,1
4,1985,F,I T,2


In [168]:
df = df['Data'].str.split('_', expand=True)
df.columns = ['Year', 'Sex', 'Country', 'Number of Kids']
df

Unnamed: 0,Year,Sex,Country,Number of Kids
0,1987,M,US,1
1,1990?,M,UK,1
2,1992,F,US,2
3,1970?,M,IT,1
4,1985,F,I T,2


Now there are some Year values that have a "?" with the year, we have to remove it

In [169]:
df['Year'] = df['Year'].apply(lambda x: x.replace('?', ''))
df

Unnamed: 0,Year,Sex,Country,Number of Kids
0,1987,M,US,1
1,1990,M,UK,1
2,1992,F,US,2
3,1970,M,IT,1
4,1985,F,I T,2


There is still a problem with the "Country" column: some values have extra spaces

In [170]:
df['Country'] = df['Country'].apply(lambda x: x.replace(' ', ''))
df

Unnamed: 0,Year,Sex,Country,Number of Kids
0,1987,M,US,1
1,1990,M,UK,1
2,1992,F,US,2
3,1970,M,IT,1
4,1985,F,IT,2


Columns like Year and Number of Kids should be integer

In [174]:
df['Year'] = df['Year'].apply(lambda x: int(x))
df['Number of Kids'] = df['Number of Kids'].apply(lambda x: int(x))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Year            5 non-null      int64 
 1   Sex             5 non-null      object
 2   Country         5 non-null      object
 3   Number of Kids  5 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes
