# Data Cleaning

In this segment, we will be looking at some fundamental types of data cleaning, and how we can leverage what we've learned so far to help us get data we can work with

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## Missing Values

One of the most common issues we find in working with real world data is dealing with missing values in our data set.  A missing value is any value that denotes that the data point is not recorded.  In Python, these values are often empty values for different data types, e.g.:

### Missing Value Types

In [2]:
None # python's null value
np.nan # not a number value from numpy
np.inf # infinite from numpy

inf

In Pandas, the majority of missing values will come in the form of `np.nan`.  Because `np.nan` is literally not a number, any operation involving `np.nan` will also result in `np.nan`, i.e.:

In [3]:
1 + np.nan

nan

In [4]:
np.array([1, np.nan, 3, 4]).sum()

nan

In [5]:
np.array([1, np.nan, 3, 4]).mean()

nan

`None` will behave differently - most operations involving None will result in an error:

In [6]:
1 + None

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

In [7]:
np.array([1, None, 3, 4]).sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

However, we deal with Nones far less in Pandas because as long as we're dealing with numeric data, `None`s are generally cast into `np.nan`s 

In [8]:
np.array([None], dtype='float')

array([nan])

Lastly, `np.inf` behaves very similarly to `np.nan` since most operations with infinity will result in infinity

In [9]:
1 + np.inf

inf

In [10]:
np.array([1, np.inf, 3, 4]).sum()

inf

In [11]:
np.inf / np.inf

nan

### Finding missing values

When we load a data set, we'll need to check for `np.nan`, `np.inf` and in some cases `None` before we do any analysis.  Luckily `numpy` has a few functions to help us with `np.isnan`, `np.isinf` and `np.isfinite` which checks for both, and pandas gives us a helpful function `pd.isnull` which covers `None` cases as well:

In [12]:
np.isnan([1, np.nan, 3, 4])

array([False,  True, False, False])

In [13]:
np.isinf([1, np.inf, 3, 4])

array([False,  True, False, False])

In [14]:
np.isfinite([1, np.inf, np.nan, 3, 4])

array([ True, False, False,  True,  True])

In [15]:
pd.isnull([np.nan, np.inf, None])

array([ True, False,  True])

We can see if there are any missing values by using the `any` method:

In [16]:
pd.isnull([np.nan, np.inf, None]).any()

True

In [17]:
pd.isnull([1, 2, None]).any()

True

In [18]:
pd.isnull([1, 2, 3]).any()

False

### Filtering & Dropping missing values

We can use these functions to help us filter our Series or DataFrames.  For example, if we are looking to filter out all the values that are nan, we can do so easily:

In [19]:
market_caps = pd.Series([954.7, 514.4, np.nan, None, 57.9, 45.7, np.inf, 38.7, 28.8, 25.8])

**note**: `pd.Series.sum()` will ignore `np.nan`, but _not_ `np.inf` as we see below:

In [20]:
market_caps.sum()

inf

In [21]:
market_caps[~np.isnan(market_caps)]

0    954.7
1    514.4
4     57.9
5     45.7
6      inf
7     38.7
8     28.8
9     25.8
dtype: float64

In [22]:
market_caps[np.isfinite(market_caps)]

0    954.7
1    514.4
4     57.9
5     45.7
7     38.7
8     28.8
9     25.8
dtype: float64

In [23]:
market_caps[np.isfinite(market_caps)].sum()

1666.0

all the above functions will apply to DataFrames as well:

In [24]:
sample_df = pd.DataFrame({
    'A': [1, 2, 3, np.nan],
    'B': [4, 5, None, np.nan],
    'C': [7, np.nan, np.nan, None]
})

In [25]:
np.isfinite(sample_df)

Unnamed: 0,A,B,C
0,True,True,True
1,True,True,False
2,True,False,False
3,False,False,False


In [26]:
pd.isnull(sample_df)

Unnamed: 0,A,B,C
0,False,False,False
1,False,False,True
2,False,True,True
3,True,True,True


this also means that we can get a count of nulls by column and by row

In [27]:
pd.isnull(sample_df).sum()

A    1
B    2
C    3
dtype: int64

In [28]:
pd.isnull(sample_df).sum(axis=1)

0    0
1    1
2    2
3    3
dtype: int64

with this, we can filter on any column.  For example, if we want to filter out anything with null from column A, we can do this:

In [29]:
sample_df[sample_df.A.notnull()]

Unnamed: 0,A,B,C
0,1.0,4.0,7.0
1,2.0,5.0,
2,3.0,,


this removed the 2nd row (we can see this since index `1` is now gone).  

Because of how common this operation is, pandas has a method to do this in one operation using `.dropna`

In [30]:
sample_df.dropna(subset=['A'])

Unnamed: 0,A,B,C
0,1.0,4.0,7.0
1,2.0,5.0,
2,3.0,,


by default `.dropna` will drop a row if _any_ element in that row has a null value.  we can use `subset` to select which columns we care about for dropping.  E.g. if we just called `.dropna` without any arguments on our DataFrame, we would get back just the first row since it's the only row that's non-null:

In [31]:
sample_df.dropna()

Unnamed: 0,A,B,C
0,1.0,4.0,7.0


lastly, we can drop by columns also (i.e. if any column has a null, we drop it).  In our case this would only leave no columns

In [32]:
sample_df.dropna(axis=1)

0
1
2
3


we can also tune how aggressively dropna drops.  The `how` argument allows us to specify drop on `any` nulls or `all` nulls, and the `thresh` argument allows us to specify how many nulls need to occur before dropping the row/column:

In [33]:
sample_df.dropna(axis=1, how='all')

Unnamed: 0,A,B,C
0,1.0,4.0,7.0
1,2.0,5.0,
2,3.0,,
3,,,


In [34]:
sample_df.dropna(axis=1, thresh=2)

Unnamed: 0,A,B
0,1.0,4.0
1,2.0,5.0
2,3.0,
3,,


### Filling missing values

Sometimes we don't want to lose data, and replacing missing values with something else could be more advantageous than dropping entire rows or columns (which are the only two choices when working with tabular data).  For example, this may be because the columns with missing data are low importance (and thus dropping high valued rows due to low importance features could be bad), or because we have columns that are designed to detect outliers, and so it is safe to fill in missing data with the median value versus dropping data points.

To do this we can use the `.fillna` method that's on both Series and DataFrames, which allows us to fill missing data points with a variety of different strategies

The easiest way to fill in missing values is to use a single value to fill in, e.g. we can fill in with a 0, or with the sample mean:

In [35]:
market_caps = pd.Series([954.7, 514.4, np.nan, None, 57.9, 45.7, np.nan, 38.7, 28.8, 25.8])

In [36]:
market_caps.fillna(0)

0    954.7
1    514.4
2      0.0
3      0.0
4     57.9
5     45.7
6      0.0
7     38.7
8     28.8
9     25.8
dtype: float64

In [37]:
market_caps.fillna(market_caps.mean())

0    954.7
1    514.4
2    238.0
3    238.0
4     57.9
5     45.7
6    238.0
7     38.7
8     28.8
9     25.8
dtype: float64

if you want to fill specific items but not others, this can also done by passing in a Series, which will fill by its index:

In [38]:
market_caps.fillna(pd.Series({3:100}))

0    954.7
1    514.4
2      NaN
3    100.0
4     57.9
5     45.7
6      NaN
7     38.7
8     28.8
9     25.8
dtype: float64

one really handy feature is the ability to forwardfill or backfill.  Forwardfill will fill the previously non-null value to the current null position recursively, and backfill will do the opposite.

This is particularly useful for time series, especially time series that are not in regular intervals, since it allows you to create a complete data set by filling down every timestamp from your joins

In [39]:
market_caps.fillna(method='ffill')

0    954.7
1    514.4
2    514.4
3    514.4
4     57.9
5     45.7
6     45.7
7     38.7
8     28.8
9     25.8
dtype: float64

In [40]:
market_caps.fillna(method='bfill')

0    954.7
1    514.4
2     57.9
3     57.9
4     57.9
5     45.7
6     38.7
7     38.7
8     28.8
9     25.8
dtype: float64

dataframes have the same method, but can work on both dimensions (like most DataFrame functions), e.g.:

In [41]:
sample_df.fillna({'A': 4, 'B': sample_df.B.mean()})

Unnamed: 0,A,B,C
0,1.0,4.0,7.0
1,2.0,5.0,
2,3.0,4.5,
3,4.0,4.5,


In [42]:
sample_df.fillna(method='ffill')

Unnamed: 0,A,B,C
0,1.0,4.0,7.0
1,2.0,5.0,7.0
2,3.0,5.0,7.0
3,3.0,5.0,7.0


In [43]:
sample_df.fillna(method='ffill', axis=1)

Unnamed: 0,A,B,C
0,1.0,4.0,7.0
1,2.0,5.0,5.0
2,3.0,3.0,3.0
3,,,


In [44]:
sample_df.assign(C=sample_df['C'].fillna(method='ffill'))

Unnamed: 0,A,B,C
0,1.0,4.0,7.0
1,2.0,5.0,7.0
2,3.0,,7.0
3,,,7.0


## Replace Values

The second most common problem you will likely encounter is the need to replace values.  This could be due to outliers (e.g. your data set reports 120% for an feature than can only be 100% maximum), mangled ingestion (e.g. all values that should say "BTC" say "BTT"), or you need the values to be formatted in a certain way before using the data (e.g. all values that say "Ethereum" needs to say "ETH").  

In this situation we will still need to take action similar to what we did for missing data, but `.isnull`, `.dropna` and `.fillna` will no longer help us since these values will not be considered nulls.

This problem becomes even more difficult when we have large data sets, and the outliers can be small

In [45]:
df = pd.DataFrame({
    'pct_scored': np.random.permutation(
        np.append(
            100 * np.random.rand(980),
            100 + np.random.rand(20)
        )
    ),
    'district': np.random.permutation(
        np.append(
            np.random.choice(['West', 'South', 'East', 'North'], 990), 
            (np.random.choice(['Nt', 'Sth'], 10))
        )
    ),
    'city': np.random.permutation(
        np.append(
            np.random.choice(['Westbrook', 'Southbridge', 'Eastriver', 'Northgate'], 990), 
            (np.random.choice(['Westnook', 'Northgrate'], 10))
        )
    )
})

### Identifying categorical values that need replacement

For categorical values, one of the easiest ways to find whether we have values that need replacing is to identify the distinct set of values and see if there's anything strange.  For example, imagine we had a data set like:

In [46]:
df

Unnamed: 0,pct_scored,district,city
0,23.443968,North,Westbrook
1,93.707695,South,Northgate
2,36.516693,West,Northgate
3,28.648353,North,Eastriver
4,91.764077,West,Southbridge
...,...,...,...
995,16.670303,South,Southbridge
996,62.784568,West,Southbridge
997,74.403803,West,Southbridge
998,10.947715,North,Southbridge


we can identify the unique set of names for district using `.unique`.  From this we found that there's two values we didn't expect `Sth` and `Nt`.

In [47]:
df['district'].unique()

array(['North', 'South', 'West', 'East', 'Nt', 'Sth'], dtype=object)

This may not be clear what values are actually errors.  We can use `.value_counts` to spot outliers

In [48]:
df['district'].value_counts()

South    261
East     247
West     243
North    239
Nt         6
Sth        4
Name: district, dtype: int64

this shows us that Nt and Sth are likely errors

Let's say we know that Sth is shorthand for South, and Nt is shorthand for North (this error could be due to the source data being manual entry).  We can use `.replace` to 

In [49]:
df['district'].replace('Nt', 'North').value_counts()

South    261
East     247
North    245
West     243
Sth        4
Name: district, dtype: int64

However if we have multiple values to replace, this is pretty inefficient.  Instead, we can pass in a dict and replace everything at once

In [50]:
df['district'].replace({'Nt': 'North', 'Sth': 'South'}).value_counts()

South    265
East     247
North    245
West     243
Name: district, dtype: int64

With a DataFrame, we can do this across multiple columns as well with a nested Dict:

In [51]:
df.replace({
    'district': {'Nt': 'North', 'Sth': 'South'},
    'city': {'Westnook': 'Westbrook', 'Northgrate': 'Northgate'}
})

Unnamed: 0,pct_scored,district,city
0,23.443968,North,Westbrook
1,93.707695,South,Northgate
2,36.516693,West,Northgate
3,28.648353,North,Eastriver
4,91.764077,West,Southbridge
...,...,...,...
995,16.670303,South,Southbridge
996,62.784568,West,Southbridge
997,74.403803,West,Southbridge
998,10.947715,North,Southbridge


In [52]:
df.replace({
    'district': {'Nt': 'North', 'Sth': 'South'},
    'city': {'Westnook': 'Westbrook', 'Northgrate': 'Northgate'}
})['city'].value_counts()

Westbrook      263
Northgate      257
Eastriver      243
Southbridge    237
Name: city, dtype: int64

### Identifying numerical values that need replacement

We found out how we can easily replace categorical values, but what about numerical outliers?  This is even easier.  We can first identify how many row we will need to replace, then replace it using DataFrame mutation

In [53]:
df.loc[df.pct_scored > 100].shape

(20, 3)

In [54]:
df.loc[df.pct_scored > 100, 'pct_scored'] = 100
df.loc[df.pct_scored > 100].shape

(0, 3)

## Duplicate Values

Duplicate values are super simple to detect and resolve - the standard methodology is to remove all but one of the duplicated values, or remove all duplicates (as it may indicate something wrong with the upstream data).  We can use `.duplicated` to detect duplicates, and `.drop_duplicates` to remove them.

Let's see an example:

In [55]:
df = pd.DataFrame({
    'A': ['A', 'B', 'C', 'B', 'D', 'D', 'D'],
    'B': [1, 2, 3, 4, 5, 5, 7]
})

In [56]:
df.duplicated()

0    False
1    False
2    False
3    False
4    False
5     True
6    False
dtype: bool

currently, duplicated shows that only one row is duplicated.  This is the case because by default duplicated will check for fully duplicated rows - i.e. _every_ element in the row is duplicated.  If we want to only check for duplicates in column A, we can do this:

In [57]:
df.duplicated(subset=['A'])

0    False
1    False
2    False
3     True
4    False
5     True
6     True
dtype: bool

Also notice that by default only the occurrence beyond the first is considered a duplicate (i.e. the first occurrence sequentially is not a duplicate).  We can change the behavior by either choosing the last occurrence as non-duplicated, or no occurrences as non-duplicated by:

In [58]:
df.duplicated(subset=['A'], keep='last')

0    False
1     True
2    False
3    False
4     True
5     True
6    False
dtype: bool

In [59]:
df.duplicated(subset=['A'], keep=False)

0    False
1     True
2    False
3     True
4     True
5     True
6     True
dtype: bool

Once we know what items are duplicated and our strategy for keeping records, we can simply call `.drop_duplicates` with the exact same parameters to return a DataFrame with only our unique rows

In [60]:
df.drop_duplicates(subset=['A'], keep='first')

Unnamed: 0,A,B
0,A,1
1,B,2
2,C,3
4,D,5


## Badly formatted values

The Last case we will go over is handling malformed text.  This can happen for example if you're ingesting data from an HTML source where markup is attached to the data, or if the data is formatted in a weird or unintuitive way.  

Luckily, pandas gives us some pretty easy methods to be able to split, truncate and parse data

### Splitting data

If we have a data set like below, we can do the "text to column" operation using `.str.split`, as below:

In [61]:
df = pd.DataFrame({
    'records': [
        '<b>A</b>_59_D    @',
        '<b>B</b>_92_L    L',
        '<b>C</b>_7_O  Q',
        '<b>D</b>_43_V  O',
        '<b>E</b>_50_J    C',
        '<b>F</b>_53_@   U',
        '<b>G</b>_17_K  C',
        '<b>H</b>_24_K  T',
        '<b>I</b>_58_K    T',
        '<b>J</b>_94_L M',
        '<b>K</b>_60_H M',
        '<b>L</b>_65_Q E',
        '<b>M</b>_23_A C',
        '<b>N</b>_62_PS',
        '<b>O</b>_90_P    X',
        '<b>P</b>_34_O    D',
        '<b>Q</b>_26_T  D',
        '<b>R</b>_78_P   T',
        '<b>S</b>_94_@   C',
        '<b>T</b>_69_?  E',
        '<b>U</b>_50_P T',
        '<b>V</b>_99_T  C',
        '<b>W</b>_20_V Q',
        '<b>X</b>_88_E    O',
        '<b>Y</b>_7_RF',
        '<b>Z</b>_47_EN',
    ]
})

df.head(10)

Unnamed: 0,records
0,<b>A</b>_59_D @
1,<b>B</b>_92_L L
2,<b>C</b>_7_O Q
3,<b>D</b>_43_V O
4,<b>E</b>_50_J C
5,<b>F</b>_53_@ U
6,<b>G</b>_17_K C
7,<b>H</b>_24_K T
8,<b>I</b>_58_K T
9,<b>J</b>_94_L M


In [62]:
df_res = df.records.str.split('_', expand=True)
df_res.columns = ['column_A', 'column_B', 'column_C']
df_res

Unnamed: 0,column_A,column_B,column_C
0,<b>A</b>,59,D @
1,<b>B</b>,92,L L
2,<b>C</b>,7,O Q
3,<b>D</b>,43,V O
4,<b>E</b>,50,J C
5,<b>F</b>,53,@ U
6,<b>G</b>,17,K C
7,<b>H</b>,24,K T
8,<b>I</b>,58,K T
9,<b>J</b>,94,L M


Next we can strip out end whitespaces from Column C, and strip out middle whitespaces using replace:

In [63]:
df_res['column_C'] = df_res.column_C.str.strip()

In [64]:
df_res['column_C'] = df_res.column_C.str.replace(' ', '')

In [65]:
df_res

Unnamed: 0,column_A,column_B,column_C
0,<b>A</b>,59,D@
1,<b>B</b>,92,LL
2,<b>C</b>,7,OQ
3,<b>D</b>,43,VO
4,<b>E</b>,50,JC
5,<b>F</b>,53,@U
6,<b>G</b>,17,KC
7,<b>H</b>,24,KT
8,<b>I</b>,58,KT
9,<b>J</b>,94,LM


lastly, the extract (and replace function above) function is very, very powerful in that it can use any regex:

In [66]:
df_res['column_A'] = df_res['column_A'].str.extract('<b>(.|\n)*?<\/b>')

In [67]:
df_res

Unnamed: 0,column_A,column_B,column_C
0,A,59,D@
1,B,92,LL
2,C,7,OQ
3,D,43,VO
4,E,50,JC
5,F,53,@U
6,G,17,KC
7,H,24,KT
8,I,58,KT
9,J,94,LM
