# Cleaning Data

## Datatypes

To discover the datatype of each column, use the pandas `.dtypes` property of the dataframe or the `.info()` method. The `object` dtype is typically encoded as a string.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('data2/tips.csv')
df.dtypes

total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size            int64
dtype: object

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


In [3]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


It is possible to convert columns of one dtype to another. 

It can be useful to convert some string dtype's to **categorical** data. e.g. gender. This can have a number of benefits;

- dataframe occupies less memory space
- some python packages can use the **categorical** type in analysis

In [4]:
df['sex'] = df['sex'].astype('category')
df['smoker'] = df['smoker'].astype('category')
df.dtypes

total_bill     float64
tip            float64
sex           category
smoker        category
day             object
time            object
size             int64
dtype: object

We can also convert **numerics to strings**:

In [5]:
df['size'] = df['size'].astype(str)
df.dtypes

total_bill     float64
tip            float64
sex           category
smoker        category
day             object
time            object
size            object
dtype: object

And strings to numerics. Often columns that should be numeric are stored as strings because of missing fields having non-numeric values, such as `-`. We can convert these columns to numeric using the `.to_numeric()` method. Don't forget to use the `errors='coerce'` property otherwise the missing values will cause the conversion to fail. Any missing values, e.g. `-` will be set to `NaN`.

In [6]:
df['size'] = pd.to_numeric(df['size'], errors='coerce')
df.dtypes

total_bill     float64
tip            float64
sex           category
smoker        category
day             object
time            object
size             int64
dtype: object

## Regular Expressions

Most data cleaning will involve string manipulation, since most data is unstructured text. Python includes the built in `re` library for **regular expressions**. Regular expressions provide a way matching for a specific sequence of characters within a string.

**Typical examples**
```txt
- 12        --> \d* (zero or more digits)  

- $12       --> \$\d*  

- $12.00    --> \$\d*\.\d* or \$\d*\.\d{2} or ^\$\d*\.\d{2}$ (match exactly)
```

### Using regular expressions

1. compile a pattern, assign to variable, using `re` library `.compile()` function
2. use the `match()` method to match the pattern, passing in the string we want to match. this will return a **match** object.
3. use the Python `bool()` function to convert out object into a boolean.

In [11]:
import re

pattern = re.compile('^\$\d*\.\d{2}$')
result = pattern.match('$12.124')
bool(result)

False

In [12]:
prog = re.compile('\d{3}-\d{3}-\d{4}')

result = prog.match('123-456-7890')
print(bool(result))

result2 = prog.match('1123-456-7890')
print(bool(result2))

True
False


Extracting numbers from strings is a common task, particularly when working with unstructured data or log files. When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the `re.findall()` function. You pass in a pattern and a string to `re.findall()`, and it will return a list of the matches.

`\d` is the pattern required to find digits. This should be followed with a `+` so that the previous element is matched one or more times.

In [13]:
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')
matches

['10', '1']

In [14]:
bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))

True

In [15]:
bool(re.match(pattern='\w*', string='Australia'))

True

### Using functions to clean data

This will often involve performing more than one operation on the data field, e.g. extract digits from a string and convert it to a numerical value. To do so we use pandas `.apply()` method which allows you to execute a python function on the field.

Using `apply()` with the property `axis=0` executes the operation on every filed in the **column**.

Using `apply()` with the property `axis=1` executes the operation across each **row**.

In [18]:
# import and load the data
import pandas as pd
from numpy import NaN
import re

df = pd.read_csv('data2/dob_job_application_filings_subset.csv')
df_subset = df[['Job #', 'Doc #', 'Borough', 'Initial Cost', 'Total Est. Fee',]]
df_subset.head()

Unnamed: 0,Job #,Doc #,Borough,Initial Cost,Total Est. Fee
0,121577873,2,MANHATTAN,$75000.00,$986.00
1,520129502,1,STATEN ISLAND,$0.00,$1144.00
2,121601560,1,MANHATTAN,$30000.00,$522.50
3,121601203,1,MANHATTAN,$1500.00,$225.00
4,121601338,1,MANHATTAN,$19500.00,$389.50


**Outline**: 

1. extract the monetary value from 'Initial Cost' and 'Total Est. Fee' columns
2. strip the '$' symbol, convert the strings to numerics, and calculate the difference
3. create a new column with the calculated difference between the two columns

In [19]:
# regular expression pattern
pattern = re.compile('^\$\d*\.\d{2}$')

When a function is applied across the rows of a dataframe, the actual row is passed in to that function, even though only one(or more) of the fields is req'd. The function will thus take the **row** of data and the **pattern** used to match monetary values.

In [21]:
def diff_money(row, pattern):
    icost = row['Initial Cost']
    tef = row['Total Est. Fee']
    # check that we have two valid monetary values, otherwise return 'NaN'
    if bool(pattern.match(icost)) and bool(pattern.match(tef)):
        icost = icost.replace("$", "")
        tef = tef.replace("$", "")
        icost = float(icost)
        tef = float(tef)
        return icost - tef
    else:
        return(NaN)

To use the function, use the `apply()` method, passing in the function, pattern and `axis=1`, so the function works **row-wise**(by default `axis=0`, and apply works **column-wise**.

In [22]:
df_subset['diff'] = df_subset.apply(diff_money, pattern=pattern, axis=1)
df_subset.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Job #,Doc #,Borough,Initial Cost,Total Est. Fee,diff
0,121577873,2,MANHATTAN,$75000.00,$986.00,74014.0
1,520129502,1,STATEN ISLAND,$0.00,$1144.00,-1144.0
2,121601560,1,MANHATTAN,$30000.00,$522.50,29477.5
3,121601203,1,MANHATTAN,$1500.00,$225.00,1275.0
4,121601338,1,MANHATTAN,$19500.00,$389.50,19110.5


The tips dataset has a `sex` column that contains the values `Male` or `Female`. Write a function that will recode `Female` to `0`, `Male` to `1`, and return `np.nan` for all entries of `sex` that are neither `Female`' nor `Male`.

Recoding variables like this is a common data cleaning task. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code.Y

You can use the `.apply()` method to apply a function across entire rows or columns of DataFrames.  However, note that each column of a DataFrame is a pandas **Series**. Functions can also be applied across **Series**. Here, you will apply your function over the `sex` column.


**NOTE**:

You can also convert the `sex` column into a categorical type.

In [26]:
import pandas as pd
import numpy as np
import re

df_tips = pd.read_csv('data2/tips.csv')
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [27]:
df_tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


In [28]:
def recode_gender(gender):

    # Return 0 if gender is 'Female'
    if gender == 'Male':
        return 1
    
    # Return 1 if gender is 'Male'    
    elif gender == 'Female':
        return 0
    
    # Return np.nan    
    else:
        return np.nan

In [29]:
df_tips['recode'] = df_tips['sex'].apply(recode_gender)
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,1
2,21.01,3.5,Male,No,Sun,Dinner,3,1
3,23.68,3.31,Male,No,Sun,Dinner,2,1
4,24.59,3.61,Female,No,Sun,Dinner,4,0


### Lambda functions

Lambda functions allow you to write one-line functions, that can be passed to `apply()`, avoiding the need to using `def` syntax.

We'll load job_applicaton dataset and clean its `Total Est. Fee` column by removing the '$' sign using `.replace()` method and a lambda.

We'll then repeat the procedure but use `re.findall()`, regex and lambda to retrieve the numerals within the string value.

In [31]:
# import and load the data
import pandas as pd
from numpy import NaN
import re

df = pd.read_csv('data2/dob_job_application_filings_subset.csv')
df_subset = df[['Job #', 'Doc #', 'Borough', 'Initial Cost', 'Total Est. Fee',]]

df_subset['total_fee'] = df_subset['Total Est. Fee'].apply(lambda x: x.replace('$', ''))
df_subset['total_fee_2'] = df_subset['Total Est. Fee'].apply(lambda x: re.findall('\d+\.\d+', x)[0])
df_subset.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,Job #,Doc #,Borough,Initial Cost,Total Est. Fee,total_fee,total_fee_2
0,121577873,2,MANHATTAN,$75000.00,$986.00,986.0,986.0
1,520129502,1,STATEN ISLAND,$0.00,$1144.00,1144.0,1144.0
2,121601560,1,MANHATTAN,$30000.00,$522.50,522.5,522.5
3,121601203,1,MANHATTAN,$1500.00,$225.00,225.0,225.0
4,121601338,1,MANHATTAN,$19500.00,$389.50,389.5,389.5


### Dealing with duplicate data

Can lead to results being skewed. We can drop duplicate data with the `.drop_duplicates()` method on the dataframe. Any duplicate rows are dropped.

The operation returns a new dataframe, the original is unchanged.

In [32]:
import pandas as pd

df_dup = pd.read_csv('data2/tips_dup.csv')
df_dup.head(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,16.99,1.01,Female,No,Sun,Dinner,2
4,10.34,1.66,Male,No,Sun,Dinner,3
5,21.01,3.5,Male,No,Sun,Dinner,3
6,16.99,1.01,Female,No,Sun,Dinner,2
7,10.34,1.66,Male,No,Sun,Dinner,3
8,21.01,3.5,Male,No,Sun,Dinner,3


In [36]:
df_dup.shape

(9, 7)

In [35]:
df = df_dup.drop_duplicates()
df.head(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


In [37]:
df_dup.shape

(9, 7)

### Dealing with missing values

It's rare to have a (real-world) dataset without any missing values, and it's important to deal with them because certain calculations cannot handle missing values while some calculations will, by default, skip over any missing values. You can

- drop that row or drop the column
- file that particular field with a values generated by various pandas methods

Use the pandas `.info()` method to see which columns have missing values.

To drop a row from the dataframe that has one or more missing values, use the `.dropna()` method. The method returns a new dataframe, the original is untouched.

In [41]:
# the `Ozone` and `Solar.R` columns have a number of missing values
import pandas as pd
import numpy as np

df_air = pd.read_csv('data2/airquality.csv')
df_air.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      116 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB


In [42]:
df = df_air.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 111 entries, 0 to 152
Data columns (total 6 columns):
Ozone      111 non-null float64
Solar.R    111 non-null float64
Wind       111 non-null float64
Temp       111 non-null int64
Month      111 non-null int64
Day        111 non-null int64
dtypes: float64(3), int64(3)
memory usage: 6.1 KB


In this particular example we lost almost a third of the data, which may be unacceptable. Another method is to fill missing values with the `.fillna()` method.

There a number of possible values that could be used, e.g. user provided, or some summary statistic, such as the mean or median calculated on the entire column.

**Using a user provided value**

In [49]:
# the `Ozone` column was automatically converted to an object 
# to accomadate the 'missing' field
import pandas as pd
import numpy as np

df_air = pd.read_csv('data2/airquality.csv')
df_air['Ozone'] = df_air['Ozone'].fillna('missing') # on a sing column
df_air.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      153 non-null object
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(2), int64(3), object(1)
memory usage: 7.2+ KB


In [55]:
# on multiple columns - replace NaN values with 0
import pandas as pd
import numpy as np

df_air = pd.read_csv('data2/airquality.csv')
df_air[['Ozone', 'Solar.R']] = df_air[['Ozone', 'Solar.R']].fillna(0)
df_air.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      153 non-null float64
Solar.R    153 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB


In [56]:
df_air.head(10)

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,0.0,0.0,14.3,56,5,5
5,28.0,0.0,14.9,66,5,6
6,23.0,299.0,8.6,65,5,7
7,19.0,99.0,13.8,59,5,8
8,8.0,19.0,20.1,61,5,9
9,0.0,194.0,8.6,69,5,10


Another option is to use some some summary statistic value, e.g. mean, median etc. When you have a data set with outliers, use median over mean.

In [57]:
import pandas as pd
import numpy as np

df_air = pd.read_csv('data2/airquality.csv')
oz_mean = df_air['Ozone'].mean()

# Replace all the missing values in the Ozone column with the mean
df_air['Ozone'] = df_air['Ozone'].fillna(oz_mean)

# Print the info of airquality
df_air.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      153 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB


In [58]:
df_air.head(10)

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,42.12931,,14.3,56,5,5
5,28.0,,14.9,66,5,6
6,23.0,299.0,8.6,65,5,7
7,19.0,99.0,13.8,59,5,8
8,8.0,19.0,20.1,61,5,9
9,42.12931,194.0,8.6,69,5,10


### Test with assers

After filling missing values, we can progammatically check that we did drop or fill `NaN` values. We would expect to find `0` missing values.

We can write an assert to verify this.

An `assert` statement returns nothing if the assertion returns `True`, otherwise it raises an `AssertionError`.

The `.all()` method returns `True` if all values are `True`. When used on a DataFrame, it returns a Series of Booleans - one for each column in the DataFrame. So if you are using it on a DataFrame, you need to chain another `.all()` method, e.g. `.all().all()` so that you return only one True or False value.  The first `.all()` method will return a `True` or `False` for each column, while the second `.all()` method will return a single `True` or `False`.

In [59]:
import pandas as pd
import numpy as np

df = pd.read_csv('data2/airquality.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      116 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB


In [60]:
# check for missing values in the `Ozone` column 
# - it should FAIL - there are missing values
# the 'all()' returns 'True' if all values in the column are 'True'
assert df.Ozone.notnull().all()

AssertionError: 

In [62]:
# Calulate the mean & replace any missing values in the Ozone column with the mean
oz_mean = df['Ozone'].mean()
df['Ozone'] = df['Ozone'].fillna(oz_mean)
assert df.Ozone.notnull().all() # check for missing

Returns nothing since the assertion passed.

In [63]:
# all 'Ozone' fields are filled
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      153 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB


Assert that there are no missing values in the Ebola dataset

In [68]:
import pandas as pd
import numpy as np

df = pd.read_csv('data2/ebola.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 18 columns):
Date                   122 non-null object
Day                    122 non-null int64
Cases_Guinea           93 non-null float64
Cases_Liberia          83 non-null float64
Cases_SierraLeone      87 non-null float64
Cases_Nigeria          38 non-null float64
Cases_Senegal          25 non-null float64
Cases_UnitedStates     18 non-null float64
Cases_Spain            16 non-null float64
Cases_Mali             12 non-null float64
Deaths_Guinea          92 non-null float64
Deaths_Liberia         81 non-null float64
Deaths_SierraLeone     87 non-null float64
Deaths_Nigeria         38 non-null float64
Deaths_Senegal         22 non-null float64
Deaths_UnitedStates    18 non-null float64
Deaths_Spain           16 non-null float64
Deaths_Mali            12 non-null float64
dtypes: float64(16), int64(1), object(1)
memory usage: 17.2+ KB


In [70]:
df_drop = df.dropna()
df_drop.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 19 to 19
Data columns (total 18 columns):
Date                   1 non-null object
Day                    1 non-null int64
Cases_Guinea           1 non-null float64
Cases_Liberia          1 non-null float64
Cases_SierraLeone      1 non-null float64
Cases_Nigeria          1 non-null float64
Cases_Senegal          1 non-null float64
Cases_UnitedStates     1 non-null float64
Cases_Spain            1 non-null float64
Cases_Mali             1 non-null float64
Deaths_Guinea          1 non-null float64
Deaths_Liberia         1 non-null float64
Deaths_SierraLeone     1 non-null float64
Deaths_Nigeria         1 non-null float64
Deaths_Senegal         1 non-null float64
Deaths_UnitedStates    1 non-null float64
Deaths_Spain           1 non-null float64
Deaths_Mali            1 non-null float64
dtypes: float64(16), int64(1), object(1)
memory usage: 152.0+ bytes


In [None]:
# Assert that there are no missing values
assert df_drop.notnull().all().all()

In [66]:
# Assert that all values are >= 0
assert (df_drop >= 0).all().all()

AssertionError: 