# Cleaning Data for Analysis

## 1. Converting data types
In this exercise, we'll see how ensuring all categorical variables in a DataFrame are of type category reduces memory usage.

The `tips` dataset has been loaded into a DataFrame called `tips`. This data contains information about how much a customer tipped, whether the customer was male or female, a smoker or not, etc.

You'll note that two columns that should be categorical - `sex` and `smoker` - are instead of type object, which is pandas' way of storing arbitrary strings. 

Convert these two columns to type `category` and note the reduced memory usage.

In [1]:
import pandas as pd

In [2]:
tips = pd.read_csv("datasets/tips.csv")
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


In [4]:
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype("category")

# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype("category")

In [5]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB


By converting sex and smoker to categorical variables, the memory usage of the DataFrame went down from 13.4 KB to 10.1KB. This may seem like a small difference here, but when dealing with large datasets, the reduction in memory usage can be very significant!

## 2. Working with numeric data
If we expect the data type of a column to be numeric (`int` or `float`), but instead it is of type `object`, this typically means that there is a non numeric value in the column, which also signifies bad data.

We can use the `pd.to_numeric()` function to convert a column into a numeric data type. If the function raises an error, we can be sure that there is a bad value within the column. We can either make use of exploratory data analysis and find the bad value, or we can choose to ignore or `coerce` the value into a missing value, `NaN`.

You'll note that the `total_bill` and `tip` columns, which should be numeric, are instead of type `object`. Fix this.

In [6]:
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips["total_bill"], errors="coerce")

# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips["tip"], errors="coerce")

In [7]:
# Print the info of tips
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB


The `'total_bill'` and `'tip'` columns in this DataFrame are stored as `object` types because the string `'missing'` is used in these columns to encode missing values. By `coercing` the values into a numeric type, they become proper `NaN` values.

## 3. String parsing with regular expressions
Regular expressions are powerful ways of defining patterns to match strings.

When working with data, it is sometimes necessary to write a regular expression to look for properly entered values. Phone numbers in a dataset is a common field that needs to be checked for validity. Define a regular expression to match US phone numbers that fit the pattern of `xxx-xxx-xxxx`.

The [regular expression module](https://docs.python.org/3/library/re.html) in python is `re`. When performing pattern matching on data, since the pattern will be used for a match across multiple rows, it's better to `compile` the `pattern` first using `re.compile()`, and then use the compiled pattern to match values.

In [8]:
# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')
prog

re.compile(r'\d{3}-\d{3}-\d{4}', re.UNICODE)

In [9]:
# See if the pattern matches
result = prog.match('123-456-7890')
result

<re.Match object; span=(0, 12), match='123-456-7890'>

In [10]:
bool(result)

True

In [11]:
# See if the pattern matches
result2 = prog.match("1123-456-7890")
print(type(result2))

<class 'NoneType'>


In [12]:
bool(result2)

False

Here, as expected, the pattern matches the first string, but not the second.

## 4. Extracting numerical values from strings
Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say we have the following string: `'the recipe calls for 6 strawberries and 2 bananas'`.

It would be useful to extract the `6` and the `2` from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), we can use the `re.findall()` function. Pass in a `pattern` and a `string` to `re.findall()`, and it will return a list of the `matches`.

In [13]:
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')
matches

['10', '1']

Regular expression successfully extracted the numeric values `10` and `1` from the string!. `\d` is the pattern required to find digits. This should be followed with a `+` so that the previous element is matched one or more times. This ensures that `10` is viewed as one number and not as `1` and `0`.

## 5. Pattern matching
For each provided string, write the appropriate `pattern` to match it.

Write patterns to match:
- A telephone number of the format `xxx-xxx-xxxx`. 
- A string of the format: A dollar sign, an arbitrary number of digits, a decimal point, 2 digits.
Use `\$` to match the dollar sign, `\d*` to match an arbitrary number of digits, `\.` to match the decimal point, and `\d{x}` to match `x` number of digits.
- A capital letter, followed by an arbitrary number of alphanumeric characters.
Use `[A-Z]` to match any capital letter followed by `\w*` to match an arbitrary number of alphanumeric characters.

In [14]:
# Write the first pattern
pattern1 = re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890')
pattern1

<re.Match object; span=(0, 12), match='123-456-7890'>

In [15]:
# Write the second pattern
pattern2 = re.match(pattern='\$\d*\.\d{2}', string='$123.45')
pattern2

<re.Match object; span=(0, 7), match='$123.45'>

In [16]:
# Write the third pattern
pattern3 = re.match(pattern='[A-Z]\w*', string='Australia')
pattern3

<re.Match object; span=(0, 9), match='Australia'>

Great work mastering the fundamentals of writing regular expressions!

## 6. Custom functions to clean data
the `tips` dataset has a `'sex'` column that contains the values `'Male'` or `'Female'`. Write a function that will recode `'Female'` to `0`, `'Male'` to `1`, and return `np.nan` for all entries of `'sex'` that are neither `'Female'` nor `'Male'`.

Recoding variables like this is a common data cleaning task. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code. This makes your code more readable and less error prone.

Use the `.apply()` method to apply a function across entire rows or columns of DataFrames. However, note that each column of a DataFrame is a pandas `Series`. Functions can also be applied across `Series`. Here, apply a function over the `'sex'` column.

In [17]:
# Define recode_gender()
def recode_gender(gender):

    # Return 0 if gender is 'Female'
    if gender == "Female":
        return 0
    
    # Return 1 if gender is 'Male'    
    elif gender == "Male":
        return 1
    
    # Return np.nan    
    else:
        return np.nan

In [18]:
# Apply the function to the sex column
tips['recode'] = tips.sex.apply(recode_gender)

# Print the first five rows of tips
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,1
2,21.01,3.5,Male,No,Sun,Dinner,3,1
3,23.68,3.31,Male,No,Sun,Dinner,2,1
4,24.59,3.61,Female,No,Sun,Dinner,4,0


For simple recodes, we can also use the `.replace()` method. We can also convert the column into a `categorical` type.

## 7. Lambda functions
A powerful Python feature that will help us clean our data more effectively is the `lambda` functions. Instead of using the `def` syntax that we used in the previously, `lambda` functions let us make simple, one-line functions.

For example, here's a function that squares a variable used in an `.apply()` method:
```python
def my_square(x):
    return x ** 2

df.apply(my_square)
```
The equivalent code using a `lambda` function is:
```python
df.apply(lambda x: x ** 2)
```
The `lambda` function takes one parameter - the variable `x`. The function itself just squares `x` and returns the result, which is whatever the one line of code evaluates to. In this way, `lambda` functions can make our code concise and Pythonic.

Clean the `tips` dataset `'total_dollar'` column by removing the dollar sign. Use two different methods: With the `.replace()` method, and with `regular expressions`.

In [19]:
tips["total_dollar"] = tips.total_bill.apply(lambda x: '$' + str(x))
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode,total_dollar
0,16.99,1.01,Female,No,Sun,Dinner,2,0,$16.99
1,10.34,1.66,Male,No,Sun,Dinner,3,1,$10.34
2,21.01,3.5,Male,No,Sun,Dinner,3,1,$21.01
3,23.68,3.31,Male,No,Sun,Dinner,2,1,$23.68
4,24.59,3.61,Female,No,Sun,Dinner,4,0,$24.59


In [20]:
# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))

# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])

# Print the head of tips
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode,total_dollar,total_dollar_replace,total_dollar_re
0,16.99,1.01,Female,No,Sun,Dinner,2,0,$16.99,16.99,16.99
1,10.34,1.66,Male,No,Sun,Dinner,3,1,$10.34,10.34,10.34
2,21.01,3.5,Male,No,Sun,Dinner,3,1,$21.01,21.01,21.01
3,23.68,3.31,Male,No,Sun,Dinner,2,1,$23.68,23.68,23.68
4,24.59,3.61,Female,No,Sun,Dinner,4,0,$24.59,24.59,24.59


Notice how the `'total_dollar_re'` and `'total_dollar_replace'` columns are identical.

## 8. Dropping duplicate data
Duplicate data causes a variety of problems. From the point of view of performance, they use up unnecessary amounts of memory and cause unneeded calculations to be performed when processing data. In addition, they can also bias any analysis results.

Drop all duplicate rows from the `airquality` dataset.

In [21]:
airquality = pd.read_csv("datasets/airquality.csv")
airquality.head(10)

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,41.0,190.0,7.4,67,5,1
2,36.0,118.0,8.0,72,5,2
3,36.0,118.0,8.0,72,5,2
4,12.0,149.0,12.6,74,5,3
5,12.0,149.0,12.6,74,5,3
6,18.0,313.0,11.5,62,5,4
7,18.0,313.0,11.5,62,5,4
8,,,14.3,56,5,5
9,28.0,,14.9,66,5,6


In [22]:
airquality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 6 columns):
Ozone      120 non-null float64
Solar.R    150 non-null float64
Wind       157 non-null float64
Temp       157 non-null int64
Month      157 non-null int64
Day        157 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.4 KB


In [23]:
# Drop the duplicates: airquality_no_duplicates
airquality_no_duplicates = airquality.drop_duplicates()

# Print info of airquality_no_duplicates
airquality_no_duplicates.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 153 entries, 0 to 156
Data columns (total 6 columns):
Ozone      116 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 8.4 KB


## 9. Filling missing data
It's rare to have a (real-world) dataset without any missing values, and it's important to deal with them because certain calculations cannot handle missing values while some calculations will, by default, skip over any missing values.

Also, understanding how much missing data we have, and thinking about where it comes from is crucial to making unbiased interpretations of data.

In [24]:
airquality_no_duplicates.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 153 entries, 0 to 156
Data columns (total 6 columns):
Ozone      116 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 8.4 KB


In [26]:
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality_no_duplicates.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality_no_duplicates.loc[:, ['Ozone']] = airquality_no_duplicates.Ozone.fillna(oz_mean)

# Print the info of airquality
airquality_no_duplicates.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 153 entries, 0 to 156
Data columns (total 6 columns):
Ozone      153 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 8.4 KB


In [27]:
airquality_no_duplicates.head(10)

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
2,36.0,118.0,8.0,72,5,2
4,12.0,149.0,12.6,74,5,3
6,18.0,313.0,11.5,62,5,4
8,42.12931,,14.3,56,5,5
9,28.0,,14.9,66,5,6
10,23.0,299.0,8.6,65,5,7
11,19.0,99.0,13.8,59,5,8
12,8.0,19.0,20.1,61,5,9
13,42.12931,194.0,8.6,69,5,10


There are no longer any missing values in the `Ozone` column of this DataFrame!

## 10. Testing your data with asserts
Practice writing `assert` statements using the `ebola` dataset to programmatically check for missing values and to confirm that all values are positive. 

Use the `.all()` method together with the `.notnull()` DataFrame method to check for missing values in a column. The `.all()` method returns `True` if all values are `True`. When used on a DataFrame, it returns a `Series` of `Booleans` - one for each column in the DataFrame. So if we are using it on a DataFrame, we need to chain another `.all()` method so that we return only one `True` or `False` value. When using these within an `assert` statement, nothing will be returned if the `assert` statement is `true`: This is how we can confirm that the data you are checking are valid.

> Note: We can use `pd.notnull(df)` as an alternative to `df.notnull()`.

Write an `assert` statement to confirm that there are no missing values in ebola.
- Use the `pd.notnull()` function on ebola (or the `.notnull()` method of ebola) and chain two `.all()` methods (that is, `.all().all()`). The first `.all()` method will return a `True` or `False` for each column, while the second `.all()` method will return a single `True` or `False`.

Write an `assert` statement to confirm that all values in ebola are greater than or equal to `0`.
- Chain two `.all()` methods to the Boolean condition `(ebola >= 0)`.

In [28]:
ebola = pd.read_csv("datasets/ebola.csv")
print(ebola.head())

         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
0    1/5/2015  289        2776.0            NaN            10030.0   
1    1/4/2015  288        2775.0            NaN             9780.0   
2    1/3/2015  287        2769.0         8166.0             9722.0   
3    1/2/2015  286           NaN         8157.0                NaN   
4  12/31/2014  284        2730.0         8115.0             9633.0   

   Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
0            NaN            NaN                 NaN          NaN         NaN   
1            NaN            NaN                 NaN          NaN         NaN   
2            NaN            NaN                 NaN          NaN         NaN   
3            NaN            NaN                 NaN          NaN         NaN   
4            NaN            NaN                 NaN          NaN         NaN   

   Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
0         1786.0          

In [29]:
# Assert that there are no missing values
assert ebola.notnull().all().all()

AssertionError: 

In [43]:
ebola_melt = ebola.melt(id_vars = ["Date", "Day"], var_name= "Country", value_name="Cases")
ebola_melt.head()

Unnamed: 0,Date,Day,Country,Cases
0,1/5/2015,289,Cases_Guinea,2776.0
1,1/4/2015,288,Cases_Guinea,2775.0
2,1/3/2015,287,Cases_Guinea,2769.0
3,1/2/2015,286,Cases_Guinea,
4,12/31/2014,284,Cases_Guinea,2730.0


In [44]:
# Convert the cases column to float
ebola_melt.loc[:, ["Cases"]] = ebola_melt.Cases.astype(float)

# Drop the "Date" and "Country" Columns
ebola_melt = ebola_melt.drop(columns=["Date", "Country"])

In [45]:
ebola_melt.tail()

Unnamed: 0,Day,Cases
1947,5,
1948,4,
1949,3,
1950,2,
1951,0,


In [46]:
ebola_melt.shape

(1952, 2)

In [47]:
# Drop all rows that have NaN values
ebola_melt.dropna(inplace=True)

In [48]:
ebola_melt.shape

(738, 2)

In [49]:
# Assert that all values are >= 0
assert (ebola_melt >= 0).all().all()

Since the `assert` statements did not throw any errors, we can be sure that there are no missing values in the data and that all values are `>= 0`!