# Lesson I 

## Uniformity

In this chapter, we're looking at more advanced data cleaning problems, such as:

* uniformatiy
* Cross field validation
* Dealing with missing data

In chapter 1, we saw how out of range values are a common problem when cleaning data, and that when left untouched, can skew your analysis.

* Out of range movie ratings
* Subscription dates in the future

**Uniformity:**

| **Column** | **Unit** |
|--------|------|
| Temperature | ``32C`` **is also** ``89.7F`` |
| Weight | ``70 kg`` **is also** ``11 st`` |
| Data | ``26-11-2019`` **is also** ``26, Novemer, 2019`` |
| Money | ``100$`` **is also** ``10763,90 Yen`` |


**Example**

```python
temperatures = pd.read_csv('temperature.csv')   
temperatures.head()
```

<img src='pictures/temperature.jpg' />

Here's a dataset with average temperature data throughout the month of March in New York City. The dataset was collected from different sources with temperature data in Celsius and Fahrenheit merged together. We can see that unless a major climate event occurred, the final value (``62.6``)here is most likely Fahrenheit, not Celsius. 

Let's confirm the presence of these values visually:

```python
# Import matplotlib
import matplotlib.pyplot as plt
# Create scatter plot
plt.scatter(x= 'Date', y= 'Temperature', data= temperatures)
# Create title, xlabel and ylabel
plt.title('Temperature in Celsius March 2019 - NYC')
plt.xlabel('Dates')
plt.ylabel('Temperature in Celsius')
# Show plot
plt.show()
```

<img src='pictures/temperature1.jpg' />

Notice these values here? They all must be fahrenheit!

### Treating temperature data

A simple web search returns the formula for converting Fahrenheit to Celsius. To convert our temperature data, we isolate all rows of ``temperature`` column where it is above **40** using the ``.loc()`` method. We chose **40** because it's a common sense maximum for Celsius temperatures in New York City. 

We then convert these values to Celsius using the formula, and reassign them to their respective Fahrenheit values in temperatures. 

We can make sure that our conversion was correct with an ``assert`` statement, by making sure the maximum value of temperature is less than **40**.

```python
temp_fh = temperatures.loc[temperatures['Temperature'] > 40, 'Temperature']
temp_cels = (temp_fh - 32) * (5/9)
temperatures.loc[temperatures['Temperature'] > 40, 'Temperature'] = temp_cels

# Assert conversion is correct
assert temperatures['Temperature'].max() < 40
```

#### Treating date data

Here's another common uniformity problem with date data. This is a DataFrame called birthdays containing birth dates for a variety of individuals. It has been collected from a variety of sources and merged into one.

```python
birthdays.head()
```

<img src='pictures/birthdays.jpg' />

Notice the dates here? The one in blue has the month, day, year format, whereas the one in orange has the month written out. The one in red is obviously an error, with what looks like a day day year format. We'll learn how to deal with that one as well.

### Datetime Formatting

``datetime`` is useful for representing dates

| **Date** | **datetime format** |
|------|-----------------|
| 25-12-2019 | ``%d-%m-%Y`` |
| December 25th 2019 | ``%c`` |
| 12-25-2019 | ``%m-%d-%Y`` |
|... | .... |

``pandas.to_datetime()`` :
* Can recognize most formats automatically
* Sometimes fails with erroneous or unrecognazable formats

You can treat these date inconsistencies easily by converting your date column to ``datetime``. We can do this in *pandas* with the ``.to_datetime()`` function. However this isn't enough and will most likely return an error.

```python
# Converts to datetime - but wont work!
birthdays['Birthday'] = pd.to_datetime(birthdays['birthday'])
```

since we have dates in multiple formats, especially the weird *day/day/format* which triggers an error with months. Instead we set the ``infer_datetime_format`` argument to ``True``, and set ``errors=coerce``. This will infer the format and return missing value for dates that couldn't be identified and converted instead of a value error.

```python
# Will Work!
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'],
                                        # Attempt to infer format of each date
                                        infer_datetime_format=True,
                                        # Returns NA for rows where conversion failed
                                         errors='coerce')
```

This returns the birthday column with aligned formats, with the initial ambiguous format of *day day year*, being set to *NAT*, which represents missing values in *Pandas* for *datetime* objects.

We can also convert the format of a datetime column using the ``dt.strftime()`` method, which accepts a datetime format of your choice. For example, here we convert the Birthday column to day month year, instead of year month day.

```python
birthdays['Birthday'] = birthdays['Birthday'].dt.strftime("%d-%m-%Y")
```

### Treating ambigous date data

However a common problem is having ambiguous dates with vague formats. For example, is this date value set in March or August? 

***Is ``2019-03-08`` in August or March?***

Unfortunately there's no clear cut way to spot this inconsistency or to treat it. Depending on the size of the dataset and suspected ambiguities, 

* We can either convert these dates to NAs and deal with them accordingly.
* if you have additional context on the source of your data, you can probably infer the format
* If the majority of subsequent or previous data is of one format, you can probably infer the format as well. 

All in all, it is essential to properly understand where your data comes from, before trying to treat it, as it will make making these decisions much easier.

## Exercise

### Uniform currencies

In this exercise and throughout this chapter, you will be working with a retail banking dataset stored in the ``banking`` DataFrame. The dataset contains data on the amount of money stored in accounts (``acct_amount``), their currency (``acct_cur``), amount invested (``inv_amount``), account opening date (``account_opened``), and last transaction date (``last_transaction``) that were consolidated from American and European branches.

You are tasked with understanding the average account size and how investments vary by the size of account, however in order to produce this analysis accurately, you first need to unify the currency amount into dollars.

In [None]:
# Import packages
import pandas as pd
# Banking dataset
banking = pd.read_csv('datasets/banking_dirty.csv')

# WILL ONLY WORK ON DATACAMP WEBSITE

# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'

# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1

# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[acct_eu, 'acct_cur'] = 'dollar'

# Assert that only dollar currency remains
assert banking['acct_cur'].unique() == 'dollar'

### Uniform dates

After having unified the currencies of your different account amounts, you want to add a temporal dimension to your analysis and see how customers have been investing their money given the size of their account over each year. The ``account_opened`` column represents when customers opened their accounts and is a good proxy for segmenting customer activity and investment over time.

However, since this data was consolidated from multiple sources, you need to make sure that all dates are of the same format. You will do so by converting this column into a ``datetime`` object, while making sure that the format is inferred and potentially incorrect formats are set to missing. 

In [5]:
# Print the header of account_opened
print(banking['account_opened'].head())

# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce') 

# Get year of account opened
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')

# Print acct_year
print(banking['acct_year'])

0   2018-02-09
1   2019-02-28
2   2018-04-25
3   2017-07-11
4   2018-05-14
Name: account_opened, dtype: datetime64[ns]
0     2018
1     2019
2     2018
3     2017
4     2018
      ... 
95    2018
96    2017
97    2017
98    2017
99    2017
Name: acct_year, Length: 100, dtype: object


# Lesson II

## Cross field validation

In this lesson we'll talk about cross field validation for diagnosing dirty data.

```python
import pandas as pd

flights = pd.read_csv('flights.csv')
flights.head()
```

<img src='pictures/flights.jpg' />

Let's take a look at the following dataset. It contains flight statistics on the total number of passengers in economy, business and first class as well as the total passengers for each flight. We know that these columns have been collected and merged from different data sources, and a common challenge when merging data from different sources is data integrity, or more broadly making sure that our data is correct.

**Cross Field Validation:**

* *The use of **multiple** fields in a dataset to sanity check data integrity*

For example in our flights dataset, this could be summing *economy, business and first class* values and making sure they are equal to the *total passengers* on the plane. This could be easily done in Pandas;

```python
sum_classes = flights[['economy_class', 'business_class', 'first_class']].sum(axis=1)
passenger_equ = sum_classes == flights['total_passengers']
# Find and filter out rows with inconsistent passenger totals
inconsistent_pass = flights[~passenger_equ]
consistent_pass = flights[passenger_equ]
```

by first subsetting on the columns to ``.sum()``, then using the sum method with the ``axis`` argument set to ``1`` to indicate row wise summing. 

We then find instances where the total passengers column is equal to the sum of the classes. And find and filter out instances of inconsistent passenger amounts by subsetting on the equality we created with brackets and the tilde symbol.

<img src='pictures/crossfield.jpg' />

Here's another example containing *user IDs, birthdays and age* values for a set of *users*. We can for example make sure that the age and birthday columns are correct by subtracting the number of years between today's date and each birthday.

```python
import pandas as pd
import datetime as dt

# Convert to datetime and get today's date
users['Birthday'] = pd.to_datetime(users['Birthday'])
today = dt.date.today()
# For each row in the Birthday column, calculate year difference
age_manual = today.year - users['Birthday'].dt.year
# Find instances where ages match
age_equ = age_manual == users['Age']
# Find and filter out rows with inconsistent age
inconsistent_age = users[~age_equ]
consistent_age = users[age_equ]
```

We can do this by first making sure the ``Birthday`` column is converted to ``datetime`` with the *pandas* to ``datetime`` function. 

We then create an *object* storing today's date using the datetime package's ``date.today()`` function. 

We then calculate the difference in years between today's date's year, and the year of each birthday by using the ``.dt.year`` attribute of the user's ``Birthday`` column. 

We then find instances where the calculated ages are equal to the actual age column in the users DataFrame. 

We then find and filter out the instances where we have inconsistencies using subsetting with brackets and the tilde symbol on the equality we created.

### What to do when we catch inconsistencies?

So what should be the course of action in case we spot inconsistencies with cross-field validation? Just like other data cleaning problems, there is no one size fits all solution, as often the best solution requires an in depth understanding of our dataset.

* Dropping data
* Set to missing and impute
* Apply rules from domain knowledge

All these routes and assumptions can be decided upon only when you have a good understanding of where your dataset comes from and the different sources feeding into it.

## Exercise

### How's our data integrity?

New data has been merged into the ``banking`` DataFrame that contains details on how investments in the ``inv_amount`` column are allocated across four different funds *A, B, C and D*.

Furthermore, the age and birthdays of customers are now stored in the ``age`` and ``birth_date`` columns respectively.

You want to understand how customers of different age groups invest. However, you want to first make sure the data you're analyzing is correct. You will do so by cross field checking values of ``inv_amount`` and ``age`` against the amount invested in different funds and customers' birthdays. 

In [6]:
# Import packages
import datetime as dt 

# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']

# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis= 1) == banking['inv_amount']

# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
inconsistent_inv = banking[~inv_equ]

# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])

Number of inconsistent investments:  8


In [12]:
# Store today's date and find ages
today = dt.date.today()
banking['birth_date'] = pd.to_datetime(banking['birth_date'])
ages_manual = today.year - banking['birth_date'].dt.year

# Find rows where age column == ages_manual
age_equ = banking['Age'] == ages_manual

# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]

# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])

Number of inconsistent ages:  100
