# 🛠 IFQ718 Module 06 Exercises 03

## 🔍  Context: Working with missing data

This notebook will briefly cover the mechanisms of identifying and filling missing fields/data. In IFQ719, you will learn how to properly fill missing data.

Let's use fuel prices as an example.

Perhaps you have hourly data for all the fuel stations in Brisbane, but there are some missing points at random intervals. 
* What caused them to be missing?
   * Perhaps connectivity to the server
   * Perhaps the fuel station had to close for an unexpected reason so they were not trading, hence no report
* How to fill the missing points?
   * Would a fuel price of zero make sense?
   * Use the average from the prices either side?
   * Reuse the previous value?
   
This last question is particularly important but it will be answered in IFQ719. Here, we will explore the Pandas-based mechanisms for actually filling the value.

In [None]:
import pandas as pd

### Specifying a field as `NaN` 

**For a new DataFrame**

Using the native `None` object will indicate to Pandas that a field is unknown:

In [None]:
data = {
    'DayOfWeek' : [f'{d}day' for d in ['Sun', 'Mon', 'Tues', 'Wednes', 'Thurs', 'Fri', 'Sat']],
    'Distance' : [12.5, 13.0, None, None, 13.2, 11.0, 4.3]
}
df = pd.DataFrame(data)

In [None]:
df

**For an existing DataFrame**

In [None]:
# Set distance of Monday to NaN
df.at[1,'Distance'] = None

In [None]:
df

### Inspecting for `NaN`'s

In [None]:
df = pd.read_csv('data/penguins.csv')

In [None]:
# the entire DataFrame
df

**How to find rows that contain `NaN`**

In [None]:
df[df.isna().any(axis=1)]

In [None]:
df.iloc[3]

### Deleting missing data

In [None]:
df = df.dropna()

In [None]:
df[df.isna().any(axis=1)]

### Calculations with missing data

When summing data containing `NaN`, any `NaN` is treated as zero. Therefore, if all the items in the sum are `NaN`, then the result is zero.

In [None]:
_df = pd.DataFrame({'columnA' : [3, 2, None], 'columnB' : [None] * 3})
print(_df)
print('\n---\n')
print(_df.sum())

When calculating the product of data containing `NaN`, any `NaN` is treated as one.

In [None]:
_df = pd.DataFrame({'columnA' : [3, 2, None], 'columnB' : [None] * 3})
print(_df)
print('\n---\n')
print(_df.product())

### Filling missing data

The `.fillna()` function will replace `NaN` with the value you specify:

In [None]:
_df.fillna('this was empty')

You can also specify the `method` parameter, as:
   * `pad` or `ffill` to propagate the last valid observation forward to the next valid
   * `backfill` or `bfill` to use the next valid observation to fill th egap

In [None]:
_df = pd.DataFrame({'columnA' : [3, 2, None], 'columnB' : [None] * 3})

In [None]:
_df.fillna(method='ffill')

### Replacing alternative values known to represent `NaN`

Fuel prices... perhaps you know that fuel stations report a price of 999.999c when there is an error. 

In Pandas, there exists the `.replace()` function that can allow you to properly replace this erroneous value with the proper representation of `NaN`.

In [None]:
pd.Series(list(range(10,15)), index=['a', 'b', 'c', 'd', 'e']).replace(13, 10)

### ✍ Activity 1: identify records in the taxi dataset that have no associated payment

In [None]:
df = pd.read_csv('data/taxis.csv')

In [None]:
df

In [None]:
# Write your code here

### ✍ Activity 2: drop records where this is no payment and no passengers

In [None]:
# Write your code here

### ✍ Activity 3: discuss, is a toll of $0.00 better than a toll of `NaN`?

### ✍ Activity 4: replace `NaN`'s

Replace `NaN` in the pickup/dropoff zone/borough columns with `Unknown`

In [None]:
# Write your code here