In [None]:
import os
import pandas as pd

# Data cleaning

Before using this notebook, you will need to have used the `nem-data` library to download data - see the instructions for this here

Cleaning data requires two tasks
- identification
- cleaning

See what files you have downloaded:

In [None]:
home = os.path.expanduser('~')
nem_files = os.listdir(os.path.join(home, 'nem-data', 'demand'))

nem_files

If we look at a specific month, we see that `nem-data` has downloaded a few files for each month:

In [None]:
nem_files = os.listdir(os.path.join(home, 'nem-data', 'demand', '2018-01'))

nem_files

Let's grab `clean.csv` and extract a potential target.

The data provided for the NEM is large - making working on problems in the NEM great for data scientists:

In [None]:
raw = pd.read_csv(os.path.join(home, 'nem-data', 'demand', '2018-01', 'clean.csv'), index_col=4, parse_dates=True)
raw.head()

In [None]:
region = raw['REGIONID'] == 'SA1'
cols = ['REGIONID', 'TOTALDEMAND']

target = raw[region][cols]
target.head()

## Exercise - raw dataset health check

On the `raw` dataframe, check
- how many missing values we have in each column
- check for duplicates

On the `target` dataframe
- check the integrity of the time stamps (do we have any gaps?)

In [None]:
raw.isnull().sum()

Some of the columns on the raw dataset contain no information at all (100% null values), but we are going to focus our analysis on `TOTALDEMAND`, which thankfully has no missing values at all.

In [None]:
raw.index

In [None]:
raw.head()

The raw dataset has no duplicated rows.

*Does it - doesn't the below suggest the opposite?

In [None]:
sum(raw.index.duplicated())

*The below suggests we don't - because we look at the entire row, not just the index*

In [None]:
sum(raw.duplicated())

In [None]:
td = pd.Timedelta(minutes = 5)

(target.index.to_series().diff() != td).sum()

Some rows on the target dataset appear to have time deltas that are different from the typical 5 minutes. We are going to check if there are any duplicate entries.

In [None]:
target = target[target.index.duplicated(keep='first') == False]

(target.index.to_series().diff() != td).sum()

After dropping the duplicated rows, there's only a single row with a time delta that isn't 5 minutes. This is normal, because the first row can't be compared to the previous.

*Your solution is an interesting one - I'll show you how I do this*

*I make a datetime index at a 5min freq. using the start and end index:*

In [None]:
dt = pd.date_range(target.index[0], target.index[-1], freq='5min')

assert dt.shape[0] == target.shape[0]

In [None]:
dt.shape[0]

In [None]:
target.shape[0]

In [None]:
set(target.index) - set(dt)