## Setup

Load libraries

In [1]:
import os
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
from pmdarima import arima
from sklearn import set_config
from matplotlib import pyplot as plt
from statsmodels.tsa.seasonal import STL


set_config(
    display='diagram',
    transform_output="pandas"
)

## Load Data

Data Source: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption).

In [3]:
## create data directory

os.makedirs('./data', exist_ok=True)

## set data path
path= os.path.join('data', 'individual+household+electric+power+consumption.zip')

power_consumption_data = (
    pd.read_csv(
        filepath_or_buffer=path,
        compression='zip',
        header='infer',
        sep=';',
        na_values=['?', 'nan'],
        low_memory=False
    )
    .assign(datetime=lambda x: pd.to_datetime(x['Date'] + ' ' + x['Time'], dayfirst=True))
    .set_index('datetime')
    .drop(columns=['Date', 'Time'])
)

power_consumption_data.info()
power_consumption_data.head()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2075259 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Data columns (total 7 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Global_active_power    float64
 1   Global_reactive_power  float64
 2   Voltage                float64
 3   Global_intensity       float64
 4   Sub_metering_1         float64
 5   Sub_metering_2         float64
 6   Sub_metering_3         float64
dtypes: float64(7)
memory usage: 126.7 MB


Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


## Data Exploration

* Check Missing Observations

In [4]:
power_consumption_data.isna().sum()

Global_active_power      25979
Global_reactive_power    25979
Voltage                  25979
Global_intensity         25979
Sub_metering_1           25979
Sub_metering_2           25979
Sub_metering_3           25979
dtype: int64

In [11]:
n_missing=power_consumption_data.isnull().any(axis=1).sum()
print(f"Number of rows with at least one missing value: {n_missing}")

Number of rows with at least one missing value: 25979


All variables have missing observations. In data analytics, there are two options for dealing with missing data:

1. We can drop all rows with at least one missing entry.
2. We can impute the mssing values - so that we do not throw away valuable instances.

Because our data is time series, dropping rows with missing entries is not feasible - it will disturb the autocorrelation between instances. There are a number of way of filling missing values for time series data:

* Last observation carried forward.
* Next observation carries backward.
* Rolling statistics (weighted moving average, weighted moving average, exponential moving average).
* K-Nearest Neighbors (KNN) imputer.
* Interpolation

For this work, we will use intepolation.

In [12]:
## fill missing observations with interpolated values

no_na_df=power_consumption_data.interpolate(method='time')

n_missing=no_na_df.isnull().any(axis=1).sum()
print(f"Number of rows with at least one missing value after interpolation: {n_missing}")

Number of rows with at least one missing value after interpolation: 0
