## Time Series Index - Processing

In modeling/forecasting, a continuous time series is required. Any missing or duplicated index can influence their results. This notebook demonstrates  developed three methods of TimeindexProcessing class for energy consumption data at Kaggle (https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption) 


1. Convert column to time index
    A column of dataframe is converted into time index. 

2. Duplicate time index
    Time series might have duplicate time index, some  pandas method don't work on duplicate index. Therefore, they need to be checked and further action like dropping, averaging etc. needs to be decided based on data. In the developed method, the first index of duplicates is kept in time series.

3. Missing time index
    In a time series, time index  might be missing. Through the developed method, a list of missing time index is identified. Along with it, in dataframe missing rows are added with null values for all columns.    

In [4]:
import pandas as pd

from TimeindexProcessing import TimeindexProcessing 

In [2]:
df = pd.read_csv('./Data/Kaggle_PJME/''PJME_hourly.csv')
df.head()

Unnamed: 0,Datetime,PJME_MW
0,2002-12-31 01:00:00,26498.0
1,2002-12-31 02:00:00,25147.0
2,2002-12-31 03:00:00,24574.0
3,2002-12-31 04:00:00,24393.0
4,2002-12-31 05:00:00,24860.0


**Converting Datetime column to Timeindexx**

In [5]:
index_processing = TimeindexProcessing()
indexed_df = index_processing.convert_column_to_timeindex(df, column_name= 'Datetime')
indexed_df.head()

Unnamed: 0_level_0,PJME_MW
Datetime,Unnamed: 1_level_1
2002-12-31 01:00:00,26498.0
2002-12-31 02:00:00,25147.0
2002-12-31 03:00:00,24574.0
2002-12-31 04:00:00,24393.0
2002-12-31 05:00:00,24860.0


**Duplicate Timeindex**

In [6]:
duplicated_index_list, duplicates_corrected_df = index_processing.duplicate_timeindex(indexed_df)
duplicated_index_list

There are 4 duplicate index in the time series. 


[Timestamp('2014-11-02 02:00:00'),
 Timestamp('2015-11-01 02:00:00'),
 Timestamp('2016-11-06 02:00:00'),
 Timestamp('2017-11-05 02:00:00')]

All identified duplicate timeindex correspond to daylight saving time. Next, let's have a look on duplicates index in raw data and duplicates corrected dataframe (keeping first row of duplicates)

In [7]:
indexed_df.loc[duplicated_index_list]

Unnamed: 0_level_0,PJME_MW
Datetime,Unnamed: 1_level_1
2014-11-02 02:00:00,22935.0
2014-11-02 02:00:00,23755.0
2015-11-01 02:00:00,21567.0
2015-11-01 02:00:00,21171.0
2016-11-06 02:00:00,20795.0
2016-11-06 02:00:00,21692.0
2017-11-05 02:00:00,21236.0
2017-11-05 02:00:00,20666.0


In [8]:
duplicates_corrected_df.loc[duplicated_index_list]

Unnamed: 0_level_0,PJME_MW
Datetime,Unnamed: 1_level_1
2014-11-02 02:00:00,22935.0
2015-11-01 02:00:00,21567.0
2016-11-06 02:00:00,20795.0
2017-11-05 02:00:00,21236.0


**Missing Timeindex**

Next, let's investigate missing time index in the above corrected dataframe for of 1 hour frequency data.

In [9]:
missing_index_list, rows_added_df = index_processing.missing_timeindex(duplicates_corrected_df, '1H')

There are 30 missing index in the time series


In [10]:
missing_index_list

[Timestamp('2002-04-07 03:00:00', freq='H'),
 Timestamp('2002-10-27 02:00:00', freq='H'),
 Timestamp('2003-04-06 03:00:00', freq='H'),
 Timestamp('2003-10-26 02:00:00', freq='H'),
 Timestamp('2004-04-04 03:00:00', freq='H'),
 Timestamp('2004-10-31 02:00:00', freq='H'),
 Timestamp('2005-04-03 03:00:00', freq='H'),
 Timestamp('2005-10-30 02:00:00', freq='H'),
 Timestamp('2006-04-02 03:00:00', freq='H'),
 Timestamp('2006-10-29 02:00:00', freq='H'),
 Timestamp('2007-03-11 03:00:00', freq='H'),
 Timestamp('2007-11-04 02:00:00', freq='H'),
 Timestamp('2008-03-09 03:00:00', freq='H'),
 Timestamp('2008-11-02 02:00:00', freq='H'),
 Timestamp('2009-03-08 03:00:00', freq='H'),
 Timestamp('2009-11-01 02:00:00', freq='H'),
 Timestamp('2010-03-14 03:00:00', freq='H'),
 Timestamp('2010-11-07 02:00:00', freq='H'),
 Timestamp('2010-12-10 00:00:00', freq='H'),
 Timestamp('2011-03-13 03:00:00', freq='H'),
 Timestamp('2011-11-06 02:00:00', freq='H'),
 Timestamp('2012-03-11 03:00:00', freq='H'),
 Timestamp

All of the above time index corresponds to time when clock either goes forward or backward. Let's confirm whether missing index rows are added in dataframe correctly.

In [11]:
rows_added_df.loc[missing_index_list]

Unnamed: 0,PJME_MW
2002-04-07 03:00:00,
2002-10-27 02:00:00,
2003-04-06 03:00:00,
2003-10-26 02:00:00,
2004-04-04 03:00:00,
2004-10-31 02:00:00,
2005-04-03 03:00:00,
2005-10-30 02:00:00,
2006-04-02 03:00:00,
2006-10-29 02:00:00,


The processed dataframe has cleaned time index for any further analysis/modeling.