# **Electricity Consumption Using Time Series Analysis**

Time series analysis is a statistical method to analyse the past data within a given duration of time to forecast the future. It comprises of ordered sequence of data at equally spaced interval.To understand the time series data & the analysis let us consider an example. Consider an example of Airline Passenger data. It has the count of passenger over a period of time.

![](https://image.freepik.com/free-photo/distribution-electric-substation-with-power-lines-transformers_156373-17.jpg)

Here the **Objective** is- Build a model to forecast the electricity power consumtion(value. The data is classified in date/time and the value of consumption. The goal is to predict electricity consumption for the next 6 years i.e. till 2024.

**Time Series:**<br>
Time Series is a series of observations taken at particular time intervals (usually equal intervals). Analysis of the series helps us to predict future values based on previous observed values. In Time series, we have only 2 variables, time & the variable we want to forecast.

**Why & where Time Series is used?**<br>
Time series data can be analysed in order to extract meaningful statistics and other charecteristsics. It's used in atleast the 4 scenarios:

1. Business Forecasting
2. Understanding past behavior
3. Plan the future
4. Evaluate current accomplishment

**Importance of Time Series Analysis:**<br>
Ample of time series data is being generated from a variety of fields. And hence the study time series analysis holds a lot of applications. Let us try to understand the importance of time series analysis in different areas.

1. Economics
2. Finance
3. Healthcare
4. Environmental Science
5. Sales Forecasting
6. Weather forecasting
7. Earthquake prediction
8. Astronomy
9. Signal processing

**Loading the basic libraries**

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima_model import ARIMA

**Loading Electric Production data set**

In [None]:
elecom = pd.read_csv('../input/electric-production/Electric_Production.csv')

**Let's check first 5 and last 5 records of data set**

In [None]:
elecom.head(5)

In [None]:
elecom.tail(5)

In [None]:
elecom.shape

In [None]:
elecom.info()

**There are 397 records in datasets and 2 columns. There are no null records present. But, look at the DATE column. We need to convert them in to datetime datatype.**

In [None]:
from datetime import datetime
elecom['DATE']=pd.to_datetime(elecom['DATE'],infer_datetime_format=True)

In [None]:
elecom.info()

**Now, we will need to index DATE column.**

In [None]:
elecomind = elecom.set_index('DATE',inplace=False)

In [None]:
elecomind.head()

**Let's plot the data**

In [None]:
plt.figure(figsize=(10,5))
plt.xlabel('Date')
plt.ylabel('Electric Power Consumption')
plt.plot(elecomind)

**From the above plot, we can see that there is a Trend compoenent in the series. Hence, we now check for stationarity of the data.**

**Let's make one function consisting of stationary data checking and ADCF test working. Because we will need to repeat the steps many times, therefore, making function will become very handy.**

In [None]:
def test_stationarity(timeseries):
    
    #Determine rolling statistics
    movingAverage = timeseries.rolling(window=12).mean()
    movingSTD = timeseries.rolling(window=12).std()
    
    #Plot rolling statistics
    plt.figure(figsize=(10,5))
    plt.plot(timeseries, color='blue', label='Original')
    plt.plot(movingAverage, color='red', label='Rolling Mean')
    plt.plot(movingSTD, color='black', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    #Perform Dickey–Fuller test:
    print('Results of Dickey Fuller Test:')
    elecom_test = adfuller(timeseries['Value'], autolag='AIC')
    dfoutput = pd.Series(elecom_test[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in elecom_test[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)

**Let's determine & plot rolling statistics.**

In [None]:
test_stationarity(elecomind)

**From above plot, we can see that Rolling Mean itself has a trend component even though Rolling Standard Deviation is fairly constant with time.**

**For time series to be stationary, we need to ensure that both Rolling Mean and Rolling Standard Deviation remain fairly constant WRT time.**

**Both the curves needs to be parallel to X-Axis, in our case it is not so.**

**We've also conducted the ADCF ie Augmented Dickey Fuller Test. Having the Null Hypothesis to be Time Series is Non Stationary.**

For a Time series to be stationary, the ADCF test should have:

1. p-value should be low (according to the null hypothesis)
2. The critical values at 1%,5%,10% confidence intervals should be as close as possible to the Test Statistics
From the above ADCF test result, we can see that p-value(near to 0.18) is very large. Also critical values lower than Test Statistics. Hence, we can safely say that our Time Series at the moment is **NOT STATIONARY**

### **Data Transformation To Achieve Stationarity**

Now, we will have to perform some data transformation to achieve Stationarity. We can perform any of the transformations like taking log scale, square, square root, cube, cube root, time shift, exponential decay, etc.

Let's perform Log Transformation.

Basically we need to remove the trend component.

In [None]:
elecom_log = np.log(elecomind)

In [None]:
plt.figure(figsize=(10,5))
plt.xlabel('Date')
plt.ylabel('Electric Power Consumption')
plt.plot(elecom_log)

**Working on Rolling stats seperately (not using function) because we would need Rolling stats separately for computing.**

In [None]:
rollmean_log = elecom_log.rolling(window=12).mean()
rollstd_log = elecom_log.rolling(window=12).std()

In [None]:
plt.figure(figsize=(10,5))
plt.plot(elecom_log, color='blue', label='Original')
plt.plot(rollmean_log, color='red', label='Rolling Mean')
plt.plot(rollstd_log, color='black', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation (Logarithmic Scale)')

From above graph we can say that, we slightly bettered our previous results. Now, we are heading into the right direction.

From the above graph, Time series with log scale as well as Rolling Mean(moving avg) both have the trend component. Thus subtracting one from the other should remove the trend component.

**R (result) = Time Series Log Scale - Rolling Mean Log Scale -> this can be our final non trend curve.**

In [None]:
elecom_new = elecom_log - rollmean_log

In [None]:
elecom_new.head()

In [None]:
elecom_new.dropna(inplace=True)

In [None]:
elecom_new.head()

**Let's determine & plot rolling statistics.**

In [None]:
test_stationarity(elecom_new)

**From the above plot, we came to know that "indeed subtracting two related series having similar trend components actually removed trend and made the dataset stationary"**

Also, after concluding the results from ADFC test, we can now say that given series is now **STATIONARY**

### **Time Shift Transformation**

In [None]:
elecom_log_diff = elecom_log - elecom_log.shift()
plt.figure(figsize=(10,5))
plt.plot(elecom_log_diff)

In [None]:
elecom_log_diff.dropna(inplace=True)
plt.figure(figsize=(10,5))
plt.plot(elecom_log_diff)

**Let's determine & plot rolling statistics.**

In [None]:
test_stationarity(elecom_log_diff)

From above plot, we can see that, visually this is the very best result as our series along with rolling stats values of moving avg(mean) & moving standard deviation is very much flat & stationary.

**Let us now break down the 3 components of the log scale series using a system libary function. Once, we separate our the components, we can simply ignore trend & seasonality and check on the nature of the residual part.**

In [None]:
decomposition = seasonal_decompose(elecom_log)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.figure(figsize=(10,5))
plt.subplot(411)
plt.plot(elecom_log, label='Original')
plt.legend(loc='best')

plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')

plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')

plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()

**There can be cases where an observation simply consist of trend & seasonality. In that case, there won't be any residual component & that would be a null or NaN. Hence, we also remove such cases.**

In [None]:
elecom_decompose = residual
elecom_decompose.dropna(inplace=True)

In [None]:
rollmean_decompose = elecom_decompose.rolling(window=12).mean()
rollstd_decompose = elecom_decompose.rolling(window=12).std()

plt.figure(figsize=(10,5))
plt.plot(elecom_decompose, color='blue', label='Original')
plt.plot(rollmean_decompose, color='red', label='Rolling Mean')
plt.plot(rollstd_decompose, color='black', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')

### **Plotting ACF & PACF**

In [None]:
lag_acf = acf(elecom_log_diff, nlags=20)
lag_pacf = pacf(elecom_log_diff, nlags=20, method='ols')

In [None]:
#Plot ACF:
plt.subplot(121)
plt.plot(lag_acf)
plt.axhline(y=0, linestyle='--', color='gray')
plt.axhline(y=-1.96/np.sqrt(len(elecom_log_diff)), linestyle='--', color='gray')
plt.axhline(y=1.96/np.sqrt(len(elecom_log_diff)), linestyle='--', color='gray')
plt.title('Autocorrelation Function')            

#Plot PACF
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0, linestyle='--', color='gray')
plt.axhline(y=-1.96/np.sqrt(len(elecom_log_diff)), linestyle='--', color='gray')
plt.axhline(y=1.96/np.sqrt(len(elecom_log_diff)), linestyle='--', color='gray')
plt.title('Partial Autocorrelation Function')
            
plt.tight_layout()

From the ACF graph, we can see that curve touches y=0.0 line at x=2. Thus, from theory, Q = 3 From the PACF graph, we see that curve touches y=0.0 line at x=2. Thus, from theory, P = 3

(from the above graphs the p and q values are very close to 3 where the graph cuts off the origin)

**ARIMA is AR + I + MA.** Before, we see an ARIMA model, let us check the results of the individual AR & MA model. Note that, these models will give a value of RSS. Lower the RSS values indicates a better model.

### **AR Model**
Making order = (3,1,0)

In [None]:
model1 = ARIMA(elecom_log, order=(3,1,0))
results_AR = model1.fit(disp=-1)
plt.figure(figsize=(10,5))
plt.plot(elecom_log_diff)
plt.plot(results_AR.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((results_AR.fittedvalues - elecom_log_diff['Value'])**2))
print('Plotting AR model')

### **MA Model**
Making order = (0,1,3)

In [None]:
model2 = ARIMA(elecom_log, order=(0,1,3))
plt.figure(figsize=(10,5))
results_MA = model2.fit(disp=-1)
plt.plot(elecom_log_diff)
plt.plot(results_MA.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((results_MA.fittedvalues - elecom_log_diff['Value'])**2))
print('Plotting MA model')

### **AR+I+MA = ARIMA Model**
Making order = (3,1,3)

In [None]:
model = ARIMA(elecom_log, order=(3,1,3))
plt.figure(figsize=(10,5))
results_ARIMA = model.fit(disp=-1)
plt.plot(elecom_log_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((results_ARIMA.fittedvalues - elecom_log_diff['Value'])**2))
print('Plotting ARIMA model')

**RSS value for:** AR Model - 0.8695, MA Model - 1.2793

ARIMA Model - 0.5227

By combining AR & MA into ARIMA, we see that RSS value has decreased from either case to 0.5227, indicating ARIMA to be better than its individual component models.

With the ARIMA model built, we will now generate predictions. But, before we do any plots for predictions ,we need to reconvert the predictions back to original form. This is because, our model was built on log transformed data.

### **Prediction & Reverse Transformation**

In [None]:
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff.head()

In [None]:
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_diff_cumsum.head()

In [None]:
predictions_ARIMA_log = pd.Series(elecom_log['Value'].iloc[0], index=elecom_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
predictions_ARIMA_log.head()

### **Inverse of log is exp**

In [None]:
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.figure(figsize=(10,5))
plt.plot(elecomind)
plt.plot(predictions_ARIMA)

**From above plot, we can see that our predicted forecasts are very close to the real time series values. It also indicates a fairly accurate model.**

In [None]:
elecom_log.head()

In [None]:
elecom_log.shape

**We have 396 (existing data of 33 yrs in months) data points. Now, we can to forecast for additional 6 yrs (6x12 months=72 data points).**

**396+72 = 468 records/data points**

In [None]:
results_ARIMA.plot_predict(1,468)

My other time series notebook(Air Passenger): https://www.kaggle.com/sunaysawant/air-passengers-time-series-arima

# **THANK YOU ;)**