Time series analysis is a statistical method to analyse the past data within a given duration of time to forecast the future. It comprises of ordered sequence of data at equally spaced interval.To understand the time series data & the analysis let us consider an example. Consider an example of Airline Passenger data. It has the count of passenger over a period of time.

Loading the basic libraries

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima_model import ARIMA

In [None]:
a1=pd.read_csv('../input/airpassengers/AirPassengers.csv')

In [None]:
a1.head()

In [None]:
a1.tail()

Let's rename "#Passengers", seems really annoying the column name.

In [None]:
a1.rename(columns={'#Passengers':'Passengers'},inplace=True)

In [None]:
a1.head()

In [None]:
a1.shape

In [None]:
a1.info()

There are 144 records in 2 datasets and 2 columns. There are no null records present. But, look at the Month column. We need to convert them in to datetime datatype.

In [None]:
from datetime import datetime
a1['Month']=pd.to_datetime(a1['Month'],infer_datetime_format=True)

In [None]:
a1.info()

Now, we will need to index Month column.

In [None]:
airpass = a1.set_index('Month',inplace=False)

In [None]:
airpass.head()

Let's plot the data

In [None]:
plt.xlabel('Date')
plt.ylabel('Number Of Air Passengers')
plt.plot(airpass)

From the above below, we can see that there is a Trend compoenent in the series. Hence, we now check for stationarity of the data.



Let's make one function consisting of stationary data checking and ADCF test working. Because we will need to repeat the steps many times, therefore, making function will become very handy

In [None]:
def test_stationarity(timeseries):
    
    #Determine rolling statistics
    movingAverage = timeseries.rolling(window=12).mean()
    movingSTD = timeseries.rolling(window=12).std()
    
    #Plot rolling statistics
    plt.plot(timeseries, color='blue', label='Original')
    plt.plot(movingAverage, color='red', label='Rolling Mean')
    plt.plot(movingSTD, color='black', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    #Perform Dickey–Fuller test:
    print('Results of Dickey Fuller Test:')
    airpass_test = adfuller(timeseries['Passengers'], autolag='AIC')
    dfoutput = pd.Series(airpass_test[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in airpass_test[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)

Let's determine & plot rolling statistics.

In [None]:
test_stationarity(airpass)

From above plot, we can see that Rolling Mean itself has a trend component even though Rolling Standard Deviation is fairly constant with time.

For time series to be stationary, we need to ensure that both Rolling Mean and Rolling Standard Deviation remain fairly constant WRT time.

Both the curves needs to be parallel to X-Axis, in our case it is not so.

We've also conducted the ADCF ie Augmented Dickey Fuller Test. Having the Null Hypothesis to be Time Series is Non Stationary.

Data Transformation To Achieve Stationarity
Now, we will have to perform some data transformation to achieve Stationarity. We can perform any of the transformations like taking log scale, square, square root, cube, cube root, time shift, exponential decay, etc.

Let's perform Log Transformation.

Basically we need to remove the trend component.

In [None]:
airpass_log = np.log(airpass)

In [None]:
plt.plot(airpass_log)

Working on Rolling stats seperately (not using function) because we would need Rolling stats separately for computing

In [None]:
rollmean_log = airpass_log.rolling(window=12).mean()
rollstd_log = airpass_log.rolling(window=12).std()


In [None]:
plt.plot(airpass_log, color='blue', label='Original')
plt.plot(rollmean_log, color='red', label='Rolling Mean')
plt.plot(rollstd_log, color='black', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation (Logarithmic Scale)')

From above graph we can say that, we slightly bettered our previous results. Now, we are heading into the right direction.

From the above graph, Time series with log scale as well as Rolling Mean(moving avg) both have the trend component. Thus subtracting one from the other should remove the trend component.

R (result) = Time Series Loca Scale - Rolling Mean Log Scale -> this can be our final non trend curve

In [None]:
airpass_new = airpass_log - rollmean_log

In [None]:
airpass_new.head()

In [None]:
airpass_new.dropna(inplace=True)

In [None]:
airpass_new.head()

In [None]:
test_stationarity(airpass_new)

From the above plot, we came to know that "indeed subtracting two related series having similar trend components actually removed trend and made the dataset stationary"

Also, after concluding the results from ADFC test:

p-value has reduced from 0.99 to 0.022
Critical values at 1%,5%,10% confidence intervals are pretty close to the Test Statistic
So we can now say that given series is now STATIONARY

Time Shift Transformation

In [None]:
airpass_log_diff = airpass_log - airpass_log.shift()
plt.plot(airpass_log_diff)

In [None]:
airpass_log_diff.dropna(inplace=True)
plt.plot(airpass_log_diff)

In [None]:
test_stationarity(airpass_log_diff)

From above plot, we can see that, visually this is the very best result as our series along with rolling stats values of moving avg(mean) & moving standard deviation is very much flat & stationary.

But, the ADCF test shows us that:

p-value of 0.07 is not as good as 0.02 of previous instance.
Test Statistic value not as close to the critical values as that of previous instance.

Let us now break down the 3 components of the log scale series using a system libary function. Once, we separate our the components, we can simply ignore trend & seasonality and check on the nature of the residual part.

In [None]:
decomposition = seasonal_decompose(airpass_log)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.subplot(411)
plt.plot(airpass_log, label='Original')
plt.legend(loc='best')

plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')

plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')

plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()

There can be cases where an observation simply consist of trend & seasonality. In that case, there won't be any residual component & that would be a null or NaN. Hence, we also remove such cases.

In [None]:
airpass_decompose = residual
airpass_decompose.dropna(inplace=True)

In [None]:
rollmean_decompose = airpass_decompose.rolling(window=12).mean()
rollstd_decompose = airpass_decompose.rolling(window=12).std()

plt.plot(airpass_decompose, color='blue', label='Original')
plt.plot(rollmean_decompose, color='red', label='Rolling Mean')
plt.plot(rollstd_decompose, color='black', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')

Plotting ACF & PACF

In [None]:
lag_acf = acf(airpass_log_diff, nlags=20)
lag_pacf = pacf(airpass_log_diff, nlags=20, method='ols')

In [None]:
#Plot ACF:
plt.subplot(121)
plt.plot(lag_acf)
plt.axhline(y=0, linestyle='--', color='gray')
plt.axhline(y=-1.96/np.sqrt(len(airpass_log_diff)), linestyle='--', color='gray')
plt.axhline(y=1.96/np.sqrt(len(airpass_log_diff)), linestyle='--', color='gray')
plt.title('Autocorrelation Function')            

#Plot PACF
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0, linestyle='--', color='gray')
plt.axhline(y=-1.96/np.sqrt(len(airpass_log_diff)), linestyle='--', color='gray')
plt.axhline(y=1.96/np.sqrt(len(airpass_log_diff)), linestyle='--', color='gray')
plt.title('Partial Autocorrelation Function')
            
plt.tight_layout()

From the ACF graph, we can see that curve touches y=0.0 line at x=2. Thus, from theory, Q = 2 From the PACF graph, we see that curve touches y=0.0 line at x=2. Thus, from theory, P = 2

ARIMA is AR + I + MA. Before, we see an ARIMA model, let us check the results of the individual AR & MA model. Note that, these models will give a value of RSS. Lower the RSS values indicates a better model.

AR Model
Making order = (2,1,0)

In [None]:
model1 = ARIMA(airpass_log, order=(2,1,0))
results_AR = model1.fit(disp=-1)
plt.plot(airpass_log_diff)
plt.plot(results_AR.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((results_AR.fittedvalues - airpass_log_diff['Passengers'])**2))
print('Plotting AR model')

MA Model
Making order = (0,1,2)

In [None]:
model2 = ARIMA(airpass_log, order=(0,1,2))
results_MA = model2.fit(disp=-1)
plt.plot(airpass_log_diff)
plt.plot(results_MA.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((results_MA.fittedvalues - airpass_log_diff['Passengers'])**2))
print('Plotting MA model')

AR+I+MA = ARIMA Model
Making order = (2,1,2)

In [None]:
model = ARIMA(airpass_log, order=(2,1,2))
results_ARIMA = model.fit(disp=-1)
plt.plot(airpass_log_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((results_ARIMA.fittedvalues - airpass_log_diff['Passengers'])**2))
print('Plotting ARIMA model')

RSS value for: AR Model - 1.5023 MA Model - 1.4721

ARIMA Model - 1.0292

By combining AR & MA into ARIMA, we see that RSS value has decreased from either case to 1.0292, indicating ARIMA to be better than its individual component models.

With the ARIMA model built, we will now generate predictions. But, before we do any plots for predictions ,we need to reconvert the predictions back to original form. This is because, our model was built on log transformed data.



Prediction & Reverse Transformation

In [None]:
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff.head()

In [None]:
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_diff_cumsum.head()

In [None]:
predictions_ARIMA_log = pd.Series(airpass_log['Passengers'].iloc[0], index=airpass_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
predictions_ARIMA_log.head()

Inverse of log is exp

In [None]:
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.plot(airpass)
plt.plot(predictions_ARIMA)

From above plot, we can see that our predicted forecasts are very close to the real time series values. It also indicates a fairly accurate model.

In [None]:
airpass_log.head()

We have 144 (existing data of 12 yrs in months) data points. Now, we want to forecast for additional 10 yrs (10x12 months=120 data points).

144+120 = 264 records/data points

In [None]:
results_ARIMA.plot_predict(1,264)