<a href="https://colab.research.google.com/github/NUELBUNDI/PDS_PROJECT/blob/main/TimeSeriesPaper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Time Series Project Paper**

---
## Student Name :Lee Bundi
## Student No:  :102586

The purpose of this paper is to model a time series analysis. 
Time series is a set of observations, each one being recorded at a specific time.
Time series analysis is done on a time series data.
In this paper our focus will be on univarite time series (Meaning i will use a single set of observation or variable indexed over time).

The **basic objective**  is to determine a model that describes the pattern of the time series. Uses for such a model are:

1. To describe the important features of the time series pattern.
2. To explain how the past affects the future.
3. To forecast future values of the series.


The time series data i'm using in this paper comprise of annual average temperature in kenya for the period starting 1981 to 2020. The source of the data is https://africaopendata.org/dataset


##Steps to Time Series Analysis.


*   Visualize the time series plot.
*   Check for stationarity.
*   Make the time series stationary.
*   Plot the ACF and PCF.
*   Select the model and train the data.
*   Choose the best performing model.
*   Forecast using the model choosen on the test data
*   Analysis the Forecast performance- iterate through the step until you find the best forecast performance.
*   Perform Future forecast.











In [None]:
# Import Packages

%matplotlib inline
from bokeh.io import output_notebook
from bokeh import models, palettes, transform
from bokeh.plotting import figure, show
import pandas as pd
import pandas_profiling
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn import cluster, decomposition, pipeline, preprocessing
import statsmodels
import missingno as mn
import plotly.offline as py
import plotly.express as px
import datetime
import plotly.graph_objects as go
from statsmodels.tsa.stattools import adfuller,acf,pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf ,plot_pacf
from statsmodels.tsa.arima_model import ARIMA
import statsmodels.api as sm
from pylab import rcParams
from math import sqrt
from sklearn.metrics import mean_squared_error
# from pmdarima import auto_arima
import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm
# from statsmodels.tsa.ar_model import AutoReg, ar_select_order


Read the data set and load it to pandas data frame

In [None]:
df=pd.read_csv("https://raw.githubusercontent.com/NUELBUNDI/PDS_PROJECT/main/tempkenyadata.csv",index_col=0,parse_dates=True)

df.head()

##**Plot the time series data to visualize and analysis.**





In [None]:
df.plot(figsize=(12,6))
plt.xlabel('Year')
plt.ylabel('Share Price for General Motors ')
plt.title('Trend of the Time Series')
plt.show()

Decompose a time series.

This enable us to visualize the components of time series namely:


*   Trend-       Increasing or decreasing value in the series.

*   seasonarity- Any repeating cycle
*   Noise        Random Variation in the series

*   Level       Average value in the series



In [None]:
plt.figure(figsize=(15,5))
result = seasonal_decompose(df, model='additive',freq=12)
result.plot()
plt.show()


# The series as no seasonality

### **Checking For Stationarity**

In [None]:
# Function to check for stationarity

def stationarity_test(timeseries):

    print('RESULTS OF DICKEY-FULLER TEST\n')
    df_test = adfuller(timeseries.iloc[:,0].values, autolag='AIC' )
    # df_test = adfuller(timeseries,autolag='AIC')
    df_output = pd.Series(df_test[0:4], index = ['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
    for key, value in df_test[4].items():
        df_output['Critical Value (%s)' %key] = value
    print(df_output)
    print("****************************************************")
    print(f'INFERENCE:         THE TIME SERIES IS {"NON-" if df_test[1]>=0.05 else ""}STATIONARY')

stationarity_test(df)

### **Make Non-stationary time series data stationary by differencing**

Since from the previous results the time series is non-stationary we have to make stationary first, this is because we can not perform ARIMA models to non-stationary time series data.



In [None]:
# Check the First Order Difference
df_diff=df.diff().dropna()

stationarity_test(df_diff)
# df_diff.plot()

##**Plot ACF and PACF**

Plot ACF and PACF of the stationary data this is useful in helping start to choose our model parameters (p,d,q) in our ARIMA models;


*   P-The number of lag observations included in the AR model.
*   D-Degree of differencing
*   Q-Moving average of MA

In [None]:
#Determine P and Q by Plotting the ACF AND PACF

def plot_acf_pcf(ts_data):
  plot_pacf(df['readig'],lags=10)
  plot_acf(df['readig'],lags=10)

plot_acf_pcf(df_diff)



Split time series data set into two : the train and test data set.

In [None]:
df.shape

In [None]:
# Split the data into train and test
train=df.iloc[:30]
test=df.iloc[30:]

### **Build an ARIMA model .**

Perform auto-ARIMA to choice the best performing combination of (PDQ) parameters checking the one with the least AIC.

Now we fit the data set into the best model chosen above.

In [None]:
import pmdarima as pmd

def arimamodel(timeseriesarray):
    autoarima_model = pmd.auto_arima(timeseriesarray, 
                              start_p=1, 
                              start_q=1,
                              test="adf",
                              trace=True)
    return autoarima_model

    
arima_model = arimamodel(train)
arima_model.summary()

From our rest above the AIC is relatively small at 13.585 Next step is to make prediction on range of the test data and compare with the actual test data.

In [None]:
# Make Predict on Test data

start=len(train)
end=len(train)+len(test)-1
pred=arima_model.predict(start=start,end=end,typ='levels')
print(pred)

In [None]:
# Plot predict vs Actual

# pred.plot(legend=True)
test['readig'].plot(legend=True)
rmse=sqrt(mean_squared_error(pred,test['readig']))
plt.title(f'The RMSE IS {rmse}')
print(test['readig'].mean())

In [None]:
# Calculate Perfomance error

rmse=sqrt(mean_squared_error(pred,test['readig']))
print(rmse)


RMSE IS 0.215 this is actually a good model , now we can fit our whole data into the model and do future predictions.

In [None]:
# If the model is okay then train the model on the whole data set and make future predictions

arima_model = sm.tsa.ARIMA(df['readig'], order=(4,2,0))
arima_model = arima_model.fit()
df.tail()



In [None]:
# For Future Dates

# index_future_dates=pd.date_range(start='2020-12-31', end ='2030-12-31')
index_future_dates=pd.date_range(start=pd.datetime(2020, 12, 31), periods=11, freq=pd.DateOffset(years=1))
# print(index_future_dates)
pred2=arima_model.predict(start=len(df),end=len(df)+10,typ='levels').rename('ARIMA PREDICTIONS')
# print(pred2)
# pred2.index=index_future_dates
print(pred2)


In [None]:
pred2.plot(figsize=(10,6),legend=True)

###References

In [None]:
# https://www.machinelearningplus.com/resources/arima/arima-forecast-test-results/
# https://www.machinelearningplus.com/resources/arima/implement-arima-model/
# https://www.machinelearningplus.com/time-series/time-series-analysis-python/
# https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/
pip install statsmodels==0.11.0
# https://www.kaggle.com/satishgunjal/tutorial-time-series-analysis-and-forecasting
# https://www.geeksforgeeks.org/python-arima-model-for-time-series-forecasting/