In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Introduction

Welcome to my time series forecasting project. In this project, I will be exploring the fascinating world of time series analysis and leveraging forecasting models.

Objective:
The primary objective of this project is to forecast sales and revenue using robust forecasting models to predict future values of the time series dataset i chose. 

Methodology:
My approach involves several key steps:

Data Exploration and Preprocessing: I will start by exploring the dataset, identifying trends, seasonality, and any anomalies. Data preprocessing techniques such as handling missing values, scaling, and removing outliers will be applied to ensure data quality.

Model Selection: We will implement ARIMA and Holt models, which are well-established and widely used for time series forecasting. These models are capable of capturing both temporal dependencies and seasonality in the data.

Model Training and Evaluation: The models will be trained on historical data and evaluated using appropriate performance metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). Cross-validation techniques will be employed to assess model generalization.

Forecasting: Finally, i will leverage the trained models to generate forecasts for future time periods, providing valuable insights into the expected trends and patterns in the data.

I welcome corrections, questions and collaborations. 

In [None]:
install.packages("tidyverse")
install.packages('dplyr')
install.packages('tidyr')

library('tidyverse')
library('dplyr')
library('tidyr')

In [None]:
install.packages("readr")
library('readr')
sales_data <- read.csv('/kaggle/input/time-series-starter-dataset/Month_Value_1.csv')

# Data Exploration and Preprocessing

In [None]:
#Formating the Period column to the right format
sales_data$Period <- as.Date(sales_data$Period, format = "%d.%m.%Y")

#Since this data is very small and contains floats, converting to numeric data type works best

data.class(sales_data$sales_quantity)
sales_data$Sales_quantity <- as.numeric(sales_data$Sales_quantity)

str(sales_data)


In [None]:
#checking for null data
null_values <- is.na(sales_data)
num_na_values <- sum(null_values)

print(num_na_values)

#removing null values
sales_data <- na.omit(sales_data)

In [None]:
#converting Revenue to time_series data
tsrev = ts(sales_data["Revenue"], frequency = 12, start = c(2015, 1))
plot(tsrev, ylab = "Revenue")


The plots depicts a non stationary data since the mean and variance are not constant hence we test for Stationality using ADF

# Stationarity Testing

A stationary time series is one that has its statistical properties stable over time. 
In order to check if the time series is stationary, I will use the Augmented Dickey-Fuller (ADF) test.
The ADF is a hypothesis test in which the null hypothesis of the time series is not-stationary, if we can reject the null hypothesis then our time series is stationary.


In [None]:
install.packages("tseries")
library('tseries')

# Performing the Augmented Dickey-Fuller (ADF) test for the Revenue time series
adf1_result <- adf.test(tsrev)

# Printing the results
print(adf1_result) 


I am very surprised at these results. i need to confirm this using other statistical methods

In [None]:
pp_result <- pp.test(tsrev)
print(pp_result)

#since the statistic for both tests is negative and the p value is below 0.05, we reject the null hypothesis

# Forecasting with Holt

In [None]:
# Forecasting Revenue for the next 2 years using the Holt Model
install.packages("forecast")
library('forecast')

tsrevtrend = holt(tsrev, h = 24)
summary(tsrevtrend)
plot(tsrevtrend)

Will Revenue continue to have an upward trend. lets find out by setting damped to true

In [None]:
summary(holt(tsrev, h = 24, damped = T ))
plot(holt(tsrev, h = 24, damped = T))

Yes, Revenue seem to continue to have an upward trend but may have a possiblity to flatten out.

# ARIMA MODEL

In [None]:
#auto Arima model selection
revarima = auto.arima(tsrev,
                     stepwise = F,
                     approximation = F)
summary(revarima)

In [None]:
# plotting using the chosen model
plot(forecast(revarima, h = 24))

We have succesfully used 2 models to forecast revenue for 24 months. I will like to compare the models in such a way that a user with little statistical knowledge can see the difference in forecasting for 7 years

# MODEL COMPARISION

In [None]:
# we bring all our models in one place

tsrevtrend = holt(tsrev, h = 84)
tsrevdamped = holt(tsrev, h = 84, damped = T, phi = 0.98)
arimarev = forecast(revarima, h = 84)

In [None]:
library(ggplot2)

autoplot(tsrev) +
    forecast::autolayer(tsrevtrend$mean, series = "Holt Linear Trend") +
    forecast::autolayer(tsrevdamped$mean, series = "Holt Damped Trend") +
    forecast::autolayer(arimarev$mean, series = "ARIMA") +
    xlab("Year") +
    ylab("Revenue") +
    guides(colour = guide_legend(title = "Forecast Method")) +
    theme(legend.position = c(0.8, 0.2)) +
    ggtitle("Revenue") +
    theme(plot.title = element_text(family = "Times", hjust = 0.5,
                                   color = "blue", face = "bold", size = 15))

From this, one can clearly see that the ARIMA model was best suited for this time series and tells a story of what the revenue expectations will look like for 7 years into the future.
