<a href="https://colab.research.google.com/github/SriRamK345/time-series-analysis/blob/main/Time_Series_Forecast.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Importing Necessary libraries


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# 2. Loading Dataset

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Time series/Dataset- Superstore (2015-2018).csv.zip")
df.head()

In [None]:
df["Category"].value_counts()

In [None]:
off_sdf= df.loc[df['Category']=='Office Supplies']
off_sdf.head(5)

In [None]:
print('Starting date:',off_sdf['Order Date'].min())
print('Ending date:',off_sdf['Order Date'].max())

We have a four year of Office Supplies data:

# 3. Data Processing

In [None]:
# Drop unrelevant variables:
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']

off_sdf.drop(cols, axis=1, inplace= True)
off_sdf

In [None]:
off_sdf.info()

In [None]:
df["Order Date"] = pd.to_datetime(df["Order Date"])

In [None]:
off_sdf.isnull().sum()

# 4. Indexing time series data

In [None]:
off_sdf.set_index("Order Date",inplace=True)

In [None]:
off_sdf

## Visualizing furniture sales time series data

In [None]:
off_sdf.plot(figsize=(15, 6))
plt.show()

The above is quite busy to interpret, we should use the resample function the time series data by Month and use the averages monthly values

In [None]:
# Ensure 'Order Date' is of datetime type
off_sdf.index = pd.to_datetime(off_sdf.index)

In [None]:
# creating new DataFrame
monthly_OS = pd.DataFrame()

monthly_OS['Sales'] = off_sdf['Sales'].resample('MS').mean()

In [None]:
#plot weekly sales data
plt.figure(figsize=(15, 6))
plt.plot(monthly_OS.index, monthly_OS.Sales, linewidth=3)

Since all values are positive, you can show this on both sides of the Y axis to emphasize the growth.

In [None]:
x= monthly_OS.index
y1= monthly_OS['Sales'].values

fig, ax = plt.subplots(1, 1, figsize=(16,5), dpi= 120)
plt.fill_between(x, y1=y1, y2=-y1, alpha=0.5, linewidth=2, color='seagreen')
plt.ylim(-800, 800)
plt.title('Sales (Two Side View)', fontsize=16)
plt.hlines(y=0, xmin=np.min(monthly_OS.index), xmax=np.max(monthly_OS.index), linewidth=.5)
plt.show()

In [None]:
cpy_df = off_sdf.copy()
cpy_df.reset_index(inplace=True)
cpy_df.head()

In [None]:
# seperating month and year
cpy_df['month'] = cpy_df['Order Date'].dt.month
cpy_df['year'] = cpy_df['Order Date'].dt.year
cpy_df.head()

In [None]:
# Draw Plot
fig, axes = plt.subplots(1, 2, figsize=(20,7), dpi= 80)
sns.boxplot(x='year', y='Sales', data=cpy_df, ax=axes[0])
sns.boxplot(x='month', y='Sales', data=cpy_df.loc[~cpy_df.year.isin([2014,2917]), :])

# Set Title
axes[0].set_title('Year-wise Box Plot\n(The Trend)', fontsize=18);
axes[1].set_title('Month-wise Box Plot\n(The Seasonality)', fontsize=18)
plt.show()

In [None]:
from pylab import rcParams
import statsmodels.api as sm

rcParams['figure.figsize'] = 18, 8

decomposition = sm.tsa.seasonal_decompose(monthly_OS['Sales'], model='additive')
fig = decomposition.plot()
plt.show()

The plots show the data is seasonality

# 6. Check Stationary of the Dataset

In [None]:
#Determing rolling statistics
moving_avg = monthly_OS.rolling(12).mean()
moving_std= monthly_OS.rolling(12).std()

In [None]:
#Plot rolling statistics:
orig = plt.plot(monthly_OS, color='blue',label='Original')
mean = plt.plot(moving_avg, color='red', label='Rolling Mean')
std = plt.plot(moving_std, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)

##  Conduct the Dickey-Fuller test:

In [None]:
from statsmodels.tsa.stattools import adfuller
print ('Results of Dickey-Fuller Test:')
dftest = adfuller(monthly_OS, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])

for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print (dfoutput)

Here’s how to interpret the most important values in the output:

Test statistic: -1.630238

P-value: 0.467366

Since the p-value is not less than .05, we fail to reject the null hypothesis.

This means the time series is non-stationary.

In other words, it has some time-dependent structure and does not have constant variance over time.

# 7. Make a Time Series Stationary

There are several method to make a time series stationary:

1. Take a log transform
2. Moving average
3. Exponentially weighted moving average
4. Difference
5. Decomposition

Some might work well in this case and others might not. But the idea is to get a hang of all the methods and not focus on just the problem at hand.

In [None]:
import pandas as pd

# Sample DataFrame
data = {'Category': ['A', 'B', 'A', 'C', 'B'],
        'Product': ['X', 'Y', 'X', 'Z', 'Y'],
        'Sales': [100, 200, 150, 50, 300]}
df = pd.DataFrame(data)

# Create a pivot table summarizing sales by category and product
pivot_table = df.pivot_table(index='Category', columns='Product', values='Sales', aggfunc='sum')

print(pivot_table)

In [None]:
# remane column caterory to cat

df.rename(columns={'Category':'cat'},inplace=True,)

In [None]:
df

In [None]:
from numpy import random

z = random.randint(100, size = (5,5))
z

In [None]:
arr = np.array([1,2,3,4,3,4,4,5])

np.where(arr%2 == 0)

In [None]:
np.where(arr == 1)