<img height=100 width=1000 src="https://cdn.educba.com/academy/wp-content/uploads/2020/05/Time-Series-Analysis.jpg" />

> #  **1.Introduction to Time Series**

### Every organization, whether in finance, healthcare, retail, or any other industry, faces significant challenges such as market volatility, technological disruptions, economic recessions, inflation, labor unrest, and shifts in regulatory policies. These factors introduce a level of risk and uncertainty that companies must navigate daily. To manage and mitigate these risks, organizations rely on forecasting methods to predict potential future events and trends, allowing them to make informed decisions and prepare for potential adverse outcomes.

### There are several methods of forecasting, each suited to different types of data and scenarios. Among the most commonly used methods are:

### * Regression Models:  Utilize the relationship between dependent and independent variables to make predictions.
### * Data Mining Methods: Involve extracting patterns and insights from large datasets to predict future events.
### * Time Series Analysis: Focuses on analyzing data points collected or recorded at specific time intervals to identify patterns and forecast future trends.

> # **2. What is Time Series?**

### A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Time series data is unique in that it is time-dependent, meaning the order of the data points is crucial. Examples of time series data include stock prices, daily temperatures, monthly sales data, and more. Analyzing time series data involves various techniques like decomposition, smoothing, and forecasting.

### Intervals of the Time Series Data

#### 1.  Yearly :- GDP , Macro-economic series
#### 2. Quarterly :- Revenue of a company.
#### 3. Monthly:- Sales, Expenditure, salary
#### 4. Weekly:- Demand , Price of Petrol and diesal
#### 5. Daily:- Closing price of stock, sensex value, daily transaction of ATM machine
#### 6. Hourly:- AAQI


> # **3. What is Not a Time Series?**

### Not all datasets that contain a time component are considered time series data. A dataset is not a time series if the order of the data does not matter or if there is no dependency on time. For example, a dataset containing survey responses from different days is not a time series if the responses themselves are independent of time. Similarly, a collection of images taken at different times but with no temporal relationship is not a time series.

> # **4. Features of Time Series Data**

### Time series data exhibits several unique features:

* ### Trend: The general direction in which the data is moving over time.
* ### Seasonality: Patterns that repeat at regular intervals due to seasonal factors.
* ### Cyclicity: Patterns that occur at irregular intervals, often due to economic or business cycles.
* ### Noise: Random variations in the data that do not have any underlying pattern.
* ### Stationarity: A property where statistical parameters (mean, variance) do not change over time.

<img height=100 width=1000 src="https://images.prismic.io/turing/6596deb5531ac2845a271fbe_Components_of_time_series_analysis_11zon_62bdcb0ac7.webp?auto=format,compress" />

> # **5.Time Series Assumptions**

### Some of the most common assumptions made for time series are based on the common sense. But always Keep in mind one thing

> ### Very long range forecasts does not work well !!

* ###  Forecast is done by keeping in mind that the market and the other conditions are not going to change in the future.
* ### There will be not any change in the market.
* ### But the change is gradual and not a drastic change.
* ### Situations like recession in 2008 US market will send the forecasts into a tizzy.
* ### Events like demonetization would throw the forecasts into disarray
* ### Based on the data available , we should not try to forecast for more than a few periods ahead.

> # **6. Time Series Types**

## Time series data can be categorized into different types based on its properties:

* ### Univariate Time Series: A series with a single variable, such as daily temperature.
* ### Multivariate Time Series: A series with multiple interdependent variables, like weather data containing temperature, humidity, and pressure.
* ### Regular vs. Irregular Time Series: Regular time series data is recorded at consistent time intervals, whereas irregular time series data is recorded at inconsistent intervals.
* ### Stationary vs. Non-Stationary Time Series: A stationary time series has constant mean and variance over time, whereas a non-stationary series does not.

In [None]:
#Libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
import datetime
import plotly.express as px
import plotly.figure_factory as ff
import plotly.offline as offline
import plotly.graph_objs as go
import plotly.io as pio 
from plotly.subplots import make_subplots
from learntools.time_series.style import *
from pathlib import Path

from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
from statsmodels.tsa.stattools import adfuller

from sklearn.metrics import mean_absolute_error,mean_absolute_error, confusion_matrix 
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler,OrdinalEncoder,OneHotEncoder
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

import keras 
from keras.layers import Dense, Dropout, LSTM 
from keras.callbacks import EarlyStopping 
from keras.models import Sequential 

> # **7-How to import data ?**

### First, we import all the datasets needed for this kernel. The required time series column is imported as a datetime column using **parse_dates** parameter and is also selected as index of the dataframe using **index_col** parameter.

In [None]:
df = pd.read_csv("/kaggle/input/time-series-data/daily-min-temperatures.csv", parse_dates=True , index_col="Date")
df

In [None]:
temp = go.Scatter(x=df.index, y=df['Temp'], mode='lines', name='Temperature')

# Layout settings
layout = go.Layout(
    title='Temperature',
    xaxis_title='date',
    yaxis_title='meantemp',
    legend=dict(x=0, y=1.0),
    margin=dict(l=80, r=80, t=40, b=40),
    height=500,
    width=1100,
)

# Combine the traces and layout
fig = go.Figure(data=temp, layout=layout)

# Display the plot
pio.show(fig)

### If we want to predict the temperature for the next few months, we will try to look at the past values and try to gauge and extract the pattern. Here we observe a pattern within each year indicating a seasonal effect. Such observations will help us in predicting future values.

### **Note: We have used only one variable here , Temp (the temperature of the past 19 years).**

### Hence this is called as the Univariate Time Series Analysis/Forecasting.

In [None]:
df2 = pd.read_csv("/kaggle/input/daily-climate-time-series-data/DailyDelhiClimateTrain.csv", parse_dates=True , index_col="date")
df2

In [None]:
fig = make_subplots(rows=2, cols=2, subplot_titles=("Temperature", "Humidity", "Pressure", "Wind Speed"))

# Temperature plot
fig.add_trace(
    go.Scatter(x=df2.index, y=df2['meantemp'], mode='lines', name='Temperature', line=dict(color='red')),
    row=1, col=1
)

# Humidity plot
fig.add_trace(
    go.Scatter(x=df2.index, y=df2['humidity'], mode='lines', name='Humidity', line=dict(color='blue')),
    row=1, col=2
)

# Pressure plot
fig.add_trace(
    go.Scatter(x=df2.index, y=df2['wind_speed'], mode='lines', name='Pressure', line=dict(color='green')),
    row=2, col=1
)

# Wind Speed plot
fig.add_trace(
    go.Scatter(x=df2.index, y=df2['meanpressure'], mode='lines', name='Wind Speed', line=dict(color='orange')),
    row=2, col=2
)

# Update layout
fig.update_layout(
    title='Weather Data Subplots',
    height=800,
    width=1100,
    showlegend=False  # Set to False if you don't want a shared legend
)

# Display the plot
pio.show(fig)

### **Note: We have used four variable here , (meantemp/humidity/wind_speed/meanpressure) .**
### Hence this is called as the Multivariate Time Series Analysis/Forecasting.



> # **8-Missing Values**

### Missing DataÂ¶
#### 1. No missing data is allowed in time series as data is ordered.
#### 2. It is simply not possible to shift the series to fill in the gaps.

### Reasons for missing data :
#### 1) Data is not collected or recorded
#### 2) Data never existed
#### 3) Data corruption

### Mark missing values:
### * NaN is the default missing value marker for reasons of computational speed and convenience.
### * We can easily detect this value with data of different types: floating point, integer, Boolean and general object.
### * However, the Python None will arise and we wish to also consider that missing.
### * To make detecting missing values easier across different array dtypes, pandas provides functions, isna() and notna(), which are also methods on Series and DataFrame objects.

> # **9-Handling Missing Values**

### **1. Understanding Missing Data in Time Series**
 Nature of Time Series Data: Time series data is sequential and ordered, meaning missing values can disrupt patterns and dependencies between observations.
 Challenges: Unlike other types of data, you cannot simply reorder or remove time series data without potentially losing important temporal information.

### **2. Strategies for Handling Missing Values in Time Series**
1. Forward Fill (Propagation of the Last Observation)
Method: Replace NaN values with the last available non-missing value.
Usage: Suitable when the missing values are assumed to be the same as the previous recorded values.
<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>df['column_name'].fillna(method='ffill', inplace=True)</b></p>
</div>

2. Backward Fill (Propagation of the Next Observation)
Method: Replace NaN values with the next available non-missing value.
Usage: Used when the missing values are assumed to be the same as the subsequent recorded values.
<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>df['column_name'].fillna(method='bfill', inplace=True)</b></p>
</div>


### **3. Interpolate Missing Values**
Method: Interpolation involves estimating the missing values based on nearby data points. Common methods include linear, polynomial, and spline interpolation.
Usage: Useful when the data is expected to have a smooth trend between missing points.
<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>df['column_name'].interpolate(method='linear', inplace=True)  # Linear interpolation</b></p>
</div>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>df['column_name'].interpolate(method='time', inplace=True) # Time-based interpolation</b></p>
</div>

### **4. Mean, Median, or Mode Imputation**

Method: Replace NaN values with the mean, median, or mode of the entire series or a rolling window.
Usage: Best for data with low variance and no strong trends or seasonality.
<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>df['column_name'].fillna(df['column_name'].mean(), inplace=True)  # Mean Imputation</b></p>
</div>


<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>df['column_name'].fillna(df['column_name'].median(), inplace=True)  # Median Imputation</b></p>
</div>


### **5. Using Moving Averages**
Method: Replace NaN values with the average of neighboring values over a fixed window.
Usage: Effective for smoothing out short-term fluctuations and highlighting longer-term trends. 
 <div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>df['column_name'].fillna(df['column_name'].rolling(window=3, min_periods=1).mean(), inplace=True)</b></p>
</div>
 
 
### **6. Model-Based Imputation (e.g., KNN, Regression, ARIMA)**
Method: Use statistical or machine learning models to predict missing values based on other available data points. Models like K-Nearest Neighbors (KNN), regression, or even time series models like ARIMA can be used.
Usage: Suitable for complex datasets where relationships between variables can help predict missing values.
Implementation: Requires more advanced techniques and libraries such as scikit-learn.


#### **7. Dropping Missing Values**
Method: Simply remove rows or columns with missing values.
Usage: Only suitable when missing data is minimal and does not significantly affect the dataset.
<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>df.dropna(inplace=True)</b></p>
</div>

> # **10.Time Series Accuracy and Frequency**

#### Time Series forecast models can both make predictions and provide a confidence interval for those predictions.

#### Forecast Range

### Confidence intervals provide an upper and lower expectation for the real observation.
#### These are useful for assessing the range of real possible outcomes for a prediction and for better understanding the skill of the model.
#### For example, the ARIMA implementation in the statsmodel python library can be used to fit an ARIMA model. It returns an ARIMAResults object.



### The object provides the forecast() function returns three values:

#### 1) Forecast: The forecasted value in the
#### 2) Standard Error of the model:
#### 3) Confidence Interval: The 95% confidence interval for the forecast

#### The error in the forecast is the difference between the actual value and the forecast.
#### Two popular accuracy measures are RMSE and MAPE.


### Forecast Requirements: 
#### A time series model must contain a key time column that contains unique values, input columns, and at least one predictable column.

#### Time series data often requires cleaning, scaling, and even transformation

#### Frequency: Data may be provided at a frequency that is too high to model or is unvenly spread through time requiring resampling for use in models.

#### Outliers: Data may contain corrupt or extreme outlier values that need to be identified and handled.

#### Frequency: 
*  ####  Frequencies may be too granular or not granular enough to get insights.
*  #### The pandas library in Pyhton provides the capability to increase or decrease the sampling frequency of the time series data.


#### Resampling:
* #### Resampling may be required if the data is not available at the same frequency that you want to make predictions.
* #### Resampling may be required to provide additional structure or insight into the learning problem for supervised learning models.


#### Up-sampling:
* #### Increase the frequencies of the sample, example: months to days
* #### Care may be needed in deciding how the fine-grained observations are calculated using interpolation.


#### **The function, resample() available in the pandas library works on the Series and DataFrame objects.**


In [None]:
shampoo = pd.read_csv("/kaggle/input/time-series-data/shampoo.csv", parse_dates= True, index_col="Month")
shampoo.head(20)

In [None]:
import pandas as pd
from datetime import datetime

# Define a date parser function
def parser(x):
    return datetime.strptime('1900-' + x, '%Y-%m-%d')  # Adjust the format if necessary

# Read the CSV file with the corrected date parser
shampoo_df = pd.read_csv(
    '/kaggle/input/time-series-data/shampoo.csv', 
    header=0, 
    index_col=0, 
    parse_dates=True, 
    date_parser=parser
)

# Ensure the index is a DatetimeIndex
shampoo_df.index = pd.to_datetime(shampoo_df.index)

# Resample the data to daily frequency
upsampled_ts = shampoo_df.resample('D').mean()

# Display the first 36 rows
print(upsampled_ts.head(20))

### Inference:
#### We observe that the resample() function has created the rows by putting NaN values as new values for dates other than day 01.

#### Next we can interpolate the missing values at this new frequency. The function, interpolate() of pandas library is used to interpolate the missing values. We use a linear interpolation which draws a straight line between available data, on the first day of the month and fills in values at the chosen frequency from this line.

In [None]:
interpolated = upsampled_ts.interpolate(method = 'linear')
interpolated.plot()
plt.show()

### Accuracy measures 

#### We would have used several models such as moving average, exponential smoothing, etc. before selecting the best model.

#### The model selection may depend on the chosen forecasting accuracy measure such as:

#### Mean Absolute Error, MAE = (1/n) (|Y1 - F1| + |Y2- F2| + ... + |Yn - Fn|)
#### Mean Absolute Percentage Error, MAPE = (1/n) ((|Y1 - F1|/Y1) + (|Y2 - F2|/Y2) + ... + (|Yn- Fn|/Yn) * 100)
#### Mean Squared Error, MSE = (1/n) ((Y1 - F1)^2 + (Y2- F2)^2 + ... + (Yn - Fn)^2)
#### Root Mean Square Error, RMSE = square root of MSE
#### where n is the number of observations Yn is the actual value of Y at time n Fn is the corresponding forecasted value. RMSE and MAPE are two most popular accuracy measures of forecasting.

### **Define functions to calculate MAE and MAPE**

In [None]:
def MAE(y,yhat):
    diff = np.abs(np.array(y)-np.array(yhat))
    try:
        mae =  round(np.mean(np.fabs(diff)),3)
    except:
        print("Error while calculating")
        mae = np.nan
    return mae

In [None]:
def MAPE(y, yhat): 
    y, yhat = np.array(y), np.array(yhat)
    try:
        mape =  round(np.mean(np.abs((y - yhat) / y)) * 100,2)
    except:
        print("Observed values are empty")
        mape = np.nan
    return mape

In [None]:
# Read the CSV file without the 'squeeze' parameter
female_birth_series = pd.read_csv(
    '/kaggle/input/time-series-data/daily-total-female-births.csv', 
    header=0, 
    index_col=0, 
    parse_dates=True
)

# Convert to Series if necessary (if only one column)
if female_birth_series.shape[1] == 1:
    female_birth_series = female_birth_series.squeeze()

# Compute rolling mean with a window of 3
rolling = female_birth_series.rolling(window=3)  # Arbitrarily chosen window size
rolling_mean = rolling.mean()

# Plot original series and rolling mean
female_birth_series.plot(label='Original Series')
rolling_mean.plot(color='red', label='Rolling Mean')
plt.legend()
plt.show()

# Zoomed plot of original and rolling mean dataset
female_birth_series[:100].plot(label='Original Series')
rolling_mean[:100].plot(color='red', label='Rolling Mean')
plt.legend()
plt.show()

## **Why Compare Rolling Mean and Original Series?**

### **Trend Analysis:**
#### Original Series: Can be noisy and difficult to interpret due to short-term fluctuations.
#### Rolling Mean: Helps in identifying the underlying trend by filtering out the noise. It reveals the general direction or pattern of the data over time.

### **Smoothing Effect:**
#### Original Series: Displays the raw data, which might show daily variations and irregularities.
#### Rolling Mean: Smooths out these variations by averaging data points, making it easier to observe trends without the interference of random fluctuations.

### **Pattern Recognition:**
#### Original Series: Shows the immediate changes and patterns in the data.
#### Rolling Mean: Helps in understanding the broader patterns and trends by averaging out short-term volatility. This can be useful for identifying long-term trends or seasonal patterns.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error as MAE, mean_absolute_percentage_error as MAPE

# Assuming you have already read the CSV and computed rolling mean
y_df = pd.DataFrame({'Observed': female_birth_series.values, 'Predicted': rolling_mean})
y_df.dropna(axis=0, inplace=True)
print(y_df.tail())

# Compute RMSE
rmse = np.sqrt(mean_squared_error(y_df.Observed, y_df.Predicted))
print("\n\n Accuracy measures ")
print('RMSE: %.3f' % rmse)

# Compute MAE
mae = MAE(y_df.Observed, y_df.Predicted)
print('MAE: %d' % int(mae))

# Compute MAPE
mape = MAPE(y_df.Observed, y_df.Predicted)
print('MAPE: %.3f' % mape)