# Problem Statement + Goal + Notes

- For code, see [2.4.1-withDarian.ipynb](https://github.com/Brinkley97/book-forecasting_and_control/blob/main/part_1/2-autocorrelation_func_and_spectrum_of_stationary_process/exercises/2.4.1-withDarian.ipynb) and [2.4.2-calculateAndPlotAcovAcor.ipynb](https://github.com/Brinkley97/book-forecasting_and_control/blob/main/part_1/2-autocorrelation_func_and_spectrum_of_stationary_process/exercises/2.4.2-calculateAndPlotAcovAcor.ipynb)
- [Notion notes](https://detraviousjbrinkley.notion.site/CH-2-Autocorrelation-Function-and-Spectrum-of-Stationary-Processes-9597792538c54da3ac4eb45e9ca6790f)

# Problem Statement + Goal

1. Calculate $ c_{0}, c_{1}, c_{2}, c_{3}, r_{1}, r_{2}, r_{3} $ for the series given in Exercise 2.1.
2. Make a graph of $ r_{k} $, k = 0, 1, 2, 3

3. My interpretation : 
    - Of the TS in 2.1, find the estimates of both 
        1. $c_k$ is the estimated autocovariance (ACov) $(\hat{\gamma}_{k})$ at lag k
        2. $r_k$ is the estimated autocorrelation (ACor) $(\hat{\rho}_{k})$ at lag k
        3. where k = 0, 1, 2, 3
    - Graph the autocorrelations at k
        - x-axis represents the k lags
        - y-axis represents the ACor values

# Questions + Futher Explore
- [pandas.DataFrame.shift](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html)
    - See example with simpler code in [5-basic_feature_engineering.ipynb](https://github.com/Brinkley97/book-intro_to_tsf_with_python/blob/main/5-basic_feature_engineering.ipynb) by Detravious Jamari Brinkley based on [Introduction to Time Series Forecasting With Python](https://machinelearningmastery.com/introduction-to-time-series-forecasting-with-python/) by Jason Brownlee
    - Attempting to use the `shift()` attribute (of a DataFrame) in [2.4.3-withPandasShiftMethod.ipynb](https://github.com/Brinkley97/book-forecasting_and_control/blob/main/part_1/2-autocorrelation_func_and_spectrum_of_stationary_process/exercises/2.4.3-withPandasShiftMethod.ipynb)


# Display Datasets
1. Book dataset
2. [LEW dataset](https://www.itl.nist.gov/div898/handbook/eda/section3/eda35c.htm)
3. [UCI Original Diabetes dataset](https://archive.ics.uci.edu/ml/datasets/diabetes)

## Terminology

- **Variance** is the spread ($\sigma^{2}$) of a single variable - single distribution
    - $\sigma^{2}$ = E [($z{_t} - \mu){^2}$] = $\int_{-\infty}^{\infty}$ ($z - \mu){^2}$ * p(z) dz
    - $ \underline{\textbf{Why (importance)}} $: To determine the confidence of our distribution

- **Sample Variance** of the stochastic process with the same pr() distribution p(z) being the same for all times t, the variance $\sigma_{z}^{2}$ can be estimated
    - $\sigma_{z}^{2}$ = 1/N $\sum_{t=1}^N (z{_t} - \bar{z})$, where $\bar{z}$ is the sample mean
    - ***Think :*** 
        - Why (importance): estimate the confidence of our distribution bc (1) it's the same for all times t and (2) it may be computationally expensive to find the actual variance $\sigma$
    
- **Covariance** is the spread ($\gamma$) of two variables - joint distributions (JD)
    - $\sigma$[X, Y]
    - (JD) [Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs.] [Wiki](https://en.wikipedia.org/wiki/Joint_probability_distribution)
    - ***Think:*** 
        - V[X] and V[Y] = cov[X, Y]
        - Why (importance): So, of these two variables in the same probability space, how do they vary/differ? If low, then low confidence then they differ a great bit which results in a wide distribution? 
        
- **AutoCovariance** is the spread at lag k ($\gamma_{k}$) of two variables - joint distributions
    - $\gamma_{k}$ = $cov[z{_t}, z_t{_+}{_k}]$ = $E[(z{_t} - \mu)(z_t{_+}{_k} - \mu)]$
    - ***Think:***
        - covariance but w/ the same variable and different time steps
        - So, of these two time steps, spaced k intervals apart, in the same probability space, how do they vary/differ?
        - Why (importance) : gather insights (ie : spread or confidence) on the same variable but at different times t and t + k
        
- **Estimated AutoCov** is the spread at lag k ($\hat{\gamma}_{k}$) of two variables - joint distributions
    - $ \hat{\gamma}_{k} $ (gamma hat) = 1/N * $ \sum_{t=1}^{N - k} (z_{t} - \bar{z})(z_{t+k} - \bar{z}) $ k = 0, 1, 2,..., K
    - ***Think:*** 
        - We want to find the estimated confidence in our single variable (z) at two different times t and t + k
            - Estimated bc we may ! be able to draw an exact value to due the complexity (of data, time it'll take, computations, etc)
            - If large, then high variance which results in a wide distribution but if small, then low variance and a narrow distribution
            - (?) Can I have a (-) confidence? If so, does this means that I'm ! confident at all?
                - Yes, can have a (-) Est AutoCov which implies that the value falls to the left of 0 when looking at the distribution curve.
        - ***Analogy - Points in a basketball game of a single player***
            - Single variable : points (p)
            - p$_t$ : points when t = end of 1st quarter
            - $p{_t}{_+}{_k}$ : points when t + k = end of another quarter
            - Find the estimated confidence/spread at the end of the 1st quarter compared to the end of another quarter
                - Confidence/spread here can be pr() of scoring x points
        - Why (importance): same as AutoCov except here it's an approx bc we may ! be able to get the true value
- P 31 in book has to obtain a useful estimate of the ACor func, need 
    - at least 50 observations so $ N \le 50 $
    - the estimated ACor ($ r_k $) would be calculated for $ k = 0, 1, ..., K $, where $ K \le N/4 $ (maybe use df.head(50) or df.head(N/4) which will give a subset of the data)

## Estimate the ACov @ lag k


\begin{align}
\hat{\gamma}_{k} = \frac{1}{N} \times \sum_{t=1}^{N - k} (z_{t} - \bar{z})(z_{t+k} - \bar{z}),
\space where 
\end{align}

\begin{align}
k = 0, 1, 2,..., K
\end{align}

\begin{align}
\bar{z} = \sum_{t=1}^{N} \frac{z_t}{N}
\end{align}



## Estimate AutoCor @ lag k

$ \hat{\rho}{_k} $ (rho hat) = $ c_k $ / $ c_0 $


- [Correlation, AutoCor, and Correlogram](https://www.youtube.com/watch?v=Aft25mI1ffw)
- **Correlation** is the measure of the relationship between two items or vars
    - ie : 
        - (+) #gallons and $ amount spent
        - (-) miles traveled and amount gas left in tank

- **AutoCorrelation** is the measure of the relationship between the same var but at different times (t and t + k)
    - Why : 
        - To determine if data is random. Where do t and t + k stop correlating? That's when $ \exists $ no relationship/ no significance so can disregard
        - To determine the trend. If $ \exists $, then highly correlated. The autocov is significantly ! 0 but may decline towards 0. Series may said to be non-stationary

- **Correlogram** is the plot of the autocorrelation vs time lag
    - x-axis : lag
    - y-axis : autocorrelation

# Graph AutoCor
- also known as correlogram
- xcorr : Plot the cross correlation between x and y; returns the lag vector
    - A measure of not only the strength of a relation between to stochastic process but also its direction [Book : TSA by Henrik Madsen | P 107]
        - How does this differ from a histogram?
    - Not symmetric
    - Also known as cross correlogram
- acorr : Plot the autocorrelation of x; returns the lag vector
    - Comparison of t = 0 and t + k so it follows as t = 0 and t + k = 0, t = 0 and t + k = 1, t = 0 and t + k = K, 
        - Why t = 0 every time? Bc $r{_k}$ = $\hat{\rho}{_k}$ = $c_k$ / $c{_0}$, where k in $c_k$ updates and k in $c{_k}{_=}{_0}$ which implies the first t
- The autocorrelation function is characterized by correlations that alternate in sign and tend to damp out with increasing lag [P 31]
- [ ] Draw line of significance