## Purpose
This notebook is for help with comparing distributions of data, focusing on Kolmogorove Smirnov (KS) test. The KS test is non-parameteric, which makes it particularly useful for data where the underlying distribution is not known. 

### References
[1] Zeimbekakis, A., Schifano, E. D., & Yan, J. (2024). On Misuses of the Kolmogorov–Smirnov Test for One-Sample Goodness-of-Fit. The American Statistician, 78(4), 481-487.
https://www.tandfonline.com/doi/full/10.1080/00031305.2024.2356095 \
[2] Büning, H. (2002). Robustness and power of modified Lepage, Kolmogorov-Smirnov and Crame´ r-von Mises two-sample tests. Journal of Applied Statistics, 29(6), 907-924.
https://www.tandfonline.com/doi/abs/10.1080/02664760220136212

### Tutorials
1. How the KS test looks for differences between two distriubtions: \
https://towardsdatascience.com/understanding-kolmogorov-smirnov-ks-tests-for-data-drift-on-profiled-data-5c8317796f78/
2. Dealing with auto-correlated data (MD data often is): \
   https://engineering.atspotify.com/2023/09/how-to-accurately-test-significance-with-difference-in-difference-models
3. Potential pitfalls of the K-S test! \
   https://asaip.psu.edu/articles/beware-the-kolmogorov-smirnov-test/#:~:text=We%20recommend%20that%20the%20distribution%20of%20the,KS%20test%20in%20two%20or%20more%20dimensions.
5. Why Anderson Darling may be better, and a concise explanation of K-S's strengths and limitations: \
https://asaip.psu.edu/articles/beware-the-kolmogorov-smirnov-test/#:~:text=We%20recommend%20that%20the%20distribution%20of%20the,KS%20test%20in%20two%20or%20more%20dimensions.
6. Explaining reference [1]: \
https://www.reddit.com/r/statistics/comments/7j273q/still_dont_understand_why_the_pvalue_distribution/
7. Explaining uniform distribution of p-values: \
https://www.reddit.com/r/statistics/comments/7j273q/still_dont_understand_why_the_pvalue_distribution/ 

## Common pitfalls for the Kolmogorov-Smirnov test
* Data must be continuous [1] (consider if Chi-square Test if data is categorical)
    - Rounding continuous data can invalidate the test
    - "Ties", where two data points have the same value indicate that the underlying data generator does not create continuous data
* When comparing your distribution to another distribution (e.g. to see if your data is normal), your independent distribution should not be created using the mean and standard deviation of your dataset, but should be independent.
    - See tutorial explaining reference 1, or AI summary in cell for "Parametric Bootstrap for Kolmogorove-Smirnov Test in Python"
* Data should be independent (this is a major problem for MD data, which is usually from a time series and autocorrelated. autocorrelation means that the data at time x-i has some predicitive value for the data at time x, which for a protein moving in time is true -- as i goes to zero, the position at time x-i approaches the position at time x. Since the test assumes independence, autocorrelation contributes to an over-estimation of significance, along with the overpowering from excessive sampling) For more on methods to deal with this, see tutorial 2.

## Special considerations
* Consider using Anderson-Darling if the important differences between the distributions are in the tails. It may even be more advisable to always use A-D test. (https://asaip.psu.edu/articles/beware-the-kolmogorov-smirnov-test/#:~:text=We%20recommend%20that%20the%20distribution%20of%20the,KS%20test%20in%20two%20or%20more%20dimensions.)

# So what do we do. 

According to the blog post in tutorial [3], bootstrapping IS advisable. Nonetheless, there is mixed consensus on online forums, largely from the standpoint that it is entirely unclear what the sample size should be. There is consensus that confidence intervals are more important than using a p-value for hypothesis testing alone. At all times, it is important to keep in mind that statistics are more a way to characterize the data rather than to prove a claim. 

It is ambiguous if bootstrapping might be helpful in resolving the auto-correlation issue. My thought right now is maybe the raw data needs to be analyzed for an autocorrelateion period, and that the sample size should be based on sampling roughly one sample per period. Arguably, this would result in lots of instances where a sinlge period is sampled multiple times, so there would be auto-correlations still, but at the same time it's the only reason I can conjure to justify a given sample size.

It may also be possible to do a parameter sweep over sample size. But there's the question of what is the "right" answer, even if we did this -- we'd just be looking at a series of plots and picking whatever we like best. 

There seems to be no reason to not use either A-D or Cramer-von Mises in place of K-S. The reason that C-M may be preferable is it normalizes across the whole distribution rather than relying on the maximum alone. Any of these can be reweighted in order to emphasize the underlying upper/lower distributions. 

It may be worthwhile to bootstrap on Z and look at distribution of P values. It should be linear, flat. And then to bootstrap for X and Y -- it should be non-linear.

## Parametric Bootstrap for Kolmogorove-Smirnov Test in Python
---
Parametric Bootstrap for Kolmogorov-Smirnov (KS) Test in Python
The Kolmogorov-Smirnov (KS) test is a non-parametric test used to assess if a sample comes from a specific distribution (one-sample KS test) or if two samples come from the same distribution (two-sample KS test). While it's non-parametric, you can use the parametric bootstrap to approximate the null distribution of the KS test statistic when working with a parametric family of distributions. 
Why use Parametric Bootstrap for KS Test?
The standard KS test assumes that the parameters of the hypothesized distribution are known. However, in practice, these parameters are often estimated from the sample data. When estimated parameters are used in the KS test, the test statistic's null distribution can change, leading to a conservative test (meaning you're less likely to reject the null hypothesis than you should be). 
The parametric bootstrap helps address this by providing a more accurate approximation of the null distribution of the KS test statistic when parameters are estimated. 
Steps for Parametric Bootstrap of KS Test in Python

    Fit the parametric distribution: Choose a parametric distribution that you believe the data follows and estimate its parameters from your data.
    Generate bootstrap samples: Repeatedly (e.g., 1000 times) draw random samples of the same size as your original data from the fitted distribution (using the estimated parameters).
    Calculate the KS test statistic for each bootstrap sample: For each bootstrap sample, perform a KS test against the fitted distribution (using the estimated parameters) and record the KS test statistic (D).
    Approximate the null distribution: The collection of KS test statistics from the bootstrap samples forms an empirical distribution that approximates the null distribution of the KS test statistic when parameters are estimated.
    Determine the p-value: Compare your original data's KS test statistic to the bootstrap distribution of KS statistics to get a more accurate p-value for your hypothesis test. 

Python Implementation
You can implement this using libraries like NumPy and SciPy in Python. 

    Use NumPy to generate random data and calculate statistics.
    SciPy's scipy.stats.kstest function can be used to perform the KS test.
    You'll need to define a function that performs the parametric bootstrap steps, including fitting the distribution, generating bootstrap samples, and calculating the KS statistic for each sample. 

Example using scipy.stats.kstest:
The scipy.stats.kstest function can be used for the KS test in Python. It takes the data, the hypothesized distribution's CDF (either a string name or a callable function), and optional parameters. SciPy documentation shows you how to use it. 
You can implement the parametric bootstrap by:

    Estimating parameters: Use methods like Maximum Likelihood Estimation (MLE) or Method of Moments to estimate the parameters of the chosen distribution from your data.
    Creating a custom CDF function: Define a Python function that calculates the cumulative distribution function (CDF) of your chosen parametric distribution using the estimated parameters.
    Generating bootstrap samples and KS statistics: Inside a loop, generate random samples from the fitted distribution using the estimated parameters. For each sample, call scipy.stats.kstest using your custom CDF function and store the resulting KS statistic.
    Analyzing bootstrap results: After the loop, analyze the distribution of the stored KS statistics to get a bootstrap-based p-value for your test. 

Important Notes:

    Choosing the appropriate parametric distribution is crucial for the parametric bootstrap to provide accurate results.
    The number of bootstrap samples (B) should be sufficiently large to accurately approximate the null distribution (e.g., B ≥ 1000).
    If your underlying data doesn't closely follow the assumed parametric distribution, the parametric bootstrap results might not be reliable