[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]()

# Kolmogorov-Smirnov test (KS-test)


### Table of Contents
- Setup
- Introduction
    - Kolmogorov's distribution
    - Kolmogorov's theorem
    - Kolmogorov-Smornov statistic
    - ECDF
    - Definition
- Motivation
- Data loading
- Application
    - Play with statistic
- Exercises
    - Build  ecdf from sample
    - Build two sample KS-test
    - Compare one sample KS-test and Student's t-test
    - Example for two sample ks-test and Student's t-test
- Conclusion
- References / Acknowledgements

---
## Setup

In [340]:
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np


import ipywidgets as widgets
from ipywidgets import interact, interact_manual, fixed

from scipy.stats import ks_2samp, kstest, norm

from sklearn.preprocessing import minmax_scale

from statsmodels.distributions.empirical_distribution import ECDF

%matplotlib inline

---
## Introduction

In this notebook we are going to consider Kolmogorov-Smirnov test and their properties.

### Kolmogorov's distribution  
CDF of Kolmogorov's distribution [[1](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test)]:
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/9d065517d5558ecb7a63ec4a6fd589eef06e6552)
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/fd9357312760787714f07f6fb53e04a0728d63d7)

---
### Kolmogorov's theorem  
Let X1, ..., Xn, ... - is an infinite sample from a continuous distribution F(x). Let Fn(x) is a empirical CDF (ECDF) build on first n elements of the sample.
Then
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/9da8bb11ae439ae2d35c150ffc81c19d7b4e9b78)
n → ∞,
where K is a random variable, that has Kolmogorov distribution [[1](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test)].

---
## Kolmogorov-Smornov statistic
**The Kolmogorov–Smirnov test** (KS-test) is a *nonparametric* test of the equality of one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (*one-sample KS-test*), or to compare two samples (*two-sample KS-test*) [[1](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test)].

---
### ECDF
**The Kolmogorov-Smirnov test** is based on the empirical
distribution function (**ECDF**). The ECDF Fn for n independent and identically distributed ordered observations Xi is defined as
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/aacca85bf28da15cbba66ea7c456cf7ad9784047)
This is a step function that increases by 1/N at the value of each ordered data point [[2](https://en.wikipedia.org/wiki/Empirical_distribution_function)].

---
## Definition
*The Kolmogorov-Smirnov test* is defined by:  
**H0** : The data follow a specified distribution  
**H1** : The data do not follow the specified distribution
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/2a8f25b438394d87d3e53a003cc3cc751d418b9c)
where *F(x)* is the theoretical cumulative distribution of the distribution
being tested which must be a continuous distribution.  
**H0** is rejected if D is greater than the critical value obtained from a table.
![](https://ars.els-cdn.com/content/image/3-s2.0-B9780128054277000014-f01-42-9780128054277.jpg)

---
## Motivation
  
***TODO: add MIND-MAP from Utih***  
  
**The Kolmogorov–Smirnov test** is similar in application to Student's test, but KS-test is a bit more complex and allows you to detect patterns you can’t detect with a Student's t-test [[3](https://towardsdatascience.com/kolmogorov-smirnov-test-84c92fb4158d)].  
Here is an example that shows the difference between Student’s t-test and KS-test.
![](https://miro.medium.com/max/934/0*_zFg_-LPurj7FbPL.)
Because the sample mean and standard deviation are highly similar the Student’s t-Test gives a very high p-value. KS-test can detect the variance. In this case the *red* distribution has a slightly binomial distribution which KS-test detect. In other words:
- Student's t-test says that there is **79.3%** chances the two samples come from the same distribution.
- KS-test says that there are **1.6%** chances the two samples come from the same distribution.

---
## Data loading

Load Boston dataset [[manual](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html)]

In [303]:
from sklearn.datasets import load_boston

data = load_boston()

In [304]:
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


---
## Application

### Play with distributions

#### Sampling

Play with sampling different columns in Boston dataset.

In [312]:
def create_cdf(norm_x):
    ecdf = ECDF(norm_x)
    cdf_df = pd.Series(ecdf.x, index=ecdf.y)
    return cdf_df

In [348]:
@interact(size=widgets.IntSlider(min=1, max=len(df.index), step=1, value=100))
def plot_dist(size, column_1=df.columns, column_2=df.columns):
    
    #load data
    x_1 = df[column_1][:size]
    x_2 = df[column_2][:size]
    
    # normalize data
    norm_x_1 = minmax_scale(x_1)
    norm_x_2 = minmax_scale(x_2)
    
    # create linspace for normal distribution
    norm_dist = np.linspace(0, 1, size)
    
    # create cdf
    cdf_norm_x_1 = create_cdf(norm_x_1)
    cdf_norm_x_2 = create_cdf(norm_x_2)
    cdf_norm_dist = pd.Series(np.stack(norm.cdf(norm_dist, loc=0.5, scale=0.5/3), axis=-1), index=norm_dist)
    
    # calculate ks-tests
    ks_stat1, _ = kstest(cdf_norm_x_1, 'norm')
    ks_stat2, _ = kstest(cdf_norm_x_2, 'norm')
    ks_stat1_2, _ = ks_2samp(cdf_norm_x_1.tolist(), cdf_norm_x_2.tolist())

    #visualition of results
    f = plt.figure(figsize=(20, 20))
    gs = f.add_gridspec(3, 1)
    
    ax = f.add_subplot(gs[0, 0])    
    ax.set_title(f"CDF {column_1} and norm\n"+\
                    f"One sample KS-test={str(round(ks_stat1,2))}", 
                 fontsize='large', fontweight='bold')
    cdf_norm_x_1.plot.line(ax=ax)
    cdf_norm_dist.plot.line(ax=ax)
    ax.legend([column_1, "norm"], fontsize=20)
    
    ax = f.add_subplot(gs[1, 0])    
    ax.set_title(f"CDF {column_2} and norm\n"+\
                    f"One sample KS-test={str(round(ks_stat2,2))}", 
                 fontsize='large', fontweight='bold')
    cdf_norm_x_2.plot.line(ax=ax)
    cdf_norm_dist.plot.line(ax=ax)
    ax.legend([column_2, "norm"], fontsize=20)
    
    ax = f.add_subplot(gs[2, 0])    
    ax.set_title(f"CDF {column_1} and {column_2}\n"+\
                    f"Two sample KS-test={str(round(ks_stat1_2,2))}", 
                 fontsize='large', fontweight='bold')
    cdf_norm_x_1.plot.line(ax=ax)
    cdf_norm_x_2.plot.line(ax=ax)
    ax.legend([column_1, column_2], fontsize=20)
    
    f.tight_layout()

interactive(children=(IntSlider(value=100, description='size', max=506, min=1), Dropdown(description='column_1…

1. What does the *received value* of **KS-test** mean?
2. What is the *range* of values for **KS-test**?
3. What does it mean if the **KS-test** value is *0*?

---
## Exercises

### Build  ecdf from sample
Build an empirical distribution function using any column from the Boston dataset.

In [306]:
# TODO: build ecdf from norm_x
def build_ecdf(norm_df):
    
    pass

In [307]:
@interact(size=widgets.IntSlider(min=1, max=len(df.index), step=1, value=100))
def plot_ecdf(size, column_name=df.columns):
    
    #load data
    x = df[column_name][:size]
    
    # normalize data
    norm_x = minmax_scale(x)
    
    # build ecdf 
    ecdf_norm_x = build_ecdf(norm_x)
    
    #visualition of results
    f = plt.figure(figsize=(20, 20))
    gs = f.add_gridspec(1, 1)
    
    ax = f.add_subplot(gs[0, 0])    
    ax.set_title(f"ECDF {column_name}", fontsize='large', fontweight='bold')
    
    # TODO: Visualize the resulting ecdf. 
    # HINT: It can be in the form of a line or a set of points.
    
    f.tight_layout()

interactive(children=(IntSlider(value=100, description='size', max=506, min=1), Dropdown(description='column_n…

---
### Build two sample KS-test
Build your implementation of two sample KS-test using any columns from the Boston dataset.
*HINT*: ![](https://wikimedia.org/api/rest_v1/media/math/render/svg/2a8f25b438394d87d3e53a003cc3cc751d418b9c)
![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/KS_Example.png/300px-KS_Example.png)

In [309]:
def create_cdf(norm_x):
    ecdf = build_ecdf(norm_x)
    
    # TODO: create cdf pandas.Dataframe from ecdf
    cdf_df = 
    
    return cdf_df

def ks_2_sample_test(cdf1: np.ndarray, cdf2: np.ndarray):
    # TODO: calculate two sample KS-statistic
    
    pass

In [None]:
@interact(size=widgets.IntSlider(min=1, max=len(df.index), step=1, value=100))
def plot_dist(size, column_1=df.columns, column_2=df.columns):
    
    #load data
    x_1 = df[column_1][:size]
    x_2 = df[column_2][:size]
    
    # normalize data
    norm_x_1 = minmax_scale(x_1)
    norm_x_2 = minmax_scale(x_2)
    
    # create cdf
    cdf_norm_x_1 = create_cdf(norm_x_1)
    cdf_norm_x_2 = create_cdf(norm_x_2)
    
    # calculate ks-test
    ks_stat1_2 = ks_2_sample_test(cdf_norm_x_1.tolist(), cdf_norm_x_2.tolist())
    
    #visualition of results
    f = plt.figure(figsize=(20, 20))
    gs = f.add_gridspec(1, 1)
        
    ax = f.add_subplot(gs[0, 0])    
    ax.set_title(f"CDF {column_1} and {column_2}\n"+\
                    f"Two sample KS-test={str(round(ks_stat1_2,2))}", 
                 fontsize='large', fontweight='bold')
    cdf_norm_x_1.plot.line(ax=ax)
    cdf_norm_x_2.plot.line(ax=ax)
    
    f.tight_layout()

1. What happens if one of the samples consists of the same elements?

---
### Compare one sample KS-test and Student's t-test
Add the ability to calculate the Student's t-test and compare them with each other.

In [None]:
def t_2_sample_test(df1: np.ndarray, df2: np.ndarray):
    # TODO: calculate two sample Student's t-statistic
    
    pass

In [None]:
@interact(size=widgets.IntSlider(min=1, max=len(df.index), step=1, value=100))
def plot_dist(size, column_1=df.columns, column_2=df.columns):
    
    #load data
    x_1 = df[column_1][:size]
    x_2 = df[column_2][:size]
    
    # normalize data
    norm_x_1 = minmax_scale(x_1)
    norm_x_2 = minmax_scale(x_2)
    
    # create cdf
    cdf_norm_x_1 = create_cdf(norm_x_1)
    cdf_norm_x_2 = create_cdf(norm_x_2)
    
    # calculate ks-test
    ks_stat1_2 = ks_2_sample_test(cdf_norm_x_1.tolist(), cdf_norm_x_2.tolist())
    
    # calculate t-test
    # TODO: add parametrs for t-test
    t_stat1_2 = t_2_sample_test()
    
    #visualition of results
    f = plt.figure(figsize=(20, 20))
    gs = f.add_gridspec(1, 1)
        
    ax = f.add_subplot(gs[0, 0])    
    ax.set_title(f"CDF {column_1} and {column_2}\n"+\
                 f"Two sample KS-test={str(round(ks_stat1_2,2))}\n"+\
                 f"Two sample Student's t-test={str(round(t_stat1_2,2))}", 
                 fontsize='large', fontweight='bold')
    cdf_norm_x_1.plot.line(ax=ax)
    cdf_norm_x_2.plot.line(ax=ax)
    
    f.tight_layout()

1. What are the **differences** between tests?
2. In what case we **can not use** KS-test, and **use** Student's t-test?

---
### Example for two sample ks-test and Student's t-test [[4](http://www.physics.csbsju.edu/stats/KS-test.html)]
Two near-by apple trees are in bloom in an otherwise empty field. One is a Whitney Crab the other is a Redwell. Do bees prefer one tree to the other? We collect data by using a stop watch to time how long a bee stays near a particular tree. We begin to time when the bee touches the tree; we stop timing when the bee is more than a meter from the tree. (As a result all our times are at least 1 second long: it takes a touch-and-go bee that long to get one meter from the tree.) We wanted to time exactly the same number of bees for each tree, but it started to rain. Unequal dataset size is not a problem for the KS-test.

In [343]:
size = 80
redwell = pd.Series([23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3, 14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3, 1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1, 19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5])
whitney = pd.Series([16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2, 23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5, 14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2, 22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1, 6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4, 23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6, 39.1, 26.5, 22.7])

This example is based on data distributed according to the *Cauchy distribution*: **a particularly abnormal case**.

Compute ks-test and t-tes for **redwell, whitney** using all your previous experience

In [None]:
## TODO compute ks-test and t-test for [redwell, whitney]

1. Which test is best applied in this case? Why?

---
## Conclusion

**Advantage:** 
An attractive feature of this test is that the distribution of the KS-test statistic itself does not depend on the underlying cumulative distribution function being tested.

The K-S test has several important **limitations**:
- It only applies to continuous distributions.
- It tends to be more sensitive near the center of the distribution than at the tails.
- The distribution must be fully specified. That is, if location, scale, and shape parameters are estimated from the data, the critical region of the KS-test is no longer valid. It typically must be determined by simulation.

---
## References / Acknowledgements

1. https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
2. https://en.wikipedia.org/wiki/Empirical_distribution_function
3. https://towardsdatascience.com/kolmogorov-smirnov-test-84c92fb4158d
4. http://www.physics.csbsju.edu/stats/KS-test.html