<p>&nbsp;</p>
</p><h1 style="text-align: center;"><strong>Kolmogorov–Smirnov Test</strong></h1>
<h2 style="text-align: center;"><strong>Nonparametric Hypothesis Testing for Data Science</strong></h2>
<p>&nbsp;</p><p>&nbsp;</p><p>&nbsp;</p>

# Introduction

The Kolmogorov-Smirnov test (KS test) allows detecting patterns that can not be detected with the Student t test.

**According to Wikipedia:**

> The Kolmogorov-Smirnov statistic quantifies the distance between
> the empirical distribution function of the sample and the cumulative
> distribution function of the reference distribution, or between the
> empirical functions of two-sample distribution. The null distribution
> of this statistic is calculated under the null hypothesis that the
> sample is withdrawn from the reference distribution (in the case of a
> sample) or that the samples are drawn from the same distribution (in
> the case of two samples). In each case, the distributions considered
> under the null hypothesis are continuous distributions, but are
> unrestricted.

The test is intended to ascertain whether a sample can be considered as coming from a population with a given distribution. The test is particularly suitable for continuous distributions and has the advantage of making no assumptions about the data distribution. 

In other words:

- The Student T-Test says that there is a 79.3% chance that the two samples will be of the same distribution.
- The KS test says that there is a 1.6% chance that the two samples will be of the same distribution.

**Imports and Parameters:**

In [79]:
import pandas as pd
import numpy as np

***

# scipy.stats.kstest

Kolmogorov-Smirnov test to improve fit. The KS test is valid only for continuous distributions.

**Imports and Parameters:**

In [112]:
from scipy import stats # if you want to import everything
from scipy.stats import kstest # specific import

**Examples:**

In [113]:
x = np.linspace(-15, 15, 9)
stats.kstest(x, 'norm')

KstestResult(statistic=0.4443560271592436, pvalue=0.038850142705171065)

In [114]:
np.random.seed(987654321) # set random seed to get the same result
stats.kstest('norm', False, N=100)

KstestResult(statistic=0.058352892479417884, pvalue=0.8853119094415126)

The above lines are equivalent to:

In [115]:
np.random.seed(987654321)
stats.kstest(stats.norm.rvs(size=100), 'norm')

KstestResult(statistic=0.058352892479417884, pvalue=0.8853119094415126)

Test against one-sided alternative hypothesis

Shift distribution to larger values, so that cdf_dgp(x) < norm.cdf(x):

In [116]:
np.random.seed(987654321)
x = stats.norm.rvs(loc=0.2, size=100)
stats.kstest(x,'norm', alternative = 'less')

KstestResult(statistic=0.12464329735846891, pvalue=0.04098916407764175)

Reject equal distribution against alternative hypothesis: less

In [117]:
stats.kstest(x,'norm', alternative = 'greater')

KstestResult(statistic=0.007211523321631108, pvalue=0.985311585903964)

Don’t reject equal distribution against alternative hypothesis: greater

In [118]:
stats.kstest(x,'norm', mode='asymp')

KstestResult(statistic=0.12464329735846891, pvalue=0.08944488871182082)

Testing t distributed random variables against normal distribution

With 100 degrees of freedom the t distribution looks close to the normal distribution, and the K-S test does not reject the hypothesis that the sample came from the normal distribution:

In [119]:
np.random.seed(987654321)
stats.kstest(stats.t.rvs(100,size=100),'norm')

KstestResult(statistic=0.07201892916547126, pvalue=0.6763006286247917)

With 3 degrees of freedom the t distribution looks sufficiently different from the normal distribution, that we can reject the hypothesis that the sample came from the normal distribution at the 10% level:

In [120]:
np.random.seed(987654321)
stats.kstest(stats.t.rvs(3,size=100),'norm')

KstestResult(statistic=0.131016895759829, pvalue=0.058826222555312224)

***

# scipy.stats.ks_2samp

Calculating the bilateral KS test for the null hypothesis in 2 independent samples are extracted from the same continuous distribution.

**Imports and Parameters:**

In [121]:
from scipy import stats # if you want to import everything
from scipy.stats import kstest # specific import

**Examples:**

In [122]:
np.random.seed(12345678)  #fix random seed to get the same result

In [123]:
n1 = 200  # size of first sample

In [124]:
n2 = 300  # size of second sample

For a different distribution, we can reject the null hypothesis since the pvalue is below 1%:

In [125]:
rvs1 = stats.norm.rvs(size=n1, loc=0., scale=1)

In [126]:
rvs2 = stats.norm.rvs(size=n2, loc=0.5, scale=1.5)

In [127]:
stats.ks_2samp(rvs1, rvs2)

Ks_2sampResult(statistic=0.20833333333333337, pvalue=4.667497551580699e-05)

For a slightly different distribution, we cannot reject the null hypothesis at a 10% or lower alpha since the p-value at 0.144 is higher than 10%

In [128]:
rvs3 = stats.norm.rvs(size=n2, loc=0.01, scale=1.0)

For an identical distribution, we cannot reject the null hypothesis since the p-value is high, 41%:

In [129]:
rvs4 = stats.norm.rvs(size=n2, loc=0.0, scale=1.0)

In [130]:
stats.ks_2samp(rvs1, rvs4)

Ks_2sampResult(statistic=0.07999999999999996, pvalue=0.4112694972985972)

***

# Example I

**Imports and Parameters:**

In [134]:
from scipy.stats import kstest

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly.plotly as py

init_notebook_mode(connected=True)

**Import Data:**

Importing some data from the average wind speed sampled every 10 minutes:

In [135]:
data = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/wind_speed_laurel_nebraska.csv')
df = data[0:10]

table = ff.create_table(df)
iplot(table)

**Testing Normality:**

The Kolmogorov-Smirnov test is comparing any two distributions to each other, not necessarily a distribution to a normal distribution. These tests may be unilateral or on both sides, but the latter applies only if the two distributions are continuous.

In [137]:
x = data['10 Min Sampled Avg']

ks_results = kstest(x, cdf='norm')

matrix_ks = [['', 'DF', 'Test Statistic', 'p-value'],
             ['Sample Data', len(x) - 1, ks_results[0], ks_results[1]]]

ks_table = ff.create_table(matrix_ks, index=True)
iplot(ks_table, filename='ks-table')

Since our p-value is read as 0.0, we have strong evidence for not rejecting the null hypothesis

***

# Example II

Example using the two-sample KS test with ks_2samp:

In [138]:
np.random.seed(12345678)

In [139]:
x = np.random.normal(0, 1, 1000)

In [140]:
y = np.random.normal(0, 1, 1000)

In [141]:
z = np.random.normal(1.1, 0.9, 1000)

In [142]:
ks_2samp(x, y)

Ks_2sampResult(statistic=0.02299999999999991, pvalue=0.9518901680484965)

In [143]:
ks_2samp(x, z)

Ks_2sampResult(statistic=0.41800000000000004, pvalue=3.708149411924217e-77)

***

# <p>&nbsp;</p>
<h1 style="text-align: center;"><strong><span lang="pt">CONCLUSION</strong></span></h1>
<p>&nbsp;</p><p>&nbsp;</p><p>&nbsp;</p>

In this Study Kernel, through the reference readings, I noticed that the KS Test is a very efficient way of automatically differentiating samples from different distributions. In reading the links, it is noticed that Test T Student provides a very high p-value, and when the sample mean and standard deviation are highly similar, it does not detect such a variation. The KS Test can detect the variation. This served as study for future use of the KS Test, it showed that it can be easily used in Data Science contexts.

**References:**

- [Kolmogorov–Smirnov Test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test)
- [scipy.stats.kstest](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html)
- [scipy.stats.ks_2samp](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html)
- [KOLMOGOROV–SMIRNOV TEST](https://towardsdatascience.com/kolmogorov-smirnov-test-84c92fb4158d)
- [Kolmogorov-Smirnov Test](http://www.physics.csbsju.edu/stats/KS-test.html)
- [6.2 - TESTE DE KOLMOGOROV-SMIRNOV](http://www.portalaction.com.br/inferencia/62-teste-de-kolmogorov-smirnov)
- [Normality Test in Python](https://plot.ly/python/normality-test/)
- [Kolmogorov-Smirnov train/test - Porto Seguro](https://www.kaggle.com/rspadim/kolmogorov-smirnov-train-test-porto-seguro)
- [Two-sample Kolmogorov-Smirnov Test in Python Scipy](https://stackoverflow.com/questions/10884668/two-sample-kolmogorov-smirnov-test-in-python-scipy)

***

##### INSTALLED VERSIONS

In [144]:
pd.show_versions()


INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.9.1
pip: 18.1
setuptools: 40.4.3
Cython: 0.29.1
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.1.1
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None


***