# Introduction

The objective of this notebook is to explore the implementation of the Kolmogronov-Smirnov Two Sample Test

Also called the KS 2 Sample Test, the Kolmogronov-Smirnov Two Sample Test compares two data distrbutions to see if they are of the same underlying distribution. In such a test, you have:
1. Two samples
2. A null and alternate hypothesis
   <ul>
       - Null is written in negative form: <b>There is no difference in distributions between the two samples provided</b> <br>
       - Alternate is written in positive form: <b>There is a difference in distributions between the two samples provided</b><br>
   </ul>
3. An Alpha Value: The Probability of wrongly rejecting the null hypothesis

<b>From my readings I observed that the KS 2 Sample Test follows a four step process:</b>
1. Sorting each sample
2. Computing the CDF for each sample
3. Computing the maximum difference in the two CDFs
4. Computing the p-value
5. If the p-value is < the alpha level, reject the null hypothesis

<b> What Libraries do I need for this exploration?</b>
1. Numpy
2. Scipy

### An Example where the distributions are similar 

In [10]:
# importing libraries needed, and our implementation of the 

from ks_2_samp_test import KS_2_sample_test
import numpy as np
import scipy

In [5]:
A = np.random.randint(0, 1000, 200)
B = np.random.randint(0, 1000, 200)

A, B

(array([876, 532, 597, 343, 645, 119, 957,  88, 444,  99, 892, 327, 659,
        434, 534, 584, 775, 484, 133, 233, 733, 679, 633, 175, 533, 123,
        560, 772, 506, 257, 604, 649, 818, 693, 765,  80, 524, 234, 778,
        602, 869, 263, 142, 714, 919, 259, 585, 803,  92, 824, 672, 267,
        848,  25, 483, 616, 297, 941, 530, 656, 569,  94, 838, 830, 658,
        814, 962, 219, 183, 895, 862, 394, 883, 738, 696, 293, 920, 451,
        490, 883, 837, 290, 417, 736, 548, 837, 188, 894, 906, 386,  89,
        969, 559, 598,   0, 362, 409, 265, 528, 979, 348, 627, 563, 307,
        401, 737, 533, 424, 676, 244, 863, 672, 309, 982, 105, 828, 161,
        828,  63, 694, 361, 577, 540, 614, 978, 207, 423, 404,  83, 440,
        856, 381, 378, 515, 955, 370, 151, 746, 478, 496, 755, 746, 659,
        235, 346, 441, 810, 770,  38, 601, 785, 410, 588, 748, 301, 818,
        467, 897, 470, 807, 961, 898, 941, 789, 378, 668, 108, 216, 832,
        227, 974, 855, 311, 595, 383, 672, 342, 150

In [6]:
KS_2_sample_test().run_test(array_a = A, array_b = B, alpha = 0.05)

Null Hypothesis: There is NO difference between Sample 1 (array_a) and Sample 2 (array_b)
Alternate Hypothesis: There is a difference between Sample 1 (array_a) and Sample 2 (array_b)


{'ks-statistic': 0.08000000000000002, 'p-value': 0.5182193645480672}


Null Hypothesis Accepted, Reason: 0.5182193645480672 > 0.05


In [12]:
# scipy's implementation

scipy.stats.ks_2samp(A, B)

KstestResult(statistic=0.08, pvalue=0.5452713464323318, statistic_location=335, statistic_sign=-1)

### An Example Where Distributions are clearly different

In [16]:
A = np.random.randint(0, 1000, 200)
B = np.random.randint(500, 2000, 200)

A, B

(array([ 25, 362, 796, 998, 155, 816, 761, 228, 125, 977,  15, 654, 385,
        668, 675, 563, 211, 886, 826, 984, 460, 785, 852, 854, 522, 908,
         19, 916,   3, 658, 648, 168,  24,  94, 572, 465, 210, 273, 718,
        878, 509, 105, 807, 925, 876, 978, 914, 370, 593, 518, 394,  91,
         62, 776, 836, 799, 573, 396, 483, 788, 633, 668, 135, 932, 401,
        193, 952, 415, 376, 787, 160, 912,  89, 608, 998, 671, 486, 166,
        355, 922, 187, 154, 316, 862, 147, 366, 880, 170, 604, 114, 662,
        825, 492, 352, 391, 769, 911, 109,  13, 393, 233, 128,   1, 254,
        792, 355, 638, 882, 873, 862, 634, 439, 951,  62, 905, 883, 481,
        285,  76, 270,  96, 618, 436,  22, 973, 700, 845, 150, 928, 889,
        363, 188, 717, 546, 832, 528, 639, 792, 666, 160, 657, 971, 552,
        132, 983, 302, 961,  98, 311, 680, 759, 354, 203, 137, 643, 135,
        718, 398, 720, 964, 237,  44, 569,  93, 798, 342, 608, 878, 314,
        698, 296, 227, 923, 480, 872, 772, 449, 964

In [17]:
KS_2_sample_test().run_test(array_a = A, array_b = B, alpha = 0.05)

Null Hypothesis: There is NO difference between Sample 1 (array_a) and Sample 2 (array_b)
Alternate Hypothesis: There is a difference between Sample 1 (array_a) and Sample 2 (array_b)


{'ks-statistic': 0.685, 'p-value': 3.5685974177585134e-47}


Null Hypothesis Rejected, Reason: 3.5685974177585134e-47 < 0.05


In [18]:
scipy.stats.ks_2samp(A, B)

KstestResult(statistic=0.685, pvalue=4.792055985422256e-45, statistic_location=998, statistic_sign=1)