# Calculate KS Statistics

Kolmogorov-Smirnov (KS) statistics compares the cumulative distribution of events and non-events and KS is where there is a maximum difference between the two distributions. In simple words, it helps us to understand how well our predictive model is able to discriminate between events and non-events.

In [20]:
#Get the raw data

import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/NickDSShare/Model-Development/main/KS_Test_Raw_Data.csv")

print(df)

            p  y
0    0.063817  0
1    0.004442  0
2    0.150929  0
3    0.223016  0
4    0.067777  0
..        ... ..
995  0.002950  0
996  0.014218  0
997  0.076319  0
998  0.006082  0
999  0.255054  0

[1000 rows x 2 columns]


In [16]:
#Create a function for calculating KS

def ks(data,target, prob,decile):
    data['target0'] = 1 - data[target]
    data['bucket'] = pd.qcut(data[prob], decile)
    grouped = data.groupby('bucket', as_index = False)
    kstable = pd.DataFrame()
    kstable['min_prob'] = grouped.min()[prob]
    kstable['max_prob'] = grouped.max()[prob]
    kstable['events']   = grouped.sum()[target]
    kstable['nonevents'] = grouped.sum()['target0']
    kstable = kstable.sort_values(by="min_prob", ascending=False).reset_index(drop = True)
    kstable['event_rate'] = (kstable.events / data[target].sum()).apply('{0:.2%}'.format)
    kstable['nonevent_rate'] = (kstable.nonevents / data['target0'].sum()).apply('{0:.2%}'.format)
    kstable['cum_eventrate']=(kstable.events / data[target].sum()).cumsum()
    kstable['cum_noneventrate']=(kstable.nonevents / data['target0'].sum()).cumsum()
    kstable['KS'] = np.round(kstable['cum_eventrate']-kstable['cum_noneventrate'], 3) * 100

    #Formating
    kstable['cum_eventrate']= kstable['cum_eventrate'].apply('{0:.2%}'.format)
    kstable['cum_noneventrate']= kstable['cum_noneventrate'].apply('{0:.2%}'.format)
    kstable.index = range(1,11)
    kstable.index.rename('Decile', inplace=True)
    # pd.set_option('display.max_columns', 9)
    print(kstable)
    
    #Display KS
    print("")
    print("KS is " + str(max(kstable['KS']))+"%"+ " at decile " + str((kstable.index[kstable['KS']==max(kstable['KS'])][0])))

In [17]:
#Call the function and check the highest KS value

ks(data=df,target="y", prob="p",decile = 10)

        min_prob  max_prob  events  nonevents event_rate nonevent_rate  \
Decile                                                                   
1       0.298894  0.975404      49         51     49.00%         5.67%   
2       0.135598  0.298687      19         81     19.00%         9.00%   
3       0.082170  0.135089      14         86     14.00%         9.56%   
4       0.050369  0.082003      10         90     10.00%        10.00%   
5       0.029415  0.050337       5         95      5.00%        10.56%   
6       0.018343  0.029384       1         99      1.00%        11.00%   
7       0.011504  0.018291       1         99      1.00%        11.00%   
8       0.006976  0.011364       1         99      1.00%        11.00%   
9       0.002929  0.006964       0        100      0.00%        11.11%   
10      0.000073  0.002918       0        100      0.00%        11.11%   

       cum_eventrate cum_noneventrate    KS  
Decile                                       
1             49.00