Implement Kolmogorov Smirnov Test in SQL-only #28

YYYasin19 · 2022-06-24T13:10:37Z

Hey!

in this PR I'd like to refactor the work of #23 and now implement the Kolmogorov-Smirnov test in SQL-only.
This has the benefit that for large datasets, we won't have to load in all the values to pass them to scipy.
Additionally, this would remove scipy as a dependency again! 🥳

The core work of this PR consists of creating an SQL query that is able to calculate the Kolmogorov-Smirnov test.

The idea behind this implementation as well as some theoretical backgrounds will be explained in the comments below.

The following issues will definitely need to be addressed:

Create more tests, especially with very large datasets
Refactor tests as p_values are not always available now
~~Remove scipy dependency~~ needs to be done on conda-feedstock
Test if the query is really database-agnostic, i.e., follows some SQL standard
~~Transform query to sqlalchemy's query API~~ Re-implement KS test in sqlalchemy expression language API #29 issue opened
Validate p-value approximation

YYYasin19 · 2022-06-24T13:41:26Z

Kolmogorov Smirnov in SQL

Our sample table table1 has the following CDF

VAL	CDF
1	0.2352941176
2	0.3529411765
3	0.4705882353
4	0.5882352941
5	0.7058823529
6	0.8235294118
7	0.9411764706
8	1

which is calculated by the first query and stored in tab1. The same applies to tab2.
Since the range of the values in table1 might differ from the range of values in table2, we perform a FULL OUTER JOIN between the CDFs based on their index (here: val, i.e., the x-axis).

SELECT coalesce(tab1.val, tab2.val) as v, tab1.cdf as cdf1, tab2.cdf as cdf2
FROM tab1 FULL OUTER JOIN tab2 ON tab1.val = tab2.val
ORDER BY v

Using coalesce we combine the indices, which results in the following table:

V	CDF1	CDF2
1	0.1666666667	0.2352941176
2	0.2222222222	0.3529411765
3	0.2777777778	0.4705882353
4	0.3333333333	0.5882352941
5	0.3888888889	0.7058823529
6	0.4444444444	0.8235294118
7	0.5	0.9411764706
8		1
16	0.5555555556
17	0.6111111111
18	0.6666666667
19	0.7222222222
20	0.7777777778
21	0.8333333333
22	0.8888888889
23	0.9444444444
24	1

Filling missing values

Since our discrete values result in a step-function like CDF, we want to forward-fill missing values in the table.
This is done by first assigning a grouper_id to each (index, cdf_value) combination with

grouped_table AS ( -- Step 2: Create a grouper attribute to fill values forward
    SELECT v,
        COUNT(cdf1) over (order by v) as _grp1, cdf1,
        COUNT(cdf2) over (order by v) as _grp2, cdf2
    FROM cdf_unfilled

and then retrieving the first value per group here

filled_cdf AS ( -- Step 3: Select first non-null value per group (defined by grouper)
    SELECT v, first_value(cdf1) over (partition by _grp1 order by v) as cdf1_filled,
            first_value(cdf2) over (partition by _grp2 order by v) as cdf2_filled
    FROM grouped_table

Replacing NULLs

As long as the smallest value of both data samples is not the same, e.g., column1 starts with 1 and column2 starts with 5, one of the CDF tables will have NULL as first value. The following query replaces those resulting NULL values with 0, s.t. the distance can be effectively calculated.

Final Query

The final query has, currently, the following form.
Some of the queries can still be optimized. The current structure allows us to just insert datajudge's get_clause into the query string to have a filtered table and use a val alias on the target column.

WITH tab1 AS (
  SELECT val, MAX(cdf) as cdf FROM (
      SELECT val, cume_dist() over (order by val) as cdf
      FROM (SELECT {column1} as val FROM {table1}) -- Change TABLE and COL here
      ORDER BY val
  ) GROUP BY val

), tab2 AS (
  SELECT val, MAX(cdf) as cdf FROM (
      SELECT val, cume_dist() over (order by val) as cdf
      FROM (SELECT {column2} as val FROM {table2}) -- Change TABLE and COL here
      ORDER BY val
  ) GROUP BY val

), cdf_unfilled AS (
  SELECT coalesce(tab1.val, tab2.val) as v, tab1.cdf as cdf1, tab2.cdf as cdf2
  FROM tab1 FULL OUTER JOIN tab2 ON tab1.val = tab2.val
  ORDER BY v

), grouped_table AS ( -- Step 2: Create a grouper attribute to fill values forward
    SELECT v,
        COUNT(cdf1) over (order by v) as _grp1, cdf1,
        COUNT(cdf2) over (order by v) as _grp2, cdf2
    FROM cdf_unfilled

), filled_cdf AS ( -- Step 3: Select first non-null value per group (defined by grouper)
    SELECT v, first_value(cdf1) over (partition by _grp1 order by v) as cdf1_filled,
            first_value(cdf2) over (partition by _grp2 order by v) as cdf2_filled
    FROM grouped_table

), replaced_nulls AS ( -- Step 4: Replace NULL values (often at the begin) with 0 to calculate difference
  SELECT coalesce(cdf1_filled, 0) as cdf1, coalesce(cdf2_filled, 0) as cdf2
  FROM filled_cdf
  )

SELECT MAX(ABS(cdf1-cdf2)) FROM replaced_nulls; -- Step 5: Calculate final statistic

YYYasin19 · 2022-06-24T14:29:16Z

Theoretical foundation for p-value

The query above only calculates the statistic of the Kolmogorov Smirnov test, which is defined as
$$\sup_{x} || F_1(x) - F_2(x) || $$
where $F_1(x)$ and $F_2(x)$ are the CDFs of the two samples $s_1$ and $s_2$ respectively.

Determining acceptance

We can now accept or reject the null-hypothesis $H_0$ of our test based on our test statistic $d$ and our sample sizes $n$ and $m$ with the following inequality:
$$ d \lt \sqrt{-\ln{\frac{\alpha}{2} * \frac{1}{2}}} $$
In this case, $H_0$ states that the two samples stem from the same distribution.

Approximating the p-value

Since this value is test-specific and harder to interpret across applications, it would be beneficial to have a p-value at hand!
Unfortunately, the distribution of the KS-statistic is unwieldly, making exact calculations hard. (Facchinetti, 2007) describes the following formula as an approximation of the p-value given $d$ and sample size $n$ for samples with $n \gt 35$:
$$ 2*\exp{-(d*\sqrt{n})^2} $$

After experimenting with a few test datasets, both very small (~50 samples) and very large (~100,000,000 samples), I can confirm that this approximation is mostly correct for the first few decimal places. (Very unscientific, I know).

Implementation:

For samples with $n \lt 35$ we won't provide a p-value for now. Also accepting or rejecting the test will still be based on the above inequality, the approx. p-value is solely for reporting purposes.

ivergara · 2022-06-24T14:56:29Z

Great job!

The following issues will definitely need to be addressed:

Create more tests, especially with very large datasets

I'm not sure how to deal with tests of large datasets. Maybe those will need to be executed manually on demand.

Test if the query is really database-agnostic, i.e., follows some SQL standard

Well, if you write the tests and it passes on the three databases we test against, then you're done.

Transform query to sqlalchemy's query API

I don't think it'll be very feasible or advisable to attempt that. ORMs have some limitations and sometimes going straight to SQL is necessary and advisable.

Anyhow, besides some minor issues that I'll point out later when this PR leaves draft mode, all looks good. Perhaps try to be more consistent in the type annotations including return values.

ivergara · 2022-06-24T14:58:09Z

Some of the errors you see in test are typing related float | None, that's syntax for Python 3.10 and we're supporting some older versions still. You'll have to go with Optional[float] for a while still.

kklein · 2022-06-24T15:24:54Z

I don't think it'll be very feasible or advisable to attempt that. ORMs have some limitations and sometimes going straight to SQL is necessary and advisable.

Maybe @YYYasin19 meant the sqlalchemy language expression api? We use that in the rest of db_access. We don't use the ORM api.

ivergara · 2022-06-24T15:28:39Z

I don't think it'll be very feasible or advisable to attempt that. ORMs have some limitations and sometimes going straight to SQL is necessary and advisable.

Maybe @YYYasin19 meant the sqlalchemy language expression api? We use that in the rest of db_access. We don't use the ORM api.

Duh, sorry. Still, in this case, I'd advise against it. I'm not opposed to it, just that don't think it's worth the effort. Maybe the only reason to do it, would be to approach multi-backend support by leveraging sqlalchemy.

kklein · 2022-06-24T15:37:09Z

Duh, sorry. Still, in this case, I'd advise against it. I'm not opposed to it, just that don't think it's worth the effort.

Okay, interesting. I am strongly in favour of it - though would be fine with postponing it until after this PR.

@YYYasin19 I would suggest that if you are motivated to translate/see a point in translating the raw sql query to the language expression api, that you do so - either in this PR already or right after this PR. If you don't, I would suggest to open an issue ensuring that someone else does it eventually.

Some of my reasons:

A priori I see a lot of value in consistency. As of now the rest of db_access uses the language expression api.
I don't think the effort is actually high once one is somewhat familiar with the api. If one isn't one might either want to become, or delegate the task to someone who is.
The language expression api has (subjectively) helped us reduce the lines of code written - by allowing us to reuse code, isolate code and modularize it. As a consequence, we have basically never encountered bugs on the query level.

…ned to them

YYYasin19 · 2022-06-24T16:47:25Z

Is there a reason why our MSSQL code is converting the table specification in ref.get_selection() into a string with quotation marks, like this "tempdb.dbo".int_table1.col_int ?
This breaks the query for now, other than that, everything seems to work 🥳

kklein · 2022-06-24T18:53:59Z

why our MSSQL code is converting the table specification in ref.get_selection() into a string with quotation marks, like this

I'm not perfectly sure - I would've thought that get_selection returns a selection rather than a string but I might be missing something here.

Independently of that, I know that we sometimes ran into trouble with mssql when it came to referring to objects. E.g.:

Table names can contain any valid characters (for example, spaces). If table names contain any characters except letters, numbers, and underscores, the name must be delimited by enclosing it in back quotes (`).

https://docs.microsoft.com/en-us/sql/odbc/microsoft/table-name-limitations?view=sql-server-ver16

…ion marks in table reference

YYYasin19 · 2022-06-25T09:21:20Z

I'm not perfectly sure - I would've thought that get_selection returns a selection rather than a string but I might be missing something here.

Yes, it does, I just had wrapped a str(...) around it 😬 .
I now have a solution that is very sub-optimal as I:
(1) replace " in MS-SQL table definitions manually

    if is_mssql(engine):
        table1_selection = str(table1_selection).replace('"', "")
        table2_selection = str(table2_selection).replace('"', "")

Note: sa.select(...) is the culprit here and returns "tempdb.dbo".table_name instead of tempdb.dbo.table_name.

(2) rely on data_source.table_name, which is not available for RawQueryDataSources
Also, this would not include any where-clauses, which makes it straight up wrong in many cases.

jonashaag · 2022-06-26T09:03:48Z

src/datajudge/constraints/stats.py

-    ) -> Tuple[Any, OptionalSelections]:
-        return db_access.get_column(engine, ref)
+    @staticmethod
+    def check_acceptance(d: float, n: int, m: int, accepted_level):


Unless d, n, m are well-known shortcuts in statistics I'd go with proper variable names here and have a note what's the equivalent on that Wikipedia page, eg.

def ... sample1_size: int, sample2_size: int ... """ See [wikipedia url] for reference with n = sample1_size, m = sample2_size ... """

Also a docstring for this function would be great

I do agree with both comments, however, since check_acceptance is an internal function I'm not too fuzzed about the variable naming IF you improve the docstring. However, improving the argument names would be a great benefit.. we're not in the 70's anymore, where economy of names was a concern.

Haha I get your points, just didn't think about it during the initial setup since all variables in the formulas are called n and m as well. n is very common in statistics for the sample size.

jonashaag · 2022-06-26T09:04:27Z

src/datajudge/constraints/stats.py

+    @staticmethod
+    def check_acceptance(d: float, n: int, m: int, accepted_level):
+        def c(alpha: float):
+            if alpha == 0.0:


== comparisons with floating point numbers are dangerous, generally speaking.

If you want to avoid log(0) I think it would be better to just do log(alpha / 2.0 + 1e-10). This also covers the problem of very small alpha. Could add an assert alpha > 0.

A very very small alpha is entirely feasible, since sometimes p-values can be in the range of 1e-50 for very large datasets.
I wanted to support alpha = 0 because that would be like stating that the test should pass every time (since the p-value is in range (0, 1]). But I guess in practice nobody would create a test that should pass every time.

src/datajudge/db_access.py

…m numbers generated from seed for reproducability

YYYasin19 · 2022-06-29T09:47:39Z

How do I find out if my approximation of the p-value is good enough?
Left is the approximation, right is the "correct" one from scipy
0.00035221594346540835 vs 0.00035250250849476917
The difference is 2.8656502936081404e-07

Is it feasible to check whether the user has scipy installed and if so, just use scipy?

ivergara

Once again, good job!

tests/integration/test_integration.py

ivergara · 2022-06-29T10:10:49Z

How do I find out if my approximation of the p-value is good enough? Left is the approximation, right is the "correct" one from scipy 0.00035221594346540835 vs 0.00035250250849476917 The difference is 2.8656502936081404e-07

Usually, the level of "precision" shown by calculators and computers is somewhat artificial. I'm not a statistician, but I'd say that anything any difference beyond e-06, in this case, is irrelevant. Usually one uses the p value of 0.05 or if you want to be very picky 0.01. A difference in the calculated value via SQL and scipy like you find will not make anyone accept or reject the null hypothesis.

Along these lines, your comparisons in the tests, like assert 5.551115123125783e-17 <= 1e-50, the 1e-50 is too much of an extreme value.

Is it feasible to check whether the user has scipy installed and if so, just use scipy?

Well, no... this is only executed in test. Users would never use this, so it's irrelevant if users have scipy installed or not.

YYYasin19 · 2022-06-29T11:48:38Z

Usually one uses the p value of 0.05 or if you want to be very picky 0.01. A difference in the calculated value via SQL and scipy like you find will not make anyone accept or reject the null hypothesis.

The problem is that in practice for one of our tests the reported p-value in the magnitude of 1e-53 because the table has 100+ million rows. Therefore, my idea to use scipy in very scientific environments where accuracy is required.

Well, no... this is only executed in test.
No, I mean in production.

In the end, it would just be an extra statement within the approximation method

try:
   import scipy
except ModuleNotFoundError:
   return 2*math.exp(...) # old method of approximation

# accurate method
return scipy.stats.distribution.kstwo.sf(d, n_samples)

I think that's feasible 👍

src/datajudge/constraints/stats.py

src/datajudge/db_access.py

ivergara · 2022-06-29T12:52:20Z

Usually one uses the p value of 0.05 or if you want to be very picky 0.01. A difference in the calculated value via SQL and scipy like you find will not make anyone accept or reject the null hypothesis.

The problem is that in practice for one of our tests the reported p-value in the magnitude of 1e-53 because the table has 100+ million rows. Therefore, my idea to use scipy in very scientific environments where accuracy is required.

Mh, that's a problem... and it's a rather generalized issue with p-values in general.

Although I did try to look for possible solutions, I didn't find any.

One idea would be to base the data test around the value D, the maximum distance between the samples. Then the test receives a threshold for D that needs to be fulfilled. Now, I agree that this solution is not very satisfactory as it just kicks the problem to define a sensible threshold and at least I personally have no idea how to interpret D enough to be able to choose a number for the threshold.

YYYasin19 · 2022-06-29T12:54:50Z

One idea would be to base the data test around the value D, the maximum distance between the samples. Then the test receives a threshold for D that needs to be fulfilled. Now, I agree that this solution is not very satisfactory as it just kicks the problem to define a sensible threshold and at least I personally have no idea how to interpret D enough to be able to choose a number for the threshold.

My experience is only anecdotal as well, but from what I've seen for various sample sizes (10, 10_000 and 100_000_000) the test statistic d doesn't get too small. Therefore, I chose 1e-10 as allowed deviation.

src/datajudge/constraints/stats.py

YYYasin19 · 2022-06-29T14:58:29Z

@ivergara Can you remove scipy as a prod dependency from the conda-feedstock after the merge of this?

jonashaag · 2022-06-30T16:34:11Z

Great work!

## Change This PR implements the Kolmogorov Smirnov test in pure SQL which is then run directly on the database. ## Commits * first integration: ks-test in database functionality * integrate sql query with data refs * formatting * formatting * refactoring: call `test` directly to access sql-result of KS test * fix row count retrieval * fix acceptance level domain error * fix alpha adjustment * fix type hints for python<3.10 * update sql query for postgres: all tables need to have an alias assigned to them * fix: typo * update query for mssql server * add check for column names * alternative way of getting table name, incl. hot fix for mssql quotation marks in table reference * don't accept zero alphas since in practice they don't make much sense * update variable naming and doc-strings * update data retrieval * include query nesting brackets * better formatting for understandibility * better formatting for understandibility * update query for better readibility with more WITH statements * new option of passing values to the TestResult to compare these * seperate implementation testing from use case testing * make independent of numpy * update tests: new distributions, no scipy and numpy dependency, random numbers generated from seed for reproducability * update comment * optional accuracy through scipy * refactoring, clean up and formatting * update comment and type hints * update tpye hints for older python versions * fix type hint: Tuple instead of tuple * update changelog and include comment about scipy calculation

YYYasin19 added 7 commits June 23, 2022 15:39

first integration: ks-test in database functionality

8d80f9f

integrate sql query with data refs

8c7f25a

formatting

ba802d9

formatting

1127da0

refactoring: call test directly to access sql-result of KS test

61e06b0

fix row count retrieval

b847613

fix acceptance level domain error

1688550

YYYasin19 added enhancement New feature or request dependencies Pull requests that update a dependency file labels Jun 24, 2022

YYYasin19 self-assigned this Jun 24, 2022

YYYasin19 added 2 commits June 24, 2022 16:59

fix alpha adjustment

7b8b7dc

fix type hints for python<3.10

806e2dd

YYYasin19 added 3 commits June 24, 2022 18:08

update sql query for postgres: all tables need to have an alias assig…

e5349ec

…ned to them

fix: typo

c9bf5cb

update query for mssql server

6866429

YYYasin19 marked this pull request as ready for review June 24, 2022 16:42

YYYasin19 marked this pull request as draft June 24, 2022 16:43

YYYasin19 added 2 commits June 25, 2022 10:43

add check for column names

b11fcf0

alternative way of getting table name, incl. hot fix for mssql quotat…

a3ff0a6

…ion marks in table reference

jonashaag reviewed Jun 26, 2022

View reviewed changes

YYYasin19 added 5 commits June 28, 2022 14:27

update query for better readibility with more WITH statements

f43e69e

new option of passing values to the TestResult to compare these

370514f

seperate implementation testing from use case testing

7fb7106

make independent of numpy

395b411

update tests: new distributions, no scipy and numpy dependency, rando…

c1e01ab

…m numbers generated from seed for reproducability

YYYasin19 marked this pull request as ready for review June 29, 2022 09:48

ivergara reviewed Jun 29, 2022

View reviewed changes

tests/integration/test_integration.py Show resolved Hide resolved

tests/integration/test_integration.py Outdated Show resolved Hide resolved

update comment

8c12e83

jonashaag reviewed Jun 29, 2022

View reviewed changes

YYYasin19 added 2 commits June 29, 2022 14:30

optional accuracy through scipy

b0631f9

refactoring, clean up and formatting

158ed7b

ivergara reviewed Jun 29, 2022

View reviewed changes

src/datajudge/constraints/stats.py Outdated Show resolved Hide resolved

YYYasin19 added 3 commits June 29, 2022 15:26

update comment and type hints

26594f8

update tpye hints for older python versions

6524fb8

fix type hint: Tuple instead of tuple

10f689c

YYYasin19 closed this Jun 29, 2022

YYYasin19 reopened this Jun 29, 2022

YYYasin19 mentioned this pull request Jun 29, 2022

Re-implement KS test in sqlalchemy expression language API #29

Closed

update changelog and include comment about scipy calculation

023bfad

YYYasin19 merged commit e08267a into main Jun 30, 2022

ivergara deleted the kstest-sql-only branch June 30, 2022 18:37

ivergara mentioned this pull request Jul 20, 2022

Fast Data Retrieval for Analytics Databases #27

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Kolmogorov Smirnov Test in SQL-only #28

Implement Kolmogorov Smirnov Test in SQL-only #28

YYYasin19 commented Jun 24, 2022 •

edited

Loading

YYYasin19 commented Jun 24, 2022

YYYasin19 commented Jun 24, 2022 •

edited

Loading

ivergara commented Jun 24, 2022 •

edited by YYYasin19

Loading

ivergara commented Jun 24, 2022

kklein commented Jun 24, 2022 •

edited

Loading

ivergara commented Jun 24, 2022 •

edited

Loading

kklein commented Jun 24, 2022

YYYasin19 commented Jun 24, 2022

kklein commented Jun 24, 2022

YYYasin19 commented Jun 25, 2022

jonashaag Jun 26, 2022

jonashaag Jun 26, 2022

ivergara Jun 27, 2022

YYYasin19 Jun 27, 2022

jonashaag Jun 26, 2022

YYYasin19 Jun 27, 2022

YYYasin19 commented Jun 29, 2022 •

edited

Loading

ivergara left a comment

ivergara commented Jun 29, 2022 •

edited

Loading

YYYasin19 commented Jun 29, 2022 •

edited

Loading

ivergara commented Jun 29, 2022

YYYasin19 commented Jun 29, 2022 •

edited

Loading

YYYasin19 commented Jun 29, 2022

jonashaag commented Jun 30, 2022

Implement Kolmogorov Smirnov Test in SQL-only #28

Implement Kolmogorov Smirnov Test in SQL-only #28

Conversation

YYYasin19 commented Jun 24, 2022 • edited Loading

YYYasin19 commented Jun 24, 2022

Kolmogorov Smirnov in SQL

Filling missing values

Replacing NULLs

Final Query

YYYasin19 commented Jun 24, 2022 • edited Loading

Theoretical foundation for p-value

Determining acceptance

Approximating the p-value

Implementation:

ivergara commented Jun 24, 2022 • edited by YYYasin19 Loading

ivergara commented Jun 24, 2022

kklein commented Jun 24, 2022 • edited Loading

ivergara commented Jun 24, 2022 • edited Loading

kklein commented Jun 24, 2022

YYYasin19 commented Jun 24, 2022

kklein commented Jun 24, 2022

YYYasin19 commented Jun 25, 2022

jonashaag Jun 26, 2022

Choose a reason for hiding this comment

jonashaag Jun 26, 2022

Choose a reason for hiding this comment

ivergara Jun 27, 2022

Choose a reason for hiding this comment

YYYasin19 Jun 27, 2022

Choose a reason for hiding this comment

jonashaag Jun 26, 2022

Choose a reason for hiding this comment

YYYasin19 Jun 27, 2022

Choose a reason for hiding this comment

YYYasin19 commented Jun 29, 2022 • edited Loading

ivergara left a comment

Choose a reason for hiding this comment

ivergara commented Jun 29, 2022 • edited Loading

YYYasin19 commented Jun 29, 2022 • edited Loading

ivergara commented Jun 29, 2022

YYYasin19 commented Jun 29, 2022 • edited Loading

YYYasin19 commented Jun 29, 2022

jonashaag commented Jun 30, 2022

YYYasin19 commented Jun 24, 2022 •

edited

Loading

YYYasin19 commented Jun 24, 2022 •

edited

Loading

ivergara commented Jun 24, 2022 •

edited by YYYasin19

Loading

kklein commented Jun 24, 2022 •

edited

Loading

ivergara commented Jun 24, 2022 •

edited

Loading

YYYasin19 commented Jun 29, 2022 •

edited

Loading

ivergara commented Jun 29, 2022 •

edited

Loading

YYYasin19 commented Jun 29, 2022 •

edited

Loading

YYYasin19 commented Jun 29, 2022 •

edited

Loading