Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Kolmogorov Smirnov Test in SQL-only #28

Merged
merged 32 commits into from
Jun 30, 2022
Merged

Conversation

YYYasin19
Copy link
Contributor

@YYYasin19 YYYasin19 commented Jun 24, 2022

Hey!

in this PR I'd like to refactor the work of #23 and now implement the Kolmogorov-Smirnov test in SQL-only.
This has the benefit that for large datasets, we won't have to load in all the values to pass them to scipy.
Additionally, this would remove scipy as a dependency again! 🥳

The core work of this PR consists of creating an SQL query that is able to calculate the Kolmogorov-Smirnov test.

The idea behind this implementation as well as some theoretical backgrounds will be explained in the comments below.

The following issues will definitely need to be addressed:

  • Create more tests, especially with very large datasets
  • Refactor tests as p_values are not always available now
  • Remove scipy dependency needs to be done on conda-feedstock
  • Test if the query is really database-agnostic, i.e., follows some SQL standard
  • Transform query to sqlalchemy's query API Re-implement KS test in sqlalchemy expression language API #29 issue opened
  • Validate p-value approximation

@YYYasin19 YYYasin19 added enhancement New feature or request dependencies Pull requests that update a dependency file labels Jun 24, 2022
@YYYasin19 YYYasin19 self-assigned this Jun 24, 2022
@YYYasin19
Copy link
Contributor Author

Kolmogorov Smirnov in SQL

Our sample table table1 has the following CDF

VAL CDF
1 0.2352941176
2 0.3529411765
3 0.4705882353
4 0.5882352941
5 0.7058823529
6 0.8235294118
7 0.9411764706
8 1

which is calculated by the first query and stored in tab1. The same applies to tab2.
Since the range of the values in table1 might differ from the range of values in table2, we perform a FULL OUTER JOIN between the CDFs based on their index (here: val, i.e., the x-axis).

SELECT coalesce(tab1.val, tab2.val) as v, tab1.cdf as cdf1, tab2.cdf as cdf2
FROM tab1 FULL OUTER JOIN tab2 ON tab1.val = tab2.val
ORDER BY v

Using coalesce we combine the indices, which results in the following table:

V CDF1 CDF2
1 0.1666666667 0.2352941176
2 0.2222222222 0.3529411765
3 0.2777777778 0.4705882353
4 0.3333333333 0.5882352941
5 0.3888888889 0.7058823529
6 0.4444444444 0.8235294118
7 0.5 0.9411764706
8 1
16 0.5555555556
17 0.6111111111
18 0.6666666667
19 0.7222222222
20 0.7777777778
21 0.8333333333
22 0.8888888889
23 0.9444444444
24 1

Filling missing values

Since our discrete values result in a step-function like CDF, we want to forward-fill missing values in the table.
This is done by first assigning a grouper_id to each (index, cdf_value) combination with

grouped_table AS ( -- Step 2: Create a grouper attribute to fill values forward
    SELECT v,
        COUNT(cdf1) over (order by v) as _grp1, cdf1,
        COUNT(cdf2) over (order by v) as _grp2, cdf2
    FROM cdf_unfilled

and then retrieving the first value per group here

filled_cdf AS ( -- Step 3: Select first non-null value per group (defined by grouper)
    SELECT v, first_value(cdf1) over (partition by _grp1 order by v) as cdf1_filled,
            first_value(cdf2) over (partition by _grp2 order by v) as cdf2_filled
    FROM grouped_table

Replacing NULLs

As long as the smallest value of both data samples is not the same, e.g., column1 starts with 1 and column2 starts with 5, one of the CDF tables will have NULL as first value. The following query replaces those resulting NULL values with 0, s.t. the distance can be effectively calculated.

Final Query

The final query has, currently, the following form.
Some of the queries can still be optimized. The current structure allows us to just insert datajudge's get_clause into the query string to have a filtered table and use a val alias on the target column.

WITH tab1 AS (
  SELECT val, MAX(cdf) as cdf FROM (
      SELECT val, cume_dist() over (order by val) as cdf
      FROM (SELECT {column1} as val FROM {table1}) -- Change TABLE and COL here
      ORDER BY val
  ) GROUP BY val

), tab2 AS (
  SELECT val, MAX(cdf) as cdf FROM (
      SELECT val, cume_dist() over (order by val) as cdf
      FROM (SELECT {column2} as val FROM {table2}) -- Change TABLE and COL here
      ORDER BY val
  ) GROUP BY val

), cdf_unfilled AS (
  SELECT coalesce(tab1.val, tab2.val) as v, tab1.cdf as cdf1, tab2.cdf as cdf2
  FROM tab1 FULL OUTER JOIN tab2 ON tab1.val = tab2.val
  ORDER BY v

), grouped_table AS ( -- Step 2: Create a grouper attribute to fill values forward
    SELECT v,
        COUNT(cdf1) over (order by v) as _grp1, cdf1,
        COUNT(cdf2) over (order by v) as _grp2, cdf2
    FROM cdf_unfilled

), filled_cdf AS ( -- Step 3: Select first non-null value per group (defined by grouper)
    SELECT v, first_value(cdf1) over (partition by _grp1 order by v) as cdf1_filled,
            first_value(cdf2) over (partition by _grp2 order by v) as cdf2_filled
    FROM grouped_table

), replaced_nulls AS ( -- Step 4: Replace NULL values (often at the begin) with 0 to calculate difference
  SELECT coalesce(cdf1_filled, 0) as cdf1, coalesce(cdf2_filled, 0) as cdf2
  FROM filled_cdf
  )

SELECT MAX(ABS(cdf1-cdf2)) FROM replaced_nulls; -- Step 5: Calculate final statistic

@YYYasin19
Copy link
Contributor Author

YYYasin19 commented Jun 24, 2022

Theoretical foundation for p-value

The query above only calculates the statistic of the Kolmogorov Smirnov test, which is defined as
$$\sup_{x} || F_1(x) - F_2(x) || $$
where $F_1(x)$ and $F_2(x)$ are the CDFs of the two samples $s_1$ and $s_2$ respectively.

Determining acceptance

We can now accept or reject the null-hypothesis $H_0$ of our test based on our test statistic $d$ and our sample sizes $n$ and $m$ with the following inequality:
$$ d \lt \sqrt{-\ln{\frac{\alpha}{2} * \frac{1}{2}}} $$
In this case, $H_0$ states that the two samples stem from the same distribution.

Approximating the p-value

Since this value is test-specific and harder to interpret across applications, it would be beneficial to have a p-value at hand!
Unfortunately, the distribution of the KS-statistic is unwieldly, making exact calculations hard. (Facchinetti, 2007) describes the following formula as an approximation of the p-value given $d$ and sample size $n$ for samples with $n \gt 35$:
$$ 2*\exp{-(d*\sqrt{n})^2} $$

After experimenting with a few test datasets, both very small (~50 samples) and very large (~100,000,000 samples), I can confirm that this approximation is mostly correct for the first few decimal places. (Very unscientific, I know).

Implementation:

For samples with $n \lt 35$ we won't provide a p-value for now. Also accepting or rejecting the test will still be based on the above inequality, the approx. p-value is solely for reporting purposes.

@ivergara
Copy link
Collaborator

ivergara commented Jun 24, 2022

Great job!

The following issues will definitely need to be addressed:

  • Create more tests, especially with very large datasets

I'm not sure how to deal with tests of large datasets. Maybe those will need to be executed manually on demand.

  • Test if the query is really database-agnostic, i.e., follows some SQL standard

Well, if you write the tests and it passes on the three databases we test against, then you're done.

  • Transform query to sqlalchemy's query API

I don't think it'll be very feasible or advisable to attempt that. ORMs have some limitations and sometimes going straight to SQL is necessary and advisable.

Anyhow, besides some minor issues that I'll point out later when this PR leaves draft mode, all looks good. Perhaps try to be more consistent in the type annotations including return values.

@ivergara
Copy link
Collaborator

Some of the errors you see in test are typing related float | None, that's syntax for Python 3.10 and we're supporting some older versions still. You'll have to go with Optional[float] for a while still.

@kklein
Copy link
Collaborator

kklein commented Jun 24, 2022

I don't think it'll be very feasible or advisable to attempt that. ORMs have some limitations and sometimes going straight to SQL is necessary and advisable.

Maybe @YYYasin19 meant the sqlalchemy language expression api? We use that in the rest of db_access. We don't use the ORM api.

@ivergara
Copy link
Collaborator

ivergara commented Jun 24, 2022

I don't think it'll be very feasible or advisable to attempt that. ORMs have some limitations and sometimes going straight to SQL is necessary and advisable.

Maybe @YYYasin19 meant the sqlalchemy language expression api? We use that in the rest of db_access. We don't use the ORM api.

Duh, sorry. Still, in this case, I'd advise against it. I'm not opposed to it, just that don't think it's worth the effort. Maybe the only reason to do it, would be to approach multi-backend support by leveraging sqlalchemy.

@kklein
Copy link
Collaborator

kklein commented Jun 24, 2022

Duh, sorry. Still, in this case, I'd advise against it. I'm not opposed to it, just that don't think it's worth the effort.

Okay, interesting. I am strongly in favour of it - though would be fine with postponing it until after this PR.

@YYYasin19 I would suggest that if you are motivated to translate/see a point in translating the raw sql query to the language expression api, that you do so - either in this PR already or right after this PR. If you don't, I would suggest to open an issue ensuring that someone else does it eventually.

Some of my reasons:

  • A priori I see a lot of value in consistency. As of now the rest of db_access uses the language expression api.
  • I don't think the effort is actually high once one is somewhat familiar with the api. If one isn't one might either want to become, or delegate the task to someone who is.
  • The language expression api has (subjectively) helped us reduce the lines of code written - by allowing us to reuse code, isolate code and modularize it. As a consequence, we have basically never encountered bugs on the query level.

@YYYasin19 YYYasin19 marked this pull request as ready for review June 24, 2022 16:42
@YYYasin19 YYYasin19 marked this pull request as draft June 24, 2022 16:43
@YYYasin19
Copy link
Contributor Author

Is there a reason why our MSSQL code is converting the table specification in ref.get_selection() into a string with quotation marks, like this "tempdb.dbo".int_table1.col_int ?
This breaks the query for now, other than that, everything seems to work 🥳

@kklein
Copy link
Collaborator

kklein commented Jun 24, 2022

why our MSSQL code is converting the table specification in ref.get_selection() into a string with quotation marks, like this

I'm not perfectly sure - I would've thought that get_selection returns a selection rather than a string but I might be missing something here.

Independently of that, I know that we sometimes ran into trouble with mssql when it came to referring to objects. E.g.:

Table names can contain any valid characters (for example, spaces). If table names contain any characters except letters, numbers, and underscores, the name must be delimited by enclosing it in back quotes (`).

https://docs.microsoft.com/en-us/sql/odbc/microsoft/table-name-limitations?view=sql-server-ver16

@YYYasin19
Copy link
Contributor Author

I'm not perfectly sure - I would've thought that get_selection returns a selection rather than a string but I might be missing something here.

Yes, it does, I just had wrapped a str(...) around it 😬 .
I now have a solution that is very sub-optimal as I:
(1) replace " in MS-SQL table definitions manually

    if is_mssql(engine):
        table1_selection = str(table1_selection).replace('"', "")
        table2_selection = str(table2_selection).replace('"', "")

Note: sa.select(...) is the culprit here and returns "tempdb.dbo".table_name instead of tempdb.dbo.table_name.

(2) rely on data_source.table_name, which is not available for RawQueryDataSources
Also, this would not include any where-clauses, which makes it straight up wrong in many cases.

) -> Tuple[Any, OptionalSelections]:
return db_access.get_column(engine, ref)
@staticmethod
def check_acceptance(d: float, n: int, m: int, accepted_level):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless d, n, m are well-known shortcuts in statistics I'd go with proper variable names here and have a note what's the equivalent on that Wikipedia page, eg.

def ... sample1_size: int, sample2_size: int ...
"""
See [wikipedia url] for reference with n = sample1_size, m = sample2_size ...
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a docstring for this function would be great

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with both comments, however, since check_acceptance is an internal function I'm not too fuzzed about the variable naming IF you improve the docstring. However, improving the argument names would be a great benefit.. we're not in the 70's anymore, where economy of names was a concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha I get your points, just didn't think about it during the initial setup since all variables in the formulas are called n and m as well. n is very common in statistics for the sample size.

@staticmethod
def check_acceptance(d: float, n: int, m: int, accepted_level):
def c(alpha: float):
if alpha == 0.0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== comparisons with floating point numbers are dangerous, generally speaking.

If you want to avoid log(0) I think it would be better to just do log(alpha / 2.0 + 1e-10). This also covers the problem of very small alpha. Could add an assert alpha > 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very very small alpha is entirely feasible, since sometimes p-values can be in the range of 1e-50 for very large datasets.
I wanted to support alpha = 0 because that would be like stating that the test should pass every time (since the p-value is in range (0, 1]). But I guess in practice nobody would create a test that should pass every time.

src/datajudge/db_access.py Show resolved Hide resolved
@YYYasin19
Copy link
Contributor Author

YYYasin19 commented Jun 29, 2022

How do I find out if my approximation of the p-value is good enough?
Left is the approximation, right is the "correct" one from scipy
0.00035221594346540835 vs 0.00035250250849476917
The difference is 2.8656502936081404e-07

Is it feasible to check whether the user has scipy installed and if so, just use scipy?

@YYYasin19 YYYasin19 marked this pull request as ready for review June 29, 2022 09:48
Copy link
Collaborator

@ivergara ivergara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, good job!

tests/integration/test_integration.py Show resolved Hide resolved
tests/integration/test_integration.py Outdated Show resolved Hide resolved
@ivergara
Copy link
Collaborator

ivergara commented Jun 29, 2022

How do I find out if my approximation of the p-value is good enough? Left is the approximation, right is the "correct" one from scipy 0.00035221594346540835 vs 0.00035250250849476917 The difference is 2.8656502936081404e-07

Usually, the level of "precision" shown by calculators and computers is somewhat artificial. I'm not a statistician, but I'd say that anything any difference beyond e-06, in this case, is irrelevant. Usually one uses the p value of 0.05 or if you want to be very picky 0.01. A difference in the calculated value via SQL and scipy like you find will not make anyone accept or reject the null hypothesis.

Along these lines, your comparisons in the tests, like assert 5.551115123125783e-17 <= 1e-50, the 1e-50 is too much of an extreme value.

Is it feasible to check whether the user has scipy installed and if so, just use scipy?

Well, no... this is only executed in test. Users would never use this, so it's irrelevant if users have scipy installed or not.

@YYYasin19
Copy link
Contributor Author

YYYasin19 commented Jun 29, 2022

Usually one uses the p value of 0.05 or if you want to be very picky 0.01. A difference in the calculated value via SQL and scipy like you find will not make anyone accept or reject the null hypothesis.

The problem is that in practice for one of our tests the reported p-value in the magnitude of 1e-53 because the table has 100+ million rows. Therefore, my idea to use scipy in very scientific environments where accuracy is required.

Well, no... this is only executed in test.
No, I mean in production.

In the end, it would just be an extra statement within the approximation method

try:
   import scipy
except ModuleNotFoundError:
   return 2*math.exp(...) # old method of approximation

# accurate method
return scipy.stats.distribution.kstwo.sf(d, n_samples)

I think that's feasible 👍

src/datajudge/constraints/stats.py Outdated Show resolved Hide resolved
src/datajudge/constraints/stats.py Outdated Show resolved Hide resolved
src/datajudge/constraints/stats.py Outdated Show resolved Hide resolved
src/datajudge/constraints/stats.py Outdated Show resolved Hide resolved
src/datajudge/db_access.py Outdated Show resolved Hide resolved
@ivergara
Copy link
Collaborator

Usually one uses the p value of 0.05 or if you want to be very picky 0.01. A difference in the calculated value via SQL and scipy like you find will not make anyone accept or reject the null hypothesis.

The problem is that in practice for one of our tests the reported p-value in the magnitude of 1e-53 because the table has 100+ million rows. Therefore, my idea to use scipy in very scientific environments where accuracy is required.

Mh, that's a problem... and it's a rather generalized issue with p-values in general.

Although I did try to look for possible solutions, I didn't find any.

One idea would be to base the data test around the value D, the maximum distance between the samples. Then the test receives a threshold for D that needs to be fulfilled. Now, I agree that this solution is not very satisfactory as it just kicks the problem to define a sensible threshold and at least I personally have no idea how to interpret D enough to be able to choose a number for the threshold.

@YYYasin19
Copy link
Contributor Author

YYYasin19 commented Jun 29, 2022

One idea would be to base the data test around the value D, the maximum distance between the samples. Then the test receives a threshold for D that needs to be fulfilled. Now, I agree that this solution is not very satisfactory as it just kicks the problem to define a sensible threshold and at least I personally have no idea how to interpret D enough to be able to choose a number for the threshold.

My experience is only anecdotal as well, but from what I've seen for various sample sizes (10, 10_000 and 100_000_000) the test statistic d doesn't get too small. Therefore, I chose 1e-10 as allowed deviation.

@YYYasin19
Copy link
Contributor Author

@ivergara Can you remove scipy as a prod dependency from the conda-feedstock after the merge of this?

@YYYasin19 YYYasin19 merged commit e08267a into main Jun 30, 2022
@jonashaag
Copy link
Contributor

Great work!

@ivergara ivergara deleted the kstest-sql-only branch June 30, 2022 18:37
kklein pushed a commit that referenced this pull request Jul 25, 2022
## Change
This PR implements the Kolmogorov Smirnov test in pure SQL which is then run directly on the database.

## Commits
* first integration: ks-test in database functionality

* integrate sql query with data refs

* formatting

* formatting

* refactoring: call `test` directly to access sql-result of KS test

* fix row count retrieval

* fix acceptance level domain error

* fix alpha adjustment

* fix type hints for python<3.10

* update sql query for postgres: all tables need to have an alias assigned to them

* fix: typo

* update query for mssql server

* add check for column names

* alternative way of getting table name, incl. hot fix for mssql quotation marks in table reference

* don't accept zero alphas since in practice they don't make much sense

* update variable naming and doc-strings

* update data retrieval

* include query nesting brackets

* better formatting for understandibility

* better formatting for understandibility

* update query for better readibility with more WITH statements

* new option of passing values to the TestResult to compare these

* seperate implementation testing from use case testing

* make independent of numpy

* update tests: new distributions, no scipy and numpy dependency, random numbers generated from seed for reproducability

* update comment

* optional accuracy through scipy

* refactoring, clean up and formatting

* update comment and type hints

* update tpye hints for older python versions

* fix type hint: Tuple instead of tuple

* update changelog and include comment about scipy calculation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants