-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Kolmogorov Smirnov Test in SQL-only #28
Conversation
Kolmogorov Smirnov in SQLOur sample table
which is calculated by the first query and stored in SELECT coalesce(tab1.val, tab2.val) as v, tab1.cdf as cdf1, tab2.cdf as cdf2
FROM tab1 FULL OUTER JOIN tab2 ON tab1.val = tab2.val
ORDER BY v Using
Filling missing valuesSince our discrete values result in a step-function like CDF, we want to forward-fill missing values in the table. grouped_table AS ( -- Step 2: Create a grouper attribute to fill values forward
SELECT v,
COUNT(cdf1) over (order by v) as _grp1, cdf1,
COUNT(cdf2) over (order by v) as _grp2, cdf2
FROM cdf_unfilled and then retrieving the first value per group here filled_cdf AS ( -- Step 3: Select first non-null value per group (defined by grouper)
SELECT v, first_value(cdf1) over (partition by _grp1 order by v) as cdf1_filled,
first_value(cdf2) over (partition by _grp2 order by v) as cdf2_filled
FROM grouped_table Replacing NULLsAs long as the smallest value of both data samples is not the same, e.g., column1 starts with 1 and column2 starts with 5, one of the CDF tables will have NULL as first value. The following query replaces those resulting NULL values with 0, s.t. the distance can be effectively calculated. Final QueryThe final query has, currently, the following form. WITH tab1 AS (
SELECT val, MAX(cdf) as cdf FROM (
SELECT val, cume_dist() over (order by val) as cdf
FROM (SELECT {column1} as val FROM {table1}) -- Change TABLE and COL here
ORDER BY val
) GROUP BY val
), tab2 AS (
SELECT val, MAX(cdf) as cdf FROM (
SELECT val, cume_dist() over (order by val) as cdf
FROM (SELECT {column2} as val FROM {table2}) -- Change TABLE and COL here
ORDER BY val
) GROUP BY val
), cdf_unfilled AS (
SELECT coalesce(tab1.val, tab2.val) as v, tab1.cdf as cdf1, tab2.cdf as cdf2
FROM tab1 FULL OUTER JOIN tab2 ON tab1.val = tab2.val
ORDER BY v
), grouped_table AS ( -- Step 2: Create a grouper attribute to fill values forward
SELECT v,
COUNT(cdf1) over (order by v) as _grp1, cdf1,
COUNT(cdf2) over (order by v) as _grp2, cdf2
FROM cdf_unfilled
), filled_cdf AS ( -- Step 3: Select first non-null value per group (defined by grouper)
SELECT v, first_value(cdf1) over (partition by _grp1 order by v) as cdf1_filled,
first_value(cdf2) over (partition by _grp2 order by v) as cdf2_filled
FROM grouped_table
), replaced_nulls AS ( -- Step 4: Replace NULL values (often at the begin) with 0 to calculate difference
SELECT coalesce(cdf1_filled, 0) as cdf1, coalesce(cdf2_filled, 0) as cdf2
FROM filled_cdf
)
SELECT MAX(ABS(cdf1-cdf2)) FROM replaced_nulls; -- Step 5: Calculate final statistic |
Theoretical foundation for p-valueThe query above only calculates the statistic of the Kolmogorov Smirnov test, which is defined as Determining acceptanceWe can now accept or reject the null-hypothesis Approximating the p-valueSince this value is test-specific and harder to interpret across applications, it would be beneficial to have a p-value at hand! After experimenting with a few test datasets, both very small (~50 samples) and very large (~100,000,000 samples), I can confirm that this approximation is mostly correct for the first few decimal places. (Very unscientific, I know). Implementation:For samples with |
Great job!
I'm not sure how to deal with tests of large datasets. Maybe those will need to be executed manually on demand.
Well, if you write the tests and it passes on the three databases we test against, then you're done.
I don't think it'll be very feasible or advisable to attempt that. ORMs have some limitations and sometimes going straight to SQL is necessary and advisable. Anyhow, besides some minor issues that I'll point out later when this PR leaves draft mode, all looks good. Perhaps try to be more consistent in the type annotations including return values. |
Some of the errors you see in test are typing related |
Maybe @YYYasin19 meant the sqlalchemy language expression api? We use that in the rest of |
Duh, sorry. Still, in this case, I'd advise against it. I'm not opposed to it, just that don't think it's worth the effort. Maybe the only reason to do it, would be to approach multi-backend support by leveraging sqlalchemy. |
Okay, interesting. I am strongly in favour of it - though would be fine with postponing it until after this PR. @YYYasin19 I would suggest that if you are motivated to translate/see a point in translating the raw sql query to the language expression api, that you do so - either in this PR already or right after this PR. If you don't, I would suggest to open an issue ensuring that someone else does it eventually. Some of my reasons:
|
Is there a reason why our MSSQL code is converting the table specification in |
I'm not perfectly sure - I would've thought that Independently of that, I know that we sometimes ran into trouble with mssql when it came to referring to objects. E.g.:
https://docs.microsoft.com/en-us/sql/odbc/microsoft/table-name-limitations?view=sql-server-ver16 |
…ion marks in table reference
Yes, it does, I just had wrapped a if is_mssql(engine):
table1_selection = str(table1_selection).replace('"', "")
table2_selection = str(table2_selection).replace('"', "") Note: (2) rely on |
src/datajudge/constraints/stats.py
Outdated
) -> Tuple[Any, OptionalSelections]: | ||
return db_access.get_column(engine, ref) | ||
@staticmethod | ||
def check_acceptance(d: float, n: int, m: int, accepted_level): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless d, n, m are well-known shortcuts in statistics I'd go with proper variable names here and have a note what's the equivalent on that Wikipedia page, eg.
def ... sample1_size: int, sample2_size: int ...
"""
See [wikipedia url] for reference with n = sample1_size, m = sample2_size ...
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a docstring for this function would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree with both comments, however, since check_acceptance
is an internal function I'm not too fuzzed about the variable naming IF you improve the docstring. However, improving the argument names would be a great benefit.. we're not in the 70's anymore, where economy of names was a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha I get your points, just didn't think about it during the initial setup since all variables in the formulas are called n
and m
as well. n
is very common in statistics for the sample size.
src/datajudge/constraints/stats.py
Outdated
@staticmethod | ||
def check_acceptance(d: float, n: int, m: int, accepted_level): | ||
def c(alpha: float): | ||
if alpha == 0.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
==
comparisons with floating point numbers are dangerous, generally speaking.
If you want to avoid log(0) I think it would be better to just do log(alpha / 2.0 + 1e-10)
. This also covers the problem of very small alpha
. Could add an assert alpha > 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very very small alpha is entirely feasible, since sometimes p-values can be in the range of 1e-50 for very large datasets.
I wanted to support alpha = 0
because that would be like stating that the test should pass every time (since the p-value is in range (0, 1]
). But I guess in practice nobody would create a test that should pass every time.
…m numbers generated from seed for reproducability
How do I find out if my approximation of the p-value is good enough? Is it feasible to check whether the user has scipy installed and if so, just use scipy? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, good job!
Usually, the level of "precision" shown by calculators and computers is somewhat artificial. I'm not a statistician, but I'd say that anything any difference beyond e-06, in this case, is irrelevant. Usually one uses the p value of 0.05 or if you want to be very picky 0.01. A difference in the calculated value via SQL and scipy like you find will not make anyone accept or reject the null hypothesis. Along these lines, your comparisons in the tests, like
Well, no... this is only executed in test. Users would never use this, so it's irrelevant if users have |
The problem is that in practice for one of our tests the reported p-value in the magnitude of
In the end, it would just be an extra statement within the approximation method try:
import scipy
except ModuleNotFoundError:
return 2*math.exp(...) # old method of approximation
# accurate method
return scipy.stats.distribution.kstwo.sf(d, n_samples) I think that's feasible 👍 |
Mh, that's a problem... and it's a rather generalized issue with p-values in general. Although I did try to look for possible solutions, I didn't find any. One idea would be to base the data test around the value |
My experience is only anecdotal as well, but from what I've seen for various sample sizes ( |
@ivergara Can you remove scipy as a prod dependency from the conda-feedstock after the merge of this? |
Great work! |
## Change This PR implements the Kolmogorov Smirnov test in pure SQL which is then run directly on the database. ## Commits * first integration: ks-test in database functionality * integrate sql query with data refs * formatting * formatting * refactoring: call `test` directly to access sql-result of KS test * fix row count retrieval * fix acceptance level domain error * fix alpha adjustment * fix type hints for python<3.10 * update sql query for postgres: all tables need to have an alias assigned to them * fix: typo * update query for mssql server * add check for column names * alternative way of getting table name, incl. hot fix for mssql quotation marks in table reference * don't accept zero alphas since in practice they don't make much sense * update variable naming and doc-strings * update data retrieval * include query nesting brackets * better formatting for understandibility * better formatting for understandibility * update query for better readibility with more WITH statements * new option of passing values to the TestResult to compare these * seperate implementation testing from use case testing * make independent of numpy * update tests: new distributions, no scipy and numpy dependency, random numbers generated from seed for reproducability * update comment * optional accuracy through scipy * refactoring, clean up and formatting * update comment and type hints * update tpye hints for older python versions * fix type hint: Tuple instead of tuple * update changelog and include comment about scipy calculation
Hey!
in this PR I'd like to refactor the work of #23 and now implement the Kolmogorov-Smirnov test in SQL-only.
This has the benefit that for large datasets, we won't have to load in all the values to pass them to scipy.
Additionally, this would remove scipy as a dependency again! 🥳
The core work of this PR consists of creating an SQL query that is able to calculate the Kolmogorov-Smirnov test.
The idea behind this implementation as well as some theoretical backgrounds will be explained in the comments below.
The following issues will definitely need to be addressed:
Remove scipy dependencyneeds to be done on conda-feedstockTransform query toRe-implement KS test in sqlalchemy expression language API #29 issue openedsqlalchemy
's query API