For reproduction:
from datachecker import check_and_export
import json
import pandas as pd
from sklearn.datasets import load_wine
test_df = load_wine(as_frame=True).data
test_df = test_df.rename(columns={"od280/od315_of_diluted_wines": "od280_od315_of_diluted_wines"})
schema = {
"check_duplicates": True,
"check_completeness": True,
"columns": {},
}
for name, value in test_df.items():
schema["columns"].update({name: {
"type": str(value.dtype),
"optional": False,
"allow_na": True,
}})
with open("schema.json", "w", encoding="utf-8") as f:
json.dump(schema, f, ensure_ascii=False, indent=4)
check_and_export(
data=test_df,
schema="schema.json",
format="html",
file="data_checker_default_log.html"
)
We get a MemoryError when the DataValidator._check_completeness method is called, specifically caused by this line:
|
combinations = set(product(*unique_values)) |
unique_values in this case will be an array of 13 arrays, with the size of the sub arrays shown below:
Then we are calling a cartesian product of this nested array e.g, set(product([13 x 126], [13 x 133], [13 x 79], ... , [13 x 121])) so we very quickly get into a memory explosion
A quick screengrab shows the rapid exponential growth in output when we start creating cartesian products of entire columns:
After just 4 columns we're pretty much cooked for memory, and these data size are tiny because it's just a toy data - with production size data we'd be done even sooner
I can't replicate the issue on the R version because I get a failure point happening somewhere else but from looking at the implementation of the completeness checks I believe we have the same issue
https://github.com/ONSdigital/data.checker/blob/a0113e4604ee3763499cdb96b4423dcebdabd090/R/checks.R#L43
For reproduction:
We get a MemoryError when the
DataValidator._check_completenessmethod is called, specifically caused by this line:datachecker/datachecker/data_checkers/pandas_validator.py
Line 42 in c7bb5cb
unique_valuesin this case will be an array of 13 arrays, with the size of the sub arrays shown below:Then we are calling a cartesian product of this nested array e.g,
set(product([13 x 126], [13 x 133], [13 x 79], ... , [13 x 121]))so we very quickly get into a memory explosionA quick screengrab shows the rapid exponential growth in output when we start creating cartesian products of entire columns:
After just 4 columns we're pretty much cooked for memory, and these data size are tiny because it's just a toy data - with production size data we'd be done even sooner
I can't replicate the issue on the R version because I get a failure point happening somewhere else but from looking at the implementation of the completeness checks I believe we have the same issue
https://github.com/ONSdigital/data.checker/blob/a0113e4604ee3763499cdb96b4423dcebdabd090/R/checks.R#L43