DataValidator class method obliterates memory

For reproduction:

```py
from datachecker import check_and_export
import json
import pandas as pd
from sklearn.datasets import load_wine

test_df = load_wine(as_frame=True).data
test_df = test_df.rename(columns={"od280/od315_of_diluted_wines": "od280_od315_of_diluted_wines"})

schema = {
    "check_duplicates": True,
    "check_completeness": True,
    "columns": {},
}

for name, value in test_df.items():
    schema["columns"].update({name: {
            "type": str(value.dtype), 
            "optional": False, 
            "allow_na": True,
    }})

with open("schema.json", "w", encoding="utf-8") as f:
    json.dump(schema, f, ensure_ascii=False, indent=4)

check_and_export(
    data=test_df,
    schema="schema.json",
    format="html",
    file="data_checker_default_log.html"
)
```

We get a MemoryError when the `DataValidator._check_completeness` method is called, specifically caused by this line:
https://github.com/ONSdigital/datachecker/blob/c7bb5cbcd69c5737fe95a1783cfa07f78d9e9176/datachecker/data_checkers/pandas_validator.py#L42

`unique_values` in this case will be an array of 13 arrays, with the size of the sub arrays shown below:

<img width="249" height="282" alt="Image" src="https://github.com/user-attachments/assets/9c88ae74-e4f3-46cb-bd01-b1cca3d891a3" />

Then we are calling a cartesian product of this nested array e.g, `set(product([13 x 126], [13 x 133], [13 x 79], ... , [13 x 121]))` so we very quickly get into a memory explosion

A quick screengrab shows the rapid exponential growth in output when we start creating cartesian products of entire columns:

<img width="667" height="290" alt="Image" src="https://github.com/user-attachments/assets/b17589e7-d1cb-4206-8bf0-b5de3a947ddd" />

After just 4 columns we're pretty much cooked for memory, and these data size are tiny because it's just a toy data - with production size data we'd be done even sooner

I can't replicate the issue on the R version because I get a failure point happening somewhere else but from looking at the implementation of the completeness checks I believe we have the same issue
https://github.com/ONSdigital/data.checker/blob/a0113e4604ee3763499cdb96b4423dcebdabd090/R/checks.R#L43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataValidator class method obliterates memory #18

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DataValidator class method obliterates memory #18

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions