Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tests against entire datasets #217

Open
kdallmeyer-sr opened this issue Jan 26, 2021 · 1 comment
Open

Support tests against entire datasets #217

kdallmeyer-sr opened this issue Jan 26, 2021 · 1 comment
Labels
enhancement New feature or request

Comments

@kdallmeyer-sr
Copy link
Contributor

A common data validation task is to compare summary statistics. For example, whether the expectation is that the average price of a product is near certain defined mean or in a defined std dev range. Likewise someone might want to compare averages across two attributes.

Typically this would be done in SQL after the data is at rest. It would be great if Baleen supported both out-of-box statistical aggregations like sum,mean,variance, stddev, min, max, and count. Likewise, Baleen should support custom aggregations that can be done across large datasets.

"Product".describeAs {
    "sku".type(StringType())

    "price".type(FloatType(min = 0.0f))

    "salePrice".type(FloatType(min = 0.0f)).describe {
        // Add warning when sale discount > $5
        it.test { dataTrace, data -> 
            if (data["price"].toFloat() - data["salePrice"].toFloat() > 5f) {
                sequenceOf(ValidationWarning(dataTrace, "sale price is more than 5 dollar difference")) 
            } else {
                sequenceOf()
            }
    }

     // Test Mean is higher that 50.
     dataSetTest { dataTrace, dataSet ->
          if(dataSet["price"].mapNotNull { it.toFloatOrNull() }.average() > 50.0f) {
                sequenceOf(ValidationError(dataTrace, "Department ($department) is not a valid value.", value))
          } else {
               sequenceOf()
          }
     }

     // Test Difference between Sale Price and Price is on average less than $5. 
     // This is different than the attribute test since we expect it to happen sometimes but not too much.
     dataSetTest { dataTrace, dataSet ->
          if(dataSet.map { dataRow -> dataRow["price"].toFloat() - dataRow["salePrice"].toFloat() }.average() > 5.0f) {
                sequenceOf(ValidationError(dataTrace, "sale discount is on average higher than $5 ", value))
          } else {
               sequenceOf()
          }
     }
}

Additionally I think it would be interesting to provide a utility function that calculates summary statistics across the entire dataset. Something like

dataSetTest { dataTrace, dataSet ->
    val summary = dataSet.summary()
    summary.forEach { attrSummary ->
        println("""
         # Numeric Summaries when attribute is a numeric value
        Average: ${attrSummary.average}
        Std Dev: ${attrSummary.stdDev}
        Count: ${attrSummary.count}
        Sum: ${attrSummary.sum}
        Min: ${attrSummary.min}
        Max: ${attrSummary.max}

        # String Summaries when attribute is a string
        MinLength: ${attrSummary.minLength}
        MaxLength: ${attrSummary.maxLength}
        AvgLength: ${attrSummary.avgLength}
        CountDistinct: ${attrSummary.countDistinct}

        # Occurence Summaries when attribute is an occurence
        MinLength: ${attrSummary.minLength}
        MaxLength: ${attrSummary.maxLength}
        AvgLength: ${attrSummary.avgLength}
        """.trimIndent())
    }
}
@kdallmeyer-sr kdallmeyer-sr added the enhancement New feature or request label Jan 26, 2021
@kdallmeyer-sr
Copy link
Contributor Author

An idea to solve this:

Collect these tests in it's own sequence or list that are returned to the topmost data description. Then if present, as it reads each data element it creates an internal map of datatrace to summary statistic for that value. For simplicity the summary map would be one giant flat one.
Then apply the dataset tests to the summary map at the end.

This would mean that the CSV validator would have to be modified so that row number is no longer a part of the datatrace. Instead it should be a tag like Json and Xml validators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant