Support tests against entire datasets #217

kdallmeyer-sr · 2021-01-26T16:39:35Z

A common data validation task is to compare summary statistics. For example, whether the expectation is that the average price of a product is near certain defined mean or in a defined std dev range. Likewise someone might want to compare averages across two attributes.

Typically this would be done in SQL after the data is at rest. It would be great if Baleen supported both out-of-box statistical aggregations like sum,mean,variance, stddev, min, max, and count. Likewise, Baleen should support custom aggregations that can be done across large datasets.

"Product".describeAs {
    "sku".type(StringType())

    "price".type(FloatType(min = 0.0f))

    "salePrice".type(FloatType(min = 0.0f)).describe {
        // Add warning when sale discount > $5
        it.test { dataTrace, data -> 
            if (data["price"].toFloat() - data["salePrice"].toFloat() > 5f) {
                sequenceOf(ValidationWarning(dataTrace, "sale price is more than 5 dollar difference")) 
            } else {
                sequenceOf()
            }
    }

     // Test Mean is higher that 50.
     dataSetTest { dataTrace, dataSet ->
          if(dataSet["price"].mapNotNull { it.toFloatOrNull() }.average() > 50.0f) {
                sequenceOf(ValidationError(dataTrace, "Department ($department) is not a valid value.", value))
          } else {
               sequenceOf()
          }
     }

     // Test Difference between Sale Price and Price is on average less than $5. 
     // This is different than the attribute test since we expect it to happen sometimes but not too much.
     dataSetTest { dataTrace, dataSet ->
          if(dataSet.map { dataRow -> dataRow["price"].toFloat() - dataRow["salePrice"].toFloat() }.average() > 5.0f) {
                sequenceOf(ValidationError(dataTrace, "sale discount is on average higher than $5 ", value))
          } else {
               sequenceOf()
          }
     }
}

Additionally I think it would be interesting to provide a utility function that calculates summary statistics across the entire dataset. Something like

dataSetTest { dataTrace, dataSet ->
    val summary = dataSet.summary()
    summary.forEach { attrSummary ->
        println("""
         # Numeric Summaries when attribute is a numeric value
        Average: ${attrSummary.average}
        Std Dev: ${attrSummary.stdDev}
        Count: ${attrSummary.count}
        Sum: ${attrSummary.sum}
        Min: ${attrSummary.min}
        Max: ${attrSummary.max}

        # String Summaries when attribute is a string
        MinLength: ${attrSummary.minLength}
        MaxLength: ${attrSummary.maxLength}
        AvgLength: ${attrSummary.avgLength}
        CountDistinct: ${attrSummary.countDistinct}

        # Occurence Summaries when attribute is an occurence
        MinLength: ${attrSummary.minLength}
        MaxLength: ${attrSummary.maxLength}
        AvgLength: ${attrSummary.avgLength}
        """.trimIndent())
    }
}

The text was updated successfully, but these errors were encountered:

kdallmeyer-sr · 2021-06-17T03:05:10Z

An idea to solve this:

Collect these tests in it's own sequence or list that are returned to the topmost data description. Then if present, as it reads each data element it creates an internal map of datatrace to summary statistic for that value. For simplicity the summary map would be one giant flat one.
Then apply the dataset tests to the summary map at the end.

This would mean that the CSV validator would have to be modified so that row number is no longer a part of the datatrace. Instead it should be a tag like Json and Xml validators.

kdallmeyer-sr added the enhancement New feature or request label Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support tests against entire datasets #217

Support tests against entire datasets #217

kdallmeyer-sr commented Jan 26, 2021

kdallmeyer-sr commented Jun 17, 2021

Support tests against entire datasets #217

Support tests against entire datasets #217

Comments

kdallmeyer-sr commented Jan 26, 2021

kdallmeyer-sr commented Jun 17, 2021