Skip to content

Latest commit

 

History

History
109 lines (83 loc) · 4.54 KB

missing.rst

File metadata and controls

109 lines (83 loc) · 4.54 KB

Missing Values Detection

Just The Code

Walkthrough

NannyML's approach to missing values detection is quite straightforward. For each chunk<Data Chunk> NannyML calculates the number of missing values. There is an option, called normalize, to convert the count of values to a relative ratio if needed. The resulting values from the reference data<data-drift-periods> chunks are used to calculate the alert thresholds<Threshold>. The missing values results from the analysis chunks are compared against those thresholds and generate alerts if applicable.

We begin by loading the titanic dataset<dataset-titanic> provided by the NannyML package.

The ~nannyml.data_quality.missing.calculator.MissingValuesCalculator class implements the functionality needed for missing values calculations. We need to instantiate it with appropriate parameters:

  • The names of the columns to be evaluated.
  • Optionally, a boolean option indicating whether we want the absolute count of the missing value instances or their relative ratio. By default it is set to true.
  • Optionally, the name of the column containing the observation timestamps.
  • Optionally, a chunking approach or a predefined chunker. If neither is provided, the default chunker creating 10 chunks will be used.
  • Optionally, a threshold strategy to modify the default one. See available threshold options here<thresholds>.

Next, the ~nannyml.data_quality.missing.calculator.MissingValuesCalculator.fit method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with for alert<Alert> generation. Then the ~nannyml.data_quality.missing.calculator.MissingValuesCalculator.calculate method will calculate the data quality results on the data provided to it.

The results can be filtered to only include a certain data period, method or column by using the filter method. You can evaluate the result data by converting the results into a DataFrame, by calling the ~nannyml.data_quality.missing.result.Result.to_df method. By default this will return a DataFrame with a multi-level index. The first level represents the column, the second level represents resulting information such as the data quality metric values, the alert thresholds or the associated sampling error.

More information on accessing the information contained in the ~nannyml.data_quality.missing.result.Result can be found on the working_with_results page.

The next step is visualizing the results, which is done using the ~nannyml.data_quality.missing.result.Result.plot method. It is recommended to filter results for each column and plot separately.

image

image

image

image

image

image

image

image

image

image

Insights

We see that most of the dataset columns don't have missing values. The Age and Cabin columns are the most interesting with regards to missing values.

What Next

We can also inspect the dataset for Unseen Values in the Unseen Values Tutorial<unseen_values>. Then we can look for any Data Drift present in the dataset using data-drift functionality of NannyML.