# How to use `DataAssistants`

* A `DataAssistant` enables you to quickly profile your data by providing a thin API over a pre-constructed `RuleBasedProfiler` configuration.
* As a result of the profiling, you get back a result object consisting of 
    * `Metrics` that describe the current state of the data
    * `Expectations` that are able to alert you if the data deviates from the expected state in the future. 
    
* `DataAssistant` results can also be plotted to help you understand their data visually.
* There are multiple `DataAssistants` centered around a theme (volume, nullity etc), and this notebook walks you through an example `VolumeDataAssistant` to show the capabilities and potential of this new interface.

### What is a `VolumeDataAssistant`?
* The `VolumeDataAssistant` allows you to automatically build a set of Expectations that alerts you if the volume of records significantly deviates from the norm. 

More specfically, the `VolumeDataAssistant` profiles the data and outputs an `ExpectationSuite` containing 2 `Expecation` types 

* `expect_table_row_count_to_be_between`
* `expect_column_unique_value_count_to_be_between`

with automatically selected values for upper and lower bound. The ranges are selected using a bootstrapping step on the sample `Batches`. This allows the `DataAssistant` to account for outliers, allowing it to obtain a more accurate estimate of the true ranges by taking into account the underlying distribution.

In [1]:
import great_expectations as ge
from great_expectations.core.yaml_handler import YAMLHandler
from great_expectations.core.batch import BatchRequest
from great_expectations.core import ExpectationSuite
from great_expectations.validator.validator import Validator
from great_expectations.rule_based_profiler.data_assistant import (
    DataAssistant,
    VolumeDataAssistant,
)
from great_expectations.rule_based_profiler.types.data_assistant_result import (
    DataAssistantResult,
)
yaml = YAMLHandler()

  warn_incompatible_dep(


## Set-up: Adding `taxi_data` `Datasource`
* Add `taxi_data` as a new `Datasource`
* We are using an `InferredAssetFilesystemDataConnector` to connect to data in the `test_sets/taxi_yellow_tripdata_samples` folder and get one `DataAsset` (`yellow_tripdata_sample`) that has 36 Batches, corresponding to one batch per month from 2018-2020.

In [2]:
data_context: ge.DataContext = ge.get_context()

In [3]:
data_path: str = "../../../../test_sets/taxi_yellow_tripdata_samples"

datasource_config: dict = {
    "name": "taxi_data_all_years",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "PandasExecutionEngine",
    },
    "data_connectors": {
        "inferred_data_connector_all_years": {
            "class_name": "InferredAssetFilesystemDataConnector",
            "base_directory": data_path,
            "default_regex": {
                "group_names": ["data_asset_name", "year", "month"],
                "pattern": "(yellow_tripdata_sample)_(2018|2019|2020)-(\\d.*)\\.csv",
            },
        },
    },
}

data_context.test_yaml_config(yaml.dump(datasource_config))

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
	inferred_data_connector_all_years : InferredAssetFilesystemDataConnector

	Available data_asset_names (1 of 1):
		yellow_tripdata_sample (3 of 36): ['yellow_tripdata_sample_2018-01.csv', 'yellow_tripdata_sample_2018-02.csv', 'yellow_tripdata_sample_2018-03.csv']

	Unmatched data_references (3 of 6):['.DS_Store', 'first_3_files', 'random_subsamples']



<great_expectations.datasource.new_datasource.Datasource at 0x7fe7ab3a5a00>

In [4]:
# add_datasource only if it doesn't already exist in our configuration
try:
    data_context.get_datasource(datasource_config["name"])
except ValueError:
    data_context.add_datasource(**datasource_config)

#  Configure `BatchRequest`

In this example, we will be using a `BatchRequest` that will return all 36 batches of data from the `taxi_data` dataset.  We will refer to the `Datasource` and `DataConnector` configured in the previous step. 

In [5]:
multi_batch_all_years_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_data_all_years",
    data_connector_name="inferred_data_connector_all_years",
    data_asset_name="yellow_tripdata_sample",
)

In [6]:
batch_request: BatchRequest = multi_batch_all_years_batch_request

# Run the `VolumeDataAssistant`

* The `VolumeDataAssistant` can be run directly from the `DataContext` by specifying `assistants` and `volume`, and passing in the `BatchRequest` from the previous step.

In [7]:
result: DataAssistantResult = data_context.assistants.volume.run(batch_request=batch_request)

Created ExpectationSuite "tmp.volume_data_assistant.suite.2f416de9".


Profiling Dataset:   0%|          | 0/2 [00:00<?, ?it/s]




Calculating Metrics:   0%|          | 0/36 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/36 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  sample_lower_quantile: np.ndarray = np.quantile(
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  sample_upper_quantile: np.ndarray = np.quantile(
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  bootstrap_lower_quantiles: Union[np.ndarray, Number] = np.quantile(
  if bootstrap_lower_quantile_bias / bootstrap_lower_quantile_standard_error <= 0.25:
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  bootstrap_upper_quantiles: Union[np.ndarray, Number] = np.quantile(
  if bootstrap_upper_quantile_bias / bootstrap_upper_quantile_standard_error <= 0.25:


Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Created ExpectationSuite "tmp.test.suite.9353a6f0".


You will see that the `DataAssistant` created a temporary `ExpectationSuite`, which is part of the `DataAssistantResult`. The `run()` method is also able to take in the following optional parameters:

* `expectation_suite`: An existing "ExpectationSuite" to update.
* `expectation_suite_name`: A name for returned "ExpectationSuite". 
* `include_citation`: Flag that controls whether or not to effective `RuleBasedProfiler` configuration should be included as a citation in metadata of the `ExpectationSuite` that is part of the `DataAssistantResult`. 
* `save_updated_expectation_suite`: Flag that controlls whether or not the updated `ExpectationSuite` will be saved

Although the `DataAssistant` will automatically create a temporary `ExpectationSuite` when running the profiling, we do recommend that you recommend an `expectation_suite_name` (which is `taxi_data_suite` in our case) so that the results can be more easily saved later on.

In [9]:
result: DataAssistantResult = data_context.assistants.volume.run(batch_request=batch_request, expectation_suite_name="taxi_data_suite")

Profiling Dataset:   0%|          | 0/2 [00:00<?, ?it/s]




Calculating Metrics:   0%|          | 0/36 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/36 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  sample_lower_quantile: np.ndarray = np.quantile(
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  sample_upper_quantile: np.ndarray = np.quantile(
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  bootstrap_lower_quantiles: Union[np.ndarray, Number] = np.quantile(
  if bootstrap_lower_quantile_bias / bootstrap_lower_quantile_standard_error <= 0.25:
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  bootstrap_upper_quantiles: Union[np.ndarray, Number] = np.quantile(
  if bootstrap_upper_quantile_bias / bootstrap_upper_quantile_standard_error <= 0.25:


Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/144 [00:00<?, ?it/s]

# Explore `DataAssistantResult` by plotting

The resulting `DataAssistantResult` can be best explored by plotting. For each `Domain` considered (`Table` and `Column` in our case), the plots will display the value for each `Batch` (36 in total). 

In [10]:
result.plot()

An additional layer of information that can be retrieved from the `DataAssistantResult` is the `prescriptive` information, which corresponds to the range values of the `Expectations` that result from the `DataAssistant` run. 

For example the `vendor_id` plot will show that the range of distinct `vendor_id` values ranged from 2-3 across all of our `Batches`, as indicated by the blue band around the plotted values. These values correspond to the `max_value` and `min_value` for the resulting `Expectation`, `expect_column_unique_value_count_to_be_between`.

In [11]:
result.plot(prescriptive=True)

# Save `ExpectationSuite`

Finally, we can save the `ExpectationSuite` resulting from the `DataAssistant`. We can use the `DataContext`'s `save_expectation_suite()` method pass in our result. The `ExpectationSuite` will be saved with the name we specified in our `expectation_suite_name` when we ran the `DataAssistant`, which was `taxi_data_suite`. 

In [12]:
data_context.save_expectation_suite(expectation_suite=result.expectation_suite)

'/Users/work/Development/great_expectations/tests/test_fixtures/rule_based_profiler/example_notebooks/great_expectations/expectations/taxi_data_suite.json'

## Optional: Clean-up Directory


As part of running this notebook, the `DataAssistant` will create a number of ExpectationSuite configurations in the `great_expectations/expectations/tmp` directory. Optionally run the following cell to clean up the directory.

In [13]:
# import shutil, os
# shutil.rmtree("great_expectations/expectations/tmp")
# os.remove("great_expectations/expectations/.ge_store_backend_id")
# os.remove("great_expectations/expectations/taxi_data_suite.json")