# How to use `DataAssistants`

* A `DataAssistant` enables you to quickly profile your data by providing a thin API over a pre-constructed `RuleBasedProfiler` configuration.
* As a result of the profiling, you get back a result object consisting of 
    * `Metrics` that describe the current state of the data
    * `Expectations` that are able to alert you if the data deviates from the expected state in the future. 
    
* `DataAssistant` results can also be plotted to help you understand their data visually.
* There are multiple `DataAssistants` centered around a theme (volume, nullity etc), and this notebook walks you through an example `VolumeDataAssistant` to show the capabilities and potential of this new interface.

### What is a `Onboarding DataAssistant`?
* The `Onboarding DataAssistant` is what we get when we are using the CLI and select `profiling`. 
* It is considered to be the "starting point" for, and is generally applicable for numeric data.
* In our example we are using `taxi_data`
* And calculates the following 7 Expectations against your data, and we are able to generate.

#### These are the `Table` Expectations. 
* `expect_table_columns_to_match_set`
   * this would likely not change.. unless it does
* `expect_table_row_count_to_be_between`. 
   * I can imagine this being useful when we are doing the number of "events" month. 

#### These are the `Column` Expectations. 
* `expect_column_min_to_be_between`
* `expect_column_max_to_be_between`
* `expect_column_mean_to_be_between`
* `expect_column_median_to_be_between`
* `expect_column_stdev_to_be_between` : `min_value` and `max_value` are here.

## Unanswered questions
- Can I do it with multi-batch? 
- Do I need to do Spark / and Sql? probably 

with automatically selected values for upper and lower bound. The ranges are selected using a bootstrapping step on the sample `Batches`. This allows the `DataAssistant` to account for outliers, allowing it to obtain a more accurate estimate of the true ranges by taking into account the underlying distribution.

In [1]:
import great_expectations as ge
from great_expectations.core.yaml_handler import YAMLHandler
from great_expectations.core.batch import BatchRequest
from great_expectations.core import ExpectationSuite
from great_expectations.core.expectation_configuration import ExpectationConfiguration
from great_expectations.validator.validator import Validator
from great_expectations.rule_based_profiler.data_assistant import (
    DataAssistant,
    VolumeDataAssistant,
)
from great_expectations.rule_based_profiler.data_assistant_result import (
    VolumeDataAssistantResult,
)
from typing import List
yaml = YAMLHandler()

## Set-up: Adding `taxi_data` `Datasource`
* Add `taxi_data` as a new `Datasource`
* We are using an `InferredAssetFilesystemDataConnector` to connect to data in the `test_sets/taxi_yellow_tripdata_samples` folder and get one `DataAsset` (`yellow_tripdata_sample`) that has 36 Batches, corresponding to one batch per month from 2018-2020.

In [2]:
data_context: ge.DataContext = ge.get_context()

In [3]:
data_path: str = "../../../../test_sets/taxi_yellow_tripdata_samples"

datasource_config: dict = {
    "name": "taxi_data_all_years",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "PandasExecutionEngine",
    },
    "data_connectors": {
        "inferred_data_connector_all_years": {
            "class_name": "InferredAssetFilesystemDataConnector",
            "base_directory": data_path,
            "default_regex": {
                "group_names": ["data_asset_name", "year", "month"],
                "pattern": "(yellow_tripdata_sample)_(2018|2019|2020)-(\\d.*)\\.csv",
            },
        },
    },
}

data_context.test_yaml_config(yaml.dump(datasource_config))

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
	inferred_data_connector_all_years : InferredAssetFilesystemDataConnector

	Available data_asset_names (1 of 1):
		yellow_tripdata_sample (3 of 36): ['yellow_tripdata_sample_2018-01.csv', 'yellow_tripdata_sample_2018-02.csv', 'yellow_tripdata_sample_2018-03.csv']

	Unmatched data_references (3 of 6):['.DS_Store', 'first_3_files', 'random_subsamples']



<great_expectations.datasource.new_datasource.Datasource at 0x7f92cf47ce80>

In [4]:
# add_datasource only if it doesn't already exist in our configuration
try:
    data_context.get_datasource(datasource_config["name"])
except ValueError:
    data_context.add_datasource(**datasource_config)

#  Configure `BatchRequest`

In this example, we will be using a `BatchRequest` that will return all 36 batches of data from the `taxi_data` dataset.  We will refer to the `Datasource` and `DataConnector` configured in the previous step. 

In [5]:
multi_batch_all_years_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_data_all_years",
    data_connector_name="inferred_data_connector_all_years",
    data_asset_name="yellow_tripdata_sample",
)

In [6]:
batch_request: BatchRequest = multi_batch_all_years_batch_request

# Run the `VolumeDataAssistant`

* The `VolumeDataAssistant` can be run directly from the `DataContext` by specifying `assistants` and `volume`, and passing in the `BatchRequest` from the previous step.

In [8]:
result = data_context.assistants.onboarding.run(batch_request=batch_request)




Generating Expectations:   0%|          | 0/8 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/12 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/11 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/3 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/8 [00:00<?, ?it/s]

# Explore `DataAssistantResult` by plotting

The resulting `DataAssistantResult` can be best explored by plotting. For each `Domain` considered (`Table` and `Column` in our case), the plots will display the value for each `Batch` (36 in total). 

In [9]:
result.plot_metrics()

interactive(children=(Dropdown(description='Select Plot: ', layout=Layout(margin='0px', width='max-content'), …

PlotResult(charts=[alt.LayerChart(...), alt.Chart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...),

An additional layer of information that can be retrieved from the `DataAssistantResult` is the `prescriptive` information, which corresponds to the range values of the `Expectations` that result from the `DataAssistant` run. 

For example the `vendor_id` plot will show that the range of distinct `vendor_id` values ranged from 2-3 across all of our `Batches`, as indicated by the blue band around the plotted values. These values correspond to the `max_value` and `min_value` for the resulting `Expectation`, `expect_column_unique_value_count_to_be_between`.

In [13]:
result.plot_expectations_and_metrics()

interactive(children=(Dropdown(description='Select Plot: ', layout=Layout(margin='0px', width='max-content'), …

PlotResult(charts=[alt.LayerChart(...), alt.Chart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...),

# Save `ExpectationSuite`

Finally, we can save the `ExpectationConfiguration` objext resulting from the `DataAssistant` in our `ExpectationSuite` and then use the `DataContext`'s `save_expectation_suite()` method to pass in our `ExpectationSuite`, updated with the `DataAssistant`.

In [10]:
suite: ExpectationSuite = ExpectationSuite(expectation_suite_name="taxi_data_suite")

In [11]:
resulting_configurations: List[ExpectationConfiguration] = suite.add_expectation_configurations(expectation_configurations=result.expectation_configurations)

In [12]:
data_context.save_expectation_suite(expectation_suite=suite)

## Optional: Clean-up Directory


As part of running this notebook, the `DataAssistant` will create a number of ExpectationSuite configurations in the `great_expectations/expectations/tmp` directory. Optionally run the following cell to clean up the directory.

In [14]:
#import shutil, os
#shutil.rmtree("great_expectations/expectations/tmp")
#os.remove("great_expectations/expectations/.ge_store_backend_id")
#os.remove("great_expectations/expectations/taxi_data_suite.json")