## Great Expectations
A simple demonstration of how to use the basic functions of the Great Expectations library with Pyspark

In [0]:
# if you don't want to install great_expectations from the clusters menu you can install direct like this
dbutils.library.installPyPI("great_expectations")

In [0]:
import great_expectations as ge
import pandas as pd

In [0]:
# first lets create a simple dataframe

data = {
  "String": ["one", "two", "two",],
  "Value": [1, 2, 2,],
}

# lets create a pandas dataframe
pd_df = pd.DataFrame.from_dict(data)

# we can use pandas to avoid needing to define schema
df = spark.createDataFrame(
  pd_df
)

## Creating Great Expectations Datasets for Pandas and PySpark

In [0]:
# now let us create the appropriate great-expectations Dataset objects

# for pandas we create a great expectations object like this
pd_df_ge = ge.from_pandas(pd_df)

# while for pyspark we can do it like this
df_ge = ge.dataset.SparkDFDataset(df)

## Running Great Expectations tests

Expectations return a dictionary of metadata, including a boolean "success" value

In [0]:
#this works the same for bot Panmdas and PySpark Great Expectations datasets
print(pd_df_ge.expect_table_row_count_to_be_between(1,10))

print(df_ge.expect_table_row_count_to_be_between(1,10))

### Differences between Great Expectations Pandas and Pyspark Datasets

In [0]:
# pandas datasets inherit all the pandas dataframe methods
print(pd_df_ge.count())

# while GE pyspark datasets do not and the following leads to an error
print(df_ge.count())

In [0]:
# however you can access the original pyspark dataframe using df_ge.spark_df

df_ge.spark_df.count()

# Taking Great Expectations further

If you want to make use of Great Expectations data context features you will need to install a data context. details can be found here https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_contexts/how_to_instantiate_a_data_context_on_a_databricks_spark_cluster.html