# Data Quality and Testing
© Explore Data Science Academy 

## Learning objectives

By the end of this train, you will: 
* understand the need for data quality validation and testing, including:
   * data quality validation as an assurance of incoming data; and
   * data testing as integration testing;
* define the six dimensions of data quality;
* be able to use Spark to calculate each of the six dimensions; and
* understand the tools available for data quality validation and testing.

## Understand the need for data quality validation and testing

Data quality is a crucial part of data engineering. After ingesting data into your system, the next important thing is to ensure that the data is of sufficient quality to allow for high-quality and robust science and analytics. 

Imagine you build the most efficient, robust, and elegant data pipelines to process your data. All this work would be for nothing if your data is not an accurate representation of real-world events. Therefore, you must prioritise data quality from the start of your project. 

There are a few cases that may occur when your data quality is not up to scratch: 

- Your code checks the input data with expectations based on data that you know is correct. If the data is not up to scratch, then you initiate a plan B. This could include re-running the API to retrieve the missing data or alerting users and data engineers to the problem.

- The code runs into an error while processing data and crashes. This may occur when your data pipelines expect a specific schema or data type.

- Lastly, your data pipelines could continue to run regardless of the poor data quality, producing outputs that look correct but are based on substandard data.

Of these three cases, the first one is the best, as we are at least being alerted to changes in data quality. This is the only case where we have data quality checking in place. The second case is not the end of the world – with a failure you are at least aware that something has failed and can look for a solution or re-run your data pipeline. The last case is very dangerous as you could remain unaware of these data quality issues for months. 

Today's applications and businesses, in general, are gravitating towards data-driven products and decision-making. As you know, there are huge amounts of data available and this is only increasing. Businesses are becoming more dependent on using data and making informed decisions. Therefore, substandard data is not an option, as it can lead to incorrect modelling or analytics.

In the sections below, we are going to take a deeper look at data quality and dive into some of the use cases for data quality checks. We'll also introduce six dimensions of data quality that can be evaluated. 

### Data quality validation as an assurance of incoming data

Data quality validation means making sure that the incoming data is of sufficient quality.

Here we will expand on how to validate the quality of the incoming data and show which mitigation should take place when there is a change in data quality. Two important factors need to be monitored here. The first is the change in data quality, such as missing values or duplicates. The second is a change in data distributions where values are outside the expected range.

Let's take a look at how to approach data quality validation. The first step in tracking data quality is deciding on the *data requirements*:
* Defining the schema and acceptable data types of a dataset ensures that changes to either of these do not negatively affect downstream processes, such as modelling and analysis.
* Data requirements can also include the six dimensions of data quality, which can direct decisions around data requirements.
* It could also include defining the bounds of what qualifies as a “normal” value, for example, one standard deviation from the mean historical value.

Once you have decided on the data requirements, you must automatically check that these requirements are met. This may look like a data quality pipeline running once a day or running every time new data are ingested which tests for the defined data quality requirements. This process could also include an alerting system that lets you (as the data engineer) know about violations in the data requirements.

Data quality is likely to vary depending on the source system. It cannot necessarily be changed, but you can ensure that the correct systems are in place to continuously mitigate the risks associated with substandard data. This may be in the form of a data quality rating displayed on the front-end of your application so that users know when data are unreliable.

Another option is creating a system that checks incoming data based on requirements and activates a recovery pipeline that re-ingests data if the faults are due to the ingestion step. These mitigation options aim to ensure awareness around data quality and that you are not falling into the third case mentioned above, where your data pipelines continue to run regardless of substandard data.

> Remember that mitigation plans for poor data quality are highly dependent on the requirements. \
> Therefore, you should treat each new dataset as a new problem to ensure no additional faults come into play.

### Data testing as integration testing

Testing is the second use for data quality. Here we use data quality as a method for integration testing. Thus, monitoring data quality through a processing pipeline on a mock dataset can tell us if the pipelines are performing as they should.

This approach takes on a more rigorous design than ensuring that incoming data is accurate. Once we are sure that incoming data is up to the correct standard, there are more places where data testing and validation can be included. Following the logic of ETL, once we have extracted the data and ensured its quality, there are likely many transformations that need to be built into the data pipelines based on various business requirements. 

*Consider the following example:* \
A business receives data every hour for the number of times a pump within a water distribution network has started in its lifetime. This metric is useful as it can help the business determine how close a pump is to the end of its life (water pumps only effectively pump water for a set amount of pump starts). This data can provide a further benefit by calculating the number of times a pump has started in a day, which can alert the business to where pressure on their system is.

To calculate the number of times a pump has started in a day, it is necessary to keep track of the previous days' cumulative number of starts and use this as a baseline for that day's number of starts. This is where data testing can be taken to a new level.

If we know the method of calculating the number of starts per day, and we have a mock dataset for which we know the number of starts that occurred, we can create a testing script to validate that the calculation is performing as expected. This involves providing the function that performs the calculation with known inputs and outputs, and testing if it passes or fails the test, so we can know with certainty that our function performs correctly. It is very similar to how we would test general software functionality.

> This is an example of integration testing for one function or transformation, but these types of checks should ideally be made at every point that your data changes form. \
> By employing this type of testing, you can safely make changes and refactor your code while still having the tests in place to validate that you haven't broken anything.

## The six dimensions of data quality

The six dimensions of data quality are generalised attributes that can be used to measure data quality. These dimensions provide a framework within which you can explore data and understand where pitfalls may be, as well as how to address data quality problems. 

<p align="center">
<img src="https://github.com/Explore-AI/Pictures/blob/421d8c55ebe6caa30836ba3c5785232d3eab84ad/data_engineering/transform/predict/DataQuality.jpg?raw=True"
     alt="six dimensions of data quality"
     style="padding-bottom=0.5em"
     width=800px/>
     <br>
     <em>The six dimensions of data quality. </em>
</p>

In some cases, specific dimensions will not apply to a dataset and, therefore, may not be useful, however, it is important to go through all six dimensions for each use case. It is, therefore, useful to have a standard approach to apply to new datasets. If there are additional requirements, we can build these on top of the six dimensions.

### An overview of the six dimensions of data quality 

Each of the six dimensions should give you an idea of how to assess your data's quality. 

#### Accuracy 

The accuracy of data is the degree to which the data represent a real-world event or object.

Sometimes IoT devices record data measurements that are incorrect but may look like a real value. An example of this is monitors that measure the level of fluid in a tank or pipe. These monitors often use an ultrasonic pulse that bounces off the fluid to measure the level. However, if the sensor is underwater or obstructed, it can give skewed readings. Another example is an API sending incorrect data or ingesting the data from the source incorrectly, such as a rounding error.

Accuracy issues can occur at the field level (one incorrect entry) or the row level (multiple incorrect rows).

Another issue that may occur is default values. A typical example of this is where a logger sends back a 0 instead of a null value, which can greatly skew any attempts at modelling. This is where it is instrumental to employ domain knowledge when assessing a dataset. 

- *Measured by*: how correct a data point is relative to a real-world event
- *Units*: specific case-related constraints
- *Related to*: validity and completeness

#### Consistency 

Consistency is the absence of difference when comparing two or more representations of something against a reference. If data are recorded or captured in multiple places, consistency becomes very important. One cannot have the same data point recorded in various ways. 

Data entries that refer to the same record or entity have to be consistent across all of the entries. An example of a consistency error is measuring data in different units but using the same column. For example, temperature data that are recorded in Kelvin and degrees Celsius but in the same column. This could greatly impact data users if they are unaware of what normal temperatures should look like.

This is not just within a single table. It becomes more important if you are dealing with relational data, in which case the mappings between tables and systems must be consistent. If not, the relationships will be completely lost between the tables and referential integrity compromised. 

- *Measured by*: analysis of pattern and/or value frequency
- *Units*: percentage
- *Related to*: accuracy, validity, and uniqueness

#### Timeliness 

Timeliness is the degree to which data represent reality from the required point in time. Timeliness expects that the data within your dataset is sufficiently up to date. What are the delays between an event happening and the data point being recorded?

If you are trying to answer questions that relate to recent problems, having timely data is extremely important. For example, you cannot use current flight patterns to model how many aeroplanes will be needed by a large aeronautics company in the next five to ten years. 

Similarly, when answering questions that require real-time answers (for example, predicting when a pipe will burst in a manufacturing plant), you have to be set up to receive real-time data from sensors and loggers. 

An example of untimely data causing problems is where a temperature measuring device on a nuclear reactor only communicates its reading two hours after the temperature is recorded. If the temperature exceeds the safety threshold, it could lead to a nuclear meltdown in the reactor. Therefore, it is crucial that the data are sent as soon as it is recorded and that the data pipelines can communicate this data point as fast as possible. 

Another important point is that timeliness is sometimes determined by the source system, which may only synchronise once a day. In this case, we may have a fast method of ingesting the data and displaying it, but we are limited by the source system.

- *Measured by*: time difference
- *Units*: time
- *Related to*: accuracy, because it will decay as time progresses

#### Validity 

The validity of a dataset is specific to a certain field. In other words, data is valid if it conforms to the syntax (format, type, range) of its definition. Each field will have a property that makes it valid, such as an "@" symbol for an email address.

Certain values within a field may have criteria required to make it valid, for example, numerical columns cannot contain alphabetical characters, which can be the result of incorrectly parsed scientific notation. This can be more difficult to determine in strings, in which case you may have to check using regex. This could be the inclusion of a symbol, as mentioned above, or it could be that an integer should not exceed certain limits. An example of this is in a gas separation process that uses extremely cold temperatures. If the system records temperatures below -273 degrees Celsius, which is absolute zero, the data would not be valid.

- *Measured by*: comparison between the data and the metadata or documentation for the data item
- *Units*: percentage of data items deemed valid or invalid
- *Related to*: accuracy, completeness, consistency, and uniqueness

#### Completeness

The completeness of data relates to how many values may be missing in your dataset. 

The degree of completeness can be measured by seeing how many theoretical data points should be present, based on timestamps. Once a theoretical maximum of how complete a dataset should be is established, you can determine how complete the dataset is based upon the ratio of missing records to theoretical completeness.

An example of incomplete data is if you are receiving pump flow readings every two minutes, but you do not have 720 data points per day. If you have 600 data points associated with a specific timestamp, it means that you are missing 120 data points/timestamps. Being aware of this is particularly important in cases where you are aggregating the data to a less frequent interval, because you can still get an hourly mean with less than 30 two-minute data points but it will not be as accurate as with all 30 data points.

The questions one could ask to determine the completeness of a dataset are, 'Does the dataset have missing values, or if it is time-series data, does it have time period gaps? Has a bias been introduced that may change your assumptions or affect your results?'

Completeness issues can occur at the field level (one entry missing) or at the row level (gaps within the dataset). This can also happen at the field level with entire fields being empty or more than 80% of a field's data being missing. Another issue that may occur is default values. A typical example of this is where a logger sends back a 0 instead of a null value, which can greatly skew any attempts at modelling. This is where it is instrumental to employ domain knowledge when assessing a dataset. 

- *Measured by*: a measure of the absence of blank (null) values or the presence of non-blank values
- *Units*: percentage
- *Related to*: validity and accuracy

#### Uniqueness

This dimension relates to having a real-world object or event represented only once in a particular dataset. The same object cannot be duplicated. In other words, uniqueness specifies that nothing will be recorded more than once based upon how that thing is identified. It is the inverse of an assessment of the level of duplication.

Each entry within the datasets should only relate to one single event which has occurred and, thus, should not be duplicated. This is largely mediated by having the appropriate primary key, which is ensured if you stick to the requirements of a good primary key. All fields in the tables should be non-transitively dependent on the primary key. As such, deduplication of the dataset may be required. 

This dimension of data quality can lead to very frustrating situations. An example is having a replication of a set of asset IDs that relate to pumps. Each asset has a flow measurement device, and each has a theoretical maximum that it can pump. If an asset's name is recorded as both ADZZ-001 and adzz-001, data could be recorded multiple times for the same pump. The theoretical maximum pumping limit could also be reached for an asset that then kicks off a process to alert users to the problem. This could lead to users receiving double the number of alarms or alerts and potentially causing more stress than is necessary.

- *Measured by*: analysis of the number of things assessed in the “real world” compared to the number of records of things in the dataset – it requires a reference dataset which is the ground truth
- *Units*: percentage
- *Related to*: consistency

## Using Spark to calculate the six dimensions 

In this section, we will look at how we can apply some of the dimensions of data quality on a dataset using Apache Spark.


The dataset we have chosen for this analysis is based on the daily stock market fluctuation for eight different indexes. Below we have an explanation of the column names and their relevant indexes.

* TL BASED ISE: Turkish Lira: Istanbul stock exchange national 100 index
* USD BASED ISE: USD Istanbul stock exchange national 100 index
* SP: Standard & Poor's 500 return index
* DAX: Stock market return index of Germany
* FTSE: Stock market return index of UK
* NIKKEI: Stock market return index of Japan
* BOVESPA: Stock market return index of Brazil
* EU: MSCI European index
* EM: MSCI emerging markets index

To get you started, below are the required imports, SparkSession setup, schema definition, and reading of the `csv` file.

In [None]:
#Uncomment the below line and execute the cell if you do not have the 'requests' module installed in your environment.
#! pip install requests

In [3]:
# Import Spark and some auxiliary functions.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *

from datetime import datetime
from pyspark.sql import DataFrame

In [4]:
# Set up a SparkSession.

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

In [5]:
# Define the schema.

schema = StructType([ \
    StructField("date", StringType(),True), \
    StructField("TL BASED ISE", DoubleType(),True), \
    StructField("USD BASED ISE", DoubleType(),True), \
    StructField("SP", DoubleType(),True), \
    StructField("DAX", DoubleType(), True), \
    StructField("FTSE", DoubleType(), True), \
    StructField("NIKKEI", DoubleType(), True), \
    StructField("BOVESPA", DoubleType(), True), \
    StructField("EU", DoubleType(), True), \
    StructField("EM", DoubleType(), True)
  ])

In [6]:
# Read in the data from a csv file.
file_url = 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/data-engineering/procesing-big-data/istanbul_stock_exchange.csv'
response = requests.get(file_url)
open('istanbul_stock_exchange.csv','wb').write(response.content)
df = spark.read.csv(f'istanbul_stock_exchange.csv', schema=schema, header=True)

### Data accuracy

Let's start by looking into *data accuracy*, which is the degree to which a piece of data accurately describes a real-world event.

To determine the accuracy of the dataset, let's have a look at the following:
1. View the DataFrame to get a general idea of how the dataset looks.
2. Summarise the DataFrame to further explore the data, including identifying if we're dealing with missing data.
3. Investigate the distribution of the different columns.
4. Filter specific columns to explore the outliers in the distributions.

> ℹ️ **Instructions** ℹ️
>
> View the DataFrame using `show()`.

In [7]:
#TODO: Write your code here.

Did you also see that null and zero values are included in the dataset?

> ℹ️ **Instructions** ℹ️
>
> Summarise the DataFrame by applying the `summary()` function, and using the `toPandas()` function to see the summary as a Pandas DataFrame.

In [None]:
#TODO: Write your code here.

By viewing the DataFrame, we already identified a few null and zero values. In the summary you should have seen that the count of some columns is below 539, indicating that we do have missing values.

Next, we investigate the distribution of the different columns.

> ℹ️ **Instructions** ℹ️
>
> Convert the DataFrame to a Pandas DataFrame using the `toPandas()` function so that we can make use of the plotting functions of matplotlib. Plot each of the nine columns using the `hist()` function.

In [None]:
#TODO: Write your code here.

From the above plots, you should see that four of the nine histogram distributions appear to be approximately normal. However, the five other variables have a concentrated distribution with some extreme outliers. Since the dataset is based on stock market fluctuations in a day, it is possible to have large negative and positive values. However, it is unlikely for one variable to get a value indicating either positive or negative shifts of more than 100% (1 on the index). Therefore, we potentially have some questionable data in the above dataset.

Next, we filter the variables that are not normally distributed, which are columns `SP`, `FTSE`, `EM`, `NIKKEI`, and `BOVESPA`.

> ℹ️ **Instructions** ℹ️
>
> Filter the five columns that are not normally distributed by using `where()` and the Spark SQL function `F.col()` to inspect the outliers. Use `<-0.2` to find these outliers in each column.

In [None]:
#TODO: Write your code here.

For variable `SP`, for example, you should have seen that some rows have a value associated with approximately 100% and 300% losses. This is uncommon and would require validation from other data sources to confirm if it is accurate.

For the other four variables, you should also have seen values associated with more than 100% change in the stock market prices. These values are indicated by indexes over 1 and will also need to be validated by other sources. In these cases, additional research through validation with other data is required to determine if the large gains and losses are legitimate. If not possible, it would be wise to drop or filter out these abnormal values since they can significantly impact the results of building a machine learning model. 

### Completeness

Let's use the same dataset and determine its completeness, in other words, what the proportion of data against the potential of “100% complete” is.

To determine the accuracy of the dataset, let's have a look at the following:
1. Determine the percentage of missing (null) and zero values within each column.
2. Explore the columns that contain missing and zero values.
3. Remedy the incomplete data. 

We already know that null values are included within the dataset, but let's explore further.

> ℹ️ **Instructions** ℹ️
>
> Check for missing values within the columns. \
> We've given you a `for` loop below to loop through each of the columns:
> * `where()`, `isNull()`, and `count()` are used to count the number of null values in the column.
> * `select()` and `count()` are used to count the total number of values in the column.
> * The number of null values is divided by the total number of values and multiplied by 100 to calculate the percentage of missing values.

In [None]:
# Uncomment, to check for missing values within the columns.

#missing_count = {}  # Dictionary to keep track of the results
#for column in df.columns:   # loop through each column
    #_count = df.where(df[column].isNull()).count()  # null count in column x
    #_total_count = df.select(df[column]).count()    # total count of column x 
    #print(f'There are {_count} ({round(_count/_total_count*100, 3)}%) null values in {column} column')  # print out and calculate results
    #missing_count[f'{column}'] = _count # recording results in missing_count dictionary 

From the above summary, you should see that only the variable `DAX` has missing (null) values. Let's explore further.

> ℹ️ **Instructions** ℹ️
>
> Filter and display the rows of variable `DAX` that have a value of null using `where()`, `isNull()`, and `show()`.

In [None]:
#TODO: Write your code here.

Next, we need to decide how to deal with this incomplete data.

In general, we can remedy incomplete data by:
* imputing the missing values;
* dropping missing values;
* discarding the incomplete column; or
* discarding the rows containing missing data.

Because the missing data in variable `DAX` seem random, imputation will be difficult. So, in this case, we will simply remove the values.

> ℹ️ **Instructions** ℹ️
>
> Assign the filter for variable `DAX` above to a new DataFrame, and `show()` this new DataFrame that does not contain null values. 

In [None]:
#TODO: Write your code here.

The above DataFrame should not contain any null values. If that is the case then you've succeeded in remedying the incomplete data from null values!

Next, we follow the same instructions, but now, we're looking for zeros.

> ℹ️ **Instructions** ℹ️
>
> Check for zeros within the columns. \
> You can use the previous loop for missing values as a guide, and uncomment the below code block to get going:
> * Use `where()`, `F.lit(0)`, and `count()` to count the number of zeros in the column.
> * Use `select()` and `count()` to count the total number of values in the column.
> * Divide the number of zeros by the total number of values, and multiply that by 100 to calculate the percentage of zero values.

In [None]:
#TODO: Uncomment the given code lines, and complete the descriptions.
# Check for zeroes within the columns.

# Define a dictionary to keep track of the results.

#for column in df.columns: # loop through each column
#    try:
        # count the number of zeros in column
        # count the total number of values in column
        # calculate and print the results
        # recording results in the dictionary 
#    except: # catch exceptions
#        print(f'Column is not numeric: {[dtype for name, dtype in df.dtypes if name == column][0]}')

You should find that columns `SP`, `DAX`, `FTSE`, and `NIKKEI` contain zeros.

Let's have a further look at each of these columns and drop the rows containing zeros.

> ℹ️ **Instructions** ℹ️
>
> Filter, show, and drop the rows in each of the columns `SP`, `DAX`, `FTSE`, and `NIKKEI` that have a value equal to zero:
> * Filter the column using the `where()` function and `show()` all the rows for that specific column that are equal to zero.
> * You can also use `where()`, `groupby()`, `count()`, and `show()` to summarise the number of zeros per date for each column.
> * Filter the column using `where()` for rows that are not equal to zero and `show()` to determine if all zeros from that specific column have been dropped successfully.

*SP:*

In [None]:
#TODO: Write your code here.

*DAX:*

In [None]:
#TODO: Write your code here.

*FTSE:*

In [None]:
#TODO: Write your code here.

*NIKKEI:*

In [None]:
#TODO: Write your code here.

In [None]:
#TODO: Uncomment to count the number of zero values for variable `NIKKEI` per date.
#df.where(df['NIKKEI'] == 0).groupby('date').count().show()

For variable `NIKKEI`, it seems like quite a few zeros are included. Having no change in the index does not make sense unless no trading has happened that day. Therefore, it requires further investigation to know if these values can be dropped.

It is possible to have no trading on a day another index was recorded, for example, if there is a public holiday and the stock markets did not open that day. This requires investigation into what happened in the country of origin for that specific index. Once you know what happened on that particular date, it becomes a question of if you want to keep the zeros or drop them. This decision may only be made based on your cases-specific application.

### Validity

Let's look at the data's validity and if it conforms to the syntax (format, type, range) of its definition.

To determine the validity, let us have a look at the following:
1. Define what we expect from the dataset and compare it to the current schema.
2. Change the `date` column from string to `DateType()`.
3. Find the first and last date within the dataset.
4. Determine if all dates included in the dataset are in the past.
5. Determine if all numeric values fall within the expected range.


We need to first define what we expect from our dataset:

- date: timestamp (nullable = true)
- TL BASED ISE: double (nullable = true) 
- USD BASED ISE: double (nullable = true) 
- SP: double (nullable = true) 
- DAX: double (nullable = true)
- FTSE: double (nullable = true)
- NIKKEI: double (nullable = true)
- BOVESPA: double (nullable = true)
- EU: double (nullable = true)
- EM: double (nullable = true)

The date column should conform to a date format and be in the past, while in the other columns the value should mostly be between -1 and 1 (because over +100% or -100% gain or loss is improbable in the market).

 ℹ️ **Instructions** ℹ️
>
> Display the data schema of the DataFrame using `printSchema()`.

In [None]:
#TODO: Write your code here.

You should note here that the date column is still a string, and not a date type as required. Let's see if we can convert it to a date.

Because the dates within the dataset are formatted as day and month as the locale's abbreviated name, and the first two digits of the year, we'd rather use a Spark user-defined function (UDF) to convert `StringType()` into `DateType()`.

In [None]:
# Define a function that translates a string date into date type.

def format_date(date_column) -> DataFrame:
  """
  Converts the date column from a string to a
  timestamp that is compatible with Spark.
  
  Params
  ======
  date_column: The target column to be 
  formatted.
  
  Returns
  ======
  date_object: The correctly formatted datetime object
  """
  return datetime.strptime(date_column,"%d-%b-%y")

In [None]:
# Uncomment.

# Instantiate the UDF.
#date_format = F.udf(format_date, DateType())

# Register the UDF (to allow reuse) - with `df` your current DataFrame.
#df_date_format = df.withColumn('date', date_format('date'))

Let's see the format of our dates:

In [None]:
# Uncomment.

#df_date_format.show(5)

We can see that the date column has now successfully become a date type!

> ℹ️ **Instructions** ℹ️
>
> Display the first ('min') and last ('max') dates of the dataset using the new DataFrame with column `date` in `DateType()`. Use the `agg()` and `show()` functions.

In [None]:
#TODO: Write your code here.

> ℹ️ **Instructions** ℹ️
>
> Use `where()`, `F.col()`, `datetime.now()`, and `count()` to determine if all dates within the dataset are in the past.

In [None]:
#TODO: Write your code here.

You should get a value of zero because all the dates within the dataset have occurred in the past.

Next, we check that all values in the numerical columns are within the expected range.

> ℹ️ **Instructions** ℹ️
>
> Complete the below code block to confirm the number of numeric outliers, using `filter()`, `F.col()`, and `count()`. \
> We suggest bounds of +- 0.2, as most indexes do not show more than 20% change, based on the summary statistics completed above.

In [None]:
#TODO: Uncomment the provided code.
 
# Define the boundary values within which the values are expected to fall.
lower_bound = -0.2
upper_bound = 0.2

# Add all the columns except the `date` column to a list.
#numerics = [x for x in df.columns if x not in 'date']

#for numeric in numerics: #iterate through the numeric columns
    # count the number of values less than the lower bound
    # count the number of values greater than the upper bound
    #print(f'Column {numeric} has {count_lower} less than {lower_bound} and {count_greater} greater than {upper_bound}') #calculate and print the results

As we can see, this investigation confirms what we initially saw in the distribution plots. We have numeric values, which means we can work with our dataset in its current form but there are outliers that should potentially be removed.

### Try it on your own

We've guided you through the application of three of the six dimensions using Spark. We strongly encourage you to investigate and try applying all six dimensions in Spark.


## List of tools available for data quality validation and testing

### Manual testing

Manual testing can be the first step in the journey of having robust data quality tests in place. This involves running manual checks, for example, calculating means and missing values using built-in Python, Spark, or Pandas methods. This can help to familiarise you with your data to decide what type of constraints may be useful.

*Pros:*

- It is the easiest to implement since there is no barrier of entry as you are already familiar with the toolset.
- It works from first principles, so you can implement exactly the tests you plan to do.
- This method can be useful in informing constraints for automated testing.

*Cons:*

- This approach may be the most inefficient since it is not being optimised computationally.
- It may be time-consuming, and any new dataset will have to be implemented from scratch.
- You do not get defined metrics out of the box.

### Automated testing

#### Deequ

Deequ is a library that has been built on top of Apache Spark. Its purpose is to define "unit tests for data" so that you can measure the quality of large datasets. 

If using Deequ within a Python environment, there is a library called [PyDeequ](https://pydeequ.readthedocs.io/en/latest/README.html). There are four main components of Deequ:

- Metrics Computations:
   * `Profilers` leverage Analyzers to inspect the columns of your dataset.
   * `Analyzers` are foundational modules used to compute metrics for the profiling and validation of data at scale.

- Constraint Suggestion:
   * This is where the rules and constraints for the groups of Analyzers are defined. These are run over the datasets and returned to a Verification Suite.

- Constraint Verification:
   * Here data are validated based on the constraints that have been set.

- Metrics Repository:
   * The Metrics Repository allows you to save Deequ metrics and runs over time.

So, if we take a look at how these four components fit together into a flow, we will get something like this:
1. Analyzers are used to compute metrics for the data.
2. The Profiler is run on each column of the data to get a better understanding of the data.
3. The Constraint Suggestion step occurs where PyDeequ gives you a set of default constraints that it thinks will be useful. Here you would also modify the constraints to your specific use case as the defaults are likely to be very basic.
4. The Constraint Verification step happens when the constraints set in the previous step that make up the Verification Suite are validated. This allows you to see if the data quality passes or fails.
5. Once the Constraint Verification step has occurred, you can save the results of various runs to the Metrics Repository. This can be very helpful as you get a historical view of all the data quality checks that have been done on your data. 

This whole process can be integrated within your ETL pipeline at multiple steps. You could have a system that kicks off to check the source data ingested as well as a system that ensures the transformations are occurring correctly.

Deequ has a rich set of checks that are available to choose from, and if you ever find yourself having to implement data quality testing with Deequ, you can find all the methods available [here](https://pydeequ.readthedocs.io/en/latest/pydeequ.html#module-pydeequ.checks). Some useful examples include defining the minimum and maximum allowed values for a column as well as checking for data completeness. You can even do checks for correlation between two columns using the hasCorrelation() or hasMutualInformation() methods. 

> To start using PyDeequ, all you need to do is run:
>
> `$ pip install pydeequ`

For a closer look at how to use PyDeequ, take a look at the [quick start](https://pydeequ.readthedocs.io/en/latest/README.html#quickstart).

*Pros:*

- This tool is specifically made for data quality testing and tracking.
- Seamless integration with Spark as it is built on top of Spark.
- Wide variety of built-in functions for data quality testing.
- Deequ provides a Profiler to automatically determine tests for your dataset using historical data.

*Cons:*

- Requires an Apache Spark environment (which is not really applicable to small datasets).

> 🚩️**Student instructions**🚩️
>
> Install `pydeequ` and run some of the data quality checks that were performed in the above sections. `Pydeequ` is very sensitive to the PySpark version being used. To run `pydeequ` successfully, you are advised to [create a new environment](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-environments) and install PySpark version 3.0.
>
> When creating a new environment, make sure the version of Python you are installing matches the version you are running on your machine to avoid version conflicts between your Spark driver and executors. Once you have created a new environment you can execute the below command to install the appropriate version of PySpark:
>
>`pip install pyspark==3.0`

#### Great Expectations

Great Expectations is one of the leading tools that can be used for validating, documenting, and profiling your data. 

It is a Python package that can be installed using the following commands:

> `$ pip install great_expectations`
>
> `$ great_expectations init` 

The first command installs Great Expectations and the second command initialises your Great Expectations deployment for a new project. This means that Great Expectations will create a new directory with the following structure:

great_expectations/
   - great_expectations.yml
   - expectations
   - checkpoints
   - notebooks
   - plugins
   - .gitignore
   - uncommitted/
      - config_variables.yml
      - documentations
      - validations


The team from Great Expectations recommend deploying this setup within a virtual environment. Details on how to set up this virtual environment can be found [here](https://legacy.docs.greatexpectations.io/en/latest/reference/supporting_resources.html).

Great Expectations has the goal of helping you understand your data better so that communicating what you have built and expect becomes effortless. Three features are delivered to help you achieve this goal. Firstly, data are automatically profiled to build the Expectations. Next, Expectations are used to validate the data quality, and then this gets pushed automatically to Data Docs to give you an idea of your data's characteristics. 

Let's clarify how this works in a bit more detail. What exactly is an Expectation? An Expectation is what you will use to define how your data should look. Expectations are built using Profilers which learn about your data. A Profiler is what Great Expectations uses to analyse your dataset and generate a first pass set of Expectations. Once the Expectations have been built, one can validate new data using the Expectations that have been generated. You also have the ability to update the Expectations that have been auto-generated to home in on your dataset's requirements. Data Docs use Expectations to describe your data and diagnose problems by giving you a visual report on which constraints have been met. 


*Pros:*

- Auto-generated Expectations out of the box that can be modified as you learn about your data.
- High configurability (allows for very accurate studies of your data).
- Configurable data source connections.
- Can be run externally in a virtual environment, operating directly on the data and not requiring a Python or Spark environment.

*Cons:*

- Can be difficult to set up due to the high level of configurability. 

#### Delta Expectations (Live Delta Tables)

Delta Expectations is a data quality tool built specifically for Delta Live Tables. Delta Live Tables is a feature of the latest Databricks Runtime and can be used to build reliable, maintainable, and testable data pipelines. The crucial difference here is that, when using Delta Live Tables, you will not be using a series of separate Apache Spark tasks. All you need to do is define the output schema, and Delta Live Tables will manage how the processing steps occur. 

Delta Expectations allows you to define the expected data quality as well as specify what to do in situations where the records do not meet the specifications. For more information on how to use Delta Expectations, have a look at the [Delta Live Tables quick start](https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-quickstart.html).

An example of how you would go about building Expectations with Delta Live Tables would start with creating a new notebook and adding code that would implement a data pipeline. Following this step, you will need to create a new pipeline job, and once the job has been completed, you can view the results of this pipeline job. 

The Expectations would be created within the notebook that makes up a part of your pipeline and will be used to read in the new data and clean it to the desired level. 

An Expectation may be defined in a similar way to the example below and includes a description, an invariant, and some sort of action to take when the record fails the invariant. 

'''
@dlt.expect("valid timestamp", "col(“timestamp”) > '2012-01-01'")
'''

Here we are expecting the timestamp column to be greater than the 1st of January 2012. There are many other options available for Expectations such as dropping the values that do not meet the criteria using the .expect_or_drop() method. You can also include multiple Expectations such as in this example:

'''
@dlt.expect_all({"valid_count": "count > 0", "valid_current_page": "current_page_id IS NOT NULL AND current_page_title IS NOT NULL"})
'''

*Pros:*

- Leverages new [Lakehouse](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) architecture.
- Seamless integration with Spark.
- Data pipelines can be integrated easily with Expectations.

*Cons:*

- Delta Expectations is a new technology and is currently not well known in the data engineering community.
- Requires Apache Spark environment and Databricks.

#### Repurposed tools –  MLFlow

Another option is using a client like MLFlow to log the metrics for the tests. This approach allows for logging the data as it comes into storage, effectively creating a very robust data lineage process. However, MLFlow is not technically the correct tool for the process and, therefore, there will be certain instances where it does not work as expected. This is because MLFlow is traditionally used to add structure and reproducibility to the machine learning lifecycle. If logging is your main purpose, other tools can also be used, such as Splunk and Logstash.

*Pros:*

- One tool for data quality and model quality tracking.
- Very commonly used meaning a large development community, which is helpful when trying to solve issues.

*Cons:*

- The tool is not optimised for storing data quality metrics.
- You do not get defined metrics out of the box.
- Using packages with specific requirements may complicate environment setup.

## Conclusion

Well done on completing this train! We introduced the concept of data quality as well as methods of handling your data with automated data quality validation and testing. Including these data quality tests can make or break your project.

You should now be familiar with the following topics:
- Data quality testing and validation
- The six dimensions of data quality
- How to calculate the six dimensions of data quality
- Some of the tools available for automated data quality testing

## Resources:

* [Data Quality Management](https://towardsdatascience.com/a-comprehensive-framework-for-data-quality-management-b110a0465e83)
* [The 6 Dimensions of Data Quality](https://towardsdatascience.com/the-six-dimensions-of-data-quality-and-how-to-deal-with-them-bdcf9a3dba71)
* [Automating DQ Checks with Great Expectations](https://urban-institute.medium.com/automating-data-quality-checks-with-great-expectations-f6b7a8e51201)
* [Data Quality Libraries](https://medium.com/datamindedbe/data-quality-libraries-the-right-fit-a6564641dfad)
* [Apache Griffim](https://griffin.apache.org/)
* [Trust Driven Development](https://medium.com/yotpoengineering/tdd-trust-driven-development-in-data-engineering-84b6e680d6ea)
* [LakeFS](https://lakefs.io/)
* [Understanding Great Expectations](https://medium.com/hashmapinc/understanding-great-expectations-and-how-to-use-it-7754c78962f4)