# Processing Big Data - Deequ Analysis

© Explore Data Science Academy

## Honour Code
I {**SIPHO**, **SHIMANGE**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).
    Non-compliance with the honour code constitutes a material breach of contract.


## Context

Having completed manual data quality checks, it should be obvious that the process can become quite cumbersome. As the Data Engineer in the team, you have researched some tools that could potentially save the team from having to do this cumbersome work. In your research, you have come a across a tool called [Deequ](https://github.com/awslabs/deequ), which is a library for measuring the data quality of large datasets.

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://github.com/Explore-AI/Pictures/raw/master/data_engineering/transform/predict/DataQuality.jpg"
     alt="Data Quality"
     style="float: center; padding-bottom=0.5em"
     width=100%/>
     <p><em>Figure 1. Six dimensions of data quality</em></p>
</div>

You present this tool to your manager; he is quite impressed and gives you the go-ahead to use this in your implementation. You are now required to perform some data quality tests using this automated data testing tool.
 

> ## 🚩️ Important Notice 🚩️
>
>To successfully run `pydeequ` without any errors, please make sure that you have an environment that is running pyspark version 3.0.
> You are advised to **create a new conda environment** and install this specific version of pyspark to avoid any technical issues:
>
> `pip install pyspark==3.0`

<br>

## Import dependencies

If you do not have `pydeequ` already installed, install it using the following command:
- `pip install pydeequ`

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pydeequ
from pydeequ.analyzers import *
from pydeequ.profiles import *
from pydeequ.suggestions import *
from pydeequ.checks import *
from pydeequ.verification import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import DecimalType, DoubleType, IntegerType, DateType, NumericType, StructType, StringType, StructField, BooleanType

In [3]:
spark = (SparkSession
    .builder
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

## Read data into spark dataframe

In this notebook, we set out to run some data quality tests, with the possiblity of running end to end on the years 1963, 1974, 1985, 1996, 2007, and 2018. 

> ℹ️ **Instructions** ℹ️
>
>1. Make use of the `Data_ingestion_student_version.ipynb` notebook to create the parquet files for the following years:
>       - 1963
>       - 1974
>       - 1985
>       - 1996
>       - 2007
>       - 2018
>
>2. Ingest the data for the for the years given above. You should only do it one year at a time.
>3. Ingest the metadata file.


When developing your code, it will be sufficient to focus on a single year. However, after your development is done, you will need to run this notebook for all of the given years above so that you can answer all the questions given in the Data Testing MCQ.

In [4]:
#TODO: Write your code here
# Use this variable (year) to determine which year your are focusing on
year = 1963

## **Run tests on the dataset**

## Test 1 - Null values ⛔️
For the first test, you are required to check the data for completeness.

> ℹ️ **Instructions** ℹ️
>
>1. Make use of the `Verification Suite` and write code to check for missing values in the data. 
>2. Display the results of your test.
>
> *You may use as many cells as necessary*


In [5]:
def file_merger(path):
    try:
        file = spark.read.parquet(path, header=True, recursiveFileLookup=True)
        return file
    except FileNotFoundError as e:
            print(f'Folder not found: Error {e}')

In [6]:
path = f'../output/{year}'
df = file_merger(path)
df.describe().show(truncate=False)

24/07/21 19:14:55 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+-------+------------------+-------------------+-------------------+-------------------+-------------------+-----------------+-----+
|summary|open              |high               |low                |close              |adj_close          |volume           |stock|
+-------+------------------+-------------------+-------------------+-------------------+-------------------+-----------------+-----+
|count  |5020              |5020               |5020               |5020               |5020               |5020             |5020 |
|mean   |1.1365598366658882|19.17383405059962  |18.943191553180316 |19.06198000288253  |7.208458180424869  |525223.0677290837|NULL |
|stddev |4.757480282400388 |61.95001257800802  |61.21457985181493  |61.60101770760707  |29.681165595553527 |910433.9606092003|NULL |
|min    |0.0               |0.06785380095243454|0.06563635170459747|0.06607984006404878|4.89296041905618E-7|0                |AA   |
|max    |303.125           |315.625            |311.875            |3

In [7]:
# Initialize the Verification Suite
verification_suite = VerificationSuite(spark)

In [8]:
# Create a single Check object for completeness
check = Check(spark, CheckLevel.Warning, 'Completeness Check')

for column in df.columns:
    check = check.isComplete(column)

In [9]:
# Run the verification suite
results = (verification_suite
          .onData(df)
          .addCheck(check)
          .run())

In [10]:
# Convert result to a DataFrame and display it
result_df = VerificationResult.checkResultsAsDataFrame(spark, results)
result_df.show(truncate=False)

+------------------+-----------+------------+---------------------------------------------------------+-----------------+------------------+
|check             |check_level|check_status|constraint                                               |constraint_status|constraint_message|
+------------------+-----------+------------+---------------------------------------------------------+-----------------+------------------+
+------------------+-----------+------------+---------------------------------------------------------+-----------------+------------------+





## Test 2 - Zero Values 🅾️

For the second test, you are required to check for zero values within the dataset.

> ℹ️ **Instructions** ℹ️
>
>1. Make use of the `Verification Suite` and write code to check for zero values within the data. 
>2. Display the results of your test.
>
> *You may use as many cells as necessary*

In [56]:
# List of numerical columns to check for zero values
numerical_columns = ["open", "high", "low", "close", "adj_close", "volume"]

# Define a check to ensure that no numerical column contains zero values
check = Check(spark, CheckLevel.Error, "Check for zero values")

for column in numerical_columns:
    # Add a check to ensure that the column does not contain zero values
    check = check.isNonNegative(column)

# Create the Verification Suite
verification_suite = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(check)

# Run the Verification Suite
verification_result = verification_suite.run()

# Extract and show the check results
for result in verification_result.checkResults:
    print(result)

{'check_status': 'Success', 'check_level': 'Error', 'constraint_status': 'Success', 'check': 'Check for zero values', 'constraint_message': '', 'constraint': 'ComplianceConstraint(Compliance(open is non-negative,COALESCE(CAST(open AS DECIMAL(20,10)), 0.0) >= 0,None,List(open),None))'}
{'check_status': 'Success', 'check_level': 'Error', 'constraint_status': 'Success', 'check': 'Check for zero values', 'constraint_message': '', 'constraint': 'ComplianceConstraint(Compliance(high is non-negative,COALESCE(CAST(high AS DECIMAL(20,10)), 0.0) >= 0,None,List(high),None))'}
{'check_status': 'Success', 'check_level': 'Error', 'constraint_status': 'Success', 'check': 'Check for zero values', 'constraint_message': '', 'constraint': 'ComplianceConstraint(Compliance(low is non-negative,COALESCE(CAST(low AS DECIMAL(20,10)), 0.0) >= 0,None,List(low),None))'}
{'check_status': 'Success', 'check_level': 'Error', 'constraint_status': 'Success', 'check': 'Check for zero values', 'constraint_message': '', '

In [73]:
# Create a Check object for zero values
columns = [column for column in df.columns if column != 'stock' and column != 'date']

check = Check(spark, CheckLevel.Warning, 'Zero Values Check')
for col in columns:
     check = check.satisfies(f'{col} != 0', f'{col} should not contain zero values')

In [74]:
# Run the verification suite
results = (verification_suite
          .onData(df)
          .addCheck(check)
          .run())

AttributeError: 'VerificationRunBuilder' object has no attribute 'onData'

In [71]:
result_df = VerificationResult.checkResultsAsDataFrame(spark, results)
result_df.show()

AttributeError: 'VerificationRunBuilder' object has no attribute 'verificationRun'

## Test 3 - Negative values ➖️
The third test requires you to check that all values in the data are positive.

> ℹ️ **Instructions** ℹ️
>
>1. Make use of the `Verification Suite` and write code to check negative values within the dataset. 
>2. Display the results of your test.
>
> *You may use as many cells as necessary*

In [14]:
# Create a Check object for zero values
columns = [column for column in df.columns if column != 'stock']

check = Check(spark, CheckLevel.Warning, 'Negative Values Check')
for column in columns:
    check = check.isNonNegative(column)

In [15]:
# Run the verification suite
results = (verification_suite
          .onData(df)
          .addCheck(check)
          .run())

In [16]:
# Convert result to a DataFrame and display it
result_df = VerificationResult.checkResultsAsDataFrame(spark, results)
result_df.show()

+--------------------+-----------+------------+--------------------+-----------------+------------------+
|               check|check_level|check_status|          constraint|constraint_status|constraint_message|
+--------------------+-----------+------------+--------------------+-----------------+------------------+
+--------------------+-----------+------------+--------------------+-----------------+------------------+



## Test 4 - Determine Maximum Values ⚠️

For the fourth test, we want to find the maximum values in the dataset for the numerical fields. Extremum values can often be used to define an upper bound for the column values so we can define them as the threshold values. 

> ℹ️ **Instructions** ℹ️
>
>1. Make use of the `Column Profiler Runner` to generate summary statistics for all the available columns. 
>2. Extract the maximum values for all the numeric columns in the data.
>
> *You may use as many cells as necessary*

In [98]:
# Profile the data using ColumnProfilerRunner
results = ColumnProfilerRunner(spark) \
    .onData(df) \
    .run()

Unable to map type DateType


In [113]:
# Extract summary statistics for all columns
profiled_columns = result.profiles
row_count = df.count()

In [123]:
# Display summary statistics
for column, profile in profiled_columns.items():
    if column != 'stock' and column != 'date':
        print(f"Column '{column}':")
        print(f"  Count: {row_count}")
        print(f"  Maximum value: {profile.maximum}")
        print(f"  Minimum value: {profile.minimum}")
        print(f"  Mean value: {profile.mean}")
        print(f"  Standard deviation: {profile.stdDev}")
        print()

Column 'open':
  Count: 5020
  Maximum value: 303.125
  Minimum value: 0.0
  Mean value: 1.1365598366658882
  Standard deviation: 4.75700640618203

Column 'low':
  Count: 5020
  Maximum value: 311.875
  Minimum value: 0.06563635170459747
  Mean value: 18.943191553180316
  Standard deviation: 61.2084824784394

Column 'close':
  Count: 5020
  Maximum value: 313.75
  Minimum value: 0.06607984006404878
  Mean value: 19.06198000288253
  Standard deviation: 61.59488184248825

Column 'volume':
  Count: 5020
  Maximum value: 20692800.0
  Minimum value: 0.0
  Mean value: 525223.0677290837
  Standard deviation: 910343.2754194132

Column 'adj_close':
  Count: 5020
  Maximum value: 148.7704620361328
  Minimum value: 4.89296041905618e-07
  Mean value: 7.208458180424869
  Standard deviation: 29.678209156919213

Column 'high':
  Count: 5020
  Maximum value: 315.625
  Minimum value: 0.06785380095243454
  Mean value: 19.17383405059962
  Standard deviation: 61.943841950712674



## Test 5 - Stock Tickers 💹️

For the fifth test, we want to determine if the stock tickers contained in our dataset are consistent. To do this, you will need to make use of use of the metadata file to check that the stock names used in the dataframe are valid. 

> ℹ️ **Instructions** ℹ️
>
>1. Make use of the `Verification Suite` and write code to determine if the stock tickers contained in the dataset appear in the metadata file.
>2. Display the results of your test.
>
> *You may use as many cells as necessary*

In [7]:
def read_metadata(path):
    try:
        df = spark.read.csv(path, header=True)
        return df
    except FileNotFoundError as e:
            print(f'Folder not found: Error {e}')

In [20]:
metadata_path = '../symbols_valid_meta.csv'
metadata_df = read_metadata(metadata_path)

In [22]:
# Extract valid tickers
valid_tickers = metadata_df.select('NASDAQ Symbol').rdd.flatMap(lambda x: x).collect()
valid_tickers_set = set(valid_tickers)

                                                                                

In [24]:
# Broadcast the valid tickers set
broadcast_valid_tickers = spark.sparkContext.broadcast(valid_tickers_set)

In [25]:
# Check if a ticker is valid
def is_valid_ticker(ticker):
    return ticker in broadcast_valid_tickers.value

In [28]:
# Register the function as a UDF
valid_ticker_udf = F.udf(is_valid_ticker, BooleanType())

In [33]:
# Check for valid tickers
df_ticker = df.withColumn('is_valid_ticker', valid_ticker_udf(F.col('stock')))

In [34]:
# Create the Verification Suite
verification_suite = VerificationSuite(spark) \
    .onData(df_ticker) \
    .addCheck(
        Check(spark, CheckLevel.Error, 'Checking stock tickers')
        .isComplete('stock')
        .isContainedIn('stock', list(valid_tickers_set))
    )

verification_result = verification_suite.run()

In [35]:
# Extract the check results as a data frame
check_results = VerificationResult.checkResultsAsDataFrame(spark, verification_result)
check_results.show()

+--------------------+-----------+------------+--------------------+-----------------+------------------+
|               check|check_level|check_status|          constraint|constraint_status|constraint_message|
+--------------------+-----------+------------+--------------------+-----------------+------------------+
|Checking stock ti...|      Error|     Success|CompletenessConst...|          Success|                  |
|Checking stock ti...|      Error|     Success|ComplianceConstra...|          Success|                  |
+--------------------+-----------+------------+--------------------+-----------------+------------------+



## Test 6 - Duplication 👥️
Lastly, we want to determine the uniqueness of the items found in the dataframe. You need to make use of the Verification Suite to check for the validity of the stock tickers. 

Similar to the previous notebook - `Data_profiling_student_version.ipynb`, the first thing to check will be if the primary key values within the dataset are unique - in our case, that will be a combination of the stock name and the date. Secondly, we want to check if the entries are all unique, which is done by checking for duplicates across that whole dataset.

> ℹ️ **Instructions** ℹ️
>
>1. Make use of the `Verification Suite` and write code to determine the uniqueness of entries contained within the dataset.
>2. Display the results of your test.
>
> *You may use as many cells as necessary*



In [36]:
# Verify unique combination of stock and date
verification_suite = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        Check(spark, CheckLevel.Error, 'Checking stock and date uniqueness')
        .isComplete('stock')
        .isComplete('date')
        .hasDistinctness(['stock', 'date'], lambda x: x == 1.0)  # Ensures each (stock, date) pair is unique
    )

verification_result = verification_suite.run()

In [40]:
# Extract the check results as a data frame
check_results = VerificationResult.checkResultsAsDataFrame(spark, verification_result)
check_results.show(truncate=False)

+----------------------------------+-----------+------------+-----------------------------------------------------------+-----------------+------------------+
|check                             |check_level|check_status|constraint                                                 |constraint_status|constraint_message|
+----------------------------------+-----------+------------+-----------------------------------------------------------+-----------------+------------------+
|Checking stock and date uniqueness|Error      |Success     |CompletenessConstraint(Completeness(stock,None,None))      |Success          |                  |
|Checking stock and date uniqueness|Error      |Success     |CompletenessConstraint(Completeness(date,None,None))       |Success          |                  |
|Checking stock and date uniqueness|Error      |Success     |DistinctnessConstraint(Distinctness(Stream(stock, ?),None))|Success          |                  |
+----------------------------------+----------

In [38]:
# Check for duplicates across the entire dataset
verification_suite_all = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        Check(spark, CheckLevel.Error, 'Checking for duplicates across the entire dataset')
        .hasDistinctness(df.columns, lambda x: x == 1.0)  # Ensures all rows are unique
    )

verification_result_all = verification_suite_all.run()

In [39]:
# Extract the check results as a data frame
check_results_all = VerificationResult.checkResultsAsDataFrame(spark, verification_result_all)
check_results_all.show(truncate=False)

+-------------------------------------------------+-----------+------------+----------------------------------------------------------+-----------------+------------------+
|check                                            |check_level|check_status|constraint                                                |constraint_status|constraint_message|
+-------------------------------------------------+-----------+------------+----------------------------------------------------------+-----------------+------------------+
|Checking for duplicates across the entire dataset|Error      |Success     |DistinctnessConstraint(Distinctness(Stream(date, ?),None))|Success          |                  |
+-------------------------------------------------+-----------+------------+----------------------------------------------------------+-----------------+------------------+

