In [1]:
# run this to shorten the data import from the files
import os
cwd = os.path.dirname(os.getcwd())+'/'
path_data = os.path.join(os.path.dirname(os.getcwd()), 'datasets/')


# Testing data pipelines

Validating a data pipeline is one of the most important measures that a Data Engineer can take to ensure that a pipeline will perform as expected when deployed to production.

Select all the benefits of validating a pipeline during and after development.

### Possible Answers

    Improves reliability and trust in pipelined data{Answer}


    Validate that data is extracted, transformed, and loaded as expected{Answer}


    Reduces need for thorough documentation


    Helps to identify and avoid data quality issues{Answer}

In [None]:
# exercise 01

"""
Validating a data pipeline at "checkpoints"

In this exercise, you'll be working with a data pipeline that extracts tax data from a CSV file, creates a new column, filters out rows based on average taxable income, and persists the data to a parquet file.

pandas has been loaded as pd, and the extract(), transform(), and load() functions have already been defined. You'll use these functions to validate the data pipeline at various checkpoints throughout its execution.
"""

# Instructions

"""

    Print the shape of the raw_tax_data and clean_tax_data DataFrames and observe the difference in dimensions.
---

     Read the DataFrame from the path "clean_tax_data.parquet" into a DataFrame called to_validate, observe the .head() of each.
---

    Check that the to_validate and clean_tax_data DataFrames are equal.
"""

# solution

raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)
load(clean_tax_data, "clean_tax_data.parquet")

print(f"Shape of raw_tax_data: {raw_tax_data.shape}")
print(f"Shape of clean_tax_data: {clean_tax_data.shape}")

to_validate = pd.read_parquet("clean_tax_data.parquet")
print(clean_tax_data.head(3))
print(to_validate.head(3))

# Check that the DataFrames are equal
print(to_validate.equals(clean_tax_data))


#----------------------------------#

# Conclusion

"""
Fantastic validation! Validating data as it flows through a pipeline ensures that the pipeline performs as it should, and can help catch bugs of faulty logic before the solution is deployed into production.
"""

'/home/nero/Documents/Estudos/DataCamp'

In [1]:
# exercise 02

"""
Testing a data pipeline end-to-end

In this exercise, you'll be working with the same data pipeline as before, which extracts, transforms, and loads tax data. You'll practice testing this pipeline end-to-end to ensure the solution can be run multiple times, without duplicating the transformed data in the parquet file.

pandas has been loaded as pd, and the extract(), transform(), and load() functions have already been defined.
"""

# Instructions

"""
    
    Run the ETL pipeline three times, using a for-loop.
    
    Print the shape of the clean_tax_data in each iteration of the pipeline run.
    
    Read the DataFrame stored in the "clean_tax_data.parquet" file into the to_validate variable.
    
    Output the shape of the to_validate DataFrame, comparing it to the shape of clean_tax_rate to ensure data wasn't duplicated upon each pipeline run.
"""

# solution

# Trigger the data pipeline to run three times
for attempt in range(0, 3):
	print(f"Attempt: {attempt}")
	raw_tax_data = extract("raw_tax_data.csv")
	clean_tax_data = transform(raw_tax_data)
	load(clean_tax_data, "clean_tax_data.parquet")
	
	# Print the shape of the cleaned_tax_data DataFrame
	print(f"Shape of clean_tax_data: {clean_tax_data.shape}")
    
# Read in the loaded data, check the shape
to_validate = pd.read_parquet("clean_tax_data.parquet")
print(f"Final shape of cleaned data: {to_validate.shape}")


#----------------------------------#

# Conclusion

"""
Great work! By testing this pipeline end-to-end, you've validated that the pipeline can be run multiple times, with data being made available to downstream consumers without duplication.
"""

'\n\n'

In [2]:
# exercise 03

"""
Validating a data pipeline with assert and isinstance

To build unit tests for data pipelines, it's important to get familiar with the assert keyword, and the isinstance() function. In this exercise, you'll practice using these two tools to validate components of a data pipeline.

The functions extract() and transform() have been made available for you, along with pandas, which has been imported as pd. Both extract() and transform() return a DataFrame. Good luck!
"""

# Instructions

"""


    Assert that the clean_tax_data DataFrame has five columns.

---
    Validate that the object stored in the clean_tax_data variable is an instance of a pd.DataFrame.

---
    Assert that the value stored in the clean_tax_data variable is an instance of pd.DataFrame.

---
    Try asserting that clean_tax_data takes the type str, and observe the exception.

"""

# solution

raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)

# Validate the number of columns in the DataFrame
assert len(clean_tax_data.columns) == 5


#----------------------------------#

raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)

# Determine if the clean_tax_data DataFrames take type pd.DataFrame
isinstance(clean_tax_data, pd.DataFrame)


#----------------------------------#

raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)

# Assert that clean_tax_data is an instance of a pd.DataFrame
assert isinstance(clean_tax_data, pd.DataFrame)


#----------------------------------#

raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)

# Assert that clean_tax_data takes is an instance of a string
try:
	assert isinstance(clean_tax_data, str)
except Exception as e:
	print(e)


#----------------------------------#

# Conclusion

"""
Super work! You leveraged isinstance() to validate data types, and assert to ensure that boolean expression return True. Getting comfortable with these tools will help when writing unit tests!
"""

'\n\n'

In [3]:
# exercise 04

"""
Writing unit tests with pytest

In this exercise, you'll practice writing a unit test to validate a data pipeline. You'll use assert and other tools to build the tests, and determine if the data pipeline performs as it should.

The functions extract() and transform() have been made available for you, along with pandas, which has been imported as pd. You'll be testing the transform() function, which is shown below.

def transform(raw_data):
    raw_data["average_taxable_income"] = raw_data["total_taxable_income"] / raw_data["number_of_firms"]
    clean_data = raw_data.loc[raw_data["average_taxable_income"] > 100, :]
    clean_data.set_index("industry_name", inplace=True)
    return clean_data

"""

# Instructions

"""


    Import the pytest library.

    Assert that the value stored in the clean_tax_data variables is an instance of a pd.DataFrame.

    Validate that the number of columns in the clean_tax_data DataFrame is greater than the columns stored in the raw_tax_data DataFrame.

"""

# solution

import pytest

def test_transformed_data():
    raw_tax_data = extract("raw_tax_data.csv")
    clean_tax_data = transform(raw_tax_data)
    
    # Assert that the transform function returns a pd.DataFrame
    assert isinstance(clean_tax_data, pd.DataFrame)
    
    # Assert that the clean_tax_data DataFrame has more columns than the raw_tax_data DataFrame
    assert len(clean_tax_data.columns) > len(raw_tax_data.columns)


#----------------------------------#

# Conclusion

"""
There you go! Building unit tests with pytest is as easy as creating and evaluating basic boolean statements with the help of the assert keyword. Keep up the great work!
"""

'\n\n'

In [4]:
# exercise 05

"""
Creating fixtures with pytest

When building unit tests, you'll sometimes have to do a bit of setup before testing can begin. Doing this setup within a unit test can make the tests more difficult to read, and may have to be repeated several times. Luckily, pytest offers a way to solve these problems, with fixtures.

For this exercise, pandas has been imported as pd, and the extract() function shown below is available for use!

def extract(file_path):
    return pd.read_csv(file_path)

"""

# Instructions

"""

    Import the pytest library.

    Create a pytest fixture called raw_tax_data.

    Return the raw_data DataFrame.

"""

# solution

# Import pytest
import pytest

# Create a pytest fixture
@pytest.fixture()
def raw_tax_data():
	raw_data = extract("raw_tax_data.csv")
    
    # Return the raw DataFrame
	return raw_data


#----------------------------------#

# Conclusion

"""
Fantastic fixtures! Creating pytest fixtures helps to keep unit test more concise, and helps to separate test set up from actual testing logic.
"""

'\n\n'

In [5]:
# exercise 06

"""
Unit testing a data pipeline with fixtures

You've learned in the last video that unit testing can help to instill more trust in your data pipeline, and can even help to catch bugs throughout development. In this exercise, you'll practice writing both fixtures and unit tests, using the pytest library and assert.

The transform function that you'll be building unit tests around is shown below for reference. pandas has been imported as pd, and the pytest() library is loaded and ready for use.

def transform(raw_data):
    raw_data["tax_rate"] = raw_data["total_taxes_paid"] / raw_data["total_taxable_income"]
    raw_data.set_index("industry_name", inplace=True)
    return raw_data

"""

# Instructions

"""


    Create a pytest fixture called clean_tax_data.
    Apply the transform() function to the raw_data dataset, and save the result in the clean_data variable and return it.
---

    Create a unit test using the fixture defined from the last step.
    Complete the statement that ensures all values in the "tax_rate" column lie within the values 0 and 1.

"""

# solution

@pytest.fixture()
def clean_tax_data():
    raw_data = pd.read_csv("raw_tax_data.csv")
    clean_data = transform(raw_data)
    return clean_data

# Pass the fixture to the function
def test_tax_rate(clean_tax_data):
    # Assert values are within the expected range
    assert clean_tax_data["tax_rate"].max() <= 1 and clean_tax_data["tax_rate"].min() >= 0


#----------------------------------#

# Conclusion

"""
Awesome work! Using fixtures and unit tests together help to make tests both easy to read, and easy to write.
"""

'\n\n'

# Orchestration and ETL tools

When deploying data pipelines to production, Data Engineers need to make sure that their pipelines can run consistently on a schedule, have access to a flexible quantity of resources, and alert on failure. To do this, Data Engineers will often look outside of a Python script to an orchestration and ETL tool.

What is the most popular orchestration tool for building, deploying, and monitoring data pipelines?

### Possible Answers


    Custom-built tools
    
    
    Airflow{Answer}
    
    
    Prefect
    
    
    Dagster

In [6]:
# exercise 07

"""
Data pipeline architecture patterns

When building data pipelines, it's best to separate the files where functions are being defined from where they are being run.

In this exercise, you'll practice importing components of a pipeline into memory before using these functions to run the pipeline end-to-end. The project takes the following format, where pipeline_utils stores the extract(), transform(), and load() functions that will be used run the pipeline.

> ls
 etl_pipeline.py
 pipeline_utils.py

"""

# Instructions

"""

    Import the extract, transform, and load functions from the pipeline_utils module.
    Use the functions imported to run the data pipeline end-to-end.

"""

# solution

# Import the extract, transform, and load functions from utils
from pipeline_utils import extract, transform, load

# Run the pipeline end to end by extracting, transforming and loading the data
raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)
load(clean_tax_data, "clean_tax_data.parquet")


#----------------------------------#

# Conclusion

"""
Great job! You've successfully imported data pipeline components from a utils file, and ran the data pipeline end-to-end.
"""

'\n\n'

In [7]:
# exercise 08

"""
Running a data pipeline end-to-end

It's important to monitor the performance of a pipeline when running in production. Earlier in the course, you explored tools such as exception handling and logging. In this last exercise, we'll practice running a pipeline end-to-end, while monitoring for exceptions and logging performance.
"""

# Instructions

"""


    From the pipeline_utils.py file, import the extract(), transform(), and load() functions.
---
    Use the extract(), transform(), and load() functions to run the tax data pipeline end-to-end, within the try-except block.
---
    Use the logging module to log an info-level success message if the pipeline executes as expected.
    Create an error-level log if an exception occurs within the pipeline. Be sure to include the name of the exception in the log output.

"""

# solution

import logging
from pipeline_utils import extract, transform, load

logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG)

try:
	raw_tax_data = extract("raw_tax_data.csv")
	clean_tax_data = transform(raw_tax_data)
	load(clean_tax_data, "clean_tax_data.parquet")
    
	logging.info("Successfully extracted, transformed and loaded data.")  # Log a success message.
    
except Exception as e:
	logging.error(f"Pipeline failed with error: {e}")  # Log failure message


#----------------------------------#

# Conclusion

"""
Incredible! Using the logging module, try-except logic, and previously built ETL functionality, you've created an environment to run a pipeline end-to-end. Congrats!
"""

'\n\n'