# 1. Introduction to Data Pipelines

Get ready to discover how data is collected, processed, and moved using data pipelines. You will explore the qualities of the best data pipelines, and prepare to design and build your own.

## Libraries

In [30]:
# Common
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas SQL
import pandasql as ps

## User Variables

In [13]:
raw_data_csv_path = "../Datasets/raw_data.csv"
raw_data = pd.read_csv(raw_data_csv_path)

# Exercises

## 1. Running an ETL Pipeline

### Description

Ready to run your first ETL pipeline? Let's get to it!

Here, the functions ``extract()``, ``transform()``, and ``load()`` have been defined for you. To run this data ETL pipeline, you're going to execute each of these functions. If you're curious, take a peek at what the ``extract()`` function looks like.

```py
def extract(file_name):
    print(f"Extracting data from {file_name}")
    return pd.read_csv(file_name)

def transform(data_frame):
    print(f"Transforming {data_frame.shape[0]} rows of raw data.")

def load(data_frame, target_table):
    print(f"Loading cleaned data to {target_table}.")
```

### Instructions

* Use the ``extract()`` function to extract data from the ``raw_data.csv`` file.
* Transform the ``extracted_data`` DataFrame using the ``transform()`` function.
* Finally, load the ``transformed_data`` DataFrame to the ``cleaned_data`` SQL table.

In [14]:
def extract(file_name):
    print(f"Extracting data from {file_name}")
    return pd.read_csv(file_name)

In [15]:
def transform(data_frame):
    print(f"Transforming {data_frame.shape[0]} rows of raw data.")

In [16]:
def load(data_frame, target_table):
    print(f"Loading cleaned data to {target_table}.")

In [17]:
# Extract data from the raw_data.csv file
extracted_data = extract(file_name=raw_data_csv_path)

# Transform the extracted_data
transformed_data = transform(data_frame=extracted_data)

# Load the transformed_data to cleaned_data.csv
load(data_frame=transformed_data, target_table="cleaned_data")


Extracting data from ../Datasets/raw_data.csv
Transforming 96 rows of raw data.
Loading cleaned data to cleaned_data.


Look at that! Using the ``extract()``, ``transform()``, and ``load()`` functions, you were able to run your first ETL pipeline. Congrats!

## 2. ELT in Action

### Description

Feeling pretty good about running ETL processes? Well, it's time to give ELT pipelines a try. Like before, the ``extract()``, ``load()``, and ``transform()`` functions have been defined for you; all you'll have to worry about is running these functions. Good luck!

### Instructions

* Use the appropriate ETL function to extract data from the ``raw_data.csv`` file.
* Load the ``raw_data`` DataFrame into the ``raw_data`` table in a data warehouse.
* Call the ``transform()`` function to transform the data in the ``raw_data`` source table.

In [40]:
class MockDataWarehouse:
    def __init__(self):
        self.tables = {}

    @classmethod
    def load_table(self, raw_data, table_name):
        pass

    @classmethod
    def run_sql(self, query):
        print(f"Transformed data with the query: {query}")


In [41]:
def extract(file_name):
    print(f"Extracting data from {file_name}.")
    return pd.read_csv(file_name)

def transform(source_table, target_table):
    data_warehouse.run_sql(f"""\n\tCREATE TABLE {target_table} AS\n      SELECT\n          CONCAT("Product ID: ", product_id),\n          quantity * price\n      FROM {source_table};\n  """)

def load(data_frame, table_name):
    print(f"Loading cleaned data to {table_name}.")
    data_warehouse.load_table(data_frame, table_name)

In [42]:
data_warehouse = MockDataWarehouse()

In [43]:
# Extract data from the raw_data.csv file
raw_data = extract(file_name=raw_data_csv_path)

# Load the extracted_data to the raw_data table
load(data_frame=raw_data, table_name="raw_data")

# Transform data in the raw_data table
transform(
  source_table="raw_data", 
  target_table="cleaned_data"
)

Extracting data from ../Datasets/raw_data.csv.
Loading cleaned data to raw_data.
Transformed data with the query: 
	CREATE TABLE cleaned_data AS
      SELECT
          CONCAT("Product ID: ", product_id),
          quantity * price
      FROM raw_data;
  


Just like that, you've successfully run your first ELT pipeline. Keep up the good work!

## 3. ETL and ELT Pipelines

### Description

Select each of the statements about ETL and ELT pipelines that are true.

### Instructions

* Answer the question

### Answers

* Processes that extract, transform then load data are known as ETL pipelines.
* ELT processes are typically found in a data platform that leverages a data warehouse

Fantastic! In the course, you'll predominantly spend time building ETL pipelines, but it's important to understand how (and why) ELT pipelines are being used in the modern data stack.

## 4. Building an ETL Pipeline

### Description

Ready to ratchet up the fun? In this exercise, you'll be responsible for building the rest of the ``load()`` function before running each step in the ETL process. The ``extract()`` and ``transform()`` functions have been defined for you. Good luck!

### Instructions

* Complete the ``load()`` function by writing the ``transformed_data`` DataFrame to a ``.csv`` file, using ``file_name``.
* Use the ``transform()`` function to clean the ``extracted_data`` DataFrame.
* Load ``transformed_data`` to the ``transformed_data.csv`` file using the ``load()`` function.

In [46]:
def transform(data_frame):
    print("Transformed the raw data returned from extract()")
    return data_frame

In [47]:
def load(data_frame, file_name):
  # Write cleaned_data to a CSV using file_name
  data_frame.to_csv(file_name)
  print(f"Successfully loaded data to {file_name}")

extracted_data = extract(file_name=raw_data_csv_path)

# Transform extracted_data using transform() function
transformed_data = transform(data_frame=extracted_data)

# Load transformed_data to the file transformed_data.csv
load(data_frame=transformed_data, file_name="transformed_data.csv")

Extracting data from ../Datasets/raw_data.csv.
Transformed the raw data returned from extract()
Successfully loaded data to transformed_data.csv


See, building an ETL pipeline doesn't need to be tricky! Using Python functions, it's easy to define and execute logic to extract, transform, and load data.

## 5. The "T" in ELT

### Description

Let's not forget about ELT! Here, the ``extract()`` and ``load()`` functions have been defined for you. Now, all that's left is to finish defining the transform() function and run the pipeline. Go get 'em!

### Instructions

* Update the ``transform()`` function to call the ``.execute()`` method on the ``data_warehouse`` object.
* Use the newly-updated ``transform()`` function to populate data in the ``total_sales`` target table by transforming data in the ``raw_sales_data`` source table.

In [50]:
def load(data_frame, table_name):
    print("Loading extracted data to sale_items.")

In [52]:
class MockDataWarehouse:
    def __init__(self):
        self.tables = {}

    @classmethod
    def load_table(self, raw_data, table_name):
        pass

    @classmethod
    def run_sql(self, query):
        print(f"Transformed data with the query: {query}")

    def execute(self, query):
        print(f"Ran the query: {query}")

In [53]:
data_warehouse = MockDataWarehouse()

In [54]:
# Complete building the transform() function
def transform(source_table, target_table):
  data_warehouse.execute(f"""
  CREATE TABLE {target_table} AS
      SELECT
          CONCAT("Product ID: ", product_id),
          quantity * price
      FROM {source_table};
  """)

extracted_data = extract(file_name=raw_data_csv_path)
load(data_frame=extracted_data, table_name="raw_sales_data")

# Populate total_sales by transforming raw_sales_data
transform(source_table="raw_sales_data", target_table="total_sales")

Extracting data from ../Datasets/raw_data.csv.
Loading extracted data to sale_items.
Ran the query: 
  CREATE TABLE total_sales AS
      SELECT
          CONCAT("Product ID: ", product_id),
          quantity * price
      FROM raw_sales_data;
  


You crushed it! Like with ETL, building an ELT pipeline can be as easy as defining and execution Python functions.

## 6. Extracting, Transforming, and Loading Student Scores Data

### Description

Alright, it's time to build your own ETL pipeline from scratch. In this exercise, you'll build three functions; ``extract()``, ``transform()``, and ``load()``. Then, you'll use these functions to run your pipeline.

The ``pandas`` library has been imported as ``pd``. Enjoy!

### Instructions

* In the ``extract()`` function, use the appropriate pandas function to read a CSV into memory.
* Update the ``transform()`` function to filter the ``data_frame`` to only include the columns ``industry_name`` and ``number_of_firms``.
* In the ``load()`` function write the ``data_frame`` to a path stored using the parameter ``file_name``.
* Pass the ``transformed_data`` DataFrame to the ``load()`` function, and run the ETL pipeline.

In [55]:
def extract(file_name):
  # Read a CSV with a path stored using file_name into memory
  return pd.read_csv(file_name)

def transform(data_frame):
  # Filter the data_frame to only incude a subset of columns
  return data_frame.loc[:, ["industry_name", "number_of_firms"]]

def load(data_frame, file_name):
  # Write the data_frame to a CSV
  data_frame.to_csv(file_name)

# Pass the transformed_data DataFrame to the load() function
load(data_frame=transformed_data, file_name="number_of_firms.csv")

Congrats, you just built a data pipeline from the ground-up. See you in the next chapter!