### Simplified Notes on ETL and ELT Pipelines

#### 1. **Introduction**
- The course is about building ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) data pipelines.
- Jake, the instructor, uses tools like Python, SQL, and Airflow to build these pipelines.

#### 2. **BI, ML, and AI**
- **Business Intelligence (BI)**, **Machine Learning (ML)**, and **AI** require data to be well-prepared.
- Data pipelines ensure data is in the right place, shape, and time for analysis or machine learning models.

#### 3. **Data Pipelines**
- Data pipelines move data from source systems (e.g., CSV, APIs, databases) to a destination.
- During this process, data is **extracted**, **transformed** (cleaned and modified), and then **loaded** into systems like data warehouses or BI tools.

#### 4. **ETL vs. ELT Pipelines**
- **ETL (Extract, Transform, Load)**:
  - Data is extracted, transformed (cleaned, structured), and then loaded into a destination.
  - ETL is commonly used with tools like Python and pandas for data manipulation.
  
- **ELT (Extract, Load, Transform)**:
  - Data is extracted, loaded into the destination (e.g., data warehouse), and then transformed after loading.
  - ELT is often used with large-scale data in data warehouses.

#### 5. **ETL Pipeline Example**
- The ETL pipeline uses three functions:
  1. **Extract**: Pull data from a source.
  2. **Transform**: Clean and modify data using Python.
  3. **Load**: Write data to a SQL database.

#### 6. **ELT Pipeline Example**
- The ELT pipeline also uses three functions:
  1. **Extract**: Pull data from a source.
  2. **Load**: Load data into a destination (like a data warehouse).
  3. **Transform**: Perform data transformation after loading, usually with SQL queries.

#### 7. **Additional Learning**
- The course will cover both **tabular** (structured) and **non-tabular** (unstructured) data.
- **Pandas** is used for data transformation and writing data to SQL databases.
- Topics like **unit testing**, **monitoring**, and **deploying pipelines to production** will also be covered.

#### 8. **Hands-On Practice**
- The course encourages building real-world ETL and ELT pipelines through practical exercises.

---

These notes give a clear and simple overview of the ETL and ELT pipeline concepts and practices, making it easier to understand and remember.

### **Running an ETL Pipeline**
Ready to run your first ETL pipeline? Let's get to it!

Here, the functions extract(), transform(), and load() have been defined for you. To run this data ETL pipeline, you're going to execute each of these functions. If you're curious, take a peek at what the extract() function looks like.
```
def extract(file_name):
    print(f"Extracting data from {file_name}")
    return pd.read_csv(file_name)
```

In [None]:
# Extract data from the raw_data.csv file
extracted_data = extract(file_name="raw_data.csv")

# Transform the extracted_data
transformed_data = transform(data_frame=extracted_data)

# Load the transformed_data to cleaned_data.csv
load(data_frame=transformed_data, target_table="cleaned_data")


### **ELT in Action**
Feeling pretty good about running ETL processes? Well, it's time to give ELT pipelines a try. Like before, the extract(), load(), and transform() functions have been defined for you; all you'll have to worry about is running these functions. Good luck!

In [None]:
# Extract data from the raw_data.csv file
raw_data = extract(file_name="raw_data.csv")

# Load the extracted_data to the raw_data table
load(data_frame=raw_data, table_name="raw_data")

# Transform data in the raw_data table
transform(
  source_table="raw_data", 
  target_table="cleaned_data"
)


### Simplified Notes on Building ETL and ELT Pipelines – Part 2

#### 1. **Overview**
- We’re now diving deeper into how ETL and ELT pipelines are built and run in practice.
- The focus is on using Python and pandas to extract, transform, and load data.

#### 2. **Extracting Data from CSV**
- Use the `pandas` library to read data from a CSV file.
- Function used: `pd.read_csv("file_path")`
- Optional parameters include:
  - `delimiter`: defines how data is separated
  - `header`: specifies the row containing column names
  - `engine`: helps with file compatibility
- Use `.head()` to preview the first few rows of the DataFrame (default is 5).

#### 3. **Filtering a DataFrame**
- After extracting, the data can be filtered using `.loc[]`.
- Example usage:
  - `df.loc[df["name"] == "Apparel", :]` → filters rows where `name` is "Apparel"
  - `df.loc[:, ["name", "num_firms"]]` → selects only the specified columns
- `loc` is powerful for filtering rows and columns at the same time.

#### 4. **Writing a DataFrame to CSV**
- Use `.to_csv("file_path")` to write a DataFrame back to a CSV file.
- Just like `read_csv`, `to_csv` also has optional settings to control the output format.
- Other formats for exporting data:
  - `.to_json()`
  - `.to_excel()`
  - `.to_sql()` (for databases)

#### 5. **Using SQL for Transformation**
- You can run SQL queries directly from Python using connectors like:
  - SQLAlchemy
  - Snowflake Connector
- SQL queries are often written as multi-line strings.
- Use methods like `.execute()` to run queries and create tables in the database.

#### 6. **Combining ETL Steps**
- Each step (extract, transform, load) can be defined in separate Python functions.
- Example ETL flow:
  1. `extract()` reads data from a CSV.
  2. `transform()` filters rows with `name == "Apparel"`.
  3. `load()` saves the result to `cleaned_data.csv`.
- This structure will be used throughout the course for building pipelines.

#### 7. **Practice Time**
- You’re encouraged to try out these tools and build a small ETL/ELT pipeline on your own now.

---

These notes break down the process of building pipelines using pandas and SQL, making it easier to remember the key tools and steps involved.

### **Running an ETL Pipeline**
Ready to run your first ETL pipeline? Let's get to it!

Here, the functions extract(), transform(), and load() have been defined for you. To run this data ETL pipeline, you're going to execute each of these functions. If you're curious, take a peek at what the extract() function looks like.

```
def extract(file_name):
    print(f"Extracting data from {file_name}")
    return pd.read_csv(file_name)
```

In [None]:
# Extract data from the raw_data.csv file
extracted_data = extract(file_name="raw_data.csv")

# Transform the extracted_data
transformed_data = transform(data_frame=extracted_data)

# Load the transformed_data to cleaned_data.csv
load(data_frame=transformed_data, target_table="cleaned_data")


### **ELT in Action**
Feeling pretty good about running ETL processes? Well, it's time to give ELT pipelines a try. Like before, the extract(), load(), and transform() functions have been defined for you; all you'll have to worry about is running these functions. Good luck!

In [None]:
# Extract data from the raw_data.csv file
raw_data = extract(file_name="raw_data.csv")

# Load the extracted_data to the raw_data table
load(data_frame=raw_data, table_name="raw_data")

# Transform data in the raw_data table
transform(
  source_table="raw_data", 
  target_table="cleaned_data"
)


### **Building an ETL Pipeline**
Ready to ratchet up the fun? In this exercise, you'll be responsible for building the rest of the load() function before running each step in the ETL process. The extract() and transform() functions have been defined for you. Good luck!

In [None]:
def load(data_frame, file_name):
  # Write cleaned_data to a CSV using file_name
  data_frame.to_csv(file_name)
  print(f"Successfully loaded data to {file_name}")

extracted_data = extract(file_name="raw_data.csv")

# Transform extracted_data using transform() function
transformed_data = transform(data_frame=extracted_data)

# Load transformed_data to the file transformed_data.csv
load(data_frame=transformed_data, file_name="transformed_data.csv")


### **The "T" in ELT**
Let's not forget about ELT! Here, the extract() and load() functions have been defined for you. Now, all that's left is to finish defining the transform() function and run the pipeline. Go get 'em!

In [None]:
# Complete building the transform() function
def transform(source_table, target_table):
  data_warehouse.execute(f"""
  CREATE TABLE {target_table} AS
      SELECT
          CONCAT("Product ID: ", product_id),
          quantity * price
      FROM {source_table};
  """)

extracted_data = extract(file_name="raw_sales_data.csv")
load(data_frame=extracted_data, table_name="raw_sales_data")

# Populate total_sales by transforming raw_sales_data
transform(source_table="raw_sales_data", target_table="total_sales")

### **Extracting, Transforming, and Loading Student Scores Data**
Alright, it's time to build your own ETL pipeline from scratch. In this exercise, you'll build three functions; extract(), transform(), and load(). Then, you'll use these functions to run your pipeline.

The pandas library has been imported as pd. Enjoy!

In [None]:
def extract(file_name):
  # Read a CSV with a path stored using file_name into memory
  return pd.read_csv(file_name)


def transform(data_frame):
  # Filter the data_frame to only incude a subset of columns
  return data_frame.loc[:, ["industry_name", "number_of_firms"]]


def load(data_frame, file_name):
  # Write the data_frame to a CSV
  data_frame.to_csv(file_name)


extracted_data = extract(file_name="raw_industry_data.csv")
transformed_data = transform(data_frame=extracted_data)

# Pass the transformed_data DataFrame to the load() function
load(data_frame=transformed_data, file_name="number_of_firms.csv")
