# **Note_2 is Continued from Note 1**


## **Intro to ETL**

ETL stands for Extract, Transform, Load. It is a process used in data warehousing and data integration to move data from various sources into a centralized repository, such as a data warehouse or a database. The ETL process consists of three main steps:

1. **Extract**: This step involves retrieving data from various source systems, which can include databases, flat files, APIs, or other data sources. The goal is to gather all relevant data needed for analysis.

2. **Transform**: In this step, the extracted data is cleaned, transformed, and prepared for analysis. This may involve filtering, aggregating, joining, or reshaping the data to ensure it is in the right format and structure for the target system.

3. **Load**: The final step is to load the transformed data into the target system, such as a data warehouse or a database. This makes the data available for querying and analysis by business intelligence tools or other applications.

## **ETL Process Overview**

The ETL process can be visualized as a pipeline where data flows through each of the three stages. Here’s a high-level overview of the ETL process:

```mermaid
graph TD
    A[Extract] --> B[Transform]
    B --> C[Load]
    C --> D[Data Warehouse]
    D --> E[Business Intelligence Tools]
```

## **Key Components of ETL**

- **Data Sources**: These can be databases, flat files, APIs, or any other systems where data resides.

- **ETL Tools**: Software applications that facilitate the ETL process, such as Apache Nifi, Talend, Informatica, or custom scripts.

- **Data Warehouse**: A centralized repository where transformed data is stored for analysis and reporting.

- **Data Quality**: Ensuring the accuracy, consistency, and reliability of the data throughout the ETL process.

- **Scheduling and Automation**: ETL processes can be scheduled to run at specific intervals or triggered by events to ensure data is always up-to-date.

## **Benefits of ETL**

- **Centralized Data**: ETL allows organizations to consolidate data from multiple sources into a single repository, making it easier to access and analyze.

- **Improved Data Quality**: The transformation step helps clean and standardize data, improving its quality for analysis.

- **Enhanced Decision Making**: By providing a unified view of data, ETL supports better decision-making processes within organizations.

- **Scalability**: ETL processes can be designed to handle large volumes of data, making them suitable for growing datasets.

## **ETL vs. ELT**

While ETL is a traditional approach, there is also a modern variant known as ELT (Extract, Load, Transform). In ELT, data is first extracted and loaded into the target system (like a data lake or data warehouse) before the transformation occurs. This approach leverages the processing power of modern databases to perform transformations after loading, allowing for more flexibility and scalability.

## **Conclusion**

ETL is a crucial process in data management that enables organizations to extract valuable insights from their data by transforming and loading it into a centralized repository. Understanding the ETL process is essential for anyone involved in data warehousing, business intelligence, or data integration projects. By implementing effective ETL practices, organizations can enhance their data quality, streamline their analytics processes, and make informed decisions based on accurate and timely information.


## **End to End ETL Pipeline with Airflow**

In this section, we will explore how to implement an end-to-end ETL pipeline using Apache Airflow, a powerful open-source tool for orchestrating complex workflows. Airflow allows you to define, schedule, and monitor ETL tasks efficiently.

### **Problem Statement**

We will create an ETL pipeline that extracts data from a public API, transforms it by cleaning and aggregating the data, and then loads it into a PostgreSQL database. The pipeline will be scheduled to run daily.

We will use the NASA API to extract data about asteroids and their close approaches to Earth. The data will be transformed to calculate the average size of asteroids and then loaded into a PostgreSQL database for further analysis.

We will dockerize the Airflow environment to ensure consistency and ease of deployment. The pipeline will consist of the following steps:

1. **Extract**: Fetch data from the NASA API.
2. **Transform**: Clean the data, calculate the average size of asteroids, and prepare it for loading.
3. **Load**: Insert the transformed data into a PostgreSQL database.

Both the Airflow and PostgreSQL services will be run in Docker containers, allowing for easy setup and management of the ETL pipeline. Here we will learn how the communication between the Airflow and PostgreSQL containers is established using Docker networking.

We will also implement the airflow hooks to ensure that the pipeline runs smoothly and handles any errors that may occur during the ETL process.

We will use different Airflow operators to perform the ETL tasks, including the `PythonOperator` for custom Python functions, the `PostgresOperator` for executing SQL commands, and the `DockerOperator` for running tasks in Docker containers and HTTP operator to make HTTP requests to the NASA API.


## **Project Begins**

Refer to `Airflow_ETL_Pipeline_Astro_Postgres`

Note that folder name should never contain special characters like `(), ; -` etc.

As by default both the `Airflow` and `Postgres` will be running in separate docker container. Therefore, we will need to establish a communication between these two.

We create `docker-compose.yml` file that will create a `Postgres` image with db_name, uname, password and env_vars.

### **Docker-Compose**

```yml
version: "3"
services:
  postgres:
    image: postgres:13
    container_name: postgres_db
    environment:
      POSTGRES_USER: birat
      POSTGRES_PASSWORD: admin
      POSTGRES_DB: postgres
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
      - airflow_network

networks:
  airflow_network:
    external: false
```

In the above `yml` file, we use the `postgres` image with all the environment variables.

`Volume` tracks the data for consistency and `Network`

We will need to have a common network so that the containers can talk.

### **ETL DAG**

```Python

from airflow import DAG
from airflow.providers.http.operators.http import SimpleHttpOperator
from airflow.operators.python import PythonOperator
from airflow.decorators import task
from airflow.providers.postgres.hooks.postgres import PostgresHook
import json
from airflow.utils.dates import days_ago
from httpx import post

# Define the DAG

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': days_ago(1),
    'retries': 1,
}

with DAG(
    'etl_dag',
    default_args=default_args,
    description='A simple ETL DAG',
    schedule='@daily',
) as dag:

  # Step 1: Create the table if it does not exist

    @task
    def create_table():
        pg_hook = PostgresHook(
            postgres_conn_id='postgres_default',
        )

        # SQL command to create the table for the API
        create_table_sql = """
        CREATE TABLE IF NOT EXISTS apod_data (
            id SERIAL PRIMARY KEY,
            title VARCHAR(255),
            explanation TEXT,
            url TEXT,
            date DATE,
            media_type VARCHAR(50),
        );
        """

        # Execute the SQL command
        pg_hook.run(create_table_sql)
  # Step 2: Extract data from the API

  # Step 3: Transform the data

  # Step 4: Load the data into PostgreSQL

  #

```

In the above code, we use `PostgresHook` to interact with the `Postgres`.

`pg_hook.run(create_table_sql)` to execute the `SQL Query`

**Setting Up the API**

Below is the format of the API :

```Json

  "copyright": "\nIreneusz Nowak\n",
  "date": "2025-05-27",
  "explanation": "Behold one of the most photogenic regions of the night sky, captured impressively.  Featured, the band of our Milky Way Galaxy runs diagonally along the bottom-left corner, while the colorful Rho Ophiuchi cloud complex is visible just right of center and the large red circular Zeta Ophiuchi Nebula appears near the top.  In general, red emanates from nebulas glowing in the light of excited hydrogen gas, while blue marks interstellar dust preferentially reflecting the light of bright young stars.  Thick dust usually appears dark brown.  Many iconic objects of the night sky appear, including (can you find them?) the bright star Antares, the globular star cluster M4, and the Blue Horsehead nebula. This wide field composite, taken over 17 hours, was captured from South Africa last June.    Explore Your Universe: Random APOD Generator",
  "hdurl": "https://apod.nasa.gov/apod/image/2505/RhoZeta_Nowak_2560.jpg",
  "media_type": "image",
  "service_version": "v1",
  "title": "Zeta and Rho Ophiuchi with Milky Way",
  "url": "https://apod.nasa.gov/apod/image/2505/RhoZeta_Nowak_960.jpg"
}

```

```Py

  # Step 2: Extract data from the API
    # api_endpoint = 'https://api.nasa.gov/planetary/apod?api_key=2qhSecq7ZVI2TyDgfHGAhblGmrl2Q7ZZHdj1b6Ij'
    extract_apod = SimpleHttpOperator(
        task_id='extract_apod',
        http_conn_id='apod_api', # Connection ID for the NASA API
        endpoint='planetary/apod', # Endpoint for the Astronomy Picture of the Day
        method='GET',
        data={"api_key": "{{ conn.nasa_api.extra_dejson.api_key }}"}, # API key from the connection
        response_filter=lambda response: response.json(), # Filter to get JSON response
  )

```

We use `SimpleHttpOperator` to send HTTP request. In this hook we provide `http_conn_id = apod_api` we will make use of `Airflow` connection in the `UI`. Then `endpoint` as to hit the `API`, the base path will be available from Airflow. Then finally `data` we get the `api_key` which is also be received from the `Airflow` connection.

Then finally we want the `response_filter` in `Json`.

### **Transforming the Data**

```Py

    @task
    def transform_data(response):
        apod_data = {
            "title": response.get("title", ""),
            "explanation": response.get("explanation", ""),
            "url": response.get("url", ""),
            "date": response.get("date", ""),
            "media_type": response.get("media_type", ""),
        }
        return apod_data

```

### **Loading the Data**

```Py

    @task
    def load_data_to_postgres(apo_data):
        # Initialize PostgresHook

        pg_hook = PostgresHook(postgres_conn_id="my_postgres_connection")

        # Define SQL Query
        insert_sql = """
        INSERT INTO apod_data (title, explanation, url, date, media_type)
        VALUES (%s, %s, %s, %s, %s);
        """

        # Execute the insert query
        pg_hook.run(
            insert_sql,
            parameters=(
                apo_data["title"],
                apo_data["explanation"],
                apo_data["url"],
                apo_data["date"],
                apo_data["media_type"],
            ),
        )

    # Define the task dependencies
    # Extracting the APOD data and transforming it into a format suitable for PostgreSQL
    create_table_task = create_table() >> extract_apod
    extract_apod_task = extract_apod.output >> transform_data
    # Transforming the data and loading it into PostgreSQL
    transform_data_task = transform_data(extract_apod_task)
    # Loading the transformed data into PostgreSQL
    load_data_to_postgres(transform_data_task)

```


**Get Inside the Astro Container** : `astro dev bash`

**Start Without Cache** : `astro dev restart --no-cache`

### **Depricated Things in Airflow**

`SimpleHttpOperator`, `Days_Before`

### **Important Steps Before Running the Astro**

As astro completely runs in the multiple Docker Container, it already by default installs all the required packages such as `Airflow`, `Postgres` and other. But, it does not install any other dependcies.

Therefore, we will need to include the name of the package in the `requirement.txt` of `Astro` project.

For example, I was getting a lots of error because I was installing the ` apache-airflow-providers-http` package in my environment but actually it needs to be installed inside the `container`.

Therefore, always add the `dependencies` inside the `requirements.txt` file.

Also, the `SimpleHttpOperator` is already depricated. We need to use `HttpOperator` which is provided by the third party providers.

[Providers_Package_Ref](https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html#id49)

The `username` and `password` in the `docker-compose` for the postgres should be `postgres` only. If we try to keep other names we will encounter error as `postgres` would be the default admin user while creating the container.

### **Setting the Important Connection (API and DB)**

Now, for the program to run properly we will need to setup the two important connection i.e. `Postgres` and other is `API`.

**API**

Go to the `Airflow` UI search for Connection.

<img src='./Notes_Img/a1.png'>

<img src='./Notes_Img/a2.png'>

We need to fill these `infos`, later when the `DAG` is executed the info will be reterived from here. The name should be the same.

**Postgres**

Click add connection.

Enter the `conn_id` used in the `DAG`. When the `Astro` starts the `Postgres` it will run the `Postgres` container with the details provided in the `docker-compose.yml` file. This file contains the `username` and `password`.

For the `host` go the container look for `postgres` click then copy the name of the container and paste in the `Host`.

<img src='./Notes_Img/a3.png'>

Then excute the `DAG` from the `Airflow` UI.

## **Possible Errors While Running DAG**

Due to volume inconsistency the `Username` and `Password` won't be able to find by the `Postgres` container due to which we will have to remove the `Volume` from the container.

```bash

(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete/Airflow_ETL_Pipeline_Astro_Postgres$ docker volume ls
DRIVER    VOLUME NAME
local     4d2e2ee6bd280a5cb2f26d998389a0fd0aef5270ce1d2814d9c22617e27b6d02
local     59e773b558bc859c8819420ab883efbbd80ba0f820be1ef86ad10981f870f657
local     airflow-etl-pipeline-astro-postgres_54d206_airflow_logs
local     airflow-etl-pipeline-astro-postgres_54d206_postgres_data
local     airflow-practice_d1986e_airflow_logs
local     airflow-practice_d1986e_postgres_data
local     mariadb_data
(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete/Airflow_ETL_Pipeline_Astro_Postgres$ docker volume rm airflow_etl_pipeline_astro_postgres_postgres_data
Error response from daemon: get airflow_etl_pipeline_astro_postgres_postgres_data: no such volume
(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete/Airflow_ETL_Pipeline_Astro_Postgres$ docker volume rm airflow_etl_pipeline_astro_postgres_postgres_data
Error response from daemon: get airflow_etl_pipeline_astro_postgres_postgres_data: no such volume
(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete/Airflow_ETL_Pipeline_Astro_Postgres$ docker volume rm airflow-etl-pipeline-astro-postgres_54d206_airflow_logs
airflow-etl-pipeline-astro-postgres_54d206_airflow_logs
(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete/Airflow_ETL_Pipeline_Astro_Postgres$ docker volume rm airflow-etl-pipeline-astro-postgres_54d206_postgres_data
airflow-etl-pipeline-astro-postgres_54d206_postgres_data
(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete/Airflow_ETL_Pipeline_Astro_Postgres$ docker volume rm airflow-practice_d1986e_postgres_data
airflow-practice_d1986e_postgres_data
(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete/Airflow_ETL_Pipeline_Astro_Postgres$

```


Once all the DAGs are completed we've successfully implement or automated the complete `ETL` pipeline project.

As we've `Postgres` container we can't directly viewe the tables and our rows. For that we will need to install `dbeaver community`

<img src='./Notes_Img/a4.png'>

**DBeaver**

<img src='./Notes_Img/a5.png'>

Go to the Database, look for `apod_data`

<img src='./Notes_Img/a6.png'>


### **Advantage of Providing the Connection Variables from Airflow UI**

We can pass any creds, remote host id, api keys.


## **Deploying the Astro Project in the Astro Cloud and AWS**

[Video_Link](https://www.udemy.com/course/complete-mlops-bootcamp-with-10-end-to-end-ml-projects/learn/lecture/46199315#overview)

**Note**

There was problem with Deployment in the `Astro Cloud`. Fix it any other Day.

```Bash

(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete/Airflow_ETL_Pipeline_Astro_Postgres$ astro deploy
Authenticated to Astro

Error: This command is not yet supported on Airflow 3 deployments

```

**Astro.io**

We will host the airflow application in the Astro Cloud. Login, Create Account.

Create the organization. Name the project.

**AWS Database**

Start the `RDS` and create database.

Create a posgrest.

Set the rules.

Copy the end point address.

<hr>

We will use `Astro CLI` for the deployment.

`astro login`

`astro deploy` : Choose the project.

Once deployed go to `DAGs` in `Astro UI`

Open the `Airflow` UI and then set the connections. Once, the connection is set run the DAGs.

Now, you can visualize the remote database as well using `DBeuer` past the host location from `AWS`.


## **Project Status Update - May 28, 2025**

### **✅ Completed Successfully:**

1. **Local Development Environment:**

   - Astro project setup complete
   - Docker containers running properly
   - Dependencies installed (HTTP & PostgreSQL providers)
   - ETL DAG created and functional

2. **ETL Pipeline Components:**

   - ✅ Extract: NASA APOD API integration
   - ✅ Transform: Data cleaning and structuring
   - ✅ Load: PostgreSQL database insertion
   - ✅ Scheduling: Daily execution configured

3. **Infrastructure:**
   - ✅ Docker containerization
   - ✅ PostgreSQL database setup
   - ✅ Airflow connections configuration
   - ✅ Volume management and networking

### **🔧 Current Challenge:**

**Deployment Issue:** Airflow 3.0 runtime not supported by `astro deploy` command

### **📋 Next Steps:**

#### **Immediate Actions:**

1. **Runtime Downgrade (Recommended):**

   ```bash
   # Update Dockerfile to use Airflow 2.x
   FROM astrocrpublic.azurecr.io/runtime:2.9.0
   ```

2. **Test Deployment:**

   ```bash
   astro dev stop
   astro dev start --no-cache
   astro deploy
   ```

3. **AWS RDS Setup:**
   - Create PostgreSQL instance
   - Configure security groups
   - Update Airflow connections

#### **Alternative Approaches:**

1. **GitHub Actions CI/CD Pipeline:**

   - Automated deployment
   - Version control integration
   - Production-ready workflow

2. **Manual Docker Deployment:**
   - Container registry push
   - Kubernetes/ECS deployment
   - Custom orchestration

#### **Future Enhancements:**

1. **Pipeline Improvements:**

   - Error handling and retry logic
   - Data validation and quality checks
   - Monitoring and alerting
   - Multiple data sources integration

2. **Infrastructure Scaling:**

   - Load balancing
   - Auto-scaling configurations
   - Multi-environment setup (dev/staging/prod)

3. **Advanced Features:**
   - Real-time data streaming
   - Machine learning integration
   - Dashboard and visualization
   - Data lineage tracking

### **🎯 Learning Objectives Achieved:**

- ✅ End-to-end ETL pipeline design
- ✅ Apache Airflow workflow orchestration
- ✅ Docker containerization for data applications
- ✅ API integration and error handling
- ✅ Database design for analytical workloads
- ✅ Production deployment considerations

### **📚 Key Takeaways:**

1. **Version Compatibility:** Always check runtime compatibility before deployment
2. **Local Testing:** Ensure thorough local testing before cloud deployment
3. **Dependency Management:** Proper requirements.txt configuration is crucial
4. **Connection Management:** Centralized connection configuration in Airflow UI
5. **Volume Management:** Proper Docker volume handling for data persistence

**Project successfully demonstrates modern data engineering practices and provides a solid foundation for production-ready ETL pipelines.**


## **Github Action and CI CD Pipeline**

In this section, we will explore how to set up a CI/CD pipeline using GitHub Actions for our Airflow ETL project. This will allow us to automate the deployment of our Airflow DAGs whenever we push changes to our GitHub repository.

It allows us to create custom workflows that can be triggered by various events, such as pushing code to a repository, creating a pull request, or scheduling a workflow to run at specific intervals.

### **Continuous Integration and Continuous Deployment (CI/CD)**

CI/CD is a software development practice that automates the process of integrating code changes, testing them, and deploying them to production. It helps ensure that code changes are reliable, consistent, and can be deployed quickly.

**Continuous Integration (CI)**: This involves automatically building and testing code changes whenever they are pushed to a repository. It helps catch bugs early in the development process. With GitHub Actions, you can define workflows that run tests when code is pushed or pull requests are created.

With this many developers working on the same project, it is essential to have a system that can automatically test and deploy code changes to ensure that the project remains stable and functional. If some tests fail, the CI process will notify the developers, allowing them to fix issues before merging code into the main branch.

**Continuous Deployment (CD)**: This involves automatically deploying code changes to production after they have passed the CI tests. It ensures that the latest code is always available in production without manual intervention. Github Actions can be configured to deploy applications to various environments, such as staging or production, based on the success of the CI tests. This practice reduces the time between code changes and their deployment, allowing for faster delivery of new features and bug fixes.

### **Setting Up GitHub Actions for Airflow ETL Project**

To set up GitHub Actions for our Airflow ETL project, we will create a workflow file that defines the steps to be executed whenever changes are pushed to the repository. This workflow will include steps to:

1. **Check out the code**: Use the `actions/checkout` action to check out the code from the repository.
2. **Set up Python**: Use the `actions/setup-python` action to set up the Python environment with the required version.
3. **Install dependencies**: Use `pip` to install the required Python packages, including Airflow and any other dependencies specified in the `requirements.txt` file.
4. **Run tests**: Execute any tests defined in the project to ensure that the code is functioning correctly.
5. **Deploy to Airflow**: Use the `astro deploy` command to deploy the Airflow DAGs to the Astro Cloud or any other Airflow environment.

### **Creating the Developers Workflow File**

A developer workflow file is a `YAML file` that defines the steps to be executed in a GitHub Actions workflow. It specifies the events that trigger the workflow, the jobs to be run, and the individual steps within each job.

To create the workflow file, we will create a directory called `.github/workflows` in the root of our repository and add a file named `ci-cd.yml`. This file will define the workflow for our Airflow ETL project.

```yaml
name: CI/CD Pipeline for Airflow ETL Project
on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Check out code
        uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: "3.8"

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      - name: Run tests
        run: |
          # Add your test command here, e.g., pytest or unittest
          echo "Running tests..."

      - name: Deploy to Airflow
        run: |
          astro deploy --project-id <your_project_id>
```

### **Explanation of the Workflow File**

- **name**: The name of the workflow, which will be displayed in the GitHub Actions UI.

- **on**: Specifies the events that will trigger the workflow. In this case, it will run on pushes and pull requests to the `main` branch.

- **jobs**: Defines the jobs that will be executed in the workflow. Each job runs in a separate environment.

- **runs-on**: Specifies the type of virtual machine to use for the job. In this case, we are using the latest version of Ubuntu.

- **steps**: Defines the individual steps to be executed in the job.

  - **Check out code**: Uses the `actions/checkout` action to check out the code from the repository.

  - **Set up Python**: Uses the `actions/setup-python` action to set up the Python environment with the specified version.

  - **Install dependencies**: Installs the required Python packages using `pip`.

  - **Run tests**: Placeholder for running tests. You can replace this with your actual test command.

  - **Deploy to Airflow**: Uses the `astro deploy` command to deploy the Airflow DAGs to the Astro Cloud or any other Airflow environment.

### **Configuring Secrets for Deployment**

To securely store sensitive information such as API keys, database credentials, or deployment tokens, you can use GitHub Secrets. These secrets can be accessed in your workflow file without exposing them in the code.

To add secrets to your GitHub repository:

1. Go to your repository on GitHub.

2. Click on "Settings" in the top menu.

3. In the left sidebar, click on "Secrets and variables" and then "Actions".

4. Click on "New repository secret" to add a new secret.

5. Enter a name for the secret (e.g, `ASTRO_DEPLOY_TOKEN`) and its value (e.g, your Astro deployment token).

6. Click "Add secret" to save it.

You can then access these secrets in your workflow file using the `secrets` context. For example, to use the `ASTRO_DEPLOY_TOKEN` secret, you can modify the deploy step as follows:

```yaml
- name: Deploy to Airflow
  run: |
    astro deploy --project-id <your_project_id> --token ${{ secrets.ASTRO_DEPLOY_TOKEN }} --environment production
```

### **Running the Workflow**

Once you have created the workflow file and committed it to your repository, GitHub Actions will automatically trigger the workflow whenever changes are pushed to the `main` branch or a pull request is created.

You can monitor the progress of the workflow in the "Actions" tab of your GitHub repository. If any step fails, you can view the logs to diagnose and fix the issue.

### **Conclusion**

When the Automated CI Pipeline is set up, it will automatically run the defined steps whenever changes are pushed to the repository. This ensures that your Airflow ETL project is always tested and if passes the tests it will be merged to the main branch.

Upon merging, CD is triggered, and the latest code is deoplyed in the Production environment.


## **First Github Action Project**

Refer to `First_Github_Action_Project`

We will create a simple GitHub Action that runs a Python script to print "Hello, World!" whenever code is pushed to the repository.

For this, we will create a new repository on GitHub and set up a workflow file to define the action. This repo is the sub module of the main repo.

[Link_Repo](https://github.com/ToniBirat7/Github_Action_Project)

```Bash

(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete/First_Github_Action_Project$ git submodule add git@github.com:ToniBirat7/Github_Action_Project.git First_Github_Action_Project

git add .gitmodules First_Github_Action_Project

(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete/First_Github_Action_Project$ git commit -m "Added First Github Action Project as submodule"

```

We will use `pandas` and `pytest` for this project. We will create a simple Python script that prints "Hello, World!" and a test to verify its functionality.

Below is the structure of the project:

```
First_Github_Action_Project/
├── .github/
│   └── workflows/
│       ├── ci.yml                 # Main CI/CD pipeline
│       ├── code-quality.yml       # Code quality checks
│       └── release.yml            # Release automation
├── src/
│   ├── __init__.py
│   ├── math_operations.py         # Core Python functions
├── tests/
│   ├── __init__.py
│   ├── test_math_operations.py    # Unit tests
├── requirements.txt               # Python dependencies
├── requirements-dev.txt           # Development dependencies
├── .gitignore                     # Git ignore rules
├── Dockerfile                     # Container configuration
├── docker-compose.yml             # Multi-service setup
└── README.md                      # This file
```

Once we push the code to the repository, we will need to use `Actions` to create the workflow file. So, first let's choose the `Actions` tab in the GitHub repository.

Select and `configure` the `Python application` template. This will create a basic workflow file for us.

We will need to create a a workflow file in the `.github/workflows` directory of our repository. This file will define the steps to be executed whenever code is pushed to the repository or a pull request is created.

The workflow file will be created in the `.github/workflows` directory. The default name is `python-app.yml`, but we can rename it to `unit_test.yml` for clarity.

It is better to copy the template of the `python-app.yml` workflow file and paste it in the `unit_test.yml` file.

We write workflow in the Key Value pair format. The key is the name of the workflow and the value is the steps to be executed.

```yaml
# Name: Python application
name: Python application

# Triggers the workflow on push or pull request events to the main branch
on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main

# Action to run the workflow
jobs:
  # Name of the job
  build:
    runs-on: ubuntu-latest # The job will run on the latest version of Ubuntu

    steps:
      - name: Checkout code
        # This step checks out the repository so that the workflow can access its contents
        uses: actions/checkout@v4

      # Set up Python environment
      # This step sets up Python 3.10 for the workflow
      - name: Set up Python 3.10
        uses: actions/setup-python@v3
        with:
          python-version: "3.10"

      # Install dependencies and run linting and tests
      # This step installs the necessary Python packages and runs linting and tests
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          if [ -f requirements.txt ]; then pip install -r requirements.txt; fi

      # Run the tests
      - name: Run tests
        run: pytest
```

This workflow file defines the following steps:

1. **Triggers**: The workflow is triggered on pushes and pull requests to the `main` branch.

2. **Jobs**: The workflow consists of a single job named `build`, which runs on the latest version of Ubuntu.

3. **Steps**:
   - **Checkout code**: Uses the `actions/checkout` action to check out the code from the repository.
   - **Set up Python**: Uses the `actions/setup-python` action to set up Python 3.10 for the workflow.
   - **Install dependencies**: Installs the required Python packages from `requirements.txt`.
   - **Run tests**: Executes the tests using `pytest`.

`pytest` by default looks for files in the `tests` directory that start with `test_` or end with `_test.py`. It will automatically discover and run all the test functions defined in those files.

Next, step is to copy the yml file and paste it in the `.github/workflows` directory of our repository.

We will need to push the changes to the repository.

Then we will commit the workflow file in the `GithHub` UI. As soon as we commit the file, the workflow will be triggered and the tests will be executed.


## **End to End Github Action Workflow with DockerHub**

In this section, we will extend our GitHub Actions workflow to include Docker integration. This will allow us to build a Docker image for our Python application and push it to Docker Hub whenever changes are pushed to the repository.

### **Project Description**

We will have a simple Flask App, for which we will have `Unit Tests` and `Integration Tests`.

After each push, we will build a Docker image, run the tests, and push the image to Docker Hub if the tests pass. This ensures that our application is always in a deployable state.

Whenever we are using `CD` we will need to have a `Secret` in the `GitHub` repository that contains the `Docker Hub` credentials. This will allow us to authenticate and push the Docker image to our Docker Hub account.

`Secrets` includes `Docker Hub Username` and `Docker Hub Password`. We will need to add these secrets in the `GitHub` repository settings.

### **Setting Up the Project Structure**

```plaintext
End_to_End_Github_Action_Workflow/
├── .github/
│   └── workflows/
│       └── docker-ci-cd.yml        # GitHub Actions workflow
├── app.py
├── Dockerfile                      # Dockerfile for building the image
├── requirements.txt                # Python dependencies
├── tests/
│   ├── __init__.py
│   ├── test_app.py                 # Unit tests for the Flask
```

First we'll create a simple Flask application in the `app.py` file. This application will have a single endpoint that returns "Hello, World!".

```python

from flask import Flask

app = Flask(__name__)

@app.route('/')
def hello_world():
    return 'Hello, World! This is a Flask app running in a Docker container.'

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

```

Next, we will create a `requirements.txt` file to specify the dependencies for our Flask application. This file will include Flask and any other required packages.

Then we'll write the test cases for our Flask application in the `tests/test_app.py` file. We'll use `pytest` for testing.

Then we will need to create a `Dockerfile` to build the Docker image for our Flask application. The Dockerfile will define the base image, copy the application code, install dependencies, and expose the necessary port.

```dockerfile

FROM python:3.8-slim

WORKDIR /app

COPY . /app/

RUN pip install -r requirements.txt

EXPOSE 5000

CMD ["python", "app.py"]

```

This Dockerfile does the following:

1. **FROM python:3.8-slim**: Uses the official Python 3.8 slim image as the base image.

2. **WORKDIR /app**: Sets the working directory to `/app` inside the container.

3. **COPY . /app/**: Copies the entire application code into the `/app` directory inside the container.

4. **RUN pip install -r requirements.txt**: Installs the Python dependencies specified in the `requirements.txt` file.

5. **EXPOSE 5000**: Exposes port 5000, which is the default port for Flask applications.

6. **CMD ["python", "app.py"]**: Specifies the command to run the application when the container starts.

Now we will write the CI/CD workflow file in the `.github/workflows` directory. This file will define the steps to build the Docker image, run tests, and push the image to Docker Hub.

```yaml
name: CI/CD for Dockerized Flask App

on:
  push:
    branches: ["main"]
  pull_request:
    branches: ["main"]

jobs:
  build-and-test:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        # This step checks out the repository so that the workflow can access its contents
        uses: actions/checkout@v4

      - name: Set Up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"

      - name: Install Dependencies
        run: |
          python -m pip install --upgrade pip
          if [ -f requirements.txt ]; then pip install -r requirements.txt; fi

      - name: Run Tests
        run: pytest

    # Build and push Docker image
  build-and-push:
    runs-on: ubuntu-latest
    needs: build-and-test

    steps:
      - name: Checkout code
        # This step checks out the repository so that the workflow can access its contents
        uses: actions/checkout@v4

      - name: Set up Docker Buildx
        # This step sets up Docker Buildx, which is a Docker CLI plugin for extended build capabilities with BuildKit
        uses: docker/setup-buildx-action@v2

      - name: Log in to Docker Hub
        # This step logs in to Docker Hub using the credentials stored in GitHub Secrets
        uses: docker/login-action@v2
        with:
          # Replace with your Docker Hub username and password stored in GitHub Secrets
          username: ${{ secrets.DOCKER_USERNAME }}
          password: ${{ secrets.DOCKER_PASSWORD }}

      # Build and push Docker image
      - name: Build Docker image
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          tags: ${{ secrets.DOCKER_USERNAME }}/flask-app:latest

      - name: Image digest
        run: echo ${{ steps.build-and-push.outputs.digest }}
```

### **Explanation of the Workflow File**

- **name**: The name of the workflow, which will be displayed in the GitHub Actions UI.

- **on**: Specifies the events that will trigger the workflow. In this case, it will run on pushes and pull requests to the `main` branch.

- **jobs**: Defines the jobs that will be executed in the workflow. Each job runs in a separate environment.

- **build-and-test**: This job runs on the latest version of Ubuntu and performs the following steps:

  - **Checkout code**: Uses the `actions/checkout` action to check out the code from the repository.
  - **Set Up Python**: Uses the `actions/setup-python` action to set up Python 3.10 for the workflow.
  - **Install Dependencies**: Installs the required Python packages from `requirements.txt`.
  - **Run Tests**: Executes the tests using `pytest`.

- **build-and-push**: This job runs after the `build-and-test` job and performs the following steps:

  - **Checkout code**: Uses the `actions/checkout` action to check out the code from the repository.
  - **Set up Docker Buildx**: Sets up Docker Buildx, which is a Docker CLI plugin for extended build capabilities with BuildKit.
  - **Log in to Docker Hub**: Uses the `docker/login-action` action to log in to Docker Hub using the credentials stored in GitHub Secrets. You will need to create these secrets in your GitHub repository settings.
  - **Build Docker image**: Uses the `docker/build-push-action` action to build and push the Docker image to Docker Hub. The image will be tagged with the username and `flask-app:latest`.
  - **Image digest**: Prints the image digest after building and pushing the image.

**Accessing the Docker Hub Username and Password**

To get the Username and Password from the Docker Hub, we will need to create a Personal Access Token (PAT) in Docker Hub. This token will be used to authenticate with Docker Hub when pushing the image.

Then copy the token and paste it in the GitHub repository `Secrets and Variables` under `action` under `Repo Secrets` add the `DOCKER_HUB_PASSWORD`. The username will be your Docker Hub username, which you can also add as a secret named `DOCKER_HUB_USERNAME`.

Now, the GitHub Actions workflow can access these secrets to authenticate with Docker Hub.

Once all the tests pass, the Docker image will be built and pushed to Docker Hub automatically.

Now, try to pull the Docker image from Docker Hub using the following command:

```bash
docker pull <your-dockerhub-username>/flask-app:latest
```

## **Special Notes**

The name of the Dockerfile should be `Dockerfile` and it should be in the root directory of the repository. The GitHub Actions workflow will look for this file to build the Docker image. Because the `docker/build-push-action` action uses the `context` parameter to specify the build context, which is the root directory of the repository in this case.


# **First Complete End to End Project with Airflow, Postgres, and GitHub Actions**

<br>

Refer to `First_Complete_End_to_End_ML_Project`.

[Github](git@github.com:ToniBirat7/First_Complete_End_to_End_ML_Project.git)

<br>

In this project, we will create a complete production-ready end-to-end machine learning project using Apache Airflow, PostgreSQL, and GitHub Actions.

We will learn how to write production ready code, set up a CI/CD pipeline, and deploy our machine learning model using Airflow and PostgreSQL.

**Creating Project Structure**

In real world projects, we will have a specific project structure that we will follow. This will help us to organize our code and make it easier to maintain.

For that, we will create a `template.py` file that will contain the project structure. We run this file to create the project structure.

```python

# Automated Script for a complete end-to-end ML project

from pathlib import Path
import logging
import os

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

project_name = "First_Complete_End_To_End_ML_Project"

list_of_files = [
  ".github/workflows/.gitkeep",
  f"src/{project_name}/__init__.py",
  f"src/{project_name}/components/__init__.py",
  f"src/{project_name}/components/data_ingestion.py",
  f"src/{project_name}/components/data_validation.py",
  f"src/{project_name}/components/data_transformation.py",
  f"src/{project_name}/components/model_trainer.py",
  f"src/{project_name}/components/model_evaluation.py",
  f"src/{project_name}/components/model_pusher.py",
  f"src/{project_name}/utils/__init__.py",
  f"src/{project_name}/utils/common.py",
  f"src/{project_name}/utils/logger.py",
  f"src/{project_name}/config/__init__.py",
  f"src/{project_name}/config/configuration.py",
  f"src/{project_name}/pipeline/__init__.py",
  f"src/{project_name}/entity/__init__.py",
  f"src/{project_name}/entity/config_entity.py",
  f"src/{project_name}/constants/__init__.py",
  "config/config.yaml",
  "params.yaml",
  "schema.yaml",
  "main.py",
  "Dockerfile",
  "setup.py",
  "research/research.ipynb",
  "templates/index.html"
]

for filepath in list_of_files:
    filepath = Path(filepath)
    filedir,filename = os.path.split(filepath)
    print(f"Processing file: {filedir}, {filename}")
    print(f"Type of file: {type(filedir)}")
    if not os.path.exists(filedir):
        if (not filedir):
            logging.info(f"File directory is empty, creating in current directory.")
        else:
          logging.info(f"Creating directory: {filedir}")
          os.makedirs(filedir,exist_ok=True)
    if (not os.path.exists(filepath)) or (os.path.getsize(filepath) == 0):
        logging.info(f"Creating file: {filepath}")
        with open(filepath, 'w') as f:
          logging.info(f"Writing to file: {filepath}")
          # Write a comment or placeholder content based on the file type
          if filename == "__init__.py":
              f.write("# This is an init file for the package\n")
          if filename == "config.yaml":
              f.write("# Configuration file for the project\n")
          elif filename == "params.yaml":
              f.write("# Parameters for the project\n")
          elif filename == "schema.yaml":
              f.write("# Schema for the project\n")
          elif filename == "main.py":
              f.write("# Main entry point for the project\n")
          elif filename == "Dockerfile":
              f.write("# Dockerfile for the project\n")
          elif filename == "setup.py":
              f.write("# Setup script for the project\n")
          else:
              f.write(f"# {filename} file\n")
    else:
        logging.info(f"File already exists and is not empty: {filepath}")

```


### **Loggin Implementation and Exception Handling**

In this section, we will implement logging in our project. Logging is an essential part of any production-ready application. It helps us to track the flow of the application, debug issues, and monitor the performance of the application.

Now, we will implement logging in our project. We will create log for each `package`. We will need to write the logging code in the `__init__.py` file of each package. Whenever we import the package, the logging code will be executed and the log file will be created. We will import the initialied `logger` object in the `__init__.py` file of each package.

```python

# This is an init file for the package
# __init__.py file

import os
import sys
import logging

logging_format = "[%(asctime)s] - %(levelname)s - %(message)s"

log_dirs = 'logs'
log_filepath = os.path.join(log_dirs, 'logging.log')

if not os.path.exists(log_dirs):
    os.makedirs(log_dirs)

logging.basicConfig(
    level=logging.INFO,
    format=logging_format,

    handlers=[
        logging.FileHandler(log_filepath), # Log to a file
        logging.StreamHandler(sys.stdout) # Log to console
    ]
)

src_logger = logging.getLogger("First_Complete_End_To_End_ML_Project") # Create a logger for the package

```

Then try to import the package in the `main.py` file. This will execute the logging code and create the log file. Every log will be written to the log file and also printed to the console.

```python


```


## **Setting Up the Utility Functions and Exception Handling**

**Utility Functions**

Utility functions are reusable functions that can be used across different parts of the project. We will create a `utils` package and add the utility functions in it.

In this section, we will set up the utility functions for our project. Utility functions are reusable functions that can be used across different parts of the project. We will create a `utils` package and add the utility functions in it.

There's `common.py` file in the `utils` package. We will add the utility functions in this file.


## **Config Box and Annotations**

In this section, we will set up the configuration for our project using `ConfigBox`. `ConfigBox` is a Python library that allows us to create a configuration object that can be easily accessed and modified.

```Python
example = {
  "key1": "value1",
  "key2": "value2",
  "key3": "value3",
}

# To access the value of "key1" no error
example["key3"]
```

```Python
# But to everytime access a key we've to use square brackets
# Instead to make it more readable we can use . just like accessing attributes of an object
# But for that we will need to bind the dictionary with ConfigBox

# Gives Error

example.key1
```

```Python

# Import ConfigBox

from box import ConfigBox

# Bind the dictionary with ConfigBox

example = ConfigBox(example)

# Now we can access the value of "key1" using dot notation

example.key1

# Output: value1
```

**Why ConfigBox is Useful?**

We've many `yaml` files in our project, and we need to load them into our module. By default `yaml` files are loaded as dictionaries, which can be cumbersome to work with. `ConfigBox` provides a structured way to handle configurations, making it easier to manage and validate them.

For example, if you have a configuration file with various settings, you can define a schema using ConfigBox annotations to ensure that the data adheres to specific types and constraints. This helps catch errors early and provides a clear structure for your configuration data.

### **Ensure Annotations**

To ensure that the annotations are correctly applied to the configuration data, you can use the `ensure_annotations` function from the `ensure` module. This function checks if the configuration data matches the defined schema and raises an error if there are any discrepancies.

Let's say we've a function to multiply two numbers, and we want to ensure that the inputs are integers. We'll need to defime the schema while defining the function, and then use `ensure_annotations` to validate the inputs.

`ensure_annotations` is a decorator that can be applied to functions to enforce type annotations at runtime. It checks the types of the arguments (schema that is defined for the parameters) and the return value against the specified annotations.

It will raise a `TypeError` if the types do not match the annotations, ensuring that the function is called with the correct types.

```python

# Without ensure_annotation

def add(a: int, b: int ):
    return a + b

# If we pass a string to the function, it will not raise an error immediately
result = add("1", "2")
result

# Output: "12" (string concatenation)

# Our function will return "12" instead of 3

# To ensure that the types are correct, we can use ConfigBox with ensure_annotation
```

```python

from box import ConfigBox
from ensure import ensure_annotations

@ensure_annotations
def add(a: int, b: int ):
    return a + b

# If we pass a string to the function, it will raise an error

add("1", "2")

# This will raise an error: TypeError: Expected int, got str instead

```

### **ConfigBox with Ensure Annotations**

```Python

# We can implement the combination of ConfigBox and ensure_annotations in our code

# To fix this, we can use ConfigBox to ensure that the types are correct
example = ConfigBox({"a": 1, "b": 2})

@ensure_annotations
def add(a: int, b: int ) -> int:
    """
    Function to add two numbers.
    Args:
        a (int): First number.
        b (int): Second number.
    Returns:
        int: Sum of a and b.
    """
    return a + b
# Now we can pass the ConfigBox object to the function
result = add(example.a, example.b)

result

# Output: 3

```


## **Next Day**

**First Complete End to End Project**

[Link](https://www.udemy.com/course/complete-mlops-bootcamp-with-10-end-to-end-ml-projects/learn/lecture/46209327#overview)
