### Architecture Overview

This diagram illustrates the flow of data and interactions between different components of the system:

* **Hugging Face GAIA Dataset**: The GAIA dataset is downloaded using Python scripts within the Airflow container.
* **Data Storage**: 
  - The dataset and relevant files are stored in an unstructured format (e.g., PDFs, JSON, TXT) within an S3 bucket.
  - Metadata is also loaded and stored in a metadata table (RDS).
* **Data Processing**:
  - The unstructured files (e.g., PDFs) are processed using tools such as PyMuPDF4LLM and Unstructured API. 
  - The processed data (e.g., JSON, TXT files) is stored back into S3 directories for further retrieval.
* **Airflow Pipeline**:
  - The entire process (downloading, processing, storing data) is orchestrated and automated using Airflow tasks.
  - Airflow is containerized to manage each step of the process, ensuring scalability and efficient task execution.
* **Streamlit User Interface**:
  - End-users interact with the system through a Streamlit-based user interface where they can provide inputs, select questions, and initiate data queries.
  - The interface connects to the FastAPI backend to retrieve stored files and relevant metadata for user-selected questions.
* **Data Retrieval**:
  - Files and metadata are retrieved from the S3 storage and metadata table based on the user's selections in the Streamlit interface.
* **OpenAI API**:
  - The user inputs, selected questions, and retrieved files are sent to the OpenAI API for language model-based analysis.
  - The API returns large language model (LLM) responses based on the processed data and user inputs.
* **Result Display**:
  - The LLM responses from the OpenAI API are returned to the Streamlit interface, where the results are displayed for the user.
  - Users can interact with the results and update their inputs as needed for further queries.


In [3]:
#import libraries
from diagrams import Diagram, Cluster, Edge
from diagrams.programming.language import Python
from diagrams.custom import Custom
from diagrams.aws.storage import S3
from diagrams.aws.database import RDS
from diagrams.onprem.workflow import Airflow
from diagrams.programming.framework import FastAPI
from diagrams.digitalocean.compute import Docker

In [4]:
#define the visual context
graph_attr = {
    "fontsize": "10",      # Font size for the text
    "size": "10,10",         # Diagram width and height (inches)
    "nodesep": "0.3",      # Reduce spacing between nodes (default is 0.25)
    "ranksep": "0.5",      # Reduce vertical spacing between levels (default is 0.5)
    "dpi": "120"            # Dots per inch (increase this to improve resolution)
}

filename = "flow_diagram"


In [5]:
#define the function for the flow diagram
def create_flow_diagram():
    try:
        # Diagram with left-to-right flow between clusters
        with Diagram("Flow Diagram", filename=filename, show=False, direction="LR", graph_attr=graph_attr) as diag:
            
            # Airflow Container, with top-to-bottom flow inside the cluster
            with Cluster("Airflow Container", direction="TB", graph_attr={"fontsize": "16", "fontname": "bold"}):  # Cluster with increased font size and bold
                airflow = Airflow("Airflow\ntrigger")
                hugging_face = Custom("GAIA Dataset", "./input_icons/HugingFace.png")
                python = Python("Downloading Metadata\nand Files")
                s3 = S3("PDF files storage")
                PyMuPDF4LLM = Custom("PyMuPDF", "./input_icons/PyMuPDF.jpeg")
                Unstructured_API = Custom("Unstructured API", "./input_icons/Unstructured.png")
                docker_airflow = Docker("Docker")

            # Connecting the elements inside the Airflow container
            airflow >> hugging_face >> python
            python >> s3
            # Bidirectional arrows between s3 and PyMuPDF4LLM, Unstructured_API
            s3 >> PyMuPDF4LLM
            s3 >> Unstructured_API

            rds_auth = RDS("User Auth")

            # FAST API Endpoints cluster on the right side
            with Cluster("FAST API Endpoints", direction="TB", graph_attr={"fontsize": "16", "fontname": "bold"}):  # Cluster with increased font size and bold
                user_auth = Custom("User Registration\n & Authentication", "./input_icons/user-authentication.png")
                fastapi_unstructured = S3("Unstructured\nProcessed")
                fastapi_opensource = S3("Open Source\nProcessed")
                rds = RDS("Metadata Table")
                openai = Custom("Open AI API", "./input_icons/OpenAI.png")

            # Containerized Streamlit and FastAPI cluster with right-to-left flow, placed on the right
            with Cluster("Containerized Streamlit and FastAPI", direction="RL", graph_attr={"fontsize": "16", "fontname": "bold"}):  # Cluster with increased font size and bold
                fast_api = FastAPI("FastAPI")
                streamlit = Custom("Streamlit", "./input_icons/streamlit.png")
                docker = Docker("Docker")

                # User connecting to streamlit and then FastAPI
            user = Custom("User", "./input_icons/user.png")
            user >> streamlit >> fast_api

            # Connecting FastAPI to FastAPI Endpoints elements
            fast_api >> user_auth >> rds_auth
            fast_api >> fastapi_unstructured
            fast_api >> fastapi_opensource
            fast_api >> openai
            openai >> fast_api

            python >> rds >> fast_api

            # Connecting Unstructured API and PyMuPDF4LLM in Airflow Container to S3 buckets in FAST API Endpoints
            Unstructured_API >> fastapi_unstructured
            PyMuPDF4LLM >> fastapi_opensource

    except Exception as e:
        print("Exception: ", e)


In [6]:
create_flow_diagram()