This project outlines the architecture and workflow of an AI application that processes and standardizes data from various sources (PDFs, web scraping, and enterprise services) and stores it in an AWS S3 bucket. The application is built using a combination of Python libraries, FastAPI for the backend, and Streamlit for the frontend.
Below is the workflow diagram for the AI Application:
- User: The end-user interacts with the application via the Streamlit frontend.
- Streamlit App: The frontend built using Streamlit.
- FastAPI Backend: The backend server that handles data processing.
- Data Extraction:
- PyMuPDF / camelot: For extracting data from PDF files using Open Source tools.
- Azure Document Intelligence and Adobe API Extract API: For extracting data from PDF files using Enterprise tools.
- BeautifulSoup: For web scraping using Open Source Tools.
- APIFY: For web scraping using Enterprise Tools.
- Standardization Tools:
- Docling: A custom tool for standardizing conversions from pdfs to markdowns.
- MarkItDown: Another custom tool for further data standardization.
- AWS S3 Bucket: Used for storing processed data.
- Google Cloud Run: Used for Deploying FastAPI applications
- Streamlit In-builtDeployment: Used for Deploying Streamlit application for UI/UX.
- User: The end-user interacts with the application via the Streamlit frontend.
- Streamlit Frontend: A custom frontend built using Streamlit for user interaction.
- FastAPI Backend: A backend server built using FastAPI to handle data processing and communication with other services.
- Data Extraction:
- PyPDF2 / pdfplumber: For extracting data from PDF files.
- BeautifulSoup/Scrapy: For web scraping.
- Microsoft Document Intelligence: For enterprise-level document processing.
- Standardization Tools:
- Docling: A custom tool for standardizing extracted data.
- MarkItDown: Another custom tool for further data standardization.
- AWS S3 Bucket: Used for storing processed data.
- The User uploads data via the Streamlit Frontend.
- The Frontend sends the data to the FastAPI Backend.
- The Backend processes the data using one or more of the following:
- PyMuPDF / Camelot for open source PDF extraction.
- BeautifulSoup / Scrapy for open source webscraping.
- Microsoft Document Intelligence for enterprise document processing.
- APIFy for enterprise webscraping.
- The extracted data is standardized using Docling in the pdf_process_pipeline and MarkItDown for webscraping_pipeline opensource.(Note:APIFy parses and generates markdown)
- The processed data is stored in an AWS S3 Bucket.
- The Frontend retrieves the processed data from the S3 Bucket and displays it to the User.
- Python 3.7+
- Diagrams library for generating the workflow diagram.
- AWS account with S3 bucket access.
- Streamlit and FastAPI installed for frontend and backend development and also other libraries.
1.Clone the repository:
git clone https://github.com/yourusername/ai-application-workflow.git
cd ai-application-workflow
2.Create a .env file and add the required credentials:
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=your_aws_region
AZURE_DOCUMENT_INTELLIGENCE_KEY=your_azure_key
AZURE_FORM_RECOGNIZER_ENDPOINT=aws_form_recognizer_endpoint
APIFY_TOKEN=your_apify_token
- Ensure you have the custom icons (
microsoft.png
,docling.png
,markitdown.png
,streamlit.png
) in the./icons/
directory. - Generate the workflow diagram:
python generate_diagram.py
4.Install dependencies: create venv inside api and frotend folder
cd api
python -m venv venv
venv/Scripts/activate
pip install -r requirements
in new terminal
cd frontend
python -m venv venv
venv/Scripts/activate
pip install -r requirements
-
Run the FastAPI backend:
uvicorn main:app --reload --host 0.0.0.0 --port 8080
-
Run the Streamlit frontend: in another term
streamlit run frontend.py
-
Open your browser and navigate to
http://localhost:8501
to interact with the application.
1.Dockerise a.There are 2 DockerFiles in the frontend(Streamlit) and api(FASTAPI) folders. b.After installing Docker Desktop,create the docker images using the following commands create the fast api docker image before that run the following commands to authorize Google Cloud SDK.And make sure you run these command in the root directory of the project.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.
- Replace
yourusername
in the repository URL with your actual GitHub username. - Ensure the
generate_diagram.py
script is created to generate the workflow diagram. - Update the
LICENSE
file if you choose a different license.
Let me know if you need further assistance!