AI Application Workflow

This project outlines the architecture and workflow of an AI application that processes and standardizes data from various sources (PDFs, web scraping, and enterprise services) and stores it in an AWS S3 bucket. The application is built using a combination of Python libraries, FastAPI for the backend, and Streamlit for the frontend.

Workflow Diagram

Below is the workflow diagram for the AI Application:

Diagram Description:

User: The end-user interacts with the application via the Streamlit frontend.
Streamlit App: The frontend built using Streamlit.
FastAPI Backend: The backend server that handles data processing.
Data Extraction:
- PyMuPDF / camelot: For extracting data from PDF files using Open Source tools.
- Azure Document Intelligence and Adobe API Extract API: For extracting data from PDF files using Enterprise tools.
- BeautifulSoup: For web scraping using Open Source Tools.
- APIFY: For web scraping using Enterprise Tools.
Standardization Tools:
- Docling: A custom tool for standardizing conversions from pdfs to markdowns.
- MarkItDown: Another custom tool for further data standardization.
AWS S3 Bucket: Used for storing processed data.
Google Cloud Run: Used for Deploying FastAPI applications
Streamlit In-builtDeployment: Used for Deploying Streamlit application for UI/UX.

Components

User: The end-user interacts with the application via the Streamlit frontend.
Streamlit Frontend: A custom frontend built using Streamlit for user interaction.
FastAPI Backend: A backend server built using FastAPI to handle data processing and communication with other services.
Data Extraction:
- PyPDF2 / pdfplumber: For extracting data from PDF files.
- BeautifulSoup/Scrapy: For web scraping.
- Microsoft Document Intelligence: For enterprise-level document processing.
Standardization Tools:
- Docling: A custom tool for standardizing extracted data.
- MarkItDown: Another custom tool for further data standardization.
AWS S3 Bucket: Used for storing processed data.

Workflow Steps

The User uploads data via the Streamlit Frontend.
The Frontend sends the data to the FastAPI Backend.
The Backend processes the data using one or more of the following:
- PyMuPDF / Camelot for open source PDF extraction.
- BeautifulSoup / Scrapy for open source webscraping.
- Microsoft Document Intelligence for enterprise document processing.
- APIFy for enterprise webscraping.
The extracted data is standardized using Docling in the pdf_process_pipeline and MarkItDown for webscraping_pipeline opensource.(Note:APIFy parses and generates markdown)
The processed data is stored in an AWS S3 Bucket.
The Frontend retrieves the processed data from the S3 Bucket and displays it to the User.

Prerequisites

Python 3.7+
Diagrams library for generating the workflow diagram.
AWS account with S3 bucket access.
Streamlit and FastAPI installed for frontend and backend development and also other libraries.

Deployed Links:

Streamlit: https://streamlit-frontend-4o32.onrender.com
Fastapi: https://pdf-markdown-app.1vittl6yfklm.us-east.codeengine.appdomain.cloud/

Installation

1.Clone the repository:

git clone https://github.com/yourusername/ai-application-workflow.git
cd ai-application-workflow

2.Create a .env file and add the required credentials:

AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=your_aws_region
AZURE_DOCUMENT_INTELLIGENCE_KEY=your_azure_key
AZURE_FORM_RECOGNIZER_ENDPOINT=aws_form_recognizer_endpoint
APIFY_TOKEN=your_apify_token

Ensure you have the custom icons (microsoft.png, docling.png, markitdown.png, streamlit.png) in the ./icons/ directory.
Generate the workflow diagram:
```
python generate_diagram.py
```

4.Install dependencies: create venv inside api and frotend folder

cd api
python -m venv venv
venv/Scripts/activate
pip install -r requirements

in new terminal

cd frontend
python -m venv venv
venv/Scripts/activate
pip install -r requirements

Usage for Testing Locally

Run the FastAPI backend:

uvicorn main:app --reload --host 0.0.0.0 --port 8080

Run the Streamlit frontend: in another term
```
streamlit run frontend.py
```
Open your browser and navigate to http://localhost:8501 to interact with the application.

Deployment Strategies to Gcloud:

1.Dockerise

There are 2 DockerFiles in the frontend(Streamlit) and api(FASTAPI) folders.
After installing Docker Desktop,create the docker images using the following commands
create the fast api docker image
before that run the following commands to authorize Google Cloud SDK.And make sure you run these command in the root directory of the project.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

Documenation and the Demo Link:

The codelab documentation for this project is here: https://codelabs-preview.appspot.com/?file_id=1m6pFHqqgA2N7T0qmKOny6_1tmpuYuv3RM9xa7Q9_UwU/edit?tab=t.0#4
The demo link for this project is here : https://northeastern-my.sharepoint.com/personal/mate_r_northeastern_edu/_layouts/15/stream.aspx?id=%2Fpersonal%2Fmate%5Fr%5Fnortheastern%5Fedu%2FDocuments%2FRecordings%2FBIG%5FDATA%5FSCRUM%2D20250206%5F121750%2DMeeting%20Recording%2Emp4&referrer=StreamWebApp%2EWeb&referrerScenario=AddressBarCopied%2Eview%2E7ef65eea%2D3e3e%2D4fab%2D8cd6%2D9465cf24018a

License

This project is licensed under the MIT License. See the LICENSE file for details.

Notes:

Replace yourusername in the repository URL with your actual GitHub username.
Ensure the generate_diagram.py script is created to generate the workflow diagram.
Update the LICENSE file if you choose a different license.

Let me know if you need further assistance!

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.streamlit		.streamlit
PDF_Extraction_and_Markdown_Generation		PDF_Extraction_and_Markdown_Generation
WebScraping_Extraction_and_Markdown_Generation		WebScraping_Extraction_and_Markdown_Generation
api		api
frontend		frontend
notebooks		notebooks
testing_codes		testing_codes
.dockerignore		.dockerignore
.gitignore		.gitignore
Cloud_Run.md		Cloud_Run.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Application Workflow

Workflow Diagram

Diagram Description:

Components

Workflow Steps

Prerequisites

Deployed Links:

Installation

Usage for Testing Locally

Deployment Strategies to Gcloud:

Contributing

Documenation and the Demo Link:

License

Notes:

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

BigDataTeam5/Automated-Document-Processing-and-Markdown-Generation-System

Folders and files

Latest commit

History

Repository files navigation

AI Application Workflow

Workflow Diagram

Diagram Description:

Components

Workflow Steps

Prerequisites

Deployed Links:

Installation

Usage for Testing Locally

Deployment Strategies to Gcloud:

Contributing

Documenation and the Demo Link:

License

Notes:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages