GitHub - MichaelFowler1/Cost_Extraction: AI Defense Pipeline: Automates extraction and analysis of GAO reports. It uses GenAI and data engineering to turn unstructured PDFs into a structured, searchable, and actionable knowledge base.

Automated AI Cost Extraction Pipeline Executive Summary I developed this system to handle the tedious task of digging through complex, unstructured PDF documents to find specific cost data. Instead of wasting hours on manual data entry, this pipeline uses Python and Google's Gemini AI to automatically find, clean, and store financial information. It captures both text and visual data (like tables and charts), analyzes the context, and saves the final results into a structured database. This project effectively turns a manual bottleneck into an automated, scalable workflow.

The Technical Workflow The system is broken down into a five-step modular pipeline to ensure data accuracy and efficiency:

Extracting Text (01_extract_pdf.py): This is the ingestor that reads raw PDF files and converts them into text that the computer can process.

Data Cleaning (02_cleaner.py): Raw PDF text is usually messy. This script scrubs the data, removes formatting errors, and prepares the text for the AI.

AI Analysis (03_ai_analyst.py): This is the brain of the operation. It feeds the cleaned data to the Gemini API to identify and extract specific cost metrics.

Visual Extraction (04_visual_extractor.py): Since important data is often buried in charts or graphs, this script uses vision-based analysis to interpret images within the documents.

Database Management (05_library_manager.py): The final step takes all the findings and logs them into a local SQLite database while also generating a CSV report for easy viewing.

Setup and Installation Clone the project: Download the repository to your local machine.

Install the environment: I used a requirements file to keep the project lightweight. You can install all necessary libraries with: pip install -r requirements.txt

Configure your API key: You will need a Gemini API key to run the analyst scripts. Create a file named .env in the root folder and add your key: GEMINI_API_KEY=your_actual_key_here

Security and Project Hygiene I built this repository with security in mind. I implemented a strict .gitignore file to ensure that sensitive files like API keys, large virtual environments, and local data exports are never uploaded to the public history. This keeps the repository clean, professional, and secure for deployment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
01_extract_pdf.py		01_extract_pdf.py
02_cleaner.py		02_cleaner.py
03_ai_analyst.py		03_ai_analyst.py
04_visual_extractor.py		04_visual_extractor.py
05_library_manager.py		05_library_manager.py
Models_Gemini		Models_Gemini
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages