This project demonstrates a simple yet powerful Retrieval Augmented Generation (RAG) system using Java, leveraging Google Cloud's Vertex AI for embeddings and the Gemini API for generative AI capabilities. It allows you to ingest PDF documents, create a knowledge base from their content, and then query that knowledge base to get answers augmented by the retrieved context.
The application serves as a foundational example for building enterprise-grade RAG solutions to enable chatbots, document summarization, and intelligent search over private or domain-specific data.
PDF Document Ingestion: Extracts text content from PDF files.
Vector Embeddings: Utilizes Vertex AI's text-embedding-004 model to create vector representations of document chunks.
In-Memory Vector Store: Stores document embeddings and their corresponding text chunks in a simple, in-memory vector database.
Semantic Search: Performs cosine similarity search to retrieve the most relevant document chunks based on a user's query.
Retrieval Augmented Generation (RAG): Augments user queries with retrieved context from documents before sending to the Gemini LLM for more accurate and grounded responses.
Generative AI: Integrates with the Google Gemini API (gemini-2.5-flash-001) for natural language understanding and generation.
The RAG pipeline operates in two main phases:
PDF documents are read and text content is extracted.
The extracted text is split into manageable chunks.
Each chunk is sent to Vertex AI's text-embedding-004 model to generate a vector embedding.
The chunk text and its embedding are stored in an in-memory vector store.
A user's natural language query is received.
The query is sent to Vertex AI's text-embedding-004 model to generate its embedding.
This query embedding is used to find the most semantically similar chunks in the in-memory vector store.
The original query is then augmented with the retrieved context (relevant chunks).
The augmented prompt is sent to the Gemini model (gemini-2.5-flash).
Gemini generates a response based on the query and the provided context.
Java 11+
Apache Maven
Google Cloud Platform (GCP)
Vertex AI: For text-embedding-004 and Gemini models.
Google Gen AI SDK for Java (com.google.genai:google-genai)
Google Cloud Client Libraries for Java (com.google.cloud:google-cloud-aiplatform)
Apache PDFBox: For PDF text extraction.
slf4j-simple: For basic logging.
Before you begin, ensure you have the following installed:
Java Development Kit (JDK) 11 or higher
Apache Maven 3.x
Google Cloud SDK (gcloud CLI): Authenticated and configured for your GCP project.
Install: Google Cloud SDK
Authenticate: gcloud auth application-default login
Create a Google Cloud Project: If you don't have one, create a new project in the Google Cloud Console.
Enable APIs:
Navigate to "APIs & Services" > "Enabled APIs & Services".
Enable the following APIs:
Vertex AI API
Set Environment Variables: The application automatically picks up your Google Cloud Project ID and Region from environment variables set by gcloud. Ensure these are set:
Bash
export GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
export GOOGLE_CLOUD_LOCATION="us-central1" # Or your preferred region (e.g., "asia-south1")'
Replace "your-gcp-project-id" with your actual project ID. us-central1 is a common region for Vertex AI models.
Clone the Repository:
Bash
git clone https://github.com/Swastik466/document-intelligence-rag-system.git
cd document-intelligence-rag-system
Place PDF Documents: Create a pdfs directory in the root of the project and place your PDF files inside it.
├── pom.xml
├── src/
│ └── main/
│ └── java/
│ └── com/
│ └── example/
│ └── rag/
│ ├── App.java
│ ├── ChunkingUtil.java
│ ├── EmbeddingService.java
│ ├── GeminiService.java
│ └── VectorStore.java
└── pdfs/
├── document1.pdf
└── document2.pdf
The application is configured to look for PDF files in this pdfs/ directory.
Build the Project: Navigate to the project root directory in your terminal and build the executable JAR:
[Bash]
mvn clean install
This command compiles the code, runs tests, and packages the application into a single executable JAR file (named document-rag-1.0-SNAPSHOT-jar-with-dependencies.jar) in the target/ directory.
After successful build, you can run the application from the target/ directory.
[Bash]
java -jar target/document-rag-1.0-SNAPSHOT-jar-with-dependencies.jar
The application will:
Ingest and process all PDF files found in the pdfs/ directory.
Build the in-memory vector store.
Prompt you to enter questions.
Persistent Vector Database: Replace the in-memory vector store with a more robust solution like Cloud SQL with pgvector or Vertex AI Vector Search for scalability and persistence.
Google Cloud Storage Integration: Allow ingesting PDFs directly from a GCS bucket instead of local files.
REST API: Wrap the RAG functionality in a Spring Boot (or similar) REST API for easier integration with front-end applications.
User Interface: Develop a simple web UI for uploading PDFs and asking questions.
Error Handling and Robustness: Enhance error handling, retry mechanisms, and logging for production readiness.
Advanced Chunking: Implement more sophisticated chunking strategies (e.g., using overlap, semantic chunking).
Multi-modal Input: Extend to handle other document types (images, presentations) using Gemini's multi-modal capabilities.
Deployment: Containerize the application (Docker) and deploy to platforms like Cloud Run or GKE.
This project is licensed under the MIT License
For any questions or collaborations, feel free to reach out:
Email: swasbits@gmail.com