Skip to content

Commit

Permalink
Merge pull request #12 from Nelsonlin0321/dev/v2
Browse files Browse the repository at this point in the history
Updated readme
  • Loading branch information
Nelsonlin0321 committed Mar 16, 2024
2 parents f0b7644 + 084ea34 commit 8c1a714
Show file tree
Hide file tree
Showing 3 changed files with 34 additions and 2 deletions.
36 changes: 34 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,36 @@
# FastAPI PDF Retrieval Augmented Generation (RAG)
# FastAPI PDF Retrieval Augmented Generation (RAG) APIs

## Introduction
The FastAPI server acts as the main backend API for full stack Chat PDF AI Assistant.

## Highlights:
- The custom "Read & Chunk" component to preprocessing PDF files, splitting pages to sentence and them merge them into chunks for embedding and indexing.
- MongoDB to perform vector similarity and keyword search operations on the database for ingested vector
- It uses the Jina framework for embedding text data and reranking for retrieval data.
- The "api/hybrid_search" route handles hybrid search queries, combining traditional text search (BM25) with vector similarity search with Jina AI reranking.

## Preview

[`www.chat-pdf-ai.com`](www.chat-pdf-ai.com)


## Overall Architecture

Frontend Github: https://github.com/Nelsonlin0321/chat-pdf-ai-assistant

<img src = "images/architecture.png">

## Details about PDF ingestion

- The PDF file is first split into individual pages using a PDF PyPDF.
- Each page is then processed using a textblob library to convert the page content into sentences.
- The sentences from each page are merged into larger chunks with overlapping text between consecutive chunks. This overlapping helps maintain context during search and retrieval.
- The chunked text data is then passed through an embedding pipeline using the Jina Embedding API. This step converts the textual data into high-dimensional vector representations.
- The resulting vector embeddings, along with their corresponding text chunks, are ingested and stored in a vector database like MongoDB.
- The original PDF file also is uploaded to an S3 storage service for front display.


<img src = "images/ingest_file.png">

## Run Backend API Locally

Expand All @@ -23,7 +55,7 @@ image_name=rag-backend-api
docker build -t ${image_name}:latest -f ./Dockerfile .
docker run --env-file docker.env -p 8000:8000 -it --rm --name ${image_name} ${image_name}:latest
```
## API Description and RAG Pipeline
## APIs Description

<img src="images/openAPI.png">

Expand Down
Binary file added images/architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/ingest_file.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 8c1a714

Please sign in to comment.