Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Create the notion_pages and chroma_db directories inside the container
RUN mkdir -p /app/notion_pages /app/chroma_db
RUN mkdir -p /app/notion_pages /app/faiss

# Copy the notion2vector directory into the container
COPY notion2vector/ .
Expand Down
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Notion to Vector Database Loader
# Notion to Vector Store Loader

## Description

Expand All @@ -15,7 +15,7 @@ This project is currently under development and not suitable for production use.
- **Markdown Export:** Convert Notion pages to Markdown files.
- **Text Splitting:** Break down documents into manageable sections.
- **Embedding Generation:** Generate embeddings for textual content.
- **Vector Database Management:** Process and save documents in the vector database (Chroma).
- **Vector Store Management:** Process and save documents in the vector store (FAISS).
- **API Endpoint:** Trigger ingestion through the `/ingest` API endpoint.

## Installation
Expand Down Expand Up @@ -78,7 +78,7 @@ curl -X POST http://localhost:4000/ingest

## Using with Other Applications

To use the vector database with other applications, you need to create a Docker persistent volume for the `chroma_db` directory. This ensures that the data remains available between container restarts and can be shared with other containers.
To use the vector store with other applications, you need to create a Docker persistent volume for the `faiss` directory. This ensures that the data remains available between container restarts and can be shared with other containers.

Here's how to set it up:

Expand All @@ -87,43 +87,43 @@ Here's how to set it up:
Create a volume using Docker:

```bash
docker volume create --name=chroma_db_volume
docker volume create --name=faiss_volume
```

### Step 2: Run the Container with Volume Mounted

When running the container, you need to mount the volume to the `chroma_db` directory inside the container:
When running the container, you need to mount the volume to the `faiss` directory inside the container:

```bash
docker run -p 4000:80 --env-file .env -v chroma_db_volume:/app/chroma_db notion2vector
docker run -p 4000:80 --env-file .env -v faiss_volume:/app/faiss notion2vector
```

### Step 3: Mount the Volume with Other Containers

You can now mount this volume in other containers that need to query the Chroma DB. Simply use the same volume name and mount it to the appropriate path within the other container:
You can now mount this volume in other containers that need to query the FAISS index. Simply use the same volume name and mount it to the appropriate path within the other container:

```bash
docker run -v chroma_db_volume:/path/in/other/container other-image-name
docker run -v faiss_volume:/path/in/other/container other-image-name
```

Replace `/path/in/other/container` with the appropriate path inside the other container where you want the Chroma DB data to be accessible.
Replace `/path/in/other/container` with the appropriate path inside the other container where you want the FAISS data to be accessible.

### Note

Make sure the application is configured correctly to use the `chroma_db` directory for storing the vector data, and the directory permissions are set appropriately for the container user.
Make sure the application is configured correctly to use the `faiss` directory for storing the vector data, and the directory permissions are set appropriately for the container user.

## To Do

- Pagges content clean up.
- Pages content clean up.
- Process only updated pages.
- Remove Langchain dependency.
- Add more settings for MD spliting.
- Expose Chroma DB API.
- More vector stores

## Support & Contribution

Feel free to open issues or submit pull requests if you find any bugs or have suggestions for improvements.

## License

[MIT]
MIT
7 changes: 4 additions & 3 deletions notion2vector/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores import Chroma
from langchain.vectorstores import FAISS


def load_env_variables():

Expand Down Expand Up @@ -71,7 +72,7 @@ def export_pages_to_markdown(page_ids):

def process_documents_and_save_to_db():
loader = NotionDirectoryLoader("notion_pages")
persist_directory = 'chroma_db'
persist_directory = 'faiss'

docs = loader.load()

Expand All @@ -94,7 +95,7 @@ def process_documents_and_save_to_db():

embeddings = OpenAIEmbeddings()

vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
vectorstore = FAISS(persist_directory=persist_directory, embedding_function=embeddings)

try:
# Fetch all document IDs from the collection
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ coloredlogs==15.0.1
crashtest==0.4.1
dataclasses-json==0.5.14
fastapi==0.99.1
faiss-cpu==1.7.4
flatbuffers==23.5.26
frozenlist==1.4.0
greenlet==2.0.2
Expand Down Expand Up @@ -67,4 +68,4 @@ uvicorn==0.23.2
uvloop==0.17.0
watchfiles==0.19.0
websockets==11.0.3
yarl==1.9.2
yarl==1.9.2