<div style="background-color: #d54f2b;padding: 1em; color: white;">
<b>Part I</b>: Setup
</div>

### **Install requirements**

In [1]:
%pip install llama-index-readers-file pymupdf
%pip install llama-index-vector-stores-postgres
%pip install llama-index-embeddings-hugging face
%pip install llama-index-llms-llama-cpp
%pip install llama-cpp-python

### **Import utils**

In [11]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from tqdm import tqdm
import os

### **Sentences transformers**

In [13]:
embed_model = HuggingFaceEmbedding(model_name="manu/bge-m3-custom-fr")

### **Initialize PostgreSQL**

- Here we create db user

In [1]:
mkdir database

In [1]:
%%writefile database/docker-compose.yml
version: '3.8'
services:
  RAG_DB:
    image: ankane/pgvector
    container_name: rag_vector_db
    environment:
      POSTGRES_DB: rag_vector_db
      POSTGRES_USER: rag_user
      POSTGRES_PASSWORD: rag_password
    ports:
      - "5433:5432"
    volumes:
      - ./pgdata:/var/lib/postgresql/data

Overwriting database/docker-compose.yml


In [3]:
!cd database ; docker-compose up -d RAG_DB

[1A[1B[0G[?25l[+] Running 0/0
 ⠋ Container rag_vector_db  [39mRecreate[0m                                       [34m0.0s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 ⠿ Container rag_vector_db  [39mStarting[0m                                       [34m0.1s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 ⠿ Container rag_vector_db  [39mStarting[0m                                       [34m0.2s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 ⠿ Container rag_vector_db  [39mStarting[0m                                       [34m0.3s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 ⠿ Container rag_vector_db  [39mStarting[0m                                       [34m0.4s [0m
[?25h[1A[1A[0G[?25l[34m[+] Running 1/1[0m
 [32m✔[0m Container rag_vector_db  [32mStarted[0m                                        [34m0.4s [0m
[?25h

- We install ```pgvector``` to manupulate the database

In [5]:
import psycopg2

# DB Parameters
db_name = "rag_vector_db"
host = "localhost"
password = "rag_password"
port = "5433"
user = "rag_user"

# Connect and create db
conn = psycopg2.connect(
    dbname="postgres",
    host=host,
    password=password,
    port=port,
    user=user,
)
conn.autocommit = True
with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {db_name}")
    c.execute(f"CREATE DATABASE {db_name}")

- We setup **PGVectorStore**, which provides support for writing and querying vector data in Postgres. 

In [6]:
from llama_index.vector_stores.postgres import PGVectorStore

vector_store = PGVectorStore.from_params(
    database=db_name,
    host=host,
    password=password,
    port=port,
    user=user,
    table_name="rag_paper_fr",
    embed_dim=1024,  # (384) openai embedding dimension
)

<div style="background-color: #d54f2b;padding: 1em; color: white;">
<b>Part II</b>: Build an Ingestion Pipeline from Scratch
</div>

1. Load Data

In [7]:
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader

# Utils
loader = PyMuPDFReader()
directory_path = Path("./documents")
pdf_files = directory_path.glob("*.pdf")

# Process and rename all PDF files
documents = []
for file_path in pdf_files:
    loaded_docs = loader.load(file_path=str(file_path))
    documents.extend(loaded_docs)
    treated_file_path = file_path.with_name(f"{file_path.stem}.pdf")
    file_path.rename(treated_file_path)

print(f"Processed {len(documents)} documents.")

Processed 59 documents.


2. Create document chuncks

In [8]:
from llama_index.core.node_parser import SentenceSplitter
text_parser = SentenceSplitter(
    chunk_size=1024,
)

text_chunks = []
doc_idxs = []
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

- Let's link each chunck to document sources metadata (Node Chunk)

In [9]:
from llama_index.core.schema import TextNode

nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc = documents[doc_idxs[idx]]
    node.metadata = src_doc.metadata
    nodes.append(node)

- Generate embeddings for each Node

In [14]:
for node in tqdm(nodes, ncols=100, desc="Generating embedding: "):
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

Generating embedding: 100%|█████████████████████████████████████████| 59/59 [00:19<00:00,  3.05it/s]


In [15]:
nodes[0].dict().keys()

dict_keys(['id_', 'embedding', 'metadata', 'excluded_embed_metadata_keys', 'excluded_llm_metadata_keys', 'relationships', 'text', 'mimetype', 'start_char_idx', 'end_char_idx', 'text_template', 'metadata_template', 'metadata_seperator', 'class_name'])

In [16]:
nodes[0].dict()['id_']

'1721594f-2065-4c34-8668-43c89b59fefa'

In [17]:
nodes[0].dict()['metadata']

{'total_pages': 6,
 'file_path': 'documents/UM6P-Disciplinary_Council.pdf',
 'source': '1'}

In [18]:
nodes[0].dict()['embedding']

[-0.030031858012080193,
 -0.038829609751701355,
 -0.0006665196851827204,
 -0.06863262504339218,
 -0.033711548894643784,
 0.020719563588500023,
 0.014222200959920883,
 -0.0631137564778328,
 -0.029535768553614616,
 -0.0323270708322525,
 0.002012217417359352,
 -0.012179107405245304,
 -0.011910955421626568,
 0.005191161297261715,
 0.038605254143476486,
 -0.035155296325683594,
 -0.010624169372022152,
 0.0028558941558003426,
 0.035635657608509064,
 -0.02485690824687481,
 0.02031904086470604,
 -0.023701844736933708,
 0.03805794566869736,
 0.011104177683591843,
 0.009796789847314358,
 0.008566536009311676,
 -0.007071482948958874,
 -0.02015155367553234,
 -0.016855310648679733,
 -0.00047774414997547865,
 -0.007029697299003601,
 0.0043077003210783005,
 0.06044435501098633,
 -0.01899896189570427,
 -0.0012316714273765683,
 -0.05496344342827797,
 0.0025561547372490168,
 -0.033543191850185394,
 -0.07758679240942001,
 -0.0517333559691906,
 -0.01633394882082939,
 -0.0047893282026052475,
 0.023800553753

In [19]:
len(nodes[0].dict()['embedding'])

1024

- Let's store embedding vector into PostgresSQL DB

In [20]:
vector_store.add(nodes)

['1721594f-2065-4c34-8668-43c89b59fefa',
 'c54afe2e-f75a-4026-b5ca-c8697a439f92',
 'abd61da3-7c14-4dc2-b27b-73d834ab550c',
 'eb36bff8-8488-4c93-b06d-b6b6e8cb24bc',
 '2c553b84-2779-486e-bf49-2a4118eb1c2b',
 '30a9ce03-b530-436b-a62f-fd396ebc7f61',
 '0d63541a-69ae-4918-88cc-b66b0da09d0a',
 '9a770d7f-d012-4697-9a70-d8769f848a9b',
 '908e61aa-afee-49ef-a97c-7d26b0db8c1c',
 'd4ac1908-e1b1-4d12-8dec-851e6bcea6da',
 'e1be257b-f945-4491-b445-d80a40728703',
 '0d538711-cf99-46b1-af72-1aff9f98474b',
 'c4e5c5b5-9134-4c25-a84f-76c89ab4aff9',
 'f07cfa67-91e3-40d3-9969-2018c8701bda',
 'ebb04b55-304b-4713-a34a-d752b2978b55',
 'dd6945a5-e779-4836-8105-f46d94755bb0',
 'b5075969-2528-42d6-bdcb-fdb7e634c5a8',
 '50e88c5c-ea4b-4171-930d-2305a57ee367',
 '4624470f-ec42-4b97-8d86-9b15bf6dfdf0',
 '2ed3a543-8453-4f61-affd-07b5c22ef8c3',
 '0b4d962e-2ef8-4908-9a96-03e3a8f0bd8f',
 'c789476b-c202-4bdb-bd06-149595eff79f',
 '263e3589-f322-4fdb-8af0-e651cc85649b',
 '337fbebf-91c9-4365-8090-9ec13f67bee4',
 '27e8199b-39f5-