Build Your First RAG System From Scratch

Forget expensive APIs and proprietary databases. We’re doing this with the tools that real engineers use to build powerful, scalable systems. Here are the tools we will be using:

1. transformers (Hugging Face): To get our powerful, free LLM.

2. sentence-transformers: The easiest way to get a top-tier embedding model.

3. faiss-cpu: Facebook AI’s blazing-fast, free vector search library. It’s our vector store.

4. langchain: We’ll only use its text splitter, which is a smart shortcut that saves us hours of regex pain.

Chunking

In [12]:
import os
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load our document
with open("my_knowledge.txt") as f:
    knowledge_text = f.read()

# 1. Initialize the Text Splitter
# This splitter is smart. It tries to split on paragraphs ("\n\n"),
# then newlines ("\n"), then spaces (" "), to keep semantically
# related text together as much as possible.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,  # Max size of a chunk
    chunk_overlap=20, # Overlap to maintain context between chunks
    length_function=len
)

# 2. Create the chunks
chunks = text_splitter.split_text(knowledge_text)

print(f"We have {len(chunks)} chunks:")
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---\n{chunk}\n")

We have 0 chunks:


Embeddings

In [15]:
with open("my_knowledge.txt", encoding="utf-8") as f:
    data = f.read()

print("Text length:", len(data))

Text length: 0


In [16]:
import os

print("Current Folder:", os.getcwd())
print("Files in folder:", os.listdir())

Current Folder: c:\Coding\Desktop\Projects\projects\RAG From Scratch
Files in folder: ['my_knowledge.txt', 'rag.ipynb']


In [17]:
import os

file_path = "my_knowledge.txt"

print("File size:", os.path.getsize(file_path), "bytes")

File size: 0 bytes


In [18]:
import os

file_path = os.path.abspath("my_knowledge.txt")

print("Absolute Path:", file_path)
print("Exists:", os.path.exists(file_path))
print("File Size:", os.path.getsize(file_path), "bytes")

with open(file_path, "r", encoding="utf-8") as f:
    content = f.read()

print("Length:", len(content))
print("Raw Content:", repr(content))

Absolute Path: c:\Coding\Desktop\Projects\projects\RAG From Scratch\my_knowledge.txt
Exists: True
File Size: 0 bytes
Length: 0
Raw Content: ''
