### Data Ingestion

In [9]:
### Document data structure
from langchain_core.documents import Document

In [10]:
doc = Document(
    page_content="This is the content of the document.",
    metadata={
        "source": "example.txt",
        "pages": 1,
        "author": "Ilham",
        "date_created": "2025-10-15"
    }
)
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'Ilham', 'date_created': '2025-10-15'}, page_content='This is the content of the document.')

In [11]:
## Create simple text file
import os
os.makedirs("E:/RAG/data/text_files", exist_ok=True)

In [12]:
sample_text ={
     "E:/RAG/data/text_files/langchain_intro.txt": """
   RAG stands for Retrieval-Augmented Generation. It is an AI technique that combines retrieval-based methods with generative models to produce more accurate and context-aware outputs.

1. Problem it solves: Generative AI models can produce fluent text but often hallucinate or give wrong answers if the information is not in their training data. RAG fixes this by retrieving relevant information first and then generating answers based on it.

2. How it works:

   Retriever: Searches a database of documents or knowledge to find relevant information. Can use sparse methods (TF-IDF, BM25) or dense embeddings (FAISS, Pinecone).
   Generator: A language model (like GPT) that takes the retrieved documents as context and generates the final answer.

3. Types of RAG:

   RAG-Sequence: Generates an answer for each retrieved document and then combines them.
   RAG-Token: Generates an answer token by token, attending over all retrieved documents.

4. Benefits:

   Reduces hallucinations.
   Can work with up-to-date or private knowledge.
   Flexible, works with any retrieval source.

5. Example use case:
   If you have a collection of Sri Lankan tea export reports and ask “What was the total tea export in 2024?”, RAG will:

   1. Retrieve relevant report(s) containing 2024 export data.
   2. Feed it to the language model.
   3. Generate an accurate answer based on the retrieved information.

     """

}

for file_path, content in sample_text.items():
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(content)
        
print("Sample text files created.")

Sample text files created.


In [13]:
### TextLoader
# from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader

loader = TextLoader("E:/RAG/data/text_files/langchain_intro.txt", encoding="utf-8")
document = loader.load()
print(document)

[Document(metadata={'source': 'E:/RAG/data/text_files/langchain_intro.txt'}, page_content='\n   RAG stands for Retrieval-Augmented Generation. It is an AI technique that combines retrieval-based methods with generative models to produce more accurate and context-aware outputs.\n\n1. Problem it solves: Generative AI models can produce fluent text but often hallucinate or give wrong answers if the information is not in their training data. RAG fixes this by retrieving relevant information first and then generating answers based on it.\n\n2. How it works:\n\n   Retriever: Searches a database of documents or knowledge to find relevant information. Can use sparse methods (TF-IDF, BM25) or dense embeddings (FAISS, Pinecone).\n   Generator: A language model (like GPT) that takes the retrieved documents as context and generates the final answer.\n\n3. Types of RAG:\n\n   RAG-Sequence: Generates an answer for each retrieved document and then combines them.\n   RAG-Token: Generates an answer tok

In [14]:
### DirectoryLoader
from langchain_community.document_loaders import DirectoryLoader

## Load all .txt files from the specified directory
dir_loader = DirectoryLoader(
    "E:/RAG/data/text_files", 
    glob="*.txt", # Pattern to match files 
    loader_cls=TextLoader, # Specify the loader class to use 
    loader_kwargs={"encoding": "utf-8"},
    show_progress=False # Show loading progress
    )

documents = dir_loader.load()
documents

[Document(metadata={'source': 'E:\\RAG\\data\\text_files\\langchain_intro.txt'}, page_content='\n   RAG stands for Retrieval-Augmented Generation. It is an AI technique that combines retrieval-based methods with generative models to produce more accurate and context-aware outputs.\n\n1. Problem it solves: Generative AI models can produce fluent text but often hallucinate or give wrong answers if the information is not in their training data. RAG fixes this by retrieving relevant information first and then generating answers based on it.\n\n2. How it works:\n\n   Retriever: Searches a database of documents or knowledge to find relevant information. Can use sparse methods (TF-IDF, BM25) or dense embeddings (FAISS, Pinecone).\n   Generator: A language model (like GPT) that takes the retrieved documents as context and generates the final answer.\n\n3. Types of RAG:\n\n   RAG-Sequence: Generates an answer for each retrieved document and then combines them.\n   RAG-Token: Generates an answer

In [15]:
# To load pdf
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader
# Load all the pdf files from the directory
dir_loader=DirectoryLoader(
    "E:/RAG/data/pdf", 
    glob="**/*.pdf", # Pattern to match files 
    loader_cls = PyMuPDFLoader, #loader class to use
    show_progress=False # Show loading progress
)

pdf_documents = dir_loader.load()
pdf_documents

[Document(metadata={'producer': 'MiKTeX pdfTeX-1.40.23', 'creator': 'LaTeX with hyperref', 'creationdate': '2022-03-04T18:53:07+05:30', 'source': 'E:\\RAG\\data\\pdf\\HBook_FAS-2019-2020.pdf', 'file_path': 'E:\\RAG\\data\\pdf\\HBook_FAS-2019-2020.pdf', 'total_pages': 252, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2022-03-04T18:53:07+05:30', 'trapped': '', 'modDate': "D:20220304185307+05'30'", 'creationDate': "D:20220304185307+05'30'", 'page': 0}, page_content='UNIVERSITY OF VAVUNIYA, SRI LANKA\nFACULTY OF APPLIED SCIENCE\nHANDBOOK\n2019/2020'),
 Document(metadata={'producer': 'MiKTeX pdfTeX-1.40.23', 'creator': 'LaTeX with hyperref', 'creationdate': '2022-03-04T18:53:07+05:30', 'source': 'E:\\RAG\\data\\pdf\\HBook_FAS-2019-2020.pdf', 'file_path': 'E:\\RAG\\data\\pdf\\HBook_FAS-2019-2020.pdf', 'total_pages': 252, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2022-03-04T18:53:07+05:30', 'trapped