# Notebook 1: Ingest Data dan Pra-pemrosesan untuk RAG

Tujuan notebook ini adalah untuk memuat dokumen, mengekstrak teksnya, dan memecahnya menjadi potongan-potongan (chunks) yang lebih kecil agar siap untuk proses embedding.

## 1. Impor Library yang Dibutuhkan

In [1]:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader, UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

load_dotenv()

True

In [2]:
openai_api_key = os.getenv("OPENAI_API_KEY")
print(f"OpenAI API Key: {bool(openai_api_key)}")

OpenAI API Key: True


## 2. Tentukan Path ke Data dan Temukan Semua File PDF


In [3]:
current_dir = os.getcwd()
project_root = os.path.dirname(current_dir)
data_dir = os.path.join(project_root, "data")
pdf_file_names = [file for file in os.listdir(data_dir) if file.endswith(".pdf")]
pdf_file_paths = [os.path.join(data_dir, file) for file in pdf_file_names]

## 3. Muat Semua Dokumen PDF yang Ditemukan

In [4]:
all_loaded_documents = []

if pdf_file_paths:
    for pdf_path in pdf_file_paths:
        try:
            print(f"\nMemuat dokumen dari: {pdf_path}...")
            loader = PyPDFLoader(pdf_path)
            documents_from_single_pdf = loader.load()

            for doc in documents_from_single_pdf:
                doc.metadata["source"] = os.path.basename(pdf_path)

            all_loaded_documents.extend(documents_from_single_pdf)
            print(f"Berhasil memuat {len(documents_from_single_pdf)} halaman/bagian dari {os.path.basename(pdf_path)}.")
        except Exception as e:
            print(f"Error saat memuat {pdf_path}: {e}")

    if all_loaded_documents:
        print(f"\nTotal {len(all_loaded_documents)} halaman/bagian berhasil dimuat dari semua file PDF.")
        print("\nContoh konten dari halaman pertama dokumen pertama yang dimuat:")
        print(all_loaded_documents[0].page_content[:500] + "...")
        print(f"Metadata contoh: {all_loaded_documents[0].metadata}")
        print("-" * 30)
else:
    print("Tidak ada file PDF untuk dimuat.")



Memuat dokumen dari: d:\Zulfi\CodeLabs\SeiZen\SeiZen-RAG\data\ojsadmin,+207.pdf...
Berhasil memuat 6 halaman/bagian dari ojsadmin,+207.pdf.

Memuat dokumen dari: d:\Zulfi\CodeLabs\SeiZen\SeiZen-RAG\data\Self-superviced Learning.pdf...
Berhasil memuat 23 halaman/bagian dari Self-superviced Learning.pdf.

Total 29 halaman/bagian berhasil dimuat dari semua file PDF.

Contoh konten dari halaman pertama dokumen pertama yang dimuat:
Indian Journal of Public Health Research & Development, March 2020, Vol. 11, No. 03  1107
The Correlation Factors on 
Epilepsy Stigma amongst People in Indonesia
Saniya Ashilah Rabbani1, Joseph Ekowahono R2, Viskasari P. Kalanjati3
1Clinical Students at Faculty of Medicine, 2Lecturer at Department of Neurology, 3Lecturer at Department of 
Anatomy and Histology, General Hospital of dr. Soetomo, Faculty of Medicine, Universitas Airlangga, 
Surabaya, Indonesia
Abstract
Background: Epilepsy is a rec...
Metadata contoh: {'producer': 'Adobe PDF Library 9.0', 'creator'

## 4. Ekstrak Teks dari Dokumen


In [5]:
if all_loaded_documents:
    print(f"\nTotal dokumen (halaman/bagian) yang dimuat dari semua PDF: {len(all_loaded_documents)}")
    for i, doc in enumerate(all_loaded_documents[:min(3, len(all_loaded_documents))]):
        print(f"\n--- Dokumen Gabungan {i+1} ---")
        print(f"Metadata: {doc.metadata}")
else:
    print("Tidak ada dokumen yang berhasil dimuat.")


Total dokumen (halaman/bagian) yang dimuat dari semua PDF: 29

--- Dokumen Gabungan 1 ---
Metadata: {'producer': 'Adobe PDF Library 9.0', 'creator': 'Adobe InDesign CS4 (6.0)', 'creationdate': '2020-06-03T12:10:59+05:30', 'moddate': '2020-06-03T12:11:10+05:30', 'source': 'ojsadmin,+207.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}

--- Dokumen Gabungan 2 ---
Metadata: {'producer': 'Adobe PDF Library 9.0', 'creator': 'Adobe InDesign CS4 (6.0)', 'creationdate': '2020-06-03T12:10:59+05:30', 'moddate': '2020-06-03T12:11:10+05:30', 'source': 'ojsadmin,+207.pdf', 'total_pages': 6, 'page': 1, 'page_label': '2'}

--- Dokumen Gabungan 3 ---
Metadata: {'producer': 'Adobe PDF Library 9.0', 'creator': 'Adobe InDesign CS4 (6.0)', 'creationdate': '2020-06-03T12:10:59+05:30', 'moddate': '2020-06-03T12:11:10+05:30', 'source': 'ojsadmin,+207.pdf', 'total_pages': 6, 'page': 2, 'page_label': '3'}


## 5. Memecah Teks menjadi Chunks (Text Splitting)

In [6]:
all_chunks = []

if all_loaded_documents:
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
        is_separator_regex=False,
    )

    for doc_idx, doc_content_obj in enumerate(all_loaded_documents):
        if not isinstance(doc_content_obj, Document):
            print(f"Peringatan: Item ke-{doc_idx} bukan objek Document LangChain, melainkan {type(doc_content_obj)}. Dilewati.")
            continue

        chunks_from_doc = text_splitter.split_text(doc_content_obj.page_content)

        for chunk_text in chunks_from_doc:
            chunk_doc = Document(page_content=chunk_text, metadata=doc_content_obj.metadata.copy())
            all_chunks.append(chunk_doc)

    print(f"\nTotal chunks yang dihasilkan dari semua PDF: {len(all_chunks)}")

    if all_chunks:
        print("\nContoh beberapa chunk pertama (perhatikan metadatanya):")
        for i, chunk in enumerate(all_chunks[:min(3, len(all_chunks))]):
            print(f"\n--- Chunk {i+1} ---")
            print(f"Metadata: {chunk.metadata}")
            print(f"Konten: {chunk.page_content[:200]}...")
            print(f"Panjang Konten: {len(chunk.page_content)} karakter")
else:
    print("Tidak ada dokumen untuk dipecah menjadi chunks.")


Total chunks yang dihasilkan dari semua PDF: 206

Contoh beberapa chunk pertama (perhatikan metadatanya):

--- Chunk 1 ---
Metadata: {'producer': 'Adobe PDF Library 9.0', 'creator': 'Adobe InDesign CS4 (6.0)', 'creationdate': '2020-06-03T12:10:59+05:30', 'moddate': '2020-06-03T12:11:10+05:30', 'source': 'ojsadmin,+207.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}
Konten: Indian Journal of Public Health Research & Development, March 2020, Vol. 11, No. 03  1107
The Correlation Factors on 
Epilepsy Stigma amongst People in Indonesia
Saniya Ashilah Rabbani1, Joseph Ekowah...
Panjang Konten: 998 karakter

--- Chunk 2 ---
Metadata: {'producer': 'Adobe PDF Library 9.0', 'creator': 'Adobe InDesign CS4 (6.0)', 'creationdate': '2020-06-03T12:10:59+05:30', 'moddate': '2020-06-03T12:11:10+05:30', 'source': 'ojsadmin,+207.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}
Konten: many wrong assumptions and views of some Indonesian people about epilepsy disease. The study was 
conducted 

## 6. Menyimpan chunks untuk digunakan di notebook lain

In [7]:
import pickle
if all_chunks:
    notebooks_dir = os.path.join(project_root, "notebooks", "chunk_files")
    os.makedirs(notebooks_dir, exist_ok=True)
    chunks_file_path = os.path.join(notebooks_dir, "processed_chunks_multi_pdf.pkl")
    try:
        with open(chunks_file_path, "wb") as f:
            pickle.dump(all_chunks, f)
        print(f"\nChunks berhasil disimpan ke: {chunks_file_path}")
    except Exception as e:
        print(f"Error saat menyimpan chunks: {e}")



Chunks berhasil disimpan ke: d:\Zulfi\CodeLabs\SeiZen\SeiZen-RAG\notebooks\chunk_files\processed_chunks_multi_pdf.pkl
