**<h1 style = 'text-align: center'>Build a semantic search engine</h1>**

## **1.1. Documents and Document Loaders**

LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:
- `page_content`: a string representing the page_content
- `metadata`: a dict containing arbitrary metadata
-`id`: (optional) a string identifier for the document


In [2]:
# generate sample documents
from langchain_core.documents import Document

documents = [
    Document(
        page_content = 'Dogs are great companions, known for their loyalty and friendliness.',
        metadata = {"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

documents

[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.')]

In [4]:
# loading documents

from langchain_community.document_loaders import PyPDFLoader

file_path = '../data/nke-10k-2023.pdf'
loader = PyPDFLoader(file_path)

docs = loader.load()

len(docs)

107

In [7]:
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F

{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': '../data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}


## **2. Splitting**

- Text splitting the PDF will help ensure that the meanings of relevant portions of the documents are not "washed out" (làm mờ) by surrounding text
- Use **text splitters** for this purpose: 
    + Will split our documents into chunks of 1000 characters with 200 characters of overlap between chunks (chia nhỏ tài liệu thành các đoạn (chunk) dài 1000 ký tự, và các đoạn có phần chồng lặp 200 ký tự).
    + use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. (Cố gắng chia theo các ký tự ngăn cách tự nhiên (như dòng mới, đoạn văn, dấu chấm). ; Nếu đoạn quá dài, nó sẽ đệ quy chia nhỏ thêm đến khi đoạn đạt kích thước phù hợp)
    + set `add_start_index=True` so that the character index where each split Document starts within the initial Document is preserved as metadata attribute “start_index”. (lưu thêm thông tin vị trí bắt đầu (start_index) của mỗi đoạn trong tài liệu gốc)