# full text search

In [1]:
!pip install whoosh PyMuPDF python-docx

Collecting whoosh
  Downloading Whoosh-2.7.4-py2.py3-none-any.whl.metadata (3.1 kB)
Collecting PyMuPDF
  Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading Whoosh-2.7.4-py2.py3-none-any.whl (468 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.8/468.8 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m72.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: whoosh, python-docx, PyMuPDF
Successfully installed PyMuPDF-1.25.3 python-docx-1.1.2 who

In [2]:
import os
import fitz  # PyMuPDF
import docx
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, ID, STORED
from whoosh.qparser import QueryParser

# Thư mục chứa file
DOCS_FOLDER = "/content/drive/MyDrive/IELTS"

# Cấu trúc dữ liệu tìm kiếm
schema = Schema(title=ID(stored=True), paragraph=TEXT(stored=True), file_name=STORED)

# Khởi tạo thư mục chỉ mục
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

# Tạo chỉ mục tìm kiếm
index = create_in("indexdir", schema)
writer = index.writer()

# Hàm đọc file và chia thành đoạn văn
def extract_text_segments(file_path):
    text_segments = []
    if file_path.endswith(".pdf"):
        doc = fitz.open(file_path)
        for page in doc:
            text_segments.extend(page.get_text("text").split("\n\n"))  # Chia đoạn theo khoảng trắng
    elif file_path.endswith(".docx"):
        doc = docx.Document(file_path)
        text_segments = [para.text for para in doc.paragraphs if para.text.strip()]
    elif file_path.endswith(".txt"):
        with open(file_path, "r", encoding="utf-8") as f:
            text_segments = f.read().split("\n\n")
    return text_segments

# Lưu từng đoạn vào Whoosh
for file_name in os.listdir(DOCS_FOLDER):
    file_path = os.path.join(DOCS_FOLDER, file_name)
    paragraphs = extract_text_segments(file_path)
    for para in paragraphs:
        writer.add_document(title=file_name, paragraph=para, file_name=file_name)

writer.commit()
print("✅ Chỉ mục đã được cập nhật với các đoạn văn!")

✅ Chỉ mục đã được cập nhật với các đoạn văn!


In [3]:
from whoosh.index import open_dir

def search_paragraphs(query):
    index = open_dir("indexdir")
    with index.searcher() as searcher:
        query_parser = QueryParser("paragraph", index.schema)
        query = query_parser.parse(query)
        results = searcher.search(query, limit=5)  # Giới hạn 5 đoạn văn

        for result in results:
            print(f"📄 File: {result['file_name']}")
            print(f"📜 Đoạn văn: {result['paragraph']}\n")

# Tìm kiếm từ khóa
query_text = "Naming Conventions"
search_paragraphs(query_text)

📄 File: code.pdf
📜 Đoạn văn:  
Verilog Coding Standards  
ATVN  
 
 
 
Copyright © 2006. Arrive Technologies Inc. 
 
Page ii 
Internal Doc. Subject to Change 
 
Contents 
1. 
Abstract .................................................................................................................................... 1 
2. 
Acronyms ................................................................................................................................. 1 
3. 
Related Documents.................................................................................................................. 1 
4. 
Requirements ........................................................................................................................... 1 
5. 
Revision Changes.................................................................................................................... 1 
6. 
Verilog Programming Conventions ...........................................................................

In [5]:
query_text = "RTL"
search_paragraphs(query_text)

📄 File: code.pdf
📜 Đoạn văn:  
Verilog Coding Standards  
ATVN  
 
 
 
Copyright © 2006. Arrive Technologies Inc. 
 
Page 8 
Internal Doc. Subject to Change 
 
6.3. 
Complex Conditionals 
 Complex conditional statements can be cryptic and hard to comprehend but inevitable 
sometimes. Proper indentation can improve this difficult situation quite a bit as shown below.  
- before - 
 
if ((((signal1 == 2’b10) &&  
(signal2 != c_value1)) ||  
(signal3 == 2’b10)) &&  
(signal4 == p_value2)) 
 
 
begin 
 
 
... 
 
 
end 
 
 
- after - 
 
if(((signal1 == 2’b10) && (signal2 != c_value1))   
 
 
 ||(signal3 == 2’b10) 
 
 
) 
 
    &&(signal4 == p_value2) 
 
   ) 
 
 
begin 
 
 
... 
 
 
end 
 
 
- or - 
 
if ((((signal1 == 2’b10) &&  
   (signal2 != c_value1)) ||  
   (signal3 == 2’b10)) &&  
   (signal4 == p_value2)) 
 
 
begin 
 
 
... 
 
 
end 
 
 
Better yet, the complexity of the above conditional should be avoided whenever possible! 
 The conditional statement is allowed as long as it d

In [6]:
query_text = "Quy trình mua hàng"
search_paragraphs(query_text)

📄 File: VERON- Purchasing process.pdf
📜 Đoạn văn: PURCHASE REQUEST AND PURCHASE ORDERS 
 
SOP No. 
 
SOP No. 
 
Effective date 
23 Jan 2025 
Revision 
 
Prepared by 
Accounting Department 
Approved by 
Peter Nguyen 
 
Subject 
PURCHASE REQUESTS AND PURCHASE ORDERS 
YÊU CẦU MUA HÀNG VÀ QUY TRÌNH ĐẶT HÀNG 
 
I. 
MỤC TIÊU/ OBJECTTIVES: 
 
Thiết lập quy trình hướng dẫn yêu cầu mua hàng và quy trình đặt hàng, nhằm: 
To established guidelines relating to purchase request and purchase orders. The purpose of this policy is to: 
- 
Mua được vật tư tốt với giá thành thấp nhất, phù hợp với những chi tiết, đặc điểm yêu cầu cho từng sản phẩm  
Obtain the best quality materials at the lowest price, in accordance with the established purchasing 
specification. 
- 
Tổ chức quy trình đặt mua hàng hợp lý và tuân theo các quy định thu mua được thực hiện trên hệ thống nội bộ 
Streamline the way of handling of comply with the purchase policies when working via internal system.  
- 
Đảm bảo các giao dịch mu

In [11]:
query_text = "THE BENEFITS OF BEING BILINGUAL"
search_paragraphs(query_text)

📄 File: Cambridge 12 Reading Test 2 Vocabulary to Topic.pdf
📜 Đoạn văn: SECTION 3: THE BENEFITS OF BEING BILINGUAL
Collocation
Vietnamese Definition
bilingual or multilingual
biết hai hoặc nhiều ngôn ngữ
cognitive systems
hệthống nhận thức
neurological systems
hệthống thần kinh
language co-activation
sựkích hoạt ngôn ngữđồng thời
auditory input
thông tin âm thanh
linguistic competition
sựcạnh tranh ngôn ngữ
tip-of-the-tongue states
trạng thái "đứng trên đầu lưỡi"
conflict management
quản lý xung đột
Stroop Task
bài kiểm tra Stroop
cognitive control
kiểm soát nhận thức
sensory processing
xửlý cảm giác
pitch perception
nhận thức âm điệu
cognitive mechanisms
cơ chếnhận thức
memory improvement
cải thiện trí nhớ
Alzheimer's disease
bệnh Alzheimer
physical signs of disease
dấu hiệu vật lý của bệnh
navigating a multilingual
environment
điều hướng trong môi trường đa ngôn ngữ
transfer advantages
lợi ích chuyển giao
5




In [12]:
query_text = "Rewrite Sentence Activity"
search_paragraphs(query_text)

📄 File: Cambridge 12 Reading Test 2 Vocabulary to Topic.pdf
📜 Đoạn văn: Rewrite Sentence Activity
1. People who can speak two or more languages are considered to have ______
(bilingual or multilingual) skills.
2. The brain’s ______ (neurological systems) are essential for processing information
and coordinating responses.
3. When learning a new language, ______ (language co-activation) can help
reinforce understanding by engaging multiple languages simultaneously.
4. The ______ (auditory input) we receive plays a crucial role in language
acquisition and comprehension.
5. In a multilingual setting, speakers often face ______ (linguistic competition) as
they choose which language to use.
6. People sometimes experience ______ (tip-of-the-tongue states) when they know a
word but cannot recall it at that moment.
7. Effective ______ (conflict management) strategies are vital for resolving
misunderstandings in conversations involving multiple languages.
8. The ______ (Stroop Task) is a psycho

#Vector Search


In [4]:
! pip install faiss-cpu sentence-transformers PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.3


In [89]:
import fitz  # PyMuPDF
import os

def extract_text_segments_from_pdf(pdf_path, chunk_size=100):
    """Trích xuất nội dung từ PDF và chia nhỏ thành các đoạn văn"""
    pdf_doc = fitz.open(pdf_path)
    text = ""
    segments = []

    for page in pdf_doc:
        text += page.get_text("text") + "\n"

    words = text.split()

    # Chia nhỏ văn bản thành các đoạn chunk_size từ
    for i in range(0, len(words), chunk_size):
        segment = " ".join(words[i : i + chunk_size])
        segments.append(segment)

    return segments

# Kiểm tra với 1 file PDF
pdf_path = "/content/drive/MyDrive/IELTS/code.pdf"
segments = extract_text_segments_from_pdf(pdf_path)
print(f"📌 Sách đã chia thành {len(segments)} đoạn văn bản!")

📌 Sách đã chia thành 30 đoạn văn bản!


In [90]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Load model để tạo embedding
model = SentenceTransformer("all-MiniLM-L6-v2")

# Thư mục chứa PDF
pdf_folder = "/content/drive/MyDrive/IELTS"

# FAISS Index
index_path = "document_index.faiss"

# Khởi tạo FAISS nếu chưa có
if os.path.exists(index_path):
    index = faiss.read_index(index_path)
else:
    index = None

segment_data = []  # Lưu thông tin (file, nội dung đoạn văn)

for file in os.listdir(pdf_folder):
    if file.endswith(".pdf"):
        pdf_path = os.path.join(pdf_folder, file)
        segments = extract_text_segments_from_pdf(pdf_path)

        for segment in segments:
            segment_data.append((file, segment))

# Chuyển đoạn văn bản thành vector
segment_texts = [seg[1] for seg in segment_data]
segment_embeddings = model.encode(segment_texts, convert_to_numpy=True)

# Nếu chưa có index, tạo mới
if index is None:
    dimension = segment_embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)

# Thêm vào FAISS
index.add(segment_embeddings)

# Lưu FAISS Index
faiss.write_index(index, index_path)

# Lưu thông tin đoạn văn bản
with open("segments.txt", "w", encoding="utf-8") as f:
    for file, text in segment_data:
        f.write(f"{file}|||{text}\n")

print("✅ Đã lưu FAISS Index với từng đoạn văn bản!")

✅ Đã lưu FAISS Index với từng đoạn văn bản!


In [92]:
# Load FAISS Index
index = faiss.read_index("document_index.faiss")

# Load thông tin đoạn văn bản
segment_data = []
with open("segments.txt", "r", encoding="utf-8") as f:
    for line in f:
        file, text = line.strip().split("|||")
        segment_data.append((file, text))

def search_pdf_segment(query, top_k=3):
    query_embedding = model.encode([query], convert_to_numpy=True)

    if index.ntotal == 0:
        print("⚠️ FAISS chưa có dữ liệu!")
        return []

    distances, closest_indices = index.search(query_embedding, top_k)

    print(f"🔍 Distances: {distances[0]}")  # In ra khoảng cách của các kết quả tìm được

    if distances[0][0] > 1.0:  # Nếu khoảng cách quá lớn, có thể là kết quả không phù hợp
        print("⚠️ Không tìm thấy nội dung phù hợp!")
        return []

    results = [(segment_data[idx][0], segment_data[idx][1]) for idx in closest_indices[0] if idx < len(segment_data)]
    return results

# Ví dụ câu hỏi
query = "Naming Conventions"
results = search_pdf_segment(query)

print("\n🔍 Kết quả tìm kiếm:")
for filename, content in results:
    print(f"📄 File: {filename}")
    print(f"📜 Nội dung đoạn phù hợp: {content}...\n")  # Hiển thị 500 ký tự đầu

🔍 Distances: [0.7470192 1.0927835 1.2032735]

🔍 Kết quả tìm kiếm:
📄 File: code.pdf
📜 Nội dung đoạn phù hợp: once or twice, but read many times. Make names meaningful and readable, and avoid obscure abbreviations. In general, the following guidelines must be followed.  In general, names should be lowercase and be composed of words, abbreviations, and acronyms combined together with underscores (as rarely as possible). The composed word should form a close description of the object named. The name should be long enough for the reader to be able to determine what the object is for. Besides, the name should be shorter than 10 characters for good synthesis procedure.  Minimize the use of abbreviations.  Use abbreviations...

📄 File: code.pdf
📜 Nội dung đoạn phù hợp: consistently. For example: input prs; // it means processor write strobe input [4:0] framecnt; // it means frame count, do not use frame_count // because frame_count is so long. input t_run; // it means time run, do not use t

In [95]:
query = "Input and Output Names"
results = search_pdf_segment(query)

print("\n🔍 Kết quả tìm kiếm:")
for filename, content in results:
    print(f"📄 File: {filename}")
    print(f"📜 Nội dung đoạn phù hợp: {content}...\n")  # Hiển thị 500 ký tự đầu

🔍 Distances: [0.9508181 1.1173918 1.31916  ]

🔍 Kết quả tìm kiếm:
📄 File: code.pdf
📜 Nội dung đoạn phù hợp: consistently. For example: input prs; // it means processor write strobe input [4:0] framecnt; // it means frame count, do not use frame_count // because frame_count is so long. input t_run; // it means time run, do not use trun // because trun is easily misunderstanding . 6.2.1. Source, Testbench, and Testcase File Names The prevailing theory on naming is that the name be meaningful. Commonly this means that each module name in a subsystem has a short prefix (two to five characters) which implies the subsystem. Only lower case letters should be used when naming source and header files....

📄 File: code.pdf
📜 Nội dung đoạn phù hợp: Names of constant (defined via `define or parameter) must be uppercase. For example: `define CMAX 8’d5 parameter CMIN = 4’d1; 6.2.5. Input and Output Names  Names of input and output must have a comment, which indicates this signal changed at which cl