### Step 1: Load Documents

We define a function to load .pdf, .docx, .csv, and .xlsx files using LangChain’s document loaders. This ensures that different file formats are converted into a consistent structure for further processing. It makes the pipeline flexible, easy to extend, and ready for text chunking and embedding in the next steps.

In [1]:
from langchain_community.document_loaders import (
    PyMuPDFLoader, UnstructuredWordDocumentLoader, CSVLoader, UnstructuredExcelLoader
)
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

from sentence_transformers import SentenceTransformer
import os

# ---- Load all documents ----
def load_documents(file_paths):
    docs = []
    for path in file_paths:
        ext = os.path.splitext(path)[1].lower()
        try:
            if ext == '.pdf':
                loader = PyMuPDFLoader(path)
            elif ext == '.docx':
                loader = UnstructuredWordDocumentLoader(path)
            elif ext == '.csv':
                loader = CSVLoader(file_path=path)
            elif ext == '.xlsx':
                loader = UnstructuredExcelLoader(path)
            else:
                print(f"Unsupported file type: {ext}")
                continue
            docs.extend(loader.load())
        except Exception as e:
            print(f"Error loading {path}: {e}")
    return docs


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
word_doc = ['Dataset summaries and citations.docx','M.Sc. Applied Psychology.docx','Stats.docx']
xlsx_doc = ['Loan amortisation schedule1.xlsx','Loan analysis.xlsx','party budget1.xlsx']
pdf_doc = ['Ocean_ecogeochemistry_A_review.pdf','The-Alchemist.pdf','The_Plan_of_the_Giza_Pyramids.pdf']

### Step 2: Read and Organize Documents
We loop through all the files grouped by type and use appropriate loaders to extract their content. Each document's full text is stored along with its filename. This prepares raw text data for splitting and embedding, making it easier to trace chunks back to their sources.

In [4]:
from langchain.document_loaders import (
    PyMuPDFLoader,
    UnstructuredWordDocumentLoader,
    UnstructuredExcelLoader,
)
import os

# Combine all your files into a list with types
files_by_type = {
    "docx": word_doc,
    "xlsx": xlsx_doc,
    "pdf": pdf_doc
}

document_texts = []

for ext, file_list in files_by_type.items():
    for path in file_list:
        try:
            if ext == "pdf":
                loader = PyMuPDFLoader(path)
            elif ext == "docx":
                loader = UnstructuredWordDocumentLoader(path)
            elif ext == "xlsx":
                loader = UnstructuredExcelLoader(path)
            else:
                continue

            docs = loader.load()
            full_text = "\n".join([doc.page_content for doc in docs])
            document_texts.append({"filename": path, "content": full_text})

        except Exception as e:
            document_texts.append({"filename": path, "content": f"[Error: {e}]"})


  warn("""Cannot parse header or footer so it will be ignored""")
  warn("""Cannot parse header or footer so it will be ignored""")
  warn(msg)
  warn(f"Print area cannot be set to Defined name: {defn.value}.")


In [5]:
document_texts[0]

{'filename': 'Dataset summaries and citations.docx',
 'content': 'Table 1. Description of studies included in the meta-analysis. Full article citations are listed after the table.\n\nReference ID Turfgrass Use Location Year since establishment Function for SOC vs. years Depths evaluated Climate Description Prior Land use Dominant Species Seasonality Data source Acuna et al., 2017 Acuna2017_Bingo Small plots Pirque, Chile 0 - 2 Linear 0-10, 10-20, 20-30 Mediterranean Cropland Tall fescue Cool No response. Imputed SE. Acuna et al., 2017 Acuna2017_C.dactylon Small plots Pirque, Chile 0 - 2 Linear 0-10, 10-20, 20-30 Mediterranean Cropland Bermuda (Cynodon dactylon) Warm No response. Imputed SE. Acuna et al., 2017 Acuna2017_CindyLou Small plots Pirque, Chile 0 - 2 Linear 0-10, 10-20, 20-30 Mediterranean Cropland Red fescue (Festuca rubra L. ssp. Rubra) Cool No response. Imputed SE. Acuna et al., 2017 Acuna2017_Cochise Small plots Pirque, Chile 0 - 2 Linear 0-10, 10-20, 20-30 Mediterranean C

### Step 3: Chunking the Documents
We split each document into smaller, overlapping chunks using RecursiveCharacterTextSplitter. This improves retrieval accuracy and LLM performance by ensuring context fits within model limits while preserving continuity.

Each chunk is tagged with metadata like filename and chunk ID, making traceability easier.

This method also helps avoid missing relevant information that might span across large paragraphs.

Chunked data becomes structured and uniform, enabling efficient vectorization in the next step.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

# Initialize the text splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Output list of LangChain Documents with metadata
chunked_documents = []

for doc in document_texts:
    content = doc["content"].strip()
    if content:
        # Create LangChain Documents with filename metadata
        base_doc = Document(
            page_content=content,
            metadata={"source": doc["filename"]}
        )

        # Split into chunks
        chunks = splitter.split_documents([base_doc])

        # Append with chunk_id
        for i, chunk in enumerate(chunks):
            chunk.metadata["chunk_id"] = i
            chunked_documents.append(chunk)

In [6]:
chunked_documents

[Document(metadata={'source': 'Dataset summaries and citations.docx', 'chunk_id': 0}, page_content='Table 1. Description of studies included in the meta-analysis. Full article citations are listed after the table.'),
 Document(metadata={'source': 'Dataset summaries and citations.docx', 'chunk_id': 1}, page_content='Reference ID Turfgrass Use Location Year since establishment Function for SOC vs. years Depths evaluated Climate Description Prior Land use Dominant Species Seasonality Data source Acuna et al., 2017 Acuna2017_Bingo Small plots Pirque, Chile 0 - 2 Linear 0-10, 10-20, 20-30 Mediterranean Cropland Tall fescue Cool No response. Imputed SE. Acuna et al., 2017 Acuna2017_C.dactylon Small plots Pirque, Chile 0 - 2 Linear 0-10, 10-20, 20-30 Mediterranean Cropland Bermuda (Cynodon dactylon) Warm No response. Imputed SE. Acuna et al., 2017 Acuna2017_CindyLou Small plots Pirque, Chile 0 - 2 Linear 0-10, 10-20, 20-30 Mediterranean Cropland Red fescue (Festuca rubra L. ssp. Rubra) Cool N

### Step 4: Embedding and Storing in Vector Database
We use the Nomic embedding model via HuggingFace to convert text chunks into dense vector representations suitable for semantic search.

These vectors are stored in a FAISS vector database, allowing fast and efficient similarity searches based on user queries.

Storing the vector index locally (faiss_nomic_index) enables reusability without recomputing embeddings every time.

This step is key to enabling Retrieval-Augmented Generation (RAG), where the most relevant document chunks are retrieved during inference.

Using a powerful embedding model like Nomic ensures high-quality semantic matching across diverse documents.

In [7]:
from langchain.vectorstores import FAISS


#from langchain_huggingface import HuggingFaceEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "nomic-ai/nomic-embed-text-v1"
model_kwargs = {'device': 'cpu','trust_remote_code':True}  # or 'cuda' if you have GPU
encode_kwargs = {'normalize_embeddings': False}
 
embedding_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)


vectorstore = FAISS.from_documents(chunked_documents, embedding_model)

vectorstore.save_local("faiss_nomic_index")

  embedding_model = HuggingFaceEmbeddings(
!!!!!!!!!!!!megablocks not available, using torch.matmul instead
<All keys matched successfully>


### Testing the Vector Store with a Sample Query

In this step, we test our vector database by running a sample query. Using similarity_search, we retrieve the top 5 most relevant document chunks based on the semantic similarity of the query. This demonstrates that our FAISS index is correctly built and can efficiently find contextually relevant information from the ingested documents.

Why it's useful:

Validates that document embeddings and indexing are working as expected.

Provides immediate feedback on the relevance and quality of results.

Helps in debugging or tuning embedding models and chunking strategies.

In [8]:
query = "What does the shepherd boy do?"
docs = vectorstore.similarity_search(query, k=5)

for i, doc in enumerate(docs):
    print(f"Doc {i+1}:\n{doc.page_content}\n{'-'*40}")


Doc 1:
the language of the soul, it is only you who can understand. But,
whichever it is, I’m going to charge you for the consultation.”
Another trick, the boy thought. But he decided to take a chance. A
shepherd always takes his chances with wolves and with drought,
and that’s what makes a shepherd’s life exciting.
“I have had the same dream twice,” he said. “I dreamed that I was
in a field with my sheep, when a child appeared and began to play
with the animals. I don’t like people to do that, because the sheep
are afraid of strangers. But children always seem to be able to play
with them without frightening them. I don’t know why. I don’t know
how animals know the age of human beings.”
“Tell me more about your dream,” said the woman. “I have to get
back to my cooking, and, since you don’t have much money, I can’t
give you a lot of time.”
“The child went on playing with my sheep for quite a while,”
continued the boy, a bit upset. “And suddenly, the child took me by
-------------------