## Multi-Agent GenAI Knowledge Navigator for Workforce Upskilling

### Business Problem

Organizations face a critical challenge: ensuring their workforce keeps pace with rapidly evolving domains such as Data Analytics, Data Science, Generative AI, Computer Vision, and Project Management. Traditional upskilling methods rely heavily on static training manuals, lengthy books, or outdated PDFs, which:

* Delay the learning cycle and overwhelm employees with information overload.
* Fail to connect foundational knowledge with the latest industry practices.
* Limit scalability in delivering targeted, role-specific learning paths.

This results in:

* Slower workforce readiness and skill adoption.
* Increased training costs with lower ROI.
* Reduced competitiveness in industries driven by fast-paced innovation.

### Use Case

A Retrieval-Augmented Generation (RAG) platform powered by multi-agent AI to:

* Ingest technical books, knowledge PDFs, and compliance guides across upskilling domains.
* Summarize dense material into **role-specific microlearning modules**.
* Cross-reference knowledge with **live internet sources** to ensure freshness and regulatory alignment.
* Auto-generate **interview preparation content, study aids, and practice questions** tailored to workforce roles.

### Example Use Case Scenarios

* Transforming a **300-page Data Science guide** into structured modules for analysts, engineers, and managers.
* Building **GenAI and Computer Vision upskilling roadmaps** that merge foundational book knowledge with current real-world advancements.
* Preparing employees for **project management certifications** by combining official manuals with live case studies and scenario-based assessments.

### Who Benefits

* **Employees & Job Seekers:** Gain faster, validated, and targeted knowledge aligned with role expectations.
* **Upskilling Programs:** Deliver scalable, domain-specific learning paths for workforce readiness.
* **Organizations:** Improve training ROI, accelerate employee onboarding, and ensure skill competitiveness.
* **Lifelong Learners:** Access both curated book knowledge and real-time industry updates.

### In Summary

This solution delivers an **intelligent knowledge navigator** that extracts, synthesizes, and validates learning material—empowering organizations to build a future-ready workforce through smarter, faster, and continuously updated upskilling programs.

### 🎯 Project Objective

* Develop a production-ready, cloud-native, multi-agent GenAI platform that ingests books and knowledge PDFs, indexes them in a scalable vector database (ChromaDB), and enables hybrid retrieval with real-time internet search.
* Enable end-to-end corporate learning workflows: book ingestion, chapter-level summarization, role-based learning paths, interview prep, and knowledge validation.
* Incorporate analytics, reporting, and feedback loops to measure learning impact and optimize upskilling strategies.

### Key Objectives

* Provide rapid summaries and knowledge drill-downs mapped to workforce roles.
* Generate validated, role-specific interview and certification prep content.
* Track knowledge gaps across teams and recommend targeted resources.
* Customize upskilling across domains like AI, cloud, data, and project management.

### ❓ Core Questions Addressed

* How can large books and technical PDFs be transformed into **role-specific microlearning modules**?
* How can organizations ensure employees’ knowledge stays **aligned with current industry trends and regulations**?
* What **domain-specific interview and certification content** should employees prepare for today?
* How can corporate upskilling be made more **scalable, measurable, and adaptive**?


### Import Required Libraries
Before we begin, it’s essential to load all necessary libraries for PDF processing, text splitting, and general data handling.

These libraries will help us read PDFs, manage large documents, and organize data efficiently.

In [1]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

In [2]:
os.chdir("../")

### Define Data Source and File Paths

Here, we specify the folder containing our PDF collection.

Ensuring we point to the right directory is critical for the automation and scalability of ingestion.

By organizing files in a dedicated folder, the system becomes maintainable and easily extensible.

In [3]:
# Define the folder path containing the PDFs
data_folder = "data"

# List all PDF files available in this folder
pdf_files = [f for f in os.listdir(data_folder) if f.lower().endswith(".pdf")]

print(f"Found {len(pdf_files)} PDF files for ingestion.")
pdf_files

Found 15 PDF files for ingestion.


["Generative AI_ A Beginner's Guide.pdf",
 'SQL Book.pdf',
 'Maths$Stats_NOTES.docx.pdf',
 'JIRASOFTWARESERVER071-290216.pdf',
 'DAX Functions Cheat Sheet.pdf',
 'Artificial Intelligence, Machine Learning, and Deep Learning.pdf',
 'Project_Management_15694.pdf',
 'Book_Power BI from Rookie to Rock Star_Book01_Power_BI_Essentials_Reza Rad_RADACAD.pdf',
 'Probability_&_Stats_Q&A.pdf',
 'Stats_Q&A.pdf',
 'Computer Vision.pdf',
 'excel_Basic_Microsoft_Excel_2010_YASHADA_Ver 1.pdf',
 'Introduction_to_Python_Programming.pdf',
 'SQL_Queries.pdf',
 'How_to_explain_quantification.pdf']

### Load PDFs and Extract Raw Text

In this step, we load each PDF document, extract its textual content, and store it for processing.

Extracting text accurately from PDFs is a crucial challenge because documents may vary in structure and formatting.

Our approach leverages the LangChain PyPDFLoader which robustly handles text extraction.

In [4]:
# Load PDFs and extract text documents

all_docs = []

for pdf_file in pdf_files:
    file_path = os.path.join(data_folder, pdf_file)
    print(f"Loading file: {pdf_file}")
    loader = PyPDFLoader(file_path= file_path)
    documents = loader.load()
    all_docs.append(documents)

print(f"Loaded a total of {len(all_docs)} raw documents from PDFs.")

Loading file: Generative AI_ A Beginner's Guide.pdf
Loading file: SQL Book.pdf
Loading file: Maths$Stats_NOTES.docx.pdf
Loading file: JIRASOFTWARESERVER071-290216.pdf
Loading file: DAX Functions Cheat Sheet.pdf
Loading file: Artificial Intelligence, Machine Learning, and Deep Learning.pdf
Loading file: Project_Management_15694.pdf
Loading file: Book_Power BI from Rookie to Rock Star_Book01_Power_BI_Essentials_Reza Rad_RADACAD.pdf
Loading file: Probability_&_Stats_Q&A.pdf
Loading file: Stats_Q&A.pdf
Loading file: Computer Vision.pdf
Loading file: excel_Basic_Microsoft_Excel_2010_YASHADA_Ver 1.pdf
Loading file: Introduction_to_Python_Programming.pdf
Loading file: SQL_Queries.pdf
Loading file: How_to_explain_quantification.pdf
Loaded a total of 15 raw documents from PDFs.


### Split Documents into Chunks

Long documents are split into smaller, digestible chunks to allow efficient embedding generation and search.

The chunk size and overlap are tunable parameters that balance context retention with retrieval performance.

Effective chunking prevents context loss during semantic retrieval.

The choice of chunk_size=1000 and chunk_overlap=100 in RecursiveCharacterTextSplitter balances several important factors for handling large PDFs (e.g., books over 400 pages):

- Chunk Size (~1000 chars):
    This size keeps text chunks small enough to fit comfortably within typical LLM context windows (e.g., GPT-3.5-turbo has ~4,000 tokens max). 1000 characters roughly correspond to about 750 tokens, allowing space for prompt and response tokens during retrieval-augmented generation (RAG). It also preserves meaningful semantic units, usually fitting one or a few paragraphs.

- Overlap (~100 chars):
    Overlapping chunks by about 10% ensures that context spanning chunk boundaries is not lost. This helps the LLM capture continuous meaning when two chunks share important information at their edges (e.g., partial sentences or concepts). Overlap mitigates information fragmentation without creating excessive redundancy.

- Recursive Splitting Logic:
    The RecursiveCharacterTextSplitter tries to split on natural boundaries in hierarchical order: paragraph breaks ("\n\n"), line breaks ("\n"), sentences ("."), spaces, and finally characters if needed. This helps maintain semantic coherence inside chunks rather than arbitrary cuts. If chunks are still too large after first boundary splitting, it moves to finer separators recursively until chunk size is met.

- Handling Large Documents:
    For very long books with many pages, this chunk size provides manageable retrieval units for vector search, optimizing between granularity and retrieval efficiency. Too large chunks slow retrieval and reduce semantic precision; too small chunks cause loss of context and increase index size.


#### In summary

Using chunk_size=1000 and chunk_overlap=100 with the recursive splitter targets natural semantic boundaries, fits well within typical context window limits, and ensures good context preservation across chunks in long PDF documents like books.

This configuration is a practical balance for RAG systems ingesting large, complex documents.

In [5]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

# Split documents into manageable chunks
all_chunks = []

for doc in all_docs:
    chunks = splitter.split_documents(doc)
    all_chunks.extend(chunks)

print(f"Split raw documents into {len(all_chunks)} chunks")

Split raw documents into 8929 chunks


### Add Metadata for Traceability

#### Why Add Source Filename Metadata to Each Chunk?

We add metadata to each chunk, such as the source filename.

This metadata helps us track information provenance, supports debugging, and enables targeted queries by source.

When large documents (like PDFs) are split into smaller pieces called chunks for processing and retrieval, it’s important to keep track of where each chunk originally came from.

This metadata serves multiple important purposes:

- Traceability: It helps you know which book, interview guide, or document a particular chunk belongs to. This is useful for debugging and validation.

- Filtering: You can filter or prioritize search results by source if needed (for example, only search chunks from a specific book).

- Context: It gives context to answers generated from these chunks — you can tell users exactly which document the answer came from.

- Organization: Helps maintain a well-organized index for the vector database, improving retrieval quality.

In [6]:
# Add source filename metadata to each chunk for traceability
for chunk in all_chunks:
    # The source is available in original doc metadata if present, else fallback
    source = chunk.metadata.get("source", "unknown")
    chunk.metadata["source"] = source

# Confirm metadata assignment for first few chunks
all_chunks[:3]

[Document(page_content="Generative  AI  for  Everyone  [Free  E-Book]   Natural  Language  Processing  3 Key  areas  of  NLP  include:  3 How  is  NLP  Useful  in  Generative  AI?  3 Examples  of  NLP  in  Generative  AI  Tools  4 AI  systems  and  tools  5 The  AI  layers  6 1.  Supervised  learning  8 2.  Unsupervised  learning  9 3.  Semi-supervised  learning  9 Deep  learning  10 Here's  how  ANNs  work:  10 Introduction  to  generative  AI  13 Generative  AI:  Some  fascinating  metrics  15 How  generative  AI  works  18 ML  model  vs.  gen  AI  model  19 Journey  from  traditional  programming  to  neural  networks  to  generative  AI  20 Large  Language  Models:  Powering  generative  AI  22 How  do  LLMs  work?  22 Different  Large  Language  Models  23 Components  of  an  LLM  25 How  do  LLMs  learn?  26 Building  an  LLM  application  27 LLMs  use  cases  28 Content  creation  28 Education  29 Customer  service  and  support  29 Research  and  development.  29 Entertainment 

###  Summarize Dataset Statistics

At this point, we can review key statistics such as total number of documents ingested, total chunks generated, and average chunk size.

These insights inform us about dataset scale and potential enhancements in chunking strategies.

In [7]:
# Calculate and display basic dataset stats
num_docs = len(pdf_files)
num_chunks = len(all_chunks)
avg_chunk_len = sum(len(chunk.page_content) for chunk in all_chunks)/num_chunks

print(f"Number of source PDF documents: {num_docs}")
print(f"Total text chunks created: {num_chunks}")
print(f"Average chunk length (characters): {avg_chunk_len:.2f}")

Number of source PDF documents: 15
Total text chunks created: 8929
Average chunk length (characters): 744.92


### Next Steps and Integration

After ingestion and chunking, these text chunks will be embedded and stored in our ChromaDB vector database for hybrid semantic retrieval.

In subsequent notebooks, we will focus on embedding generation, vector store management, multi-agent orchestration, and building a user-facing query interface.

This will help us build a fast, accurate RAG system powered by our curated document collection.

### Save Chunk Data to Disk After Chunking

Serialize the list of chunk objects (e.g., as JSON, pickle, or a lightweight custom format).

Save texts and metadata so they can be reloaded later for embedding.

In [8]:
import os
import pickle

os.makedirs("research_notebooks/processed", exist_ok=True)

with open("research_notebooks/processed/chunks.pkl", "wb") as f:
    pickle.dump(all_chunks, f)

print("Chunks saved to research_notebooks/processed/chunks.pkl")

Chunks saved to research_notebooks/processed/chunks.pkl
