##### Step 1: Make your func to extract data from the pdf

In [1]:
import fitz #PyMuPDF

def extract_text_from_pdf(pdf_path):
    text = ""
    with fitz.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf):
            page_text = page.get_text()
            text += page_text + "\n"
            print(f"✅ Extracted Page {page_num + 1}")
    return text


In [2]:
pdf_path = r"media\DBMS_Syllabus.pdf"
extracted_text = extract_text_from_pdf(pdf_path)

✅ Extracted Page 1
✅ Extracted Page 2
✅ Extracted Page 3


In [3]:
with open("syllabus.txt", "w", encoding="utf-8") as f:
    f.write(extracted_text)

##### Step 2: Chunking the extracted-text

In [4]:
def chunk_text(text, max_length=500):
    sentences = text.split('. ')
    chunks, chunk = [], ""
    for sentence in sentences:
        if len(chunk) + len(sentence) < max_length:
            chunk += sentence + '. '
        else:
            chunks.append(chunk.strip())
            chunk = sentence + '. '
    if chunk:
        chunks.append(chunk.strip())
    return chunks

chunks = chunk_text(extracted_text, max_length=500)


In [5]:
print(f"Total Chunks created: {len(chunks)}")
print(f"This is a sample chunk: {chunks[4]}")

Total Chunks created: 8
This is a sample chunk: Review the fundamental view on unstructured data and describe other emerging
database technologies.
Module:1 Database 
Systems 
Concepts 
and 
Architecture 
4 hours 
Need  for  database  systems  – Characteristics  of  Database Approach – Advantages of 
using DBMS approach -  Actors on the Database Management Scene: Database 
Administrator - Classification  of database management systems  -  Data Models -  Schemas 
and Instances - Three-Schema Architecture  -   The  Database  System  Environment - 
Centralized  and  Client/Server  Architectures  for  DBMSs – Overall Architecture of 
Database Management Systems  
Module:2  Relational Model and E-R Modeling 
6 hours 
Relational Model:  Candidate Keys, Primary Keys, Foreign Keys -  Integrity Constraints - 
Handling of Nulls - Entity  Relationship  Model: Types  of  Attributes, Relationships, 
Structural Constraints, Relational model Constraints – Mapping ER model to a relational 
schema – Ex

##### Step 3: Creating Embeddings for this.

In [6]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
embeddings = model.encode(chunks, show_progress_bar=True)

Batches: 100%|██████████| 1/1 [00:00<00:00,  3.49it/s]


In [8]:
import numpy as np

print(f"Embeddings Shape: \n {np.array(embeddings).shape}")

Embeddings Shape: 
 (8, 384)


##### Step 4: Adding embeddings into the FAISS for retrieval

Seeing if it has been already been stored

In [9]:
import faiss
import pickle

# Load FAISS index
index = faiss.read_index("syllabus_index.faiss")
print("✅ FAISS index loaded.")

# Load chunks list
with open("syllabus_chunks.pkl", "rb") as f:
    chunks = pickle.load(f)

print(f"✅ Loaded {len(chunks)} chunks.")

✅ FAISS index loaded.
✅ Loaded 8 chunks.


In [11]:
import faiss

# FAISS requires data in numpy float32 format to build its index.
embeddings_np = np.array(embeddings).astype("float32")

# IndexFlatL2 creates a flat (brute-force) index using L2 (Euclidean) distance for similarity search.
index = faiss.IndexFlatL2(embeddings_np.shape[1])

index.add(embeddings_np)

print(f"Index created and {index.ntotal} items added.")

Index created and 8 items added.


##### Step 5: Saving the embeddings for faster loading time w.r.t same pdf

In [18]:
faiss.write_index(index, "syllabus_index.faiss")
print("✅ FAISS index saved to syllabus_index.faiss")


import pickle

with open("syllabus_chunks.pkl", "wb") as f:
    pickle.dump(chunks, f)

print("✅ Chunks saved to syllabus_chunks.pkl")

✅ FAISS index saved to syllabus_index.faiss
✅ Chunks saved to syllabus_chunks.pkl


In [19]:
print(f"Index Dimensions: {index.d}")
print(f"Number of vectors stored: {index.ntotal}")

Index Dimensions: 384
Number of vectors stored: 8


##### Step 6: Time to retrieve the relevant chunks

In [10]:
# Retrieve top - k embeddings dependent on the query.
def retrieve(query, k = 3):
    #Encode the query into an embedding
    query_embedding = model.encode([query]).astype("float32")
    
    #Search the FAISS index for k-similar chunks
    distances, indices = index.search(query_embedding, k)
    
    result = [chunks[i] for i in indices[0]]
    
    return result

In [12]:
#Time to test it with a sample query

query = "What is NoSQL"
k = 2

results = retrieve(query, k)

print(f"Top {len(results)} results for your query.\n")
for i , res in enumerate(results, 1):
    print(f"Result {i}: \n{res}\n{"-"*50}")

Top 2 results for your query.

Result 1: 
Gerardus Blokdyk, NoSQL Databases A Complete Guide, 5STARCooks, 2021 
Mode of Evaluation: CAT, Written assignments, Quiz and FAT. 
Recommended by Board of Studies 
04-03-2022 
Approved by Academic Council 
No. 65 
Date 
17-03-2022 
 
 
 
Agenda Item 65/39 - Annexure - 35
Proceedings of the 65th Academic Council (17.03.2022)
985

.
--------------------------------------------------
Result 2: 
BCSE302L 
Database Systems (3-0-0-3) 
Introduction to Data Models - Various architecture of DBMS - Different Relational Models - 
Entity and relations model - Different types of Normalization – Types of indexing - Hashing 
Techniques -  Query processing - Query optimization techniques - Transaction processing -  
Concurrency control - Introduction to NoSQL databases.
--------------------------------------------------


##### Step 7: Connecting with the Ollama Local Model.

In [13]:
from ollama import Client

client = Client(host='http://localhost:11434')

def generate_answer_ollama(query, retrieved_chunks):
    context = "\n\n".join(retrieved_chunks)

    prompt = f"""
    You are an academic assistant.

    Question: {query}

    You have been provided with the following context, use this to generate a better answer.
    
    Context:
    {context}

    Answer:
    """

    response = client.chat(
        model='llama2:7b',
        messages=[
            {'role': 'user', 'content': prompt}
        ]
    )
    return response['message']['content']

In [15]:
query = "What is CAP Theorem in NoSQL"
results = retrieve(query, k=1)
answer = generate_answer_ollama(query, results)

print(answer)


The CAP Theorem, also known as the Consistency, Availability, and Partition Tolerance theorem, is a fundamental concept in NoSQL databases. It was first introduced by Jim Gray and Michael Stonebraker in 1978 and has since become a cornerstone of modern distributed systems.

The CAP Theorem states that in a distributed database system, it is impossible to simultaneously achieve all three of the following properties:

1. Consistency: Every client request receives a consistent view of the data.
2. Availability: The system is always available to process requests, even in the presence of failures.
3. Partition Tolerance: The system continues to function and serve requests even when there are network partitions (i.e., some nodes in the system cannot communicate with each other).

In other words, the CAP Theorem dictates that a distributed database system can only choose two out of these three properties to prioritize. For example, a system might prioritize consistency over availability, or 