# 1. Sourcing PDF

PDF link: https://www.ashtoncentralmosque.com/app/uploads/2014/07/the-quran-with-annotated-interpretation-in-modern-english-ali-unal.pdf


In [1]:
import os
import requests

pdf_path = "the-quran-with-annotated-interpretation-in-modern-english-ali-unal.pdf"

if not os.path.exists(pdf_path):
    print("File does not exists, downloading...")

    url = "https://www.ashtoncentralmosque.com/app/uploads/2014/07/the-quran-with-annotated-interpretation-in-modern-english-ali-unal.pdf"

    filename = pdf_path

    response = requests.get(url) #gets file in bytes

    if response.status_code == 200:
      with open(pdf_path, "wb") as file:
        file.write(response.content)
      print("File downloaded successfully.")
    else:
      print(f"Failed to download the file {response.status_code}")
else:
  print("File already exists.")

File already exists.


#2. Extracting text from pdf

In [2]:
!pip install pdfplumber

import re
import pdfplumber
import nltk
import pandas as pd
import torch
from nltk.tokenize import sent_tokenize
nltk.download('punkt_tab')

pdf_path = "/content/the-quran-with-annotated-interpretation-in-modern-english-ali-unal.pdf"

def remove_inline_numbers(text: str) -> str:
    """
    Remove citation-like numbers that come right after a period or ')'.
    Example: 'repentance).22 God' -> 'repentance). God'
    """
    pattern = r'([.)])\s*\d+'
    return re.sub(pattern, r'\1', text)

chunks = []  # list of dicts: {"page_number": int, "chunks_text": list[str]}

with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        page_number = page.page_number
        raw_text = page.extract_text() or ""

        cleaned_text = remove_inline_numbers(raw_text)
        sentences = sent_tokenize(cleaned_text)

        page_chunks = []
        current_chunk = []

        for i, sent in enumerate(sentences, start=1):
            current_chunk.append(sent)
            if i % 10 == 0:
                page_chunks.append(" ".join(current_chunk))
                current_chunk = []

        if current_chunk:
            page_chunks.append(" ".join(current_chunk))

        chunks.append({
            "page_number": page_number,
            "chunks_text": page_chunks
        })

chunks_df = pd.DataFrame(chunks)

chunks_df = chunks_df.explode("chunks_text", ignore_index=True)

chunks_df["chunk_id_in_page"] = (
    chunks_df.groupby("page_number").cumcount() + 1
)



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [3]:
print(chunks_df.tail(15))

      page_number                                        chunks_text  \
2935         1317  1317 Glossary of Terms\n(ash-)Sharī‘at at-Takw...   
2936         1317  If this Prophetic description is figurative, i...   
2937         1318  Glossary of Terms 1318\nSubhānallāh: All-Glori...   
2938         1318  It has two aspects, one for the life of all cr...   
2939         1319  1319 Glossary of Terms\n(as-)Sūrah: An indepen...   
2940         1320  Glossary of Terms 1320\ncan be accomplished th...   
2941         1321  1321 Glossary of Terms\nservice\n[J14] at-)Tas...   
2942         1322  Glossary of Terms 1322\nable meanings. (at-)Ta...   
2943         1323  1323 Glossary of Terms\nand at-tawāf, and as-s...   
2944         1324  Glossary of Terms 1324\n(al-)Yahūd: The Jews. ...   
2945         1324  The Qur’ān uses the word “day” not only in the...   
2946         1325  1325 Glossary of Terms\nback to me,” meaning h...   
2947         1325  Having a very wide area of usage, in the term

- Embeddings model: https://huggingface.co/sentence-transformers/all-mpnet-base-v2
- Base transformer: microsoft/mpnet-base (MPNet encoder)
- Maps sentences and paragraphs to a 768‑dimensional dense vector space
- Model size 0.1B params
- Trained on 3 different datasets
- dataset 1 link: https://huggingface.co/datasets/mandarjoshi/trivia_qa
- dataset 2 link: https://huggingface.co/datasets/stanfordnlp/snli
- dataset 3 link: https://huggingface.co/datasets/google-research-datasets/natural_questions

In [4]:
from sentence_transformers import SentenceTransformer, util

device = "cuda" if torch.cuda.is_available() else "cpu"
embedding_model = SentenceTransformer("all-mpnet-base-v2", device=device)

In [5]:
!nvidia-smi

Tue Dec  9 10:36:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   48C    P0             26W /   70W |     574MiB /  15360MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [6]:
# encode all chunk texts as a list
chunk_texts = chunks_df["chunks_text"].tolist()
embeddings = embedding_model.encode(chunk_texts, convert_to_numpy=True)  # or convert_to_tensor=True

# add as new column (store as list so DataFrame can handle it)
chunks_df["embeddings"] = embeddings.tolist()

print(chunks_df.head())
print(chunks_df.iloc[0]["embeddings"][:10])  # first 10 dims of first chunk vector

   page_number                                        chunks_text  \
0            1                                                NaN   
1            2  THE QUR’AN\nwith\nAnnotated Interpretation in ...   
2            3                www.mquran.org\nwww.theholybook.org   
3            4  Contents\nForeword ..............................   
4            4  Yunus (Jonah) ...................................   

   chunk_id_in_page                                         embeddings  
0                 1  [-0.02318713814020157, 0.05149746313691139, -0...  
1                 1  [0.027168413624167442, 0.0320991687476635, 0.0...  
2                 1  [0.017434922978281975, 0.06314095854759216, -0...  
3                 1  [0.01651417650282383, -0.013219342567026615, -...  
4                 2  [0.022808723151683807, 0.05383909493684769, -0...  
[-0.02318713814020157, 0.05149746313691139, -0.002392231021076441, -0.008844197727739811, -0.01957680843770504, 0.024297600612044334, 0.02342279069

# 3. FAISS vector database

- Github link: https://github.com/facebookresearch/faiss
- FAISS documentation: https://faiss.ai
- For deeper understanding of FAISS: https://www.datacamp.com/blog/faiss-facebook-ai-similarity-search

In [7]:
!pip install faiss-cpu

import numpy as np
import faiss

emb_matrix = np.vstack(chunks_df["embeddings"].values).astype("float32")
print(emb_matrix.shape)  # (num_chunks, 768)

faiss.normalize_L2(emb_matrix)

d = emb_matrix.shape[1]  # embedding dimension (768 for all-mpnet-base-v2)

index = faiss.IndexFlatIP(d)  # exact search, inner product
print("Is trained:", index.is_trained)

index.add(emb_matrix)
print("Index size (ntotal):")
print(index.ntotal)

(2950, 768)
Is trained: True
Index size (ntotal):
2950


In [8]:
!nvidia-smi

Tue Dec  9 10:37:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   69C    P0             29W /   70W |    1540MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

~

In [None]:
from huggingface_hub import login
import os

HF_TOKEN = "hf_huggingface_token"
os.environ["HF_TOKEN"] = HF_TOKEN
login(token=HF_TOKEN)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


# 4. Model (Google Gemma 2b-it)
- Model Link: https://huggingface.co/google/gemma-2-2b-it
- Base model "google/gemma-2-2b" link: https://huggingface.co/google/gemma-2-2b
- Model size 3B params

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

HF_TOKEN = "hf_huggingface_token"

model_id = "google/gemma-2-2b-it"

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    token=HF_TOKEN,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    token=HF_TOKEN,
    quantization_config=quant_config,
    device_map="auto",          # send weights directly to GPU
    low_cpu_mem_usage=True,     # avoid big CPU fp32 copy
)

model

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear4bit(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2304, bias=False)
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear4bit(in_features=9216, out_features=2304, bias=False)
          (act_fn): GELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (post_attention_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (pre_feedfor

In [11]:
!nvidia-smi

Tue Dec  9 10:38:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   67C    P0             29W /   70W |    4696MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [12]:
def retrieve_top_k(query, k=8):
    q_emb = embedding_model.encode([query], convert_to_numpy=True).astype("float32")

    faiss.normalize_L2(q_emb)

    scores, idxs = index.search(q_emb, k)
    idxs = idxs[0]
    scores = scores[0]

    contexts = []
    for score, i in zip(scores, idxs):
        row = chunks_df.iloc[i]
        contexts.append(row["chunks_text"])

    return contexts

In [13]:
def build_rag_prompt(question, contexts):
    context_block = "\n\n".join(contexts)
    return (
        "You are an assistant answering questions about the Qur'an.\n"
        "Use ONLY the context below. If the answer is not there, say you don't know.\n\n"
        f"Context:\n{context_block}\n\n"
        f"Question: {question}\n"
        "Answer clearly in English."
    )

In [16]:
def answer_question(question, k=8, max_new_tokens=256):
    contexts = retrieve_top_k(question, k=k)
    user_prompt = build_rag_prompt(question, contexts)

    messages = [
        {"role": "user", "content": user_prompt},
    ]

    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
        )

    generated = outputs[0][input_ids.shape[-1]:]
    text = tokenizer.decode(generated, skip_special_tokens=True)
    return text.strip()

In [45]:
question = "What does the quran say about the day of judgement"
print(answer_question(question))

The Quran describes the Day of Judgment as a day of ultimate reckoning and consequence. 

Here are some key points from the verses you provided:

* **Silence and No Excuses:**  On Judgment Day, people will be unable to speak or offer excuses (verse 35).
* **Diverse Manifestations:** The Day of Judgment will not be a single, uniform event.  People will experience different types of punishment and reward, including Hell and Paradise (verses 36, 45, 46, 47). 
* **Unseen Judgment:** The Quran emphasizes that the Day of Judgment is not a spectacle but a time for individual accountability.  People will face their deeds and be judged by God (verses 36, 45, 46, 47).
* **Universal Gathering:** The Day of Judgment will see all humanity, including the jinn, gathered together (verse 36, 47).
* **The Trumpet and the Last Hour:** The Day of Judgment will be heralded by the sound of the Trumpet, and the Last Hour will be a time of intense struggle and punishment for the disbelievers (verse 753).
* **

In [39]:
question = "What if someone donot fast in the month of ramadan"
print(answer_question(question))

The context provides several options for those who cannot fast in Ramadan. 

* **They can feed a person in need.** This is a form of redemption (penance) for those who are unable to fast. 
* **They can make up for the missed days.** The text states they must make up the number of days they missed during Ramadan.
* **They can continue with supplicatory prayer.** This refers to prayers during the month of Ramadan.
* **They must not be on a journey or be ill.** These are exceptions to the general rule.


The context emphasizes that it's important to make up for missed days and that fasting is beneficial.
