# Chatbot Q&A Quranic Reasoning

## Business Understanding

- Bagaimana potensi penggunaan QRQA Dataset dalam mengembangkan produk edukasi digital Islam berbasis AI (seperti chatbot tanya jawab, aplikasi pembelajaran, atau virtual mufti)?

  _Untuk mengidentifikasi peluang produk turunan dan segmen pasar potensial (pelajar, akademisi, pesantren digital, dll.)._

- Model bahasa mana (seperti LLaMA, Mistral, DeepSeek, dsb.) yang paling cocok untuk fine-tuning dengan QRQA Dataset dalam konteks kecepatan, akurasi, dan efisiensi biaya?

  _Akan dites pada Notebook ini._

- Bagaimana cara mengukur efektivitas reasoning model terhadap pertanyaan-pertanyaan kompleks dalam QRQA?

  _Menggunakan metrik evaluasi seperti BLEU, ROUGE, atau human-evaluated Islamic consistency score._

## Data and Tools Acquisition

In [None]:
!pip install transformers
!pip install kaggle
!pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=e75b1373ec4e16a3905d664f5d034cc5fd22306fbba8e4843eaadb66e49a28ec
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import kagglehub
from kagglehub import KaggleDatasetAdapter
from google.colab import files
import os
import pathlib
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

In [None]:
! mkdir ~/.kaggle

In [None]:
!cp /content/drive/MyDrive/CollabData/kaggle_API/kaggle.json ~/.kaggle/kaggle.json

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download lazer999/quranic-reasoning-synthetic-dataset

Dataset URL: https://www.kaggle.com/datasets/lazer999/quranic-reasoning-synthetic-dataset
License(s): CC0-1.0


In [None]:
! kaggle datasets download alizahidraja/quran-english

Dataset URL: https://www.kaggle.com/datasets/alizahidraja/quran-english
License(s): GPL-2.0


In [None]:
! unzip quranic-reasoning-synthetic-dataset.zip

Archive:  quranic-reasoning-synthetic-dataset.zip
  inflating: Quran_R1_excel.xlsx     


In [None]:
! unzip quran-english.zip

Archive:  quran-english.zip
  inflating: Quran_English.csv       
  inflating: Quran_English_with_Tafseer.csv  


## Data Preparation

In [None]:
file_path = "/content/Quran_R1_excel.xlsx"
df = pd.read_excel(file_path)
df.head()

Unnamed: 0.1,Unnamed: 0,Question,Complex_CoT,Response
0,0,What is the significance of patience (sabr) in...,Patience (sabr) is a key virtue emphasized in ...,The Quran highlights patience as a sign of str...
1,1,Why do we have to pray five times a day? Would...,The five daily prayers are a fundamental pilla...,The five daily prayers maintain spiritual conn...
2,2,What does the Quran say about friendships? How...,Friendship plays a crucial role in shaping a b...,The Quran advises selecting righteous friends ...
3,3,Why does the Quran emphasize so much on gratit...,Gratitude (shukr) is vital in Islam as it fost...,"The Quran underscores gratitude, promising inc..."
4,4,How should we deal with disagreements among si...,The Quran encourages resolving sibling dispute...,Sibling disagreements should be resolved with ...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857 entries, 0 to 856
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   857 non-null    int64 
 1   Question     857 non-null    object
 2   Complex_CoT  857 non-null    object
 3   Response     857 non-null    object
dtypes: int64(1), object(3)
memory usage: 26.9+ KB


Column `Unnamed: 0` merupakan Column yang harus kita drop karena tidak berguna

In [None]:
df = df.drop(columns=['Unnamed: 0'])
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857 entries, 0 to 856
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Question     857 non-null    object
 1   Complex_CoT  857 non-null    object
 2   Response     857 non-null    object
dtypes: object(3)
memory usage: 20.2+ KB


In [None]:
file_path = "/content/Quran_English_with_Tafseer.csv"
df_quran = pd.read_csv(file_path)
df_quran.head()

Unnamed: 0,Name,Surah,Ayat,Verse,Tafseer
0,The Opening,1,1,"In the name of Allah, the Beneficent, the Merc...",In the Name of God the Compassionate the Merciful
1,The Opening,1,2,"Praise be to Allah, Lord of the Worlds,",In the Name of God the name of a thing is that...
2,The Opening,1,3,"The Beneficent, the Merciful.",The Compassionate the Merciful that is to say ...
3,The Opening,1,4,"Owner of the Day of Judgment,",Master of the Day of Judgement that is the day...
4,The Opening,1,5,Thee (alone) we worship; Thee (alone) we ask f...,You alone we worship and You alone we ask for ...


In [None]:
df_quran.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6236 entries, 0 to 6235
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Name     6236 non-null   object
 1   Surah    6236 non-null   int64 
 2   Ayat     6236 non-null   int64 
 3   Verse    6236 non-null   object
 4   Tafseer  6235 non-null   object
dtypes: int64(2), object(3)
memory usage: 243.7+ KB


In [None]:
display(df_quran[df_quran['Tafseer'].isnull()])

Unnamed: 0,Name,Surah,Ayat,Verse,Tafseer
4555,Muhammad,47,11,That is because Allah is patron of those who b...,


Ada satu data yang tidak memiliki tafsir kosong, dalam hal ini kita akan isi data kosong ini dengan data sintetis

In [None]:
# Fill empty 'Tafseer' values with a synthetic data
df_quran['Tafseer'] = df_quran['Tafseer'].fillna("This surah emphasizes that Allah is the protector and ally (Mawlā) of those who believe, offering them divine support, guidance, and victory, while the disbelievers are left without any true protector. This verse reassures the believers that despite external challenges or opposition, they are never alone—Allah stands by them in both worldly and spiritual affairs. Conversely, disbelievers, no matter their apparent power or alliances, lack divine backing and are ultimately vulnerable. Revealed in the context of struggle between faith and disbelief, particularly in times of conflict, this verse highlights the importance of trusting in Allah, as real strength and success come through His support, not mere worldly means.")
print(df_quran[df_quran['Tafseer'].isnull()])

Empty DataFrame
Columns: [Name, Surah, Ayat, Verse, Tafseer]
Index: []


## Model Development

Kita akan menggunakan model T5, cek penjelasan Transformer [disini](https://medium.com/@gagangupta_82781/understanding-the-t5-model-a-comprehensive-guide-b4d5c02c234b)

Untuk saat ini, kita akan gunakan dataset `df` saja.

In [None]:
inputt=df['Question'].tolist()
labelt=df['Response'].tolist()

Split Train-Test (Dalam hal ini kita akan pisah 9:1)

In [None]:
train_inputs, test_inputs, train_labels, test_labels = train_test_split(
    inputt, labelt, test_size=0.1, random_state=42
)

Call the Model

In [None]:
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Sebelum melatih model, mari kita tokenisasi data

In [None]:
def tokenize_data(inputs, labels, tokenizer, max_length=128):
    input_encodings = tokenizer(
        list(inputs), max_length=max_length, padding=True, truncation=True, return_tensors="pt"
    )
    label_encodings = tokenizer(
        list(labels), max_length=max_length, padding=True, truncation=True, return_tensors="pt"
    )
    return input_encodings, label_encodings

train_inputs_enc, train_labels_enc = tokenize_data(train_inputs, train_labels, tokenizer)
test_inputs_enc, test_labels_enc = tokenize_data(test_inputs, test_labels, tokenizer)

In [None]:
class CustomDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels["input_ids"])

    def __getitem__(self, idx):
        return {
            "input_ids": self.encodings["input_ids"][idx],
            "attention_mask": self.encodings["attention_mask"][idx],
            "labels": self.labels["input_ids"][idx],
        }

train_dataset = CustomDataset(train_inputs_enc, train_labels_enc)
test_dataset = CustomDataset(test_inputs_enc, test_labels_enc)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8)

Mari kita train model kita kali ini serta menggunakan Optimizer untuk meningkatkan Akurasi model!

In [None]:
optimizer = AdamW(model.parameters(), lr=5e-6)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

epochs = 300
for epoch in range(epochs):
    model.train()
    for batch in train_loader:
        optimizer.zero_grad()

        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1} Loss: {loss.item()}")

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch 1 Loss: 3.8859550952911377
Epoch 2 Loss: 1.9943588972091675
Epoch 3 Loss: 1.0364328622817993
Epoch 4 Loss: 1.08381187915802
Epoch 5 Loss: 0.8080282211303711
Epoch 6 Loss: 0.7284589409828186
Epoch 7 Loss: 0.7101523876190186
Epoch 8 Loss: 1.8257259130477905
Epoch 9 Loss: 1.4991545677185059
Epoch 10 Loss: 0.5647551417350769
Epoch 11 Loss: 1.475122332572937
Epoch 12 Loss: 0.5308416485786438
Epoch 13 Loss: 1.9766144752502441
Epoch 14 Loss: 0.34073492884635925
Epoch 15 Loss: 0.5684186816215515
Epoch 16 Loss: 1.457788109779358
Epoch 17 Loss: 2.3355212211608887
Epoch 18 Loss: 0.5158972144126892
Epoch 19 Loss: 1.5391343832015991
Epoch 20 Loss: 1.1571787595748901
Epoch 21 Loss: 0.3925011157989502
Epoch 22 Loss: 1.393946647644043
Epoch 23 Loss: 1.120505452156067
Epoch 24 Loss: 0.43720436096191406
Epoch 25 Loss: 1.24997878074646
Epoch 26 Loss: 1.2107274532318115
Epoch 27 Loss: 0.4278765022754669
Epoch 28 Loss: 0.5620697140693665
Epoch 29 Loss: 0.5768057107925415
Epoch 30 Loss: 1.026095390319

Epoch terakhir menunjukkan 0.13 Loss

## Model Testing

In [None]:
model.eval()
for batch in test_loader:
    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    labels = batch["labels"].to(device)

    input_texts = [tokenizer.decode(ids, skip_special_tokens=True) for ids in input_ids]
    true_labels = [tokenizer.decode(label, skip_special_tokens=True) for label in labels]

    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=50
    )
    predictions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

    for input_text, true_label, pred in zip(input_texts, true_labels, predictions):
        print("-" * 50)
        print(f"input_txt: {input_text}")
        print(f"true_label: {true_label}")
        print(f"true_pred: {pred}")

    break

--------------------------------------------------
input_txt: Why does Allah allow bad people to succeed in this world?
true_label: The Quran warns that worldly success is a test: 'Do not be deceived by the prosperity of those who disbelieve' (3:196). True success lies in righteousness.
true_pred: Allah allows bad people to succeed in this world by fostering moral and spiritual growth.
--------------------------------------------------
input_txt: What does the Quran teach about the responsibility of using reason to safeguard one’s faith?
true_label: It teaches that using reason is a fundamental responsibility that protects and strengthens one’s faith.
true_pred: It teaches that using reason to safeguard one’s faith is a duty that requires continual improvement.
--------------------------------------------------
input_txt: What does the Quran teach about handling criticism within the family?
true_label: The Quran encourages using constructive criticism as an opportunity for growth, resp

## Model Evaluation

In [None]:
# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

# Assuming 'predictions' and 'true_labels' are lists of strings from the previous code block

bleu_scores = []
rouge1_scores = []
rougeL_scores = []

for prediction, true_label in zip(predictions, true_labels):
  # Calculate BLEU score
  reference = [true_label.split()]
  candidate = prediction.split()
  bleu_score = sentence_bleu(reference, candidate)
  bleu_scores.append(bleu_score)

  # Calculate ROUGE scores
  scores = scorer.score(true_label, prediction)
  rouge1_scores.append(scores['rouge1'].fmeasure)
  rougeL_scores.append(scores['rougeL'].fmeasure)

# Calculate average scores
avg_bleu = np.mean(bleu_scores)
avg_rouge1 = np.mean(rouge1_scores)
avg_rougeL = np.mean(rougeL_scores)

print(f"Average BLEU Score: {avg_bleu}")
print(f"Average ROUGE-1 Score: {avg_rouge1}")
print(f"Average ROUGE-L Score: {avg_rougeL}")


Average BLEU Score: 0.050027511624029866
Average ROUGE-1 Score: 0.40502670538075736
Average ROUGE-L Score: 0.31920398605095696


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


## Penjelasan Setiap Metrik

---

- **BLEU (Bilingual Evaluation Understudy)**

  > Nilai: 0.050

  BLEU digunakan untuk mengukur kemiripan antara hasil generasi model dengan jawaban referensi berdasarkan kesamaan n-gram.

  Skor BLEU < 0.1 dalam konteks QnA bersifat umum, terutama pada teks yang bersifat panjang, reasoning, atau bernuansa keagamaan karena struktur jawabannya bisa sangat variatif tergantung pertanyaannya.

  Dalam model kali ini, skor BLEU kita relatif **Rendah** yang dimana menunjukkan bahwa model menghasilkan jawaban yang secara kata-kata sangat berbeda dari jawaban referensi, meskipun bisa saja maknanya benar.

---

- **ROUGE-1**
  > Nilai: 0.405

  Mengukur kesamaan kata secara langsung (unigram overlap) antara jawaban model dan jawaban referensi.

  Skor di atas 0.4 dianggap **Cukup Baik** untuk tugas QnA generatif.

---

- **ROUGE-L**
  > Nilai: 0.319

  Mengukur kesamaan struktur atau urutan kata (longest common subsequence).

  Skor di atas 0.3 menunjukkan bahwa model **Cukup Baik** dalam meniru sebagian struktur kalimat dari jawaban referensi.

## Model Saving

In [None]:
# Save the model
model_path = "/content/drive/MyDrive/CollabData/QuranicReasoningModel/Model1"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

print(f"Model saved to {model_path}")


Model saved to /content/drive/MyDrive/CollabData/QuranicReasoningModel/Model1
