# Chatbot Q&A Quranic Reasoning

## Business Understanding

- Bagaimana potensi penggunaan QRQA Dataset dalam mengembangkan produk edukasi digital Islam berbasis AI (seperti chatbot tanya jawab, aplikasi pembelajaran, atau virtual mufti)?

  _Untuk mengidentifikasi peluang produk turunan dan segmen pasar potensial (pelajar, akademisi, pesantren digital, dll.)._

- Model bahasa mana (seperti LLaMA, Mistral, DeepSeek, dsb.) yang paling cocok untuk fine-tuning dengan QRQA Dataset dalam konteks kecepatan, akurasi, dan efisiensi biaya?

  _Akan dites pada Notebook ini._

- Bagaimana cara mengukur efektivitas reasoning model terhadap pertanyaan-pertanyaan kompleks dalam QRQA?

  _Menggunakan metrik evaluasi seperti BLEU, ROUGE, atau human-evaluated Islamic consistency score._

## Data and Tools Acquisition

In [1]:
!pip install transformers
!pip install kaggle
!pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=c41b3d182d4b3ec82721f4d7e0721bcb64c0b2b5bcfda287f626782d6c119ef7
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import kagglehub
from kagglehub import KaggleDatasetAdapter
from google.colab import files
import os
import pathlib
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

In [3]:
# ! mkdir ~/.kaggle

In [4]:
# !cp /content/drive/MyDrive/CollabData/kaggle_API/kaggle.json ~/.kaggle/kaggle.json

In [5]:
# ! chmod 600 ~/.kaggle/kaggle.json

In [6]:
# ! kaggle datasets download lazer999/quranic-reasoning-synthetic-dataset

In [7]:
# ! kaggle datasets download alizahidraja/quran-english

In [8]:
# ! unzip quranic-reasoning-synthetic-dataset.zip

In [9]:
# ! unzip quran-english.zip

## Data Preparation

In [10]:
file_path = "/kaggle/input/quranic-reasoning-synthetic-dataset/Quran_R1_excel.xlsx"
df = pd.read_excel(file_path)
df.head()

Unnamed: 0.1,Unnamed: 0,Question,Complex_CoT,Response
0,0,What is the significance of patience (sabr) in...,Patience (sabr) is a key virtue emphasized in ...,The Quran highlights patience as a sign of str...
1,1,Why do we have to pray five times a day? Would...,The five daily prayers are a fundamental pilla...,The five daily prayers maintain spiritual conn...
2,2,What does the Quran say about friendships? How...,Friendship plays a crucial role in shaping a b...,The Quran advises selecting righteous friends ...
3,3,Why does the Quran emphasize so much on gratit...,Gratitude (shukr) is vital in Islam as it fost...,"The Quran underscores gratitude, promising inc..."
4,4,How should we deal with disagreements among si...,The Quran encourages resolving sibling dispute...,Sibling disagreements should be resolved with ...


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857 entries, 0 to 856
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   857 non-null    int64 
 1   Question     857 non-null    object
 2   Complex_CoT  857 non-null    object
 3   Response     857 non-null    object
dtypes: int64(1), object(3)
memory usage: 26.9+ KB


Column `Unnamed: 0` merupakan Column yang harus kita drop karena tidak berguna

In [12]:
df = df.drop(columns=['Unnamed: 0'])
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857 entries, 0 to 856
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Question     857 non-null    object
 1   Complex_CoT  857 non-null    object
 2   Response     857 non-null    object
dtypes: object(3)
memory usage: 20.2+ KB


Let's go to the next data

In [13]:
file_path = "/kaggle/input/quran-english/Quran_English_with_Tafseer.csv"
df_quran = pd.read_csv(file_path)
df_quran.head()

Unnamed: 0,Name,Surah,Ayat,Verse,Tafseer
0,The Opening,1,1,"In the name of Allah, the Beneficent, the Merc...",In the Name of God the Compassionate the Merciful
1,The Opening,1,2,"Praise be to Allah, Lord of the Worlds,",In the Name of God the name of a thing is that...
2,The Opening,1,3,"The Beneficent, the Merciful.",The Compassionate the Merciful that is to say ...
3,The Opening,1,4,"Owner of the Day of Judgment,",Master of the Day of Judgement that is the day...
4,The Opening,1,5,Thee (alone) we worship; Thee (alone) we ask f...,You alone we worship and You alone we ask for ...


In [14]:
df_quran.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6236 entries, 0 to 6235
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Name     6236 non-null   object
 1   Surah    6236 non-null   int64 
 2   Ayat     6236 non-null   int64 
 3   Verse    6236 non-null   object
 4   Tafseer  6235 non-null   object
dtypes: int64(2), object(3)
memory usage: 243.7+ KB


In [15]:
display(df_quran[df_quran['Tafseer'].isnull()])

Unnamed: 0,Name,Surah,Ayat,Verse,Tafseer
4555,Muhammad,47,11,That is because Allah is patron of those who b...,


Ada satu data yang tidak memiliki tafsir kosong, dalam hal ini kita akan isi data kosong ini dengan data sintetis

In [16]:
# Fill empty 'Tafseer' values with a synthetic data
df_quran['Tafseer'] = df_quran['Tafseer'].fillna("This surah emphasizes that Allah is the protector and ally (Mawlā) of those who believe, offering them divine support, guidance, and victory, while the disbelievers are left without any true protector. This verse reassures the believers that despite external challenges or opposition, they are never alone—Allah stands by them in both worldly and spiritual affairs. Conversely, disbelievers, no matter their apparent power or alliances, lack divine backing and are ultimately vulnerable. Revealed in the context of struggle between faith and disbelief, particularly in times of conflict, this verse highlights the importance of trusting in Allah, as real strength and success come through His support, not mere worldly means.")
print(df_quran[df_quran['Tafseer'].isnull()])

Empty DataFrame
Columns: [Name, Surah, Ayat, Verse, Tafseer]
Index: []


### Data Merging

Sebelum kita develop modelnya, mari kita gabung `df_quran` dengan `df`

In [17]:
# Create the first template
df_quran['Question'] = "Question: What is the meaning of Surah " + df_quran['Surah'].astype(str) + ":" + df_quran['Ayat'].astype(str) + "?"
df_quran['Response'] = "Response: \nVerse:\n" + df_quran['Verse'] + ", " + df_quran['Tafseer']

# Create the second template and append it to the first dataframe
df_quran_2 = pd.DataFrame()
df_quran_2['Question'] = "Question: What is the meaning of Surah " + df_quran['Name'] + ":" + df_quran['Ayat'].astype(str) + "?"
df_quran_2['Response'] = "Response: \nVerse:\n" + df_quran['Verse'] + ", " + df_quran['Tafseer']

df_quran = pd.concat([df_quran, df_quran_2], ignore_index=True)

# Select only the relevant columns for merging
df_quran = df_quran[['Question', 'Response']]

# Concatenate the two dataframes
merged_df = pd.concat([df, df_quran], ignore_index=True)
merged_df.head()


Unnamed: 0,Question,Complex_CoT,Response
0,What is the significance of patience (sabr) in...,Patience (sabr) is a key virtue emphasized in ...,The Quran highlights patience as a sign of str...
1,Why do we have to pray five times a day? Would...,The five daily prayers are a fundamental pilla...,The five daily prayers maintain spiritual conn...
2,What does the Quran say about friendships? How...,Friendship plays a crucial role in shaping a b...,The Quran advises selecting righteous friends ...
3,Why does the Quran emphasize so much on gratit...,Gratitude (shukr) is vital in Islam as it fost...,"The Quran underscores gratitude, promising inc..."
4,How should we deal with disagreements among si...,The Quran encourages resolving sibling dispute...,Sibling disagreements should be resolved with ...


## Model Development

Kita akan menggunakan model T5, cek penjelasan Transformer [disini](https://medium.com/@gagangupta_82781/understanding-the-t5-model-a-comprehensive-guide-b4d5c02c234b)

In [18]:
inputt=merged_df['Question'].tolist()
labelt=merged_df['Response'].tolist()

Split Train-Test (Dalam hal ini kita akan pisah 9:1, dan kita hanya akan mengambil data dari `df` saja)

In [19]:
train_inputs, test_inputs, train_labels, test_labels = train_test_split(inputt[:857], labelt[:857], test_size=0.1, random_state=42)


Mari kita Panggil Tokenizer dan Pre-Model yang akan kita pakai, dalam hal ini T5

In [20]:
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Sebelum melatih model, mari kita tokenisasi data

In [21]:
def tokenize_data(inputs, labels, tokenizer, max_length=128):
    input_encodings = tokenizer(
        list(inputs), max_length=max_length, padding=True, truncation=True, return_tensors="pt"
    )
    label_encodings = tokenizer(
        list(labels), max_length=max_length, padding=True, truncation=True, return_tensors="pt"
    )
    return input_encodings, label_encodings

train_inputs_enc, train_labels_enc = tokenize_data(train_inputs, train_labels, tokenizer)
test_inputs_enc, test_labels_enc = tokenize_data(test_inputs, test_labels, tokenizer)

In [22]:
class CustomDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels["input_ids"])

    def __getitem__(self, idx):
        return {
            "input_ids": self.encodings["input_ids"][idx],
            "attention_mask": self.encodings["attention_mask"][idx],
            "labels": self.labels["input_ids"][idx],
        }

train_dataset = CustomDataset(train_inputs_enc, train_labels_enc)
test_dataset = CustomDataset(test_inputs_enc, test_labels_enc)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8)

Mari kita train model kita kali ini serta menggunakan Optimizer untuk meningkatkan Akurasi model!

In [23]:
optimizer = AdamW(model.parameters(), lr=5e-6)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

epochs = 500
for epoch in range(epochs):
    model.train()
    for batch in train_loader:
        optimizer.zero_grad()

        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1} Loss: {loss.item()}")

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch 1 Loss: 4.026292324066162
Epoch 2 Loss: 1.4067541360855103
Epoch 3 Loss: 1.6838067770004272
Epoch 4 Loss: 1.6578739881515503
Epoch 5 Loss: 0.8528079986572266
Epoch 6 Loss: 1.7591357231140137
Epoch 7 Loss: 0.721689760684967
Epoch 8 Loss: 2.1780614852905273
Epoch 9 Loss: 0.6551007628440857
Epoch 10 Loss: 0.4917037785053253
Epoch 11 Loss: 0.5474380254745483
Epoch 12 Loss: 1.54666268825531
Epoch 13 Loss: 0.4773276150226593
Epoch 14 Loss: 0.5948296189308167
Epoch 15 Loss: 0.5525073409080505
Epoch 16 Loss: 0.5228738188743591
Epoch 17 Loss: 0.39165809750556946
Epoch 18 Loss: 2.482361316680908
Epoch 19 Loss: 0.5824633836746216
Epoch 20 Loss: 1.4919103384017944
Epoch 21 Loss: 1.4694604873657227
Epoch 22 Loss: 0.7610399127006531
Epoch 23 Loss: 1.3244988918304443
Epoch 24 Loss: 0.42754021286964417
Epoch 25 Loss: 0.4092494547367096
Epoch 26 Loss: 1.1170042753219604
Epoch 27 Loss: 0.630314290523529
Epoch 28 Loss: 0.6510558128356934
Epoch 29 Loss: 0.42253318428993225
Epoch 30 Loss: 0.527846395

Epoch terakhir menunjukkan 0.13 Loss

## Model Testing

In [24]:
model.eval()
for batch in test_loader:
    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    labels = batch["labels"].to(device)

    input_texts = [tokenizer.decode(ids, skip_special_tokens=True) for ids in input_ids]
    true_labels = [tokenizer.decode(label, skip_special_tokens=True) for label in labels]

    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=50
    )
    predictions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

    for input_text, true_label, pred in zip(input_texts, true_labels, predictions):
        print("-" * 50)
        print(f"input_txt: {input_text}")
        print(f"true_label: {true_label}")
        print(f"true_pred: {pred}")

    break

--------------------------------------------------
input_txt: Why does Allah allow bad people to succeed in this world?
true_label: The Quran warns that worldly success is a test: 'Do not be deceived by the prosperity of those who disbelieve' (3:196). True success lies in righteousness.
true_pred: Allah allows bad people to succeed in this world by fostering moral integrity and a just society.
--------------------------------------------------
input_txt: What does the Quran teach about the responsibility of using reason to safeguard one’s faith?
true_label: It teaches that using reason is a fundamental responsibility that protects and strengthens one’s faith.
true_pred: It teaches that using reason to safeguard one’s faith is a duty that bestows integrity and justice.
--------------------------------------------------
input_txt: What does the Quran teach about handling criticism within the family?
true_label: The Quran encourages using constructive criticism as an opportunity for growt

## Model Evaluation

In [25]:
# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

# Assuming 'predictions' and 'true_labels' are lists of strings from the previous code block

bleu_scores = []
rouge1_scores = []
rougeL_scores = []

for prediction, true_label in zip(predictions, true_labels):
  # Calculate BLEU score
  reference = [true_label.split()]
  candidate = prediction.split()
  bleu_score = sentence_bleu(reference, candidate)
  bleu_scores.append(bleu_score)

  # Calculate ROUGE scores
  scores = scorer.score(true_label, prediction)
  rouge1_scores.append(scores['rouge1'].fmeasure)
  rougeL_scores.append(scores['rougeL'].fmeasure)

# Calculate average scores
avg_bleu = np.mean(bleu_scores)
avg_rouge1 = np.mean(rouge1_scores)
avg_rougeL = np.mean(rougeL_scores)

print(f"Average BLEU Score: {avg_bleu}")
print(f"Average ROUGE-1 Score: {avg_rouge1}")
print(f"Average ROUGE-L Score: {avg_rougeL}")


Average BLEU Score: 0.3020465450462989
Average ROUGE-1 Score: 0.4184689669090548
Average ROUGE-L Score: 0.3278212112872705


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


## Penjelasan Setiap Metrik

---

- **BLEU (Bilingual Evaluation Understudy)**

  > Nilai: > 0.2

  BLEU digunakan untuk mengukur kemiripan antara hasil generasi model dengan jawaban referensi berdasarkan kesamaan n-gram.

  Skor BLEU di atas 0.2 dalam konteks QnA generatif dianggap **Moderat**, terutama pada teks yang bersifat reasoning atau naratif karena bentuk jawaban bisa sangat variatif.

  Dalam model kali ini, skor BLEU kita menunjukkan bahwa model menghasilkan jawaban yang memiliki kemiripan n-gram dengan referensi, namun tetap menyisakan ruang untuk peningkatan struktur dan kesesuaian kata.

---

- **ROUGE-1**
  > Nilai: > 0.4

  Mengukur kesamaan kata secara langsung (unigram overlap) antara jawaban model dan jawaban referensi.

  Skor di atas 0.4 dianggap **Cukup Baik** untuk tugas QnA generatif, menunjukkan bahwa model mampu menangkap sebagian besar kata kunci dari jawaban referensi.

---

- **ROUGE-L**
  > Nilai: > 0.3

  Mengukur kesamaan struktur atau urutan kata (longest common subsequence).

  Skor di atas 0.3 menunjukkan bahwa model **Cukup Baik** dalam meniru sebagian struktur kalimat dari jawaban referensi, meskipun belum sepenuhnya presisi secara sintaksis.

## Model Saving

In [26]:
# Save the model
model_path = "/kaggle/working/Model"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

print(f"Model saved to {model_path}")

Model saved to /kaggle/working/Model
