### Environment Setup and Dependencies

Importing necessary libraries for document loading, text splitting, vector embeddings, retrieval-augmented generation (RAG), and conversational memory. Configuring the environment to suppress warnings and prepare for processing the PDF document with LangChain and Google Generative AI.

In [1]:
!pip install -U langchain-community langchain-google-genai pypdf langchain_huggingface chromadb arabic-reshaper python-bidi



In [2]:
import os
import time
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path
from langchain.chains import create_history_aware_retriever, create_retrieval_chain, RetrievalQA
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.memory import ConversationBufferMemory
from langchain_google_genai import ChatGoogleGenerativeAI
from google.colab import userdata
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma, FAISS
import arabic_reshaper
from bidi.algorithm import get_display

import warnings
warnings.filterwarnings("ignore")

In [3]:
import torch
print("Using device:", torch.device("cuda" if torch.cuda.is_available() else "cpu"))

Using device: cuda


### Load PDF Document

Loading the internal regulations document of the Faculty of Computers and Artificial Intelligence (October 2019) using PyPDFLoader.

In [4]:
loader = PyPDFLoader("/content/اللائحة الداخلية لكلية الحاسبات والذكاء الاصطناعي أكتوبر 2019.pdf")
docs = loader.load()

In [5]:
# Each doc corresponds to a page
print(len(docs), "pages loaded")

74 pages loaded


### Fix Arabic Text Rendering

Reshaping and reordering Arabic text in each document page to ensure proper visual display (connected characters and right-to-left layout). The corrected text is saved back into the `Document` objects.  
Then, previewing the first three pages to verify the fix.

In [7]:
# Fix Arabic text inside each Document object
for i, doc in enumerate(docs):
    # Extract raw text from the page
    text = doc.page_content

    # Reshape Arabic characters to connect properly
    reshaped_text = arabic_reshaper.reshape(text)

    # Correct the right-to-left display order
    bidi_text = get_display(reshaped_text)

    # Save the fixed Arabic text back into the Document
    doc.page_content = bidi_text

print("✅ Arabic text has been reshaped, reordered, and stored back into docs.\n")

# Display the content of the first few pages (example: first 3)
for i, page in enumerate(docs[:3]):
    print(f"--- Page {i + 1} ---")
    print(page.page_content)
    print("\n" + "=" * 80 + "\n")

✅ Arabic text has been reshaped, reordered, and stored back into docs.

--- Page 1 ---
ﺍﻟﻼﺋﺤﺔ ﺍﻟﺪﺍﺧﻠﻴﺔ 
ﻟﻜﻠﻴﺔ ﺍﻟﺤﺎﺳﺒﺎﺕ ﻭﺍﻟﺬﻛﺎﺀ ﺍﻻﺻﻄﻨﺎﻋﻲ 
Faculty of Computers and 
 Artificial Intelligence
(ﺑﻨﻈﺎﻡ ﺍﻟﺴﺎﻋﺎﺕ ﺍﻟﻤﻌﺘﻤﺪﺓ 
 
ﺟﺎﻣﻌﺔ ﻣﺪﻳﻨﺔ ﺍﻟﺴﺎﺩﺍﺕ 
 
ﺃﻛﺘﻮﺑﺮ 2019


--- Page 2 ---
ﺟﺎﻣﻌﺔ ﻣﺪﻳﻨﺔ ﺍﻟﺴﺎﺩﺍﺕ 
ﻛﻠﻴﺔ ﺍﻟﺤﺎﺳﺒﺎﺕ ﻭﺍﻟﺬﻛﺎﺀ ﺍﻻﺻﻄﻨﺎﻋﻲ 
  
Page 1 of 73        ﺍﻟﻼﺋﺤﺔ ﺍﻟﺪﺍﺧﻠﻴﺔ ﻟﻜﻠﻴﺔ ﺍﻟﺤﺎﺳﺒﺎﺕ ﻭﺍﻟﺬﻛﺎﺀ ﺍﻻﺻﻄﻨﺎﻋﻲ ﺑﻨﻈﺎﻡ ﺍﻟﺴﺎﻋﺎﺕ ﺍﻟﻤﻌﺘﻤﺪﺓ  2019          
ﻗﺎﺋﻤﺔ ﺍﻟﻤﺤﺘﻮﻳﺎﺕ 
ﻣﺴﻠﺴﻞ  ﺍﻟﻤﻮﺿﻮﻉ  ﺭﻗﻢ ﺍﻟﺼﻔﺤﺔ  
1  ﺗﻤﻬﻴﺪ 2 
2  ﺭﺅﻳﺔ ﻭﺭﺳﺎﻟﺔ ﻭﺃﻫﺪﺍﻑ ﺍﻟﻜﻠﻴﺔ   3 
3  ﻣﺎﺩﺓ1: ﻗﻮﺍﻋﺪ ﺍﻟﻘﺒﻮﻝ ﺑﺎﻟﻜﻠﻴﺔ 4 
4  ﻣﺎﺩﺓ2: ﺃﻗﺴﺎﻡ ﺍﻟﻜﻠﻴﺔ ﻭﺍﻟﺪﺭﺟﺎﺕ ﺍﻟﻌﻠﻤﻴﺔ 4 
5  ﻣﺎﺩﺓ3: ﻟﻐﺔ ﺍﻟﺘﺪﺭﻳﺲ 5 
6  ﻣﺎﺩﺓ4: ﻧﻈﺎﻡ ﺍﻟﺴﺎﻋﺎﺗﺎﻟﻤﻌﺘﻤﺪﺓ 5 
7  ﻣﺎﺩﺓ5: ﺍﻹﺭﺷﺎﺩ ﺍﻷﻛﺎﺩﻳﻤﻰ 6 
8  ﻣﺎﺩﺓ6: ﺍﻟﺘﺴﺠﻴﻞ ﻭﺍﻟﺤﺬﻑ ﻭﺍﻹﺿﺎﻓﺔ 7 
9  ﻣﺎﺩﺓ7: ﺍﻻﻧﺴﺤﺎﺏ ﻣﻦ ﺍﻟﻤﻘﺮﺭ 8 
10  ﻣﺎﺩﺓ8 :ﻗﻮﺍﻋﺪﺍﻟﻤﻮﺍﻇﺒﺔ ﻭﺍﻟﻐﻴﺎﺏ 8 
11  ﻣﺎﺩﺓ9: ﺍﻻﻧﺘﻘﺎﻝ ﺑﻴﻦ ﺍﻟﻤﺴﺘﻮﻳﺎﺕ 9 
12  ﻣﺎﺩﺓ10: ﺍﻹﻧﻘﻄﺎﻉ ﻋﻦ ﺍﻟﺪﺭﺍﺳﺔ 9 
13  ﻣﺎﺩﺓ11: ﺍﻟﻔﺼﻞ ﻣﻦ ﺍﻟﻜﻠﻴﺔ 9 
14  ﻣﺎﺩﺓ12 :ﺍﻟﺘﺤﻮﻳﻞ ﻭﻧﻘﻞ ﺍﻟﻘﻴﺪ ﻣﻦ ﺍﻟﻜﻠﻴﺎﺕ ﺍﻷﺧﺮﻯ 10 
15  ﻣﺎﺩﺓ13: ﻧﻈﺎﻡ ﺍﻹﻣﺘﺤﺎﻧﺎﺕ 10 
16  ﻣﺎﺩﺓ14: ﻧﻈﺎﻡ ﺍﻟﺘﻘﻮﻳﻢ 1

### Initialize Google Generative AI (Gemini)

Configuring the `gemini-2.5-flash` model via LangChain with zero temperature for deterministic responses, retry logic, and integration with the provided Google API key from Colab secrets.

In [9]:
google_llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0.1,
    max_tokens=None, # Max output tokens
    timeout=None,
    max_retries=2,
    convert_system_message_to_human=True,
    google_api_key=userdata.get('GOOGLE_API_KEY')
)

### Split Documents into Chunks

Using `RecursiveCharacterTextSplitter` to break the cleaned Arabic documents into manageable chunks (2000 characters with 200-character overlap) while respecting natural text boundaries. This prepares the text for effective embedding and retrieval. A sample chunk is displayed for verification.

In [10]:
# ⚙️ Configure the text splitter for balanced chunking
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,          # Maximum number of characters per chunk
    chunk_overlap=100,        # Overlap to preserve context between chunks
    separators=["\n\n", "\n", ".", " "]  # Priority of text break points
)

# ✂️ Split the loaded and fixed documents into smaller chunks
chunks = splitter.split_documents(docs)

print(f"✅ Successfully created {len(chunks)} chunks.\n")

# 🧾 Optional: Preview a sample chunk
sample = chunks[0]
print("Sample Chunk Metadata:", sample.metadata)
print("Sample Chunk Content:\n", sample.page_content)

✅ Successfully created 258 chunks.

Sample Chunk Metadata: {'producer': 'Microsoft® Word لبرنامج Office 365', 'creator': 'Microsoft® Word لبرنامج Office 365', 'creationdate': '2019-10-27T13:59:27+02:00', 'author': 'Magda Abou El Safa', 'moddate': '2019-10-27T14:01:24+02:00', 'title': 'كلية الحاسبات والذكاء الاصطناعى FCaI-USC', 'source': '/content/اللائحة الداخلية لكلية الحاسبات والذكاء الاصطناعي أكتوبر 2019.pdf', 'total_pages': 74, 'page': 0, 'page_label': '1'}
Sample Chunk Content:
 ﺍﻟﻼﺋﺤﺔ ﺍﻟﺪﺍﺧﻠﻴﺔ 
ﻟﻜﻠﻴﺔ ﺍﻟﺤﺎﺳﺒﺎﺕ ﻭﺍﻟﺬﻛﺎﺀ ﺍﻻﺻﻄﻨﺎﻋﻲ 
Faculty of Computers and 
 Artificial Intelligence
(ﺑﻨﻈﺎﻡ ﺍﻟﺴﺎﻋﺎﺕ ﺍﻟﻤﻌﺘﻤﺪﺓ 
 
ﺟﺎﻣﻌﺔ ﻣﺪﻳﻨﺔ ﺍﻟﺴﺎﺩﺍﺕ 
 
ﺃﻛﺘﻮﺑﺮ 2019


### Initialize Multilingual Embedding Model

Loading the `paraphrase-multilingual-MiniLM-L12-v2` sentence transformer model to generate embeddings for both Arabic and English text. Configured to run on GPU (`cuda`) with normalized embeddings for improved semantic similarity and retrieval performance.

In [11]:
# Multilingual model (supports Arabic, English, etc.)
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

# Configuration for GPU and embedding normalization
model_kwargs = {'device': 'cuda'}  # Use 'cuda' for GPU; fallback to 'cpu' if no GPU available
encode_kwargs = {'normalize_embeddings': True}  # Normalization improves semantic search accuracy

# Initialize multilingual embedding model
hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

print(f"Loaded multilingual embedding model: {model_name} (GPU enabled)")


Loaded multilingual embedding model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (GPU enabled)


### Create and Persist Chroma Vector Store

Building a local Chroma vector database from the document chunks using multilingual embeddings. The vector store is persisted to disk (`./chroma_fcai_regulations_db`) for future reuse—enabling efficient semantic search over the FCAI internal regulations without re-embedding. Total embedded chunks: `{len(chunks)}`.

In [13]:
# Create a local Chroma vector store for the FCAI internal regulations document
vectorstore = Chroma.from_documents(
    documents=chunks,                               # The list of split document chunks
    embedding=hf,                                   # HuggingFaceEmbeddings instance
    persist_directory="./chroma_fcai_regulations_db"  # Local storage directory
)

# Persist the embeddings to disk
vectorstore.persist()

print("Chroma vector store for FCAI internal regulations created and saved successfully!")
print(f"Location: ./chroma_fcai_regulations_db")
print(f"Total embedded chunks: {len(chunks)}\n")

# To reload the vector store later (without re-embedding):
# vectorstore = Chroma(
#     persist_directory="./chroma_fcai_regulations_db",
#     embedding_function=hf
# )

# Optional: Display confirmation
print("Vector store instance:\n", vectorstore)


Chroma vector store for FCAI internal regulations created and saved successfully!
Location: ./chroma_fcai_regulations_db
Total embedded chunks: 258

Vector store instance:
 <langchain_community.vectorstores.chroma.Chroma object at 0x788045315760>


### Set Up Retrieval-Augmented Generation (RAG) Pipeline

Configured a similarity-based retriever from the Chroma vector store (top 6 relevant chunks) and integrated it with the Gemini LLM via a `RetrievalQA` chain. The `"stuff"` chain type combines all retrieved context into a single prompt, and source documents are returned for transparency and verification.

In [27]:
# Create retriever interface from your persisted Chroma vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",     # or "mmr" (diverse search)
    search_kwargs={"k": 6}  # fetch_k is how many candidates to consider for mmr or similarity
)

# Create a RetrievalQA chain (LLM + Retriever)
qa_chain = RetrievalQA.from_chain_type(
    llm=google_llm,
    retriever=retriever,
    chain_type="stuff",               # Simple concatenation of retrieved texts
    return_source_documents=True      # Include retrieved chunks in results
)

In [26]:
print(qa_chain.input_keys)

['query']


### 🧪 Evaluate RAG Chatbot on FCAI Regulations

Testing the Retrieval-Augmented Generation (RAG) system with a mix of Arabic and English questions about the Faculty of Computers and Artificial Intelligence (FCAI) internal regulations. The system retrieves relevant document chunks and generates answers using the Gemini LLM, while citing source pages for verification.

In [32]:
# 🔍 Arabic RAG Chatbot Test – FCAI Internal Regulations

questions = [
    "ما هي التخصصات الأربعة المتاحة في كلية الحاسبات والذكاء الاصطناعي وفقًا للائحة الداخلية؟",
    "كم عدد الساعات المعتمدة الإجمالية المطلوبة للتخرج من برنامج بكالوريوس علوم الحاسب؟",
    "ما هو الحد الأدنى للمعدل التراكمي (CGPA) المطلوب للبقاء في الكلية وفقًا للائحة؟",
    "What is the percentage allocated for the final exam in a regular course according to the evaluation system?",
    "What is the minimum cumulative GPA (CGPA) required to remain in the faculty according to the regulations?"
]

print("\n🧠 FCAI Internal Regulations RAG Evaluation\n")

for i, q in enumerate(questions, 1):
    print(f"🔹 Q{i}: {q}")
    result = qa_chain.invoke({"query": q})
    print(f"🧩 الإجابة: {result['result']}\n📚 المصادر:")
    for doc in result["source_documents"]:
        print(f"  - {doc.metadata.get('source', 'Unknown')} | صفحة {doc.metadata.get('page', 'N/A')}")
    print("=" * 80)
    time.sleep(2)

print("✅ Evaluation completed.")


🧠 FCAI Internal Regulations RAG Evaluation

🔹 Q1: ما هي التخصصات الأربعة المتاحة في كلية الحاسبات والذكاء الاصطناعي وفقًا للائحة الداخلية؟
🧩 الإجابة: وفقًا للائحة الداخلية، التخصصات الأربعة المتاحة في كلية الحاسبات والذكاء الاصطناعي هي:
1.  علوم الحاسب
2.  نظم المعلومات
3.  الذكاء الاصطناعي
4.  المعلوماتية الحيوية
📚 المصادر:
  - /content/اللائحة الداخلية لكلية الحاسبات والذكاء الاصطناعي أكتوبر 2019.pdf | صفحة 0
  - /content/اللائحة الداخلية لكلية الحاسبات والذكاء الاصطناعي أكتوبر 2019.pdf | صفحة 0
  - /content/اللائحة الداخلية لكلية الحاسبات والذكاء الاصطناعي أكتوبر 2019.pdf | صفحة 0
  - /content/اللائحة الداخلية لكلية الحاسبات والذكاء الاصطناعي أكتوبر 2019.pdf | صفحة 0
  - /content/اللائحة الداخلية لكلية الحاسبات والذكاء الاصطناعي أكتوبر 2019.pdf | صفحة 5
  - /content/اللائحة الداخلية لكلية الحاسبات والذكاء الاصطناعي أكتوبر 2019.pdf | صفحة 5
🔹 Q2: كم عدد الساعات المعتمدة الإجمالية المطلوبة للتخرج من برنامج بكالوريوس علوم الحاسب؟
🧩 الإجابة: للحصول على درجة البكالوريوس، يجب أن يجتاز ال