### Environment Setup and Dependencies

Importing necessary libraries for document loading, text splitting, vector embeddings, retrieval-augmented generation (RAG), and conversational memory. Configuring the environment to suppress warnings and prepare for processing the PDF document with LangChain and Google Generative AI.

In [1]:
!pip install -U langchain-community langchain-google-genai pypdf langchain_huggingface chromadb arabic-reshaper python-bidi unstructured pdfminer.six Pillow pi_heif "unstructured[pdf]" sentence-transformers



In [2]:
import os
import time
from langchain.document_loaders import PyPDFLoader
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path
from langchain.chains import create_history_aware_retriever, create_retrieval_chain, RetrievalQA
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.memory import ConversationBufferMemory
from langchain_google_genai import ChatGoogleGenerativeAI
from google.colab import userdata
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma, FAISS
import arabic_reshaper
from bidi.algorithm import get_display

import warnings
warnings.filterwarnings("ignore")

In [3]:
import torch
print("Using device:", torch.device("cuda" if torch.cuda.is_available() else "cpu"))

Using device: cpu


### Load PDF Document

Loading the internal regulations document of the Faculty of Computers and Artificial Intelligence (October 2019) using PyPDFLoader.

In [4]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("ahmedayman7/fcai-usc-internal-regulations")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'fcai-usc-internal-regulations' dataset.
Path to dataset files: /kaggle/input/fcai-usc-internal-regulations


In [5]:
pdf_path = "/kaggle/input/fcai-usc-internal-regulations/2019.pdf"

loader = PyPDFLoader(pdf_path, mode="page")
docs = loader.load()

# Each doc corresponds to a page
print(len(docs), "pages loaded")

74 pages loaded


### Fix Arabic Text Rendering

Reshaping and reordering Arabic text in each document page to ensure proper visual display (connected characters and right-to-left layout). The corrected text is saved back into the `Document` objects.  
Then, previewing the first three pages to verify the fix.

In [7]:
# Fix Arabic text inside each Document object
for i, doc in enumerate(docs):
    # Extract raw text from the page
    text = doc.page_content

    # Reshape Arabic characters to connect properly
    reshaped_text = arabic_reshaper.reshape(text)

    # Correct the right-to-left display order
    bidi_text = get_display(reshaped_text)

    # Save the fixed Arabic text back into the Document
    doc.page_content = bidi_text

print("✅ Arabic text has been reshaped, reordered, and stored back into docs.\n")

# Display the content of the first few pages (example: first 3)
for i, page in enumerate(docs[:3]):
    print(f"--- Page {i + 1} ---")
    print(page.page_content)
    print("\n" + "=" * 80 + "\n")

✅ Arabic text has been reshaped, reordered, and stored back into docs.

--- Page 1 ---
ﺍﻟﻼﺋﺤﺔ ﺍﻟﺪﺍﺧﻠﻴﺔ 
ﻟﻜﻠﻴﺔ ﺍﻟﺤﺎﺳﺒﺎﺕ ﻭﺍﻟﺬﻛﺎﺀ ﺍﻻﺻﻄﻨﺎﻋﻲ 
Faculty of Computers and 
 Artificial Intelligence
(ﺑﻨﻈﺎﻡ ﺍﻟﺴﺎﻋﺎﺕ ﺍﻟﻤﻌﺘﻤﺪﺓ 
 
ﺟﺎﻣﻌﺔ ﻣﺪﻳﻨﺔ ﺍﻟﺴﺎﺩﺍﺕ 
 
ﺃﻛﺘﻮﺑﺮ 2019


--- Page 2 ---
ﺟﺎﻣﻌﺔ ﻣﺪﻳﻨﺔ ﺍﻟﺴﺎﺩﺍﺕ 
ﻛﻠﻴﺔ ﺍﻟﺤﺎﺳﺒﺎﺕ ﻭﺍﻟﺬﻛﺎﺀ ﺍﻻﺻﻄﻨﺎﻋﻲ 
  
Page 1 of 73        ﺍﻟﻼﺋﺤﺔ ﺍﻟﺪﺍﺧﻠﻴﺔ ﻟﻜﻠﻴﺔ ﺍﻟﺤﺎﺳﺒﺎﺕ ﻭﺍﻟﺬﻛﺎﺀ ﺍﻻﺻﻄﻨﺎﻋﻲ ﺑﻨﻈﺎﻡ ﺍﻟﺴﺎﻋﺎﺕ ﺍﻟﻤﻌﺘﻤﺪﺓ  2019          
ﻗﺎﺋﻤﺔ ﺍﻟﻤﺤﺘﻮﻳﺎﺕ 
ﻣﺴﻠﺴﻞ  ﺍﻟﻤﻮﺿﻮﻉ  ﺭﻗﻢ ﺍﻟﺼﻔﺤﺔ  
1  ﺗﻤﻬﻴﺪ 2 
2  ﺭﺅﻳﺔ ﻭﺭﺳﺎﻟﺔ ﻭﺃﻫﺪﺍﻑ ﺍﻟﻜﻠﻴﺔ   3 
3  ﻣﺎﺩﺓ1: ﻗﻮﺍﻋﺪ ﺍﻟﻘﺒﻮﻝ ﺑﺎﻟﻜﻠﻴﺔ 4 
4  ﻣﺎﺩﺓ2: ﺃﻗﺴﺎﻡ ﺍﻟﻜﻠﻴﺔ ﻭﺍﻟﺪﺭﺟﺎﺕ ﺍﻟﻌﻠﻤﻴﺔ 4 
5  ﻣﺎﺩﺓ3: ﻟﻐﺔ ﺍﻟﺘﺪﺭﻳﺲ 5 
6  ﻣﺎﺩﺓ4: ﻧﻈﺎﻡ ﺍﻟﺴﺎﻋﺎﺗﺎﻟﻤﻌﺘﻤﺪﺓ 5 
7  ﻣﺎﺩﺓ5: ﺍﻹﺭﺷﺎﺩ ﺍﻷﻛﺎﺩﻳﻤﻰ 6 
8  ﻣﺎﺩﺓ6: ﺍﻟﺘﺴﺠﻴﻞ ﻭﺍﻟﺤﺬﻑ ﻭﺍﻹﺿﺎﻓﺔ 7 
9  ﻣﺎﺩﺓ7: ﺍﻻﻧﺴﺤﺎﺏ ﻣﻦ ﺍﻟﻤﻘﺮﺭ 8 
10  ﻣﺎﺩﺓ8 :ﻗﻮﺍﻋﺪﺍﻟﻤﻮﺍﻇﺒﺔ ﻭﺍﻟﻐﻴﺎﺏ 8 
11  ﻣﺎﺩﺓ9: ﺍﻻﻧﺘﻘﺎﻝ ﺑﻴﻦ ﺍﻟﻤﺴﺘﻮﻳﺎﺕ 9 
12  ﻣﺎﺩﺓ10: ﺍﻹﻧﻘﻄﺎﻉ ﻋﻦ ﺍﻟﺪﺭﺍﺳﺔ 9 
13  ﻣﺎﺩﺓ11: ﺍﻟﻔﺼﻞ ﻣﻦ ﺍﻟﻜﻠﻴﺔ 9 
14  ﻣﺎﺩﺓ12 :ﺍﻟﺘﺤﻮﻳﻞ ﻭﻧﻘﻞ ﺍﻟﻘﻴﺪ ﻣﻦ ﺍﻟﻜﻠﻴﺎﺕ ﺍﻷﺧﺮﻯ 10 
15  ﻣﺎﺩﺓ13: ﻧﻈﺎﻡ ﺍﻹﻣﺘﺤﺎﻧﺎﺕ 10 
16  ﻣﺎﺩﺓ14: ﻧﻈﺎﻡ ﺍﻟﺘﻘﻮﻳﻢ 1

### Initialize Google Generative AI (Gemini)

Configuring the `gemini-2.5-flash` model via LangChain with zero temperature for deterministic responses, retry logic, and integration with the provided Google API key from Colab secrets.

In [8]:
google_llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0.1,
    max_tokens=None, # Max output tokens
    timeout=None,
    max_retries=2,
    convert_system_message_to_human=True,
    google_api_key=userdata.get('GOOGLE_API_KEY')
)

### Split Documents into Chunks

Using `RecursiveCharacterTextSplitter` to break the cleaned Arabic documents into manageable chunks (2000 characters with 200-character overlap) while respecting natural text boundaries. This prepares the text for effective embedding and retrieval. A sample chunk is displayed for verification.

In [9]:
# ⚙️ Configure the text splitter for balanced chunking
splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,          # Maximum number of characters per chunk
    chunk_overlap=1000,        # Overlap to preserve context between chunks
    separators=["\n\n", "\n", ".", " "]  # Priority of text break points
)

# ✂️ Split the loaded and fixed documents into smaller chunks
chunks = splitter.split_documents(docs)

print(f"✅ Successfully created {len(chunks)} chunks.\n")

# 🧾 Optional: Preview a sample chunk
sample = chunks[50]
print("Sample Chunk Metadata:", sample.metadata)
print("Sample Chunk Content:\n", sample.page_content)

✅ Successfully created 81 chunks.

Sample Chunk Metadata: {'producer': 'Microsoft® Word لبرنامج Office 365', 'creator': 'Microsoft® Word لبرنامج Office 365', 'creationdate': '2019-10-27T13:59:27+02:00', 'author': 'Magda Abou El Safa', 'moddate': '2019-10-27T14:01:24+02:00', 'title': 'كلية الحاسبات والذكاء الاصطناعى FCaI-USC', 'source': '/kaggle/input/fcai-usc-internal-regulations/2019.pdf', 'total_pages': 74, 'page': 46, 'page_label': '47'}
Sample Chunk Content:
 ﺟﺎﻣﻌﺔ ﻣﺪﻳﻨﺔ ﺍﻟﺴﺎﺩﺍﺕ 
ﻛﻠﻴﺔ ﺍﻟﺤﺎﺳﺒﺎﺕ ﻭﺍﻟﺬﻛﺎﺀ ﺍﻻﺻﻄﻨﺎﻋﻲ 
  
Page 46 of 73        ﺍﻟﻼﺋﺤﺔ ﺍﻟﺪﺍﺧﻠﻴﺔ ﻟﻜﻠﻴﺔ ﺍﻟﺤﺎﺳﺒﺎﺕ ﻭﺍﻟﺬﻛﺎﺀ ﺍﻻﺻﻄﻨﺎﻋﻲ ﺑﻨﻈﺎﻡ ﺍﻟﺴﺎﻋﺎﺕ ﺍﻟﻤﻌﺘﻤﺪﺓ  2019          
 for Robotic systems, Topics include how robots move, sense, and
 perceive the world around them. The course introduces also constructing,
 planning and programming robots ability to Sensing, controlling, remote
 control and testing using computer languages for communication and
 advanced Input / Output programming for system practical programming
 and harmonious pr

### Initialize Multilingual Embedding Model (E5 Base)

Switching to the `intfloat/multilingual-e5-base` embedding model—a high-performing multilingual encoder that supports Arabic, English, and many other languages. Configured to run on GPU (`cuda`) with normalized embeddings to enhance retrieval quality in semantic search over the FCAI regulations document.

In [10]:
# Multilingual model (supports Arabic, English, etc.)
model_name = "intfloat/multilingual-e5-base"

# Configuration for GPU and embedding normalization
model_kwargs = {'device': 'cpu'}  # Use 'cuda' for GPU; fallback to 'cpu' if no GPU available
encode_kwargs = {'normalize_embeddings': True}  # Normalization improves semantic search accuracy

# Initialize multilingual embedding model
hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

print(f"Loaded multilingual embedding model: {model_name} (GPU enabled)")


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Loaded multilingual embedding model: intfloat/multilingual-e5-base (GPU enabled)


### Create and Persist Chroma Vector Store

Building a local Chroma vector database from the document chunks using multilingual embeddings. The vector store is persisted to disk (`./chroma_fcai_regulations_db`) for future reuse—enabling efficient semantic search over the FCAI internal regulations without re-embedding. Total embedded chunks: `{len(chunks)}`.

In [11]:
# Create a local Chroma vector store for the FCAI internal regulations document
vectorstore = Chroma.from_documents(
    documents=chunks,                               # The list of split document chunks
    embedding=hf,                                   # HuggingFaceEmbeddings instance
    persist_directory="./chroma_fcai_regulations_db"  # Local storage directory
)

# Persist the embeddings to disk
vectorstore.persist()

print("Chroma vector store for FCAI internal regulations created and saved successfully!")
print(f"Location: ./chroma_fcai_regulations_db")
print(f"Total embedded chunks: {len(chunks)}\n")

# To reload the vector store later (without re-embedding):
# vectorstore = Chroma(
#     persist_directory="./chroma_fcai_regulations_db",
#     embedding_function=hf
# )

# Optional: Display confirmation
print("Vector store instance:\n", vectorstore)


Chroma vector store for FCAI internal regulations created and saved successfully!
Location: ./chroma_fcai_regulations_db
Total embedded chunks: 81

Vector store instance:
 <langchain_community.vectorstores.chroma.Chroma object at 0x7da526004710>


### Set Up Retrieval-Augmented Generation (RAG) Pipeline

Configured a similarity-based retriever from the Chroma vector store (top 6 relevant chunks) and integrated it with the Gemini LLM via a `RetrievalQA` chain. The `"stuff"` chain type combines all retrieved context into a single prompt, and source documents are returned for transparency and verification.

In [20]:
# Create retriever interface from your persisted Chroma vector store


# retriever = vectorstore.as_retriever(
#     search_type="similarity",     # or "mmr" (diverse search)
#     search_kwargs={"k": 6}  # fetch_k is how many candidates to consider for mmr or similarity
# )

retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximal Marginal Relevance
    search_kwargs={"k": 10, "fetch_k": 20, "lambda_mult": 0.5}
)


# Create a RetrievalQA chain (LLM + Retriever)
qa_chain = RetrievalQA.from_chain_type(
    llm=google_llm,
    retriever=retriever,
    chain_type="stuff",               # Concatenate all chunks into one prompt
    return_source_documents=True      # Include retrieved chunks in results
)

In [16]:
print(qa_chain.input_keys)

['query']


### 🧪 Evaluate RAG Chatbot on FCAI Regulations

Testing the Retrieval-Augmented Generation (RAG) system with a mix of Arabic and English questions about the Faculty of Computers and Artificial Intelligence (FCAI) internal regulations. The system retrieves relevant document chunks and generates answers using the Gemini LLM, while citing source pages for verification.

In [18]:
# 🔍 Arabic RAG Chatbot Test – FCAI Internal Regulations

questions = [
    "ما هي التخصصات الأربعة المتاحة في كلية الحاسبات والذكاء الاصطناعي وفقًا للائحة الداخلية؟",
    "كم عدد الساعات المعتمدة الإجمالية المطلوبة للتخرج من برنامج بكالوريوس علوم الحاسب؟",
    "ما هو الحد الأدنى للمعدل التراكمي (CGPA) المطلوب للبقاء في الكلية وفقًا للائحة؟",
    "What is the percentage allocated for the final exam in a regular course according to the evaluation system?",
    "What is the minimum cumulative GPA (CGPA) required to remain in the faculty according to the regulations?"
]

print("\n🧠 FCAI Internal Regulations RAG Evaluation\n")

for i, q in enumerate(questions, 1):
    print(f"🔹 Q{i}: {q}")
    result = qa_chain.invoke({"query": q})
    print(f"🧩 الإجابة: {result['result']}\n📚 المصادر:")
    for doc in result["source_documents"]:
        print(f"  - {doc.metadata.get('source', 'Unknown')} | صفحة {doc.metadata.get('page', 'N/A')}")
    print("=" * 80)
    time.sleep(2)

print("✅ Evaluation completed.")


🧠 FCAI Internal Regulations RAG Evaluation

🔹 Q1: ما هي التخصصات الأربعة المتاحة في كلية الحاسبات والذكاء الاصطناعي وفقًا للائحة الداخلية؟
🧩 الإجابة: وفقًا للائحة الداخلية لكلية الحاسبات والذكاء الاصطناعي، التخصصات المتاحة هي:

1.  **علوم الحاسب** (Computer Science) - مذكور كقسم في المادة 2.
2.  **نظم المعلومات** (Information Systems) - مذكور كقسم في المادة 2.
3.  **المعلوماتية الحيوية** (Bioinformatics) - مذكور كبرنامج في المادة 1 (قواعد القبول) والمادة 31 (المقررات التأهيلية).
4.  **الذكاء الاصطناعي** (Artificial Intelligence) - مذكور في اسم الكلية نفسها، وفي المجالات العلمية لقسم علوم الحاسب (المادة 2)، وهناك مقررات اختيارية مخصصة له (صفحة 29).
📚 المصادر:
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 0
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 16
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 3
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 2
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 1
 

In [19]:
# 🔍 Arabic RAG Chatbot Test – FCAI Internal Regulations

questions = [
    "عاوز اسجل ماده software engineering 2 اي المطلوب عشان اخدها او اي المواد الل المفروض اخدها قبلها في قسم نظم المعلومات"
]

print("\n🧠 FCAI Internal Regulations RAG Evaluation\n")

for i, q in enumerate(questions, 1):
    print(f"🔹 Q{i}: {q}")
    result = qa_chain.invoke({"query": q})
    print(f"🧩 الإجابة: {result['result']}\n📚 المصادر:")
    for doc in result["source_documents"]:
        print(f"  - {doc.metadata.get('source', 'Unknown')} | صفحة {doc.metadata.get('page', 'N/A')}")
    print("=" * 80)
    time.sleep(2)

print("✅ Evaluation completed.")


🧠 FCAI Internal Regulations RAG Evaluation

🔹 Q1: عاوز اسجل ماده software engineering 2 اي المطلوب عشان اخدها او اي المواد الل المفروض اخدها قبلها في قسم نظم المعلومات
🧩 الإجابة: لا توجد معلومات عن مقرر "Software Engineering 2" (هندسة البرمجيات-2) في سياق النص المقدم.

المعلومات المتاحة هي عن مقرر "Software Engineering-1" (هندسة البرمجيات-1) الذي يحمل الكود CS301، ومتطلبه السابق هو "Computer Programming-2" (CS203).
📚 المصادر:
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 5
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 36
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 35
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 10
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 20
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 8
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 19
  - /kaggle/input/fcai-usc-internal-regulations/2019.pdf | صفحة 50
  - /kaggle/input/fcai-usc-internal