# Egyptian Smart Legal Advisor

## Overview

The **Egyptian Smart Legal Advisor** is an advanced AI-powered tool designed to process Egyptian legal documents, create semantic search capabilities, and provide legal advice based on Egyptian law sources. This system leverages state-of-the-art Natural Language Processing (NLP) techniques, machine learning models, and graph databases to deliver accurate and context-aware legal insights. It is particularly tailored for Arabic legal text, addressing the unique challenges of the Arabic language and legal domain.

---

## Key Features

1. **Arabic Text Processing**:
   - Normalization of Arabic text (e.g., removing diacritics, standardizing characters).
   - Extraction of key legal terms using predefined patterns and stemming.
   - Handling of Arabic-specific text issues (e.g., reshaping, bidirectional text rendering).

2. **Document Processing**:
   - Extraction of text from PDF documents using `pdfplumber`.
   - Splitting of legal text into structured chunks for processing.
   - Handling of large legal documents (e.g., constitutions, civil codes).

3. **Semantic Search**:
   - Use of FAISS (Facebook AI Similarity Search) for efficient vector-based semantic search.
   - Arabic-optimized embeddings for accurate retrieval of relevant legal documents.

4. **Graph-Based Search**:
   - Integration with Neo4j for structured legal knowledge representation.
   - Retrieval of articles based on legal terms and their relationships in the knowledge graph.

5. **Hybrid Retrieval**:
   - Combination of semantic search (FAISS) and graph-based search (Neo4j) for comprehensive results.
   - Contextual expansion to include related articles for better coverage.

6. **Query Classification**:
   - Classification of user queries into legal or general categories using zero-shot classification.
   - Support for Arabic queries with a multilingual model (`xlm-roberta-large-xnli`).

7. **Legal Advice Generation**:
   - Integration with large language models (LLMs) like Google's Gemini for generating legal advice.
   - Structured prompts to ensure accurate and context-aware responses.

8. **User Interaction**:
   - Support for Arabic conversational queries.
   - Clear and formatted output for easy consumption by users or LLMs.

---

## Workflow

1. **Input**:
   - The user submits a query in Arabic (e.g., "ما هي حقوق المرأة في الدستور المصري؟").

2. **Query Processing**:
   - The query is normalized and key legal terms are extracted.
   - The query is classified as legal or general.

3. **Document Retrieval**:
   - Semantic search is performed using FAISS to find relevant documents.
   - Graph-based search is performed using Neo4j to find articles related to the extracted legal terms.

4. **Result Combination**:
   - Results from FAISS and Neo4j are combined and ranked.
   - Contextual expansion is performed to include related articles.

5. **Output**:
   - The top results are formatted and presented to the user or LLM.


## install the necessary libraries

In [None]:
!pip install langchain_google_genai pdfplumber langchain-community langchain-huggingface
!pip install arabic-reshaper python-bidi pyarabic chromadb
!pip install bitsandbytes
!pip install neo4j
!pip install langchain_groq
!pip install faiss-cpu
!pip install tashaphyne
!pip install --upgrade tensorflow
!pip install --upgrade tf_keras

Collecting langchain_google_genai
  Downloading langchain_google_genai-2.1.0-py3-none-any.whl.metadata (3.6 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-community
  Downloading langchain_community-0.3.19-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain_google_genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<0.7.0,>=0.6.16 (from langchain_google_genai)
  Downloading google_ai_generativelanguage-0.6.16-py3-none-any.whl.metadata (5.7 kB)
Collecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (fr

Collecting arabic-reshaper
  Downloading arabic_reshaper-3.0.0-py3-none-any.whl.metadata (12 kB)
Collecting python-bidi
  Downloading python_bidi-0.6.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Collecting pyarabic
  Downloading PyArabic-0.6.15-py3-none-any.whl.metadata (10 kB)
Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.11-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.20.0-py2.py3-none-any.wh

## Log in to huggingface

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) غ
Invalid input. Must be one of ('y', 'yes', '1', 'n', 'no', '0', '')
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
The token `Cosmos` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCanno

## import the necessary Libraries

In [None]:
import re
import torch
import pdfplumber
import arabic_reshaper
import unicodedata
from bidi.algorithm import get_display
import os
import json
import numpy as np
from typing import List, Dict, Any, Literal

# NLP and ML imports
from transformers import pipeline
from tashaphyne.stemming import ArabicLightStemmer

# LangChain imports
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableBranch, RunnableLambda
from langchain_groq import ChatGroq
from langchain_google_genai import ChatGoogleGenerativeAI

# Neo4j imports
from neo4j import GraphDatabase

## Arabic Text Processing
### This class is responsible for processing Arabic text, including normalization, stemming, and extracting key legal terms. It is essential for preparing the text for further analysis and retrieval.



In [None]:
class ArabicTextProcessor:
    """
    Handles Arabic text processing tasks including normalization,
    stemming, and legal term extraction.
    """

    def __init__(self):
        """Initialize the processor with Arabic stopwords and stemmer."""
        self.stopwords = self.load_arabic_stopwords()
        self.stemmer = ArabicLightStemmer()  # Light stemming for Arabic

    def load_arabic_stopwords(self):
        """Load common Arabic stopwords."""
        return set([
            "في", "من", "على", "و", "ال", "الى", "إلى", "عن", "مع",
            "هذا", "هذه", "ذلك", "تلك", "أو", "ثم", "لكن", "و", "ف", "ب", "ل"
        ])

    def normalize_arabic_text(self, text):
        """
        Normalize Arabic text by removing diacritics and standardizing letter forms.

        Args:
            text: The Arabic text to normalize

        Returns:
            Normalized text
        """
        # Remove diacritics (tashkeel)
        text = re.sub(r'[\u064B-\u065F\u0670]', '', text)

        # Normalize various forms of alef
        text = re.sub(r'[إأآا]', 'ا', text)

        # Normalize teh marbuta and yeh
        text = text.replace('ة', 'ه')
        text = text.replace('ى', 'ي')

        # Remove kashida (tatweel)
        text = text.replace('ـ', '')

        return text

    def extract_key_legal_terms(self, text):
        """
        Extract key legal terms from text to improve retrieval.

        Args:
            text: The Arabic text to extract terms from

        Returns:
            List of unique legal terms found in the text
        """
        # Legal term patterns in Arabic
        legal_term_patterns = [
            # Crime and court related terms
            r'(جريمة|تهمة|قضية|دعوى|محكمة|قانون|حكم|التزام|مسؤولية|عقد|التزام قانوني|إجراءات|حقوق|سجن|غرامة|تعويض|مخالفة|جنحة|جناية)',

            # Legal roles and processes
            r'(متهم|شاهد|محام|قاضي|مدعي|مدعى عليه|نيابة|تحقيق|محضر|أدلة|إثبات|براءة|إدانة)',

            # Legal procedures
            r'(إجراءات|استئناف|نقض|طعن|دفاع|اتهام|محاكمة|جلسة|حبس|توقيف|احتجاز|تفتيش|ضبط)',

            # Statute of limitations
            r'(تقادم|تقادم مكسب|تقادم مسقط|مدة التقادم|مرور الزمن|انقضاء الدعوى|سقوط الحق بالتقادم)',

            # Rights usage
            r'(استعمال الحق|سوء استعمال الحق|تجاوز حدود الحق|حق مشروع|حق غير مشروع|إساءة استعمال الحق)',

            # Legal roles and procedures
            r'(مدعي|مدعى عليه|محام|قاضي|تحقيق|إثبات|براءة|إدانة|إجراءات قانونية|عقوبة|دعوى مدنية)',

            # Women's rights
            r'(حقوق المرأة|المساواة بين الجنسين|تمكين المرأة|حماية المرأة|التمثيل السياسي للمرأة)',

            # Non-discrimination and constitutional rights
            r'(عدم التمييز|الحقوق الدستورية|المشاركة السياسية|العمل|الأجور|الأحوال الشخصية)',

            # Egyptian legal sources
            r'(الدستور المصري|المجلس القومي للمرأة|القانون المدني|القانون الجنائي|الأحوال الشخصية)',

            # Civil law concepts
            r'(المسؤولية التقصيرية|المسؤولية العقدية|التعويض المدني)',

            # Contracts and obligations
            r'(العقد|الالتزام|الإرادة المنفردة|الإثراء بلا سبب|الفعل الضار|الضرر)'
        ]

        key_terms = []
        for pattern in legal_term_patterns:
            matches = re.findall(pattern, text)
            # Apply stemming to each matched term
            key_terms.extend([self.stemmer.light_stem(word) for word in matches])

        # Return unique terms only
        return list(set(key_terms))

## PDF Processing Functions
### These functions handle the extraction of text from PDF files, splitting the text into manageable chunks, and processing multiple PDFs. They ensure the text is ready for further analysis and storage.



In [None]:
def extract_text_from_pdf(pdf_path):
    """
    Extract Arabic text from a PDF file and fix text direction issues.

    Args:
        pdf_path: Path to the PDF file

    Returns:
        Normalized and fixed Arabic text content
    """
    extracted_text = ""

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            if text:
                # Fix Arabic text rendering issues
                reshaped_text = arabic_reshaper.reshape(text)
                fixed_text = get_display(reshaped_text)  # Correct text order
                extracted_text += fixed_text + "\n"

    # Check if extraction was successful
    if not extracted_text.strip():
        print(f"⚠️ No text extracted from {pdf_path}! Check if the PDF is scanned or uses embedded fonts.")

    # Post-process and normalize the extracted text
    normalized_text = unicodedata.normalize("NFKC", extracted_text)

    # Fix common Arabic text issues
    normalized_text = re.sub(r'\bلا(?=[ا-ي])', 'ال', normalized_text)  # Fix "la" at beginning of words
    normalized_text = re.sub(r'(?<=[ا-ي])لا\b', 'ال', normalized_text)  # Fix "la" at end of words
    normalized_text = re.sub(r"\)([^\)]+)\(", r"(\1)", normalized_text)  # Fix reversed parentheses

    return normalized_text


def split_legal_text(text, source_name):
    """
    Split legal text into structured chunks for processing.

    Args:
        text: The extracted text to split
        source_name: Name of the legal source (e.g., "Constitution")

    Returns:
        List of chunks with metadata
    """
    print(f"📜 Processing text from {source_name}, Preview:\n{text[:500]}\n{'='*50}")

    # Set up text splitter with appropriate parameters for Arabic legal text
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=[".", "؟", "\n\n"]  # Arabic-appropriate separators
    )

    # Split text into pages
    pages = splitter.split_text(text)

    # Create structured chunks with metadata
    chunks = []
    for i, page in enumerate(pages, start=1):
        chunks.append({
            "content": page.strip(),
            "source": source_name,
            "article_id": str(i)  # Using chunk number as article_id
        })

    return chunks


def load_and_process_pdfs(legal_files):
    """
    Load and process multiple Arabic legal PDFs.

    Args:
        legal_files: List of tuples containing (pdf_path, source_name)

    Returns:
        List of all processed text chunks with metadata
    """
    all_chunks = []

    for pdf_path, source_name in legal_files:
        print(f"\n🔍 Processing {source_name} from {pdf_path}...")

        # Extract text from PDF
        extracted_text = extract_text_from_pdf(pdf_path)

        if not extracted_text:
            print(f"❌ Failed to extract text from {pdf_path}")
            continue

        # Split text into chunks with metadata
        chunks = split_legal_text(extracted_text, source_name)
        all_chunks.extend(chunks)

    # Verify processing results
    if not all_chunks:
        print("❌ No valid chunks found in any document!")
        return []

    print(f"✅ Total chunks across all documents: {len(all_chunks)}")

    # Print sample chunks for verification (limited to first 120)
    for chunk in all_chunks:
        chunk_index = all_chunks.index(chunk)
        if chunk_index < 120:
            print(f"Chunk {chunk_index}: {chunk['content'][:100]}...")

    return all_chunks

## Neo4j Graph Database Handler
### This class manages interactions with the Neo4j graph database, including setting up the schema, adding legal text chunks, and performing searches. It is crucial for storing and retrieving structured legal knowledge.



In [None]:
class LegalGraphDB:
    """
    Handles interactions with Neo4j graph database for legal knowledge representation.
    """

    def __init__(self, uri, username, password):
        """
        Initialize connection to Neo4j.

        Args:
            uri: Neo4j URI
            username: Neo4j username
            password: Neo4j password
        """
        self.driver = GraphDatabase.driver(uri, auth=(username, password))
        self.setup_schema()

    def close(self):
        """Close the Neo4j connection."""
        self.driver.close()

    def setup_schema(self):
        """Set up Neo4j schema with constraints and indexes for performance."""
        with self.driver.session() as session:
            # Create constraints for uniqueness
            session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (l:LegalArticle) REQUIRE l.id IS UNIQUE")
            session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (s:Source) REQUIRE s.name IS UNIQUE")
            session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (t:Term) REQUIRE t.name IS UNIQUE")

            # Create indexes for faster retrieval
            session.run("CREATE INDEX IF NOT EXISTS FOR (l:LegalArticle) ON (l.article_id)")
            session.run("CREATE INDEX IF NOT EXISTS FOR (t:Term) ON (t.name)")

    def add_legal_chunks(self, chunks):
        """
        Add legal text chunks to Neo4j with metadata and relationships.

        Args:
            chunks: List of text chunks with metadata
        """
        with self.driver.session() as session:
            for chunk in chunks:
                # Generate a unique ID for the chunk
                chunk_id = f"{chunk['source']}-{chunk['article_id']}-{hash(chunk['content'])}"

                # Create the legal article node with metadata
                session.run("""
                    MERGE (s:Source {name: $source})
                    MERGE (l:LegalArticle {
                        id: $id,
                        content: $content,
                        article_id: $article_id
                    })
                    MERGE (l)-[:FROM_SOURCE]->(s)
                """, {
                    "id": chunk_id,
                    "content": chunk["content"],
                    "article_id": chunk["article_id"],
                    "source": chunk["source"]
                })

                # Extract key terms and create relationships
                processor = ArabicTextProcessor()
                key_terms = processor.extract_key_legal_terms(chunk["content"])

                # Link articles to terms
                for term in key_terms:
                    session.run("""
                        MATCH (l:LegalArticle {id: $chunk_id})
                        MERGE (t:Term {name: $term})
                        CREATE (l)-[:MENTIONS]->(t)
                    """, {
                        "chunk_id": chunk_id,
                        "term": term
                    })

    def search_related_articles(self, terms, limit=5):
        """
        Search for articles related to given legal terms.

        Args:
            terms: List of legal terms to search for
            limit: Maximum number of results to return

        Returns:
            List of related article dictionaries
        """
        with self.driver.session() as session:
            result = session.run("""
                UNWIND $terms AS term
                MATCH (t:Term {name: term})<-[:MENTIONS]-(l:LegalArticle)-[:FROM_SOURCE]->(s:Source)
                RETURN l.id AS id, l.content AS content, l.article_id AS article_id,
                       s.name AS source, count(t) AS relevance
                ORDER BY relevance DESC
                LIMIT $limit
            """, {"terms": terms, "limit": limit})

            return [record.data() for record in result]

    def search_contextual_articles(self, article_id, source, limit=3):
        """
        Find contextually related articles based on proximity in the law.

        Args:
            article_id: ID of the article to find context for
            source: Legal source name
            limit: Maximum number of results to return

        Returns:
            List of contextually related article dictionaries
        """
        with self.driver.session() as session:
            result = session.run("""
                MATCH (l:LegalArticle)-[:FROM_SOURCE]->(s:Source {name: $source})
                WHERE abs(toInteger(l.article_id) - toInteger($article_id)) <= 5
                AND l.article_id <> $article_id
                RETURN l.id AS id, l.content AS content, l.article_id AS article_id,
                      s.name AS source
                ORDER BY abs(toInteger(l.article_id) - toInteger($article_id))
                LIMIT $limit
            """, {"article_id": article_id, "source": source, "limit": limit})

            return [record.data() for record in result]

## Vector Search with FAISS
### This function creates vector embeddings for the text chunks using FAISS, enabling semantic search capabilities. It uses an Arabic-optimized embedding model to ensure accurate results.



In [None]:
def create_faiss_embeddings(chunks):
    """
    Create FAISS vector embeddings for text chunks using an Arabic-optimized model.

    Args:
        chunks: List of text chunks with metadata

    Returns:
        FAISS vector database or None if no chunks
    """
    if not chunks:
        print("⚠️ No text chunks for embedding. Skipping embedding process!")
        return None

    # Extract text content and metadata
    texts = [chunk["content"] for chunk in chunks]
    metadatas = [{"source": chunk["source"], "article_id": chunk["article_id"]} for chunk in chunks]

    # Initialize Arabic-optimized embedding model
    embedding_model = HuggingFaceEmbeddings(
        model_name="silma-ai/silma-embeddding-sts-v0.1",  # Arabic-optimized embedding model
        model_kwargs={'device': 'cpu'},
        encode_kwargs={'normalize_embeddings': True}
    )

    # Create FAISS index from texts
    vectordb = FAISS.from_texts(
        texts=texts,
        embedding=embedding_model,
        metadatas=metadatas
    )

    # Save the FAISS index for future use
    vectordb.save_local("./arabic_law_faiss")

    return vectordb

## Query Classification
### This function sets up a classifier to determine the type of legal query (e.g., specific law question, legal advice request). It uses a multilingual model compatible with Arabic.



In [None]:
def setup_query_classifier():
    """
    Set up a classifier for legal query types using zero-shot classification.

    Returns:
        Function that classifies Arabic queries
    """
    # Use multilingual model compatible with Arabic
    classifier = pipeline(
        "zero-shot-classification",
        model="joeddav/xlm-roberta-large-xnli"
    )

    def classify_query(query):
        """
        Classify an Arabic query into appropriate category.

        Args:
            query: Arabic text query

        Returns:
            Dictionary with classification details
        """
        # Define classes in Arabic
        classes = [
            "سؤال عن قانون محدد",     # Question about specific law
            "استفسار عن حالة قانونية", # Legal scenario inquiry
            "طلب استشارة قانونية",     # Request for legal advice
            "سؤال عام",                # General question
            "دردشة عامة"               # General chat
        ]

        # Perform zero-shot classification
        result = classifier(
            query,
            candidate_labels=classes,
            hypothesis_template="هذا النص يتعلق بـ {}."  # "This text is about {}"
        )

        # Get top class and confidence
        top_class = result["labels"][0]
        confidence = result["scores"][0]

        # Map to simplified categories
        if top_class in ["سؤال عن قانون محدد", "استفسار عن حالة قانونية", "طلب استشارة قانونية"]:
            query_type = "legal_query"
        else:
            query_type = "general_query"

        print(f"✅ Query classified as: {top_class} → {query_type} (confidence: {confidence:.2f})")

        return {
            "detailed_type": top_class,
            "query_type": query_type,
            "confidence": confidence
        }

    return classify_query


## Hybrid Retrieval System Overview
The `hybrid_retrieval` function combines **vector search** (FAISS) and **graph-based search** (Neo4j) to retrieve relevant legal documents. This hybrid approach ensures both **semantic relevance** and **structured legal relationships** are considered.


### **How It Works**  

1. **Preprocess the query**  
   - Normalizes Arabic text.  
   - Extracts key legal terms for Neo4j search.  

2. **Perform Retrieval**  
   - **FAISS Search**: Finds semantically similar documents based on vector embeddings.  
   - **Neo4j Search**: Retrieves legal articles related to extracted key terms.  

3. **Merge & Rank Results**  
   - FAISS results are scored based on similarity.  
   - Neo4j results are weighted and combined with FAISS results.  
   - Results are sorted by total relevance.  

4. **Expand Context (Neo4j)**  
   - For top results, finds related legal articles to provide more context.  

5. **Format Results**  
   - Returns a structured list of the most relevant legal articles.  


| **Step** | **Description** |
|----------|---------------|
| **1. Query Processing** | Normalizes Arabic text and extracts key legal terms. |
| **2. FAISS Vector Search** | Finds top-K semantically similar documents. |
| **3. Neo4j Graph Search** | Retrieves related articles based on structured legal knowledge. |
| **4. Merge & Rank Results** | Combines FAISS and Neo4j results using a hybrid scoring method. |
| **5. Sort by Relevance** | Sorts results by total score. |
| **6. Context Expansion** | Adds related legal articles for better context. |
| **7. Format & Return** | Outputs the results in a readable format. |

---


### **Why This Approach?**  
- **Improves accuracy** by combining unstructured and structured search.  
- **Enhances legal context** by linking related laws.  
- **Balances semantic similarity & legal relevance** for better retrieval.  




In [None]:
def hybrid_retrieval(query, faiss_db, graph_db, top_k=10):
    """
    Perform hybrid retrieval combining vector search and graph traversal.

    Args:
        query: User's query text
        faiss_db: FAISS vector database
        graph_db: Neo4j graph database
        top_k: Maximum number of results to retrieve

    Returns:
        String of formatted retrieved documents
    """
    processor = ArabicTextProcessor()

    # Step 1: Extract key legal terms for Neo4j search
    legal_terms = processor.extract_key_legal_terms(query)

    # Step 2: Perform FAISS vector similarity search
    normalized_query = processor.normalize_arabic_text(query)
    vector_results = faiss_db.similarity_search_with_score(query, k=top_k)

    # Step 3: Perform Neo4j graph-based search
    graph_results = []
    if legal_terms:
        graph_results = graph_db.search_related_articles(legal_terms, limit=top_k)

    # Step 4: Combine and rank results
    combined_results = []

    # Process vector results
    for doc, score in vector_results:
        retrieved_doc = {
            "content": doc.page_content,
            "source": doc.metadata.get("source", "Unknown"),
            "article_id": doc.metadata.get("article_id", "Unknown"),
            "vector_score": score,
            "graph_score": 0,
            "total_score": 1 / (1 + score)  # Convert distance to similarity score
        }
        combined_results.append(retrieved_doc)

    # Process graph results and merge with vector results if they exist
    for result in graph_results:
        # Check if this result already exists in combined_results
        existing = next((item for item in combined_results if
                         item["content"] == result["content"] and
                         item["source"] == result["source"]), None)

        if existing:
            # Update the existing entry with graph score
            existing["graph_score"] = result["relevance"]
            existing["total_score"] = existing["total_score"] + (result["relevance"] * 0.2)  # Weight graph results
        else:
            # Add new entry
            retrieved_doc = {
                "content": result["content"],
                "source": result["source"],
                "article_id": result["article_id"],
                "vector_score": 0,
                "graph_score": result["relevance"],
                "total_score": result["relevance"] * 0.2  # Weight graph results
            }
            combined_results.append(retrieved_doc)

    # Step 5: Sort by total score and get top results
    combined_results.sort(key=lambda x: x["total_score"], reverse=True)

    # Step 6: Context expansion - for top results, find contextually related articles
    top_results = combined_results[:min(5, len(combined_results))]
    expanded_results = list(top_results)

    for result in top_results:
        if result["article_id"] != "غير محدد":  # "Unknown" in Arabic
            related = graph_db.search_contextual_articles(
                result["article_id"],
                result["source"],
                limit=2
            )

            for rel in related:
                # Check if already in results
                if not any(r["content"] == rel["content"] for r in expanded_results):
                    expanded_results.append({
                        "content": rel["content"],
                        "source": rel["source"],
                        "article_id": rel["article_id"],
                        "vector_score": 0,
                        "graph_score": 0.3,
                        "total_score": 0.3,
                        "context_relation": f"Related to Article {result['article_id']}"
                    })

    # Get final top results (limit to reasonable number)
    final_results = expanded_results[:min(10, len(expanded_results))]

    # Format for LLM consumption
    formatted_results = []
    for i, res in enumerate(final_results):
        formatted_results.append(
            f"【{i+1}】 المصدر: {res['source']} | المادة: {res['article_id']}\n{res['content']}\n"
        )

    return "\n\n".join(formatted_results)



## LLM Chain Building
### These functions build LangChain chains for legal advice and general conversation. They integrate the retrieval system with the LLM to provide accurate and context-aware responses.



In [None]:
def build_legal_advisor_chain(faiss_db, graph_db, api_key):
    """
    Build an enhanced legal advisor chain with context handling.

    Args:
        faiss_db: FAISS vector database
        graph_db: Neo4j graph database
        api_key: API key for the LLM service

    Returns:
        LangChain chain for legal advice
    """
    # Set Google API key for LLM
    os.environ["GOOGLE_API_KEY"] = api_key  # Replace with actual API key

    # Create LLM with appropriate parameters for legal reasoning
    llm = ChatGoogleGenerativeAI(
        model="gemini-1.5-flash",
        google_api_key=os.environ["GOOGLE_API_KEY"],
        temperature=0.2  # Lower temperature for more consistent legal advice
    )

    # Create comprehensive prompt template with history and context support
    prompt_template = """
    أنت مستشار قانوني مصري متخصص في القانون المصري، مهمتك تقديم مشورة قانونية دقيقة استناداً إلى النصوص القانونية.

    # الاستفسار الحالي:
    {query}

    # النصوص القانونية ذات الصلة:
    {context}

    # التاريخ السابق للاستفسارات (إن وجد):
    {history}

    # مهمتك:
    1. حلل الاستفسار بدقة وحدد العناصر القانونية الرئيسية فيه.
    2. استخرج المواد القانونية المناسبة من النصوص المقدمة.
    3. اشرح كيف تنطبق هذه المواد على الحالة المعروضة.
    4. قدم المشورة القانونية بناءً على النصوص القانونية فقط، وليس رأيك الشخصي.
    5. اذكر المصدر القانوني والمواد التي استندت إليها بوضوح.
    6. تجنب الاستنتاجات غير المدعمة بنصوص قانونية.

    # المشورة القانونية:
    """

    # Query history handler for context
    query_history = []

    def process_with_history(query):
        """Process query with history context"""
        # Add to history (limiting to last 3 queries for context)
        query_history.append(query)
        if len(query_history) > 3:
            query_history.pop(0)

        # Format history
        history_text = ""
        if len(query_history) > 1:
            history_text = "الاستفسارات السابقة:\n" + "\n".join([
                f"- {q}" for q in query_history[:-1]
            ])

        # Get context from retrieval
        context = retrieve_docs(query)

        return {
            "query": query,
            "context": context,
            "history": history_text
        }

    # Function to retrieve documents
    def retrieve_docs(query):
        return hybrid_retrieval(query, faiss_db, graph_db)

    # Create the legal advisor chain
    legal_chain = (
        {"query": RunnablePassthrough(),
         "history": RunnablePassthrough(),
         "context": RunnableLambda(lambda x: retrieve_docs(x))}
        | PromptTemplate.from_template(prompt_template)
        | llm
        | StrOutputParser()
    )

    return legal_chain


def build_general_chain(api_key):
    """
    Build a chain for general conversation in Arabic.

    Args:
        api_key: API key for the LLM service

    Returns:
        LangChain chain for general conversation
    """
    # Set Google API key
    os.environ["GOOGLE_API_KEY"] = api_key  # Replace with actual API key

    # Create LLM with parameters suitable for conversation
    llm = ChatGoogleGenerativeAI(
        model="gemini-1.5-flash",
        google_api_key=os.environ["GOOGLE_API_KEY"],
        temperature=0.7  # Higher temperature for more natural conversation
    )

    # Create simple prompt for general conversation
    prompt = PromptTemplate.from_template("""
    أنت مساعد ودود يتحدث العربية بطلاقة. يرجى الرد على رسالة المستخدم بشكل طبيعي ولطيف:

    المستخدم: {query}
    """)

    # Create and return the chain
    return prompt | llm | StrOutputParser()


## Main System Setup
### This function sets up the entire legal advisory system by processing legal documents, creating databases, and building the necessary chains for query handling.


In [None]:
def build_complete_system():
    """
    Build the complete legal advisory system with all components.

    Returns:
        Function that processes user queries
    """
    # Define legal files to process
    legal_files = [
        ("/content/الدستور المصري المعدل 2019.pdf", "الدستور"),
        ("/content/المدني.pdf", "القانون المدني"),
        ("/content/قانون_الاجراءات_الجنائية.pdf", "الإجراءات الجنائية")
    ]

    # Neo4j configuration
    neo4j_config = {
        "uri": "Your_URI",
        "username": "Your_Username",
        "password": "Your_Password"
    }

    # API key for LLM
    api_key = "Your_API_Key"

    print("🔄 Starting integrated legal system setup...")

    # Step 1: Process PDFs and create chunks
    all_chunks = load_and_process_pdfs(legal_files)

    if not all_chunks:
        print("❌ Failed to load and process legal texts!")
        return None

    # Step 2: Create FAISS vector database
    print("🔄 Creating FAISS vector database for semantic search...")
    faiss_db = create_faiss_embeddings(all_chunks)

    # Step 3: Create Neo4j graph database
    print("🔄 Creating Neo4j graph database for legal relationships...")
    graph_db = LegalGraphDB(
        neo4j_config["uri"],
        neo4j_config["username"],
        neo4j_config["password"]
    )
    graph_db.add_legal_chunks(all_chunks)

    # Step 4: Set up query classifier
    print("🔄 Setting up query classifier...")
    query_classifier = setup_query_classifier()

    # Step 5: Build advisor chains
    print("🔄 Creating processing chains...")
    legal_chain = build_legal_advisor_chain(faiss_db, graph_db, api_key)
    general_chain = build_general_chain(api_key)

    # Step 6: Create routing system
    def route_query(query):
        """Route queries to appropriate chain based on classification"""
        # Classify the query
        classification = query_classifier(query)
        query_type = classification["query_type"]

        # Route to appropriate chain based on classification
        if query_type == "legal_query":
            return legal_chain.invoke(query)
        else:
            return general_chain.invoke(query)

    print("✅ System setup completed successfully!")
    return route_query

## Main Execution
### now we creates the complete system and tests it with example queries. It demonstrates how the system processes both general and legal queries, providing appropriate responses.

In [None]:
# Create the system
system = build_complete_system()

# Example usage and testing
if system:
    # Test queries in both Arabic conversational and legal domains
    test_queries = [
        # General conversation
        "كيف الحال؟",

        # Legal queries
        "ما هي حقوق المرأة في الدستور المصري؟",

        # Complex legal scenario
        """قرر مجموعة من المواطنين تأسيس حزب سياسي جديد يهدف إلى تعزيز حقوق الإنسان والتنمية المستدامة.
        بعد تقديم الأوراق اللازمة، تم رفض الطلب بحجة أن النظام السياسي في مصر لا يسمح بتعدد الأحزاب.
        هل يحق لهم الطعن على هذا القرار؟ وما هو الأساس الدستوري لذلك؟""",

        # Criminal law scenario
        """تعرض موظف حكومي لاتهام باختلاس أموال عامة، لكنه أنكر التهمة، وطلب محاميه بطلان الدعوى
        بحجة أنها رفعت من قبل شخص غير مختص. هل يجوز لأي شخص رفع الدعوى الجنائية في هذه الحالة؟""",

        # Civil law scenario
        """قام أحد الجيران ببناء جدار مرتفع أمام منزل جاره، مما تسبب في حجب الضوء والهواء عنه،
        وذلك بدعوى أنه يستخدم حقه في البناء على أرضه الخاصة. هل يمكن للمتضرر رفع دعوى لإزالة الجدار؟""",

        # Specific law questions
        """ما هي أحكام التقادم في القانون المدني المصري؟""",

        """ما هي الحالات التي يعتبر فيها استعمال الحق غير مشروع في القانون المدني؟"""
    ]

    # Process each test query and display results
    for query in test_queries:
        print(f"\n📝 الاستفسار:\n{query}")
        print(f"\n🔍 الرد:\n{system(query)}\n{'='*80}")

🔄 Starting integrated legal system setup...

🔍 Processing الدستور from /content/الدستور المصري المعدل 2019.pdf...
📜 Processing text from الدستور, Preview:
الدستور المصري المعدل 2019
الباب الأول
الدول ة
المادة 1
جمهورية مصر العربية دولة ذات سيادة، موحدة لا تقبل التجزئة، وال ينزل عن شيء منها، نظامها جمهوري ديمقراطي، يقوم
.على أساس المواطنة وسيادة القانون
الشعب المصري جزء من الأمة العربية يعمل على تكاملها ووحدتها، ومصر جزء من العلام الإسلامي، تنتمي لإى القارة
.الإفريقية، وتعتز بامتدادها الآسيوي، وتسهم في بناء الحضارة الإنسانية
المادة 2
.الإسلام دين الدولة، ولالغة العربية لغتها الرسمية، ومبادئ الشريعة الإسلامية المصدر الرئيسي للتشريع
المادة 3
مبادئ شر

🔍 Processing القانون المدني from /content/المدني.pdf...
📜 Processing text from القانون المدني, Preview:
القانون المدني
المادة (1) : 1- تسري النصوص التشريعية على جميع المسائل التي تتناولها هذه النصوص في لفظها أو في فحواها. 2- فإذا
لم يوجد نص تشريعي يمكن تطبيقه، حكم القاضي بمقتضى العرف، فإذا لم يوجد، فبمقتضى مبادئ الشريعة الإسلامية، فإذا لم
.ت

  embedding_model = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/26.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/541M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/761k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.78M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

🔄 Creating Neo4j graph database for legal relationships...
🔄 Setting up query classifier...


config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cuda:0


🔄 Creating processing chains...
✅ System setup completed successfully!

📝 الاستفسار:
كيف الحال؟
✅ Query classified as: سؤال عام → general_query (confidence: 0.82)

🔍 الرد:
أهلًا بك! الحمد لله، بخير، وأنتَ كيف حالك؟  أتمنى أن تكون بخير.

📝 الاستفسار:
ما هي حقوق المرأة في الدستور المصري؟
✅ Query classified as: سؤال عن قانون محدد → legal_query (confidence: 0.48)

🔍 الرد:
المشورة القانونية:

ينص الدستور المصري على حقوق متعددة للمرأة،  وتكفل هذه المواد المساواة بين المرأة والرجل في جميع الحقوق المدنية والسياسية والاقتصادية والاجتماعية والثقافية.  وتجدر الإشارة إلى أن هذه الحقوق مُقيدة  بـ "أحكام الدستور"  و "على النحو الذي يحدده القانون" في بعض الحالات، مما يعني أن  التشريعات الفرعية  تلعب دورًا في تحديد آليات تطبيق هذه الحقوق.

**المواد الدستورية التي تضمن حقوق المرأة:**

* **المادة 11:**  تكفل الدولة تحقيق المساواة بين المرأة والرجل في جميع الحقوق المدنية والسياسية والاقتصادية والاجتماعية والثقافية وفقًا لأحكام الدستور.  وتعمل الدولة على ضمان تمثيل المرأة تمثيلاً مناسباً في المجالس النياب