## Steps for Processing Hadith Dataset

1. **Define Validation Models and Utilities**
   - Define a Pydantic model (`QuestionAnswer`) for validating question-answer pairs.
   - Implement a function (`estimate_tokens`) to estimate the number of tokens in Arabic text.

2. **Prepare the Prompt Template**
   - Create a prompt template (`PROMPT_TEMPLATE`) that instructs the language model to answer four fixed questions about each hadith, using Modern Standard Arabic with full diacritics.

3. **Process a Single Hadith Dataset File**
   - Load the input JSON file containing hadith entries.
   - For each entry:
     - Skip entries without a valid explanation (`sharh`).
     - Format the prompt with the hadith and its explanation.
     - Estimate the number of tokens in the prompt.
     - Send the prompt to the Together API and stream the response.
     - Collect and clean the model's output, ensuring it is a JSON array of four answers.
     - Map each answer to its corresponding question and update the entry with:
       - `FT_Pairs`: List of question-answer pairs.
       - `hadith_lessons`: First answer as a lesson.
       - `hadith_application`: Second answer as an application.
     - Handle errors by setting empty values if processing fails.
   - Save the enriched data to the output JSON file.
   - Print statistics about the processing.

4. **Process All Hadith Files in a Directory**
   - Create the output directory if it does not exist.
   - Find all JSON files in the input directory.
   - For each file:
     - Process the file using the above function.
     - Aggregate statistics across all files.
     - Add a delay between files to avoid rate limiting.
   - Print an overall summary of the processing results.

In [1]:
import json
import re
import time
import os
import glob
from typing import List, Dict, Any
from pydantic import BaseModel, Field
from together import Together
from tqdm import tqdm

ImportError: cannot import name 'Together' from 'together' (/home/mohamed/.local/lib/python3.12/site-packages/together/__init__.py)

In [None]:
# Initialize Together client
client = Together(api_key="tgp_v1_nG8tM-osJ_jwKSPsPiRZnr6IiartvAs5A8IexEAoyxk")

# Define fixed questions with diacritics
def get_fixed_questions() -> List[Dict[str, str]]:
    """Return the fixed questions with their full diacritics"""
    return [
        {
            "question": "مَا هِيَ الرَّسَائِلُ الرَّئِيسِيَّةُ وَالدُّرُوسُ المُسْتَفَادَةُ والفَوَائِد المُستَخلصَة مِنَ الحَدِيثِ؟",
            "answer": "{answer1}"
        },
        {
            "question": "كَيْفَ يُمْكِنُ تَطْبِيقُ الحَدِيثِ فِي الحَيَاةِ اليَوْمِيَّةِ؟",
            "answer": "{answer2}"
        },
        {
            "question": "مَا أَهَمِّيَّةُ الحَدِيثِ فِي الفِقْهِ الإِسْلَامِيِّ؟",
            "answer": "{answer3}"
        },
        {
            "question": "مَا سَبَبُ وُرُودِ الحَدِيثِ؟",
            "answer": "{answer4}"
        }
    ]

In [None]:
# Define Pydantic models for validation
class QuestionAnswer(BaseModel):
    """Model for question-answer pairs"""
    question: str
    answer: str

# Function to estimate token count (rough approximation)
def estimate_tokens(text: str) -> int:
    """Estimate token count in a string (rough approximation)"""
    # For Arabic text, a rough estimation is about 1 token per 2.5 characters
    return len(text) // 2

# Create a simplified prompt template that only asks for answers, not questions
PROMPT_TEMPLATE = """
You are an expert in analyzing Prophetic Hadiths and Islamic jurisprudence.
Your task is to analyze an Arabic Hadith using the provided explanation to extract knowledge, jurisprudential rulings, and practical applications.

== INPUT ==
You will receive:
- 'hadith': Text of the Prophetic Hadith in Arabic (with diacritics).
- 'explanation': Detailed explanation of the Hadith (Arabic text).

== TASK ==
Based on the provided Hadith, explanation, and your knowledge, answer the following questions in JSON format.

== FIXED QUESTIONS ==
Answer ONLY these questions without repeating the question text:
1. What are the main messages, lessons learned, and benefits derived from the Hadith?
2. How can the Hadith be applied in daily life?
3. What is the importance of the Hadith in Islamic jurisprudence?
4. What is the reason or context behind the narration of the Hadith?

== RULES ==
- Use Modern Standard Arabic with full diacritics in your answers.
- Base your answers on the provided explanation and your knowledge of authentic Hadiths.
- Ensure each answer accurately reflects the content and meaning of the Hadith without incorporating personal interpretations or conclusions.
- Verify the authenticity of all information before preparing your response.
- Return ONLY a JSON array with your answers as shown below, without any additional comments or explanations.

== Expected Input ==
{{
  "hadith": "{hadith}",
  "sharh": "{sharh}"
}}

== Expected Output (JSON Array) ==
[
  "الإجابة الأولى مع التشكيل الكامل...",
  "الإجابة الثانية مع التشكيل الكامل...",
  "الإجابة الثالثة مع التشكيل الكامل...",
  "الإجابة الرابعة مع التشكيل الكامل..."
]
"""

def process_hadith_dataset(input_file: str, output_file: str) -> None:
    """
    Process the hadith dataset, enriching it with QA pairs
    
    Args:
        input_file: Path to the input JSON file
        output_file: Path to save the output JSON file
    """
    # Load hadith dataset
    with open(input_file, "r", encoding="utf-8") as f:
        data = json.load(f)
    
    # Get fixed questions template
    fixed_questions = get_fixed_questions()
    
    # Process each hadith with a progress bar
    total_tokens_before = 0
    total_tokens_after = 0
    processed_count = 0
    
    print(f"Processing {len(data)} hadith entries from {input_file}...")
    for entry in tqdm(data, desc=f"Processing hadiths from {os.path.basename(input_file)}"):
        # Skip if no sharh is available
        if not entry.get("sharh") or entry["sharh"] == ".":
            print(f"Skipping hadith ID {entry.get('hadith_id', 'unknown')} - no sharh available")
            continue
        
        # Create input for LLM - use the same file without creating a new one
        prompt = PROMPT_TEMPLATE.format(
            hadith=entry["hadith"],
            sharh=entry["sharh"]
        )
        
        # Estimate tokens before generation
        tokens_before = estimate_tokens(prompt)
        total_tokens_before += tokens_before
        
        # Stream response from Together
        try:
            response = client.chat.completions.create(
                model="deepseek-ai/DeepSeek-V3",
                messages=[{"role": "user", "content": prompt}],
                stream=True
            )
            
            # Collect response
            output_text = ""
            for token in response:
                if hasattr(token, 'choices') and token.choices[0].delta.content:
                    output_text += token.choices[0].delta.content
            
            # Estimate tokens after generation
            tokens_after = estimate_tokens(output_text)
            total_tokens_after += tokens_after
            
            # Process the output - directly match keys and values
            try:
                # Clean the output text - handle potential formatting issues
                output_text = output_text.strip()
                if output_text.startswith("```json"):
                    output_text = output_text[7:]
                if output_text.endswith("```"):
                    output_text = output_text[:-3]
                output_text = output_text.strip()
                
                # Parse as regular JSON - expecting an array of 4 strings
                answer_list = json.loads(output_text)
                
                if not isinstance(answer_list, list):
                    raise ValueError(f"Expected list of answers, got: {type(answer_list)}")
                
                # Ensure we have at least 4 answers, pad with empty strings if needed
                while len(answer_list) < 4:
                    answer_list.append("")
                
                # Use direct key matching instead of creating new objects
                qa_pairs = []
                for i, answer in enumerate(answer_list[:4]):  # Limit to first 4 answers
                    qa_pair = {
                        "question": fixed_questions[i]["question"],
                        "answer": answer
                    }
                    qa_pairs.append(qa_pair)
                
                # Update the existing entry directly
                entry["FT_Pairs"] = qa_pairs
                
                # Extract first two answers directly into lessons and application
                entry["hadith_lessons"] = [answer_list[0]] if answer_list[0] else []
                entry["hadith_application"] = [answer_list[1]] if answer_list[1] else []
                
                processed_count += 1
                
            except (json.JSONDecodeError, ValueError) as e:
                print(f"\n[!] Processing error for hadith ID {entry.get('hadith_id', 'unknown')}: {str(e)}")
                # Set empty values for failed processing
                entry["FT_Pairs"] = []
                entry["hadith_lessons"] = []
                entry["hadith_application"] = []
        
        except Exception as e:
            print(f"\n[!] API error for hadith ID {entry.get('hadith_id', 'unknown')}: {str(e)}")
            entry["FT_Pairs"] = []
            entry["hadith_lessons"] = []
            entry["hadith_application"] = []
    
    # Save the enriched data
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    
    # Print statistics
    print("\nProcessing completed for file {input_file}:")
    print(f"- Successfully processed: {processed_count}/{len(data)} entries")
    print(f"- Total tokens before generation: {total_tokens_before}")
    print(f"- Total tokens after generation: {total_tokens_after}")
    if total_tokens_after > 0:
        print(f"- Token reduction ratio: {total_tokens_before/total_tokens_after:.2f}x")
    print(f"- Dataset saved to '{output_file}'")
    
    return processed_count, total_tokens_before, total_tokens_after

def process_all_hadith_files(input_dir: str, output_dir: str = None) -> None:
    """
    Process all hadith JSON files in a directory
    
    Args:
        input_dir: Path to the directory containing input JSON files
        output_dir: Path to save the output JSON files (if None, will use input_dir with '_processed' suffix)
    """
    # Create output directory if it doesn't exist
    if output_dir is None:
        output_dir = input_dir + "_processed"
    
    os.makedirs(output_dir, exist_ok=True)
    
    # Get all JSON files in the input directory
    json_files = glob.glob(os.path.join(input_dir, "*.json"))
    
    if not json_files:
        print(f"No JSON files found in {input_dir}")
        return
    
    print(f"Found {len(json_files)} JSON files to process")
    
    # Process each file
    total_processed = 0
    total_tokens_before = 0
    total_tokens_after = 0
    
    for input_file in json_files:
        file_name = os.path.basename(input_file)
        output_file = os.path.join(output_dir, file_name)
        
        print(f"\nProcessing file: {file_name}")
        processed, tokens_before, tokens_after = process_hadith_dataset(input_file, output_file)
        
        total_processed += processed
        total_tokens_before += tokens_before
        total_tokens_after += tokens_after
        
        # Add a small delay between files to avoid rate limiting
        time.sleep(2)
    
    # Print overall statistics
    print("\n===== OVERALL PROCESSING SUMMARY =====")
    print(f"Total files processed: {len(json_files)}")
    print(f"Total hadiths processed: {total_processed}")
    print(f"Total tokens before generation: {total_tokens_before}")
    print(f"Total tokens after generation: {total_tokens_after}")
    if total_tokens_after > 0:
        print(f"Overall token reduction ratio: {total_tokens_before/total_tokens_after:.2f}x")
    print(f"All processed files saved to '{output_dir}'")


In [None]:

if __name__ == "__main__":
    # Process all JSON files in the Sahih_muslim directory
    input_dir = r"d:\زينب\Sahih_muslim"
    output_dir = r"d:\زينب\Sahih_muslim_processed"
    
    process_all_hadith_files(input_dir, output_dir)

# Add the mattan's hadith to questions answers  

In [None]:
import json
import os
from tqdm import tqdm  # مكتبة شريط التقدم

# مسار المجلد الذي يحتوي على ملفات JSON
folder_path = "Sahih_muslim_processed"

# الحصول على قائمة ملفات JSON
json_files = [f for f in os.listdir(folder_path) if f.endswith(".json")]

# التكرار مع tqdm لعرض شريط التقدم
for filename in tqdm(json_files, desc="معالجة الملفات"):
    file_path = os.path.join(folder_path, filename)
    
    # فتح ملف البيانات للقراءة
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    # معالجة كل عنصر في البيانات
    for entry in data:
        hadith_text = entry.get("hadith", "")

        # تحديث الأسئلة في FT_Pairs
        for pair in entry.get("FT_Pairs", []):
            if "question" in pair:
                pair["question"] = pair["question"].replace("الحَدِيثِ", f"الحَدِيثِ'{hadith_text}'")

        # تحديث hadith_lessons و hadith_application إذا كانت موجودة
        if "hadith_lessons" in entry and entry["hadith_lessons"]:
            for i, lesson in enumerate(entry["hadith_lessons"]):
                entry["hadith_lessons"][i] = lesson.replace("الحَدِيثِ", f"الحَدِيثِ'{hadith_text}'")
        
        if "hadith_application" in entry and entry["hadith_application"]:
            for i, app in enumerate(entry["hadith_application"]):
                entry["hadith_application"][i] = app.replace("الحَدِيثِ", f"الحَدِيثِ'{hadith_text}'")

    # حفظ البيانات المعدلة
    with open(file_path, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

print("✅ تم تحديث جميع ملفات الحديث بنجاح!")


# Check for Empty JSON Files to Delete

In [None]:
import os
import json

# مجلد ملفات JSON
folder_path = "Sahih_muslim_processed"

# التكرار على جميع ملفات JSON في المجلد
for filename in os.listdir(folder_path):
    if filename.endswith(".json"):
        file_path = os.path.join(folder_path, filename)
        
        try:
            # فتح وتحميل الملف
            with open(file_path, "r", encoding="utf-8") as f:
                data = json.load(f)

            # التأكد أن الملف يحتوي على قائمة من العناصر
            if isinstance(data, list):
                all_empty = True
                for entry in data:
                    # فحص الحقول الثلاثة
                    if (entry.get("hadith_lessons") or 
                        entry.get("hadith_application") or 
                        entry.get("FT_Pairs")):
                        all_empty = False
                        break

                # حذف الملف إذا كانت كل الحقول في كل العناصر فارغة
                if all_empty:
                    os.remove(file_path)
                    print(f"🗑️ تم حذف الملف لأن جميع الحقول فارغة: {filename}")

        except Exception as e:
            print(f"⚠️ خطأ في قراءة الملف {filename}: {e}")


# Delete Processed JSON Files from Original Directory

In [None]:
import os

# مجلد ملفات JSON الأصلية
processed_folder = r"D:\زينب\Sahih_muslim_processed"

# مجلد صحيح مسلم الذي سيتم حذف الملفات منه
original_folder = r"D:\زينب\Sahih_muslim"

# الحصول على أسماء ملفات JSON من مجلد المعالجة (بدون الامتداد)
json_names = [os.path.splitext(f)[0] for f in os.listdir(processed_folder) if f.endswith(".json")]

# التكرار على الملفات داخل مجلد صحيح مسلم
for filename in os.listdir(original_folder):
    file_path = os.path.join(original_folder, filename)

    # التحقق إن كان هذا ملف JSON واسمه موجود في قائمة الأسماء
    if filename.endswith(".json"):
        name_without_ext = os.path.splitext(filename)[0]
        if name_without_ext in json_names:
            os.remove(file_path)
            print(f"🗑️ تم حذف الملف: {filename}")

print("✅ تم حذف جميع الملفات المطابقة من مجلد Sahih_muslim.")
