# **Introduction**
This notebook provides an end-to-end pipeline for summarizing both English and Arabic books. It involves extracting text from a PDF, dividing the text into semantically coherent chunks, summarizing each chunk, and finally generating a Text output of the summary. The pipeline automatically detects the language of the book and applies the appropriate summarization model. The summarization models used are optimized to run on GPUs, ensuring efficiency for large texts.

The steps for the pipeline include:

1. **Text Extraction**: Extract the raw text from the PDF file.
2. **Language Detection**: Detect whether the text is in English or Arabic.
3. **Semantic Chunking**: Break the text into semantically meaningful chunks based on sentence embeddings or natural chunking (depending on the language).
4. **Text Summarization**: Summarize each chunk using the appropriate model (BART for English, mT5 for Arabic).
5. **text Generation**: Create a txt file containing the summarized text.


# **Pipeline Steps**


## Step 1: Install the Required Libraries


In [1]:
!pip install PyMuPDF pdfplumber transformers arabic_reshaper python-bidi matplotlib reportlab fpdf2
!pip install spacy camel-tools sentence-transformers fpdf PyPDF2 stanza langdetect


Collecting PyMuPDF
  Downloading PyMuPDF-1.24.10-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.4-py3-none-any.whl.metadata (41 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting arabic_reshaper
  Downloading arabic_reshaper-3.0.0-py3-none-any.whl.metadata (12 kB)
Collecting python-bidi
  Downloading python_bidi-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting reportlab
  Downloading reportlab-4.2.2-py3-none-any.whl.metadata (1.4 kB)
Collecting fpdf2
  Downloading fpdf2-2.7.9-py2.py3-none-any.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.24.10 (from PyMuPDF)
  Downloa

## Step 2: Import Required Libraries


In [2]:
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline
import re
from fpdf import FPDF
import shutil
import pdfplumber
import arabic_reshaper
from bidi.algorithm import get_display
import stanza
from langdetect import detect


  from tqdm.autonotebook import tqdm, trange


## Step 3: Download the Arabic Language Model for Stanza


In [3]:
stanza.download('ar')


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: ar (Arabic) ...


Downloading https://huggingface.co/stanfordnlp/stanza-ar/resolve/v1.9.0/models/default.zip:   0%|          | 0…

INFO:stanza:Downloaded file to /root/stanza_resources/ar/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources


## Step 4: Define Helper Functions for Text Extraction and Cleaning


In [4]:
# Extract text from the PDF file using pdfplumber
def extract_text_from_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ''.join(page.extract_text() for page in pdf.pages)
    return text

# Function: Arabic Text Reshaping and Bidi Fix
def fix_arabic_text(text):
    reshaped_text = arabic_reshaper.reshape(text)
    return get_display(reshaped_text)


# Clean the text by removing URLs, numbers, and extra spaces
def clean_text(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\b\d+\b', '', text)
    text = re.sub(r'\b[A-Za-z]\b', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text


## Step 5: Set Up the Sentence-BERT Model and Summarizer Models


In [5]:
# Load pre-trained Sentence-BERT model for semantic embeddings (ensure GPU usage)
model = SentenceTransformer('all-MiniLM-L6-v2', device='cuda')
summarizer_en = pipeline("summarization", model="facebook/bart-large-cnn", device=0)
summarizer_ar = pipeline('summarization', model='csebuetnlp/mT5_multilingual_XLSum', device=0)
nlp_ar = stanza.Pipeline('ar', processors='tokenize')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/730 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/375 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Loading: mwt
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Done loading processors!


## Step 6: Define the Function for Semantic Chunking in English


In [6]:
def divide_by_semantics_with_length(text, threshold=0.6, max_words=800, min_words=400):
    sentences = text.split('. ')
    embeddings = model.encode(sentences, convert_to_tensor=True)
    chunks = []
    current_chunk = sentences[0]

    for i in range(1, len(sentences)):
        similarity = util.pytorch_cos_sim(embeddings[i], embeddings[i - 1])
        current_word_count = len(current_chunk.split())

        if similarity < threshold or current_word_count + len(sentences[i].split()) > max_words:
            if current_word_count >= min_words:
                chunks.append(current_chunk.strip())
                current_chunk = sentences[i]
            else:
                current_chunk += '. ' + sentences[i]
        else:
            current_chunk += '. ' + sentences[i]

    if len(current_chunk.split()) >= min_words:
        chunks.append(current_chunk.strip())

    return chunks


## Step 7: Define the Function for Semantic Chunking in Arabic

In [7]:
# Function: Semantic Chunking (Arabic)
def chunk_arabic_text(text, min_words=300, max_words=500):
    """Break the Arabic text into semantically meaningful chunks."""
    doc = nlp_ar(text)
    chunks = []
    current_chunk = []
    current_chunk_word_count = 0

    for sentence in doc.sentences:
        sentence_text = sentence.text
        sentence_word_count = len(sentence_text.split())

        # If the sentence is too long, split it into smaller sentences
        if sentence_word_count > max_words:
            split_sentences = split_long_sentence(sentence_text, max_words)
        else:
            split_sentences = [sentence_text]

        # Add the split sentences to the current chunk
        for split_sentence in split_sentences:
            split_sentence_word_count = len(split_sentence.split())
            if current_chunk_word_count + split_sentence_word_count > max_words and current_chunk_word_count >= min_words:
                chunks.append(' '.join(current_chunk))
                current_chunk = []
                current_chunk_word_count = 0

            current_chunk.append(split_sentence)
            current_chunk_word_count += split_sentence_word_count

    # Add the last chunk if it meets the minimum word requirement
    if current_chunk_word_count >= min_words:
        chunks.append(' '.join(current_chunk))

    return chunks

# Helper function to split long Arabic sentences
def split_long_sentence(sentence_text, max_words):
    words = sentence_text.split()
    return [' '.join(words[i:i + max_words]) for i in range(0, len(words), max_words)]


## Step 8: Define the Summarization and Text Generation Functions


In [55]:
# Summarize text chunks
def summarize_chunks(chunks, summarizer, min_chunk_length=50, max_summary_length=300, min_summary_length=80):
    summaries = []
    for chunk in chunks:
        if len(chunk.split()) > min_chunk_length:
            try:
                summary = summarizer(chunk, max_length=max_summary_length, min_length=min_summary_length, do_sample=False)[0]['summary_text']
                summaries.append(summary)
            except Exception as e:
                print(f"Error summarizing chunk: {e}")
                summaries.append(chunk)
        else:
            summaries.append(chunk)
    return summaries


# Process the title based on the language
def get_title(language):
    if language == 'ar':
        title = "ملخص الكتاب"
    else:
        title = 'Book Summary'
    return title


# Generate the text file
def generate_txt(summary_text, txt_output_path, language='en'):
    # Process the title
    title = get_title(language)

    # Process the body text
    if language == 'ar':
        reshaped_text = arabic_reshaper.reshape(summary_text)
        body_text = get_display(reshaped_text)
    else:
        body_text = summary_text

    # Define A4 page parameters
    characters_per_line = 80  # تقديريًا لعرض السطر في A4
    effective_line_width = characters_per_line

    # Adjust alignment based on language
    if language == 'ar':
        # For Arabic, define a function to right-align text
        def align_line(line):
            return line.rjust(effective_line_width)
    else:
        # For English, define a function to left-align text
        def align_line(line):
            return line.ljust(effective_line_width)

    # Center the title considering alignment
    centered_title = title.center(effective_line_width)

    # Format the body text with alignment
    formatted_body = ''
    for paragraph in body_text.split('\n'):
        words = paragraph.split()
        line = ''
        for word in words:
            if len(line) + len(word) + 1 <= effective_line_width:
                line += word + ' '
            else:
                # Strip extra space and align the line
                line = line.strip()
                formatted_line = align_line(line)
                formatted_body += formatted_line + '\n'
                line = word + ' '
        if line:
            line = line.strip()
            formatted_line = align_line(line)
            formatted_body += formatted_line + '\n'
        formatted_body += '\n'  # إضافة سطر فارغ بين الفقرات

    # Write the title and body to a text file
    with open(txt_output_path, 'w', encoding='utf-8') as f:
        f.write(centered_title + '\n\n')
        f.write(formatted_body)




## Step 9: Define the Summarization Pipelines for English and Arabic


### **English Summarization Pipeline**


In [56]:
def summarize_english(book_text, text_output_path="english_summary.txt"):
    # Step 1: Divide text into semantic chunks
    semantic_chunks = divide_by_semantics_with_length(book_text)

    # Step 2: Clean the chunks
    cleaned_chunks = [clean_text(chunk) for chunk in semantic_chunks]

    # Step 3: Summarize the chunks
    summarized_chunks = summarize_chunks(cleaned_chunks, summarizer_en)

    # Step 4: Generate PDF
    final_summary = '\n\n'.join(summarized_chunks)
    generate_txt(final_summary, text_output_path, language='en')

    print(f"Summarization completed!, saved to {text_output_path}")

    return final_summary


### **Arabic Summarization Pipeline**


In [57]:
def summarize_arabic(pdf_path, text_output_path="arabic_summary.txt"):
    # Step 1: Extract text from PDF and fix Arabic text direction
    text = extract_text_from_pdf(pdf_path)
    fixed_text = fix_arabic_text(text)  # Fixing the text direction

    # Step 2: Chunk the text semantically
    chunks = chunk_arabic_text(fixed_text)  # Now the chunking function is defined

    # Step 3: Summarize the chunks
    summarized_chunks = summarize_chunks(chunks, summarizer_ar)

    # Step 4: Clean and generate the final summary
    cleaned_summaries = [clean_text(chunk) for chunk in summarized_chunks]
    final_summary = '\n\n'.join(cleaned_summaries)

    # Step 5: Generate txt
    final_summary_arabic = fix_arabic_text(final_summary)
    generate_txt(final_summary_arabic, text_output_path, language='ar')

    # Notify the user that the txt has been created
    print(f"Summarization completed!, saved to {text_output_path}")

    return final_summary


## Step 10: Language Detection and Pipeline Execution


In [58]:
def detect_language_and_summarize(pdf_path, text_output_path_ar="arabic_summary.txt", text_output_path_en="english_summary.txt"):
    text = extract_text_from_pdf(pdf_path)
    language = detect(text)

    if language == 'ar':
        print("Detected Arabic. Running Arabic summarization pipeline...")
        return summarize_arabic(pdf_path, text_output_path=text_output_path_ar)
    else:
        print("Detected English. Running English summarization pipeline...")
        return summarize_english(text, text_output_path=text_output_path_en)


## Step 11: Run the Pipeline on Your PDF


In [60]:
pdf_path = "/content/arabic_summary (14).pdf"  # Update this to the correct PDF path
final_summary = detect_language_and_summarize(pdf_path)


Detected Arabic. Running Arabic summarization pipeline...
Summarization completed!, saved to arabic_summary.txt
