# **Introduction**
This notebook provides an end-to-end pipeline for paraphrasing both English and Arabic books. It involves extracting text from a PDF, dividing the text into semantically coherent chunks, paraphrasing each chunk, and finally generating a PDF output of the paraphrased text. The pipeline automatically detects the language of the book and applies the appropriate paraphrasing model. The paraphrasing models used are optimized to run on GPUs, ensuring efficiency for large texts.

The steps for the pipeline include:

1. **Text Extraction**: Extract the raw text from the PDF file.
2. **Language Detection**: Detect whether the text is in English or Arabic.
3. **Semantic Chunking**: Break the text into semantically meaningful chunks based on sentence embeddings or natural chunking (depending on the language).
4. **Text Paraphrasing**: Paraphrase each chunk using the appropriate model (T5 for English, mT5 for Arabic).
5. **PDF Generation**: Create a PDF file containing the summarized text.


# **Pipeline Steps**


## Step 1: Install the Required Libraries


In [1]:
!pip install PyMuPDF pdfplumber transformers arabic_reshaper python-bidi matplotlib reportlab fpdf2
!pip install spacy camel-tools sentence-transformers fpdf PyPDF2 stanza langdetect datasets
!pip install nltk


Collecting PyMuPDF
  Downloading PyMuPDF-1.24.10-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.4-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting arabic_reshaper
  Downloading arabic_reshaper-3.0.0-py3-none-any.whl.metadata (12 kB)
Collecting python-bidi
  Downloading python_bidi-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting reportlab
  Downloading reportlab-4.2.2-py3-none-any.whl.metadata (1.4 kB)
Collecting fpdf2
  Downloading fpdf2-2.7.9-py2.py3-none-any.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.24.10 (from PyMuPDF)
  Downloading PyMuPDFb-1.24.10-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Collecting pdfmine

## Step 2: Import Required Libraries


In [2]:
import re
import shutil
import math
import nltk
from nltk.tokenize import sent_tokenize
from langdetect import detect
from bidi.algorithm import get_display
import pdfplumber
import arabic_reshaper
from fpdf import FPDF
from sentence_transformers import SentenceTransformer, util
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
import stanza

# Download NLTK data
nltk.download('punkt')


  from tqdm.autonotebook import tqdm, trange
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Step 3: Download the Arabic Language Model for Stanza


In [3]:
# Download and initialize Stanza pipeline for Arabic
stanza.download('ar')
nlp_ar = stanza.Pipeline('ar', processors='tokenize')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: ar (Arabic) ...


Downloading https://huggingface.co/stanfordnlp/stanza-ar/resolve/v1.9.0/models/default.zip:   0%|          | 0…

INFO:stanza:Downloaded file to /root/stanza_resources/ar/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Loading: mwt
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Done loading processors!


## Step 4: Define Helper Functions for Text Extraction and Cleaning


In [4]:
# Extract text from the PDF file using pdfplumber
def extract_text_from_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ''.join(page.extract_text() for page in pdf.pages)
    return text

# Function: Arabic Text Reshaping and Bidi Fix
def fix_arabic_text(text):
    reshaped_text = arabic_reshaper.reshape(text)
    return get_display(reshaped_text)

# Clean the text by removing URLs, numbers, and extra spaces
def clean_text(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\b\d+\b', '', text)
    text = re.sub(r'\b[A-Za-z]\b', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text


def clean_arabic_text(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\b\d+\b', '', text)
    text = re.sub(r'\b[ء-ي]\b', '', text)
    text = re.sub(r'\b[A-Za-z]\b', '', text)
    text = re.sub(r'[?!؟"()«»:\-]', '', text)
    text = re.sub(r'\s+([,.،])', r'\1', text)
    text = re.sub(r'([,.،])([^\s])', r'\1 \2', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Clean and summarize all chunks
def clean_chunks(chunks):
    return [clean_arabic_text(chunk) for chunk in chunks]


## Step 5: Set Up the Sentence-BERT Model and Paraphrasing Models


In [5]:
# Load pre-trained Sentence-BERT model for semantic embeddings (ensure GPU usage)
sbert_model = SentenceTransformer('all-MiniLM-L6-v2', device=0)

# Define paraphrasing pipelines
paraphraser_en = pipeline("text2text-generation", model="t5-base", device=0)

model_name = "google/mt5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

paraphraser_ar = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0)  # تأكد من أن الجهاز يدعم GPU

# Initialize Stanza pipeline for Arabic
nlp_ar = stanza.Pipeline('ar', processors='tokenize')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/702 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Done loading processors!


## Step 6: Define the Function for Semantic Chunking in English


In [30]:
def divide_by_semantics_with_length(text, threshold=0.6, max_words=400, min_words=300):
    sentences = nltk.sent_tokenize(text)  # Use NLTK for sentence tokenization
    embeddings = sbert_model.encode(sentences, convert_to_tensor=True)
    chunks = []
    current_chunk = sentences[0]

    for i in range(1, len(sentences)):
        similarity = util.pytorch_cos_sim(embeddings[i], embeddings[i-1])
        current_word_count = len(current_chunk.split())

        # If the next sentence makes the chunk exceed the max word limit
        if current_word_count + len(sentences[i].split()) > max_words:
            # Ensure the current chunk has at least min_words before breaking
            if current_word_count >= min_words:
                chunks.append(current_chunk.strip())  # Finalize the current chunk
                current_chunk = sentences[i]  # Start a new chunk
            else:
                # If the chunk is below min_words, add the sentence even if it exceeds max_words
                current_chunk += ' ' + sentences[i]
        elif similarity < threshold:
            # Break chunk if semantic similarity is low and the chunk meets the minimum word count
            if current_word_count >= min_words:
                chunks.append(current_chunk.strip())  # Finalize the current chunk
                current_chunk = sentences[i]  # Start a new chunk
            else:
                # If the chunk is too small, continue adding sentences
                current_chunk += ' ' + sentences[i]
        else:
            # Continue adding sentences to the current chunk
            current_chunk += ' ' + sentences[i]

    # Append the last chunk if it satisfies the minimum word condition
    if len(current_chunk.split()) >= min_words:
        chunks.append(current_chunk.strip())

    return chunks


## Step 7: Define the Function for Semantic Chunking in Arabic

In [31]:
def chunk_arabic_text(text, tokenizer, max_tokens=500):
    sentences = nltk.sent_tokenize(text)
    chunks = []
    current_chunk = ''
    current_tokens = 0

    for sentence in sentences:
        sentence_tokens = len(tokenizer.encode(sentence, add_special_tokens=False))

        if current_tokens + sentence_tokens > max_tokens:
            if current_chunk:
                chunks.append(current_chunk.strip())
                current_chunk = sentence
                current_tokens = sentence_tokens
            else:
                # إذا كانت الجملة نفسها تتجاوز الحد الأقصى، نقسمها إلى كلمات
                words = sentence.split()
                sub_chunk = ''
                sub_tokens = 0
                for word in words:
                    word_tokens = len(tokenizer.encode(word, add_special_tokens=False))
                    if sub_tokens + word_tokens > max_tokens:
                        if sub_chunk:
                            chunks.append(sub_chunk.strip())
                            sub_chunk = word
                            sub_tokens = word_tokens
                        else:
                            sub_chunk = ''
                            sub_tokens = 0
                    else:
                        sub_chunk += ' ' + word
                        sub_tokens += word_tokens
                if sub_chunk:
                    chunks.append(sub_chunk.strip())
        else:
            current_chunk += ' ' + sentence
            current_tokens += sentence_tokens

    if current_chunk.strip():
        chunks.append(current_chunk.strip())

    return chunks

## Step 8: Define the Paraphrasing and PDF Generation Functions


In [32]:
# Paraphrase text chunks
def paraphrase_chunks_en(chunks, min_words=350, max_words=400, num_return_sequences=1):
    paraphrased_chunks = []
    for chunk in chunks:
        chunk_length = len(chunk.split())  # Get the word count of the original chunk

        try:
            # Use the paraphraser to generate paraphrases
            paraphrases = paraphraser_en(chunk, max_length=chunk_length, num_return_sequences=num_return_sequences, do_sample=False)
            paraphrased_text = paraphrases[0]['generated_text']  # Extract the paraphrased text

            # Ensure that the paraphrased text has between min_words and max_words
            paraphrased_words = paraphrased_text.split()
            if len(paraphrased_words) > max_words:
                paraphrased_text = ' '.join(paraphrased_words[:max_words])  # Trim if longer
            elif len(paraphrased_words) < min_words:
                paraphrased_text = paraphrased_text + ' ' + ' '.join(paraphrased_words[:min_words-len(paraphrased_words)])  # Repeat words if shorter

            paraphrased_chunks.append(paraphrased_text)
        except Exception as e:
            #print(f"Error paraphrasing chunk: {e}")
            paraphrased_chunks.append(chunk)  # Append the original chunk if paraphrasing fails

    return paraphrased_chunks


def paraphrase_chunks_ar(chunks, min_words=350, max_words=400, num_return_sequences=1):

    paraphrased_chunks = []
    for chunk in chunks:
        chunk_length = len(chunk.split())  # Get the word count of the original chunk

        try:
            paraphrases = paraphraser_ar(chunk, max_length=chunk_length, num_return_sequences=num_return_sequences, do_sample=False)
            paraphrased_text = paraphrases[0]['generated_text']  # Extract the paraphrased text

            # Ensure that the paraphrased text has between min_words and max_words
            paraphrased_words = paraphrased_text.split()
            if len(paraphrased_words) > max_words:
                paraphrased_text = ' '.join(paraphrased_words[:max_words])  # Trim if longer
            elif len(paraphrased_words) < min_words:
                paraphrased_text = paraphrased_text + ' ' + ' '.join(paraphrased_words[:min_words-len(paraphrased_words)])  # Repeat words if shorter

            paraphrased_chunks.append(paraphrased_text)
        except Exception as e:
            #print(f"Error paraphrasing chunk: {e}")
            paraphrased_chunks.append(chunk)  # Append the original chunk if paraphrasing fails

    return paraphrased_chunks


# Process the title based on the language
def get_title(language):
    if language == 'ar':
        title = "إعادة صباغة الكتاب"
    else:
        title = 'Book Paraphrased'
    return title


# Generate the text file
def generate_txt(summary_text, txt_output_path, language='en'):
    # Process the title
    title = get_title(language)

    # Process the body text
    if language == 'ar':
        reshaped_text = arabic_reshaper.reshape(summary_text)
        body_text = get_display(reshaped_text)
    else:
        body_text = summary_text

    # Define A4 page parameters
    characters_per_line = 80  # تقديريًا لعرض السطر في A4
    effective_line_width = characters_per_line

    # Adjust alignment based on language
    if language == 'ar':
        # For Arabic, define a function to right-align text
        def align_line(line):
            return line.rjust(effective_line_width)
    else:
        # For English, define a function to left-align text
        def align_line(line):
            return line.ljust(effective_line_width)

    # Center the title considering alignment
    centered_title = title.center(effective_line_width)

    # Format the body text with alignment
    formatted_body = ''
    for paragraph in body_text.split('\n'):
        words = paragraph.split()
        line = ''
        for word in words:
            if len(line) + len(word) + 1 <= effective_line_width:
                line += word + ' '
            else:
                # Strip extra space and align the line
                line = line.strip()
                formatted_line = align_line(line)
                formatted_body += formatted_line + '\n'
                line = word + ' '
        if line:
            line = line.strip()
            formatted_line = align_line(line)
            formatted_body += formatted_line + '\n'
        formatted_body += '\n'  # إضافة سطر فارغ بين الفقرات

    # Write the title and body to a text file
    with open(txt_output_path, 'w', encoding='utf-8') as f:
        f.write(centered_title + '\n\n')
        f.write(formatted_body)


## Step 9: Define the Paraphrasing Pipelines for English and Arabic


### **English Paraphrasing  Pipeline**


In [33]:
def paraphrase_english(book_text, text_output_path="english_paraphrase.txt"):
    # Step 1: Divide text into semantic chunks
    semantic_chunks = divide_by_semantics_with_length(book_text)

    # Step 2: Clean the chunks
    cleaned_chunks = [clean_text(chunk) for chunk in semantic_chunks]

    # Step 3: Paraphrase the chunks
    paraphrased_chunks = paraphrase_chunks_en(cleaned_chunks, paraphraser_en)

    # Step 4: Generate text
    final_paraphrase = '\n\n'.join(paraphrased_chunks)
    generate_txt(final_paraphrase, text_output_path, language='en')

    print(f"Paraphrasing completed! Saved to {text_output_path}")

    return final_paraphrase


### **Arabic Paraphrasing Pipeline**


In [34]:
def paraphrase_arabic(pdf_path, text_output_path="arabic_paraphrase.txt"):
    # Step 1: Extract text from PDF and fix Arabic text direction
    text = extract_text_from_pdf(pdf_path)
    fixed_text = fix_arabic_text(text)  # Fixing the text direction

    # Step 2: Chunk the text semantically
    chunks = chunk_arabic_text(fixed_text, tokenizer, max_tokens=400)  # Now the chunking function is defined

    # Step 3: Paraphrase the chunks
    paraphrased_chunks = paraphrase_chunks_ar(chunks, paraphraser_ar)

    # Step 4: Clean the paraphrased chunks using the custom Arabic cleaning function
    cleaned_paraphrase = clean_chunks(paraphrased_chunks)

    # Step 5: Join the cleaned chunks into the final paraphrased text
    final_paraphrase = '\n\n'.join(cleaned_paraphrase)

    # Step 6: Fix the Arabic text direction before generating the text
    final_paraphrase_arabic = fix_arabic_text(final_paraphrase)
    generate_txt(final_paraphrase_arabic, text_output_path, language='ar')

    # Notify the user that the text has been created
    print(f"Paraphrasing completed! Saved to {text_output_path}")

    return final_paraphrase_arabic


## Step 10: Language Detection and Pipeline Execution


In [35]:
def detect_language_and_paraphrase(pdf_path, text_output_path_ar="arabic_paraphrase.txt", text_output_path_en="english_paraphrase.txt"):
    text = extract_text_from_pdf(pdf_path)
    language = detect(text)
    print(f"Detected language: {language}")

    if language == 'ar':
        print("Detected Arabic. Running Arabic paraphrasing pipeline...")
        return paraphrase_arabic(pdf_path, text_output_path=text_output_path_ar)
    else:
        print("Detected English. Running English paraphrasing pipeline...")
        return paraphrase_english(text, text_output_path=text_output_path_en)


## Step 11: Run the Pipeline on Your PDF


In [37]:
pdf_path = "/content/english_summary (2) (1).pdf"  # Update this to the correct PDF path
final_paraphrase = detect_language_and_paraphrase(pdf_path)


Detected language: en
Detected English. Running English paraphrasing pipeline...
Paraphrasing completed! Saved to english_paraphrase.txt
