<a href="https://colab.research.google.com/github/SulemanShahani/text-summarization-using-BERT-Latent-Semantic-Analysis-LSA-LexRank-and-T5-Transformers-/blob/main/text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from transformers import TFBartForConditionalGeneration, BartTokenizer
from string import punctuation
import warnings
import fitz  # PyMuPDF

# Filter out the specific UserWarning
warnings.filterwarnings("ignore", message="The secret `HF_TOKEN` does not exist in your Colab secrets.*")




# Initialize BART tokenizer with Byte-Pair Encoding (BPE)
bart_tokenizer_bpe = BartTokenizer.from_pretrained('facebook/bart-large-cnn', tokenizer_type='BPE')



# Load pre-trained BART model and tokenizer
model = TFBartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
#tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')





def read_pdf(file_path):
    """
    Read text from a PDF file.

    Args:
        file_path (str): The path to the PDF file.

    Returns:
        str: The text extracted from the PDF.
    """
    text = ""
    with fitz.open(file_path) as pdf_file:
        for page_num in range(len(pdf_file)):
            page = pdf_file.load_page(page_num)
            text += page.get_text()
    return text


file_path = '/content/eng.pdf'
text = read_pdf(file_path)


# Function for text preprocessing
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join([c for c in text if c not in punctuation])
    return text



# Text preprocessing for
processed_text__bart = "summarize" + preprocess_text(text)



# Concatenate the input text with the prefix "summarize:"
text_to_summarize =  processed_text__bart

# Tokenization using WordPiece for BART
input_ids_bart = bart_tokenizer_bpe.encode(processed_text__bart, return_tensors='tf', max_length=1024, truncation=True, add_special_tokens=True)

# Generate the summary

summary_ids = model.generate(input_ids_bart, max_length=500)

# Decode the summary
bart_summary = bart_tokenizer_bpe.decode(summary_ids[0], skip_special_tokens=True)

print("BART Summary:")
print(bart_summary)


All PyTorch model weights were used when initializing TFBartForConditionalGeneration.

All the weights of TFBartForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


BART Summary:
Human rights is the foundation of freedom justice and peace in the world. The peoples of the united nations have reaffirmed their faith in fundamental human rights in the dignity and worth of the human person. Human rights should be protected by the rule of law. A common understanding of these rights and freedoms is of the greatest importance.


In [26]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
from string import punctuation
import fitz  # PyMuPDF
# Load pre-trained T5 model and tokenizer
model = T5ForConditionalGeneration.from_pretrained('t5-small')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')


def read_pdf(file_path):
    """
    Read text from a PDF file.

    Args:
        file_path (str): The path to the PDF file.

    Returns:
        str: The text extracted from the PDF.
    """
    text = ""
    with fitz.open(file_path) as pdf_file:
        for page_num in range(len(pdf_file)):
            page = pdf_file.load_page(page_num)
            text += page.get_text()
    return text


file_path = '/content/eng.pdf'
text = read_pdf(file_path)



# Function for text preprocessing
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join([c for c in text if c not in punctuation])
    return text

# Text preprocessing for T5
processed_text_t5 = "summarize" + preprocess_text(text)


# Concatenate the input text with the prefix "summarize:"
text_to_summarize =  processed_text_t5

# Encode the input text
input_ids = t5_tokenizer.encode(text_to_summarize, return_tensors='pt', max_length=2000, truncation=True, add_special_tokens=True)

# Generate the summary
summary_ids = model.generate(input_ids, max_length = 500 )

# Decode the summary
t5_summary = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)


# Post-processing steps
# Since t5 convert all text into lowercase so we need to convert it to Propercase text
# Split the summary into sentences
sentences = t5_summary.split('. ')

# Capitalize the first letter of each sentence and join them back together
proper_case_summary = '. '.join([sentence.capitalize() for sentence in sentences])

print("T5 Summary (Proper Case):")
print(proper_case_summary)








Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


T5 Summary (Proper Case):
Universal declaration of human rights is a common standard of achievement for all peoples and all nations. Despite the u.n.'s commitment to promoting universal respect for and observance of human rights and fundamental freedoms, the general assembly proclaims this universal declaration of human rights as a common standard of achievement for all peoples and all nations.


In [5]:
# Import the summarizer
#!pip install sumy
import nltk
nltk.download('punkt')
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
import fitz

def read_pdf(file_path):
    """
    Read text from a PDF file.

    Args:
        file_path (str): The path to the PDF file.

    Returns:
        str: The text extracted from the PDF.
    """
    text = ""
    with fitz.open(file_path) as pdf_file:
        for page_num in range(len(pdf_file)):
            page = pdf_file.load_page(page_num)
            text += page.get_text()
    return text


file_path = '/content/eng.pdf'
text = read_pdf(file_path)



# Concatenate the input text with the prefix "summarize:"
text_to_summarize = text

# Parsing the text string using PlaintextParser
parser = PlaintextParser.from_string(text, Tokenizer('english'))

# Creating the summarizer
lsa_summarizer = LsaSummarizer()
lsa_summary = lsa_summarizer(parser.document, 25)  # 3 indicates the number of sentences in the summary

# Printing the summary
for sentence_lsa in lsa_summary:
    print(sentence_lsa)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Universal Declaration of Human Rights Preamble Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, Whereas it is essential, if man is not to be compelled to have recourse, as a last resort, to rebellion against tyranny and oppression, that human rights should be protected by the rule of law, Whereas it is essential to promote the development of friendly relations between nations, Whereas the peoples of the United Nations have in the Charter reaffirmed their faith in fundamental human rights, in the dignity and worth of the human person and in the equ

In [6]:
# Importing the parser and tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

# Import the LexRank summarizer
from sumy.summarizers.lex_rank import LexRankSummarizer

# Text to summarize
import fitz

def read_pdf(file_path):
    """
    Read text from a PDF file.

    Args:
        file_path (str): The path to the PDF file.

    Returns:
        str: The text extracted from the PDF.
    """
    text = ""
    with fitz.open(file_path) as pdf_file:
        for page_num in range(len(pdf_file)):
            page = pdf_file.load_page(page_num)
            text += page.get_text()
    return text


file_path = '/content/eng.pdf'
text = read_pdf(file_path)



# Concatenate the input text with the prefix "summarize:"
text_to_summarize = text

# Parsing the text string using PlaintextParser
parser = PlaintextParser.from_string(text, Tokenizer('english'))

# Creating a summary of 3 sentences
lex_rank_summarizer = LexRankSummarizer()
lexrank_summary = lex_rank_summarizer(parser.document, sentences_count=25)

# Printing the summary
for sentence_lexrank in lexrank_summary:
    print(sentence_lexrank)


Universal Declaration of Human Rights Preamble Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, Whereas it is essential, if man is not to be compelled to have recourse, as a last resort, to rebellion against tyranny and oppression, that human rights should be protected by the rule of law, Whereas it is essential to promote the development of friendly relations between nations, Whereas the peoples of the United Nations have in the Charter reaffirmed their faith in fundamental human rights, in the dignity and worth of the human person and in the equ