<a href="https://colab.research.google.com/github/EmreYY20/ToS-Simplification/blob/dev/Emre_Summarization_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Imports

In [None]:
!pip install sumy PyPDF2 textstat transformers

In [None]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

#import transformers
#from transformers import T5ForConditionalGeneration, T5Tokenizer

import textstat
import nltk
import PyPDF2

nltk.download('punkt')

# 1. Automated Extractive Summarization

1.1. Using sumy

In [45]:
# URL of the PDF document
pdf_url = "/content/Bolt.pdf"

# Extract text from the PDF
pdf_text = ""
with open(pdf_url, "rb") as pdf_file:
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        pdf_text += page.extract_text()

# Create a parser for the extracted ToS text
parser = PlaintextParser.from_string(pdf_text, Tokenizer("english"))

# Use LSA (Latent Semantic Analysis) for summarization
lsa_summarizer = LsaSummarizer()
lsa_summary = lsa_summarizer(parser.document, sentences_count=10)  # Change the number of sentences as needed

# Use LexRank for summarization
lex_rank_summarizer = LexRankSummarizer()
lex_rank_summary = lex_rank_summarizer(parser.document, sentences_count=10)  # Change the number of sentences as needed

# Use TextRank for summarization
text_rank_summarizer = TextRankSummarizer()
text_rank_summary = text_rank_summarizer(parser.document, sentences_count=10)  # Change the number of sentences as needed

# Print summaries
print("LSA Summary:")
for sentence in lsa_summary:
    print(sentence)

print("\nLexRank Summary:")
for sentence in lex_rank_summary:
    print(sentence)

print("\nTextRank Summary:")
for sentence in text_rank_summary:
    print(sentence)


LSA Summary:
The list of Bolt group companies and partners is available at https://bolt.eu/cities/ In order to use Bolt app you must agree to the terms and conditions that are set out below: 1.
Transport services are provided by drivers under a contract (with you) for the carriage of passengers.
Drivers provide transport services on an independent basis (either in person or via a company) as economic and professional service providers.
These service providers may charge you additional fees when processing payments in connection with the Bolt in-App Payment.
Bolt is not responsible for any such fees and disclaims all liability in this regard.
The resolution of disputes related to Bolt in-App Payment also takes place through us.
Inquiries submitted by e-mail or Bolt App will receive a response within one business day.
Sometimes driver may decide to cancel your request, please note that Bolt is not responsible for such situations.
Amendments to the General Terms and Conditions 8.1 If any 

In [50]:
# Convert summaries to strings
lsa_summary_text = ' '.join(map(str, lsa_summary))
lex_rank_summary_text = ' '.join(map(str, lex_rank_summary))
text_rank_summary_text = ' '.join(map(str, text_rank_summary))

# Content-based evaluation
original_sentences = pdf_text.split('.')
lsa_summary_sentences = lsa_summary_text.split('.')
lex_rank_summary_sentences = lex_rank_summary_text.split('.')
text_rank_summary_sentences = text_rank_summary_text.split('.')

# Calculate overlap between original and summary sentences
def calculate_overlap(summary_sentences):
    overlap_count = sum(1 for sentence in summary_sentences if sentence in original_sentences)
    overlap_percentage = (overlap_count / len(original_sentences)) * 100
    return overlap_percentage

print("Overlap with Original Text (LSA): {:.2f}%".format(calculate_overlap(lsa_summary_sentences)))
print("Overlap with Original Text (LexRank): {:.2f}%".format(calculate_overlap(lex_rank_summary_sentences)))
print("Overlap with Original Text (TextRank): {:.2f}%".format(calculate_overlap(text_rank_summary_sentences)))

Overlap with Original Text (LSA): 0.59%
Overlap with Original Text (LexRank): 2.37%
Overlap with Original Text (TextRank): 7.69%


In [56]:
# Calculate readability scores for the original text
flesch_reading_original = textstat.flesch_reading_ease(pdf_text)
flesch_kincaid_original = textstat.flesch_kincaid_grade(pdf_text)

print("Readability Scores for Original Text:")
print(f"Flesch Reading Ease: {flesch_reading_original}")
print(f"Flesch-Kincaid Grade Level: {flesch_kincaid_original}\n")

# Calculate readability scores for the generated summaries (LSA, LexRank, TextRank)
summaries = {
    "LSA Summary": lsa_summary_text,
    "LexRank Summary": lex_rank_summary_text,
    "TextRank Summary": text_rank_summary_text
}

for summary_name, summary_text in summaries.items():
    flesch_reading = textstat.flesch_reading_ease(summary_text)
    flesch_kincaid = textstat.flesch_kincaid_grade(summary_text)

    print(f"Readability Scores for {summary_name}:")
    print(f"Flesch Reading Ease: {flesch_reading}")
    print(f"Flesch-Kincaid Grade Level: {flesch_kincaid}\n")

Readability Scores for Original Text:
Flesch Reading Ease: 49.35
Flesch-Kincaid Grade Level: 11.8

Readability Scores for LSA Summary:
Flesch Reading Ease: 54.32
Flesch-Kincaid Grade Level: 9.9

Readability Scores for LexRank Summary:
Flesch Reading Ease: 55.47
Flesch-Kincaid Grade Level: 11.5

Readability Scores for TextRank Summary:
Flesch Reading Ease: 49.28
Flesch-Kincaid Grade Level: 13.9



# 2. Extractive Summarization

In [60]:
# Function to extract text from a PDF file
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        num_pages = len(pdf_reader.pages)
        text = ''
        for page_num in range(num_pages):
            page = pdf_reader.getPage(page_num)
            text += page.extractText()
        return text

# Load the summarization pipeline with GPT-2 model
summarizer = pipeline("summarization")

# Replace with your PDF file path
pdf_file_path = 'path_to_your_pdf_file.pdf'

# Extract text from the PDF
pdf_text = extract_text_from_pdf(pdf_file_path)

# Generate abstractive summary
generated_summary = summarizer(pdf_text, max_length=150, min_length=30, do_sample=False)

# Print the generated summary
print("Generated Abstractive Summary:")
print(generated_summary[0]['summary_text'])

NameError: ignored