# Text Summarization 

### Text summarization Defination 
Text summarization is the process of distilling the most important information from a piece of text while retaining its key meaning and essence. It involves condensing a larger body of text, such as an article, document, or webpage, into a shorter version that captures the main points and essential details.

### There are generally two main approaches to text summarization:

1. **Extractive Summarization:** In extractive summarization, key sentences or phrases are selected directly from the original text to form the summary. These selected sentences are typically the ones that contain the most important information or convey the main ideas of the text. Extractive summarization does not generate new sentences but rather reuses existing ones.

2. **Abstractive Summarization:** Abstractive summarization, on the other hand, involves generating new sentences that convey the main ideas of the text in a concise manner. This approach requires natural language generation techniques to paraphrase and rephrase the original text, often combining information from multiple sentences to create a coherent summary. Abstractive summarization can produce more concise summaries than extractive methods but may introduce some degree of interpretation or paraphrasing.

### Use Case
Text summarization is widely used in various applications, including information retrieval, document summarization, news aggregation, and content recommendation systems. It helps users quickly grasp the main points of lengthy texts, saves time, and aids in decision-making and information processing.

# 5 techniques for text summarization in Python

### 1. Gensim

Gensim is a popular Python library primarily used for topic modeling and natural language processing tasks, including text summarization. While Gensim is widely recognized for its topic modeling capabilities, it also offers functionalities for text summarization through algorithms like TextRank and LSA (Latent Semantic Analysis).

Here's how Gensim can be utilized for text summarization:

1. **TextRank Algorithm**: Gensim provides an implementation of the TextRank algorithm, which is an extractive summarization technique based on the PageRank algorithm used by Google. With Gensim, you can apply TextRank to identify the most important sentences in a document and generate a summary based on their significance.

2. **LSA (Latent Semantic Analysis)**: LSA is a statistical technique used for dimensionality reduction in natural language processing tasks. Gensim includes functionalities for performing LSA, which can be used for text summarization by extracting the most relevant sentences based on their semantic similarity to the entire document.

3. **TF-IDF (Term Frequency-Inverse Document Frequency)**: While Gensim is not primarily focused on TF-IDF, it does offer utilities for constructing document-term matrices, which can be useful for implementing TF-IDF-based summarization techniques.


Overall, Gensim is a powerful and versatile library for natural language processing tasks, including text summarization, and it offers a range of functionalities to support various summarization techniques.

### 2. Sumy
Sumy is a Python library designed specifically for text summarization. It provides a simple interface for extracting summaries from documents using various algorithms, including some of the techniques mentioned earlier like TextRank and TF-IDF. Sumy aims to make text summarization accessible and easy to implement for Python developers.

Key features of Sumy include:

1. **Multiple Summarization Algorithms**: Sumy supports several algorithms for text summarization, including TextRank, LSA (Latent Semantic Analysis), LexRank, KL-Sum, and more. This allows users to experiment with different approaches and select the one that best fits their needs.

2. **Ease of Use**: Sumy offers a straightforward API, making it easy to integrate into Python projects. Users can simply import the library and use its methods to generate summaries from text.

3. **Customization Options**: While Sumy provides default settings for summarization algorithms, it also allows users to customize parameters such as the number of sentences in the summary or the language of the input text.

4. **Support for Multiple Languages**: Sumy supports text summarization in various languages, making it suitable for international applications where multilingual support is required.

5. **Extensibility**: Sumy is designed to be extensible, allowing developers to add custom summarization algorithms or modify existing ones to better suit their specific use cases.

Overall, Sumy is a useful tool for implementing text summarization in Python, particularly for developers who want a quick and easy way to integrate summarization capabilities into their applications without delving into the intricacies of the algorithms themselves.

### 3. NLTK

NLTK (Natural Language Toolkit) is a comprehensive Python library for natural language processing (NLP) tasks, including text summarization. While NLTK is more commonly known for its wide range of functionalities for tasks such as tokenization, part-of-speech tagging, and parsing, it also offers tools and resources that can be utilized for text summarization purposes.

Here's how NLTK can be used for text summarization:

1. **Extractive Summarization using NLTK**:
   - NLTK provides functionalities for processing text data and computing various metrics that can be used for extractive summarization.
   - One common approach is to use NLTK for tasks such as sentence tokenization, word tokenization, part-of-speech tagging, and calculating sentence importance scores based on metrics like term frequency, inverse document frequency, and sentence length.
   - These metrics can then be used to rank sentences and select the most important ones to form the summary.


While NLTK provides basic tools for text summarization, more advanced summarization techniques like TextRank or LSA may require additional implementations or integration with other libraries. However, NLTK serves as a valuable resource for building foundational components of text summarization systems in Python.

### 4. T5 

T5 (Text-To-Text Transfer Transformer) is a state-of-the-art natural language processing (NLP) model developed by Google. It belongs to the Transformer architecture family and is particularly known for its versatility in handling various NLP tasks using a unified "text-to-text" framework. This means that both the inputs and outputs of the model are in the form of text, allowing it to be trained and applied to a wide range of tasks without task-specific modifications.

While T5 was initially introduced for tasks like text generation, translation, and question answering, it can also be adapted for text summarization. The basic idea is to cast the summarization task as a text-to-text transformation problem, where the input text is the document to be summarized, and the output text is the summary itself.

Here's a general outline of how T5 can be used for text summarization:

1. **Data Preparation**: Prepare a dataset consisting of pairs of documents and corresponding summaries. Each document-summary pair serves as a training example for the model.

2. **Model Fine-Tuning**: Fine-tune the pre-trained T5 model on the summarization dataset using a sequence-to-sequence learning objective. During fine-tuning, the model learns to generate summaries from input documents.

3. **Inference**: After fine-tuning, the model can be used to generate summaries for new documents. Given a document as input, the model generates the corresponding summary by decoding the output sequence.

Python libraries such as Hugging Face's `transformers` provide easy-to-use interfaces for working with pre-trained T5 models and fine-tuning them on custom datasets. Here's a simplified example of how you can use `transformers` for text summarization with T5:

T5's strength lies in its ability to handle various NLP tasks using a unified framework, making it a powerful tool for text summarization when fine-tuned on summarization-specific datasets.

### 5. GPT-3

GPT-3 (Generative Pre-trained Transformer 3) is a state-of-the-art language model developed by OpenAI. It belongs to the Transformer architecture family, known for its ability to generate human-like text based on a given prompt. While GPT-3 is not specifically designed for text summarization, it can be adapted for this task by framing it as a generative modeling problem.

Here's a high-level overview of how GPT-3 can be used for text summarization:

1. **Prompting**: Provide GPT-3 with a prompt that includes the input document and a directive to summarize it. The prompt serves as the initial context for GPT-3 to generate the summary.

2. **Generation**: GPT-3 generates the summary based on the provided prompt. It does so by predicting the most likely continuation of the prompt, which in this case is the summary of the input document.

3. **Filtering**: Optionally, you can post-process the generated summary to filter out any irrelevant or redundant information and ensure that the output is coherent and concise.


While GPT-3 can produce high-quality summaries, it may not always generate concise or coherent summaries, especially for longer documents or when the input prompt is ambiguous. Post-processing and filtering of the generated summary may be necessary to improve its quality.

Overall, GPT-3 offers a powerful and versatile approach to text summarization, leveraging its ability to generate human-like text based on a given prompt.

### Installing Necessary Libraries

In [7]:
! pip install PyPDF2 transformers
! pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.2.0-cp39-cp39-win_amd64.whl (991 kB)
     -------------------------------------- 991.5/991.5 kB 4.8 MB/s eta 0:00:00
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.2.0


In [11]:
pip install --user transformers

Note: you may need to restart the kernel to use updated packages.


In [10]:
pip install transformers


Collecting tokenizers<0.19,>=0.14
  Using cached tokenizers-0.15.2-cp39-none-win_amd64.whl (2.2 MB)
Installing collected packages: tokenizers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.19.1
    Uninstalling tokenizers-0.19.1:
      Successfully uninstalled tokenizers-0.19.1
Successfully installed tokenizers-0.15.2


## Using T5 for text summarization

### Importing necessary libraries

In [2]:
import PyPDF2
from transformers import T5ForConditionalGeneration, T5Tokenizer

### Extracting only text from pdf (graphs/images will not be extracted)

In [6]:
text = ""
with open(pdf_path, 'rb') as file:
    pdf_reader = PyPDF2.PdfReader(file)
    for page in pdf_reader.pages:
        text += page.extract_text()
print(text)

pdf_path = r"C:\Users\MD\Downloads\“AI OPTIMIZED QUANTUM FOREX BOT TO ENHANCE PHILANTHROPY Evolution and Future Developments”  .pdf"

text = extract_text_from_pdf(pdf_path)

WHITE PAPER  
“AI OPTIMIZED QUANTUM FOREX BOT  TO ENHANCE PHILANTHROPY  Evolution and Future Developments ”   
By Daniel Rosendahl   
 
Introduction  
In the ever -evolving world of finance, currency trading stands as one of its most intricate and dynamic components. 
Traditionally, currency trading —or forex trading (although other trading platforms do exist) —has been the realm of 
institutional investors, financial experts, and a select group of individuals with a deep understanding of global 
economies. Yet,  as with many sectors of the economy, the digital revolution has greatly democratized access, with a 
myriad of platforms now enabling anyone to try their hand at predicting the rise and fall of currency values.  
 
While the barriers to entry have lowered, th e complexities associated with forex trading have concurrently increased. 
Economic events, geopolitical tensions, and a host of other factors can introduce unpredictable volatility. The sheer 
volume of global transactio

### Combine code ( extracting text from pdf and summarizing it) (model = t5-small)

In [15]:
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

def generate_summary(text):
    model_name = "t5-small"
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)

    summary_ids = model.generate(inputs, max_length=150, num_beams=4, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

pdf_path = r"C:\Users\MD\Downloads\“AI OPTIMIZED QUANTUM FOREX BOT TO ENHANCE PHILANTHROPY Evolution and Future Developments”  .pdf"

text = extract_text_from_pdf(pdf_path)

summary = generate_summary(text)

print("Summary : ",summary)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Summary :  currency trading stands as one of its most intricate and dynamic components. the digital revolution has greatly democratized access, with a myriad of platforms now enabling anyone to try their hand at predicting the rise and fall of currency values. integrating AI into currency trading isn't just about automating trades, but about leveraging the power of AI to make more informed decisions.


### Breaking the code for better understanding

1. **Tokenizer Initialization**: The `T5Tokenizer.from_pretrained(model_name)` line initializes a tokenizer for the T5 model specified by `model_name`. A tokenizer is responsible for converting raw text input into tokenized format that can be understood by the model. It handles tasks such as splitting words into subwords, adding special tokens, and converting tokens to numerical IDs.

2. **Model Initialization**: The `T5ForConditionalGeneration.from_pretrained(model_name)` line initializes a T5 model for conditional generation using the specified `model_name`. This model is pre-trained on a large corpus of text data and fine-tuned for various natural language processing tasks, including text summarization. The model takes tokenized input and generates output based on the task it's designed for. In this case, the model is designed for conditional generation, which means it can generate text sequences conditioned on some input.

Overall, these lines set up the tokenizer and the T5 model for conditional generation, allowing you to tokenize input text and generate summaries using the T5 model architecture.

1. **Text Encoding**: The `tokenizer.encode()` method is used to encode the input text into numerical tokens suitable for the T5 model. The input text is prefixed with "summarize: " to indicate to the model that it should generate a summary of the provided text. The `tokenizer.encode()` method converts the text into a list of token IDs.

2. **PyTorch Tensors**: The `return_tensors="pt"` argument specifies that the output should be returned as PyTorch tensors. PyTorch tensors are multi-dimensional arrays that can be processed by the T5 model. This allows the input data to be compatible with the T5 model's processing requirements.

3. **Maximum Length and Truncation**: The `max_length=512` argument specifies the maximum length of the encoded input sequence. If the input text exceeds this length, it will be truncated to fit within the specified limit. The `truncation=True` argument indicates that truncation should be applied if necessary to ensure that the input sequence does not exceed the maximum length.

Overall, this line encodes the input text into numerical tokens, converts it into PyTorch tensors, and applies truncation if needed to prepare it for input to the T5 model for summary generation.

This line of code is responsible for generating the summary using the T5 model. Let's break down the arguments used in the `generate` method:

1. **inputs**: This is the input data provided to the model for generating the summary. It typically consists of token IDs representing the input text. In this case, it's the tokenized representation of the input text obtained using the tokenizer.

2. **max_length**: This parameter specifies the maximum length of the generated summary in terms of tokens. In this code, the maximum length of the summary is set to 150 tokens. This means that the generated summary will contain at most 150 tokens.

3. **num_beams**: This parameter controls the number of beams used in beam search decoding. Beam search is a technique used in sequence generation tasks like text summarization to generate multiple candidate sequences simultaneously. The `num_beams` parameter specifies the number of beams to use. In this code, `num_beams` is set to 4, which means that the model will consider 4 candidate sequences during decoding.

4. **length_penalty**: This parameter is used to control the trade-off between the length of the generated summary and its probability. A higher length penalty encourages the model to generate shorter summaries, while a lower length penalty allows for longer summaries. In this code, `length_penalty` is set to 2.0, indicating a moderate preference for shorter summaries.

5. **early_stopping**: This parameter determines whether to stop decoding when all beams have reached the end-of-sequence token. If set to `True`, decoding will stop when at least one beam has reached the end-of-sequence token. This helps prevent the generation of overly long sequences. In this code, `early_stopping` is set to `True`.

Overall, this line of code generates the summary using the T5 model with specified parameters for maximum length, beam search, length penalty, and early stopping. The generated summary is represented as a sequence of token IDs (`summary_ids`). Later, this sequence is decoded into human-readable text using the tokenizer.

1. **summary_ids[0]**: `summary_ids` is a list of sequences of token IDs representing the generated summaries. Since we typically generate only one summary, `summary_ids[0]` selects the first (and usually the only) sequence of token IDs from the list.

2. **skip_special_tokens=True**: This parameter instructs the tokenizer to skip special tokens during decoding. Special tokens include tokens like `[CLS]`, `[SEP]`, and padding tokens, which are not part of the actual text content but are used for formatting and padding purposes. By setting `skip_special_tokens=True`, we ensure that these special tokens are not included in the final decoded text.

Overall, this line of code takes the sequence of token IDs representing the generated summary (`summary_ids[0]`) and converts it into human-readable text, excluding any special tokens, using the tokenizer. The resulting text is assigned to the variable `summary`, which represents the final generated summary in natural language form.

### t5-base

In [29]:
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

def generate_summary(text):
    model_name = "t5-base"
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)

    summary_ids = model.generate(inputs, max_length=150, num_beams=4, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

pdf_path = r"C:\Users\MD\Downloads\“AI OPTIMIZED QUANTUM FOREX BOT TO ENHANCE PHILANTHROPY Evolution and Future Developments”  .pdf"

text = extract_text_from_pdf(pdf_path)

summary = generate_summary(text)

print("Summary : ",summary)


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Summary :  integrating AI into currency trading isn't just about automating trades. it's about leveraging the power of AI to make more informed decisions. as potential rewards grow, so too do potential losses. this white paper delves into the creation, development and maturation of an AI -driven currency trading software bot.


### t5-large

In [30]:
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

def generate_summary(text):
    model_name = "t5-large"
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)

    summary_ids = model.generate(inputs, max_length=150, num_beams=4, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

pdf_path = r"C:\Users\MD\Downloads\“AI OPTIMIZED QUANTUM FOREX BOT TO ENHANCE PHILANTHROPY Evolution and Future Developments”  .pdf"

text = extract_text_from_pdf(pdf_path)

summary = generate_summary(text)

print("Summary : ",summary)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Summary :  integrating AI into currency trading isn't just about automating trades. it's about leveraging the power of AI to make more informed decisions. the objective isn't merely profitability; it's sustainable and responsible profitability. integrating AI into currency trading isn't just about automating trades.


## Over here a question arises which t5 model to use ?

The accuracy of a T5 model for text summarization depends on various factors, including the size of the model, the quality and quantity of the training data, and the specific characteristics of the text to be summarized. Generally, larger T5 models tend to have higher accuracy, but they also require more computational resources.

Here's a general guideline for choosing a T5 model variant for text summarization:

1. **t5-small**: This variant is smaller and faster but may sacrifice some accuracy compared to larger models. It's suitable for quick experimentation or when computational resources are limited.

2. **t5-base**: This is a mid-sized variant that balances performance and computational cost. It's a good choice for many text summarization tasks and is widely used in practice.

3. **t5-large**: This variant has more parameters and can potentially achieve higher accuracy, especially for more complex summarization tasks or when summarizing longer documents. However, it requires more memory and computational resources to train and use.

4. **t5-3b**, **t5-11b**: These are even larger variants with significantly more parameters. They are suitable for tasks where very high accuracy is required or when dealing with extremely large datasets. However, they are also the most resource-intensive and may not be practical for all use cases.

When choosing a T5 model variant, consider your specific requirements in terms of accuracy, computational resources, and speed. It's often a trade-off between these factors, so you may need to experiment with different variants to find the best balance for your particular text summarization task.