<a href="https://colab.research.google.com/github/MichaelSomma94/Generative_AI/blob/main/Large_Document_Summarization_task_with_BART_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using a BART model for summarization of large documents

In this tutorial, I walk you through how to use the BART model from the Huggingface model hub. This model is particularly suited for the summariztation of large documents

We are going to do roughly the following steps:
- installation
- library imports
- load the dataset (crawled web page content)
- load the pre-trained BART model and tokenizer
- do a summary of an entire web page


I am using Google Colab with a GPU runtime for this tutorial. If you are not sure how to use a GPU Runtime take a look here.

In [3]:
!pip3 install torch torchvision
!pip install sentencepiece
!pip install transformers
!pip install accelerate -U


Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99
Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!nvidia-smi

Fri Aug 18 09:46:11 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8    10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

After we uploaded the file with use `unzip` to extract the recipes.json.

# Necessary Imports

In [4]:
from transformers import AutoTokenizer
from transformers import TextDataset,DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead
from transformers import AutoConfig
import torch
import glob
from transformers import T5Tokenizer, T5ForConditionalGeneration



In [15]:
with open("test_Bart_2.txt") as f:
  test_text = f.read()




In [16]:
print(len(test_text))

8183


In [13]:
from transformers import BartForConditionalGeneration, BartTokenizer

# Load pre-trained BART model and tokenizer
model_name = "facebook/bart-large"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)


Original Text:
Your large document goes here...

Generated Summary:
8 AI Trends To Look Out For in 2023 & 2024 Ema Lukan Updated: July 12, 2023 Contents Text Link Easily scale your video production in 120+ languages. Try Synthesia Did you know that one of the first applications of AI was handwriting recognition? Postal services used it to read addresses on envelopes. 📩 Well, we’ve come a long way since then. Remember when we thought artificial intelligence would first impact manual, blue-collar work? Well, turns out we were collectively wrong. 😬 In late 2022, we witnessed the rise of generative AI. The boom began with AI image generators and reached its peak with the release of ChatGPT. Since then,


- Some Sample tasks: this is a summarization of 150 tokens of a website talking about the advances and trends in GenAI






In [17]:
# Tokenize and generate summary
input_ids = tokenizer.encode(test_text, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(input_ids, max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Original Text:")
print(input_text)
print("\nGenerated Summary:")
print(summary)

Original Text:
Your large document goes here...

Generated Summary:
10 Prevalent Trends in Generative AI                As we face 2023, here are a few key AI trends we can witness this year. They can be adapted for multiple sectors globally. A startup called Alethea AI uses generative AI models and blockchain to create interactive AI characters that can be traded as intelligent NFTs. It can create 3D avatars with custom templates in forty different languages. It allows the user to control the tone, pitch, and style of the voice.                                 Improved Natural Language Generation                A large amount of unstructured language data has fuelled the need to develop natural language processing (NLP) technology applications. For example, chatbots cannot completely replace human customer service representatives as they are
