In [1]:
"""The Transformers library, developed by Hugging Face, 
is a popular open-source library for natural language processing (NLP) tasks, 
including pre-training and fine-tuning state-of-the-art NLP models such as BERT, GPT, RoBERTa, etc."""


!pip install transformers[sentencepiece]





In [2]:
file = open('file', 'r')
content = file.read().strip()

In [3]:
content

'number of words in The rapid advancement of technology in recent decades has had a profound impact on nearly every aspect of human life. From communication and transportation to healthcare, entertainment, education, and business, the influence of technology is ubiquitous and transformative. This remarkable progress has brought about numerous benefits, but it has also raised concerns and challenges that society must navigate.\n\nOne area that has been profoundly influenced by technological advancements is communication. The advent of the internet, social media platforms, and mobile devices has revolutionized the way people connect and interact globally. Communication has become instantaneous, allowing individuals from different corners of the world to connect effortlessly and exchange information in real-time. Social media platforms have enabled people to share their thoughts, opinions, and experiences on a global scale, fostering connections and creating virtual communities. Additiona

In [4]:
"""We use the Transformers library to load a pre-trained sequence-to-sequence model and its tokenizer. 
Specifically we use the "distilbart-cnn-12-6" model checkpoint."""

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

checkpoint = "sshleifer/distilbart-cnn-12-6"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [5]:
"""The attribute tokenizer.model_max_length provides the maximum length of the input sequence 
that the tokenizer can handle for the associated model."""


tokenizer.model_max_length 

1024

In [6]:
#This attribute shows the maximum length of a single sentence without considering the maximum length of the whole input sequence.
tokenizer.max_len_single_sentence 

1022

In [7]:
#The punkt module includes a pre-trained tokenizer for dividing text into sentences.
#The sent_tokenize function applies the pre-trained tokenizer to the input text and splits it into individual sentences based language-specific rules.

import nltk
nltk.download('punkt')
sentences = nltk.tokenize.sent_tokenize(content)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Roopal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
#It finds the longest sentence

max([len(tokenizer.tokenize(sentence)) for sentence in sentences])


38

In [9]:
"""The updated code provided performs a chunking operation on the sentences based on a specified maximum length constraint. 
It divides the sentences into chunks, ensuring that each chunk doesn't exceed the maximum length allowed by the tokenizer."""

length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

2

In [10]:
#Number of tokens in each chunk

[len(tokenizer.tokenize(c)) for c in chunks]

[1013, 535]

In [11]:
len(tokenizer(content).input_ids)

Token indices sequence length is longer than the specified maximum sequence length for this model (1562 > 1024). Running this sequence through the model will result in indexing errors


1562

In [12]:
"""It creates a list of tokenized inputs using the tokenizer on each chunk. 
Each tokenized input is returned as a dictionary-like object with tensors, where the return_tensors="pt".
return_tensors="pt" specifies that the tokenizer should return PyTorch tensors as values in the dictionary-like object."""

inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

In [13]:
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))



 The rapid advancement of technology in recent decades has had a profound impact on nearly every aspect of human life. From communication and transportation to healthcare, entertainment, education, and business, the influence of technology is ubiquitous and transformative. The advent of the internet, social media platforms, and mobile devices has revolutionized the way people connect and interact globally.




 Self-driving cars have the potential to revolutionize transportation by reducing accidents caused by human error and optimizing traffic flow. Autonomous vehicles promising enhanced safety, efficiency, and convenience. Ride-sharing platforms like Uber and Lyft have disrupted the transportation industry. The field of education has also been greatly impacted by technology.
