# Install Dependencies

In [1]:
!pip install torch==1.10.1+cu102 torchvision==0.11.2+cu102 torchaudio===0.10.1+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html

Looking in links: https://download.pytorch.org/whl/cu102/torch_stable.html


In [2]:
# Pegasus would not run without this
!pip install sentencepiece



In [3]:
!pip install transformers



## What is Pytorch? 
It is an open source machine learning framework that accelerates the path from research prototyping to production deployment. 

# Import and Load Model

In [4]:
# Importing dependencies from transformers
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

## What is Pegasus?
After some digging around I found that instead of using NLTK and other methods of sentence summary I could use Pytorch and Pegasus. Pegasus is a model taht was pre-trained on gap setnences from the body of text (corpus).

In [5]:
# Creating Tokenizer
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.36M [00:00<?, ?B/s]

In [6]:
type(tokenizer)

transformers.models.pegasus.tokenization_pegasus.PegasusTokenizer

A token is something used to take paragraphs or sentences, and turn them into number tokens to be rated indivudualy.

In [7]:
# Loading in pre-trained model 
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")

# Pegasus model
As said before Pegasus is a pre-trained model. To get this model on our system we download it. This is a huge model which was 2.12gb

# Performing Abstractive Summarization

In [8]:
text = """
The Lord of the Rings is an epic[1] high-fantasy novel[a] by English author and scholar J. R. R. Tolkien. Set in Middle-earth, intended to be Earth at some distant time in the past, the story began as a sequel to Tolkien's 1937 children's book The Hobbit, but eventually developed into a much larger work. Written in stages between 1937 and 1949, The Lord of the Rings is one of the best-selling books ever written, with over 150 million copies sold.[2]

The title refers to the story's main antagonist, the Dark Lord Sauron, who in an earlier age created the One Ring to rule the other Rings of Power given to Men, Dwarves, and Elves, in his campaign to conquer all of Middle-earth. From homely beginnings in the Shire, a hobbit land reminiscent of the English countryside, the story ranges across Middle-earth, following the quest to destroy the One Ring mainly through the eyes of the hobbits Frodo, Sam, Merry and Pippin.

Although often called a trilogy, the work was intended by Tolkien to be one volume of a two-volume set along with The Silmarillion.[3][T 2] For economic reasons, The Lord of the Rings was published over the course of a year from 29 July 1954 to 20 October 1955, in three volumes[3][4] titled The Fellowship of the Ring, The Two Towers, and The Return of the King. The work is divided internally into six books, two per volume, with several appendices of background material. Some later editions print the entire work in a single volume, following the author's original intent.

Tolkien's work, after an initially mixed reception by the literary establishment, has been the subject of extensive analysis of its themes and origins. Influences on this earlier work, and on the story of The Lord of the Rings, include philology, mythology, Christianity, earlier fantasy works, and his own experiences in the First World War.

The Lord of the Rings has since been reprinted many times and translated into at least 38 languages.[b] Its enduring popularity has led to numerous references in popular culture, the founding of many societies by fans of Tolkien's works,[5] and the publication of many books about Tolkien and his works. It has inspired numerous derivative works, including paintings, music, films, television, video games, and board games, helping create and shape the modern fantasy genre, within which it is considered one of the greatest books of all time.

Award-winning adaptations of The Lord of the Rings have been made for radio, theatre, and film. It has been named Britain's best novel of all time in the BBC's The Big Read."""

In [9]:
# converting text into token representation - numbers 
tokens = tokenizer(text, truncation=True, padding="longest", return_tensors="pt") 

In [10]:
# Displaying our tokens
tokens

{'input_ids': tensor([[  139,  2346,   113,   109, 17557,   117,   142,  7277, 65077,  1100,
           281,   121, 72074,  2794,  4101,   304,  1100,   141,  1188,  1782,
           111, 15461,   907,   107,   840,   107,   840,   107, 40900,   107,
          3089,   115,  3396,   121, 21019,   108,  2685,   112,   129,  2774,
           134,   181,  9234,   166,   115,   109,   555,   108,   109,   584,
          1219,   130,   114, 12677,   112, 40900,   131,   116, 21120,   404,
           131,   116,   410,   139, 38844,   108,   155,  2435,  1184,   190,
           114,   249,  1599,   201,   107, 16550,   115,  4208,   317, 21120,
           111, 20322,   108,   139,  2346,   113,   109, 17557,   117,   156,
           113,   109,   229,   121, 10346,  1031,   521,  1158,   108,   122,
           204,  3968,   604,  4862,  1575,   107,  4101, 50558,   139,  1560,
          6335,   112,   109,   584,   131,   116,   674, 45629,   108,   109,
          5715,  2346, 29609,  7465,  

### What we are doing is unpacking our input ID's and our attention maks
This shows where our attention is going when we are applying our summary

In [11]:
# Summarizing the text we input
## The double ** is how we unpack
summary = model.generate(**tokens)

In [13]:
# Summary in tokens
summary

tensor([[    0,   139,  2346,   113,   109, 17557,   117,   156,   113,   109,
           229,   121, 10346,  1031,   521,  1158,   108,   122,   204,  3968,
           604,  4862,  1575,   107,     1]])

We did it! But we dont know what these words mean until we decode them.

In [14]:
# Decode summary
# the summary is nested so we are grabbing our tokens from the summary
tokenizer.decode(summary[0])

'The Lord of the Rings is one of the best-selling books ever written, with over 150 million copies sold.'