# ***Abstractive Text Summarization using Pegasus***

### Pre-training with Extracted Gap-sentences for Abstractive Summarization.

It is specifically designed for the task of text summarization, particularly abstractive summarization, where the goal is to generate a concise and coherent summary that captures the main points of the input text, possibly with new sentences that were not in the original text
_____________________________________________________________________________

### **Install PyTorch**

In [1]:
# https://pytorch.org/get-started/locally/

### **Install Transformers from Hugging Face**

We'll be using the Pegasus-xsum model from Hugging Face

In [3]:
# !pip install transformers

### **Importing and Loading the Model**

We'll be bringing in two imports:

- *PegasusForConditionalGeneration*: This will allow us to use the deep learning model.

- *PegasusTokenizer*: This classs will allow sentences to convert to a set of tokens which we will then pass to our model `(NOTE: ensure you have the 'sentencepiece' library installed - you can do so through 'pip install sentencepiece' if using pip)`

In [3]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')

model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### **Perform Some Abstractive Summarization**

In [6]:
#Let's introduce some text

text = """
Bidirectional Encoder Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at Google.[1][2] A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."[3]

BERT was originally implemented in the English language at two model sizes:[1] (1) BERTBASE: 12 encoders with 12 bidirectional self-attention heads totaling 110 million parameters, and (2) BERTLARGE: 24 encoders with 16 bidirectional self-attention heads totaling 340 million parameters. Both models were pre-trained on the Toronto BookCorpus[4] (800M words) and English Wikipedia (2,500M words).

BERT is an "encoder-only" transformer architecture.

On a high level, BERT consists of three modules:

embedding. This module converts an array of one-hot encoded tokens into an array of vectors representing the tokens.
a stack of encoders. These encoders are the Transformer encoders. They perform transformations over the array of representation vectors.
un-embedding. This module converts the final representation vectors into one-hot encoded tokens again.
The un-embedding module is necessary for pretraining, but it is often unnecessary for downstream tasks. Instead, one would take the representation vectors output at the end of the stack of encoders, and use those as a vector representation of the text input, and train a smaller model on top of that.

BERT uses WordPiece to convert each English word into an integer code. Its vocabulary has size 30,000. Any token not appearing in its vocabulary is replaced by [UNK] for "unknown".
"""

In [14]:
# Next we'll next to convert this text into its tokenized/number representation

tokens = tokenizer(text, truncation=True, padding='longest', return_tensors='pt') # 'truncation=True' will shorten our text as there are limits as to how much we can pass into the model and we want to return pytorch tensors with 'return_tensors='pt'

print(tokens)

print(type(tokens))

{'input_ids': tensor([[ 7671, 37390, 93789, 37955,   116,   135, 38979,   143, 62613,   158,
           117,   114,  1261,   861,   451,   124,   109, 22470,  3105,   108,
          7913,   118,   203,  5110,  2757,   204,  1331,   449,   113,   109,
           691,  1581,   107,   168,   140,  2454,   115,  1350,   931,   141,
          2995,   134,  1058,   107, 65077, 32887, 50558,   202,  7149,  4413,
          2629,  7111,   120,   198,   386,   114,   332,   204,   114,   232,
           108,   110, 62613,   148,   460,   114, 20410, 13757,   115,  4284,
          7148, 11430,   143, 72237,   158,  8026,  8742,   204,  3968,   473,
          6185, 10850,   111,  3024,   109,   861,   107, 54151, 59740,   110,
         62613,   140,  3273,  4440,   115,   109,  1188,  1261,   134,   228,
           861,  2568,   151, 65077,  1100,  6806,   110, 62613, 51534,   151,
           665, 40753,   116,   122,   665, 79050,   813,   121, 65167,  4082,
           916,   273,  8558,   604,  

In [15]:
# Let's try summarize the text now

summary = model.generate(**tokens) # '**tokens' essentially unpacks everything seen above

# Summary in tokens
summary

tensor([[    0,  7671, 37390, 37955,   116,   135, 38979,   143, 62613,   158,
           117,   114,  1261,   861,   451,   124,   109, 22470,  3105,   108,
          7913,   118,   203,  5110,  2757,   204,  1331,   449,   113,   109,
           691,  1581,   107,     1]])

In [19]:
# The above tensor doesnt actually provide us with much meaning however, we can again utilize our tokenizer to decode the above and extract our summarization. 

# what we need is actully nested so let's grab the first instance of our result

tokenizer.decode(summary[0], skip_special_tokens=True) # we can skip special tokens to remove special tokens that appear at the start and end like '<pad>' or '</s>'


# We can somewhat validate our results by copying the output below and doing a search on this page: "https://en.wikipedia.org/wiki/BERT_(language_model)" where the text above was extracted from.
# You'll see that there are 0 results which means the output is 100% original and it indeed did perform abstraction rather than extraction.

'Bidirectional Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models.'

### **Let's do another piece of text**

Source: https://en.wikipedia.org/wiki/Machine_learning

In [21]:
text_2 = """
The term machine learning was coined in 1959 by Arthur Samuel, an IBM employee and pioneer in the field of computer gaming and artificial intelligence.[9][10] The synonym self-teaching computers was also used in this time period.[11][12]

Although the earliest machine learning model was introduced in the 1950s when Arthur Samuel invented a program that calculated the winning chance in checkers for each side, the history of machine learning roots back to decades of human desire and effort to study human cognitive processes.[13] In 1949, Canadian psychologist Donald Hebb published the book The Organization of Behavior, in which he introduced a theoretical neural structure formed by certain interactions among nerve cells.[14] Hebb's model of neurons interacting with one another set a groundwork for how AIs and machine learning algorithms work under nodes, or artificial neurons used by computers to communicate data.[13] Other researchers who have studied human cognitive systems contributed to the modern machine learning technologies as well, including logician Walter Pitts and Warren McCulloch, who proposed the early mathematical models of neural networks to come up with algorithms that mirror human thought processes.[13]

By the early 1960s an experimental "learning machine" with punched tape memory, called Cybertron, had been developed by Raytheon Company to analyze sonar signals, electrocardiograms, and speech patterns using rudimentary reinforcement learning. It was repetitively "trained" by a human operator/teacher to recognize patterns and equipped with a "goof" button to cause it to re-evaluate incorrect decisions.[15] A representative book on research into machine learning during the 1960s was Nilsson's book on Learning Machines, dealing mostly with machine learning for pattern classification.[16] Interest related to pattern recognition continued into the 1970s, as described by Duda and Hart in 1973.[17] In 1981 a report was given on using teaching strategies so that an artificial neural network learns to recognize 40 characters (26 letters, 10 digits, and 4 special symbols) from a computer terminal.[18]

Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."[19] This definition of the tasks in which machine learning is concerned offers a fundamentally operational definition rather than defining the field in cognitive terms. This follows Alan Turing's proposal in his paper "Computing Machinery and Intelligence", in which the question "Can machines think?" is replaced with the question "Can machines do what we (as thinking entities) can do?".[20]

Modern-day machine learning has two objectives. One is to classify data based on models which have been developed; the other purpose is to make predictions for future outcomes based on these models. A hypothetical algorithm specific to classifying data may use computer vision of moles coupled with supervised learning in order to train it to classify the cancerous moles. A machine learning algorithm for stock trading may inform the trader of future potential predictions
"""

In [23]:
tokens_2 = tokenizer(text_2, truncation=True, padding='longest', return_tensors='pt')

tokens_2

{'input_ids': tensor([[  139,  1286,  1157,   761,   140, 27775,   115, 22570,   141,  9429,
         10070,   108,   142,  7313,  2307,   111, 12649,   115,   109,   764,
           113,   958,  3982,   111,  4958,  3941,   107,  4101,  2507, 32887,
          2449,  1100,   139, 47248,   813,   121, 46490,  4328,   140,   163,
           263,   115,   136,   166,   908,   107,  4101,  4363, 32887,  3602,
          1100,  2113,   109,  9441,  1157,   761,   861,   140,  2454,   115,
           109,  7765,   116,   173,  9429, 10070, 11553,   114,   431,   120,
          7123,   109,  2269,  1012,   115, 67338,   118,   276,   477,   108,
           109,   689,   113,  1157,   761,  4663,   247,   112,  2701,   113,
           883,  2524,   111,  1441,   112,   692,   883,  7842,  1994,   107,
         65077, 59740,   222, 20322,   108,  3066, 17518,  5502,   285, 12750,
          1299,   109,   410,   139,  7235,   113, 20786,   108,   115,   162,
           178,  2454,   114,  9637, 1

In [24]:
summary_2 = model.generate(**tokens_2)

summary_2

tensor([[   0, 3838,  761,  117,  109,  692,  113,  199, 4328,  543,  135,  306,
          107,    1]])

In [25]:
tokenizer.decode(summary_2[0], skip_special_tokens=True)

'Machine learning is the study of how computers learn from experience.'

### **Let's do recent news article on Nvidia**

Source: https://finance.yahoo.com/news/nvidia-stock-buy-143000315.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAB7TYDBbY4owo_Iy0uwhI8_mBjSfZwwiQQguLUxyJ1dU6ZNhNGsLWLJpUqv4oQFLpsbVdNcXOCZRHK-OepxiB5c_yr_D6f6fUWNkaMk1r_rQHbYTU1edHQm9584mDmywrMVxI0Ff_z528hifW-1jESyDeRfOXxqzSKG8m1k_gy5S

In [26]:
text_3 = """
Nvidia (NASDAQ: NVDA) won over Wall Street last year, illustrated by its more than 280% stock growth since March 2023. The company's years of dominance in graphics processing units (GPUs) perfectly positioned it to profit significantly from a boom in artificial intelligence (AI) as demand for the chips skyrocketed. As a result, Nvidia's quarterly revenue and free cash flow are up 207% and 430%, respectively, in the last 12 months.

The company's meteoric rise has some analysts questioning whether the company has much more to offer investors in 2024. However, trends in the chip market indicate Nvidia will have little problem retaining its leading market share in AI GPUs, despite new offerings from Advanced Micro Devices and Intel.

Meanwhile, the AI market is nowhere near hitting its ceiling. It's projected to expand at a compound annual growth rate of 37% until at least 2030. The sector's potential indicates GPU demand is likely to continue rising for the foreseeable future, with Nvidia well-equipped to continue enjoying significant gains from AI.

Here's why Nvidia remains an attractive buy in March.

Nvidia will likely retain its AI dominance despite rising competition
Nvidia's success in the AI chip market has led to countless tech companies announcing ventures into the industry. Leading chipmakers AMD and Intel plan to begin shipping new GPUs soon in an attempt to challenge Nvidia's market share. Meanwhile, companies new to the sector are also joining in, as Amazon and Microsoft announced new AI chips last year.

However, market trends suggest Nvidia's supremacy will be challenging for competitors to overcome. The company has held an over 80% market share in desktop GPUs for years, despite AMD's and Intel's presence in the sector.

Intel only entered the industry last year, while AMD's history in desktop GPUs spans decades. Still, AMD's GPUs only account for about 10% of the market.

A similar situation has occurred in another area of the chip market. Intel was a king in central processing units (CPUs) for years, with an 82% market share at the start of 2017 when AMD landed on the scene with its Ryzen line of CPUs. AMD has managed to steal a significant share from Intel since then. However, Intel is still responsible for most of the CPU market; its share is above 60% and AMD's is at 36%.

Nvidia's estimated 80% to 95% market share in AI GPUs could falter slightly as competition heats up. However, history indicates the company will retain its overall lead and continue to see major gains from AI for years.

Projections show Nvidia's stock should continue beating the S&P 500
Nvidia has stunned Wall Street over the last year, posting multiple quarters of record earnings. In the fourth quarter of 2024 (ended in January), the company's revenue increased by 265% year over year to $22 billion. Meanwhile, operating income jumped 983% to nearly $14 billion. The monster growth was primarily from a 409% increase in data center revenue, reflecting increased chip sales.

While a spike in AI GPU sales is mainly responsible for Nvidia's stellar financial growth, the chipmaker is also profiting from an improving PC market. Spikes in inflation prompted steep declines in PC sales, with shipments dipping 16% in 2022 and continuing to fall for most of 2023. However, recent reports indicate the market is finally showing signs of recovery.

According to Gartner, PC shipments popped 0.3% in Q4 2023, marking the first such increase in over a year. Market improvements have been reflected in Nvidia's sales, with its PC-centered gaming segment reporting an 81% rise in revenue in Q3 2024 (which ended October 2023).

A leading role in AI and a recovering PC market suggests Nvidia has a strong outlook in the coming years. Earnings-per-share (EPS) estimates seem to support this.

NVDA EPS Estimates for 2 Fiscal Years Ahead Chart
NVDA EPS Estimates for 2 Fiscal Years Ahead Chart
The above chart shows Nvidia's earnings could hit $34 per share by fiscal 2026. Multiplying that figure by its forward price-to-earnings ratio of 38 yields a stock price of $1,292.

Considering the company's current position, that projection would see Nvidia's stock rise 40% over the next two years. The company may not replicate last year's growth but would still beat the S&P 500's 22% growth since 2022.

As a result, Nvidia still has much to offer new investors and is an exciting buy right now.

Should you invest $1,000 in Nvidia right now?

Before you buy stock in Nvidia, consider this:

The Motley Fool Stock Advisor analyst team just identified what they believe are the 10 best stocks for investors to buy now… and Nvidia wasn’t one of them. The 10 stocks that made the cut could produce monster returns in the coming years.

Stock Advisor provides investors with an easy-to-follow blueprint for success, including guidance on building a portfolio, regular updates from analysts, and two new stock picks each month. The Stock Advisor service has more than tripled the return of S&P 500 since 2002*.

See the 10 stocks

*Stock Advisor returns as of March 11, 2024

John Mackey, former CEO of Whole Foods Market, an Amazon subsidiary, is a member of The Motley Fool’s board of directors. Dani Cook has no position in any of the stocks mentioned. The Motley Fool has positions in and recommends Advanced Micro Devices, Amazon, Microsoft, and Nvidia. The Motley Fool recommends Gartner and Intel and recommends the following options: long January 2023 $57.50 calls on Intel, long January 2025 $45 calls on Intel, long January 2026 $395 calls on Microsoft, short January 2026 $405 calls on Microsoft, and short May 2024 $47 calls on Intel. The Motley Fool has a disclosure policy.

Is Nvidia Stock a Buy? was originally published by The Motley Fool
"""

In [27]:
tokens_3 = tokenizer(text_3, truncation=True, padding='longest', return_tensors='pt')

tokens_3

{'input_ids': tensor([[30859,   143, 18482,   151, 15047, 15690,   158,   576,   204,  2948,
          1411,   289,   232,   108,  9789,   141,   203,   154,   197,   280,
         34939,  1279,   874,   381,  1051, 30249,   107,   139,   301,   131,
           116,   231,   113, 19224,   115,  3647,  2196,  2022,   143, 58454,
           116,   158,  2475,  8523,   126,   112,  3508,  2838,   135,   114,
          9862,   115,  4958,  3941,   143, 13901,   158,   130,  1806,   118,
           109,  5162, 52082,   107,   398,   114,   711,   108, 30859,   131,
           116, 10337,  2563,   111,   294,  1325,  1971,   127,   164,   599,
         22436,   111,   384, 41074,   108,  4802,   108,   115,   109,   289,
           665,   590,   107,   139,   301,   131,   116, 77039,  2423,   148,
           181,  8067, 12817,   682,   109,   301,   148,   249,   154,   112,
           369,  2714,   115, 34074,   107,   611,   108,  2994,   115,   109,
          6263,   407,  4298, 30859,  

In [28]:
summary_3 = model.generate(**tokens_3)

summary_3

tensor([[    0,   240,   119,   131,   261,   174,   124,   109,  6662,   118,
           114,  3278,  1279,   112,   631,   115,  1051,   108,   119,   382,
           245,   112,  1037, 30859,   107, 30859,   138,   770,  5515,   203,
          5344, 19224,  2409,  4220,  1702, 30859,   131,   116,   924,   115,
           109,  5344,  6263,   407,   148,  1358,   112,  6150,  3278,   524,
         13501, 15488,   190,   109,   503,   107,     1]])

In [30]:
tokenizer.decode(summary_3[0], skip_special_tokens=True)

# And again, the below cannot be found in the news article which signals a success

"If you've been on the hunt for a tech stock to buy in March, you might want to consider Nvidia. Nvidia will likely retain its AI dominance despite rising competition Nvidia's success in the AI chip market has led to countless tech companies announcing ventures into the industry."