# Natural Language Generation 
Abstractive Text Summarization using Google Pegasus & other implementations

In [None]:
!pip install transformers sentencepiece
import transformers
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from IPython.display import clear_output
clear_output()

In [None]:
'''
import requests
r = requests.post(
    "https://api.deepai.org/api/summarization", # extractive
    files={
        'text': open('test_summ.txt', 'rb'),
    },
    headers={'api-key': 'quickstart-QUdJIGlzIGNvbWluZy4uLi4K'}
)
r = requests.post(
    "https://api.deepai.org/api/summarization",
    data={
        'text': 'Data science is a concept to unify statistics, data analysis, informatics, and their related methods in order to understand and analyse actual phenomena with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer science and information science. Turing Award winner Jim Gray imagined data science as a fourth paradigm of science (empirical, theoretical, computational, and now data-driven) and asserted that everything about science is changing because of the impact of information technology and the data deluge. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains. Data science is related to data mining, machine learning and big data. A data scientist is someone who creates programming code and combines it with statistical knowledge to create insights from data.',
    },
    headers={'api-key': 'quickstart-QUdJIGlzIGNvbWluZy4uLi4K'}
)
print(r.json())'''
clear_output()

**Note:**

Refer https://huggingface.co/models?pipeline_tag=summarization&sort=downloads for pre-trained models to use with pipleline.

Supported text summarisation models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'PegasusForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'T5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration'].

In [None]:
# PEGASUS
a_summarizer = pipeline("summarization", model = "google/pegasus-xsum")

# BART
b_summarizer = pipeline("summarization", model = "facebook/bart-large-cnn")

# T5
c_summarizer = pipeline("summarization", model= "csebuetnlp/mT5_multilingual_XLSum")
d_summarizer = pipeline("summarization", model= "google/t5-v1_1-base")

#e_summarizer = pipeline("summarization", model= "openai-gpt") # The model 'OpenAIGPTLMHeadModel' is not supported for summarization but text generation. Somehow, it gives output for summarization too.

#f_summarizer = pipeline("summarization", model= "mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization") # not working
clear_output()

In [None]:
# Automatically create relevant architecture for the model given the name to pretrained config
tokenizer = AutoTokenizer.from_pretrained('google/pegasus-xsum')
model = AutoModelForSeq2SeqLM.from_pretrained('google/pegasus-xsum')

def auto_pegasus(text_example):
    tokens_input = tokenizer.encode("summarize: "+ text_example, return_tensors='pt', max_length=512, truncation=True)
    ids = model.generate(tokens_input, min_length=80, max_length=120)
    summary = tokenizer.decode(ids[0], skip_special_tokens=True)
    print(summary)

In [None]:
# Pipeline Code
'''from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "google/pegasus-xsum"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)

def regular_pegasus1(src_text):
    batch = tokenizer(src_text, truncation=True, padding="longest", return_tensors="pt").to(device)
    translated = model.generate(**batch)
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    print(tgt_text)

def regular_pegasus1(src_text):    
    batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest',return_tensors='pt')
    translated = model.generate(**batch)
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    print(tgt_text)'''
clear_output()

## Example #1

In [None]:
text_example = '''
The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey 
building, and the tallest structure in Paris. Its base is square, measuring 
125 meters (410 ft) on each side. During its construction, the Eiffel Tower 
surpassed the Washington Monument to become the tallest man-made structure in 
the world, a title it held for 41 years until the Chrysler Building in New York 
City was finished in 1930. It was the first structure to reach a height of 300 
meters. Due to the addition of a broadcasting aerial at the top of the tower in 
1957, it is now taller than the Chrysler Building by 5.2 meters (17 ft). 
Excluding transmitters, the Eiffel Tower is the second tallest free-standing 
structure in France after the Millau Viaduct.'''
# Need to evaluate results manually due to lack of labelling

print(a_summarizer(text_example)[0]['summary_text']+"\n\n"+b_summarizer(text_example)[0]['summary_text']+"\n\n"+c_summarizer(text_example)[0]['summary_text']+"\n\n"+d_summarizer(text_example)[0]['summary_text'])

The Eiffel Tower is a free-standing structure in Paris, France.

The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building. Its base is square,                 measuring 125 meters (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world.

The Eiffel Tower has become the tallest free-standing building in the world.

. It is the tallest structure in Paris. It is the tallest structure in Paris


In [None]:
auto_pegasus(text_example) 

The Eiffel Tower, also known as the Arc de Triomphe, was built in 1889 in Paris, France, by Gustave Eiffel, the architect of the Champs-Elysees, the Champs-lysées, the Arc de Triomphe and the Arc de Triomphe in Paris, as well as the Arc de Triomphe and the Arc de Triomphe in London, England, and the Arc de Triomphe in Paris, France.


## Example #2

In [None]:
text_example = '''Data science is a concept to unify statistics, data analysis, informatics, and their related methods in order to understand and analyse actual phenomena with data. 
It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. 
However, data science is different from computer science and information science. 
Turing Award winner Jim Gray imagined data science as a fourth paradigm of science (empirical, theoretical, computational, and now data-driven) and asserted that everything about science is changing because of the impact of information technology and the data deluge. 
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains. 
Data science is related to data mining, machine learning and big data. A data scientist is someone who creates programming code and combines it with statistical knowledge to create insights from data.'''
# Need to evaluate results manually due to lack of labelling

print(a_summarizer(text_example)[0]['summary_text']+"\n\n"+b_summarizer(text_example)[0]['summary_text']+"\n\n"+c_summarizer(text_example)[0]['summary_text']+"\n\n"+d_summarizer(text_example)[0]['summary_text'])

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains.

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data. It is related to data mining, machine learning and big data. A data scientist is someone who creates programming code and combines it with statistical knowledge to create insights from data.

Data science is one of the most important fields in science.

. Data science is a interdisciplinary field of science and engineering. It is related


In [None]:
auto_pegasus(text_example) 

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains, such as finance, healthcare, energy, manufacturing, education, and the media and entertainment industries. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains.
