<a href="https://colab.research.google.com/github/Emzee88/ISYS5002_2024_S1/blob/main/8.1_text_summariser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a summariser

This section is based on the YouTube video [AI Text Summarization with Hugging Face Transformers in 4 Lines of Python](https://youtu.be/TsfLm5iiYb4)

As Information Systems professionals, we use our skills to be aware of advanced concepts and think about how you can meet the organisational Using *Hugging Face Transformers*, you can leverage a pre-trained summarisation pipeline to start summarising content. In this section, we will:
1. Installing Hugging Face Transformers
2. Building a summarisation pipeline
3. Run model/pipeline to summarisation
4. **Investigate way to reuse the pipeline**

> [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) free state-of-the-art pre-trained machine learning models for processing text, images, audio and video. See the project website for more information.

In [10]:
# Install Hugging Face Transformers and Dependencies
!pip install transformers



In [11]:
#import libraries
from transformers import pipeline

'''
import the pipeline function from the transformers library,
and use it to create a summarization pipeline object
'''
# load sumarisation pipeline
summary_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")




In [13]:
'''
Once the pipeline is created, it can be used to summarize text
by passing in a string of text to the summary_pipeline object
'''
# Let us copy-n-paste some text
article = """
Around the world, as regulators look to rein in Big Tech, like the ongoing digital platforms inquiry in Australia, online platforms will face a raft of new rules in the EU.
Known as the Digital Services Act, it’s a comprehensive set of regulations for digital services and content in the Eurozone.
Like GDPR, the Digital Services Act is expected to lead the way for other countries to provide some rules around how digital services function,
with everything from algorithms to online marketplaces, social networks, content-sharing platforms, app stores and online travel and accommodation platforms included.
The Digital Services Act sets out clear due diligence obligations for digital platforms and other online intermediaries with measures for cooperation with trusted flaggers and
competent authorities on content moderation, and measures to deter rogue traders from reaching consumers.
"""

# Run the summariser pipeline
summary = summary_pipeline(article, max_length = 50, min_length= 20)

# What does a summary look like?
print("summary is: ", summary)

# By inspection of output, 'summary' is a list.  The first element of the list is a dictionary.
# The key to the dictionary is 'summary_text'.

# Extract and display the summarised text
text = summary[0]['summary_text'] # get first element, then extract the value for key 'summary text
print("\nExtracted text: ", text)

summary is:  [{'summary_text': 'The Digital Services Act is a comprehensive set of regulations for digital services and content in the Eurozone. It is expected to lead the way for other countries to provide some rules around how digital services function.'}]

Extracted text:  The Digital Services Act is a comprehensive set of regulations for digital services and content in the Eurozone. It is expected to lead the way for other countries to provide some rules around how digital services function.


In [14]:

# splits the summarised text into a list of sentences using .split('.')
summary[0]['summary_text'].split('.')

['The Digital Services Act is a comprehensive set of regulations for digital services and content in the Eurozone',
 ' It is expected to lead the way for other countries to provide some rules around how digital services function',
 '']

In [17]:
text = summary[0]['summary_text'] # get first element, then extract the value for key 'summary_text'
sentences = text.split('. ') # split the text into sentences

print("\nExtracted sentences: ")
for i, sentence in enumerate(sentences):
  print(f"Sentence {i+1}: {sentence}")




Extracted sentences: 
Sentence 1: The Digital Services Act is a comprehensive set of regulations for digital services and content in the Eurozone
Sentence 2: It is expected to lead the way for other countries to provide some rules around how digital services function.


**Let's make it a function**

In [18]:
from transformers import pipeline

def summarise(article):
  summary_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")
  summary = summary_pipeline(article, max_length = 50, min_length= 20)
  text = summary[0]['summary_text'] # get first element, then extract the value for key 'summary text
  return text


**A quick test.**

In [19]:
some_text = '''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction
'''

print(summarise(some_text))

Your max_length is set to 50, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread concerns relating to reproduction.


Umm... it worked, but with a warning on max_length.   We could reduce the max length or add a check that we have at least 50 words.  Our reasoning (design decision) is that it doesn't really make sense to sumarise say one sentance. We could pick any minimun size, but 50 seems like a good number.

But first, how do I count words in a string?  We could search the internet for some code snippets.  We can use the the string method `split()`.

In [20]:
help(str.split)

Help on method_descriptor:

split(self, /, sep=None, maxsplit=-1)
    Return a list of the substrings in the string, using sep as the separator string.
    
      sep
        The separator used to split the string.
    
        When set to None (the default value), will split on any whitespace
        character (including \\n \\r \\t \\f and spaces) and will discard
        empty strings from the result.
      maxsplit
        Maximum number of splits (starting from the left).
        -1 (the default value) means no limit.
    
    Note, str.split() is mainly useful for data that has been intentionally
    delimited.  With natural text that includes punctuation, consider using
    the regular expression module.



So `split()` returns a list of words.  The `len()` of the list will be the word count.  Let us try it.


In [21]:
some_text = '''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction
'''

count = len(some_text.split())
print(count)

21


**Let us update the function to include this (word length) check.**

We will also add a doc string.  I choosen to use an `assert` statement, but you could do something similar with an `if` statement.

In [22]:
from transformers import pipeline

def summarise(article):
  '''
  Returns a summary of a text.
  The length of the text has to be greater than 50 words
  '''
  assert len(article.split()) > 50, 'Please make sure your text has at least 50 words'

  summary_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")
  summary = summary_pipeline(article, max_length = 50, min_length= 20)
  text = summary[0]['summary_text'] # get first element, then extract the value for key 'summary text
  return text

In [23]:
some_text = '''A lack of transparency and reporting standards in the scientifc
community has led to increasing and widespread concerns relating to reproduction
'''

print(summarise(some_text))

AssertionError: Please make sure your text has at least 50 words

Great the assertion worked.

In [24]:
bigger_text='''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and
relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The
metabolomics community has made substantial eforts to align with FAIR data standards by promoting open data formats,
data repositories, online spectral libraries, and metabolite databases.
'''

print(summarise(bigger_text))

Metabolomics generates vast amounts of data andrelies heavily on data science for deriving biological meaning. Metabolomics community has made substantial eforts to align with FAIR data standards by promoting open data formats.


Okay that is working well.

Let us start to use our hard work

*   Summarise a PDF text
*   Summarise a webpage text


