# T5 Model Demo

https://arxiv.org/pdf/1910.10683.pdf

In [1]:
!pip install transformers --upgrade
#transformers >= 2.8.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m79.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m

In [2]:
!pip install sentencepiece
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
import sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99
--2023-06-27 05:09:29--  https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278779 (272K) [text/plain]
Saving to: ‘botchan.txt’


2023-06-27 05:09:29 (8.57 MB/s) - ‘botchan.txt’ saved [278779/278779]



# Using Main Class

In [3]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import time

model = T5ForConditionalGeneration.from_pretrained('t5-small')  #tiny-base
tokenizer = T5Tokenizer.from_pretrained('t5-small')  #tiny-base

start_time = time.time()

def summarization_infer(text, max=50):
  preprocess_text = text.replace("\n", " ").strip()
  t5_prepared_Text = "summarize: "+preprocess_text
  tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt")

  summary_ids = model.generate(tokenized_text, min_length=30, max_length=max, top_k=100, top_p=0.8) #top-k top-p sampling strategy
  output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
  end_time = time.time()
  print (f'Time taken : {end_time-start_time}')
  return output

def translation_infer(text, max=50):
  preprocess_text = text.replace("\n", " ").strip()
  t5_prepared_Text = "translate English to German: "+preprocess_text
  tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt")

  translation_ids = model.generate(tokenized_text, min_length=10, max_length=50, early_stopping=True, num_beams=2)
  output = tokenizer.decode(translation_ids[0], skip_special_tokens=True)
  end_time = time.time()
  print (f'Time taken : {end_time-start_time}')
  return output

def grammatical_acceptibility_infer(text):
  preprocess_text = text.replace("\n", " ").strip()
  t5_prepared_Text = "cola sentence: "+preprocess_text
  tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt")

  grammar_ids = model.generate(tokenized_text, min_length=1, max_length=3)
  output = tokenizer.decode(grammar_ids[0], skip_special_tokens=True)
  end_time = time.time()
  print (f'Time taken : {end_time-start_time}')
  return output

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

# Using PyTorch Pipelines
Newly introduced in transformers v2.3.0, pipelines provides a high-level, easy to use, API for doing inference over a variety of downstream-tasks. Read [more practical documentation](https://www.kaggle.com/funtowiczmo/hugging-face-transformers-how-to-use-pipelines)

In [9]:
text = """
In recent years, people are seeking for a solution to improve text
summarization for Thai language. Although several solutions such
as PageRank, Graph Rank, Latent Semantic Analysis (LSA)
models, etc., have been proposed, research results in Thai text
summarization were restricted due to limited corpus in Thai
language with complex grammar. This paper applied a text
summarization system for Thai travel news based on keyword
scored in Thai language by extracting the most relevant sentences
from the original document. We compared LSA and Non-negative
Matrix Factorization (NMF) to find the algorithm that is suitable
with Thai travel news. The suitable compression rates for Generic
Sentence Relevance score (GRS) and K-means clustering were also
evaluated. From these experiments, we concluded that keyword
scored calculation by LSA with sentence selection by GRS is the
best algorithm for summarizing Thai Travel News, compared with
human with the best compression rate of 20%.

Daily newspaper has abundant of data that users do not have
enough time for reading them. It is difficult to identify the relevant
information to satisfy the information needed by users. Automatic
summarization can reduce the problem of information overloading
and it has been proposed previously in English and other languages.
However, there were only a few research results in Thai text
summarization due to the lack of corpus in Thai language and the
complicated grammar.
Text Summarization [1] is a technique for summarizing the content
of the documents. It consists of three steps: 1) create an
intermediate representation of the input text, 2) calculate score for
the sentences based on the concepts, and 3) choose important sentences
to be included in the summary. Text summarization can
be divided into 2 approaches. The first approach is the extractive
summarization, which relies on a method for extracting words and
searching for keywords from the original document. The second
approach is the abstractive summarization, which analyzes words
by linguistic principles with transcription or interpretation from the
original document. This approach implies more effective and
accurate summary than the extractive methods. However, with the
lack of Thai corpus, we chose to apply an extractive summarization
method for Thai text summarization.
"""

In [10]:
from transformers import pipeline

summarization_pipeline = pipeline(task='summarization', model="t5-small")
output = summarization_pipeline(text, min_length=30, max_length=50, top_k=100, top_p=0.8)
print (output)

[{'summary_text': 'research results in Thai text summarization were restricted due to limited corpus in Thai language with complex grammar . we compared LSA and Non-negative Matrix Factorization to find the algorithm that is suitable with Thai travel news'}]


# Making Flask API

In [6]:
from flask import Flask, request

app = Flask(__name__)

@app.route('/infer', methods=['POST'])
def infer():
  args = request.args['task']
  text = request.args['text']
  if args=='summarize':
    return summarization_infer(text)
  elif args=='translation':
    return translation_infer(text)
  else:
    return grammatical_acceptibility_infer(text)

if __name__=='__main__':
  app.run(host='0.0.0.0', port=5555, debug=False, threaded=True)


 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5555
 * Running on http://172.28.0.12:5555
INFO:werkzeug:[33mPress CTRL+C to quit[0m


In [7]:
text ="""
With 1,229 fresh cases in the last 24 hours, India's novel coronavirus count has increased to 21,700, according to the latest Ministry of Health and Family Welfare data. Along with that, the death toll due to the virus has increased to 686 after 34 more patients succumbed to the highly contagious disease since yesterday, it said. So far, India has 16,689 active cases. There are also 77 foreign nationals who are affected by the virus, the ministry said. Apart from that, there are at least 4,324 patients who have been discharged or cured from the highly contagious disease and one has migrated from the country. Speaking at the press briefing today, Lav Agarwal, Joint Secretary, Health Ministry, said, "As on today, we have 12 districts that did not have a fresh case in the last 28 days or more. There are now 78 districts (23 States/UTs) that has not reported any fresh cases during the last 14 Days." However, he also said that the increase in the number of coronavirus cases in the country is "more or less linear, not exponential." According to the Thursday morning data of health ministry, 4,257 Covid-19 patients have been cured so far, bring the recovery rate to is 19.89% as of now, said Aggrawal. He also added, "We have been able to cut virus transmission, minimise spread of COVID-19 in 30 days of lockdown."
"""

In [11]:
summarization_infer(text)

Time taken : 143.0182855129242


'text summarization is a technique for summarizing the content of the documents. it relies on a method for extracting words and searching for keywords from the original document. the extractive summarization is a'