In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
!pip install optimum quanto onnxruntime onnxruntime-tools onnxconverter_common -q

## Importing necessary dependencies

In [4]:
import os
import onnx
import torch
import transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
from optimum.onnxruntime import ORTQuantizer, ORTModelForSeq2SeqLM
from pathlib import Path
from transformers.onnx import FeaturesManager
from optimum.onnxruntime import ORTQuantizer, ORTModelForSeq2SeqLM
from onnxruntime.quantization import quantize_dynamic, QuantType
from onnxconverter_common import float16

## Loading and inferencing model finetuned model

In [6]:
fine_tuned_checkpoint = "/content/drive/MyDrive/news_summarizer_seq2seq/finetuned_model"

In [7]:
fine_tuned_model = AutoModelForSeq2SeqLM.from_pretrained(fine_tuned_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(fine_tuned_checkpoint)

In [8]:
fine_tuned_pipeline = pipeline("summarization", model=fine_tuned_model, tokenizer=tokenizer)

In [9]:
fine_tuned_pipeline("""In 2013, Kohli was ranked number one in the ICC rankings for ODI batsmen. In 2015, he achieved the summit of T20I rankings.[7] In 2018, he was ranked top Test batsman, making him the only Indian cricketer to hold the number one spot in all three formats of the game. He is the first player to score 20,000 runs in a decade. In 2020, the International Cricket Council named him the male cricketer of the decade.""")[0]['summary_text']

Your max_length is set to 200, but your input_length is only 108. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=54)


'Kohli is the only Indian cricketer to hold the number one spot in all three formats of the game . In 2020, the International Cricket Council named him the male cricketer of the decade .'

## Converting model files in onnx format

## This line below generates onnx files for seq2seq model

In [10]:
!optimum-cli export onnx --model /content/drive/MyDrive/news_summarizer_seq2seq/finetuned_model --task seq2seq-lm-with-past --for-ort /content/drive/MyDrive/news_summarizer_seq2seq/onnx_model

2024-06-20 13:45:48.668157: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-20 13:45:48.668251: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-20 13:45:48.670249: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
The option --for-ort was passed, but its behavior is now the default in the ONNX exporter and passing it is not required anymore.
Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/3: T5Stack *****
Using framework PyTorch: 2.3.0+cu121

In [11]:
model = ORTModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/news_summarizer_seq2seq/onnx_model")

In [12]:
onnx_translation = pipeline("summarization", model=model, tokenizer=tokenizer)

In [13]:
onnx_translation("""In 2013, Kohli was ranked number one in the ICC rankings for ODI batsmen. In 2015, he achieved the summit of T20I rankings.[7] In 2018, he was ranked top Test batsman, making him the only Indian cricketer to hold the number one spot in all three formats of the game. He is the first player to score 20,000 runs in a decade. In 2020, the International Cricket Council named him the male cricketer of the decade.""")[0]['summary_text']

Your max_length is set to 200, but your input_length is only 108. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=54)


'Kohli is the only Indian cricketer to hold the number one spot in all three formats of the game . In 2020, the International Cricket Council named him the male cricketer of the decade .'

## Quantization to float16

In [14]:
def quantize_float(source_directory, target_directory, files_to_check):
    existing_files = []
    for file_name in files_to_check:
        full_path = os.path.join(source_directory, file_name)
        if os.path.isfile(full_path):
            model = onnx.load(f"{source_directory}/{file_name}")
            model_fp16 = float16.convert_float_to_float16(model)
            onnx.save(model_fp16, f"{target_directory}/{file_name}")
        print("\n\n")
        print(f"{target_directory}/{file_name}=======>Done")

In [15]:
quantize_float("/content/drive/MyDrive/news_summarizer_seq2seq/onnx_model", "/content/drive/MyDrive/news_summarizer_seq2seq/quantfloat_model", ["encoder_model.onnx", "decoder_model.onnx", "decoder_with_past_model.onnx"])


















In [16]:
fine_tuned_model.config.to_json_file("/content/drive/MyDrive/news_summarizer_seq2seq/quantfloat_model/config.json")

## Inferencing quantized float16 model

In [17]:
fp16_quantized_checkpoint = "/content/drive/MyDrive/news_summarizer_seq2seq/quantfloat_model"

In [18]:
quantfloat_model = ORTModelForSeq2SeqLM.from_pretrained(fp16_quantized_checkpoint)

Generation config file not found, using a generation config created from the model config.


In [19]:
quantfloat_pipeline = pipeline("summarization", model=quantfloat_model, tokenizer=tokenizer)

In [20]:
quantfloat_pipeline("""In 2013, Kohli was ranked number one in the ICC rankings for ODI batsmen. In 2015, he achieved the summit of T20I rankings.[7] In 2018, he was ranked top Test batsman, making him the only Indian cricketer to hold the number one spot in all three formats of the game. He is the first player to score 20,000 runs in a decade. In 2020, the International Cricket Council named him the male cricketer of the decade.""")[0]['summary_text']

Your max_length is set to 200, but your input_length is only 108. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=54)


'Kohli is the only Indian cricketer to hold the number one spot in all three formats of the game . In 2020, the International Cricket Council named him the male cricketer of the decade .'

## Quantization to int8

In [21]:
def quantint_model(source_directory, target_directory, files_to_check):
    existing_files = []
    for file_name in files_to_check:
        full_path = os.path.join(source_directory, file_name)
        if os.path.isfile(full_path):
          quantize_dynamic(f"{source_directory}/{file_name}",
                  f"{target_directory}/{file_name}",
                  weight_type=QuantType.QInt8)

        print("\n\n")
        print(f"{target_directory}/{file_name}=======>Done")

In [22]:
quantint_model("/content/drive/MyDrive/news_summarizer_seq2seq/onnx_model", "/content/drive/MyDrive/news_summarizer_seq2seq/quantint_model", ["encoder_model.onnx", "decoder_model.onnx", "decoder_with_past_model.onnx"])






















In [23]:
# Save the model configuration
fine_tuned_model.config.to_json_file("/content/drive/MyDrive/news_summarizer_seq2seq/quantint_model/config.json")

In [24]:
quantint_checkpoint = "/content/drive/MyDrive/news_summarizer_seq2seq/quantint_model"
tokenizer_checkpoint = "/content/drive/MyDrive/news_summarizer_seq2seq/finetuned_model"

## Inferencing quantized int-8 model

In [25]:
quantint_model = ORTModelForSeq2SeqLM.from_pretrained(quantint_checkpoint)

Generation config file not found, using a generation config created from the model config.


In [26]:
quantint_pipeline = pipeline("summarization", model=quantint_model, tokenizer=tokenizer)

In [27]:
quantint_pipeline("""In 2013, Kohli was ranked number one in the ICC rankings for ODI batsmen. In 2015, he achieved the summit of T20I rankings.[7] In 2018, he was ranked top Test batsman, making him the only Indian cricketer to hold the number one spot in all three formats of the game. He is the first player to score 20,000 runs in a decade. In 2020, the International Cricket Council named him the male cricketer of the decade.""")[0]['summary_text']

Your max_length is set to 200, but your input_length is only 108. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=54)


'Kohli was ranked number one in the ICC rankings for ODI batsmen . He is the first Indian player to score 20,000 runs in a decade .'

## Inference time comparision b/w finetuned, onnx format model,quantfloat and quantint model

In [28]:
input_text = """The rapid advancements in artificial intelligence (AI) technology
                are revolutionizing various industries, from healthcare to finance.
                In healthcare, AI-powered diagnostic tools are enhancing the accuracy of disease detection,
                enabling early intervention and improving patient outcomes.
                For instance, AI algorithms can analyze medical images
                with greater precision than human doctors,
                identifying abnormalities that might be missed during manual examination.
                In finance, AI-driven algorithms are optimizing trading strategies,
                predicting market trends, and managing risk more effectively.
                These technologies not only increase efficiency but also reduce operational costs.
                However, the widespread adoption of AI also raises ethical concerns,
                such as data privacy and the potential for job displacement.
                As AI continues to evolve, it is crucial to address these issues through thoughtful
                regulation and by ensuring that AI systems are developed and deployed responsibly."""

In [29]:
%%time
fine_tuned_pipeline(input_text)[0]['summary_text']

Your max_length is set to 200, but your input_length is only 168. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=84)


CPU times: user 2.41 s, sys: 20.2 ms, total: 2.43 s
Wall time: 2.43 s


'In healthcare, AI-powered diagnostic tools are enhancing the accuracy of disease detection . For example, AI algorithms can analyze medical images with greater precision than human doctors .'

In [32]:
%%time
onnx_translation("""In 2013, Kohli was ranked number one in the ICC rankings for ODI batsmen. In 2015, he achieved the summit of T20I rankings.[7] In 2018, he was ranked top Test batsman, making him the only Indian cricketer to hold the number one spot in all three formats of the game. He is the first player to score 20,000 runs in a decade. In 2020, the International Cricket Council named him the male cricketer of the decade.""")[0]['summary_text']

Your max_length is set to 200, but your input_length is only 108. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=54)


CPU times: user 1.54 s, sys: 2.25 ms, total: 1.54 s
Wall time: 1.62 s


'Kohli is the only Indian cricketer to hold the number one spot in all three formats of the game . In 2020, the International Cricket Council named him the male cricketer of the decade .'

In [30]:
%%time
quantfloat_pipeline(input_text)[0]['summary_text']

Your max_length is set to 200, but your input_length is only 168. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=84)


CPU times: user 7.85 s, sys: 98.9 ms, total: 7.95 s
Wall time: 8.11 s


'In healthcare, AI-powered diagnostic tools are enhancing the accuracy of disease detection . For example, AI algorithms can analyze medical images with greater precision than human doctors .'

In [31]:
%%time
quantint_pipeline(input_text)[0]['summary_text']

Your max_length is set to 200, but your input_length is only 168. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=84)


CPU times: user 902 ms, sys: 2.79 ms, total: 905 ms
Wall time: 905 ms


'AI-powered diagnostic tools are revolutionizing various industries, from healthcare to finance . For example, AI algorithms can analyze medical images with greater precision than human doctors .'