<a href="https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization"> Optimized and Quantized deepset/roberta-base-squad2
    

Base line Performance

In [1]:
from transformers import pipeline

qa = pipeline("question-answering",model="deepset/roberta-base-squad2")


  return torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count


In [2]:
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." 
question="As what is Philipp working?" 

payload = {"inputs": {"question": question, "context": context}}


In [3]:
from time import perf_counter
import numpy as np 

def measure_latency(pipe, payload):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
    # Timed run
    for _ in range(50):
        start_time = perf_counter()
        _ =  pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Vanilla model {measure_latency(qa,payload)}")
#     Vanilla model Average latency (ms) - 64.15 +\- 2.44


Vanilla model Average latency (ms) - 452.64 +\- 25.56


1. Convert model to ONNX


In [4]:
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer
from pathlib import Path


model_id="deepset/roberta-base-squad2"
onnx_path = Path(".\\ONNX_and_Quantizaion")

# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

The argument `from_transformers` is deprecated, and will be removed in optimum 2.0.  Use `export` instead
Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.0.1+cu117
Overriding 1 configuration item(s)
	- use_cache -> False


verbose: False, log level: Level.ERROR



('ONNX_and_Quantizaion\\tokenizer_config.json',
 'ONNX_and_Quantizaion\\special_tokens_map.json',
 'ONNX_and_Quantizaion\\vocab.json',
 'ONNX_and_Quantizaion\\merges.txt',
 'ONNX_and_Quantizaion\\added_tokens.json',
 'ONNX_and_Quantizaion\\tokenizer.json')

2. Optimize & quantize model with Optimum

In [5]:
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig

# Create the optimizer
optimizer = ORTOptimizer.from_pretrained(model)

# Define the optimization strategy by creating the appropriate configuration
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

# Optimize the model
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)


Optimizing model...
Configuration saved in ONNX_and_Quantizaion\ort_config.json
Optimized model saved at: ONNX_and_Quantizaion (external data format: False; saved all tensor to one file: True)


WindowsPath('ONNX_and_Quantizaion')

In [6]:
# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx")
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.quantize(
    save_dir=onnx_path,
    quantization_config=dqconfig,
)


Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: False)
Quantizing model...
Saving quantized model at: ONNX_and_Quantizaion (external data format: False)
Configuration saved in ONNX_and_Quantizaion\ort_config.json


3. Create Custom Handler for Inference Endpoints

In [7]:
%%writefile handler.py
from typing import  Dict, List, Any
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline


class EndpointHandler():
    def __init__(self, path=""):
        # load the optimized model
        self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx")
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        # create pipeline
        self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer)

    def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
        """
        Args:
            data (:obj:):
                includes the input data and the parameters for the inference.
        Return:
            A :obj:`list`:. The list contains the answer and scores of the inference inputs
        """
        inputs = data.get("inputs", data)
        # run the model
        prediction = self.pipeline(**inputs)
        # return prediction
        return prediction


Overwriting handler.py


4. Test Custom Handler Locally


In [8]:
from handler import EndpointHandler

# init handler
my_handler = EndpointHandler(path=".\\ONNX_and_Quantizaion")

# prepare sample payload
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." 
question="As what is Philipp working?" 

payload = {"inputs": {"question": question, "context": context}}

# test the handler
my_handler(payload)


The ONNX file model_optimized_quantized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.


{'score': 0.2847447991371155,
 'start': 88,
 'end': 102,
 'answer': 'Technical Lead'}

In [9]:
from time import perf_counter
import numpy as np 

def measure_latency(handler,payload):
    latencies = []
    # warm up
    for _ in range(10):
        _ = handler(payload)
    # Timed run
    for _ in range(50):
        start_time = perf_counter()
        _ =  handler(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}")

Optimized & Quantized model Average latency (ms) - 175.72 +\- 14.34


test Quantizaion model

In [96]:
import fitz  # PyMuPDF

path = r'F:\NLP Apple Course\NLP Project\Ammar Abdelhady CV.pdf'
def extract_text_from_pdf(pdf_path):
    text = ''
    pdf_document = fitz.open(pdf_path)
    for page_number in range(pdf_document.page_count):
        page = pdf_document[page_number]
        text += page.get_text()
    return text


text_data = extract_text_from_pdf(path)
print(anas_cv)

Profile
Software engineer Expertise with project management approaches like Agile and waterfall methodologies, and have
the ability to build and website from A to Z.
Seeking an opportunity as a Full-stack web developer with Asp.net core or Blazor.
Professional Experience
Website Administrator, Elhramain Company
•Manage email communications related to the website.
•Collect and analyze data from competitor stores to determine optimal pricing and 
discounts.
11/2023 – present
Mansoura, Egypt
•Execute edits and updates to enhance website features and pages.
•Ensure a seamless and user-friendly online experience for customers.
•Contribute to the optimization of website performance to support business 
objectives.
Full Stack Developer, Appyinnovate
•Developed and maintained high-performance web applications using ASP.NET Core 
MVC, resulting in a 20% increase in user satisfaction.
•Utilized front-end technologies such as Angular7 and Bootstrap to create intuitive 
and user-friendly interface

In [11]:
from handler import EndpointHandler

# init handler
my_handler = EndpointHandler(path=".\\ONNX_and_Quantizaion")

The ONNX file model_optimized_quantized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.


In [12]:
# prepare sample payload
question = "What is the my all skill?"
context = text_data

payload = {"inputs": {"question": question, "context": context}}

# test the handler
my_handler(payload)['answer']

'Artificial Intelligence'

In [13]:
text_data

'Ammar Abdelhady Raafat\n \nFrom  : Cairo, Egypt    \nPhone : 010-262-073-13 \nEmail : ammarabdelhady8@gmail.com  \nGitHub   : Ammar abdelhady \nLinkedin : Ammar abdelhady \nPROFILE \nTo give you a brief overview of my skill set, I have a solid understanding of computer science, worked in Back end \ndevelopment for 1 year and worked in the AI field for 3 years, I have great knowledge working in: \n● Data science field : including tasks like communicating stakeholders, data engineering pipeline, data collecting (web \nscraping), Data structures & Algorithms, Statistical Analysis, databases, analysis, visualization, data preprocessing, machine \nlearning, Auto ML, deep learning and deployment on the cloud with web micro services & API. \n \n● Computer vision field : including tasks like classification, object detection, segmentation,  \ntracking, generative models, style transfer, Image similarity \n \nI am a fast, flexible code agnostic, and a hard worker learner. \n \n \nEXPERIENCE \nT

In [15]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline, AutoConfig


model_name = "deepset/roberta-base-squad2"

model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

nlp = pipeline('question-answering', model=model, tokenizer=tokenizer)

In [100]:
QA_input = {
    'question': 'what is the my job?'.lower(),
    'context': text_data
}
res = nlp(QA_input)

KeyboardInterrupt: 

In [None]:
res["answer"]

In [20]:
print(text_data)

Ammar Abdelhady Raafat
 
From  : Cairo, Egypt    
Phone : 010-262-073-13 
Email : ammarabdelhady8@gmail.com  
GitHub   : Ammar abdelhady 
Linkedin : Ammar abdelhady 
PROFILE 
To give you a brief overview of my skill set, I have a solid understanding of computer science, worked in Back end 
development for 1 year and worked in the AI field for 3 years, I have great knowledge working in: 
● Data science field : including tasks like communicating stakeholders, data engineering pipeline, data collecting (web 
scraping), Data structures & Algorithms, Statistical Analysis, databases, analysis, visualization, data preprocessing, machine 
learning, Auto ML, deep learning and deployment on the cloud with web micro services & API. 
 
● Computer vision field : including tasks like classification, object detection, segmentation,  
tracking, generative models, style transfer, Image similarity 
 
I am a fast, flexible code agnostic, and a hard worker learner. 
 
 
EXPERIENCE 
Training Program Descri

In [72]:
anas_cv = extract_text_from_pdf(r"F:\NLP Apple Course\NLP Project\Anas Amin Resume.pdf")

In [44]:
QA_input = {
    'question': 'what is the my shills?'.lower(),
    'context': anas_cv[1282:]
}
res = nlp(QA_input)

res["answer"]

'\nanasamin2002'

In [35]:
import re

In [36]:
re.search("skill", anas_cv.lower())

<re.Match object; span=(1282, 1287), match='skill'>