<a href="https://www.kaggle.com/code/aisuko/inference-with-cpu?scriptVersionId=164530532" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# DeepSpare

It is a CPU inference runtime that takes advantages of sparsity to accelerate neural network inference. Coupled with SparseML(A optimization library for pruning and quantizing models). DeepSpare delivers exceptional inference performance on CPU hardware.


# Features
* Sparse kernels for speedups and memory savings from unstructured sparse weights
* 8-bit weight and activation quantization support
* Efficient usage of cached attention keys and values for minimal memory movement


# Checking if CPU support AVX and AVX2

In [1]:
!grep avx /proc/cpuinfo

flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabiliti

In [2]:
!grep avx2 /proc/cpuinfo

flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabiliti

# Installation

Here are two ways to install deepsparse:
* `pip install -U deepsparse-nightly[llm]`
* `pip install deepsparse`

However, the latest stable version in Kaggle environment will cause issue `ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.`

I guess it needs specific version of transformers.

In [3]:
pip install -U deepsparse-nightly[llm]

Collecting deepsparse-nightly[llm]
  Obtaining dependency information for deepsparse-nightly[llm] from https://files.pythonhosted.org/packages/9c/8e/4087b342e07536edc09bac5abac6b748bef0f9aed6fd068f08e0f8e545d2/deepsparse_nightly-1.7.0.20240131-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading deepsparse_nightly-1.7.0.20240131-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (23 kB)
Collecting sparsezoo-nightly~=1.7.0 (from deepsparse-nightly[llm])
  Obtaining dependency information for sparsezoo-nightly~=1.7.0 from https://files.pythonhosted.org/packages/5f/23/42f0298c587914186ef490c0975ed8f2ee8c9bb0a1a0686f47f56d570661/sparsezoo_nightly-1.7.0.20240131-py3-none-any.whl.metadata
  Downloading sparsezoo_nightly-1.7.0.20240131-py3-none-any.whl.metadata (21 kB)
Collecting onnx<1.15.0,>=1.5.0 (from deepsparse-nightly[llm])
  Obtaining dependency information for onnx<1.15.0,>=1.5.0 from https://files.pythonhosted.org/packages/47/d4/f2

# Demo of Inference

In [4]:
from deepsparse import TextGeneration

pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

prompt="""
Below is aninstruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
"""

print(pipeline(prompt, max_new_tokens=75).generations[0].text)

Downloading (…)ed/deployment.tar.gz:   0%|          | 0.00/3.04G [00:00<?, ?B/s]

DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240131 COMMUNITY | (1ddb9f31) (release) (optimized) (system=avx2, binary=avx2)


Sparsity is the number of non-zero elements in a matrix. For example, in the matrix A = [1, 2, 3, 4, 5, 6, 7, 8, 9] the number of non-zero elements is 3.


# Computer Vision and NLP Models

DeepSparse supports many variants of CNNs and Transformer models, such as BERT, ViT, ResNet, TOLOv5/8, and many more. DeepSpare includes three deployment APIs:


## Engine 

It is the lowest-level API. With Engine, you compile an ONNX model, pass tensors as input, and receive the raw outputs.


## Pipeline

It wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.


## Server

It wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.

# Engine

This example will downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, complies the model, and runs inference on randomsly generated input. Users can provide their own ONNX models, whether dense or sparse

In [5]:
from deepsparse import Engine

model_address ="zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model=Engine(model=model_address, batch_size=1)

Downloading (…)ed/deployment.tar.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

In [6]:
inputs=compiled_model.generate_random_inputs()
output=compiled_model(inputs)
print(output)

2024-02-27 11:48:59 deepsparse.utils.onnx INFO     Generating input 'input_ids', type = int64, shape = [1, 128]
2024-02-27 11:48:59 deepsparse.utils.onnx INFO     Generating input 'attention_mask', type = int64, shape = [1, 128]
2024-02-27 11:48:59 deepsparse.utils.onnx INFO     Generating input 'token_type_ids', type = int64, shape = [1, 128]


[array([[-0.34614536,  0.09025408]], dtype=float32)]


# Pipeline

It wrap Engine with pre- and post-processing, enabling you to pass raw data and receive the post-processed prediction.

In [7]:
from deepsparse import Pipeline

# download onnx, set up pipeline
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"

sentiment_analysis_pipeline = Pipeline.create(
  task="sentiment-analysis",    # name of the task
  model_path=zoo_stub,          # zoo stub or path to local onnx file
)



In [8]:
prediction = sentiment_analysis_pipeline("Cool")
print(prediction)

labels=['positive'] scores=[0.9995297193527222]
