Cformers

SoTA Transformers with C-backend for fast inference on your CPU.

Introduction

We identify three pillers to enable fast inference of SoTA AI models on your CPU:

Fast C/C++ LLM inference kernels for CPU.
Machine Learning Research & Exploration front - Compression through quantization, sparsification, training on more data, collecting data and training instruction & chat models.
Easy to use API for fast AI inference in dynamically typed language like Python.

This project aims to address the third using LLaMa.cpp and GGML.

Guiding Principles

Inference Speed! Focus on inference, not training.
Precompressed models.
Minimal setup required - soon pip install cformers should be good to get started.
Easily switch between models and quantization types.
Support variety of prompts.

And most importantly:

You, the users, get to decide which direction we take this project.

Usage

Setup

pip install transformers wget
git clone https://github.com/nolanoOrg/cformers.git
cd cformers/cformers/cpp && make && cd ..

Usage:

from interface import AutoInference as AI
ai = AI('EleutherAI/gpt-j-6B')
x = ai.generate('def parse_html(html_doc):', num_tokens_to_generate=500)
print(x['token_str'])

OR

from interface import AutoInference as AI
ai = AI('OpenAssistant/oasst-sft-1-pythia-12b')
x = ai.generate("<|prompter|>What's the Earth total population<|endoftext|><|assistant|>", num_tokens_to_generate=100)
print(x['token_str'])

OR

python chat.py

chat.py accepts the following parameteres:

-t 100 Number of tokens to generate
-p Tell me a joke for a single prompt interaction
-m pythia to load one of the available (bloom, pythia or gptj )

We are working on adding support for pip install cformers.

Following Architectures are supported:

GPT-J
BLOOM
GPT-NeoX/Pythia/Open-Assistant/Open-Chat-Kit
CodeGen

Currently following huggingface models are supported:

EleutherAI/gpt-j-6B
bigscience/bloom-560m
bigscience/bloom-1b1
bigscience/bloom-1b7
bigscience/bloom-3b
BigScience/bloom-7b1
OpenAssistant/oasst-sft-1-pythia-12b
togethercomputer/GPT-NeoXT-Chat-Base-20B (thanks to @HCBlackFox)
Salesforce/codegen-350M-mono
Salesforce/codegen-2B-mono
Salesforce/codegen-6B-mono
Salesforce/codegen-16B-mono
gpt2 (thanks to @kamalojasv181)

We need to quantize and upload remaining models based on the supported architectures on huggingface. We would appreciate your help in this regard.

Coming Soon:

Features:

Switch between models
Chat-mode (interactive mode)
Various tools to support Prompt-engineering, chaining, saving and sharing.

Code-base restructuring:

Switch to Pybind11 rather than Subprocess - expected speedup: 3-4x
Restructure the codebase to reuse.
Somehow create llama.cpp as a git-submodule/dependency.

Models

For now, we are focussing on AutoRegressive-style generative models.

GPT-J
BLOOM
GPT-NeoX/Pythia/Open-Assistant/Open-Chat-Kit (Architecture supported, need to quantize and upload models.)
CodeGen
LLaMa & Alpaca
OPT & Galactica
T5
RWKV
GPT-2
And more (including multimodal)...

Quantization types:

Int4 with fixed zero-offset
Int4 with variable zero-offset
GPTQ-Int4 with fixed zero-offset
GPTQ-Int4 with variable zero-offset
Int3 quantization, proxy quantization and binning.

Contributions

We encourage contributions from the community.

Providing feedback:

Let us know what features you want, what models you want to use.
Reporting bugs, raising issues and sending Pull Requests.

Easy first issues:

Following are some easy first issues ways in which you can help improve CTransformers:

Pick an existing HF model, quantize it, upload to HF and add it to the mapping in ctransformers/map_model_to_url.py
Add support for new models.
Add support for new quantization types.

Issues on Machine Learning side (some are exploratory):

Try out GPTQ on these models and upload the resulting models to HF.
Benchmark the quantized models. #2
Can we merge Query and Key Matrices for GPT-J/LLaMa? #3
Explore CALM (Confident Adaptive Language Modelling) with 4-bit precision models #4
Saving Keys and Values in memory at lower precision (refer FlexGen) #6
Try out other quantization techniques like proxy quantization, etc.
Explore SparseGPT #5
Explore Quantization of Multimodal Models

Non-Python side

If you are allergic to Python, you can:

Port support for fast loading here: ggerganov/llama.cpp#91 (comment)

You can also contribute to LLaMa.cpp and we will port those niceties here.

Add support for greater than 32 bin/group size int4 quantized weights with GGML/LLaMa.cpp (A potential pitfalls - the intermediate representation may not be losslessly grouppable to >32 bin size, only weight matrix may be grouppable to >32 bin size, etc.)
Speed up quantized matrix multiplication in GGML/LLaMa.cpp
Add Int3 and Int2 quantization support to GGML/LLaMa.cpp
Add fast Ampere-sparse quantized matrix multiplication functions in GGML/LLaMa.cpp

Misc. Notes

Our interface is still limited to generation. We are working to support other features:

Allow stopping-generation midway
Anti-prompt for conversation models
Returning embeddings and/or logits.
[Dev] Switch to pybindings over the C++ kernels and calling them from Python.
[Dev] Re-use the code in main.cpp

We would love to hear from you various ways in which we can speed up and improve the interface.

License

MIT License

Communication and Support

Discord: https://discord.gg/HGujTPQtR6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Cformers

Introduction

Guiding Principles

Usage

Coming Soon:

Models

Quantization types:

Contributions

Providing feedback:

Easy first issues:

Issues on Machine Learning side (some are exploratory):

Non-Python side

Misc. Notes

License

Communication and Support

Files

README.md

Latest commit

History

README.md

File metadata and controls

Cformers

Introduction

Guiding Principles

Usage

Coming Soon:

Models

Quantization types:

Contributions

Providing feedback:

Easy first issues:

Issues on Machine Learning side (some are exploratory):

Non-Python side

Misc. Notes

License

Communication and Support