engine

History

Name		Name	Last commit message	Last commit date
parent directory ..
cli		cli
common		common
config		config
controllers		controllers
cortex-common		cortex-common
database		database
deps		deps
e2e-test		e2e-test
examples		examples
exceptions		exceptions
extensions		extensions
migrations		migrations
repositories		repositories
services		services
templates		templates
test		test
utils		utils
vcpkg @ 10b7a17		vcpkg @ 10b7a17
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
audio.md		audio.md
install.bat		install.bat
install.sh		install.sh
install_deps.sh		install_deps.sh
main.cc		main.cc
vcpkg-configuration.json		vcpkg-configuration.json
vcpkg.json		vcpkg.json

README.md

cortex-cpp - Embeddable AI

Documentation - API Reference - Changelog - Bug reports - Discord

⚠️ cortex-cpp is currently in Development: Expect breaking changes and bugs!

About cortex-cpp

Cortex-cpp is a streamlined, stateless C++ server engineered to be fully compatible with OpenAI's API, particularly its stateless functionalities. It integrates a Drogon server framework to manage request handling and includes features like model orchestration and hardware telemetry, which are essential for production environments.

Remarkably compact, the binary size of cortex-cpp is around 3 MB when compressed, with minimal dependencies. This lightweight and efficient design makes cortex-cpp an excellent choice for deployments in both edge computing and server contexts.

Utilizing GPU capabilities does require CUDA.

Prerequisites

Hardware

Ensure that your system meets the following requirements to run Cortex:

OS:
- MacOSX 13.6 or higher.
- Windows 10 or higher.
- Ubuntu 18.04 and later.
RAM (CPU Mode):
- 8GB for running up to 3B models.
- 16GB for running up to 7B models.
- 32GB for running up to 13B models.
VRAM (GPU Mode):
- 6GB can load the 3B model (int4) with ngl at 120 ~ full speed on CPU/ GPU.
- 8GB can load the 7B model (int4) with ngl at 120 ~ full speed on CPU/ GPU.
- 12GB can load the 13B model (int4) with ngl at 120 ~ full speed on CPU/ GPU.
Disk: At least 10GB for app and model download.

Quickstart

To install Cortex CLI, follow the steps below:

Download cortex-cpp here: https://github.com/menloresearch/cortex/releases
Install cortex-cpp by running the downloaded file.
Download a Model:

mkdir model && cd model
wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true

Run cortex-cpp server:

cortex-cpp

Load a model:

curl http://localhost:3928/inferences/server/loadmodel \
  -H 'Content-Type: application/json' \
  -d '{
    "llama_model_path": "/model/llama-2-7b-model.gguf",
    "ctx_len": 512,
    "ngl": 100,
  }'

Make an Inference:

curl http://localhost:3928/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Who won the world series in 2020?"
      },
    ]
  }'

Table of parameters

Below is the available list of the model parameters you can set when loading a model in cortex-cpp:

Parameter	Type	Description
`llama_model_path`	String	The file path to the LLaMA model.
`ngl`	Integer	The number of GPU layers to use.
`ctx_len`	Integer	The context length for the model operations.
`embedding`	Boolean	Whether to use embedding in the model.
`n_parallel`	Integer	The number of parallel operations.
`cont_batching`	Boolean	Whether to use continuous batching.
`user_prompt`	String	The prompt to use for the user.
`ai_prompt`	String	The prompt to use for the AI assistant.
`system_prompt`	String	The prompt to use for system rules.
`pre_prompt`	String	The prompt to use for internal configuration.
`cpu_threads`	Integer	The number of threads to use for inferencing (CPU MODE ONLY)
`n_batch`	Integer	The batch size for prompt eval step
`caching_enabled`	Boolean	To enable prompt caching or not
`clean_cache_threshold`	Integer	Number of chats that will trigger clean cache action
`grp_attn_n`	Integer	Group attention factor in self-extend
`grp_attn_w`	Integer	Group attention width in self-extend
`mlock`	Boolean	Prevent system swapping of the model to disk in macOS
`grammar_file`	String	You can constrain the sampling using GBNF grammars by providing path to a grammar file
`model_type`	String	Model type we want to use: llm or embedding, default value is llm

Download

Version Type	Windows		MacOS		Linux
Stable (Recommended)	CPU	CUDA	Intel	M1/M2	CPU	CUDA

Download the latest or older versions of Cortex-cpp at the GitHub Releases.

Manual Build

Manual build is a process in which the developers build the software manually. This is usually done when a new feature is implemented, or a bug is fixed. The process for this project is defined in .github/workflows/cortex-build.yml

Contact Support

For support, please file a GitHub ticket.
For questions, join our Discord here.
For long-form inquiries, please email hello@jan.ai.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

engine

engine

README.md

cortex-cpp - Embeddable AI

About cortex-cpp

Prerequisites

Hardware

Quickstart

Table of parameters

Download

Manual Build

Contact Support

Star History

Files

engine

Directory actions

More options

Directory actions

More options

Latest commit

History

engine

Folders and files

parent directory

README.md

cortex-cpp - Embeddable AI

About cortex-cpp

Prerequisites

Hardware

Quickstart

Table of parameters

Download

Manual Build

Contact Support

Star History