## ExLlama Fast Inference
Supports ExLlama WebUI as well as [Oobabooga WebUI API imitation](https://gist.github.com/BlankParenthesis/4f490630b6307ec441364ab64f3ce900)

Up to 33B 4-bit on 2x T4, and 13B 4-bit on 1x P100 or 1x T4
### Installation

In [None]:
# Kaggle
%cd /kaggle/

# Colab
# %cd /content/

# Install ExLlama and deps
!pip install -q --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118
!pip install -q safetensors sentencepiece ninja
!pip install -q huggingface_hub

!git clone https://github.com/turboderp/exllama
%cd exllama

# Install WebUI deps
!pip install -q flask waitress

# Install deps for Oobabooga WebUI API imitation
!wget "https://gist.githubusercontent.com/BlankParenthesis/4f490630b6307ec441364ab64f3ce900/raw/38f4feb8ea2c023907eaacf4a98c645bca2dfe3a/api.py"
!pip install -q flask_sock

# Install localtunnel to access Flask/API
!npm install localtunnel

### Model download
Download using HuggingFace repo ID

In [None]:
# Full repo download method

# Select model
repo_id = "TheBloke/Chronoboros-33B-GPTQ"
#repo_id = "TheBloke/chronos-33b-GPTQ"
#repo_id = "ausboss/llama-30b-supercot-4bit"
#repo_id = "CalderaAI/30B-Lazarus-GPTQ4bit"

#repo_id = "TheBloke/Llama-2-13B-GPTQ"
#repo_id = "TheBloke/chronos-hermes-13B-GPTQ"
#repo_id = "TehVenom/Metharme-13b-4bit-GPTQ"
#repo_id = "TheBloke/Nous-Hermes-13B-GPTQ"

# Select branch
revision="main"
#revision="gptq-4bit-128g-actorder_True"
#revision="gptq-8bit-128g-actorder_True"

# Download model
from huggingface_hub import snapshot_download
snapshot_download(repo_id=repo_id, revision=revision, local_dir=f"./{repo_id.replace('/', '_')}")

import os
os.environ["MODEL_DIR"] = f"{repo_id.replace('/', '_')}"

print(f"Model dir: './{repo_id.replace('/', '_')}'")

In [None]:
# Old download method - for repos where multiple versions are in the same branch

# Select model
repo_id = "reeducator/bluemoonrp-30b"
model_filename = "bluemoonrp-30b-4bit-128g.safetensors" # From the model repo

# Select branch
revision="main"

# Download model
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id=repo_id, revision=revision, filename="config.json", local_dir=f"./{repo_id.replace('/', '_')}")
hf_hub_download(repo_id=repo_id, revision=revision, filename="tokenizer.model", local_dir=f"./{repo_id.replace('/', '_')}")
hf_hub_download(repo_id=repo_id, revision=revision, filename=model_filename, local_dir=f"./{repo_id.replace('/', '_')}")

import os
os.environ["MODEL_DIR"] = f"{repo_id.replace('/', '_')}"

print(f"Model dir: './{repo_id.replace('/', '_')}'")

In [None]:
# Full repo download lora

# Select model
repo_id = "Ruaif/Kimiko_13B"

# Select branch
revision="main"

# Download model
from huggingface_hub import snapshot_download
snapshot_download(repo_id=repo_id, revision=revision, local_dir=f"./{repo_id.replace('/', '_')}")

import os
os.environ["LORA_DIR"] = f"{repo_id.replace('/', '_')}"

print(f"Lora dir: './{repo_id.replace('/', '_')}'")

In [None]:
# Delete downloaded model
!rm -r $MODEL_DIR
!dir

### Run inference
Select either ExLlama WebUI or Oobabooga WebUI API imitation

In [None]:
# ExLlama WebUI
# Access localtunnel page and input IP as password
!curl ipv4.icanhazip.com

# 2x T4
!python ./webui/app.py -d $MODEL_DIR --host "127.0.0.1:5000" -gs 8,11 & npx localtunnel --port 5000
#!python ./webui/app.py -d $MODEL_DIR --host "127.0.0.1:5000" -gs 8,11 -l 4096 & npx localtunnel --port 5000 # 4k context for Llama 2
#!python ./webui/app.py -d $MODEL_DIR --lora $LORA_DIR --host "127.0.0.1:5000" -gs 8,11 -l 4096 & npx localtunnel --port 5000 # 4k context for Llama 2 + lora
#!python ./webui/app.py -d $MODEL_DIR --host "127.0.0.1:5000" -gs 8,11 -l 4096 -a 2.5 & npx localtunnel --port 5000 # 4k NTK scaling for Llama 1

# 1x P100 or 1x T4
#!python ./webui/app.py -d $MODEL_DIR --host "127.0.0.1:5000" & npx localtunnel --port 5000
#!python ./webui/app.py -d $MODEL_DIR --host "127.0.0.1:5000" -l 4096 & npx localtunnel --port 5000 # 4k context for Llama 2
#!python ./webui/app.py -d $MODEL_DIR --lora $LORA_DIR --host "127.0.0.1:5000" -l 4096 & npx localtunnel --port 5000 # 4k context for Llama 2 + lora
#!python ./webui/app.py -d $MODEL_DIR --host "127.0.0.1:5000" -l 4096 -a 2.5 & npx localtunnel --port 5000 # 4k NTK scaling for Llama 1

In [None]:
# Oobabooga WebUI API imitation
# Access localtunnel page and input IP as password
# Standard API: https://X/api
# Streaming API: ws://X/api/v1/stream
!curl ipv4.icanhazip.com

# 2x T4
!python api.py -d $MODEL_DIR -gs 8,11 & npx localtunnel --port 5000
#!python api.py -d $MODEL_DIR -gs 8,11 -l 4096 & npx localtunnel --port 5000 # 4k context for Llama 2
#!python api.py -d $MODEL_DIR --lora $LORA_DIR -gs 8,11 -l 4096 & npx localtunnel --port 5000 # 4k context for Llama 2 + lora
#!python api.py -d $MODEL_DIR -gs 8,11 -l 4096 -a 2.5 & npx localtunnel --port 5000 # 4k NTK scaling for Llama 1

# 1x P100 or 1x T4
#!python api.py -d $MODEL_DIR & npx localtunnel --port 5000
#!python api.py -d $MODEL_DIR -l 4096 & npx localtunnel --port 5000 # 4k context for Llama 2
#!python api.py -d $MODEL_DIR --lora $LORA_DIR -l 4096 & npx localtunnel --port 5000 # 4k context for Llama 2 + lora
#!python api.py -d $MODEL_DIR -l 4096 -a 2.5 & npx localtunnel --port 5000 # 4k NTK scaling for Llama 1

### Misc tests

In [None]:
# Benchmarking speeds

# 2x T4
!python test_benchmark_inference.py -d $MODEL_DIR -p -gs 8,11

# 1x P100 or 1x T4
#!python test_benchmark_inference.py -d $MODEL_DIR -p

In [None]:
# Benchmarking perplexity

# 2x T4
!python test_benchmark_inference.py -d $MODEL_DIR -ppl -ppl_ds "./datasets/wikitext2_val_sample.jsonl" -gs 8,11

# 1x P100 or 1x T4
#!python test_benchmark_inference.py -d $MODEL_DIR -ppl -ppl_ds "./datasets/wikitext2_val_sample.jsonl"