<a href="https://colab.research.google.com/github/AlpinDale/misc-scripts/blob/main/Aphrodite.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to the Aphrodite Engine Demo!

You can play around with the API here, or scroll down to see how you can interact with the engine using Python.

By default, a few models have been included. You can models for a variety of quantization methods, including: EXL2, GPTQ, AWQ, GGUF, Marlin, AQLM, SqueezeLLM, etc. Simply input their Hugging Face ID in the `Model` field, and configure the other options as necessary. Colab only has 16GB of VRAM, so you may be unable to run larger models reliably - a good rule of thumb is taking the model's size in GBs and adding 3 to it. If that's less than ~15, then you can run it here. If you're confused on what a model ID is, it's essentially the username/model-name on Hugging Face. For example, the URL `https://huggingface.co/Kooten/Kunoichi-DPO-v2-7B-8bpw-exl2`'s ID is simply the `https://huggingface.co/` part stripped out.

If you're on mobile, please tap on the play button below. If you're not, you can safely skip it and go to the next cell.

If you run into any problems, open an issue [here](https://github.com/AlpinDale/misc-scripts/issues).

For **20B models**, make sure the model is quantized and set `GPU_Memory_Utilization` to 0.9. Make sure you don't request more than 200 tokens per reply, as it may run out of memory.

In [None]:
#@title <b>v-- Tap this if you play on Mobile</b> { display-mode: "form" }
%%html
<b>Press play on the music player to keep the tab alive, then start KoboldAI below (You can ignore this step if you used run all and are on PC)</b><br/>
<audio src="https://raw.githubusercontent.com/KoboldAI/KoboldAI-Client/main/colab/silence.m4a" controls>

In [None]:
#@title <b>v-- Run this cell to start the engine.</b>
#@markdown **The free plan on Google Colab only supports up to 13B (quantized).**
#@markdown **You can enter a custom model as well in addition to the default ones. Supported models types are:**
#@markdown ****

Model = "Kooten/Kunoichi-DPO-v2-7B-8bpw-exl2" #@param ["Kooten/Kunoichi-DPO-v2-7B-8bpw-exl2", "TheBloke/UNA-TheBeagle-7B-v1-GPTQ", "LoneStriker/Fimbulvetr-11B-v2-GPTQ", "TheBloke/OpenHermes-2.5-Mistral-7B-AWQ"]{allow-input: true}
#@markdown **The specific model branch to download. Useful for exl2 models where every bpw is on separate branches.**
Revision = "main" #@param []{allow-input: true}
#@markdown **Should be auto-recognized for most models. If you receive a KeyError, or unexpectedly run out of memory for small models, please use this to specify the correct quant format. Most exl2 models have this issue, so configure this for exl2 models.**
Quantization = "exl2" #@param ["None", "exl2", "gptq", "awq", "aqlm", "quip", "marlin"]
#@markdown **Adjust this and the Context Length slider if you're running into COOM (CUDA Out Of Memory) issues!**
GPU_Memory_Utilization = 0.95 #@param {type:"slider", min:0, max:1, step:0.01}
#@markdown **The free Colab GPU may not have enough memory to accomodate more than 8192 Context Length for most models.**
Context_Length = 8192 #@param {type:"slider", min:1024, max:32768, step:1024}
#@markdown **Use FP8 KV Cache to reduce memory usage and allow higher context lengths.**
FP8_KV_Cache = False #@param {type:"boolean"}
#@markdown **Check this to launch a Kobold-compatible API in addition to the OpenAI one. Keep in mind that the API key does not protect Kobold routes.**
launch_kobold_api = False #@param {type:"boolean"}
#@markdown **[OPTIONAL] Enter an API key to secure your API.**
OpenAI_API_Key = "" #@param []{allow-input: true}


%pip show aphrodite-engine &> /dev/null && echo "Existing Aphrodite Engine installation found. Updating..." && pip uninstall aphrodite-engine -q -y
!echo "Installing/Updating the Aphrodite Engine, this may take a while..."
%pip install aphrodite-engine --extra-index-url https://downloads.pygmalion.chat/whl > /dev/null 2>&1
!echo "Installation successful! Starting the engine now."


!wget -q -c https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64
!chmod +x cloudflared-linux-amd64
!echo "Creating a Cloudflare URL..."
!nohup ./cloudflared-linux-amd64 tunnel --url http://127.0.0.1:2242 &> nohup.out &
!sleep 10
!echo "============================================================"
!echo "Please copy this URL:"
!grep -o 'https://[^ ]*.trycloudflare.com' nohup.out
!echo "============================================================"

model = Model
gpu_memory_utilization = GPU_Memory_Utilization
context_length = Context_Length
api_key = OpenAI_API_Key
quant = Quantization
fp8_kv = FP8_KV_Cache
kobold = launch_kobold_api
revision = Revision

command = [
    "aphrodite run",
    model,
    "--dtype", "float16",
    "--host", "127.0.0.1",
    "--gpu-memory-utilization", str(gpu_memory_utilization),
    "--max-model-len", str(context_length),
    "--max-log-len", "0",
    "--revision", revision
]

if kobold:
    command.append("--launch-kobold-api")

if quant != "None":
    command.extend(["-q", quant])

if fp8_kv:
    command.append("--kv-cache-dtype fp8")

if api_key != "":
    command.extend(["--api-keys", api_key])

!{" ".join(command)} &

## For Developers

You can use the `aphrodite` library to load an LLM and generate text with it (!). To get started, install the package using pip.

In [None]:
%pip install aphrodite-engine --extra-index-url https://downloads.pygmalion.chat/whl

Now, let's define some prompts. We'll pass these to the model, and have it generate completions for it.

In [None]:
prompts = [
    "Describe a serene and peaceful forest clearing on a warm summer day. Include details about the sights, sounds, and smells that one might experience in this tranquil setting.",
    "Write a short story about a curious robot named Zephyr who discovers an ancient, mysterious artifact hidden deep within an abandoned factory. The artifact holds a secret that could change the course of robot history.",
    "Create a dialogue between two friends discussing their dreams and aspirations for the future. One friend is an optimist, while the other is more pragmatic and cautious.",
    "Compose a haiku about the beauty and simplicity of a single cherry blossom falling from a tree in the springtime.",
]

Next, we'll import the `LLM` and `SamplingParams` classes from Aphrodite. The `LLM` class handles the model loading, and all the configuration related to it, and `SamplingParams` handles the sampler settings.

In [None]:
from aphrodite import LLM, SamplingParams

sampling_params = SamplingParams(temperature=0.9, min_p=0.1, max_tokens=256)

llm = LLM(model="Kooten/Kunoichi-DPO-v2-7B-8bpw-exl2", quantization="exl2",
          enforce_eager=True)

outputs = llm.generate(prompts, sampling_params)

Great! That should've generated some text to our prompts. Now, let's view the prompts.

In [None]:
for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated text: {output.outputs[0].text}\n")

To kill the engine and free up memory, run this.

In [None]:
import torch; import gc; from aphrodite.modeling.megatron.parallel_state import destroy_model_parallel

destroy_model_parallel()
del llm.llm_engine.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()