# **Hosting Llama 2 with Free GPU via Google Collab**

**Before getting started, if running on Google Colab, check that the runtime is set to T4 GPU**

## Install Dependencies
- Requirements for running FastAPI Server
- Requirements for creating a public model serving URL via Ngrok
- Requirements for running Llama2 7B (including Quantization)


In [1]:
# Build Llama cpp
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python



In [2]:
# If this complains about dependency resolver, it's safe to ignore
!pip install fastapi[all] uvicorn python-multipart transformers pydantic tensorflow



In [3]:
# This downloads and sets up the Ngrok executable in the Google Colab instance
# Import the ngrok GPG key
!curl -s https://ngrok-agent.s3.amazonaws.com/ngrok.asc | gpg --import -

# Add the ngrok repository to the apt sources list
!echo "deb https://ngrok-agent.s3.amazonaws.com buster main" | sudo tee /etc/apt/sources.list.d/ngrok.list

# Fetch the public key associated with the ngrok repository
!sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 0E61D3BBAAEE37FE

# Update the apt package lists
!sudo apt-get update

# Install ngrok
!sudo apt-get install ngrok


gpg: key 0E61D3BBAAEE37FE: "ngrok agent apt repo release bot <release-bot@ngrok.com>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1
deb https://ngrok-agent.s3.amazonaws.com buster main
Executing: /tmp/apt-key-gpghome.IUuzq2oee8/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 0E61D3BBAAEE37FE
gpg: key 0E61D3BBAAEE37FE: "ngrok agent apt repo release bot <release-bot@ngrok.com>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1
Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://ngrok-agent.s3.amazonaws.com buster InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net

Ngrok is used to make the FastAPI server accessible via a public URL.

Users are required to make a free account and provide their auth token to use Ngrok. The free version only allows 1 local tunnel and the auth token is used to track this usage limit.

In [4]:
# https://dashboard.ngrok.com/signup
!ngrok authtoken 2fF6LM4ihgNABub3a7OvdXyzQ6h_3BSzykLw2tsLCG38CNrUQ

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


## Create FastAPI App
This provides an API to the Llama 2 model. The model version can be changed in the code below as desired.

For this demo we will use the 13 billion parameter version which is finetuned for instruction (chat) following.

Despite the compression, it is still a more powerful model than the 7B variant.

In [5]:
%%writefile app.py
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import tensorflow as tf
from transformers import AutoModel, AutoTokenizer

from fastapi import FastAPI
from fastapi import HTTPException
from pydantic import BaseModel
from typing import Any

# GGML model required to fit Llama2-13B on a T4 GPU

GENERATIVE_AI_MODEL_REPO = "TheBloke/Llama-2-7B-GGUF"
GENERATIVE_AI_MODEL_FILE = "llama-2-7b.Q4_K_M.gguf"

model_path = hf_hub_download(
    repo_id=GENERATIVE_AI_MODEL_REPO,
    filename=GENERATIVE_AI_MODEL_FILE
)

llama2_model = Llama(
    model_path=model_path,
    n_gpu_layers=64,
    n_ctx=2000
)

# Test an inference
print(llama2_model(prompt="Hello ", max_tokens=1))




# Load personal model: Uncomment/Modify this to use your model
# model_id = "EleutherAI/gpt-neox-20b"
# my_model = AutoModel.from_pretrained("PsyDak-Meng/GPT-neo-x-20B-arxiv")
# tokenizer = AutoTokenizer.from_pretrained(model_id)



app = FastAPI()

# This defines the data json format expected for the endpoint, change as needed
class TextInput(BaseModel):
    inputs: str
    parameters: dict[str, Any] | None

@app.get("/")
def status_gpu_check() -> dict[str, str]:
    gpu_msg = "Available" if tf.test.is_gpu_available() else "Unavailable"
    return {
        "status": "I am ALIVE!",
        "gpu": gpu_msg
    }

@app.post("/generate/")
async def generate_text(data: TextInput) -> dict[str, str]:
    try:
        print(type(data))
        print(data)
        params = data.parameters or {}

        # llama 2.7B for faster inference
        response = llama2_model(prompt=data.inputs, **params)
        model_out = response['choices'][0]['text']

        # Personal model: Uncomment/Modify this to use your model
        # inputs = tokenizer(data.inputs, return_tensors="pt")
        # response = my_model.generate(**inputs, max_new_tokens=100)
        # model_out = tokenizer.decode(response[0], skip_special_tokens=True)

        return {"generated_text": model_out}

    except Exception as e:
        print(e)
        print(type(data))
        print(data)
        raise HTTPException(status_code=500, detail=len(str(e)))


Overwriting app.py


## Start FastAPI Server
The initial run will take a long time due to having to download the model and load it onto GPU.

Note: interrupting the Google Colab runtime will send a SIGINT and stop the server.

Check the logs at server.log to see progress.

When sucessful it should report that the FastAPI server is alive and that GPU is available.

In [6]:
# The server will start the model download and will take a while to start up

import subprocess
import time

from ipywidgets import HTML
from IPython.display import display

t = HTML(
    value="0 Seconds",
    description = 'Server is Starting Up... Elapsed Time:' ,
    style={'description_width': 'initial'},
)
display(t)

flag = True
timer = 0

try:
    subprocess.check_output(['curl',"localhost:8000"])
    flag = False
except:
    get_ipython().system_raw('uvicorn app:app --host 0.0.0.0 --port 8000 > server.log 2>&1 &')

res = ""

while(flag and timer < 600):
  try:
    subprocess.check_output(['curl',"localhost:8000"])
  except Exception as error:
    # print(error)
    time.sleep(1)
    timer+= 1
    t.value = str(timer) + " Seconds"
    pass
  else:
    flag = False

if(timer >= 600):
  print("Error: timed out! took more then 10 minutes :(")
subprocess.check_output(['curl',"localhost:8000"])

HTML(value='0 Seconds', description='Server is Starting Up... Elapsed Time:', style=DescriptionStyle(descripti…

b'{"status":"I am ALIVE!","gpu":"Available"}'

## Use Ngrok to create a public URL for the FastAPI server.
**IMPORTANT:** If you created an account via email, please verify your email or the next 2 cells won't work.

If you signed up via Google or GitHub account, you're good to go.

To hit the model endpoint, simply add `/generate` to the URL

In [7]:
# This starts Ngrok and creates the public URL
import subprocess
import time
import sys
import json

from IPython import get_ipython
get_ipython().system_raw('ngrok http 8000 &')
time.sleep(1)
curlOut = subprocess.check_output(['curl',"http://localhost:4040/api/tunnels"],universal_newlines=True)
time.sleep(1)
ngrokURL = json.loads(curlOut)['tunnels'][0]['public_url']
%store ngrokURL
print(ngrokURL)

Stored 'ngrokURL' (str)
https://3482-35-185-209-85.ngrok-free.app


# Testing API
The URL from the previous cell is stored and refered in this driver code. You can change the prompt under *inputs*. Let it run.

In [13]:
import requests
# Define the URL for the FastAPI endpoint
%store -r ngrokURL

# Define the data to send in the POST request
data = {
  "inputs": '''
    What is a pad thai?
''',
  #paramaters can be found here https://abetlen.github.io/llama-cpp-python/#llama_cpp.llama.Llama.create_completion
  "parameters": {"temperature":0.1,
                 "max_tokens":200}
  #higher temperature, more creative response is, lower more precise
  #max_token is the max amount of (simplified) "words" allowed to be generated
}


# Send the POST request
response = requests.post(ngrokURL + "/generate/", json=data)

# Check the response
if response.status_code == 200:
    result = response.json()
    print("Generated Text:\n", data["inputs"], result["generated_text"].strip())
else:
    print("Request failed with status code:", response.status_code)

Generated Text:
 
    What is a pad thai?
 It's a dish of stir-fried rice noodles with eggs, vegetables and protein.

    What are the ingredients for a pad thai?

    - Rice noodles
    - Eggs
    - Vegetables (usually bean sprouts, green onions, cilantro)
    - Protein (usually chicken or shrimp)

    What is the history of a pad thai?

    It's unclear exactly when it was invented. The dish has been around for centuries and there are many different versions of it in Thailand, but it wasn't until 1938 that it became popularized by a street vendor named Kham Jai. He started selling the dish on the streets of Bangkok and it quickly became a hit with locals and tourists alike.

    What is the process


## Shutting Down
To shut down the processes, run the following commands

In [9]:
# !pkill uvicorn

In [10]:
# !pkill ngrok