#Project: Hosting LLAMA model as an api using Ngrok and Fast API

## Step 1 : Installing dependencies for the project as follows

1. Dependencies for FastAPI
2. Installing Ngrok to make a local server be publicly accessible
3. Dependencies required to download any model from hugging face that we wanted to create the api for ; so api requests can be used to make predictions or classifications or chat completions etc.

In [None]:
!pip install llama-cpp-python
!pip install pyngrok fastapi[all] uvicorn python-multipart transformers pydantic tensorflow requests

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.90.tar.gz (63.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.8/63.8 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.90-cp310-cp310-linux_x86_64.whl size=3414591 sha256=2651b865ee7b3a4ee8

In [None]:
# I have used
# !wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip -q
# !unzip -q ngrok-stable-linux-amd64.zip
import getpass
from pyngrok import ngrok, conf

print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth")
conf.get_default().auth_token = getpass.getpass()



Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth
··········


In [None]:
auth_token = 'get your oath key from ngrok dashboard by signing in'

In [40]:
## Using Fast API to create endpoints for calling the model

%%writefile app.py

from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
import threading
from pyngrok import ngrok
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
from fastapi import HTTPException
from typing import Any

class request(BaseModel):
  inputs: str
  parameters: dict[str, Any] | None

GENERATIVE_AI_MODEL_REPO = "TheBloke/Llama-2-7B-GGUF"
GENERATIVE_AI_MODEL_FILE = "llama-2-7b.Q4_K_M.gguf"

model_path  = hf_hub_download(filename = GENERATIVE_AI_MODEL_FILE , repo_id = GENERATIVE_AI_MODEL_REPO )
llama2_model = Llama(model_path=model_path,n_gpu_layers=64,n_ctx=2000)

app = FastAPI()

@app.get('/')
def index():
  return "Welcome to Llama2 assistant Page"


@app.post('/complete_text')
def complete_text(input: request):
  params = input.parameters or {}
  response = llama2_model(prompt= input.inputs, **params)
  output = response['choices'][0]['text']
  return {"generated_text": output}







Overwriting app.py


In [41]:
## Using uvicorn to host our Fast API app. Since we are in Colab, we use get_ipython to run the command in our interactive python session.
## Now our Fast API is hosted on colabs environment which is not publicly accessible
get_ipython().system_raw('uvicorn app:app --host 0.0.0.0 --port 8000 --reload > server.log 2>&1 &')

In [42]:
## ngrok is used to connect the colabs environemnt port 8000 to a public URL

from pyngrok import ngrok, conf
public_url = ngrok.connect(8000)
print(f"Public URL: {public_url}")

Public URL: NgrokTunnel: "https://bbbe-34-125-210-28.ngrok-free.app" -> "http://localhost:8000"


Use the `<Public URL>/docs` to access the end points defined in our Fast API
app which includes

1. `/` : index page that welcomes to Fast API app
2. `/generate_text`:  to call our LLAMA model to finish the sentence for us



In [44]:
## We can also check using the request module
import requests
url = public_url.public_url

print(requests.get(f'{url}/').json())
data = {
  "inputs": '''
Tell me how to make a chocolate cake?
''',
  #paramaters can be found here https://abetlen.github.io/llama-cpp-python/#llama_cpp.llama.Llama.create_completion
  "parameters": {"temperature":0.1,
                 "max_tokens":200}
  #higher temperature, more creative response is, lower more precise
  #max_token is the max amount of (simplified) "words" allowed to be generated
}

response = requests.post(f'{url}/complete_text',json = data)
print("Generated text",response.json())

Welcome to Llama2 assistant Page
Generated text {'generated_text': "\nI'm not sure if this is the right place to ask this question, but I'm not sure where else to go.\n\nI'm trying to make a chocolate cake, but I'm not sure how to do it. I've tried looking online, but I can't find anything that's really helpful.\n\nI'm not sure if I need to use a mixer or not, and I'm not sure what kind of chocolate to use.\n\nI'm not sure if I need to use a mixer or not, and I'm not sure what kind of chocolate to use.\n\nI'm not sure if I need to use a mixer or not, and I'm not sure what kind of chocolate to use.\n\nI'm not sure if I need to use a mixer or not, and I'm not sure what kind of"}
