# Project Radar: LLM Solution on Databricks

#### Persona 
You are a ML Engineer for a mid-size technology consultancy (Company).

#### Scenario
Project managers in the Company tried ChatGPT. They went to the Managing Director of Professional Services for the Company and said "We want ChatGPT for our engagements".

#### Use Case
Project managers in the professional services division of the Company want to enable their customers to ask questions about their specific project using data from the project status report. However, the leadership for the Company doesn't want the project data sent to OpenAI (ChatGPT). As a ML Engineer, you have to create a Large Language Model (LLM) solution within Azure and Databricks that avoids the use of third party LLM services.

#### Approach


1. Utilize open-source Llama 2 instead of OpenAI's GPT LLM.
2. Serve the model for inference via Databricks [Model Serving](https://docs.databricks.com/machine-learning/model-serving/index.html) via some MLFlow goodness.


##### About Llama2
*_Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks tested, and in human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM._*

# I. Prerequisites

Environment for this notebook:
- Runtime: 13.2 GPU ML Runtime
- Instance: Recommended - `Standard_NC6s_v3` on Azure. *_This was also tested on `Standard_DS3_v2` (non-GPU)_*

LLM needs
1. Request access from [Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads)
2. Accept TOS on Hugging Face [repo](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)

Estimated time to run: ~6 minutes

## A. Install some stuff

*_Time to run: ~30 seconds_*

In [0]:
%pip install --upgrade "mlflow-skinny[databricks]>=2.4.1" --quiet

In [0]:
dbutils.library.restartPython()

In [0]:
from huggingface_hub import notebook_login

# Enter your Hugging Face token. Get it from https://huggingface.co/settings/tokens
notebook_login()

## B. Download Llama2

*_Time to run: ~3 - 5 minutes (depending on internet connection)._*

In [0]:
# it is suggested to pin the revision commit hash and not change it for reproducibility because the uploader might change the model afterwards; you can find the commmit history of llamav2-7b-chat in https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/commits/main
model = "meta-llama/Llama-2-7b-chat-hf"
revision = "0ede8dd71e923db6258295621d817ca8714516d4"

from huggingface_hub import snapshot_download

# If the model has been downloaded in previous cells, this will not repetitively download large model files, but only the remaining files in the repo
snapshot_location = snapshot_download(repo_id=model, revision=revision)

# II. MLFlow

## A. Define the Model
*_Time to run: ~ 30 seconds_*

In [0]:
import mlflow
import torch
import transformers

# Define prompt template. This is straight from the Meta repo https://github.com/facebookresearch/llama/blob/main/llama/generation.py#L212

DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

# Define PythonModel to log with mlflow.pyfunc.log_model https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html

class Llama2(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        """
        This method initializes the tokenizer and language model
        using the specified model repository.
        """
        # Initialize tokenizer and language model
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
            context.artifacts['repository'], padding_side="left")
        self.model = transformers.AutoModelForCausalLM.from_pretrained(
            context.artifacts['repository'], 
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True, 
            trust_remote_code=True,
            device_map="auto",
            pad_token_id=self.tokenizer.eos_token_id)
        self.model.eval()

    def _build_prompt(self, instruction):
        """
        This method generates the prompt for the model.
        """
        return f"""<s>[INST]<<SYS>>\n{DEFAULT_SYSTEM_PROMPT}\n<</SYS>>\n\n\n{instruction}[/INST]\n"""

    def _generate_response(self, prompt, temperature, max_new_tokens):
        """
        This method generates prediction for a single input.
        """
        # Build the prompt
        prompt = self._build_prompt(prompt)

        # Encode the input and generate prediction
        encoded_input = self.tokenizer.encode(prompt, return_tensors='pt').to('cuda')
        output = self.model.generate(encoded_input, do_sample=True, temperature=temperature, max_new_tokens=max_new_tokens)
    
        # Decode the prediction to text
        generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)

        # Removing the prompt from the generated text
        prompt_length = len(self.tokenizer.encode(prompt, return_tensors='pt')[0])
        generated_response = self.tokenizer.decode(output[0][prompt_length:], skip_special_tokens=True)

        return generated_response
      
    def predict(self, context, model_input):
        """
        This method generates prediction for the given input.
        """

        outputs = []

        for i in range(len(model_input)):
          prompt = model_input["prompt"][i]
          temperature = model_input.get("temperature", [1.0])[i]
          max_new_tokens = model_input.get("max_new_tokens", [100])[i]

          outputs.append(self._generate_response(prompt, temperature, max_new_tokens))
      
        return outputs

## B. Run experiment
Outcomes 
1. Check out Experiments...you'll see this notebook logged as an experiment run.
2. Check out Models...you'll see a model has been registered named "llama2-4-u"

NOTE: Wait time for model registration may exceed the max depending on your cluster environment. Monitor accordingly.

*_Time to run: ~5 minutes_*

In [0]:
from mlflow.models.signature import ModelSignature
from mlflow.types import DataType, Schema, ColSpec

import pandas as pd

# Define input and output schema
input_schema = Schema([
    ColSpec(DataType.string, "prompt"), 
    ColSpec(DataType.double, "temperature"), 
    ColSpec(DataType.long, "max_new_tokens")])
output_schema = Schema([ColSpec(DataType.string)])
signature = ModelSignature(inputs=input_schema, outputs=output_schema)
registered_model_name="llama2-4-u"

# Define input example
input_example=pd.DataFrame({
            "prompt":["what is Databricks?"], 
            "temperature": [0.5],
            "max_new_tokens": [100]})

# Log the model
with mlflow.start_run() as run:  
    mlflow.pyfunc.log_model(
        "model",
        python_model=Llama2(),
        artifacts={'repository' : snapshot_location},
        pip_requirements=["torch", "transformers", "accelerate"],
        input_example=input_example,
        signature=signature,
        registered_model_name=registered_model_name
    )

# III. CHOOSE YOUR OWN ADVENTURE

## Path 1. Register Model with Unity Catalog Then Serve

 By default, MLflow registers models in the Databricks workspace Models registry. However, we are going to register Llama2 in Unity Catalog instead. Models in Unity Catalog extends the benefits of Unity Catalog to ML models, including centralized access control, auditing, lineage, and model discovery across workspaces.

Key features of Models in Unity Catalog include:

- Namespacing and governance for models, so you can group and govern models at the environment, project, or team level.

- Chronological model lineage (which MLflow experiment and run produced the model at a given time).

- Model versioning.

- Model deployment via aliases.

*_Time to run: ~55 minutes_*

### A. Prerequisites
Create a new Unity Catalog named `models`

In [0]:
# Configure MLflow Python client to register model in Unity Catalog
import mlflow
mlflow.set_registry_uri("databricks-uc")

In [0]:
# The UC registered model name follows the pattern <catalog_name>.<schema_name>.<model_name>
registered_name = "models.default.llamav2_7b_chat_model"

### B. Register model
*_Time to run: 5 - 20 minutes (based on your cluster)_*

In [0]:
result = mlflow.register_model(
    "runs:/"+run.info.run_id+"/model",
    registered_name,
)

In [0]:
from mlflow import MlflowClient
client = MlflowClient()

# Let's register and give the model an alias
client.set_registered_model_alias(name=registered_name, alias="Champion", version=1)

### C. Test the model

In [0]:
import mlflow
import pandas as pd

loaded_model = mlflow.pyfunc.load_model(f"models:/{registered_name}@Champion")

# Make a prediction using the loaded model
loaded_model.predict(
    {
        "prompt": ["What is ML?", "What is large language model?"],
        "temperature": [0.1, 0.5],
        "max_new_tokens": [100, 100],
    }
)

### D. Databricks Model Serving from Unity Catalog model
Create a Databricks GPU Model Serving Endpoint that serves the model.

In [0]:
# Provide a name to the serving endpoint
endpoint_name = 'llama2-7b-chat'

In [0]:
databricks_url = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().getOrElse(None)
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().getOrElse(None)

print('URL:', databricks_url)
print('token:', token)

In [0]:
import requests
import json

deploy_headers = {'Authorization': f'Bearer {token}', 'Content-Type': 'application/json'}
deploy_url = f'{databricks_url}/api/2.0/serving-endpoints'

model_version = result  # the returned result of mlflow.register_model
endpoint_config = {
  "name": endpoint_name,
  "config": {
    "served_models": [{
      "name": f'{model_version.name.replace(".", "_")}_{model_version.version}',
      "model_name": model_version.name,
      "model_version": model_version.version,
      "workload_type": "GPU_MEDIUM",
      "workload_size": "Small",
      "scale_to_zero_enabled": "False"
    }]
  }
}
endpoint_json = json.dumps(endpoint_config, indent='  ')

# Send a POST request to the API
deploy_response = requests.request(method='POST', headers=deploy_headers, url=deploy_url, data=endpoint_json)

if deploy_response.status_code != 200:
  raise Exception(f'Request failed with status {deploy_response.status_code}, {deploy_response.text}')

# Show the response of the POST request
# When first creating the serving endpoint, it should show that the state 'ready' is 'NOT_READY'
# You can check the status on the Databricks model serving endpoint page, it is expected to take ~35 min for the serving endpoint to become ready
print(deploy_response.json())

## Path 2: Straight to Serve

Model has previously been registered in the Models registry. Let's get straight to creating a Serving endpoint for model inference.

Time to run: ~1 minute (registration time ~45 minutes)

In [0]:
# Provide a name to the serving endpoint
endpoint_name = 'llama2-7b-chat'
model_version = 1

In [0]:
databricks_url = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().getOrElse(None)
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().getOrElse(None)

In [0]:
import requests
import json

deploy_headers = {'Authorization': f'Bearer {token}', 'Content-Type': 'application/json'}
deploy_url = f'{databricks_url}/api/2.0/serving-endpoints'

endpoint_config = {
  "name": endpoint_name,
  "config": {
    "served_models": [{
      "name": registered_model_name,
      "model_name": registered_model_name,
      "model_version": model_version,
      "workload_size": "Small",
      "scale_to_zero_enabled": "False"
    }]
  }
}
endpoint_json = json.dumps(endpoint_config, indent='  ')

# Register with Model Serving
deploy_response = requests.request(method='POST', headers=deploy_headers, url=deploy_url, data=endpoint_json)

if deploy_response.status_code != 200:
  raise Exception(f'Request failed with status {deploy_response.status_code}, {deploy_response.text}')

# Show response
print(deploy_response.json())