![image](https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/notebooks/headers/watsonx-Prompt_Lab-Notebook.png)
# AI Service Deployment Notebook
This notebook contains steps and code to test, promote, and deploy an AI Service
capturing logic to implement RAG pattern for grounded chats.

**Note:** Notebook code generated using Prompt Lab will execute successfully.
If code is modified or reordered, there is no guarantee it will successfully execute.
For details, see: <a href="/docs/content/wsj/analyze-data/fm-prompt-save.html?context=wx" target="_blank">Saving your work in Prompt Lab as a notebook.</a>


Some familiarity with Python is helpful. This notebook uses Python 3.11.

## Contents
This notebook contains the following parts:

1. Setup
2. Initialize all the variables needed by the AI Service
3. Define the AI service function
4. Deploy an AI Service
5. Test the deployed AI Service

## 1. Set up the environment

Before you can run this notebook, you must perform the following setup tasks:

### Connection to WML
This cell defines the credentials required to work with watsonx API for both the execution in the project, 
as well as the deployment and runtime execution of the function.

**Action:** Provide the IBM Cloud personal API key. For details, see
<a href="https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui" target="_blank">documentation</a>.


In [None]:
import os
import getpass
import requests

def get_credentials():
    return {
        "url" : "https://us-south.ml.cloud.ibm.com",
        "apikey" : getpass.getpass("Please enter your api key (hit enter): ")
    }

def get_bearer_token():
    url = "https://iam.cloud.ibm.com/identity/token"
    headers = {"Content-Type": "application/x-www-form-urlencoded"}
    data = f"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={credentials['apikey']}"

    response = requests.post(url, headers=headers, data=data)
    return response.json().get("access_token")

credentials = get_credentials()

In [None]:
from ibm_watsonx_ai import APIClient

client = APIClient(credentials)

### Connecting to a space
A space will be be used to host the promoted AI Service.


In [None]:
space_id = "044ec471-2604-4d4a-80c2-1e0c4a1d6105"
client.set.default_space(space_id)


### Promote asset(s) to space
We will now promote assets we will need to stage in the space so that we can access their data from the AI service.


In [None]:
source_project_id = "c07e23d7-13de-4702-96ff-f1f5b5a2457d"
vector_index_id = client.spaces.promote("6b62d768-fb69-4833-acbf-daa84178f8ba", source_project_id, space_id)
print(vector_index_id)


## 2. Create the AI service function
We first need to define the AI service function

### 2.1 Define the function

In [None]:
params = {
    "space_id": space_id, 
    "vector_index_id": vector_index_id
}

def gen_ai_service(context, params = params, **custom):
    # import dependencies
    import json
    from ibm_watsonx_ai.foundation_models import ModelInference
    from ibm_watsonx_ai import APIClient
    import gzip
    import chromadb
    import random
    import string
    from ibm_watsonx_ai.foundation_models.embeddings.sentence_transformer_embeddings import SentenceTransformerEmbeddings

    vector_index_id = params.get("vector_index_id")

    # Get credentials token
    credentials = {
        "url": "https://us-south.ml.cloud.ibm.com",
        "token": context.generate_token()
    }

    # Setup client
    client = APIClient(credentials)
    space_id = params.get("space_id")
    client.set.default_space(space_id)

    # Get vector index details
    vector_index_details = client.data_assets.get_details(vector_index_id)
    vector_index_properties = vector_index_details["entity"]["vector_index"]
    top_n = 20 if vector_index_properties["settings"].get("rerank") else int(vector_index_properties["settings"]["top_k"])

    def rerank( client, documents, query, top_n ):
        from ibm_watsonx_ai.foundation_models import Rerank

        reranker = Rerank(
            model_id="cross-encoder/ms-marco-minilm-l-12-v2",
            api_client=client,
            params={
                "return_options": {
                    "top_n": top_n
                },
                "truncate_input_tokens": 512
            }
        )

        reranked_results = reranker.generate(query=query, inputs=documents)["results"]

        new_documents = []
        
        for result in reranked_results:
            result_index = result["index"]
            new_documents.append(documents[result_index])
            
        return new_documents
    emb = SentenceTransformerEmbeddings('sentence-transformers/all-MiniLM-L6-v2')


    def hydrate_chromadb():
        data = client.data_assets.get_content(vector_index_id)
        content = gzip.decompress(data)
        stringified_vectors = str(content, "utf-8")
        vectors = json.loads(stringified_vectors)
        
        chroma_client = chromadb.Client()
        
        # make sure collection is empty if it already existed
        collection_name = "my_collection"
        try:
            collection = chroma_client.delete_collection(name=collection_name)
        except:
            print("Collection didn't exist - nothing to do.")
        collection = chroma_client.create_collection(name=collection_name)
        
        vector_embeddings = []
        vector_documents = []
        vector_metadatas = []
        vector_ids = []
        
        for vector in vectors:
            vector_embeddings.append(vector["embedding"])
            vector_documents.append(vector["content"])
            metadata = vector["metadata"]
            lines = metadata["loc"]["lines"]
            clean_metadata = {}
            clean_metadata["asset_id"] = metadata["asset_id"]
            clean_metadata["asset_name"] = metadata["asset_name"]
            clean_metadata["url"] = metadata["url"]
            clean_metadata["from"] = lines["from"]
            clean_metadata["to"] = lines["to"]
            vector_metadatas.append(clean_metadata)
            asset_id = vector["metadata"]["asset_id"]
            random_string = ''.join(random.choices(string.ascii_uppercase + string.digits, k=10))
            id = "{}:{}-{}-{}".format(asset_id, lines["from"], lines["to"], random_string)
            vector_ids.append(id)

        collection.add(
            embeddings=vector_embeddings,
            documents=vector_documents,
            metadatas=vector_metadatas,
            ids=vector_ids
        )
        return collection
    
    chroma_collection = hydrate_chromadb()

    def proximity_search( question, emb ):
        query_vectors = emb.embed_query(question)
        query_result = chroma_collection.query(
            query_embeddings=query_vectors,
            n_results=top_n,
            include=["documents", "metadatas", "distances"]
        )
        
        documents = list(reversed(query_result["documents"][0]))
        metadatas = reversed(query_result["metadatas"][0])
        distances = reversed(query_result["distances"][0])
        results = []
        for metadata, distance in zip(metadatas, distances):
            results.append({
                "metadata": metadata,
                "score": distance
            })

        return {
            "results": results,
            "documents": documents
        }

    # Functions used for inferencing
    def format_input(messages, documents):
        context = "\n".join(documents)
        system = f"""<|start_of_role|>system<|end_of_role|>You are Granite, an AI language model developed by IBM in 2024. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior. You are a AI language model designed to function as a specialized Retrieval Augmented Generation (RAG) assistant. When generating responses, prioritize correctness, i.e., ensure that your response is correct given the context and user query, and that it is grounded in the context. Furthermore, make sure that the response is supported by the given document or context. Always make sure that your response is relevant to the question. If an explanation is needed, first provide the explanation or reasoning, and then give the final answer. Avoid repeating information unless asked.<|end_of_text|>
"""
        messages_section = []

        for index,value in enumerate(messages, start=0):
            content = value["content"]
            user_template = f"""<|start_of_role|>user<|end_of_role|>{content}<|end_of_text|>
"""
            assistant_template = f"""<|start_of_role|>assistant<|end_of_role|>{content}<|end_of_text|>
"""
            grounded_user_template = f"""<|start_of_role|>user<|end_of_role|>Use the following pieces of context to answer the question.

{context}

Question: {content}<|end_of_text|>
"""

            formatted_entry = user_template if value["role"] == "user" else assistant_template
            if (index == len(messages)-1):
                formatted_entry = grounded_user_template
            
            messages_section.append(formatted_entry)

        messages_section = "".join(messages_section)
        prompt = f"""{system}{messages_section}<|start_of_role|>assistant<|end_of_role|>"""
        return prompt
    
    def inference_model( messages, documents, access_token, stream ):
        prompt = format_input(messages, documents)
        model_id = "ibm/granite-3-8b-instruct"
        parameters =  {
            "decoding_method": "greedy",
            "max_new_tokens": 900,
            "min_new_tokens": 0,
            "repetition_penalty": 1
        }
        inference_credentials = {
            "url": credentials.get("url"),
            "token": access_token
        }
        model = ModelInference(
            model_id = model_id,
            params = parameters,
            credentials = inference_credentials,
            space_id = space_id
        )
        # Generate grounded response
        if (stream == True):
            generated_response = model.generate_text_stream(prompt=prompt, guardrails=False)
        else:
            generated_response = model.generate_text(prompt=prompt, guardrails=False)

        return generated_response

    def get_internal_client(access_token):
        # Setup client
        internal_credentials = {
            "url": credentials.get("url"),
            "token": access_token
        }
        internal_client = APIClient(internal_credentials)
        space_id = params.get("space_id")
        internal_client.set.default_space(space_id)
        return internal_client

    def generate(context):
        payload = context.get_json()
        messages = payload.get("messages")
        access_token = context.get_token()
        last_question = messages[-1].get("content")
 
        internal_client = get_internal_client(access_token)


        # Proximity search
        documents = proximity_search(last_question, emb)["documents"]

        if vector_index_properties["settings"].get("rerank"):
            documents = rerank(internal_client, documents, last_question, vector_index_properties["settings"]["top_k"])
    
        
        # Grounded inferencing
        generated_response = inference_model(messages, documents, access_token, False)
        
        execute_response = {
            "headers": {
                "Content-Type": "application/json"
            },
            "body": {
                "choices": [{
                    "index": 0,
                    "message": {
                    "role": "assistant",
                    "content": generated_response
                    }
                }]
            }
        }

        return execute_response

    def generate_stream(context):
        payload = context.get_json()
        messages = payload.get("messages")
        access_token = context.get_token()
        last_question = messages[-1].get("content")

        internal_client = get_internal_client(access_token)


        # Proximity search
        documents = proximity_search(last_question, emb)["documents"]

        if vector_index_properties["settings"].get("rerank"):
            documents = rerank(internal_client, documents, last_question, vector_index_properties["settings"]["top_k"])
    
        
        # Grounded inferencing
        response_stream = inference_model(messages, documents, access_token, True)

        for chunk in response_stream:
            chunk_response = {
                "choices": [{
                    "index": 0,
                    "delta": {
                        "role": "assistant",
                        "content": chunk
                    }
                    
                }]
            }
            yield chunk_response

    return generate, generate_stream


### 2.2 Test locally

In [None]:
# Initialize AI Service function locally
from ibm_watsonx_ai.deployments import RuntimeContext

context = RuntimeContext(api_client=client)

streaming = False
findex = 1 if streaming else 0
local_function = gen_ai_service(context, vector_index_id=vector_index_id, space_id=space_id)[findex]
messages = []

In [None]:
local_question = "Change this question to test your function"

messages.append({ "role" : "user", "content": local_question })

context = RuntimeContext(api_client=client, request_payload_json={"messages": messages})

response = local_function(context)

result = ''

if (streaming):
    for chunk in response:
        print(chunk["choices"][0]["message"]["delta"]["content"], end="", flush=True)
else:
    print(response)


## 3. Store and deploy the AI Service
Before you can deploy the AI Service, you must store the AI service in your watsonx.ai repository.

In [None]:
# Look up software specification for the AI service
software_spec_id_in_project = "4e852902-d936-42e1-bdba-70472ad5d47e"
software_spec_id = ""

try:
	software_spec_id = client.software_specifications.get_id_by_name("ai-service-v4-software-specification")
except:
    software_spec_id = client.spaces.promote(software_spec_id_in_project, source_project_id, space_id)

In [None]:
# Define the request and response schemas for the AI service
request_schema = {
    "application/json": {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "type": "object",
        "properties": {
            "messages": {
                "title": "The messages for this chat session.",
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "role": {
                            "title": "The role of the message author.",
                            "type": "string",
                            "enum": ["user","assistant"]
                        },
                        "content": {
                            "title": "The contents of the message.",
                            "type": "string"
                        }
                    },
                    "required": ["role","content"]
                }
            }
        },
        "required": ["messages"]
    }
}

response_schema = {
    "application/json": {
        "oneOf": [{"$schema":"http://json-schema.org/draft-07/schema#","type":"object","description":"AI Service response for /ai_service_stream","properties":{"choices":{"description":"A list of chat completion choices.","type":"array","items":{"type":"object","properties":{"index":{"type":"integer","title":"The index of this result."},"delta":{"description":"A message result.","type":"object","properties":{"content":{"description":"The contents of the message.","type":"string"},"role":{"description":"The role of the author of this message.","type":"string"}},"required":["role"]}}}}},"required":["choices"]},{"$schema":"http://json-schema.org/draft-07/schema#","type":"object","description":"AI Service response for /ai_service","properties":{"choices":{"description":"A list of chat completion choices","type":"array","items":{"type":"object","properties":{"index":{"type":"integer","description":"The index of this result."},"message":{"description":"A message result.","type":"object","properties":{"role":{"description":"The role of the author of this message.","type":"string"},"content":{"title":"Message content.","type":"string"}},"required":["role"]}}}}},"required":["choices"]}]
    }
}

In [None]:
# Store the AI service in the repository
ai_service_metadata = {
    client.repository.AIServiceMetaNames.NAME: "lablab IBM watsonx challenge",
    client.repository.AIServiceMetaNames.DESCRIPTION: "",
    client.repository.AIServiceMetaNames.SOFTWARE_SPEC_ID: software_spec_id,
    client.repository.AIServiceMetaNames.CUSTOM: {},
    client.repository.AIServiceMetaNames.REQUEST_DOCUMENTATION: request_schema,
    client.repository.AIServiceMetaNames.RESPONSE_DOCUMENTATION: response_schema,
    client.repository.AIServiceMetaNames.TAGS: ["wx-vector-index"]
}

ai_service_details = client.repository.store_ai_service(meta_props=ai_service_metadata, ai_service=gen_ai_service)

In [None]:
# Get the AI Service ID

ai_service_id = client.repository.get_ai_service_id(ai_service_details)

In [None]:
# Deploy the stored AI Service
deployment_custom = {}
deployment_metadata = {
    client.deployments.ConfigurationMetaNames.NAME: "lablab IBM watsonx challenge",
    client.deployments.ConfigurationMetaNames.ONLINE: {},
    client.deployments.ConfigurationMetaNames.CUSTOM: deployment_custom,
    client.deployments.ConfigurationMetaNames.DESCRIPTION: "",
    client.repository.AIServiceMetaNames.TAGS: ["wx-vector-index"]
}

function_deployment_details = client.deployments.create(ai_service_id, meta_props=deployment_metadata, space_id=space_id)


## 4. Test AI Service

In [None]:
# Get the ID of the AI Service deployment just created

deployment_id = client.deployments.get_id(function_deployment_details)
print(deployment_id)

In [None]:
messages = []
remote_question = "Change this question to test your function"
messages.append({ "role" : "user", "content": remote_question })
payload = { "messages": messages }

In [None]:
result = client.deployments.run_ai_service(deployment_id, payload)
if "error" in result:
    print(result["error"])
else:
    print(result)

# Next steps
You successfully deployed and tested the AI Service! You can now view
your deployment and test it as a REST API endpoint.

<a id="copyrights"></a>
### Copyrights

Licensed Materials - Copyright © 2024 IBM. This notebook and its source code are released under the terms of the ILAN License.
Use, duplication disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

**Note:** The auto-generated notebooks are subject to the International License Agreement for Non-Warranted Programs (or equivalent) and License Information document for watsonx.ai Auto-generated Notebook (License Terms), such agreements located in the link below. Specifically, the Source Components and Sample Materials clause included in the License Information document for watsonx.ai Studio Auto-generated Notebook applies to the auto-generated notebooks.  

By downloading, copying, accessing, or otherwise using the materials, you agree to the <a href="https://www14.software.ibm.com/cgi-bin/weblap/lap.pl?li_formnum=L-AMCU-BYC7LF" target="_blank">License Terms</a>  