![image](https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/notebooks/headers/watsonx-Prompt_Lab-Notebook.png)
# AutoAI RAG experiment with custom foundation model.

#### Disclaimers

- Use only Projects and Spaces that are available in the watsonx context.


## Notebook content

This notebook demonstrates how to deploy custom foundation model and use this model in AutoAI RAG experiment.
The data used in this notebook is from the [Granite Code Models paper](https://arxiv.org/pdf/2405.04324).

Some familiarity with Python is helpful. This notebook uses Python 3.11.


## Learning goal

The learning goals of this notebook are:

- How to deploy your own foundation models with huggingface hub
- Create an AutoAI RAG job that will find the best RAG pattern based on custom foundation model used during the experiment


## Contents

This notebook contains the following parts:
- [Set up the environment](#Set-up-the-environment)
- [Prerequisites](#Prerequisites)
- [Create API Client instance.](#Create-API-Client-instance.)
- [Download custom model from hugging face](#Download-custom-model-from-hugging-face)
- [Deploy the model](#Deploy-the-model)
- [Prepare the data for the AutoAI RAG experiment](#Prepare-the-data-for-the-AutoAI-RAG-experiment)
- [Run the AutoAI RAG experiment](#Run-the-AutoAI-RAG-experiment)
- [Query generated pattern locally](#Query-generated-pattern-locally)
- [Summary](#Summary)

## Set up the environment

In [None]:
%pip install -U wget | tail -n 1
%pip install -U 'ibm-watsonx-ai[rag]>=1.3.26' | tail -n 1

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


<a id="prerequisites"></a>

## Prerequisites
Please fill below values to be able to move forward:
- API_KEY - your api key to IBM Cloud, more information about API keys can be found [here](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui).
- WML_ENDPOINT - endpoint url associated with your api key, to see the list of available endpoints please refer to this [documentation](https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-object-storage-endpoints).
- PROJECT_ID - ID of the project associated with your api key and endpoint, to find your project id please refer to this [documentation](https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/projects.html?context=wx&audience=wdp&locale=en).
- DATASOURCE_CONNECTION_ASSET_ID - connection asset ID to your data source which will store custom foundation model files, please refer to this [documentation](https://dataplatform.cloud.ibm.com/docs/content/wsj/manage-data/create-conn.html?context=cpdaas) to get to know how to create this kind of asset. In the example below you will be using the connection to `S3 Cloud Object Storage`.
- BUCKET_NAME - bucket with your custom foundation models files in `.safetensors` format.

In [None]:
API_KEY = "PUT YOUR API KEY HERE" # API key to your IBM cloud or Cloud Pack for Data instance
WML_ENDPOINT = "https://us-south.ml.cloud.ibm.com" # endpoint associated with your API key
PROJECT_ID = "PUT YOUR PROJECT ID HERE" # project ID associated with your API key and endpoint

DATASOURCE_CONNECTION_ASSET_ID = "PUT YOUR DATASURCE CONNECTION ASSET ID" # datasource connection inside your project
BUCKET_NAME = "PUT BUCKET NAME WHICH STORES YOUR CUSTOM MODEL FILES HERE" # bucket name in your Cloud Object Storage
BUCKET_MODEL_DIR_NAME = "PUT PATH TO YOUR CUSTOM MODEL FILES IN YOUR BUCKET" # dir name inside the bucket which will store your custom model files

BUCKET_BENCHMARK_JSON_FILE_PATH = "benchmark.json" # path inside bucket where your benchmark.json file is stored

## Create API Client instance.
This client will allow us to connect with the IBM services.

In [12]:
from ibm_watsonx_ai import APIClient, Credentials

credentials = Credentials(
                api_key=API_KEY,
                url=WML_ENDPOINT
            )

client = APIClient(credentials=credentials,  project_id=PROJECT_ID)

## Deploy the model
Check the docs to avoid any problems during model deployment [here](https://ibm.github.io/watsonx-ai-python-sdk/fm_custom_models.html).

### Create custom model repository

In [13]:
software_spec = client.software_specifications.get_id_by_name('watsonx-cfm-caikit-1.1')

In [14]:
metadata = {
    client.repository.ModelMetaNames.NAME: "My deployment",
    client.repository.ModelMetaNames.SOFTWARE_SPEC_ID: software_spec,
    client.repository.ModelMetaNames.TYPE: client.repository.ModelAssetTypes.CUSTOM_FOUNDATION_MODEL_1_0,
    client.repository.ModelMetaNames.MODEL_LOCATION: {
        "file_path": BUCKET_MODEL_DIR_NAME,
        "bucket": BUCKET_NAME,
        "connection_id": DATASOURCE_CONNECTION_ASSET_ID,
    },
}

In [15]:
stored_model_details = client.repository.store_model(model=BUCKET_MODEL_DIR_NAME, meta_props=metadata)
stored_model_asset_id = client.repository.get_model_id(stored_model_details)

In [16]:
client.repository.list(framework_filter='custom_foundation_model_1.0')[0:1]

Unnamed: 0,ID,NAME,CREATED,FRAMEWORK,TYPE,SPEC_STATE,SPEC_REPLACEMENT
0,3a9346ec-bd38-4965-b3d9-b19fb545dc92,My deployment,2025-06-27T12:29:20.158Z,custom_foundation_model_1.0,model,supported,


### Store client task credentials

In [17]:
try:
    client.task_credentials.store()
except Exception:
    print("Client task credentials have already been stored.")

Task Credentials have already been stored. Use old or delete them.


Client task credentials have already been stored.


### Perform custom model deployment

In [18]:
MAX_SEQUENCE_LENGTH = 32_000
MAX_NEW_TOKENS = 1000
MIN_NEW_TOKENS = 1
MAX_BATCH_SIZE = 1024

meta_props = {
    client.deployments.ConfigurationMetaNames.NAME: "My custom foundation model deployment",
    client.deployments.ConfigurationMetaNames.DESCRIPTION: "My custom foundation model deployment",
    client.deployments.ConfigurationMetaNames.HARDWARE_REQUEST: {
        'size': client.deployments.HardwareRequestSizes.Small,
        'num_nodes': 1
    },
    # optionally overwrite model parameters here
    client.deployments.ConfigurationMetaNames.FOUNDATION_MODEL: {
        "max_sequence_length": MAX_SEQUENCE_LENGTH, 
        "max_new_tokens": MAX_NEW_TOKENS, 
        "max_batch_size": MAX_BATCH_SIZE,
    },
    client.deployments.ConfigurationMetaNames.SERVING_NAME: "custom_foundation_model" # must be unique
}
deployment_details = client.deployments.create(stored_model_asset_id, meta_props)
deployment_id = client.deployments.get_id(deployment_details=deployment_details)



######################################################################################

Synchronous deployment creation for id: '3a9346ec-bd38-4965-b3d9-b19fb545dc92' started

######################################################################################


initializing
Note: This model is missing a chat template. To use chat-related functionality, enable a chat template in the tokenizer_config.json file and try again.
.......................................................................................................
ready


-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='085784e7-75d5-452d-ae2d-873aa6e20075'
-----------------------------------------------------------------------------------------------




## Prepare the data for the AutoAI RAG experiment

### Download `granite_code_models.pdf` document

In [20]:
import wget

data_url = "https://arxiv.org/pdf/2405.04324"
byom_input_filename = "granite_code_models.pdf"
wget.download(data_url, byom_input_filename)

'granite_code_models.pdf'

### Create data asset with your training data

In [23]:
document_asset_details = client.data_assets.create(name=byom_input_filename, file_path=byom_input_filename)

document_asset_id = client.data_assets.get_id(document_asset_details)
document_asset_id

Creating data asset...
SUCCESS


'f99e0fc9-9170-44b2-bf36-1f32e8384df1'

In [24]:
from ibm_watsonx_ai.helpers import DataConnection

input_data_references = [DataConnection(data_asset_id=document_asset_id)]

### Create your own benchmark.json file to ask the questions related to the document

In [26]:
import json 

local_benchmark_json_filename = "benchmark.json"

benchmarking_data = [
     {
        "question": "What are the two main variants of Granite Code models?",
        "correct_answer": "The two main variants are Granite Code Base and Granite Code Instruct.",
        "correct_answer_document_ids": [byom_input_filename]
     },
     {
        "question": "What is the purpose of Granite Code Instruct models?",
        "correct_answer": "Granite Code Instruct models are finetuned for instruction-following tasks using datasets like CommitPack, OASST, HelpSteer, and synthetic code instruction datasets, aiming to improve reasoning and instruction-following capabilities.",
        "correct_answer_document_ids": [byom_input_filename]
     },
     {
        "question": "What is the licensing model for Granite Code models?",
        "correct_answer": "Granite Code models are released under the Apache 2.0 license, ensuring permissive and enterprise-friendly usage.",
        "correct_answer_document_ids": [byom_input_filename]
     },
]

with open(local_benchmark_json_filename, mode="w", encoding="utf-8") as fp:
    json.dump(benchmarking_data, fp, indent=4)

### Create data asset with benchmark.json file

In [27]:
test_asset_details = client.data_assets.create(name=local_benchmark_json_filename, file_path=local_benchmark_json_filename)

test_asset_id = client.data_assets.get_id(test_asset_details)
test_asset_id

Creating data asset...
SUCCESS


'ab864a7e-7b75-4e19-a833-936e1d24ed3c'

In [28]:
test_data_references = [DataConnection(data_asset_id=test_asset_id)]

## Run the AutoAI RAG experiment

Provide the input information for AutoAI RAG optimizer:
- `custom_prompt_template_text` - custom prompt template text which will be used to query your own foundation model
- `custom_context_template_text` - custom context template text which will be used to query your own foundation model
- `name` - experiment name
- `description` - experiment description
- `max_number_of_rag_patterns` - maximum number of RAG patterns to create
- `optimization_metrics` - target optimization metrics

In [30]:
from ibm_watsonx_ai.experiment import AutoAI
from ibm_watsonx_ai.helpers.connections import DataConnection, ContainerLocation
from ibm_watsonx_ai.foundation_models.schema import (
        AutoAIRAGCustomModelConfig,
        AutoAIRAGModelParams
)

experiment = AutoAI(credentials, project_id=PROJECT_ID)

custom_prompt_template_text = "Answer my question {question} related to these documents {reference_documents}."
custom_context_template_text = "My document {document}"

parameters = AutoAIRAGModelParams(max_sequence_length=32_000)
custom_foundation_model_config = AutoAIRAGCustomModelConfig(
    deployment_id=deployment_id, 
    project_id=PROJECT_ID, 
    prompt_template_text=custom_prompt_template_text, 
    context_template_text=custom_context_template_text, 
    parameters=parameters
)

rag_optimizer = experiment.rag_optimizer(
    name='AutoAI RAG - Custom foundation model experiment',
    description = "AutoAI RAG experiment using custom foundation model.",
    max_number_of_rag_patterns=4,
    optimization_metrics=['faithfulness'],
    foundation_models=[custom_foundation_model_config]
) 


container_data_location = DataConnection(
        type="container",
        location=ContainerLocation(
           path="autorag/results"
        ),
    )

container_data_location.set_client(api_client=client)

rag_optimizer.run(
    test_data_references=test_data_references,
    input_data_references=input_data_references,
    results_reference=container_data_location,
    background_mode=False
)



##############################################

Running 'cc4a3c81-a058-4376-9d94-e5e14d58e36c'

##############################################


pending....
running.......................................................................
completed
Training of 'cc4a3c81-a058-4376-9d94-e5e14d58e36c' finished successfully.


{'entity': {'hardware_spec': {'id': 'a6c4923b-b8e4-444c-9f43-8a7ec3020110',
   'name': 'L'},
  'input_data_references': [{'connection': {'id': '37589eb2-ed80-4174-a33b-adf7d9dcf727'},
    'location': {'bucket': 'autorag-byom',
     'file_name': 'granite_code_models.pdf'},
    'type': 'connection_asset'}],
  'parameters': {'constraints': {'generation': {'foundation_models': [{'context_template_text': 'My document {document}',
       'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075',
       'parameters': {'max_sequence_length': 32000},
       'project_id': '74cee487-8422-49ef-b61f-db92d8ce7b12',
       'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.'}]},
    'max_number_of_rag_patterns': 4},
   'optimization': {'metrics': ['faithfulness']},
   'output_logs': True},
  'results': [{'context': {'iteration': 0,
     'max_combinations': 80,
     'rag_pattern': {'composition_steps': ['model_selection',
       'chunking',
       'em

In [None]:
rag_optimizer.get_run_details()

{'entity': {'hardware_spec': {'id': 'a6c4923b-b8e4-444c-9f43-8a7ec3020110',
   'name': 'L'},
  'input_data_references': [{'connection': {'id': '37589eb2-ed80-4174-a33b-adf7d9dcf727'},
    'location': {'bucket': 'autorag-byom',
     'file_name': 'granite_code_models.pdf'},
    'type': 'connection_asset'}],
  'parameters': {'constraints': {'generation': {'foundation_models': [{'context_template_text': 'My document {document}',
       'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075',
       'parameters': {'max_sequence_length': 32000},
       'project_id': '74cee487-8422-49ef-b61f-db92d8ce7b12',
       'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.'}]},
    'max_number_of_rag_patterns': 4},
   'optimization': {'metrics': ['faithfulness']},
   'output_logs': True},
  'results': [{'context': {'iteration': 0,
     'max_combinations': 80,
     'rag_pattern': {'composition_steps': ['model_selection',
       'chunking',
       'em

In [None]:
summary = rag_optimizer.summary()
summary

Unnamed: 0_level_0,mean_faithfulness,mean_answer_correctness,mean_context_correctness,chunking.method,chunking.chunk_size,chunking.chunk_overlap,embeddings.model_id,vector_store.distance_metric,retrieval.method,retrieval.number_of_chunks,retrieval.hybrid_ranker,generation.model_id
Pattern_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Pattern4,0.4715,0.9153,1.0,recursive,1024,256,intfloat/multilingual-e5-large,cosine,window,3,,085784e7-75d5-452d-ae2d-873aa6e20075
Pattern1,0.3519,0.649,1.0,recursive,512,128,intfloat/multilingual-e5-large,cosine,window,3,,085784e7-75d5-452d-ae2d-873aa6e20075
Pattern2,0.0945,0.4841,1.0,recursive,512,128,intfloat/multilingual-e5-large,cosine,simple,5,,085784e7-75d5-452d-ae2d-873aa6e20075
Pattern3,0.0942,0.2469,1.0,recursive,1024,256,intfloat/multilingual-e5-large,cosine,simple,3,,085784e7-75d5-452d-ae2d-873aa6e20075


In [None]:
best_pattern_name = summary.index.values[0]
print('Best pattern is:', best_pattern_name)

best_pattern = rag_optimizer.get_pattern()

Best pattern is: Pattern4


In [None]:
rag_optimizer.get_pattern_details(pattern_name=best_pattern_name)

{'composition_steps': ['model_selection',
  'chunking',
  'embeddings',
  'retrieval',
  'generation'],
 'duration_seconds': 15,
 'location': {'evaluation_results': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/evaluation_results.json',
  'indexing_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/indexing_inference_notebook.ipynb',
  'inference_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/indexing_inference_notebook.ipynb',
  'inference_service_code': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/inference_ai_service.gz',
  'inference_service_metadata': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/inference_service_metadata.json'},
 'name': 'Pattern4',
 'settings': {'chunking': {'chunk_overlap': 256,
   'chunk_size': 1024,
   'method': 'recursive'},
  'embeddings': {'model_id': 'intfloat/multilingual-e5-large',
   'truncate_input_tokens': 512,
   'truncate_strategy': 'left'},
  'ge

## Query generated pattern locally

In [None]:
from ibm_watsonx_ai.deployments import RuntimeContext

runtime_context = RuntimeContext(api_client=client)
inference_service_function = best_pattern.inference_service(runtime_context)[0]

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


In [None]:
question = "What training objectives are used for the granite models?"

context = RuntimeContext(
    api_client=client,
    request_payload_json={"messages": [{"role": "user", "content": question}]},
)


resp = inference_service_function(context)
resp

{'body': {'choices': [{'index': 0,
    'message': {'role': 'assistant',
     'content': ' The\nclusters are equipped with 100Gbps and 200Gbps HDR InfiniBand links, respectively.\nWe utilize NVIDIA’s Megatron-LM (Shoeybi et al., 2019; Narayanan et al., 2021) for\ndistributed training, which is optimized for large language models. We use the same Megatron\nLM framework for all our models, ensuring consistency in training infrastructure.\n4.5 Model Architecture\nThe architecture of the Granite Code models is based on the original transformer architecture\n(Douglas & Smith, 2019) with modifications for code modeling. The base model has 16\nlayers, 8 attention heads, and a token embedding dimension of 512. For the 3B model, we\nuse the standard transformer architecture with a multi-head attention mechanism. The 8B model\nincorporates Grouped-Query Attention (GQA) (Ainslie et al., 2023) to improve inference\nefficiency. The 20B model uses learned absolute position embeddings and Multi-Query\

In [None]:
print(inference_service_function(context)["body"]["choices"][0]["message"]["content"])

 The
clusters are equipped with 100Gbps and 200Gbps HDR InfiniBand links, respectively.
We utilize NVIDIA’s Megatron-LM (Shoeybi et al., 2019; Narayanan et al., 2021) for
distributed training, which is optimized for large language models. We use the same Megatron
LM framework for all our models, ensuring consistency in training infrastructure.
4.5 Model Architecture
The architecture of the Granite Code models is based on the original transformer architecture
(Douglas & Smith, 2019) with modifications for code modeling. The base model has 16
layers, 8 attention heads, and a token embedding dimension of 512. For the 3B model, we
use the standard transformer architecture with a multi-head attention mechanism. The 8B model
incorporates Grouped-Query Attention (GQA) (Ainslie et al., 2023) to improve inference
efficiency. The 20B model uses learned absolute position embeddings and Multi-Query
Attention (Shazeer, 2019). The 34B model is built upon the 20B model with depth
upscaling (Kim et al

## Summary

 You successfully completed this notebook!
 
 You learned how to use AutoAI RAG with your own foundation model.
 
Check out our _<a href="https://ibm.github.io/watsonx-ai-python-sdk/samples.html" target="_blank" rel="noopener no referrer">Online Documentation</a>_ for more samples, tutorials, documentation, how-tos, and blog posts.

### Author:
 **Michał Steczko**, Software Engineer at watsonx.ai.

Copyright © 2025 IBM. This notebook and its source code are released under the terms of the MIT License.