# Step 0

## - Click on the menu to the right

## - "Insert Project Token"

## - Use the downward arrow icon on the top menu to move the cell down and begin running the notebook

# Load Data into Milvus for RAG


# 1. Set up the environment

## Install libraries

we need to install the pymilvus package to the watsonx.ai Python environment. 

In [2]:
!pip install grpcio==1.60.0 
!pip install pymilvus



# !!RESTART THE KERNAL AFTER pymilvus install!! 

Certain dependencies need to be persisted. Restarting the kernal allows this to occur

In [3]:
!pip install ipython-sql==0.4.1
!pip install sqlalchemy==1.4.46
!pip install sqlalchemy==1.4.46 "pyhive[presto]"
!pip install python-dotenv
!pip install sentence_transformers
!pip install langchain-community
!pip install PyMuPDF

# clean up the libraries not required



# Step 1: Document Ingestion

## Load the pdf version of the watsonx.data documentation

Load the pdf version as an asset in the project using Spark

In [4]:
import requests
from pyspark.sql import SparkSession
import os

spark = SparkSession.builder \
    .appName("Download watsonx.data PDF documentation") \
    .getOrCreate()

def download_pdf(url, local_path):
    """Download the PDF from a URL and save it locally"""
    response = requests.get(url)
    if response.status_code == 200:
        with open(local_path, 'wb') as file:
            file.write(response.content)
        print(f"PDF downloaded successfully to {local_path}")
    else:
        print(f"Failed to download PDF. Status code: {response.status_code}")

pdf_url = "https://www.ibm.com/support/pages/system/files/inline-files/IBM%20watsonx.data%20version%202.0.3.pdf"  
local_file_path = "wxd_doc_pdf.pdf"  

download_pdf(pdf_url, local_file_path)

spark.stop()


PDF downloaded successfully to wxd_doc_pdf.pdf


## Code to extract the text from the pdf for embeddings

In [5]:

# from langchain_community.document_loaders import DirectoryLoader
# from langchain_community.document_loaders import PyPDFLoader

asset_li=wslib.assets.list_assets("data_asset")
print(asset_li)

wslib.download_file("wxd_doc_pdf")

import fitz # PyMuPDF

doc = fitz.open("wxd_doc_pdf")

pdf_text = ""

for page in doc:
    pdf_text += page.get_text()

# print(pdf_text)

[{'name': 'wxd_doc_pdf', 'description': None, 'asset_id': '1377b784-7528-4933-ade1-096dafbf2d7a', 'asset_type': 'data_asset', 'tags': None}]


# Step 2: Document Chunking

To manage the large texts, we can divide the data into manageble chunks by logical units such as paragraphs, sentences, or fixed token lengths

### Option 1: Chunking by Paragraphs

In [6]:
def chunk_by_paragraphs(text):
    paragraphs = text.split("\n\n") # Assuming paragraphs are separated by two newlines
    return [p.strip() for p in paragraphs if p.strip()]

chunks = chunk_by_paragraphs(pdf_text)

# print(chunks[:1]) # Preview of the first 5 chunks

### Option 2: Chunking by Sentences

If the document structure is more fluid and paragraphs are not clearly defined, you could break the text into sentences. You can use _nltk_ or a similar library for sentence tokennization

In [7]:
# import nltk

# nltk.download('punkt')

# def chunk_by_sentence(text):
#     sentences = nltk.sent_tokenize(text)
#     return sentences

# chunks = chunk_by_sentence(pdf_text)

# print(chunks[:5]) # Preview of the first 5 chunks

### Option 3: Chunking by Token length

For more control over the chunk size, you can spilt the text into chunks of a fixed number of tokens

In [8]:
def chunk_by_tokens(text, max_tokens=512):
    tokens = text.split() # Tokenize the whitespace
    chunks = [tokens[i:i + max_tokens] for i in range(0, len(tokens), max_tokens)]
    return [' '.join(chunk) for chunk in chunks]

chunks = chunk_by_tokens(pdf_text)

# print(chunks[:1]) # Preview of the first 5 chunks

# Step 3: Embedding Generation

In [9]:
wslib.list_connections

<bound method Agent.list_connections of <ibm_watson_studio_lib.impl.agent.Agent object at 0x7f0452092cd0>>

In [10]:
# note if you named your Milvus connection something other than "Milvus Connection" Please replace the name below

milvus_credentials = wslib.get_connection("custom-service")
print(milvus_credentials['host'])
# replace the milvus connection asset in the project

6c0c63ab-ecd7-45bd-bea8-c4d2d6fe976a.cie9nt2d0bngcm5pd3og.lakehouse.dev.appdomain.cloud


In [11]:
#milvus_credentials

In [12]:
from pymilvus import(
    Milvus,
    IndexType,
    Status,
    connections,
    FieldSchema,
    DataType,
    Collection,
    CollectionSchema,
)


url = milvus_credentials['host']
port = milvus_credentials['port']
apikey = milvus_credentials['password']
apiuser = 'ibmlhapikey'


connections.connect(alias="default", 
                    host=url, 
                    port=port, 
                    user=apiuser, 
                    password=apikey, 
                    secure=True)

In [13]:
# Create a new collection
collection_description = 'wxd docs pdf'
collection_name = 'wxd_documentation2'

In [14]:
# Create collection - define fields + schema

fields = [
    FieldSchema(name="document_id", dtype=DataType.INT64), # Document Id
    FieldSchema(name="chunk_id",  dtype=DataType.VARCHAR, is_primary=True, max_length=20000), # Chunk Id
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384), # embedding dimension
]

# Create a schema
schema = CollectionSchema(fields, collection_description)

# Create a collection
collection = Collection(collection_name, schema)

# Create index
index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
}

collection.create_index(field_name="embedding", index_params=index_params)



Status(code=0, message=)

In [15]:
# we can run a check to see the collections in our milvus instance and we see the new collection has been created 

from pymilvus import utility
utility.list_collections()

['wxd_documentation1',
 'wxd_documentation2',
 'test_collection',
 'wxd_documentation']

In [16]:
# load data into Milvus
import pandas as pd
from sentence_transformers import SentenceTransformer
from pymilvus import Collection, connections
import warnings
warnings.filterwarnings('ignore')

data = []
print("len of chunks", len(chunks))
model = SentenceTransformer('sentence-transformers/all-minilm-l12-v2') # 384 dim

# for i in range(len(chunks)):
for i in range(10):
    # Create vector embeddings + data
    passage_embeddings = model.encode(chunks[i])
    document_id = i
    data.append({"document_id":document_id, "chunk_id":chunks[i],"embedding": passage_embeddings})
    print(i)
out = collection.insert(data)
    
print("wxd chunk: \'" + chunks[i][0] + "\' has been loaded.")

2024-12-17 20:01:08.582472: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


len of chunks 302
0
1
2
3
4
5
6
7
8
9
wxd chunk: 'q' has been loaded.


In [17]:
## check to ensure entities have been loaded into the collection

basic_collection = Collection(collection_name) 

basic_collection.num_entities
basic_collection.flush()

# Step 4: Searching with Milvus

In [18]:
from sentence_transformers import SentenceTransformer
from pymilvus import(
    Milvus,
    IndexType,
    Status,
    connections,
    FieldSchema,
    DataType,
    Collection,
    CollectionSchema,
)

url = milvus_credentials['host']
port = milvus_credentials['port']
apikey = milvus_credentials['password']
apiuser = 'ibmlhapikey'


connections.connect(alias="default", 
                    host=url, 
                    port=port, 
                    user=apiuser, 
                    password=apikey, 
                    secure=True)


# Load collection

basic_collection = Collection(collection_name)      
basic_collection.load()

# Query function
def query_milvus(query, num_results):
    
    # Vectorize query
    model = SentenceTransformer('sentence-transformers/all-minilm-l12-v2') # 384 dim
    query_embeddings = model.encode([query])

    # Search
    search_params = {
        "metric_type": "L2", 
        "params": {"nprobe": 5}
    }
    results = basic_collection.search(
        data=query_embeddings, 
        anns_field="embedding", 
        param=search_params,
        limit=num_results,
        expr=None, 
        output_fields=['document_id'],
    )
    return results

# Prompt with LLM

In [19]:
## Consider some questions to ask regarding the topic you have chosen 

#question_text = "OTHER QUESTION TEXT"

question_text = "How to add a new catalog?"

In [20]:
# Query Milvus 

num_results = 3
results = query_milvus(question_text, num_results)

relevant_chunks = []
for i in range(num_results):    
    text = results[0][i].id
    relevant_chunks.append(text)
    
print(relevant_chunks)

['the left pane, go to Catalogs > All catalogs to view the available catalogs. 3. Select the catalog to open the catalog details page. 4. Click the catalog name and go to the Access control tab. 5. Go to Add collaborators > Add user and select a user role (Admin, Editor, or Viewer). 6. Search and select one or more users from the list and click Add. The user addition is successful. 7. Go to the Assets tab of the catalog details page, click the asset name, and go to the Access tab of the asset. 8. Click Add members, search for the added user, and click Add. Changing the owner of the asset Procedure 1. Go to the Assets tab of the catalog details page, click the asset name to open the asset details. 2. Click on the edit icon beside Asset owner and select a new user from the list. 3. Click, Apply. The asset owner is changed. Configure IBM Knowledge Catalog Do the following steps to associate a user to the table asset in IKC and assign the ownership. Procedure 1. Login to the IKC Cloud Pack

In [21]:
def make_prompt(context, question_text):
    return (f"{context}\n\nPlease answer a question using this text. "
          + f"If the question is unanswerable, say \"unanswerable\"."
          + f"\n\nQuestion: {question_text}")


# Build prompt w/ Milvus results
# Embed retrieved passages(context) and user question into into prompt text

context = "\n\n".join(relevant_chunks)
prompt = make_prompt(context, question_text)

print(prompt)

the left pane, go to Catalogs > All catalogs to view the available catalogs. 3. Select the catalog to open the catalog details page. 4. Click the catalog name and go to the Access control tab. 5. Go to Add collaborators > Add user and select a user role (Admin, Editor, or Viewer). 6. Search and select one or more users from the list and click Add. The user addition is successful. 7. Go to the Assets tab of the catalog details page, click the asset name, and go to the Access tab of the asset. 8. Click Add members, search for the added user, and click Add. Changing the owner of the asset Procedure 1. Go to the Assets tab of the catalog details page, click the asset name to open the asset details. 2. Click on the edit icon beside Asset owner and select a new user from the list. 3. Click, Apply. The asset owner is changed. Configure IBM Knowledge Catalog Do the following steps to associate a user to the table asset in IKC and assign the ownership. Procedure 1. Login to the IKC Cloud Pack f