![Top <](./images/watsonxdata.png "watsonxdata")

# Watsonx and Milvus

Watsonx is IBM's platform committed to injecting generative AI into services that span across customer's data lifecycle. Each of the services offer a unique experience but when combined together, the business value is even stronger. This demonstration features:

  1. Scraping data from Wikipedia and other web articles into a Jupyter Notebook
  2. Inserting web data into watsonx.data
  3. Vectorizing data in watsonx.data and inserting it into the Milvus vector database
  4. Retrieve prompts from Milvus that can be embedded into a Large Language Model 

#### Credits

This material has been adopted from material originally produced by Katherine Ciaravalli and Ken Bailey.

## Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs' generative process. This can improve the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information. Implementing RAG in an LLM-based question answering system has two main benefits: It ensures that the model has access to the most current, reliable facts, and that users have access to the model’s sources, ensuring that its claims can be checked for accuracy and ultimately trusted.

In this example we will use Wikipedia articles on a specific topic, Climate Change. We want to explore answering business questions related to this topic. As the environment continues to change, businesses will need to take into consideration how these changes will impact their operations. Combining additional climate change data alongside business specific data would allow companies to prose meaningful questions and consider alternative outcomes when determining effective business strategies. Although Wikipedia is not the most trusted source, this is an introductory demo, intended to highlight the ease of use of incorporating new information into Large Language Models. 

## Load Wikipedia Data

This notebook walks through the process of loading a wikipedia article into a watsonx.data relational database table. We use the [Wikipedia python library](https://pypi.org/project/wikipedia/) to retrieve wikipedia articles. We then create a table in the database to store the articles. Finally, we load the articles into the database. 

For details on the copyright issues when extracting data, please refer to the [Wikipedia Copyrights](https://en.wikipedia.org/wiki/Wikipedia:Copyrights) page.

### Install required libraries

A couple of additional libraries need to be loaded into the notebook to order to query Wikipedia articles.

In [None]:
!pip install python-dotenv
!pip install wikipedia
!pip install pymilvus
!pip install sentence_transformers
!pip install grpcio==1.60.0 

### Fetch Wikipedia Articles

The following code will search Wikipedia articles and display a list of the articles by title. The initial search will return a list of up to 10 titles, while the subsequent call will retrieve the summary of the article. The two results are combined into one dataframe for easy scrolling.

Update the next field to include what you are searching for.

In [None]:
topic = "climate"

### Retrieve 10 Articles
The next call will retrieve a maximum of 10 titles and display the list.

In [None]:
import wikipedia
search_results = wikipedia.search(topic)
print("Article Title")
print("-------------------------------------------------")
for result in search_results: print(result)

### Retrieve Article Summary
Now that we have a list of articles, we can request a summary of each article and display them. Note that if an article is ambiguous, the program will not attempt to retrieve the article. An ambiguous article is an article which could refer to multiple topics. The summary output from an ambiguous article will display possible searches that you may want to try. Since we are only interested in direct articles, the ambiguous titles will be ignored.

In [None]:
import wikipedia
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# search
search_results = wikipedia.search("Climate")

display_articles = []
for i in range (0,len(search_results)):
    try:
        summary = wikipedia.summary(search_results[i])
    except Exception as err:
        print(f"Skipped article '{search_results[i]}' skipped because of ambiguity.")
        continue
        
    display_articles.append({
        "title"   : search_results[i],
        "summary" : summary
    })

#print(display_articles)

df = pd.DataFrame.from_dict(display_articles)
df.style.set_properties(**{'text-align': 'left'})

## Load a Wikipedia Article into watsonx.data 
This step will load selected articles into watsonx.data. Since we are only interested in climate change, we will select the first two articles in the list. You can change the documents loaded by changing the document indexes in the variable found in the next cell.

In [None]:
documents = [0,1]

In [None]:
import wikipedia

# fetch wikipedia articles

articles = {}
for document in documents:
    articles.update({display_articles[document]["title"] : None})

for k,v in articles.items():
    article = wikipedia.page(k)
    articles[k] = article.content
    print(f"Successfully fetched article {k}")

print(f"Successfully fetched {len(articles)} articles ")

### Connect to watsonx.data
The following code will use the Presto Magic commmands to load data in watsonx.data.

In [None]:
%run presto.ipynb

The connection details should not change unless you are attempting to run this script from a Jupyter environment that is outside of the developer system.

In [None]:
%%sql
   connect
   userid=ibmlhadmin
   password=password
   hostname=watsonxdata
   port=8443
   catalog=tpch
   schema=tiny
   certfile=/certs/lh-ssl-ts.crt

## Create Schema in watsonx.data
We need to create a new bucket to store the Wikipedia data.

In [None]:
%%sql
DROP TABLE IF EXISTS hive_data.watsonxai.wikipedia;
DROP SCHEMA IF EXISTS hive_data.watsonxai;

The next step will delete any existing data in the watsonxai bucket. A DROP table command does not remove the files in the bucket. You may see error messages displayed if no data or bucket exists.

In [None]:
import warnings
warnings.filterwarnings('ignore')

minio_host    = "watsonxdata"
minio_port    = "9000"
hive_host     = "watsonxdata"
hive_port     = "9083"

hive_id           = None
hive_password     = None
minio_access_key  = None
minio_secret_key  = None
keystore_password = None

try:
    with open('/certs/passwords') as fd:
        certs = fd.readlines()
    for line in certs:
        args = line.split()
        if (len(args) >= 3):
            system   = args[0].strip()
            user     = args[1].strip()
            password = args[2].strip()
            if (system == "Minio"):
                minio_access_key = user
                minio_secret_key = password
            elif (system == "Thrift"):
                hive_id = user
                hive_password = password
            elif (system == "Keystore"):
                keystore_password = password
            else:
                pass
except Error as e:
    print("Certificate file with passwords could not be found")

%system mc alias set watsonxdata http://{minio_host}:{minio_port} {minio_access_key} {minio_secret_key}

%system mc rm --recursive --force watsonxdata/hive-bucket/watsonxai

### Create the Watsonxai Schema and Table

In [None]:
%%sql
CREATE SCHEMA IF NOT EXISTS 
  hive_data.watsonxai 
WITH ( location = 's3a://hive-bucket/watsonxai')

### Create the Wikipedia Table

In [None]:
%%sql
CREATE TABLE hive_data.watsonxai.wikipedia
  (
  "id" varchar,
  "text" varchar,
  "title" varchar  
  )
WITH 
  (
  format = 'PARQUET',
  external_location = 's3a://hive-bucket/watsonxai' 
  )

## Load the Data
The Wikipedia article is written into the watsonx.data database in chucks of approximately 225 words in size. The reason for chunking the data is to make it more efficient when populating the Milvus system from watsonx.data.

In [None]:
# Chunk data
def split_into_chunks(text, chunk_size):
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

split_articles = {}
for k,v in articles.items():
    split_articles[k] = split_into_chunks(v, 225)

# Insert data

for article_title, article_chunks in split_articles.items():

    for i, chunk in enumerate(article_chunks):
            
        escaped_chunk = chunk.replace("'", "''").replace("%", "%%")
        insert_stmt = f"insert into hive_data.watsonxai.wikipedia values ('{i+1}', '{escaped_chunk}', '{article_title}')"
        %sql --quiet {insert_stmt}
        print(f"{article_title} {i+1}/{len(article_chunks)} inserted",end="\r")
            
    print(f"\n{article_title} Insertion complete")

### Confirm that the Data has be Loaded

In [None]:
%%sql
   select * from hive_data.watsonxai.wikipedia

## Load Vector Embeddings to Milvus

Here we will take the data we loaded into watsonx.data from the previous step and load it into the vector database Milvus. This data was previously chunked and stored in a watsonx.data hive table, so we'll pull from here, vectorize the text chunks and load them into Milvus.

Before we can start loading the data, though, we need to create a collection in Milvus to hold the data. We'll call this collection `wiki_articles`. This collection holds the vector embeddings for each chunk of text, as well as the original text itself and additional context.

In [None]:
!rm -f /tmp/presto.cert
!echo QUIT | openssl s_client -showcerts -connect localhost:8443 | awk '/-----BEGIN CERTIFICATE-----/ {p=1}; p; /-----END CERTIFICATE-----/ {p=0}' > /tmp/presto.crt

### Milvus Connection Settings

In [None]:
host            = 'watsonxdata'
port            = 19530
user            = 'ibmlhadmin'
password        = 'password'
server_pem_path = '/tmp/presto.crt'

#### Generate a Connection to Milvus

In [None]:
from pymilvus import(
    Milvus,
    IndexType,
    Status,
    connections,
    FieldSchema,
    DataType,
    Collection,
    CollectionSchema,
)

connections.connect(alias='default',
                   host=host,
                   port=port,
                   user=user,
                   password=password,
                   server_pem_path=server_pem_path,
                   server_name='watsonxdata',
                   secure=True)

#### Create a Collection in Milvus
This code will drop the wiki_articles collection if it exists, and then recreate it. This script should return the following text.
```
Status(code=0, message=)
```

In [None]:
from pymilvus import utility

utility.drop_collection("wiki_articles")

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True), # Primary key
    FieldSchema(name="article_text", dtype=DataType.VARCHAR, max_length=2500,),
    FieldSchema(name="article_title", dtype=DataType.VARCHAR, max_length=200,),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=384),
]

schema = CollectionSchema(fields, "wikipedia article collection schema")

wiki_collection = Collection("wiki_articles", schema)

# Create index
index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
}

wiki_collection.create_index(field_name="vector", index_params=index_params)

#### Double Check that the Schema Exists

In [None]:
from pymilvus import utility
utility.list_collections()

### Insert Vectors into Milvus

Here we read data from the watsonx.data table. We pull text chunks and titles from the database, being sure to separate them out into separate lists. We then vectorize using the `sentence-transformers/all-MiniLM-L6-v2` sentence transformer model. Learn more about Hugging Face sentence transformers here: [Sentence Transformers](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

It is important we assemble the article text, article titles and vector embeddings into a `data` object. This object will be used to load the data into Milvus.

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from pymilvus import Collection, connections
import warnings
import os
#os.environ["TOKENIZERS_PARALLELISM"] = "false"

warnings.filterwarnings('ignore')

# Download Wikipedia articles from watsonx.data using the engine we created earlier 

articles_df = %sql --pandas SELECT * from hive_data.watsonxai.wikipedia

# extract text + titles

passages = articles_df['text'].tolist()
passage_titles = articles_df['title'].tolist()

# Create vector embeddings + data

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # 384 dim
passage_embeddings = model.encode(passages)

basic_collection = Collection("wiki_articles") 
data = [
    passages,
    passage_titles,
    passage_embeddings
]
out = basic_collection.insert(data)
basic_collection.flush()  # Ensures data persistence
print("Done")

#### Check that the Collection has been Loaded

In [None]:
basic_collection = Collection("wiki_articles") 
basic_collection.load()
basic_collection.num_entities 

## Query Milvus & Prompt LLM
After gathering the data from Wikipedia and then vectorizing it and inserting into Milvus, we are now ready to perform queries against the vector database. We will use the `sentence-transformers/all-MiniLM-L6-v2` model to generate the query vector and then use Milvus to find the most similar vectors in the database.

### Create a Query Function
The following function will be used to query the Milvus database.

In [None]:
from sentence_transformers import SentenceTransformer
from pymilvus import(
    Milvus,
    IndexType,
    Status,
    connections,
    FieldSchema,
    DataType,
    Collection,
    CollectionSchema,
)

def query_milvus(query, num_results=5):
    
    # Vectorize query
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # 384 dim
    query_embeddings = model.encode([query])

    # Search
    search_params = {
        "metric_type": "L2", 
        "params": {"nprobe": 5}
    }
    results = basic_collection.search(
        data=query_embeddings, 
        anns_field="vector", 
        param=search_params,
        limit=num_results,
        expr=None, 
        output_fields=['article_text'],
    )
    return results

### Prompt LLM with Query Results
Consider how climate change may relate to other industries and processes related to your business. Select one of the questions below to feed into Milvus query.

In [None]:
question_text = "What can my company do to help fight climate change?"
#question_text = "How do businesses negatively effect climate change?"
#question_text = "What can a businesses do to have a positive effect on climate change?"
#question_text = "How can a business reduce their carbon footprint?"

### Search a Question in Milvus
An embedding is made for the question being asked. It is then used to search for the most relevant chunks in Milvus. The top 3 related chunks are retrieved below and can be used for a large language prompt.

The documents that best match the question are found in the list below.

In [None]:
import re
num_results = 3
results = query_milvus(question_text, num_results)

display_articles = []
relevant_chunks  = []
for i in range(num_results):
    display_articles.append({
        "ID"      : results[0].ids[i],
        "Distance": results[0].distances[i],
        # "Article" : re.sub(r"^.*?\. (.*$)",r"\1",results[0][i].entity.get('article_text'))
        "Article" : re.sub(r"^.*?\. (.*\.).*$",r"\1",results[0][i].entity.get('article_text'))        
    })
    relevant_chunks.append(re.sub(r"^.*?\. (.*\.).*$",r"\1",results[0][i].entity.get('article_text')))

df = pd.DataFrame.from_dict(display_articles).sort_values("Distance",ascending=False)
df.style.set_properties(**{'text-align': 'left'}).set_caption(question_text).set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', 'blue'),
        ('font-size', '20px')
    ]
}])

### Generate a Prompt
The data retrieved from Milvus can be used to generate a prompt for watsonx.ai.

In [None]:
def make_prompt(context, question_text):
    return (f"{context}\n\nPlease answer a question using this text. "
          + f"If the question is unanswerable, say \"unanswerable\"."
          + f"\n\nQuestion: {question_text}")


# Build prompt w/ Milvus results
# Embed retrieved passages(context) and user question into into prompt text

context = "\n\n".join(relevant_chunks)
prompt = make_prompt(context, question_text)
print(prompt)

#### Credits: IBM 2025, Katherine Ciaravalli, George Baklarz [baklarz@ca.ibm.com]