![image](https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/notebooks/headers/watsonx-Prompt_Lab-Notebook.png)
# Milvus collection build for the grounded collection
This notebook contains steps and code to populate a Milvus collection using the documents defined in
the associated vector index asset. If the collection is not empty, it will be emptied before being re-populated.

**Note:** Notebook code generated using the Vector Index tool will execute successfully.
If code is modified or reordered, there is no guarantee it will successfully execute.

Some familiarity with Python is helpful. This notebook uses Python 3.11.

## Contents
This notebook contains the following parts:

1. Setup
2. Create vector store
3. Process input documents
4. Load chunked documents into the vector store

## Setup
Install the libraries you need for this notebook.

In [None]:
!pip install 'ibm-watsonx-ai>=1.3.6,<1.4.0'
!pip install python-pptx


In [None]:
import warnings
warnings.filterwarnings('ignore')
import os
import getpass

import ibm_boto3
from ibm_botocore.client import Config
from ibm_watsonx_ai.client import APIClient
from ibm_watsonx_ai.foundation_models.extensions.rag.vector_stores import VectorStore
from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document
from ibm_watsonx_ai.foundation_models.extensions.rag.chunker import LangChainChunker
from langchain.document_loaders import (
  CSVLoader,
  UnstructuredExcelLoader
)

from ibm_watsonx_ai.data_loaders.datasets.documents import DocumentsIterableDataset
from ibm_watsonx_ai.data_loaders.experiment import ExperimentDataLoader
from ibm_watsonx_ai.helpers import FSLocation, AssetLocation
from ibm_watsonx_ai.helpers.connections import DataConnection

from ibm_watsonx_ai.foundation_models import Embeddings
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes


### Connection to WML
This cell defines the credentials required to work with watsonx API for both the execution of the build.

**Action:** Provide the IBM Cloud personal API key. For details, see
<a href="https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui" target="_blank">documentation</a>.


In [None]:
# NOTE: this cell uses a VECTOR_INDEX_BEARER_TOKEN variable for job runs. We need to make the modification
# to use customer's API key as defined in user profile. Otherwise asking for user input

wml_credentials = {
    "url": "https://ca-tor.ml.cloud.ibm.com",
}
if (os.getenv("VECTOR_INDEX_BEARER_TOKEN")):
    wml_credentials["token"] = os.getenv("VECTOR_INDEX_BEARER_TOKEN")
else:
    wml_credentials["apikey"] = getpass.getpass("Enter your API key:")

project_id = os.getenv("PROJECT_ID")


In [None]:
client = APIClient(wml_credentials=wml_credentials)
client.set.default_project(project_id)

## Create vector store
Create an instance of the vector store wrapper class.

In [None]:
vector_index_details = client.data_assets.get_details("c421728b-6e37-43aa-b5d8-612e74154323")
vector_index_properties = vector_index_details["entity"]["vector_index"]
print(vector_index_properties)

In [None]:
emb = Embeddings(
    model_id=vector_index_properties["settings"]["embedding_model_id"],
    credentials=wml_credentials,
    project_id=project_id,
    params={
        "truncate_input_tokens": 512
    }
)

index_name = vector_index_properties["store"]["index"]
database_name = vector_index_properties["store"]["database"]

text_field = None

if ("schema_fields" in vector_index_properties["settings"]):
    vector_store_schema = vector_index_properties["settings"]["schema_fields"]
    text_field = vector_store_schema.get("text")

vector_store = VectorStore(
    client=client,
    connection_id=vector_index_properties["store"]["connection_id"],
    embeddings=emb,
    index_name=index_name,
    drop_old=True,
    database=database_name,
    consistency_level='Strong',
    connection_args={'secure': True},
    text_field=text_field
)





## Process the documents
We will now loop through the list of documents in the vector index, process them one by one according to their mime type.

In [None]:
mime_type_mappings = {
    'text/csv': CSVLoader
}

In [None]:
chunk_size = vector_index_properties["settings"]["chunk_size"]
chunk_overlap = vector_index_properties["settings"]["chunk_overlap"]

text_splitter = LangChainChunker(
    method="recursive",
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [None]:
def load_document( document_properties ):
    id = document_properties["metadata"]["asset_id"]
    filename = document_properties["metadata"]["name"]
    file_path = client.data_assets.download(id, filename)
    mime_type = document_properties["entity"]["data_asset"]["mime_type"]
    if (mime_type == "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"):
        dataset = DocumentsIterableDataset(connections=[DataConnection(location=FSLocation(path=file_path))], error_callback=error_callback)
        data_loader = ExperimentDataLoader(dataset=dataset)
        return data_loader
    if (mime_type_mappings.get(mime_type)):
        loader = mime_type_mappings[mime_type](file_path)
        return loader.load_and_split()
    dataset = DocumentsIterableDataset(connections=[DataConnection(location=FSLocation(path=file_path))], error_callback=error_callback)
    data_loader = ExperimentDataLoader(dataset=dataset)
    return data_loader

def error_callback(document_name: str, exception: Exception):
   raise Exception(f" loading document: {document_name} failed with Exception: \n{exception}")

In [None]:
### The vector store schema is stored using dot notation
### This converts the dot notation to an object consumable by the Document
def dot_notation_to_dict(dot_notation, value, metadata):
    if dot_notation == None:
        return metadata
    parts = dot_notation.split('.')
    current = metadata
    for part in parts[:-1]:
        if part not in current:
            current[part] = {}
        current = current[part]
    current[parts[-1]] = value
    return metadata

def compute_documents_metadata( document_name, loaded_documents ):
    filtered_documents = []
    if ("schema_fields" in vector_index_properties["settings"]):
        vector_store_schema = vector_index_properties["settings"]["schema_fields"]
        for document in loaded_documents:
            computed_document_data = {
                "metadata": {}
            }
            computed_document_data["metadata"] = dot_notation_to_dict(vector_store_schema.get("page_number"), document.dict()["metadata"].get("page", 0), computed_document_data["metadata"])
            computed_document_data["metadata"] = dot_notation_to_dict(vector_store_schema.get("document_name"), document_name, computed_document_data["metadata"])
            computed_document_data["metadata"] = dot_notation_to_dict(vector_store_schema.get("page_content"), document.dict()["page_content"], computed_document_data["metadata"])
            filtered_documents.append(document.copy(update=computed_document_data))
    return filtered_documents


In [None]:
def process_document( document_properties ):
    print("Processing", document_properties["metadata"]["name"])
    # Parse the document into raw text
    loaded_documents = load_document(document_properties)
    filtered_documents = compute_documents_metadata(document_properties["metadata"]["name"], loaded_documents)
    # Split the document text into chunks
    return text_splitter.split_documents(filtered_documents)

In [None]:
def process_documents_from_folder( connected_folder_details ):
    connection_id = connected_folder_details["entity"]["folder_asset"]["connection_id"]
    connection_path = connected_folder_details["entity"]["folder_asset"]["connection_path"]
    connection_details = client.connections.get_details(connection_id)
    connection_props = connection_details['entity']['properties']
    connection_name = connection_details['entity']['name']
    bucket_name=connection_props.get("bucket")

    ## Do not process documents if there is no bucket defined in the connection
    if bucket_name == None:
        return []
    
    print(f"Initializing client for {connection_name} connection")

    cos_client = ibm_boto3.client(
        "s3",
        ibm_api_key_id=connection_props['api_key'],
        ibm_service_instance_id=connection_props['resource_instance_id'],
        ibm_auth_endpoint=connection_props['iam_url'],
        config=Config(signature_version="oauth"),
        endpoint_url=f"https://{connection_props['url']}"
    )
    
    prefix=connection_path.removeprefix(f"/{bucket_name}/")
    prefix=f"{prefix}/"

    files = cos_client.list_objects(Bucket=bucket_name, Prefix=prefix)
    filenames = [f.get('Key') for f in files.get("Contents", [])]
    all_documents = []
    for filename in filenames:
        loaded_documents = []
        try:
            file_path = f"{connection_name}{filename.removeprefix(prefix)}"
            cos_client.download_file(Bucket=bucket_name, Filename=file_path, Key=filename)
            file_props = cos_client.get_object(Bucket=bucket_name, Key=filename)
            mime_type = file_props["ContentType"]

            # Parse the document into raw text
            if (mime_type == "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"):
                dataset = DocumentsIterableDataset(connections=[DataConnection(location=FSLocation(path=file_path))], error_callback=error_callback)
                data_loader = ExperimentDataLoader(dataset=dataset)
                loaded_documents += data_loader
            elif (mime_type_mappings.get(mime_type)):
                loader = mime_type_mappings[mime_type](file_path)
                loaded_documents += loader.load_and_split()
            else:
                dataset = DocumentsIterableDataset(connections=[DataConnection(location=FSLocation(path=file_path))], error_callback=error_callback)
                data_loader = ExperimentDataLoader(dataset=dataset)
                loaded_documents += data_loader
            
            all_documents += compute_documents_metadata(f"{connection_name}-{filename.removeprefix(prefix)}", loaded_documents)
            
            print(f"Loaded '{filename}' from '{bucket_name}' bucket.")
        except Exception as e:
            print(f"Could not load '{filename}' from '{bucket_name}' bucket.", e)

    # Split the document text into chunks
    return text_splitter.split_documents(all_documents)

def error_callback(document_name: str, exception: Exception):
   raise Exception(f" loading document: {document_name} failed with Exception: \n{exception}")

In [None]:
def process_documents( document_ids ):
    documents = []
    for document_id in document_ids:
        document_properties = client.data_assets.get_details(document_id)
        
        if (document_properties["metadata"]["asset_type"] == "folder_asset"):
            document = process_documents_from_folder(document_properties)
            documents += document
        else:
            document = process_document(document_properties)
            documents += document
    
    return documents

In [None]:
documents = process_documents(vector_index_properties["data_assets"])

## Load chunked documents into the vector store
We can now add the chunked documents into the vector store. As the build notebook can be
run over and over, we need to ensure we are working with an empty index.

In [None]:
vector_store.add_documents(content=documents, batch_size=20)
print("Documents were added.")

# Next steps
You successfully executed this notebook! If there were no errors, the vector index has been loaded
with document chunks and is now ready for executing proximity search queries.

<a id="copyrights"></a>
### Copyrights

Licensed Materials - Copyright © 2024 IBM. This notebook and its source code are released under the terms of the ILAN License.
Use, duplication disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

**Note:** The auto-generated notebooks are subject to the International License Agreement for Non-Warranted Programs (or equivalent) and License Information document for watsonx.ai Auto-generated Notebook (License Terms), such agreements located in the link below. Specifically, the Source Components and Sample Materials clause included in the License Information document for watsonx.ai Studio Auto-generated Notebook applies to the auto-generated notebooks.  

By downloading, copying, accessing, or otherwise using the materials, you agree to the <a href="https://www14.software.ibm.com/cgi-bin/weblap/lap.pl?li_formnum=L-AMCU-BYC7LF" target="_blank">License Terms</a>  