#### -------------------------------------------------------------------------
####
#### Licensed Materials - Property of IBM
#### 
#### Copyright IBM Corporation 2024, 2025. All Rights Reserved.
#### 
#### Note to U.S. Government Users Restricted Rights:
#### Use, duplication or disclosure restricted by GSA ADP Schedule
#### Contract with IBM Corp.
####
#### -------------------------------------------------------------------------

# Document Vectorization for Retrieval Augmented Generation

This Notebook is part of the watsonx Code Assistant on IBM Cloudpak for Data. It is  intended to be used to load code and documentation content from a Github repository into a vector store (here after referred to as indexing) for use with Retrieval Augmented Generation (here after refered to as RAG) feature in IBM watsonx Code Assistant (hereafter referred to as WCA). It pulls contents from the repository, chunks the content based on its type, enhances the (code) chunks with explanations,vectorizes the chunks and stores them in the vector store. It leverages the [wca_rag_lib Python library](https://github.com/IBMDataScience/sample-notebooks/blob/master/Files/wca_rag_lib-0.0.0.tar.gz) to perform these operations. You must download this library before running this notebook.

Content that is vectorized through this Notebook is then used by watsonx Code Assistant to extract context for chat queries using the @repo and @docs references. 
For details on how to configure RAG in WCA please refer to : https://www.ibm.com/docs/en/software-hub/5.2.x?topic=services-watsonx-code-assistant 

>**Please note <br><br> This Notebook populates code into your Opensearch vector store on IBM Cloudpak for Data. Please make sure you have all the approvals and permissions to to make a copy of your code outside the GitHub repository.**

## How to use this Notebook 

**Indexing a repository for the first time:**<BR>

1. Decide what is the content type that needs to be indexed  
   There are two types of content that this Notebook can index:
    - source code and 
    - documentation

   Both of these cannot be done in the same run. Set the ```content_type``` variable accordingly to either ```code``` or ```docs```. (More details: [**Variables that control scope of the content to be indexed**](#scope) section below).

2. Make sure the prerequsites are met<br>
3. Review all the environment variables listed in this Notebook and make sure the mandatory ones are set.<br>
4. Follow the instructions and run the rest of the cells.

As a best practice, if your repository contains over 10000 files, it is recommended to index specific folders and build the index incrementally. This can be done by setting the included_folders environment variable.
   

**Keeping the indexed content up to date**<br>
For accurate results with RAG it is essential that the contents of the vector store are updated periodically to keep the contents current with respect to changes in the Github repository.

To update files in an existing index : 
 1. Set the `content_type` to either `code` or `docs` based on the content.
 2. Set the  ```reindex``` flag to ```True``` 
 3. Make sure the REPOSITORY_BRANCH is set to the same branch that was used to create the index.
 4. Make sure you enter the ```index_name```. Typically the name of the github repository that was indexed.
 5. Follow the instructions and run the cells.

***Important Note:*** It is important that the REPOSITORY_BRANCH used for re-indexing matches with the branch that was used to create the index. Using a different branch for re-indexing will lead to content from two different branches to be indexed, leaving the vector store in a undesirable condition.

## How Source Code Indexing Works<br>

Source code is downloaded from the configured Github repository and chunked into functions.An explanation is generated for each function and attached to the respective chunk. The chunks are then vectorized using an embedding model. The vectorized chunk and the code in plain text are both stored in the vector store.

Specialized function extractors are used for code in Python, Java, C, C++, Javascript, Typescript and Go languages. Chunking is currently not supported for other languages through this notebook. 

If you wish to index JSON, XML or other such files which are not documentation, you can use the file chunking strategy with ```content_type``` set to ```code```. Please refer to the environment variables section for more details.

## How documentation indexing works<br>
Documentation chunking is supported for mark down, PDF, docx and pptx formats. The documentation content is downloaded from the configured Github repository and chunked based on a fixed chunk size. The chunks are vectorized and saved to a the vector store along with the plain text of the chunk.


## Prerequisites : 

Before you start running this Notebook, please make sure you have met the following requirements: 

 - you have the latest instance of WCA setup on IBM Cloudpak for Data (5.2.2) and have followed the instructions for setting up Retrieval Augmented Generation (RAG). More details: https://www.ibm.com/docs/en/software-hub/5.2.x?topic=services-watsonx-code-assistant
 - you have an API key to access WCA.  More Details:  https://www.ibm.com/docs/en/software-hub/5.2.x?topic=wca-getting-started#getting-started__api-key__title__1
 
In addition to the above, you will also need the following : 
 - a Personal Access Token (PAT) from your github account. You can obtain this from  https://github.com/settings/tokens.If you wish to use code from your organization repositories, please use the appropriate Enterprise Github URL.Please note that the generated PAT needs to have read permissions on your repositories. Make sure to select the repo permissions while generating the token.


## Setting up the environment variables

**Secrets and Vector Store Details (required)**<br>
The following environment variables are part of the parameter set that comes with the watsonx Code Assistant RAG project. These parameters often need to be set once and used for all the repositories that need to be indexed. These are : 
  - Vector store details (Note: Opensearch is the only supported vector store by WCA on IBM Cloudpak for Data ) 
  - WCA API Key
  - WCA Username
  - Github PAT
Locate the parameter set RAG_INDEX_PARAMETERS in the Watson Studio project and update the above parameters.


The rest of the variables are to be set in the Notebook. The values of these variables may change based on the content that you index. 

**Github repository details (required)**<br>
Provide the details of the repository that you want to index through the following variables:  
- GITHUB_HOST
- REPOSITORY_ORGANIZATION_NAME
- REPOSITORY_NAME
- REPOSITORY_BRANCH

This Notebook pulls content from Github based on these variables values. Note that the default branch is set to `main`. If you wish to use a different branch, please set up the REPOSITORY_BRANCH value accordingly.

<a name="scope"></a>  
**Variables that control scope of the content to be indexed**<br>
The following optional variables can be setup in the  Notebook to take better control of the scope of the content that gets indexed. These variables may change based on what and how you wish to process the content of each repo. It is recommeneded to review these variable values before beginning to index any repository. No change is required if the default values are good enough for your repository.

***content_type***<br>
Indicates whether the content that is being indexed is code or documentation.<br>
Valid values : ```code```  or ```docs``` <br>
Default value : ```code```<br>
Change this to ```docs``` if you are indexing documentation content such as markdown files, pdfs, docx or pptx.

***included_folders***<br>
This is an optional parameter that limits the folders that are to be indexed. Specify the paths to the folders to be included in this list. If all the folders in the repo are not required to be indexed. This is useful for large applications which have all their code in a single repository.Indexing specific folders incrementally is recommended best practice for such repositories.<br>
Valid values : path to the folders to be included. Example : ```["src/java/app_server" , "src/java/messaging"]``` <br>
Default value: empty list.<br>

***excluded_folders***<br>
This is an optional parameter that excludes folders that are not to be indexed. Specify the paths to the folders to be excluded in this list. This is useful when you need one or two folders to be excluded from getting indexed. For example a test folder containing unit tests or functional tests is a typical candidate. Not indexing test content avoids using test content to being used as context instead of the actual code.<br>
Valid values: path to the folders to be included. Example : ```["src/tests" , "src/doc"]``` <br>
Default value: empty list.<br>

***ignore_file_keywords***<br>
This is an optional parameter that helps skip files with specific keywords in the file name from being indexed.<br>
Valid values: Any string that could be part of a file name<br>
Default value: empty list.<br>

***chunking_strategy***<br>
This is an optional parameter that can be used when the ```content_type``` is ```code```. It is to be used when a file should not be chunked but indexed as-is.This is useful for files of type XML or JSON which may not have functions like code and also not natural language text like a mark down or PDF.<br>
Valid values: ```function``` or ```file```<br>
Default value: ```function```

***file_types_for_file_strategy***<br>
This parameter is used in conjunction with ```chunking_strategy``` . If `chunking_strategy` is set to `file`, ```file_types_for_file_strategy``` indicates what file types should be processed with file strategy.<br>
Valid values: List of valid file extensions. Example : ```["json" , "xml"]``` <br>
Default value: Empty list. No specific files are processed with file chunking strategy.

**Indexing Variables**<br>

***index_name (required)***<br>
This is a mandatory variable that needs to be set by the user. <br>
This indicates the name of the index in the vector store for the content that is being indexed. Typically this can be the name of the respository that is being indexed.However, when multiple repositories are being indexed into the same index (for example different documentation repositories grouped into one index), a more meaningful name can be chosen for the index. 
Note: If your vector store is Milvus, the index name should not have any hyphens. Replace the hyphens with an underscore.

***create_new_index***
This flag indicates if the index mentioned in the `index_name` variable needs to be created in the vector store or is an existing index. Set this value to `False`, if you wish to reuse an existing index for the content that is being indexed in the current run.
Valid values: `True/False`
Default value: `True`

***reindex***
This flag is used to indicate if the current run is for re-indexing an existing index in the vector store. When this is set to `True`, the changes in REPOSITORY_BRANCH since the previous push to the vector store are pulled from github and pushed to the vector store. The same index name that was used to index the repository previously needs to be mentioned in the `index_name` variable. 
Valid values: `True/False`
Default value: `False`

**Other variables**
***root_url***
This is the host that is used to generate explanations for code chunks. Please refer to the RAG setup instructions in the WCA product documentation for instructions on how to obtain this URL: https://www.ibm.com/docs/en/software-hub/5.2.x?topic=services-watsonx-code-assistant

## Troubleshooting

- When something fails, it is generally due to some environment variable not being set correctly. This is the first thing to check if you see errors especically in the intial steps.

- When the explanation generation step fails with some error but the notebook kernal is still running, try resuming it by re-running the cell once again. [Step 5b]

- When the indexing step (6) fails but the notebook kernal is still running, this can be resumed by running the subsequent cell after setting the resume flag to True. [Step 6b]

- If processing large repositories, especially those with PDF files is slow with the Notebook running on the default configuration of XXS then please use a higher environment configuration such as the XS or S for faster processing.
This can be done in the Watson Studio project by clicking on the vertical elipsis menu of the Notebook in the assets tab and selecting the Change environment option.

- If you face issues while inserting project token, you can create a new project token by going to your Proejct -> Manage -> Switch to Access tokens tab -> Create new access token

Please try resuming the failed steps especially for large repositories so the entire process is not repeated once again.

## Cleanup
This notebook creates intermediate jsonl files to help with resuming the steps if there are failures. These files can be found in the Watson Studio project. You can delete them manually after the indexing has been done.

In [1]:
parameters_retrieved = {
    "GITHUB_API_TOKEN": "",
    "WCA_APIKEY": "",
    "WCA_USERNAME": "",
    "DB_TYPE": "opensearch",    
    "OS_URL": "",
    "OS_USERNAME": "",
    "OS_PASSWORD": ""
}

In [None]:
# Set Required variables

#########################################################################################################################
# Please refer to the documentation in the beginning of this notebook and populate the following variables if required  #
#########################################################################################################################

content_type = "code"

create_new_index = True

index_name = "" 

# set the boolean variable according to usecase
reindex = False 
if reindex:
    create_new_index =  False

# These will be used to get explanations for code from WCA
root_url = "https://<cpd hostname>"
default_base_url = f"{root_url}/v2/wca/core/chat/text/generation"

# Repository information - this is the repository that will be processed and stored to the vector store.
# provide the values for the following variables.
GITHUB_HOST = "" #github.ibm.com, github.ibm.com
REPOSITORY_ORGANIZATION_NAME = ""  #code-assistant
REPOSITORY_NAME = ""
REPOSITORY_BRANCH = "main"  # "main" # "master"

excluded_folders = []   # "src/test"

included_folders = []   # "docs"

ignore_file_keywords = []

chunking_strategy = "function"  # function # file

file_types_for_file_strategy = []   # 'XML',"HTML"


In [29]:
if not index_name:
    raise Exception("Index name not populated. Set index_name and try again.")
if set(excluded_folders) & set(included_folders):
    raise Exception("`excluded_folders` and `included_folders` could not contain same folder path.")

### 2. Install the Required Dependencies

In [None]:
%pip install wca_rag_lib-0.0.0.tar.gz

In [30]:
if GITHUB_HOST == 'github.com':
    GITHUB_API_TOKEN = ""    
else:
    GITHUB_API_TOKEN = parameters_retrieved['GITHUB_API_TOKEN']

apikey = parameters_retrieved['WCA_APIKEY']
username = parameters_retrieved['WCA_USERNAME']
db_type = parameters_retrieved['DB_TYPE']

if db_type == "opensearch":    # Open Search Config
    config = {
        "OS_URL":parameters_retrieved["OS_URL"],
        "OS_PASSWORD":parameters_retrieved["OS_PASSWORD"],
        "OS_USERNAME":parameters_retrieved["OS_USERNAME"]
    }


### 3. Extract Repository Contents
#### This step clones the repository contents and converts them into a parquet dataset.

In [None]:
import dataclasses
import tempfile
from pathlib import Path
from typing import Callable, List

from git import Repo

from wca_rag_lib.extract_code.commands import (
    extract_class_aware_functions_from_parquet,
    extract_non_code_from_parquet,
    extract_parquet
)
from wca_rag_lib.store_data.commands import encode_store
from wca_rag_lib.store_data.es_reindex import update_es_index
from wca_rag_lib.store_data.milvus_reindex import update_milvus_index
import wca_rag_lib.store_data.os_reindex as os_reindex
import importlib
importlib.reload(os_reindex)
from wca_rag_lib.store_data.os_reindex import update_os_index

def get_latest_commit_id(file_dir):
    try:
        repo = Repo(file_dir, search_parent_directories=True)
        commit_id = repo.head.commit.hexsha
        print("Commit id found:", commit_id)
        return commit_id
    except Exception as e:
        print("Error retrieving commit ID:", e)
        return ""
# Clone a git repository and extract parquet dataset.
# Repo Details
if GITHUB_API_TOKEN:
    BASE_URL = f"https://{GITHUB_API_TOKEN}@{GITHUB_HOST}"
else:
    BASE_URL = f"https://{GITHUB_HOST}"

REPOSITORY_URL = f"{BASE_URL}/{REPOSITORY_ORGANIZATION_NAME}/{REPOSITORY_NAME}"
OUTPUT_DIR = f"{REPOSITORY_NAME}_ibm"
with tempfile.TemporaryDirectory() as tmpdir:
    dest = Path(tmpdir) / REPOSITORY_NAME
    repo = Repo.clone_from(REPOSITORY_URL, dest)
    repo.git.checkout(REPOSITORY_BRANCH)
    latest_commit_id = get_latest_commit_id(dest)
    repo_url = f"https://{GITHUB_HOST}/{REPOSITORY_ORGANIZATION_NAME}/{REPOSITORY_NAME}/blob/{REPOSITORY_BRANCH}"
    new_entries_path = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_new_update_ibm.jsonl"
    extract_parquet(
        input_path=tmpdir,
        output_path=f"{OUTPUT_DIR}/{REPOSITORY_NAME}_ibm.parquet",
        repo_url=repo_url,
        repo_name=REPOSITORY_NAME,
        included_folders=included_folders,
        excluded_folders=excluded_folders,
        ignore_file_keywords=ignore_file_keywords,
    )
    
    print("Parquet dataset output complete")

    if reindex == True:
        content_type_lower = content_type.lower()
        content_type_lower = 'noncode' if content_type_lower == 'docs' else content_type_lower
        if db_type == "elasticsearch":
            update_es_index(index_name,dest,repo_url,new_entries_path,config=config,db_type=db_type, chunking_strategy=chunking_strategy, file_types_for_file_strategy=file_types_for_file_strategy, content_type=content_type_lower)
        elif db_type == "opensearch":
            update_os_index(index_name,dest,repo_url,new_entries_path,connection_params=config, chunking_strategy=chunking_strategy, file_types_for_file_strategy=file_types_for_file_strategy, content_type=content_type_lower)
        else:
            update_milvus_index(index_name,dest,repo_url,new_entries_path,config=config, chunking_strategy=chunking_strategy, file_types_for_file_strategy=file_types_for_file_strategy, content_type=content_type_lower)

### 4. Extract Functions from Repository Content
#### Transform the parquet dataset into a JSONL file for further processing.

In [None]:
# This function is used to extract functions from a parquet dataset.

if reindex == False :     
    # This function is used to extract functions from a parquet dataset.
    FUNCTIONS_DATASET_PATH = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_functions_ibm.jsonl"

    # reuse_input_path_parquet = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_ibm.parquet".replace('-', '_').replace('/', '__')
    if content_type.lower() == "code":
        extract_class_aware_functions_from_parquet(
            input_path=f"{OUTPUT_DIR}/{REPOSITORY_NAME}_ibm.parquet",  # input_path = reuse_input_path_parquet,
            output_path=FUNCTIONS_DATASET_PATH,
            path_field='path',
            code_field='contents',
            language_field='language',
            chunking_strategy = chunking_strategy,
            file_types_for_file_strategy = file_types_for_file_strategy,
        )

        with open(FUNCTIONS_DATASET_PATH, 'r') as f:
            num_lines = sum(1 for line in f)
        print(f"Total number of records to be processed : {num_lines}")
    else:
        extract_non_code_from_parquet(
        input_path=f"{OUTPUT_DIR}/{REPOSITORY_NAME}_ibm.parquet",  # input_path = reuse_input_path_parquet,
        output_path=FUNCTIONS_DATASET_PATH,
        path_field='path',
        code_field='contents',
        language_field='language'
        )

        with open(FUNCTIONS_DATASET_PATH, 'r') as f:
            num_lines = sum(1 for line in f)
        print(f"Total number of records to be processed : {num_lines}")

### 5. Enhance functions with Explanations (CODE)

#### In this step, we send the previously extracted functions to WCA to generate detailed explanations for each function. The generated explanations are then stored in the `explanation` column of the enhanced dataset, which will later be used for embedding. 

5a. **Prepare for explanation generation**

In [33]:
# This function retrieves explanations for code snippets in a functions-JSONL file and stores it into enchanced-functions-JSONL file.

import json
import os

explanation_payload = """
Explain this code : 
```{language}
{text}
```
"""

input_file_path = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_functions_ibm.jsonl" if reindex == False else new_entries_path

# use the following file path if the kernel was restarted and the extracted jsonl file for the functions is lost
# reuse_input_file_path_jsonl = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_functions_ibm.jsonl".replace('-', '_').replace('/', '__')

output_file_path = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_enhanced_functions_ibm.jsonl"  # location for the updated file

# location for file which stores errored records. This contains records for which explanation could not be generated due to some error during explanation generation
log_file_path = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_error_functions_ibm.jsonl" 

index_file_path = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_index_ibm.jsonl"

if os.path.exists(output_file_path):
    os.remove(output_file_path)

if os.path.exists(log_file_path):
    os.remove(log_file_path)

if os.path.exists(index_file_path):
    os.remove(index_file_path)

5b. **Explanation generation**<br><br>
    **If the following step fails, simply re-run this cell to resume the enhancement process.**

In [None]:
if content_type.lower() == "code":    
    import asyncio
    import aiofiles
    import json
    import os
    import sys
    import datetime
    from tenacity import retry, wait_exponential, stop_after_attempt, RetryError
    from tqdm.asyncio import tqdm
    from wca_rag_lib.external.watson import get_code_chunks_explanation, explain_response_processor

    try:
        from IPython.display import display, HTML, clear_output
        IS_JUPYTER = True
    except ImportError:
        IS_JUPYTER = False


    # please do not increase this number, it may cause performance issues
    MAX_CONCURRENT_TASKS = 10


    RETRY_SETTINGS = {
        "wait": wait_exponential(min=1, max=10),
        "stop": stop_after_attempt(3)
    }


    LOG_BATCH_SIZE = 100
    IO_FLUSH_INTERVAL = 5


    stats = {
        "success": 0,
        "no_exp": 0,        
        "no_exp_e": 0,      
        "skipped": 0,
        "errors": 0,
        "total_processed": 0,
        "total_input": 0
    }

    stats_lock = asyncio.Lock()

    log_queue = asyncio.Queue()

    error_messages = set()


    @retry(**RETRY_SETTINGS)
    async def explain_code(formatted_text, get_code_chunks_explanation, api_key):
        """
        Calls the external function to get code explanations with retry logic.
        This function assumes `get_code_chunks_explanation` is a synchronous or
        asynchronous function that takes formatted_text and api_key and returns
        (is_error: bool, response: str).
        """
        loop = asyncio.get_event_loop()
        return await loop.run_in_executor(None, get_code_chunks_explanation, formatted_text, api_key, username, root_url, default_base_url)

    async def count_processed_chunks(output_file_path):
        """Counts lines in the output file to determine how many chunks are already processed."""
        count = 0
        if os.path.exists(output_file_path):
            async with aiofiles.open(output_file_path, 'r') as outf:
                async for _ in outf:
                    count += 1
        return count

    async def count_input_records(input_file_path):
        """Counts total lines in the input file."""
        count = 0
        async with aiofiles.open(input_file_path, 'r') as infile:
            async for _ in infile:
                count += 1
        return count

    async def read_input_records(input_file_path, start_index):
        """
        Asynchronously reads and yields JSON records from the input file,
        starting from a specified index. Handles JSON decoding errors.
        """
        i = 0
        async with aiofiles.open(input_file_path, 'r') as infile:
            async for line in infile:
                if i < start_index:
                    i += 1
                    continue
                try:
                    yield i, json.loads(line)
                except json.JSONDecodeError as e:
                    await log_error(f"JSONDecodeError at line {i}: {str(e)} - Skipping record.")
                    async with stats_lock:
                        stats["errors"] += 1
                        stats["total_processed"] += 1
                finally:
                    i += 1
            else:
                if i == 0:
                    raise Exception(f"`{input_file_path}` has no data.")

    async def log_message(message: str, level: str):
        """Puts a log message into the log_queue."""
        timestamp = datetime.datetime.now().isoformat()
        log_entry = f"[{level}] {timestamp} - {message}"
        if level == "ERROR":
            print(log_entry, flush=True)
            global error_messages
            error_messages.add(message.split(' - ')[0])
        await log_queue.put(log_entry)

    async def log_error(message: str):
        await log_message(message, "ERROR")

    async def log_info(message: str):
        await log_message(message, "INFO")

    def get_stats_html():
        """Generates an HTML string for displaying live statistics."""
        return f"""
        <div style="font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; font-size: 14px; margin-top: 10px; padding: 10px; border: 1px solid #e0e0e0; border-radius: 8px; background-color: #f9f9f9;">
            <h4 style="margin-top: 0; color: #333;">Processing Statistics:</h4>
            <table style="width: 100%; border-collapse: collapse;">
                <tr><td style="padding: 4px 0;">Total Input:</td><td style="text-align: right; font-weight: bold;">{stats['total_input']}</td></tr>
                <tr><td style="padding: 4px 0;">Processed:</td><td style="text-align: right; font-weight: bold;">{stats['total_processed']}</td></tr>
                <tr><td style="padding: 4px 0;">Success:</td><td style="text-align: right; color: #28a745; font-weight: bold;">{stats['success']}</td></tr>
                <tr><td style="padding: 4px 0;">No Explanation (Lang):</td><td style="text-align: right; color: #ffc107; font-weight: bold;">{stats['no_exp']}</td></tr>
                <tr><td style="padding: 4px 0;">No Explanation (API Err):</td><td style="text-align: right; color: #dc3545; font-weight: bold;">{stats['no_exp_e']}</td></tr>
                <tr><td style="padding: 4px 0;">Skipped (Dir):</td><td style="text-align: right; color: #6c757d; font-weight: bold;">{stats['skipped']}</td></tr>
                <tr><td style="padding: 4px 0;">Other Errors:</td><td style="text-align: right; color: #dc3545; font-weight: bold;">{stats['errors']}</td></tr>
            </table>
        </div>
        """

    async def log_writer_task(log_file_path: str, queue: asyncio.Queue, flush_interval: int):
        """
        Dedicated task to write log messages to the log file in batches.
        """
        buffer = []
        last_flush_time = asyncio.get_event_loop().time()
        async with aiofiles.open(log_file_path, "a") as logf:
            while True:
                try:
                    item = await asyncio.wait_for(queue.get(), timeout=flush_interval)
                    buffer.append(item + "\n")
                    queue.task_done()

                    if len(buffer) >= LOG_BATCH_SIZE:
                        await logf.write("".join(buffer))
                        buffer.clear()
                        last_flush_time = asyncio.get_event_loop().time()
                except asyncio.TimeoutError:
                    if buffer:
                        await logf.write("".join(buffer))
                        buffer.clear()
                        last_flush_time = asyncio.get_event_loop().time()
                except asyncio.CancelledError:
                    if buffer:
                        await logf.write("".join(buffer))
                    break

    async def process_chunk(index, record, explanation_payload, output_file_path: str, # Changed to direct path
                            get_code_chunks_explanation, api_key, pbar,
                            explain_response_processor, stats_display_handle=None):
        """
        Processes a single code chunk: checks for exclusion/support, calls API,
        writes result to output file directly, and updates global statistics and progress bar.
        """
        try:
            path = record.get("path", "")
            lang = record.get("language", "").lower()

            formatted_text = explanation_payload.format(text=record['text'], language=record['language'])

            async with asyncio.Semaphore(MAX_CONCURRENT_TASKS):
                is_error, response = await explain_code(formatted_text, get_code_chunks_explanation, api_key)

            if is_error:
                record['explanation'] = explain_response_processor(lang, path, record.get("full_code", ""), "")
                async with stats_lock:
                    stats["no_exp_e"] += 1
                await log_error(f"Record {index}: No explanation - API error during explanation.")
            else:
                record['explanation'] = explain_response_processor(lang, path, record.get("full_code", ""), response)
                async with stats_lock:
                    stats["success"] += 1
                await log_info(f"Record {index}: Successfully explained.")

            async with aiofiles.open(output_file_path, "a") as outf: # Direct write to output
                await outf.write(json.dumps(record) + "\n")

        except RetryError as re:
            async with stats_lock:
                stats["errors"] += 1
            await log_error(f"Record {index}: RetryError - {str(re)}")
        except Exception as e:
            import traceback
            async with stats_lock:
                stats["errors"] += 1
            tb = traceback.format_exc()
            await log_error(f"Record {index}: Unhandled Exception - {str(e)}\n{tb}")
        finally:
            async with stats_lock:
                stats["total_processed"] += 1
                pbar.set_description(f"Processing ({stats['total_processed']}/{stats['total_input']})")
                if IS_JUPYTER and stats_display_handle:
                    stats_display_handle.update(HTML(get_stats_html()))
            pbar.update(1)


    async def run_all(
        input_file_path: str,
        output_file_path: str,
        log_file_path: str,
        explanation_payload: str,
        get_code_chunks_explanation,
        api_key: str,
        explain_response_processor
    ):
        """
        Main function to orchestrate the code chunk processing.
        """
        await log_info(f"Starting code explanation process at {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        await log_info(f"Input file: {input_file_path}")
        await log_info(f"Output will be written to: {output_file_path}")
        await log_info(f"Log file: {log_file_path}")
        await log_info(f"Max concurrent tasks: {MAX_CONCURRENT_TASKS}")
        await log_info(f"Log batch size: {LOG_BATCH_SIZE}")

        if not os.path.exists(input_file_path):
            print(f"[ERROR] Input file not found: {input_file_path}", flush=True)
            return

        processed_chunks_on_start = await count_processed_chunks(output_file_path)
        total_chunks = await count_input_records(input_file_path)

        async with stats_lock:
            stats["total_input"] = total_chunks
            stats["total_processed"] = processed_chunks_on_start

        remaining_chunks = total_chunks - processed_chunks_on_start

        print(f"\nResuming from chunk {processed_chunks_on_start}...", flush=True)
        print(f"Total chunks in input: {total_chunks}, Remaining to process: {remaining_chunks}\n", flush=True)

        stats_display_handle = None
        if IS_JUPYTER:
            clear_output(wait=True)
            stats_display_handle = display(HTML(get_stats_html()), display_id=True)
            print("\n", flush=True)

        log_writer_task_obj = asyncio.create_task(log_writer_task(log_file_path, log_queue, IO_FLUSH_INTERVAL))

        pbar = None
        try:
            pbar = tqdm(total=remaining_chunks, desc="Processing", unit=" chunk", position=0, leave=True)
            pbar.update(processed_chunks_on_start)

            processing_tasks = []
            async for i, record in read_input_records(input_file_path, processed_chunks_on_start):
                task = process_chunk(
                    i, record, explanation_payload, output_file_path,
                    get_code_chunks_explanation, api_key, pbar,
                    explain_response_processor, stats_display_handle
                )
                processing_tasks.append(asyncio.create_task(task))

                if len(processing_tasks) >= MAX_CONCURRENT_TASKS * 2:
                    done, pending = await asyncio.wait(processing_tasks, return_when=asyncio.FIRST_COMPLETED)
                    processing_tasks = list(pending)

            if processing_tasks:
                await asyncio.gather(*processing_tasks)

        finally:
            if pbar:
                pbar.close()

        await log_queue.join()
        log_writer_task_obj.cancel()
        try:
            await log_writer_task_obj
        except asyncio.CancelledError:
            pass

        if IS_JUPYTER and stats_display_handle:
            stats_display_handle.update(HTML(get_stats_html()))
        elif not IS_JUPYTER:
            print(get_stats_html())

        print("\n" + "=" * 40, flush=True)
        print("        Processing Complete!", flush=True)
        print("=" * 40 + "\n", flush=True)

        print(f"Total chunks processed: {stats['total_processed']}/{stats['total_input']}", flush=True)
        print(f"  - Successful explanations: {stats['success']}", flush=True)
        print(f"  - No explanation (Unsupported Language): {stats['no_exp']}", flush=True)
        print(f"  - No explanation (API Error after retries): {stats['no_exp_e']}", flush=True)
        print(f"  - Skipped (Excluded Directory): {stats['skipped']}", flush=True)
        print(f"  - Other Errors (e.g., JSON parsing): {stats['errors']}", flush=True)
        print(f"\nOutput file: {output_file_path}", flush=True)
        print(f"Log file: {log_file_path}", flush=True)

        if error_messages:
            print("\n--- Summary of Unique Errors Encountered ---", flush=True)
            for msg in sorted(list(error_messages)):
                print(f"- {msg}", flush=True)
        else:
            print("\nNo significant errors encountered during processing.", flush=True)

        print("\n" + "=" * 40, flush=True)
        await log_info(f"Process finished at {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


    await run_all(
        input_file_path=input_file_path,
        output_file_path=output_file_path,
        log_file_path=log_file_path,
        explanation_payload=explanation_payload,
        get_code_chunks_explanation=get_code_chunks_explanation,
        api_key=apikey,
        explain_response_processor=explain_response_processor
    )

### 6. Create Index and save enhanced chunks to vector store
#### When content_type is "code", this step creates index using the enhanced JSONL file, where embeddings are generated based on the explanation column.
#### When content_type is "docs", this step creates index using the functions JSONL file, where embeddings are generated based on the text column.

6a. **Create or update existing index** <br><br>
 - **Populate the index-name below with a valid string**<br><br>
 - **Recommended index-name is REPOSITORY_NAME.lower()**<br><br>
 - **Note that index-name should not have hyphens or any special characters, use underscore as word separator if required.**

In [None]:
# Creates or updates index for a given dataset.

import os

EMBEDDING_MODEL = "msmarco-MiniLM-L-6-v3"

# the JSON field name used to generate embeddings
if content_type.lower() == "code":
    EMBEDDING_FIELD = ["explanation"]  
    dataset_file_path = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_enhanced_functions_ibm.jsonl"
else:
    EMBEDDING_FIELD = ["text"]  
    dataset_file_path = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_functions_ibm.jsonl"

try:
    index = encode_store(
        model_name=EMBEDDING_MODEL,
        dataset_file_path=dataset_file_path,
        index_file_path=index_file_path,
        dataset_name=index_name,
        content_key=EMBEDDING_FIELD,
        db_type=db_type, 
        config=config,
        create_new_index=create_new_index, # create new index
        resume=False,
    )
    print(f"INDEX NAME : {index}")
except KeyboardInterrupt:
    print("Process interrupted by user.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

6b. **Retry failed indexing process**<br><br>
- **Run the below step only if previous step has failed.**<br><br>
- **Set the resume flag to True and index name same as previous step before running this step** 

In [None]:
# Resume index for a given dataset.
# Set to true incase indexing fails and you want to resume it

resume = False
import os
EMBEDDING_MODEL = "msmarco-MiniLM-L-6-v3"
# the JSON field name used to generate embeddings
if content_type.lower() == "code":
    EMBEDDING_FIELD = ["explanation"]  
    dataset_file_path = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_enhanced_functions_ibm.jsonl"
else:
    EMBEDDING_FIELD = ["text"]  
    dataset_file_path = f"{OUTPUT_DIR}/{REPOSITORY_NAME}_functions_ibm.jsonl"

if resume : 
    try:
        index = encode_store(
            model_name=EMBEDDING_MODEL,
            dataset_file_path=dataset_file_path,
            index_file_path=index_file_path,
            dataset_name=index_name,
            content_key=EMBEDDING_FIELD,
            db_type=db_type, 
            config = config,
            create_new_index=False, # update existing index
            resume = resume
        )
        print(f"INDEX NAME : {index}")
    except KeyboardInterrupt:
        print("Process interrupted by user.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

In [None]:
# ADD MASTER RECORD

import importlib
import wca_rag_lib.external.records as records

# Reload the whole module
importlib.reload(records)

# Import fresh functions
from wca_rag_lib.external.records import (
    add_custom_record_to_milvus,
    add_custom_record_to_es,
    add_custom_record_to_os,
)

# Save latest commit id for recent update in milvus master record for future use.
if db_type.lower() == "milvus":
    add_custom_record_to_milvus(
        texts=["WCA_MASTER_RECORD"],
        commit_ids=[latest_commit_id],
        languages=["WCA_MASTER_RECORD"],
        urls=[""],
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        config=config,
        index=index
    )
# Save latest commit id for recent update in elasticsearch master record for future use.
elif db_type.lower() == "elasticsearch":
    add_custom_record_to_es(
    texts=["WCA_MASTER_RECORD"],
    commit_ids=[latest_commit_id],
    languages=["WCA_MASTER_RECORD"],
    urls=[""],
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    config=config,
    index_name=index
)
# Save latest commit id for recent update in opensearch master record for future use.
elif db_type.lower() == "opensearch":
    add_custom_record_to_os(
    texts=["WCA_MASTER_RECORD"],
    commit_ids=[latest_commit_id],
    languages=["WCA_MASTER_RECORD"],
    urls=[""],
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    config=config,
    index_name=index
)
print(f"Latest Commit ID updated in MASTER record {latest_commit_id}")

Copyright © 2024, 2025. This notebook and its source code are released under the terms of the MIT License.

Author: Rishi S ribalaji@in.ibm.com