<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       CSAE Bot: Quickly find your demos of interest by just typing
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Install required libraries</b></p>

In [1]:
%%capture

!pip install openai langchain langchain-openai panel==1.3.4

<div class="alert alert-block alert-info">
    <p style='font-size:16px;font-family:Arial;color:#00233C'><b>Note:</b> Please restart the kernel. The simplest way is by typing <b>0 0</b> (zero zero) and then pressing <i><b>enter</b></i></p>
</div>

In [1]:
import os
import json
import re

from langchain.document_loaders import DirectoryLoader
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

from langchain_community.document_loaders import NotebookLoader
from langchain_community.document_loaders import DirectoryLoader

from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
import openai

from teradataml import *

<hr style="height:1px;border:none;background-color:#00233C;">

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>1.1 Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [86]:
%run -i ./UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
execute_sql('''SET query_band='DEMO= CSAE_Bot.ipynb;' UPDATE FOR SESSION;''')

Performing setup ...
Setup complete



Enter password:  ·········


... Logon successful
Connected as: teradatasql://demo_user:xxxxx@host.docker.internal/dbc
Engine(teradatasql://demo_user:***@host.docker.internal)


TeradataCursor uRowsHandle=37 bClosed=False

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Extract contents (code, text) from notebook</b></p>

In [2]:
# Function to extract content from a Jupyter notebook
def extract_notebook_content(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        notebook_data = json.load(f)

    content = ""
    for cell in notebook_data.get('cells', []):
        if cell['cell_type'] == 'markdown':
            # Clean markdown content by removing HTML tags
            content += '\n'.join(cell['source']) + '\n\n'
        elif cell['cell_type'] == 'code':
            # Format code properly
            content += '```python\n' + ''.join(cell['source']) + '\n```\n\n'
    return content

# Function to remove HTML tags
def remove_html_tags(text):
    """Remove HTML tags from a string"""
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

# Function to split the notebook content into markdown and code
def split_ipynb_content(content):
    # Regular expression to match code blocks
    code_pattern = re.compile(r'```python(.*?)```', re.DOTALL)

    # Find all code blocks
    code_blocks = code_pattern.findall(content)

    # Split the content by code blocks
    parts = code_pattern.split(content)

    # Combine markdown and code blocks
    result = []
    for i, part in enumerate(parts):
        if i % 2 == 0:
            # This is a markdown part, remove HTML tags
            clean_part = remove_html_tags(part)
            result.append(('markdown', clean_part))
        else:
            # This is a code part
            result.append(('code', part))

    return result

# Function to clean and split notebook content
def clean_and_split_notebook_content(file_path):
    """Extract markdown content and clean up the notebook's information."""
    # Extract the content from the notebook file
    content = extract_notebook_content(file_path)
    
    # Split content into markdown and code cells
    split_content = split_ipynb_content(content)

    # Initialize a list to hold combined documents
    combined_documents = []
    current_markdown = ""
    current_code = ""

    # Iterate through the split content to group markdown with code
    for part_type, part in split_content:
        if part_type == 'markdown':
            # If we have code and markdown, combine them
            if current_markdown or current_code:
                combined_documents.append({"markdown": current_markdown, "code": current_code})
            # Update current markdown to the new one
            current_markdown = part
            current_code = ""  # Reset code, ready for next code block
        elif part_type == 'code':
            # Append the code to the current code block
            current_code += part
    
    # Add the last document (markdown + code)
    if current_markdown or current_code:
        combined_documents.append({"markdown": current_markdown, "code": current_code})

    return combined_documents

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Create FAISS vector database</b></p>
<div class="alert alert-block alert-warning">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: You do not have to run the next cell multiple times.  Each time it is executed it will generate over 1M embeddings and the charge is typically $0.02USD / 1M tokens</b>.</p>
   



In [3]:
import getpass
import os

if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key:  ························································


In [8]:
def generate_emb():
    # Load notebooks and clean them
    path = '/home/jovyan/JupyterLabRoot/'
    loader = DirectoryLoader(path, glob="**/*.ipynb", loader_cls=NotebookLoader)
    notebooks = loader.load()

    # Clean each notebook before processing it
    cleaned_documents = []
    for notebook in notebooks:
        # Assuming notebook metadata contains file path
        file_path = notebook.metadata.get("source", "Unknown")  # Adjust this as needed
        cleaned_data = clean_and_split_notebook_content(file_path)

        # Convert cleaned data to documents, including the source file path
        for data in cleaned_data:
            if data['markdown'] or data['code']:
                doc = Document(
                    page_content=f"Markdown:\n{data['markdown']} \n\nCode:\n{data['code']}",
                    metadata={"source": file_path}  # Ensure the source file path is added
                )
                cleaned_documents.append(doc)


    # Split text into manageable chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    docs = text_splitter.split_documents(cleaned_documents)


    # for count of token
    from tiktoken import encoding_for_model

    def count_document_tokens(document, model_name="gpt-4"):
        encoder = encoding_for_model(model_name)
        return len(encoder.encode(document.page_content))

    tiktokn = 0
    for doc in cleaned_documents:
        tiktokn = tiktokn + count_document_tokens(doc)

    print("total token from all the notebooks: ", tiktokn)

    # Create vector store using embeddings
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large") 
    vector_store = FAISS.from_documents(docs, embeddings)

    # Save the index for reuse
    vector_store.save_local(".notebooks_index")
    return vector_store

In [9]:
def load_emb():
    # Load the FAISS index with dangerous deserialization enabled
    # Create vector store using embeddings
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large") 

    return FAISS.load_local(
        ".notebooks_index", embeddings, allow_dangerous_deserialization=True
    )

In [10]:
# Request user's input
generate = input("Do you want to generate embeddings? ('yes'/'no'): ")

# Check the user's input
if generate.lower() == "yes":
    vector_store = generate_emb()
elif generate.lower() == "no":
    try:
        print('Loading existing embeddings...')
        vector_store = load_emb()
    except:
        print('Embeddings not found, generating now..')
        generate_emb()
        vector_store = load_emb()

Do you want to generate embeddings? ('yes'/'no'):  no


Loading existing embeddings...


<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Load existing vector database and define RAG</b></p>

In [11]:
# # Load the FAISS index with dangerous deserialization enabled
# # Create vector store using embeddings
# embeddings = OpenAIEmbeddings(model="text-embedding-3-large") 

# vector_store = FAISS.load_local(
#     "notebooks_index", embeddings, allow_dangerous_deserialization=True
# )

In [12]:
# Custom Prompt Template
CUSTOM_PROMPT = """
You are a helpful assistant. Use the following retrieved information from Jupyter notebooks to provide:
1. A **clean and concise textual explanation** based on the question and notebook markdown.
2. **Relevant Clean Python code** extracted from the notebooks' code cells that are related to the question. Please filter the code that is related to the query.
3. Extract the source documents
If no relevant information is found, politely say so.

Context:
{context}

Question:
{question}

Your response should be in below format:
##Answer:
##Relevant Code:
##Source documents:
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=CUSTOM_PROMPT
)

# Make sure to use a Chat model like 'gpt-4' or 'gpt-3.5-turbo'
chat_model = ChatOpenAI(model="gpt-4o-mini")

# Retrieval QA Chain
qa_chain = RetrievalQA.from_chain_type(
    llm=chat_model,
    retriever=vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 10}),  # Assuming vector_store is your vector database
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

In [13]:
# Function to Query Chatbot
def query_chatbot(question):
    # Query the chatbot using the chain
    result = qa_chain.invoke(question)
    answer = result["result"]

    # Extract and format relevant source paths from source documents
    source_docs = result.get("source_documents", [])
    sources = "\n".join(set([doc.metadata.get("source", "Unknown") for doc in source_docs]))

    return f"""
{answer}

Reference Notebook(s):
{sources if sources else "No source notebooks found."}
"""

In [14]:
from IPython.display import display, Markdown, HTML, Javascript
import re
import textwrap

def extract_answer_code_references(input_string):

    # Extract references and create JupyterLab-compatible links
    references = re.findall(r'(/home/[^\s]+)', input_string)
    
    # Check and format paths to open in JupyterLab (ensure paths are relative to /notebooks/)
    html_output = [f'<a href="{ref.replace("/home/jovyan/JupyterLabRoot/", "")}"> -> {ref.split("/")[-1]} </a>' for ref in references]
    html_output2 = []
    for t in html_output:
        html_output2.append(textwrap.fill(t, width=100))
    return "\n\n".join(html_output2)

In [16]:
import panel as pn
import textwrap

pn.extension(design="material")

# panel callback function
def callback(contents, user, instance):
    response = qa_chain.invoke(contents)
    result = response['result']
    
    source_docs = response.get("source_documents", [])
    sources = "\n".join(set([doc.metadata.get("source", "Unknown") for doc in source_docs]))
    html_output = extract_answer_code_references(sources)
    
    # Wrap the string into two lines
    wrapped_text = textwrap.fill(html_output, width=100) 
    result = result + "\n\n" +  wrapped_text
    return result


pn.chat.ChatInterface(
    callback=callback,
    show_rerun=False,
    show_undo=False,
    show_clear=False,
    width=1400,
    height=400,
)

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. You can try your own question</b></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here are some sample questions that you can try out:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>How VectorDistance works?</li>
    <li>What is Script table operator?</li>
    <li>Give me demos which have AWS Bedrock?</li>
    <li>What is GEOSEQUENCE? Show me some examples</li>
    <li>Which notebooks are using OpenAI?</li>
    <li>Which notebooks are about fraud detection?</li>
    <li>How to use TDApiClient to generate the embeddings?</li>
    <li>Show me demo for Broken digital Journey?</li>
</ol>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>