# RAG-CE with Synthetic Data Manipulation

In this notebook, we will introduce the concepts of RAG-CE with synthetic data manipulation.


### Python Imports


In [1]:
%load_ext autoreload
%autoreload 2


import sys
sys.path.append('..\\code')


import os
from dotenv import load_dotenv
load_dotenv()

from IPython.display import display, Markdown, HTML
from PIL import Image
from doc_utils import *


def show_img(img_path, width = None):
    if width is not None:
        display(HTML(f'<img src="{img_path}" width={width}>'))
    else:
        display(Image.open(img_path))


### Make sure we have the OpenAI Models information

We will need the GPT-4-Turbo and GPT-4-Vision models for this notebook.

When running the below cell, the values should reflect the OpenAI reource you have created in the `.env` file.

In [None]:
model_info = {
        'AZURE_OPENAI_RESOURCE': os.environ.get('AZURE_OPENAI_RESOURCE'),
        'AZURE_OPENAI_KEY': os.environ.get('AZURE_OPENAI_KEY'),
        'AZURE_OPENAI_MODEL_VISION': os.environ.get('AZURE_OPENAI_MODEL_VISION'),
        'AZURE_OPENAI_MODEL': os.environ.get('AZURE_OPENAI_MODEL'),
}

oai_client = AzureOpenAI(
    azure_endpoint = OPENAI_API_BASE, 
    api_key= AZURE_OPENAI_KEY,  
    api_version= AZURE_OPENAI_API_VERSION,
)

oai_emb_client = AzureOpenAI(
    azure_endpoint = AZURE_OPENAI_EMBEDDING_API_BASE, 
    api_key= AZURE_OPENAI_EMBEDDING_MODEL_RESOURCE_KEY,  
    api_version= AZURE_OPENAI_EMBEDDING_MODEL_API_VERSION,
)

model_info

## Build the Index

In this section, we will build the vector index that will be used in the remainder of the notebook.

The dataset consists of 76 facts generated by GPT-4 around the Tesla Model S vehicle.

In [15]:
data = './sample_data/Tesla_Model_S.txt'

text = read_asset_file(data)[0]

facts = text.split('\n\n')
print(f"This dataset has {len(facts)} fact items in it")
facts[0]

This dataset has 76 fact items in it


'The Tesla Model S Plaid version boasts an extraordinary acceleration speed, sprinting from 0 to 60 mph in under 2 seconds precisely at 1.99 seconds, making it one of the fastest accelerating production cars available. Its power comes from a tri-motor setup that produces over 1,020 horsepower, enabling it to complete a quarter mile in just 9.23 seconds. '

### Make sure we have the AI Search information

When running the below cell, the values should reflect the AI Search reource you have created in the `.env` file, specifically, the `COG_SEARCH_ENDPOINT` and `COG_SEARCH_ADMIN_KEY` entries.

In [None]:
os.environ.get('COG_SEARCH_ENDPOINT'), os.environ.get('COG_SEARCH_ADMIN_KEY')

### Instantiating the Index Object

In [4]:
from utils.cogsearch_rest import *

index_name = 'tesla_facts'

fields = [
            {"name": "id", "type": "Edm.String", "key": True, "searchable": True, "filterable": True, "retrievable": True, "sortable": True},
            {"name": "vector", "type": "Collection(Edm.Single)", "searchable": True,"retrievable": True, "dimensions": 1536,"vectorSearchProfile": "my-vector-profile"},
            {"name": "tags", "type": "Edm.String","searchable": True, "filterable": False, "retrievable": True, "sortable": False, "facetable": False},
            {"name": "text", "type": "Edm.String","searchable": True, "filterable": False, "retrievable": True, "sortable": False, "facetable": False},
]

index = CogSearchRestAPI(index_name, fields=fields)


### Creating the AI Search Index

In [None]:
# Uncomment if needed
# index.delete()

## Creating the AI Search Index
if index.get_index() is None:
    print(f"No index {index_name} detected, creating one ... ")
    index.create_index()

### Populating the Index

In this cell, we do create very simple metadata, and upload it to AI Search vector index.

In [None]:
 metadatas = []

 for fact in facts: 
    metadata = {
        "text": fact, 
        "vector": get_embeddings(fact, client=oai_emb_client),
        "tags": generate_tag_list(fact, client=oai_client),
        "id": generate_uuid_from_string(fact)
    }

    metadatas.append(metadata)
    
    if len(metadatas) % 10 == 0:
        print(f"Processed {len(metadatas)} items")

# Save the data to a pickle file
save_to_pickle(metadatas, './sample_data/tesla_facts.pkl')

# Upload to the AI Search index
upload_output = index.upload_documents(metadatas)

### Querying the Index

Searching the vector index with Hybrid Search enabled.

In [5]:
query = "How many speakers are in the Tesla Model S, and what are their characteristics?"

def query_search(query, top=3):
    context = ""
    results = index.search_documents(query, top=top)
    for r in results['value']:
        print(f"Distance Score: {bc.OKBLUE}{r['@search.score']:.5f}{bc.ENDC}\nText: {bc.OKGREEN}{r['text']}{bc.ENDC}\n\n")
        context += r['text'] + '\n\n'

    return context

context = query_search(query)


Distance Score: [94m0.03110[0m
Text: [92mThe Model S is equipped with a high-fidelity sound system, including up to 11 speakers with neodymium magnets and a specially designed acoustic architecture. This setup ensures an immersive listening experience for all passengers, with crystal-clear highs and deep, resonant lows. [0m


Distance Score: [94m0.03182[0m
Text: [92mTesla's Model S comes with a premium audio system that includes 11 speakers with neodymium magnets, providing an immersive listening experience. This high-fidelity system is meticulously tuned to the car's interior acoustics, enhancing the enjoyment of music and media.[0m


Distance Score: [94m0.03200[0m




## RAG-CE

Use AI Search to retrieve the context necessary for the Assistants API to work with and to produce a final answer.

In [17]:
query_template = """

## START OF CONTEXT
{context}
## END OF CONTEXT

Based on the Context above, please answer the below question. You **MUST** use the context to answer the question. If the answer is not in the context, please respond with "I don't know". If the question requires calculations, make sure to break down your methodology step by step in the final answer, walk the user through your process, and show intermediate calculation results.

Question: {question}

"""

question = "How much time will it take for the Tesla Model S to cover three quarters of a mile? Assume the continuing acceleration is twice as much for the next quarter-mile as it is for the initial quarter-mile, and then double that for the last quarter mile. Make a chart of the time it takes to cover 1/4, 1/2, and 3/4 of a mile."

# Get context for AI Search
print("The Search Context:\n")
query = query_template.format(question=question, context=query_search(question))

# Use Assistants API
assistant, thread = create_assistant(oai_client)    
messages = query_assistant(query, assistant, thread, client = oai_client)
response, files = process_assistants_api_response(messages, client = oai_client)

print(f"Assistants API generated {len(files)} files for this answer.\n")
print(f"The final response from Assistants-API:\n{bc.OKBLUE}{response}{bc.ENDC}\n")

for f in files:
    if f['type'] == 'assistant_image':
        show_img(f['asset'], width=700)

The Search Context:

Distance Score: [94m0.03280[0m
Text: [92mThe Tesla Model S Plaid version boasts an extraordinary acceleration speed, sprinting from 0 to 60 mph in under 2 seconds precisely at 1.99 seconds, making it one of the fastest accelerating production cars available. Its power comes from a tri-motor setup that produces over 1,020 horsepower, enabling it to complete a quarter mile in just 9.23 seconds. [0m


Distance Score: [94m0.01667[0m
Text: [92mTesla offers a comprehensive warranty for the Model S, including an 8-year or unlimited mile battery and drive unit warranty and a 4-year or 50,000-mile limited warranty covering the rest of the vehicle. This coverage underscores Tesla's confidence in the reliability and longevity of their electric vehicles.[0m


Distance Score: [94m0.01020[0m
Text: [92mThe rear seats of the Model S can be folded down, providing a flat loading area that expands the cargo space significantly. This versatility makes the Model S practical 

### Plot the Embeddings

Just out of curiosity, the below is just a function to plot the embeddings array in 2-D just to see any semantic clusters for the dataset.

In the generated plots below, you can hover on the dots with your mouse to see the tags associated with each entry. You can check that for dots that are closely located, the subject of the text is very semantically close. 

In [None]:
## In case these libraries are not installed, uncomment the below lines
%pip install scikit-learn matplotlib plotly nbformat hdbscan

In [2]:
## In case metadatas was lost during kernel restart
metadatas = load_from_pickle('./sample_data/tesla_facts.pkl')

In [10]:
import numpy as np
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
import hdbscan
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px


embeddings = np.array([np.array(m['vector']) for m in metadatas])

# Perform dimensionality reduction using PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(embeddings)

# Perform dimensionality reduction using t-SNE
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_result = tsne.fit_transform(embeddings)

# Perform HDBSCAN clustering on the PCA reduced data
hdbscan_pca = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
pca_clusters = hdbscan_pca.fit_predict(pca_result)

# Perform HDBSCAN clustering on the t-SNE reduced data
hdbscan_tsne = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
tsne_clusters = hdbscan_tsne.fit_predict(tsne_result)

# Convert clusters and results to a format suitable for Plotly
df_pca = pd.DataFrame(pca_result, columns=['PC1', 'PC2'])
df_pca['cluster'] = pca_clusters
df_pca['label'] = [f'Tags: {m["tags"][:200]}' for m in metadatas]  # Customize with desired metadata

df_tsne = pd.DataFrame(tsne_result, columns=['Dim1', 'Dim2'])
df_tsne['cluster'] = tsne_clusters
df_tsne['label'] = [f'Tags: {m["tags"][:200]}' for m in metadatas]   # Customize with desired metadata

# PCA Plot with Plotly
fig_pca = px.scatter(df_pca, x='PC1', y='PC2', color='cluster', hover_data=['label'])
fig_pca.update_layout(title='PCA of Embeddings with DBSCAN Clusters', title_x=0.5)
fig_pca.show()

# t-SNE Plot with Plotly
fig_tsne = px.scatter(df_tsne, x='Dim1', y='Dim2', color='cluster', hover_data=['label'])
fig_tsne.update_layout(title='t-SNE of Embeddings with DBSCAN Clusters', title_x=0.5)
fig_tsne.show()

[t-SNE] Computing 75 nearest neighbors...
[t-SNE] Indexed 76 samples in 0.001s...
[t-SNE] Computed neighbors for 76 samples in 0.003s...
[t-SNE] Computed conditional probabilities for sample 76 / 76
[t-SNE] Mean sigma: 0.210971
[t-SNE] KL divergence after 250 iterations with early exaggeration: 47.934624
[t-SNE] KL divergence after 300 iterations: 0.349468
