<a href="https://colab.research.google.com/github/RedPlunder/CS450-Winter/blob/main/test_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**RAG version: 2.0 (with tags optimizing retreival)**

**Tags method: similarity**

**Prompt version: 4.1**

**gpt version: 4o-mini**

**key improvements: new RAG design with tags**

*Last modification: Jinwei 3/12*

**Input:** `./dataset/kd_withTags_fromFinetune.csv`, `./dataset/test.csv`, `./dataset/doc_embeddings.npy` (optional)

**Output:** `./dataset/test.csv` (overwritten)\n


In [1]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0


In [2]:
import pandas as pd
import openai
from transformers import AutoTokenizer, AutoModel
import faiss
import numpy as np
import os
import torch
import asyncio
from google.colab import userdata

### Setup

In [3]:
llm_choice = "gpt"
openai.api_key = userdata.get("OPENAI_API_KEY")
client = openai.AsyncClient(api_key=userdata.get("OPENAI_API_KEY"))

# Set environment variable to prevent runtime issues
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

# Load tokenizer and model for embedding
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L12-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L12-v2")

# gpu for embed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"Using device: {device}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

Using device: cuda




```
# This is formatted as code
```

### Load knowledge data

In [4]:
# Load knowledge database
rag_db = pd.read_csv("./dataset/kd_withTags_fromFinetune.csv")
rag_data = rag_db[['ID', 'Content', 'tags']].dropna()

# Load knowledge database
rag_data['Content'] = rag_data['Content'].str.lower()
documents = rag_data['Content'].tolist()
doc_ids = rag_data['ID'].tolist()
print(rag_db.columns.tolist())
######Tags############
doc_tags = rag_data['tags'].tolist()  # I assume that the format of Tag content is like this "<K8s><helm>...<Ingress>""
#######################

['Category', 'Topic', 'Concept', 'Content', 'Link_to', 'URL', 'Tag', 'ID', 'tags']


### embed knowledge text

In [5]:


def embed(documents, batch_size=20):
    """Generate embeddings in batches to prevent memory overflow."""
    all_embeddings = []
    for i in range(0, len(documents), batch_size):
        print(f"embedding {i} / {len(documents)}")
        batch = documents[i:i+batch_size]
        # Tokenize the batch with dynamic padding
        inputs = tokenizer(batch, padding="longest", truncation=True, max_length=512, return_tensors="pt")
        # move input to GPU
        inputs = {key: value.to(device) for key, value in inputs.items()}
        # Disable gradient calculation for inference
        with torch.no_grad():
            outputs = model(**inputs)
        #
        embeddings = outputs.last_hidden_state[:, 0, :].detach().cpu().numpy()
        all_embeddings.append(embeddings)
    return np.vstack(all_embeddings)

# Define the path for saving/loading embeddings
embeddings_path = './dataset/doc_embeddings.npy'
if os.path.exists(embeddings_path):
    print("Loading embeddings from file...")
    # Load embeddings using memory mapping to avoid loading the entire file into RAM
    doc_embeddings = np.load(embeddings_path, mmap_mode='r')
else:
    print("Generating embeddings in batches...")
    doc_embeddings = embed(documents)
    np.save(embeddings_path, doc_embeddings)  # Save embeddings to disk

# Build the FAISS index using L2 distance
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings)
print("FAISS indexing complete!")

Loading embeddings from file...
FAISS indexing complete!


### Retrieve considering tags
The main modifications are in the code below.

The parts wrapped with `#############` are the modifications or additions Lianjie made.

#### Lianjie's Approach to Utilizing Tags

Since the retrieval part of the code directly calls the API, making it difficult to incorporate tag information at that stage, Lianjie naturally placed the utilization of tags in the **re-ranking** phase—this is also a common practice in RAG projects.

Simply put, after the initial ranking, Lianjie further refine the top-k results using tags to select the chunks that best match the query. Lianjie attempted the following two methods:

1. **Counting Tag Matches**  
   For example, if the query tag is `<K8s><Helm>`, and the initial retrieval returns the following results:
   ```json
   {
       "tags": ["<K8s><Json>", "<K8s><Ingress>", "<Kubectl><google-cloud>"]
   }
   ```
   Clearly, the first two chunks each contain <K8s>, which appears once in the query tags, while the last chunk contains none. Therefore, in this approach, the algorithm prioritizes the first two chunks.
   This is implemented in re_ranking_tags().

2. **Using Cosine Similarity**  
Sometimes, different tags may still be highly related in real-world scenarios. For instance, "cat," "dog," and "goldfish" are distinct tags, but "cat" and "dog" are more semantically related. Cosine similarity effectively captures such relationships. Thus, I use cosine similarity to improve the re-ranking process.This is implemented in re_ranking_similarity().

Only one of these functions needs to be called in retrieve_documents().

In [6]:
#######################
import re

### Method 1: Re-ranking by frequency of the tags.
def extract_tags(input_string):
    # Use Regex to get content in <...>
    matches = re.findall(r'<([^>]+)>', input_string)
    return matches


"""
Re-rank the results by frequency of the tags.
The chunks with more identical tags will be given higher priority.
We will choose the top k chunks from {results}.
@Args:
    queries_tags: A list, every query has a tag string.
    results: query results RAG.
    k: The top **k** result you want to get after re-ranking.
"""
def re_ranking_tags(queries_tags, results, k=3):

    def filter_result_by_tag_list(tag_list, result, k=3):
        # 计算每个标签字符串与tag_list中相同标签的个数
        def count_matching_tags(tag_str):
            return sum(1 for tag in tag_list if tag in tag_str)
        # 将ids, contetns, tags组合在一起，方便排序
        combined = list(zip(result["ids"], result["contents"], result["tags"]))
        # 根据匹配的标签个数进行排序，降序排列
        combined.sort(key=lambda x: count_matching_tags(x[2]), reverse=True)
        # 选择前 k 个元素
        top_k = combined[:k]
        # 构造新的result
        new_result = {
            "ids": [item[0] for item in top_k],
            "contents": [item[1] for item in top_k],
            "tags": [item[2] for item in top_k]
        }
        return new_result

    new_results = list()
    for query_tag, result in zip(queries_tags, results):
        query_tag_list = extract_tags(query_tag)
        new_result = filter_result_by_tag_list(query_tag_list, result)
        new_results.append(new_result)

    return new_results  # ✅ Now it properly returns results


### Method 2: Re-ranking by tag similariry.
from sklearn.metrics.pairwise import cosine_similarity

tag_embedding = embed(doc_tags)   # This Step could be placed into Section "Load knowledge data".
"""
Re-rank the results by cosine similarity.
The chunks whose tags have higher cosine similarity with query_tag will be given higher priority.
We will choose the top k chunks from {results}.
@Args:
    queries_tags: A list, every query has a tag string.
    results: query results RAG.
    k: The top **k** result you want to get after re-ranking.
"""
def re_ranking_similarity(queries_tags, results, k=3):

    def filter_result_by_cosine_similarity(query_tag_embed, result, k=3):

        query_tag_embed = np.array(query_tag_embed).reshape(1, -1)

        valid_ids = [id for id in result["ids"] if isinstance(id, int) and 0 <= id < len(tag_embedding)]
        result_embedding_list = [tag_embedding[id] for id in valid_ids]

        if len(result_embedding_list) == 0:
            result_embedding_list = np.zeros((1, query_tag_embed.shape[1]))

        result_embedding_list = np.array(result_embedding_list).reshape(-1, query_tag_embed.shape[1])

        # 计算余弦相似度
        cosine_similarities = cosine_similarity(query_tag_embed, result_embedding_list).flatten()
        # 将 ids、tags 和余弦相似度组合在一起
        combined = list(zip(valid_ids, result["tags"], cosine_similarities))
        # 根据余弦相似度进行排序，降序排列
        combined.sort(key=lambda x: x[2], reverse=True)
        # 选择前 k 个元素
        top_k = combined[:k]
        # 构造新的 result
        new_result = {
            "ids": [item[0] for item in top_k],
            "contents": [documents[item[0]] for item in top_k],  # ✅ Retrieve actual content
            "tags": [doc_tags[item[0]] for item in top_k]        # ✅ Retrieve actual tags
        }
        return new_result

    new_results = list()
    queries_tags_embedding = embed([q_tag.strip().lower() for q_tag in queries_tags])

    for query_tag_embed, result in zip(queries_tags_embedding, results):
        new_result = filter_result_by_cosine_similarity(query_tag_embed, result)
        new_results.append(new_result)

    return new_results

#############################

embedding 0 / 4304
embedding 20 / 4304
embedding 40 / 4304
embedding 60 / 4304
embedding 80 / 4304
embedding 100 / 4304
embedding 120 / 4304
embedding 140 / 4304
embedding 160 / 4304
embedding 180 / 4304
embedding 200 / 4304
embedding 220 / 4304
embedding 240 / 4304
embedding 260 / 4304
embedding 280 / 4304
embedding 300 / 4304
embedding 320 / 4304
embedding 340 / 4304
embedding 360 / 4304
embedding 380 / 4304
embedding 400 / 4304
embedding 420 / 4304
embedding 440 / 4304
embedding 460 / 4304
embedding 480 / 4304
embedding 500 / 4304
embedding 520 / 4304
embedding 540 / 4304
embedding 560 / 4304
embedding 580 / 4304
embedding 600 / 4304
embedding 620 / 4304
embedding 640 / 4304
embedding 660 / 4304
embedding 680 / 4304
embedding 700 / 4304
embedding 720 / 4304
embedding 740 / 4304
embedding 760 / 4304
embedding 780 / 4304
embedding 800 / 4304
embedding 820 / 4304
embedding 840 / 4304
embedding 860 / 4304
embedding 880 / 4304
embedding 900 / 4304
embedding 920 / 4304
embedding 940 / 430

### Batch Inference

In [7]:
# Retrieve and validate documents
def retrieve_documents(queries, queries_tags, rerank_method="freq", k=5):
    """
    Efficiently retrieves relevant documents for a batch of queries.

    Args:
        queries (list of str): A list of query strings.
        k (int): The number of documents to retrieve per query before re-ranking.

    Returns:
        list of dict: A list where each element corresponds to a query and contains
                      the keys 'ids' and 'contents' for the retrieved documents.
    """
    query_embeddings = np.array(embed([q.strip().lower() for q in queries]))

    if query_embeddings.ndim == 1:
        query_embeddings = query_embeddings.reshape(1, -1)

    # Perform batch search in the index
    distances, indices = index.search(np.array(query_embeddings), k)

    indices = np.array([[i if 0 <= i < len(doc_ids) else 0 for i in row] for row in indices])
    # Retrieve and return documents and their IDs for each query
    ##########Tags################
    results = []
    for query_indices in indices:
        result = {
            "ids": [int(doc_ids[i]) for i in query_indices],
            "contents": [documents[i] for i in query_indices],
            "tags": [doc_tags[i] for i in query_indices]
        }
        results.append(result)

    if rerank_method == "freq":
        return re_ranking_tags(queries_tags, results, k)
    elif rerank_method == "similarity":
        return re_ranking_similarity(queries_tags, results, k)

    ###############################

# Roughly truncates the input text to a maximum number of characters.
def truncate_text(text, max_chars=4096):
    return text[:max_chars] if len(text) > max_chars else text


async def generate_responses(title_queries, body_queries, contexts):
    """
    Efficiently generates responses for a batch of queries asynchronously.

    Args:
        title_queries (list of str): A list of question titles.
        body_queries (list of str): A list of question bodies.
        contexts (list of str): A list of retrieved knowledge corresponding to each query.

    Returns:
        list: A list of generated responses.
    """

    async def call_openai(prompt):
        """Helper function to call OpenAI API with retries."""
        for attempt in range(3):  # Retry up to 3 times if needed
            try:
                response = await client.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=[{"role": "user", "content": prompt}],
                    max_tokens=4090,
                    temperature=0
                )
                return response.choices[0].message.content  # Successful response
            except openai.RateLimitError:
                print(f"Rate limit error, retrying ({attempt+1}/3)...")
                await asyncio.sleep(2**attempt)  # Exponential backoff
            except Exception as e:
                print(f"OpenAI API Error: {e}")
                return f"API Error: {e}"  # Store the exact error message
        return "API Error: Max retries exceeded."

    # Prepare prompts using the new structured format
    prompts = [
        f"""
        <prompt>
            <retrieved_knowledge>
                <![CDATA[ {{context}} ]]>
            </retrieved_knowledge>

            <user_query>
                <title>{{title_query}}</title>
                <body>{{body_query}}</body>
            </user_query>

            <instructions>
                <summary>
                    You are a Kubernetes expert and troubleshooting assistant. Your task is to diagnose and resolve Kubernetes-related issues using only the retrieved knowledge provided above and established Kubernetes best practices. Do not add any personal ideas or thoughts.
                </summary>

                <structured_debugging_approach>
                    <step1>Localization: Identify the exact YAML field, CLI flag, or Kubernetes object causing the issue.</step1>
                    <step2>Reasoning: Based solely on the retrieved knowledge and Kubernetes internals, explain the root cause of the issue.</step2>
                    <step3>Remediation: Provide a verified fix as a Kubernetes YAML configuration with proper justifications.</step3>
                    <step4>Verification: Ensure that the YAML is syntactically correct, schema-compliant, and valid for Kubernetes.</step4>
                </structured_debugging_approach>

                <problem_solving_strategy>
                    <step1>
                        <title>Reference Retrieved Knowledge First</title>
                        <description>
                            - Analyze the retrieved knowledge above before generating any answer.
                            - Rely exclusively on this information and standard Kubernetes best practices.
                        </description>
                    </step1>

                    <step2>
                        <title>Apply Localization, Reasoning, and Remediation</title>
                        <description>
                            - First, localize the issue within the YAML or Kubernetes component.
                            - Next, provide reasoning for the problem based on the retrieved data.
                            - Finally, offer an exact, verified YAML fix.
                        </description>
                    </step2>

                    <step3>
                        <title>Fallback for Insufficient Retrieved Knowledge</title>
                        <description>
                            - If the retrieved knowledge is insufficient to directly solve the issue, explicitly state so.
                            - Then, apply Kubernetes best practices to suggest a possible fix.
                        </description>
                    </step3>

                    <step4>
                        <title>YAML Validation Before Output</title>
                        <description>
                            - Confirm that the YAML fix has valid syntax, complies with the Kubernetes API schema, and uses the correct apiVersion.
                            - Correct any issues automatically before output.
                        </description>
                    </step4>

                    <step5>
                        <title>No Speculation Allowed</title>
                        <description>
                            - Do not introduce any new ideas or assumptions beyond the retrieved knowledge.
                            - Answer strictly based on the provided information and verified Kubernetes guidelines.
                        </description>
                    </step5>

                    <step6>
                        <title>Output Format</title>
                        <description>
                            - If a YAML fix is applicable, output only the corrected YAML.
                            - Keep any additional explanation concise and directly relevant to the YAML fix.
                        </description>
                    </step6>
                </problem_solving_strategy>

                <output_format>
                    <description>
                        Return only the YAML configuration fix if applicable. If an explanation is necessary for edge cases, keep it minimal and directly tied to the provided Kubernetes YAML solution.
                    </description>
                </output_format>
            </instructions>
        </prompt>
        """
        for title_query, body_query, context in zip(title_queries, body_queries, contexts)
    ]

    # Run all API calls asynchronously
    tasks = [call_openai(prompt) for prompt in prompts]
    responses = await asyncio.gather(*tasks)

    return responses



async def process_questions(file_path, batch_size=40):
    # Load CSV file into a Pandas DataFrame
    df = pd.read_csv(file_path, encoding="utf-8")

    # Ensure necessary columns exist
    top_1_col = f"{llm_choice}_Top_1_Context"
    top_2_col = f"{llm_choice}_Top_2_Context"
    top_3_col = f"{llm_choice}_Top_3_Context"
    context_merged_col = f"{llm_choice}_Merged_Contexts"  # Merged for GPT input
    response_col_name = f"{llm_choice}_Generated_Response"
    context_id_col_name = f"{llm_choice}_Context_IDs"

    for col in [top_1_col, top_2_col, top_3_col, context_merged_col, response_col_name, context_id_col_name]:
        if col not in df.columns:
            df[col] = ""  # Initialize with empty strings

    # Get list of unanswered questions
    unanswered_mask = df[f"{llm_choice}_Generated_Response"] == ""
    unanswered_df = df[unanswered_mask]

    # Process questions in batches
    for start in range(0, len(unanswered_df), batch_size):

        batch = unanswered_df.iloc[start : start + batch_size]

        # Since test 2, retrieve body and title seperately.
        title_queries = batch["Question Title"].tolist()
        body_queries = batch["Question Body"].tolist()
        queries_tags = batch["Question Tags"].tolist()

        # Retrieve context in batch (each result is a dict with 'ids' and 'contents')
        ## rerank_method can be "freq" or "similarity"
        results = retrieve_documents(body_queries, queries_tags=queries_tags, rerank_method="similarity", k=10)

        # Extract top 3 contexts individually
        print(f"Retrieved results: {results}")

        top_1_contexts = [truncate_text(result["contents"][0]) if len(result["contents"]) > 0 else "" for result in results]
        top_2_contexts = [truncate_text(result["contents"][1]) if len(result["contents"]) > 1 else "" for result in results]
        top_3_contexts = [truncate_text(result["contents"][2]) if len(result["contents"]) > 2 else "" for result in results]

        # Create merged context string for GPT response generation
        merged_contexts = ["\n\n".join(filter(None, [c1, c2, c3])) for c1, c2, c3 in zip(top_1_contexts, top_2_contexts, top_3_contexts)]

        # Create context_id strings (concatenated document IDs)
        context_ids = [", ".join([str(doc_id) for doc_id in result['ids']]) for result in results]

        # Generate responses asynchronously
        responses = await generate_responses(title_queries, body_queries, merged_contexts)

        # Update the DataFrame with responses
        df.loc[batch.index, response_col_name] = responses
        df.loc[batch.index, context_merged_col] = merged_contexts  # Keep merged for reference
        df.loc[batch.index, top_1_col] = top_1_contexts
        df.loc[batch.index, top_2_col] = top_2_contexts
        df.loc[batch.index, top_3_col] = top_3_contexts
        df.loc[batch.index, context_id_col_name] = context_ids
        print(f"Processed {start + len(batch)} / {len(unanswered_df)} questions")


    # Save the updated CSV file with responses
    df.to_csv(file_path, index=False, encoding="utf-8")
    print(f"All questions processed and saved back to {file_path}")

if __name__ == "__main__":
    file_path = "./dataset/test.csv"
    await process_questions(file_path)


embedding 0 / 40
embedding 20 / 40
embedding 0 / 40
embedding 20 / 40
Processed 40 / 783 questions
embedding 0 / 40
embedding 20 / 40
embedding 0 / 40
embedding 20 / 40
Processed 80 / 783 questions
embedding 0 / 40
embedding 20 / 40
embedding 0 / 40
embedding 20 / 40
Processed 120 / 783 questions
embedding 0 / 40
embedding 20 / 40
embedding 0 / 40
embedding 20 / 40
Processed 160 / 783 questions
embedding 0 / 40
embedding 20 / 40
embedding 0 / 40
embedding 20 / 40
Processed 200 / 783 questions
embedding 0 / 40
embedding 20 / 40
embedding 0 / 40
embedding 20 / 40
Processed 240 / 783 questions
embedding 0 / 40
embedding 20 / 40
embedding 0 / 40
embedding 20 / 40
Processed 280 / 783 questions
embedding 0 / 40
embedding 20 / 40
embedding 0 / 40
embedding 20 / 40
Processed 320 / 783 questions
embedding 0 / 40
embedding 20 / 40
embedding 0 / 40
embedding 20 / 40
Processed 360 / 783 questions
embedding 0 / 40
embedding 20 / 40
embedding 0 / 40
embedding 20 / 40
Processed 400 / 783 questions
em