<a href="https://colab.research.google.com/github/Mzluci9/Complaint-Analysis-RAG/blob/task_1/notebooks/task4_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Task 4: RAG Chatbot Evaluation and Optimization
**Participant**: Michael Zewdu Lemma
**Cohort**: KAIM 5/6
**Date**: July 7, 2025

This notebook evaluates the RAG chatbot from Task 3, assesses retrieval and generation performance, and optimizes the pipeline for better accuracy and efficiency.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Load Task 3 results
results_path = '/content/drive/My Drive/Colab Notebooks/rag_results.json'
try:
    results = pd.read_json(results_path, orient='records', lines=True)
    logging.info(f"Loaded results with shape: {results.shape}")
    print("Sample Results:")
    print(results[['query', 'response']].head())
except Exception as e:
    logging.error(f"Failed to load results: {e}")
    raise

Sample Results:
                                              query  \
0                 Why are people unhappy with BNPL?   
1    What are common issues with Credit Card fraud?   
2         Why do Savings Account complaints happen?   
3  What issues do people face with Money Transfers?   
4          Why are Personal Loan complaints common?   

                                            response  
0  <|system|>\nYou are a financial complaint anal...  
1  <|system|>\nYou are a financial complaint anal...  
2  <|system|>\nYou are a financial complaint anal...  
3  <|system|>\nYou are a financial complaint anal...  
4  <|system|>\nYou are a financial complaint anal...  


# Evaluate Baseline Performance

In [4]:
# Manual evaluation of response quality
for idx, row in results.iterrows():
    print(f"\nQuery: {row['query']}")
    print(f"Response: {row['response']}")
    print("Retrieved Chunks:")
    for i, chunk in enumerate(row['retrieved_chunks']):
        print(f"Chunk {i+1} (Product: {chunk['product']}, Distance: {row['distances'][i]:.4f}):")
        print(chunk['chunk_text'])


Query: Why are people unhappy with BNPL?
Response: <|system|>
You are a financial complaint analysis assistant.
<|user|>
Why are people unhappy with BNPL?
<|retrieved|>
Complaint (Product: Buy Now, Pay Later (BNPL)): practices of bnpl companies reporting only negative data creates an incomplete and potentially damaging picture of a consumer s creditworthiness it is my understanding that the cfpb has been looking into the bnpl sector and the unfair practices that are being used difficulty accessing assistance during financial hardship furthermore affirm does not provide easily accessible avenues for customers to seek assistance during periods of financial hardship navigating their customer service channels to request payment arrangements or other forms of support is unnecessarily difficult and frustrating this lack of transparency and accessibility exacerbates the negative impact of late payments particularly during unforeseen financial challenges i have attempted to contact them and h

In [1]:
!pip install -U bitsandbytes
!pip install -U accelerate transformers




# Test Additional Queries

In [5]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Load data and index
data_path = '/content/drive/My Drive/Colab Notebooks/chunked_complaints.csv'
index_path = '//content/drive/My Drive/Colab Notebooks/complaint_index.faiss'
df_chunks = pd.read_csv(data_path)
index = faiss.read_index(index_path)

# Load LLM
model_name = "HuggingFaceH4/zephyr-7b-beta"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config,
    trust_remote_code=True
)
llm = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Retriever function
embedder = SentenceTransformer('all-MiniLM-L6-v2')
def retrieve_chunks(query, index, df_chunks, embedder, top_k=5):
    query_embedding = embedder.encode([query], show_progress_bar=False)
    distances, indices = index.search(np.array(query_embedding, dtype=np.float32), top_k)
    retrieved_chunks = df_chunks.iloc[indices[0]][['complaint_id', 'product', 'chunk_idx', 'chunk_text']].to_dict('records')
    return retrieved_chunks, distances[0]

# RAG pipeline
def rag_pipeline(query, index, df_chunks, embedder, llm, top_k=5):
    chunks, distances = retrieve_chunks(query, index, df_chunks, embedder, top_k)
    if not chunks:
        return "No relevant complaints found.", [], []
    system_prompt = "You are a financial complaint analysis assistant."
    context = "\n".join([f"Complaint (Product: {chunk['product']}): {chunk['chunk_text']}" for chunk in chunks])
    full_prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{query}\n<|retrieved|>\n{context}"
    response = llm(full_prompt, max_new_tokens=200, do_sample=True, temperature=0.7)
    return response[0]["generated_text"], chunks, distances

# Test new queries
new_queries = [
    "What are the main reasons for Credit Card dissatisfaction?",
    "Why do people complain about Personal Loan interest rates?",
    "What problems occur with Money Transfer delays?"
]
new_results = []
for query in new_queries:
    response, chunks, distances = rag_pipeline(query, index, df_chunks, embedder, llm)
    print(f"\nQuery: {query}")
    print(f"Response: {response}")
    print("Retrieved Chunks:")
    for i, chunk in enumerate(chunks):
        print(f"Chunk {i+1} (Product: {chunk['product']}, Distance: {distances[i]:.4f}):")
        print(chunk['chunk_text'])
    new_results.append({
        'query': query,
        'response': response,
        'retrieved_chunks': chunks,
        'distances': distances.tolist()
    })

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Device set to use cuda:0


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


Query: What are the main reasons for Credit Card dissatisfaction?
Response: <|system|>
You are a financial complaint analysis assistant.
<|user|>
What are the main reasons for Credit Card dissatisfaction?
<|retrieved|>
Complaint (Product: Credit Card): credit cards
Complaint (Product: Credit Card): edge of a recession or possibly even a depression people are under a lot of stress and the thoughtless actions of this credit card company are not helpful i find it hard to comprehend how they failed and continue to fail their customers and their behavior especially at this time is hard to comprehend
Complaint (Product: Credit Card): harms consumers who depend on the assumption that if their credit line is sufficient they ll be able to buy things what is the point of having a credit card that can decline charges randomly on no basis with no notice
Complaint (Product: Credit Card): in the first place their customer service has led me astray multiple times and its completely demoralizing for 

# Optimize the Pipeline

In [6]:
# Optimization 1: Adjust top_k
print("\nTesting top_k=3")
for query in new_queries[:2]:  # Test on 2 queries
    response, chunks, distances = rag_pipeline(query, index, df_chunks, embedder, llm, top_k=3)
    print(f"\nQuery: {query}")
    print(f"Response (top_k=3): {response}")
    print("Retrieved Chunks:")
    for i, chunk in enumerate(chunks):
        print(f"Chunk {i+1} (Product: {chunk['product']}, Distance: {distances[i]:.4f}):")
        print(chunk['chunk_text'])

# Optimization 2: Filter short chunks (<5 words)
df_filtered = df_chunks[df_chunks['chunk_length'] >= 5]
print("\nFiltered Chunked Dataset Shape:", df_filtered.shape)
for query in new_queries[:2]:
    response, chunks, distances = rag_pipeline(query, index, df_filtered, embedder, llm)
    print(f"\nQuery (Filtered): {query}")
    print(f"Response: {response}")
    print("Retrieved Chunks:")
    for i, chunk in enumerate(chunks):
        print(f"Chunk {i+1} (Product: {chunk['product']}, Distance: {distances[i]:.4f}):")
        print(chunk['chunk_text'])

# Optimization 3: Tweak prompt
def rag_pipeline_optimized(query, index, df_chunks, embedder, llm, top_k=5):
    chunks, distances = retrieve_chunks(query, index, df_chunks, embedder, top_k)
    if not chunks:
        return "No relevant complaints found.", [], []
    system_prompt = "You are a financial complaint analysis assistant. Provide a concise, bullet-point summary of the main issues based on the complaints."
    context = "\n".join([f"Complaint (Product: {chunk['product']}): {chunk['chunk_text']}" for chunk in chunks])
    full_prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{query}\n<|retrieved|>\n{context}"
    response = llm(full_prompt, max_new_tokens=200, do_sample=True, temperature=0.7)
    return response[0]["generated_text"], chunks, distances

print("\nTesting optimized prompt")
for query in new_queries[:2]:
    response, chunks, distances = rag_pipeline_optimized(query, index, df_chunks, embedder, llm)
    print(f"\nQuery (Optimized Prompt): {query}")
    print(f"Response: {response}")
    print("Retrieved Chunks:")
    for i, chunk in enumerate(chunks):
        print(f"Chunk {i+1} (Product: {chunk['product']}, Distance: {distances[i]:.4f}):")
        print(chunk['chunk_text'])


Testing top_k=3

Query: What are the main reasons for Credit Card dissatisfaction?
Response (top_k=3): <|system|>
You are a financial complaint analysis assistant.
<|user|>
What are the main reasons for Credit Card dissatisfaction?
<|retrieved|>
Complaint (Product: Credit Card): credit cards
Complaint (Product: Credit Card): edge of a recession or possibly even a depression people are under a lot of stress and the thoughtless actions of this credit card company are not helpful i find it hard to comprehend how they failed and continue to fail their customers and their behavior especially at this time is hard to comprehend
Complaint (Product: Credit Card): harms consumers who depend on the assumption that if their credit line is sufficient they ll be able to buy things what is the point of having a credit card that can decline charges randomly on no basis with no notice? it is a useless product.
Complaint (Product: Credit Card): I have a credit card, in which I have used only for emerge

# Save Results and Update Report

In [8]:
from google.colab import files

# Save new results
all_results = results.to_dict('records') + new_results
pd.DataFrame(all_results).to_json('/content/drive/My Drive/Colab Notebooks/rag_results_updated.json', orient='records', lines=True)
files.download('/content/drive/My Drive/Colab Notebooks/rag_results_updated.json')

# Save notebook
with open('/content/drive/My Drive/Colab Notebooks/Task4_Evaluation.ipynb', 'w') as f:
    f.write("# [Manually save notebook content]")
files.download('/content/drive/My Drive/Colab Notebooks/Task4_Evaluation.ipynb')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>