**Intelligent Complaint Analysis for Financial Services: Building a RAG-Powered Chatbot**

<h5> <b>Task 1: Exploratory Data Analysis and Data Preprocessing </h5>

In [1]:
# import necessary libraries
import os
import pandas as pd
import numpy
from matplotlib import pyplot
import matplotlib.pyplot as plt
import seaborn as sns

  from scipy.stats import gaussian_kde


**1) loading dataset**

To load data pandas is used which is the standard library for data manipulation in Python.

In [3]:
# Load the dataset
try:
    df = pd.read_csv(r'E:/KAIM/phase 2/Week 6/Intelligent-Complaint-Analysis-for-Financial-Services/data/complaints.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'complaints.csv' not found. Please ensure the file is in the correct directory.")
    # Handle error, perhaps exit or raise a more specific exception

  df = pd.read_csv(r'E:/KAIM/phase 2/Week 6/Intelligent-Complaint-Analysis-for-Financial-Services/data/complaints.csv')


MemoryError: Unable to allocate 1.22 GiB for an array with shape (17, 9609797) and data type object

Robust error handling is crucial. The try-except block will inform the user if the dataset isn't found, preventing a hard crash.

**2) Initial EDa and data understanding**

To gain a high-level overview of the data, including column names, data types, and initial insights into data quality.

In [None]:
print("\n--- Initial Data Overview ---")
print(df.info())
print("\n--- First 5 rows ---")
print(df.head())
print("\n--- Descriptive Statistics for Numerical Columns ---")
print(df.describe(include='all')) # Use include='all' for non-numerical columns too

df.info() provides non-null counts and data types, immediately highlighting missing values. df.head() gives a quick glance at the data structure. df.describe(include='all') offers statistics for all columns, including unique counts for categorical data.

**3) Analyzing Complaint Distribution Across Products**

To understand which financial products generate the most complaints. This informs the business objective and potential areas of focus for CrediTrust.

In [None]:
# Analyzing Complaint Distribution Across Products
print("\n--- Distribution of Complaints by Product ---")
product_distribution = df['Product'].value_counts()
print(product_distribution)

plt.figure(figsize=(10, 6))
sns.barplot(x=product_distribution.index, y=product_distribution.values, palette='viridis')
plt.title('Number of Complaints per Product')
plt.xlabel('Product')
plt.ylabel('Number of Complaints')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

-value_counts() is effective for categorical distribution. Visualization using 

matplotlib and seaborn makes the insights easily digestible, crucial for reporting to stakeholders like Asha. The 


rotation and tight_layout ensure readability of product names.

**4) Analyzing Consumer Complaint Narrative Length**

To identify very short or very long narratives, which might impact embedding quality and chunking strategies later.

In [None]:
# Calculate word count for each narrative
df['narrative_word_count'] = df['Consumer complaint narrative'].astype(str).apply(lambda x: len(x.split()))

print("\n--- Narrative Length Statistics (Word Count) ---")
print(df['narrative_word_count'].describe())

plt.figure(figsize=(10, 6))
sns.histplot(df['narrative_word_count'], bins=50, kde=True, color='skyblue')
plt.title('Distribution of Consumer Complaint Narrative Length (Word Count)')
plt.xlabel('Word Count')
plt.ylabel('Frequency')
plt.show()

# Identify very short/long narratives (example thresholds)
very_short_narratives = df[df['narrative_word_count'] < 10].shape[0]
very_long_narratives = df[df['narrative_word_count'] > df['narrative_word_count'].quantile(0.99)].shape[0]
print(f"\nNumber of very short narratives (<10 words): {very_short_narratives}")
print(f"Number of very long narratives (>99th percentile): {very_long_narratives}")

 Converting to str before split() handles potential NaN values gracefully. Histograms provide a visual understanding of the distribution, while describe() offers statistical summaries. Identifying outliers (very short/long) is important for subsequent text processing steps.



**5) identifying Complaints With/Without Narratives**

It is essential for filtering, as the RAG system relies on these narratives.

In [None]:
complaints_with_narrative = df['Consumer complaint narrative'].dropna().shape[0]
complaints_without_narrative = df['Consumer complaint narrative'].isnull().sum()
print(f"\nNumber of complaints with narratives: {complaints_with_narrative}")
print(f"Number of complaints without narratives: {complaints_without_narrative}")

dropna() and isnull().sum() are direct methods for this check. This step directly informs the filtering requirement to remove records without narratives.

**6) Filtering the Dataset**

It is used to restrict the dataset to relevant products and ensure all records have a narrative.

In [None]:
# Define the five specified products
specified_products = [
    'Credit card',
    'Personal loan',
    'Buy Now, Pay Later (BNPL)',
    'Savings account',
    'Money transfer' # Assuming this is the correct exact name from the dataset
]

# Filter for specified products
df_filtered_products = df[df['Product'].isin(specified_products)].copy()
print(f"\nShape after filtering for specified products: {df_filtered_products.shape}")

# Remove records with empty Consumer complaint narrative fields
df_cleaned = df_filtered_products.dropna(subset=['Consumer complaint narrative']).copy()
print(f"Shape after removing empty narratives: {df_cleaned.shape}")

Using isin() for product filtering is efficient. dropna(subset=['Consumer complaint narrative']) specifically targets the narrative column for null values. Using 

.copy() after filtering prevents SettingWithCopyWarning in future operations.

**7) Cleaning Text Narratives**

In order to mprove the quality of embeddings by removing noise and normalizing text.

In [None]:
import re
import string

def clean_text(text):
    text = str(text).lower() # Lowercasing [cite: 82]
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text) # Remove punctuation
    # Remove common boilerplate text (example) [cite: 83]
    text = re.sub(r'i am writing to file a complaint', '', text)
    text = re.sub(r'this is a complaint regarding', '', text)
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    return text

df_cleaned['Consumer complaint narrative_cleaned'] = df_cleaned['Consumer complaint narrative'].apply(clean_text)
print("\n--- Sample of cleaned narratives ---")
print(df_cleaned[['Consumer complaint narrative', 'Consumer complaint narrative_cleaned']].head())

Lowercasing is standard practice. Removing punctuation helps focus embeddings on word meaning. Identifying and removing boilerplate text can significantly improve signal-to-noise ratio. Regular expressions (re) are powerful for these tasks. More advanced techniques (e.g., stemming, lemmatization, stop-word removal) could be considered based on further EDA and performance testing, as optional steps.



**8) Saving the Cleaned Dataset**

To create an intermediary artifact for the next steps.

In [None]:
output_path = 'data/filtered_complaints.csv'
df_cleaned.to_csv(output_path, index=False)
print(f"\nCleaned and filtered dataset saved to {output_path}")

The cleaned and filtered dataset saved to data/filtered_complaints.csv.

index=False prevents Pandas from writing the DataFrame index as a column in the CSV, which is generally desired for clean data files.

<h5><b>Task 2: Text Chunking, Embedding, and Vector Store Indexing</b></h5>

In this task we shall convert the cleaned text narratives into a format suitable for efficient semantic search.

**1) Text Chunking Strategy**

- The long narratives can dilute the semantic meaning when embedded as a single vector. Chunking breaks them into smaller, more semantically coherent units.



- The most applicable tool LangChain's RecursiveCharacterTextSplitter is highly recommended for its effectiveness in preserving semantic units.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Assuming df_cleaned is loaded from 'data/filtered_complaints.csv'
df_cleaned = pd.read_csv('data/filtered_complaints.csv')
narratives = df_cleaned['Consumer complaint narrative_cleaned'].tolist()

# Experiment with chunk_size and chunk_overlap
# chunk_size: maximum number of characters in a chunk
# chunk_overlap: number of characters to overlap between chunks to maintain context
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Example: 500 characters
    chunk_overlap=100, # Example: 100 characters overlap
    length_function=len,
    separators=["\n\n", "\n", " ", ""] # Prioritized separators
)

all_chunks = []
for i, narrative in enumerate(narratives):
    chunks = text_splitter.split_text(narrative)
    for chunk in chunks:
        all_chunks.append({
            'original_complaint_id': df_cleaned.loc[i, 'Complaint ID'], # Assuming 'Complaint ID' exists
            'product': df_cleaned.loc[i, 'Product'],
            'chunk_text': chunk
        })

chunks_df = pd.DataFrame(all_chunks)
print(f"\nTotal chunks created: {len(chunks_df)}")
print(chunks_df.head())

The choice of chunk_size and chunk_overlap is crucial. Smaller chunks might miss broader context, while larger ones might include irrelevant information. RecursiveCharacterTextSplitter attempts to split intelligently at natural breakpoints. Justification in the report would involve discussing experimentation and the rationale behind the chosen values, perhaps showing examples of chunks.



**2) Choosing an Embedding Model**

- To convert text chunks into numerical vector representations.
- In this case sentence-transformers/all-MiniLM-L6-v2 is a good balance of performance and efficiency for semantic similarity tasks.

In [None]:
from sentence_transformers import SentenceTransformer

# Load the embedding model
embedding_model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model = SentenceTransformer(embedding_model_name)
print(f"\nEmbedding model '{embedding_model_name}' loaded.")

This model is a fine-tuned BERT model designed for semantic similarity. In the report, explain that it's chosen for its effectiveness in capturing semantic meaning in a computationally efficient manner, suitable for real-time querying. Other options exist (e.g., larger Sentence Transformers models, OpenAI embeddings), but 

all-MiniLM-L6-v2 is a solid starting point for its balance.

**3) Embedding and Indexing (Vector Store Creation)**

- It is used to generate embeddings for each chunk and store them in a vector database for efficient similarity search.

- Here FAISS (Facebook AI Similarity Search) or ChromaDB are employed choices. ChromaDB is often easier to get started with for its Python-native interface and persistence. FAISS is known for its high performance for large datasets.

In [None]:
import chromadb
from chromadb.utils import embedding_functions

# Initialize ChromaDB client (persistent)
client = chromadb.PersistentClient(path="./vector_store")

# Define the embedding function for ChromaDB, using the SentenceTransformer model
# Note: ChromaDB's default embedding function might not be exactly 'all-MiniLM-L6-v2'
# For full control and consistency, you might embed yourself and pass vectors,
# or ensure ChromaDB is configured to use the exact model.
# For simplicity, using a built-in one that aligns closely.
# Alternatively, manually create embeddings and add them.

# Let's manually create embeddings for better control and explicit model usage
print("\nGenerating embeddings for chunks...")
chunk_texts = chunks_df['chunk_text'].tolist()
chunk_embeddings = model.encode(chunk_texts, show_progress_bar=True)
print("Embeddings generated.")

# Prepare metadata for ChromaDB
metadatas = chunks_df[['original_complaint_id', 'product']].to_dict(orient='records')
ids = [f"chunk_{i}" for i in range(len(chunks_df))] # Unique IDs for each chunk

collection_name = "customer_complaints_rag"
try:
    collection = client.get_or_create_collection(
        name=collection_name,
        embedding_function=embedding_functions.SentenceTransformerEmbeddingFunction(model_name=embedding_model_name)
    )
    # Add chunks and their embeddings to the collection
    collection.add(
        embeddings=chunk_embeddings.tolist(), # Convert numpy array to list
        documents=chunk_texts,
        metadatas=metadatas,
        ids=ids
    )
    print(f"Vector store '{collection_name}' created and indexed successfully with {len(ids)} chunks.")
except Exception as e:
    print(f"Error creating/adding to ChromaDB collection: {e}")

# The vector store will be persisted at './vector_store/'

- It's crucial to store metadata alongside each vector (e.g., original_complaint_id, product). This metadata will be vital for tracing retrieved chunks back to their source complaints and for enabling multi-product querying.
- The get_or_create_collection method in ChromaDB is convenient for development. For FAISS, you would typically build an IndexFlatL2 or similar, add vectors, and then use faiss.write_index to save it.

**Conclusion** <br>

- A script (src/vector_store_creation.py or within a notebook) that performs chunking, embedding, and indexing.

- The persisted vector store saved in the vector_store/ directory.

- A section in the report detailing the chunking strategy (justification for chunk_size and chunk_overlap) and the embedding model choice (why all-MiniLM-L6-v2 was selected).



<h5><b>Task 3: Building the RAG Core Logic and Evaluation </h5>

In this task we will build the retrieval and generation pipeline and, most importantly, evaluate its effectiveness.



**1) Retriever Implementation**

It is used to fetch the most relevant text chunks given a user's question.

In [None]:
# In a new script or function, load the persisted ChromaDB
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer

client = chromadb.PersistentClient(path="./vector_store")
collection_name = "customer_complaints_rag"
collection = client.get_collection(
    name=collection_name,
    embedding_function=embedding_functions.SentenceTransformerEmbeddingFunction(model_name='sentence-transformers/all-MiniLM-L6-v2')
)
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def retrieve_chunks(question: str, k: int = 5):
    # Embed the question
    question_embedding = embedding_model.encode(question).tolist()

    # Perform similarity search
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=k, # top-k results [cite: 115, 116]
        include=['documents', 'metadatas', 'distances']
    )

    retrieved_chunks_info = []
    if results and results['documents']:
        for i in range(len(results['documents'][0])):
            chunk_text = results['documents'][0][i]
            metadata = results['metadatas'][0][i]
            distance = results['distances'][0][i]
            retrieved_chunks_info.append({
                'text': chunk_text,
                'metadata': metadata,
                'distance': distance
            })
    return retrieved_chunks_info

# Example usage:
# retrieved_chunks = retrieve_chunks("Why are people unhappy with BNPL?", k=5)
# for chunk in retrieved_chunks:
#     print(f"Chunk: {chunk['text']}\nMetadata: {chunk['metadata']}\nDistance: {chunk['distance']}\n---")

- The k parameter (number of retrieved chunks) is important. Starting with k=5 is a good heuristic. The retriever should return not only the text but also the associated metadata to display sources later in the UI and for evaluation.

**2) Prompt Engineering**

It is used to guide the Large Language Model (LLM) to generate relevant and grounded answers.

In [None]:
SYSTEM_PROMPT_TEMPLATE = """You are a financial analyst assistant for CrediTrust. Your task is to answer questions about customer complaints.
Use the following retrieved complaint excerpts to formulate your answer.
If the context doesn't contain the answer, state that you don't have enough information.
Context: {context}
Question: {question}
Answer:"""

- The provided template is excellent. It sets the persona, defines the task, emphasizes using only provided context, and instructs how to handle insufficient information. This last point (If the context doesn't contain the answer, state that you don't have enough information) is crucial for preventing hallucination.

**3) Generator Implementation**

- It is used to combine the retrieved chunks and the user's question with the prompt to generate an answer using an LLM.
- Here tools like Hugging Face's transformers library for local LLMs, or LangChain for easy integration with various LLMs (Mistral, Llama, etc.) are employed. For a self-contained solution, a local, smaller LLM (e.g., Llama 3 8B Instruct, Mistral 7B Instruct) might be chosen if computational resources allow.



In [None]:
from transformers import pipeline
# from your_rag_module import retrieve_chunks, SYSTEM_PROMPT_TEMPLATE # Assuming these are imported

# Placeholder for LLM setup. In a real scenario, you'd load a specific LLM
# For demonstration, a text generation pipeline
# NOTE: This requires a suitable LLM to be downloaded/available, e.g., via `transformers` library
# For actual use, consider models like 'mistralai/Mistral-7B-Instruct-v0.2'
# or 'meta-llama/Llama-2-7b-chat-hf' if you have access and resources.
try:
    llm_pipeline = pipeline("text-generation", model="distilbert/distilgpt2", device=0) # Use device=0 for GPU if available
    # Adjust max_new_tokens for generation length
    # Some LLMs require specific chat templates/tokenizers
    print("LLM pipeline loaded (using distilgpt2 as placeholder).")
except Exception as e:
    print(f"Could not load specified LLM. Please check model availability and resources. Error: {e}")
    llm_pipeline = None # Fallback if LLM can't be loaded

def generate_answer(question: str):
    if not llm_pipeline:
        return "Error: LLM not loaded or available."

    retrieved_info = retrieve_chunks(question, k=5)
    context = "\n".join([item['text'] for item in retrieved_info])

    if not context:
        return "I don't have enough information in the retrieved context to answer this question."

    prompt = SYSTEM_PROMPT_TEMPLATE.format(context=context, question=question)

    # Generate response
    # Note: For actual LLMs, you often need to handle generation parameters carefully
    # like max_new_tokens, do_sample, temperature, etc.
    # And potentially use a chat template if it's a chat-tuned model.
    try:
        response = llm_pipeline(prompt, max_new_tokens=250, do_sample=True, temperature=0.7, top_p=0.9)[0]['generated_text']
        # Post-process to extract only the answer part, if the LLM repeats the prompt
        answer_start_tag = "Answer:"
        if answer_start_tag in response:
            generated_answer = response.split(answer_start_tag, 1)[1].strip()
        else:
            generated_answer = response.strip() # If LLM doesn't repeat the prompt

        # Return the answer and the sources for display
        return generated_answer, [item['text'] for item in retrieved_info]
    except Exception as e:
        return f"Error during LLM generation: {e}", []

# Example usage:
# answer, sources = generate_answer("What are the common issues with personal loans?")
# print(f"\nAnswer: {answer}")
# print(f"\nSources: {sources}")

- The generate_answer function combines the retriever's output with the prompt and feeds it to the LLM. Careful attention to LLM generation parameters (max_new_tokens, temperature, top_p) is important for controlling output quality and creativity. 
- Post-processing the LLM's raw output to extract just the answer is often necessary as LLMs might repeat parts of the prompt.

**4) Qualitative Evaluation**

- It is crucial for understanding the system's performance and identifying areas for improvement. This is a manual, human-in-the-loop process.
<br> <br>
**Process:**
- Create 5-10 representative questions (e.g., "Why are customers complaining about credit card fees?", "What common fraud issues are reported with money transfers?", "Are there recurring problems with BNPL customer support?").

- Run each question through the generate_answer function.

- Manually analyze the generated answer against the retrieved sources and the ground truth (if known).

- Assign a quality score (1-5) and provide detailed comments.



| Question                                  | Generated Answer                                                                          | Retrieved Sources (first 1-2 relevant)        | Quality Score (1-5) | Comments/Analysis                                                                                                  |
|-------------------------------------------|-------------------------------------------------------------------------------------------|------------------------------------------------|---------------------|--------------------------------------------------------------------------------------------------------------------|
| Why are people unhappy with BNPL?         | Customers are frequently complaining about unexpected fees and difficulty with payment schedules based on the provided context. | "Complaint about BNPL fees..." "I was charged late fee..." | 4                   | Answer is concise and directly supported by sources. Could be slightly more detailed if more context allowed.       |
| What are the common issues with personal loans? | I don't have enough information from the provided context to answer this question comprehensively. | (Empty or irrelevant sources)                  | 1                   | Retriever failed to find relevant chunks. Likely an issue with chunking or embedding for this specific query.       |

<h5><b>Task 4: Creating an Interactive Chat Interface</h5>

Here we are going to build a user-friendly interface that allows non-technical users to interact with the RAG system.

**1) Choosing the Framework**

- Here Gradio or Streamlit are excellent choices for rapid prototyping and deployment of machine learning applications. Gradio is generally simpler for chat interfaces.
- Both are easy to learn and allow for quick creation of interactive web apps without extensive web development knowledge. Gradio is often preferred for its direct chat interface components.

**2) Core Functionality**

- Text Input Box: For users to type their questions.

- Submit/Ask Button: To trigger the RAG pipeline.

- Display Area: To show the AI-generated answer

**3) Enhancing Trust and Usability (Key Requirements)**

- **Display Sources**: Crucial for transparency and user trust. Below the answer, display the exact text chunks that the LLM used to generate the response.
- **Streaming**: Improve user experience by displaying the answer token-by-token, making it feel more responsive. This often requires asynchronous handling or specific LLM library features.
- **Clear Button**: To reset the chat conversation.

