- Fine-tuning:
Like training a new employee from scratch until they memorize your company's rules.
- RAG:
Like giving an employee a handbook and letting them look up answers whenever asked.

In [1]:
import sys
sys.path.append("..") 

In [2]:
import sys
import os

PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))

# Add src folder to Python path for imports
SRC_PATH = os.path.join(PROJECT_ROOT, 'src')
if SRC_PATH not in sys.path:
    sys.path.append(SRC_PATH)

from src.evaluate_rag import evaluate_rag
from src.evaluate_and_report import evaluate_and_generate_report_auto

INDEX_PATH = os.path.join(PROJECT_ROOT, "vector_store", "faiss_index.idx")
METADATA_PATH = os.path.join(PROJECT_ROOT, "vector_store", "metadata.pkl")
OUTPUT_PATH = os.path.join(PROJECT_ROOT, "outputs", "evaluation_report.md")

sample_questions = [
    "How do customers feel about credit card late fees?",
    "Are there complaints about buy now, pay later?",
    "Is there dissatisfaction with student loans?",
    "What are the most common problems with prepaid cards?"
]

# Run full evaluation + auto scoring + report generation
df_results = evaluate_and_generate_report_auto(
    sample_questions, INDEX_PATH, METADATA_PATH, OUTPUT_PATH, top_k=5
)
print("Full evaluation with auto scoring:")
print(df_results.head())


Device set to use cuda:0


Chunks sample keys: dict_keys(['complaint_id', 'product', 'chunk_id', 'chunk_text'])
Chunks sample content: {'complaint_id': 3729558, 'product': 'Credit card', 'chunk_id': 0, 'chunk_text': 'these past few months have been very difficult for everyone despite the fact that people are doing all they can to pay their monthly bills credit card companies are taking advantage of the situation to add late fees where it should be forbidden i paid mine late a few days late but i paid therefore when i requested there should be helping people when didnt want to remove my late fee even though i paid 10000 in interest this is racket credit card companies should be barred from assessing late fees in these times'}


Token indices sequence length is longer than the specified maximum sequence length for this model (557 > 512). Running this sequence through the model will result in indexing errors


Chunks sample keys: dict_keys(['complaint_id', 'product', 'chunk_id', 'chunk_text'])
Chunks sample content: {'complaint_id': 7422836, 'product': 'Personal loan', 'chunk_id': 0, 'chunk_text': 'buy now pay later payment for one expense  affirm did not make reporting of any good standing loans to credit bureaus a part of the initial agreement agreement stated that late payments and defaults may be reported but there are no late payments company reported payment as an open installment loan account on xxxx credit report negatively affecting credit score and credit worthiness company refused to rectify issue ive used affirm 8 times previously with no reporting to credit bureaus agreement and loan reporting practices are intentionally deceptive and harmful'}
Chunks sample keys: dict_keys(['complaint_id', 'product', 'chunk_id', 'chunk_text'])
Chunks sample content: {'complaint_id': 4339332, 'product': 'Personal loan', 'chunk_id': 0, 'chunk_text': 'i never took out a student loan under my name 

| Principle                      | Description                                                                        | Did We Address It?           | How?                                                                                                                                                       |
| ------------------------------ | ---------------------------------------------------------------------------------- | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Text Relevance**             | Are the text fields useful for the downstream task (e.g., complaint narratives)?   |  Yes                        | I filtered only complaints with non-empty `narrative` and cleaned them for relevant text only.                                                           |
| **Semantic Integrity**         | Does the text still preserve meaning after cleaning/chunking?                      |  Yes (with room to improve) | Used chunking with overlap (`chunk_size`, `chunk_overlap`) to preserve context; could experiment with sentence-based splits for higher integrity.          |
| **Metadata Quality**           | Is the metadata complete, consistent, and informative?                             |  Yes                        | Kept important fields like `complaint_id`, `product`, `company`, and linked them to each chunk.                                                            |
| **Embedding Quality**          | Are embeddings meaningful and relevant to query matching?                          |  Yes                        | Used `sentence-transformers/all-MiniLM-L6-v2`, a lightweight and high-performance model suitable for semantic search; tested with FAISS similarity scores. |
| **Storage & Format**           | Is the data stored in a way that is fast and accessible?                           |  Yes                        | Used `FAISS` for retrieval speed, and saved metadata as a `pickle` file — fast and aligned with RAG design patterns.                                       |
| **Performance Readiness**      | Can the data pipeline handle user queries efficiently?                             |  Yes                        | Vector store is indexed, embeddings precomputed, and retrieval is optimized (`top_k=5`).                                                                   |
| **Encoding/Language Handling** | Are encodings (e.g., UTF-8), special characters, and multilingual support handled? |  Partial                    | I handled English text properly with normalization; for multilingual (e.g., Amharic), model and tokenizer choice needs adaptation.                       |
