# Part C — Mini-RAG Notebook (Full Implementation)

This notebook builds a small Retrieval-Augmented Generation (RAG) pipeline using synthetic Hiver KB articles. It includes:

- KB creation / loading
- Embedding generation with SentenceTransformers
- FAISS vector index for retrieval
- Retriever and RAG answer synthesis
- Optional Gemini generation (use Colab Secrets / `GEMINI_API_KEY`)
- Two required queries with retrieved articles, answers and confidence

**Local files referenced:** `/mnt/data/Hiver – AI Intern Evaluation Assignment.pdf`.

## 0. Install requirements (run once)

If you're in Colab or a fresh environment, run the pip install line.

In [None]:
# Uncomment and run if needed (Colab / fresh env)
!pip install -q sentence-transformers faiss-cpu google-generativeai

print('If running in a fresh environment, uncomment the pip install line and run this cell.')

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m55.1 MB/s[0m eta [36m0:00:00[0m
[?25hIf running in a fresh environment, uncomment the pip install line and run this cell.


## 1. Create synthetic KB articles

In [22]:
import os, json, textwrap

kb_dir = '/mnt/data/hiver_kb_articles'
os.makedirs(kb_dir, exist_ok=True)

kb_articles = [
    {
        'id': 'kb1',
        'title': "Unable to access shared mailbox",
        'tags': ['access_issue'],
        'content': """I am getting a permission denied message when trying to access our shared mailbox."""
    },
    {
        'id': 'kb2',
        'title': "Rule not triggering",
        'tags': ['workflow_issue'],
        'content': """Our auto-assignment rule is no longer firing for emails with the word Refund."""
    },
    {
        'id': 'kb3',
        'title': "Email threads not merging",
        'tags': ['threading_issue'],
        'content': """Two replies from the same customer created different threads in Hiver."""
    },
    {
        'id': 'kb4',
        'title': "Tag suggestions incorrect",
        'tags': ['tagging_accuracy'],
        'content': """Tag suggestions are showing irrelevant tags like billing for product related emails."""
    },
    {
        'id': 'kb5',
        'title': "Drafts disappearing",
        'tags': ['ui_bug'],
        'content': """Draft replies disappear when switching between conversations."""
    },
    {
        'id': 'kb6',
        'title': "Automation delay",
        'tags': ['automation_delay'],
        'content': """Our automation to mark emails as pending is taking 2–3 minutes to run."""
    },
    {
        'id': 'kb7',
        'title': "Login issues",
        'tags': ['auth_issue'],
        'content': """Several users are unable to log in today. Getting 'invalid session'."""
    },
    {
        'id': 'kb8',
        'title': "Export not downloading",
        'tags': ['export_issue'],
        'content': """Exporting conversations as CSV fails silently."""
    },
    {
        'id': 'kb9',
        'title': "Notification spam",
        'tags': ['notification_bug'],
        'content': """Agents are receiving duplicate desktop notifications."""
    },
    {
        'id': 'kb10',
        'title': "Feature request: bulk tagging",
        'tags': ['feature_request'],
        'content': """We want an option to bulk-apply tags to multiple emails."""
    },

    # ---------------------------- CUST B ----------------------------
    {
        'id': 'kb11',
        'title': "Billing mismatch",
        'tags': ['billing_error'],
        'content': """We were charged for 20 users though we only have 12 active."""
    },
    {
        'id': 'kb12',
        'title': "CSAT not calculated",
        'tags': ['analytics_issue'],
        'content': """CSAT scores stopped generating since last week."""
    },
    {
        'id': 'kb13',
        'title': "Slow email load",
        'tags': ['performance'],
        'content': """Opening emails takes 10–12 seconds."""
    },
    {
        'id': 'kb14',
        'title': "Mobile app crash",
        'tags': ['mobile_bug'],
        'content': """Hiver app crashes whenever loading an attachment."""
    },
    {
        'id': 'kb15',
        'title': "SLA not applying",
        'tags': ['sla_issue'],
        'content': """Our SLA rules aren't applied to emails from our VIP customers."""
    },
    {
        'id': 'kb16',
        'title': "Tags not saved",
        'tags': ['tagging_issue'],
        'content': """Adding tags doesn't save unless we refresh."""
    },
    {
        'id': 'kb17',
        'title': "Automation duplication",
        'tags': ['automation_bug'],
        'content': """Our workflow to create tasks is creating duplicates again."""
    },
    {
        'id': 'kb18',
        'title': "Incorrect user assignments",
        'tags': ['assignment_bug'],
        'content': """Emails are being assigned to the wrong agent."""
    },
    {
        'id': 'kb19',
        'title': "IMAP sync failure",
        'tags': ['sync_issue'],
        'content': """IMAP sync halted unexpectedly."""
    },
    {
        'id': 'kb20',
        'title': "Feature request: unified analytics",
        'tags': ['feature_request'],
        'content': """We want analytics from multiple teams in one dashboard."""
    },

    # ---------------------------- CUST C ----------------------------
    {
        'id': 'kb21',
        'title': "Mail merge stuck",
        'tags': ['mail_merge_issue'],
        'content': """Mail merge gets stuck processing at 0%."""
    },
    {
        'id': 'kb22',
        'title': "Search not returning results",
        'tags': ['search_issue'],
        'content': """Searching for customer names yields no results."""
    },
    {
        'id': 'kb23',
        'title': "Deleted emails reappearing",
        'tags': ['sync_bug'],
        'content': """Emails deleted yesterday reappear today."""
    },
    {
        'id': 'kb24',
        'title': "Tables broken in composer",
        'tags': ['editor_bug'],
        'content': """Tables inserted in the editor lose formatting."""
    },
    {
        'id': 'kb25',
        'title': "Attachments corrupted",
        'tags': ['attachment_issue'],
        'content': """Downloaded attachments appear corrupted."""
    },
    {
        'id': 'kb26',
        'title': "Auto-close not working",
        'tags': ['automation_issue'],
        'content': """Emails older than SLA should auto-close but remain open."""
    },
    {
        'id': 'kb27',
        'title': "CSAT survey not sent",
        'tags': ['csat_issue'],
        'content': """Customers aren’t receiving CSAT surveys."""
    },
    {
        'id': 'kb28',
        'title': "UI freeze",
        'tags': ['ui_performance'],
        'content': """UI freezes when scrolling fast."""
    },
    {
        'id': 'kb29',
        'title': "Delay in notifications",
        'tags': ['notification_delay'],
        'content': """Users get notifications 5–7 minutes late."""
    },
    {
        'id': 'kb30',
        'title': "Dark mode request",
        'tags': ['feature_request'],
        'content': """Dark mode would help our night shift team."""
    },

    # ---------------------------- CUST D ----------------------------
    {
        'id': 'kb31',
        'title': "Archived emails missing",
        'tags': ['analytics_bug'],
        'content': """Archived emails do not show up in Analytics."""
    },
    {
        'id': 'kb32',
        'title': "Kanban view glitch",
        'tags': ['ui_bug'],
        'content': """Cards overlap in Kanban mode."""
    },
    {
        'id': 'kb33',
        'title': "Unable to add user",
        'tags': ['user_management'],
        'content': """Error: 'Authorization missing' when adding a new agent."""
    },
    {
        'id': 'kb34',
        'title': "Forwarding fails",
        'tags': ['forwarding_issue'],
        'content': """Forwarding an email gives a server timeout."""
    },
    {
        'id': 'kb35',
        'title': "Signature duplication",
        'tags': ['signature_bug'],
        'content': """Our signatures duplicate twice when replying."""
    },
    {
        'id': 'kb36',
        'title': "Custom fields lost",
        'tags': ['ui_state_bug'],
        'content': """Custom fields disappear after switching tabs."""
    },
    {
        'id': 'kb37',
        'title': "Report export incorrect",
        'tags': ['analytics_accuracy'],
        'content': """SLAs look incorrect in exported reports."""
    },
    {
        'id': 'kb38',
        'title': "Smart suggestions irrelevant",
        'tags': ['suggestion_accuracy'],
        'content': """Smart suggestions propose the wrong KB articles."""
    },
    {
        'id': 'kb39',
        'title': "Confetti animation stuck",
        'tags': ['ui_bug'],
        'content': """The confetti animation plays repeatedly after resolving."""
    },
    {
        'id': 'kb40',
        'title': "Need API documentation",
        'tags': ['information_request'],
        'content': """We want to build an integration; need updated API docs."""
    },

    # ---------------------------- CUST E ----------------------------
    {
        'id': 'kb41',
        'title': "Email lag",
        'tags': ['sync_delay'],
        'content': """Email timestamps lag by 3–5 minutes."""
    },
    {
        'id': 'kb42',
        'title': "Assignee reset",
        'tags': ['assignment_issue'],
        'content': """Emails revert to unassigned randomly."""
    },
    {
        'id': 'kb43',
        'title': "Tag creation blocked",
        'tags': ['admin_ui_bug'],
        'content': """Cannot create new tags in admin panel."""
    },
    {
        'id': 'kb44',
        'title': "Workflow errors",
        'tags': ['workflow_bug'],
        'content': """Testing workflows shows red error banner."""
    },
    {
        'id': 'kb45',
        'title': "Draft lost",
        'tags': ['draft_issue'],
        'content': """Draft disappears when switching tabs."""
    },
    {
        'id': 'kb46',
        'title': "CSAT report mismatch",
        'tags': ['analytics_issue'],
        'content': """CSAT count in dashboard doesn’t match total responses."""
    },
    {
        'id': 'kb47',
        'title': "Attachment preview broken",
        'tags': ['attachment_preview_bug'],
        'content': """Preview pane fails to load PDFs."""
    },
    {
        'id': 'kb48',
        'title': "Auto-assign slow",
        'tags': ['automation_delay'],
        'content': """Incoming emails remain unassigned for up to 2 minutes."""
    },
    {
        'id': 'kb49',
        'title': "Mobile push notifications not received",
        'tags': ['mobile_notification_issue'],
        'content': """Mobile users don’t receive push notifications."""
    },
    {
        'id': 'kb50',
        'title': "Request: custom priority field",
        'tags': ['feature_request'],
        'content': """We want a custom priority field for tickets."""
    },

    # ---------------------------- CUST F ----------------------------
    {
        'id': 'kb51',
        'title': "Email duplication",
        'tags': ['duplication_bug'],
        'content': """Some incoming emails appear twice."""
    },
    {
        'id': 'kb52',
        'title': "BCC not recorded",
        'tags': ['logging_issue'],
        'content': """BCC participants do not show up in activity logs."""
    },
    {
        'id': 'kb53',
        'title': "Random sign-outs",
        'tags': ['session_issue'],
        'content': """Users get signed out every 30 minutes."""
    },
    {
        'id': 'kb54',
        'title': "Composer slow",
        'tags': ['editor_performance'],
        'content': """Typing in the composer is extremely slow."""
    },
    {
        'id': 'kb55',
        'title': "Keyboard shortcuts broken",
        'tags': ['shortcut_bug'],
        'content': """Shortcuts like 'R' to reply aren’t working."""
    },
    {
        'id': 'kb56',
        'title': "Global search frozen",
        'tags': ['search_issue'],
        'content': """Global search gets stuck loading."""
    },
    {
        'id': 'kb57',
        'title': "Rules not saving",
        'tags': ['workflow_bug'],
        'content': """Workflow rules don't save after clicking Submit."""
    },
    {
        'id': 'kb58',
        'title': "Analytics sync delay",
        'tags': ['analytics_latency'],
        'content': """Stats update only once per hour."""
    },
    {
        'id': 'kb59',
        'title': "Outbox stuck",
        'tags': ['send_issue'],
        'content': """Emails remain in the outbox indefinitely."""
    },
    {
        'id': 'kb60',
        'title': "Need help setting SLA",
        'tags': ['setup_help'],
        'content': """Not sure how to set SLA targets for different teams."""
    }
]

# Save all KB articles as JSON
for a in kb_articles:
    path = os.path.join(kb_dir, a['id'] + '.json')
    with open(path, 'w', encoding='utf-8') as f:
        json.dump(a, f, indent=2)
print('Saved', len(kb_articles), 'articles to', kb_dir)

Saved 60 articles to /mnt/data/hiver_kb_articles


## 2. Load KB articles and inspect

In [23]:
import os, json
kb_dir = '/mnt/data/hiver_kb_articles'
docs = []
for fname in sorted(os.listdir(kb_dir)):
    if fname.endswith('.json'):
        with open(os.path.join(kb_dir, fname), 'r', encoding='utf-8') as f:
            a = json.load(f)
            docs.append({'id': a['id'], 'title': a['title'], 'text': a['title'] + '\n\n' + a['content']})
len(docs), docs[0]

(60,
 {'id': 'kb1',
  'title': 'Unable to access shared mailbox',
  'text': 'Unable to access shared mailbox\n\nI am getting a permission denied message when trying to access our shared mailbox.'})

## 3. Create embeddings and FAISS index (sentence-transformers)

In [24]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

model_name = 'all-MiniLM-L6-v2'  # fast and small; change to mpnet for better quality
embed_model = SentenceTransformer(model_name)

texts = [d['text'] for d in docs]
embs = embed_model.encode(texts, convert_to_numpy=True, normalize_embeddings=True)

d = embs.shape[1]
index = faiss.IndexFlatIP(d)  # inner product on normalized vectors == cosine similarity
index.add(embs)
print('Built FAISS index: vectors=', index.ntotal, 'dim=', d)

Built FAISS index: vectors= 60 dim= 384


## 4. Retriever function

In [25]:
import numpy as np

def retrieve(query, k=3):
    q_emb = embed_model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    D, I = index.search(q_emb, k)
    results = []
    for score, idx in zip(D[0], I[0]):
        if idx < 0:
            continue
        results.append({'doc_id': docs[idx]['id'], 'title': docs[idx]['title'], 'text': docs[idx]['text'], 'score': float(score)})
    return results

# quick sanity test
retrieve('How do I configure automations in Hiver?', k=3)

[{'doc_id': 'kb3',
  'title': 'Email threads not merging',
  'text': 'Email threads not merging\n\nTwo replies from the same customer created different threads in Hiver.',
  'score': 0.3861601948738098},
 {'doc_id': 'kb6',
  'title': 'Automation delay',
  'text': 'Automation delay\n\nOur automation to mark emails as pending is taking 2–3 minutes to run.',
  'score': 0.31393590569496155},
 {'doc_id': 'kb14',
  'title': 'Mobile app crash',
  'text': 'Mobile app crash\n\nHiver app crashes whenever loading an attachment.',
  'score': 0.308933824300766}]

In [26]:
from google.colab import userdata
api_key = userdata.get("GEMINI_API_KEY")

## 5. Simple RAG answer generator and confidence scoring

This synthesizes a short answer from retrieved docs. Optionally call Gemini if you set `GEMINI_API_KEY` in Colab secrets or environment.

In [27]:
from textwrap import dedent
import os, json

# configure Gemini if available
gemini_available = True
try:
    import google.generativeai as genai
    genai.configure(api_key=api_key)
    if api_key:
        genai.configure(api_key=api_key)
        gemini_available = True
except Exception:
    gemini_available = False

def generate_answer_rag(query, k=3, use_gemini=True, model_name='gemini-2.5-flash'):
    retrieved = retrieve(query, k=k)
    if len(retrieved)==0:
        return {'answer':'No relevant documents found.', 'confidence':0.0, 'retrieved':[]}
    avg_score = sum(r['score'] for r in retrieved)/len(retrieved)
    conf = float((avg_score + 1)/2)  # map cosine [-1,1] -> [0,1]
    context = '\n\n---\n\n'.join([f"Title: {r['title']}\n{r['text']}" for r in retrieved])
    prompt = dedent(f"""
You are an assistant that MUST answer the user's question using ONLY the provided knowledge base excerpts. If the answer is not present, say 'I am not sure based on the KB provided.'\n\nQuestion: {query}\n\nContext:\n{context}\n\nProvide a concise factual answer and list the source titles used.
""")
    if use_gemini and gemini_available:
        try:
            model = genai.GenerativeModel(model_name)
            resp = model.generate_content(prompt)
            out = resp.text.strip()
            return {'answer': out, 'confidence': conf, 'retrieved': retrieved}
        except Exception as e:
            return {'answer': f'LLM call failed: {e}', 'confidence': conf, 'retrieved': retrieved}
    else:
        # simple synthesis: pick first paragraph from each retrieved doc and join
        paras = []
        for r in retrieved:
            p = r['text'].split('\n')[0].strip()
            paras.append(p)
        concise = ' '.join(paras)
        final = f"{concise}\n\nSources: {', '.join([r['title'] for r in retrieved])}"
        return {'answer': final, 'confidence': conf, 'retrieved': retrieved}

## 6. Run required queries and show retrieved articles + answers

In [28]:
queries = [
    ('How do I configure automations in Hiver?', 'configure_automations'),
    ('Why is CSAT not appearing?', 'why_csat_missing')
]

results = {}
for q, key in queries:
    res = generate_answer_rag(q, k=4, use_gemini=True)
    results[key] = res

for key, res in results.items():
    print('\n' + '='*60)
    print('Query key:', key)
    print('Answer:\n', res['answer'])
    print('\nConfidence:', round(res['confidence'],3))
    print('\nRetrieved articles:')
    for r in res['retrieved']:
        print('-', r['title'], '(score=', round(r['score'],3),')')


Query key: configure_automations
Answer:
 I am not sure based on the KB provided.

Confidence: 0.664

Retrieved articles:
- Email threads not merging (score= 0.386 )
- Automation delay (score= 0.314 )
- Mobile app crash (score= 0.309 )
- Automation duplication (score= 0.307 )

Query key: why_csat_missing
Answer:
 Customers are not receiving CSAT surveys.

Source titles:
CSAT survey not sent

Confidence: 0.737

Retrieved articles:
- CSAT not calculated (score= 0.601 )
- CSAT survey not sent (score= 0.53 )
- CSAT report mismatch (score= 0.505 )
- BCC not recorded (score= 0.261 )


## 7. Improvements & Failure Case

### 5 Ways to improve retrieval
1. Use a stronger embedding model (e.g., all-mpnet-base-v2) for semantically richer embeddings.
2. Chunk long documents into passages and index passages instead of whole docs to increase precision.
3. Use a cross-encoder re-ranker (e.g., SBERT cross-encoder) to re-score top-k results.
4. Add query expansion / paraphrase generation to improve recall.
5. Maintain domain-specific stopwords and normalize technical terms.

### Example failure case & debugging steps
**Failure case:** the system returns unrelated KB articles for "How do I configure automations in Hiver?".
**Debugging:**
- Inspect embedding similarity scores for query vs docs.
- Increase k and inspect candidates; then re-rank.
- Verify KB includes Hiver Automation-related keywords; if not, augment KB.
- Try a stronger embedding model and re-index.


## 8. Artifacts & Notes
- KB files saved to `/mnt/data/hiver_kb_articles/`.
- Assignment PDF referenced at `/mnt/data/Hiver – AI Intern Evaluation Assignment.pdf`.
- To enable Gemini generation, add your key to Colab secrets as `GEMINI_API_KEY` or set it in environment variables.