# Jupiter Money FAQ Pipeline Notebook

This notebook implements the end-to-end pipeline for building an interactive FAQ assistant using an open-source LLM.

## 1. Setup & Imports
Install required packages and import dependencies.

In [9]:
# Install (run once):
# !pip install playwright nest_asyncio sentence-transformers faiss-cpu langdetect googletrans==4.0.0-rc1 python-dotenv requests
# !playwright install

import os
import re
import html
import json
import faiss
import numpy as np
import requests
from langdetect import detect
from googletrans import Translator
import time
from sentence_transformers import SentenceTransformer
from dotenv import load_dotenv, find_dotenv
import asyncio
import json
from playwright.async_api import async_playwright
import nest_asyncio
nest_asyncio.apply()


dotenv_path = find_dotenv()
load_dotenv(dotenv_path)

# ── Configure Your Hosted Open-Source LLM API ────────────────────────────────
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
MODEL_NAME = "mistralai/Mixtral-8x7B-Instruct-v0.1"
MULTILINGUAL_MODEL = "paraphrase-multilingual-MiniLM-L12-v2" 

translator = Translator()


## 2. Data Cleaning & Preparation
Load raw JSON, clean text, and prepare question/metadata lists.

In [18]:
RAW_JSON       = r"D:\srisurya\Ml_projects\Faq chat bot\faq_data_raw.json"
INDEX_FILE     = r"D:\srisurya\Ml_projects\Faq chat bot\faq.index"
QUESTIONS_FILE = r"D:\srisurya\Ml_projects\Faq chat bot\faq_questions.json"
METADATA_FILE  = r"D:\srisurya\Ml_projects\Faq chat bot\faq_metadata.json"
EMBEDDINGS_FILE = r"D:\srisurya\Ml_projects\Faq chat bot\faq_embeddings.npy"

### 🧹 Data Cleaning and Preparation

This section defines two key functions to clean and structure the raw FAQ dataset:

---

#### **`clean_text(text: str) -> str`**

Cleans raw text data by:
- Decoding HTML entities (e.g., `&amp;` → `&`)
- Replacing multiple spaces/newlines with a single space
- Removing mentions (e.g., `@username`)
- Removing emojis and custom shortcodes (e.g., `:smile:`)
- Removing URLs
- Stripping leading/trailing non-word characters

> Returns a simplified, cleaned version of the input string.

---

#### **`load_and_prepare_data()`**

Loads the raw scraped JSON (`RAW_JSON`) and prepares it for embedding and retrieval:
- Cleans the question text and skips duplicates
- Cleans each post (question + replies)
- Collects structured metadata for each topic, including:
  - Cleaned title
  - URL
  - Tags
  - Cleaned posts list (`text`, `user`, etc.)

> Prints the number of unique, cleaned questions loaded.  
> Returns two lists:
> - `questions`: List of cleaned question strings  
> - `metadatas`: Corresponding metadata dictionaries
"""

In [11]:
def clean_text(text: str) -> str:
    text = html.unescape(text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'@[\w\-]+', '', text)
    text = re.sub(r':[\w\-]+:', '', text)
    text = re.sub(r'https?://\S+', '', text)
    text = re.sub(r'^\W+|\W+$', '', text)
    return text.strip()

def load_and_prepare_data():
    with open(RAW_JSON, "r", encoding="utf-8") as f:
        data = json.load(f)

    questions, metadatas = [], []
    seen = set()
    for topic in data:
        q = clean_text(topic.get("text", ""))
        if not q or q in seen:
            continue
        seen.add(q)

        # clean posts
        cleaned_posts = []
        for p in topic.get("posts", []):
            cp = p.copy()
            cp["text"] = clean_text(cp.get("text", ""))
            cleaned_posts.append(cp)

        questions.append(q)
        metadatas.append({
            "tags":  topic.get("tags", []),
            "title": clean_text(topic.get("title", "")),
            "url":   topic.get("url", ""),
            "posts": cleaned_posts,
        })

    print(f"Loaded {len(questions)} unique questions.")
    return questions, metadatas


### 🌐 Language Normalization

#### `ensure_english(text: str) -> str`

This function ensures that input text is in **English**, translating it if necessary.

---

**Functionality:**
- Detects the language of the input string using `langdetect`.
- If the detected language is not English (`'en'`), it uses **Google Translate** (`googletrans`) to convert it to English.
- If detection fails or the text is already in English, it returns the original text unchanged.

---

**Purpose:**
To standardize multilingual user content by translating all input text into English before further processing (e.g., indexing, embedding, or retrieval).

---

**Notes:**
- The function gracefully handles errors—if detection or translation fails, it returns the input text as-is.
- Requires internet access for translation via Google Translate.

---



In [12]:
def ensure_english(text: str) -> str:
    try:
        lang = detect(text)
        if lang != 'en':
            return translator.translate(text, dest='en').text
    except:
        pass
    return text

### 🧠 Building or Loading FAISS Index & Embeddings

#### `build_or_load_index_and_embeddings(questions, metadatas)`

This function prepares the semantic search backend by either:
- **Loading** a precomputed FAISS index and embeddings from disk (if available), or
- **Building** a new index from scratch using the provided questions.

---

####  Function Responsibilities:
- Initializes a `SentenceTransformer` model (`MULTILINGUAL_MODEL`).
- Checks if existing files (`INDEX_FILE`, `EMBEDDINGS_FILE`, `QUESTIONS_FILE`, `METADATA_FILE`) exist:
  - If yes, loads and returns saved index, embeddings, questions, and metadata.
  - If no, computes new embeddings using the model:
    - Normalizes them for cosine similarity.
    - Builds a **FAISS index** using inner product (IP) search.
    - Saves index, embeddings, questions, and metadata to disk for reuse.

---

####  Returns:
- `idx`: FAISS index (`IndexFlatIP`)
- `embs`: Numpy array of question embeddings
- `questions`: List of question texts
- `metadatas`: Associated metadata for each question
- `model_multi`: The multilingual SentenceTransformer model

---

####  Caching:
This function allows your notebook to resume quickly between runs by persisting data to disk—saving significant time for large datasets.

---

#### Tip:
You can change the model used by modifying the `MULTILINGUAL_MODEL` constant, e.g., to `"all-MiniLM-L6-v2"` for English-only datasets.



In [13]:

def build_or_load_index_and_embeddings(questions, metadatas):
    model_multi = SentenceTransformer(MULTILINGUAL_MODEL)
    if os.path.exists(INDEX_FILE) and os.path.exists(EMBEDDINGS_FILE):
        idx = faiss.read_index(INDEX_FILE)
        embs = np.load(EMBEDDINGS_FILE)
        qs = json.load(open(QUESTIONS_FILE)); ms = json.load(open(METADATA_FILE))
        print("Loaded existing index and embeddings.")
        return idx, embs, qs, ms, model_multi
    embs = model_multi.encode(questions, convert_to_numpy=True, batch_size=32)
    faiss.normalize_L2(embs)
    idx = faiss.IndexFlatIP(embs.shape[1]); idx.add(embs)
    faiss.write_index(idx, INDEX_FILE); np.save(EMBEDDINGS_FILE, embs)
    json.dump(questions, open(QUESTIONS_FILE, "w"), indent=2)
    json.dump(metadatas, open(METADATA_FILE, "w"), indent=2)
    print(f"Built index with {idx.ntotal} vectors.")
    return idx, embs, questions, metadatas, model_multi 

###  Semantic Similarity Suggestion

#### `suggest_related(query, questions, idx, model, embeddings, top_k=5)`

This function retrieves past questions that are semantically similar to the user’s current query, using FAISS for efficient nearest neighbor search.

---

#### 🧠 Function Workflow:

1. **Track the query**  
   Appends the incoming query to a `query_history` list for later reference or analytics.

2. **Encode the query**  
   Transforms the query into a vector using a pretrained `SentenceTransformer` model (`model`), and normalizes it for cosine similarity search.

3. **Search the FAISS index**  
   Runs an inner-product (IP) search on the full FAISS index (`idx`) of all past embeddings to find the top-`k` most similar questions.

4. **Return suggestions**  
   Maps the result indices back to original question strings.

---

####  Parameters:
- `query`: New input string for which related past questions are to be found.
- `questions`: List of all indexed questions.
- `idx`: FAISS index (built with `IndexFlatIP` and normalized embeddings).
- `model`: SentenceTransformer used to encode the query.
- `embeddings`: Numpy array of all embedded questions.
- `top_k`: Number of suggestions to return (default = 5).

---

####  Returns:
- A list of `top_k` most semantically similar past questions from the corpus.


In [14]:
query_history = []

def suggest_related(
    query: str,
    questions: list,
    idx: faiss.IndexFlatIP,
    model: SentenceTransformer,
    embeddings: np.ndarray,
    top_k: int = 5
) -> list:
    """
    Suggest past questions semantically similar to `query`.
    Uses the full FAISS index `idx` over `embeddings`.
    """
    query_history.append(query)
    q_emb = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    D, I = idx.search(q_emb, top_k)
    scores, ids = D[0], I[0]
    suggestions = [questions[i] for i in ids]
    return suggestions

### 🏷️ Tag-Based Filtering with LLM-Aided Classification

This section implements intelligent retrieval based on tag prediction using an open-source LLM and tag-filtered FAISS indexing.

---

#### `classify_tags_with_open_llm(query: str, possible_tags: list) -> list`

Uses an **Open LLM API (Together.xyz)** to classify a user query into 2–3 relevant tags.

**How it works:**
- Sends a prompt to the LLM with a list of possible tags and a user query.
- The model returns a comma-separated list of predicted tags.
- If the query is off-topic or ambiguous, it returns `"none"`.

**Returns:**
- A list of tags (max 3) or an empty list (`[]`) if none apply.

> 📌 This enables dynamic filtering of FAQs based on content themes.

---

#### `retrieve_with_prefilter(...) -> list`

Performs **semantic search with tag filtering**. Only retrieves similar questions whose metadata contains the predicted tags.

**Inputs:**
- `query`: The user’s search string
- `predicted_tags`: Tags predicted by the LLM
- `questions`: List of all questions
- `metadatas`: Metadata per question (tags, title, posts, etc.)
- `model`: SentenceTransformer used for encoding
- `embeddings`: FAISS-compatible numpy embeddings
- `top_k`: Number of results to retrieve (default = 10)

---

**Workflow:**

1. **Filter candidate questions** using the predicted tags.
2. **Build a temporary FAISS index** using only the filtered subset.
3. **Encode the query** and perform semantic search on the sub-index.
4. **Return results** with question text, score, tags, title, URL, and full post content.


In [15]:
def classify_tags_with_open_llm(query: str, possible_tags: list) -> list:
    prompt = f"""
You are a helpful assistant that assigns relevant tags to user questions about Jupiter.money. If the question is unrelated or ambiguous, respond with "none".

Possible tags: {', '.join(possible_tags)}

Classify this question into the most relevant 2–3 tags. If it is unrelated, say "none".

Question:
"{query}"

Respond with a comma-separated list of tags or "none".
"""

    response = requests.post(
        "https://api.together.xyz/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {TOGETHER_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "model": MODEL_NAME,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0.3,
            "max_tokens": 100,
        },
    )

    content = response.json()["choices"][0]["message"]["content"]
    content = content.strip().lower()
    if "none" in content:
        return []
    return [tag.strip() for tag in content.split(",") if tag.strip()]

def retrieve_with_prefilter(
    query: str,
    predicted_tags: list,
    questions: list,
    metadatas: list,
    model: SentenceTransformer,
    embeddings: np.ndarray,
    top_k: int = 10
):
    
    if predicted_tags:
        allowed_ids = [
            i for i, meta in enumerate(metadatas)
            if any(tag in meta["tags"] for tag in predicted_tags)
        ]
    else:
        allowed_ids = [i for i, meta in enumerate(metadatas) if not meta["tags"]]

    if not allowed_ids:
        return []

  
    sub_embs = embeddings[allowed_ids]
    faiss.normalize_L2(sub_embs)
    dim = sub_embs.shape[1]
    sub_index = faiss.IndexFlatIP(dim)
    sub_index.add(sub_embs)


    q_emb = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)

    D, I = sub_index.search(q_emb, top_k)
    scores, idxs = D[0], I[0]


    results = []
    for score, sub_i in zip(scores, idxs):
        orig_i = allowed_ids[sub_i]
        md = metadatas[orig_i]
        results.append({
            "question": questions[orig_i],
            "score":    float(score),
            "tags":     md["tags"],
            "title":    md["title"],
            "url":      md["url"],
            "posts":    md["posts"],
        })
    return results


### 🤖 Answer Generation with Open LLM

#### `generate_answer_with_open_llm(query: str, context_posts: list) -> str`

This function uses an **open-source LLM (via Together.xyz)** to generate a helpful, natural-language answer based on selected forum discussion posts.

---

#### 💡 Function Purpose:
To synthesize a concise and friendly response to the user's query using real community discussions as context.

---

#### 🧠 How it works:

1. **Context Building:**
   - Extracts up to the top 3 relevant posts from `context_posts`.
   - Formats them as `"username: post text"` to simulate a forum thread.

2. **Prompt Construction:**
   - Uses a **system prompt** that guides the model to behave like a polite and clear Jupiter assistant.
   - Embeds the user’s question and the forum context into a **user prompt**.

3. **LLM Call:**
   - Sends the prompt to Together.xyz's chat API using the specified `MODEL_NAME`.
   - Uses a moderate temperature (`0.5`) to keep responses helpful and creative, but grounded.

4. **Fallback:**
   - If no context is available, returns a default message encouraging the user to rephrase or visit the community.

---

#### Parameters:
- `query`: The user’s input question.
- `context_posts`: A list of relevant forum posts with structure `{ "user": { "name": ... }, "text": ... }`.

#### Returns:
- A single string: the model-generated answer to the user’s question, grounded in the forum content.

---

> ⚠️ Requires `TOGETHER_API_KEY` and internet access. Make sure the Together API is accessible and the model you're using is available.


In [16]:
def generate_answer_with_open_llm(query: str, context_posts: list) -> str:
    if not context_posts:
        return ("I'm not sure about that yet. You can try rephrasing your question "
                "or visit the Jupiter Community for more help.")

    # Build context: top 3 posts
    context = "\n\n".join(
        f"{post['user']['name']}: {post['text']}"
        for post in context_posts[:3]
    )

    system_prompt = (
        "You are a helpful assistant for Jupiter.money.\n"
        "Use the provided forum discussion to answer the user's question clearly and politely.\n"
        "Rephrase responses in friendly, natural language.\n"
        "If you're not sure or the context is unrelated, say so gracefully."
    )
    user_prompt = (
        f"User question: {query}\n\n"
        f"Forum context:\n{context}\n\n"
        "Answer the user based on the forum context:"
    )

    resp = requests.post(
        "https://api.together.xyz/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {TOGETHER_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "model": MODEL_NAME,
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user",   "content": user_prompt}
            ],
            "temperature": 0.5,
            "max_tokens": 512,
        },
    )
    return resp.json()["choices"][0]["message"]["content"].strip()

### 🧪 Interactive QA Loop with Retrieval + Generation

This section ties everything together into an interactive, end-to-end retrieval-augmented question-answering system for Jupiter.money forum data.

---

#### 🧱 Step-by-Step Breakdown:

1. **Data & Index Setup**
   - `load_and_prepare_data()` loads and cleans the questions and metadata.
   - `build_or_load_index_and_embeddings()` loads or creates a FAISS index and sentence embeddings.
   - `all_tags` is created by extracting all unique tags from the metadata.

2. **User Loop**
   Continuously prompts the user to ask a question:

   **a. Tag Prediction:**
   - Uses `classify_tags_with_open_llm()` to assign the query to 2–3 relevant forum tags.
   - Helps narrow down the search space to only relevant discussions.

   **b. Contextual Retrieval:**
   - `retrieve_with_prefilter()` fetches semantically similar posts from the predicted tag space.

   **c. Answer Generation:**
   - Two answers are generated using `generate_answer_with_open_llm()`:
     1. **Retrieval-based answer**: Uses real community discussions as context.
     2. **LLM-only answer**: Generates without any grounding/context.
   - Latency for both responses is measured and printed.

   **d. Related Questions:**
   - `suggest_related()` retrieves other semantically similar past questions from the full corpus.
   - These are printed as additional suggestions.

---

#### 🖥️ Output:
- ✅ Predicted tags
- ✅ Two answer versions (with and without retrieval grounding)
- ✅ Related questions list
- ✅ Latency for both approaches (retrieval-augmented vs LLM-only)

---

> 💬 This setup helps evaluate the benefits of retrieval-augmented generation (RAG) and guides users to similar existing content.

> 🔁 Type `"exit"` to end the loop.


In [None]:
questions, metadatas = load_and_prepare_data()
idx, embs, questions, metadatas, model = build_or_load_index_and_embeddings(questions, metadatas)
all_tags = sorted({t for m in metadatas for t in m["tags"]})
while True:
        q = input("Ask a question (exit to quit): ").strip()
        if q.lower() == "exit": break
        print(f"\n You asked:\n  {q}\n")
        tags = classify_tags_with_open_llm(q, all_tags)
        print(f"Predicted tags: {tags or 'none'}")
        hits = retrieve_with_prefilter(q, tags, questions, metadatas, model, embs, top_k=8)
        # 1) Retrieval-based
        start = time.time()
        ret_answer = generate_answer_with_open_llm(q, hits[0]["posts"] if hits else [])
        ret_latency = time.time() - start
        
        # 2) LLM-only (no retrieval context)
        start = time.time()
        llm_only_answer = generate_answer_with_open_llm(q, [])
        llm_latency = time.time() - start
        
        # 3) Display comparison
        print(f"\nRetrieval-based ({ret_latency:.2f}s):\n{ret_answer}\n")
        print(f" LLM-only    ({llm_latency:.2f}s):\n{llm_only_answer}\n")
        suggestions = suggest_related(q, questions, idx, model, embs, top_k=5)
        print("You might also be interested in these related questions:")
        for s in suggestions:
            print("  •", s)


Loaded 1226 unique questions.
Loaded existing index and embeddings.

 You asked:
  Hi, as we know Jupiter has two accounts in the app with Federal Bank.  I don’t have any ongoing pot but somehow I transferred money to my second account using account details.  Now how can I transfer that money to my main savings account?

Predicted tags: ['pots', 'savings-account', 'transfer']

⏱ Retrieval-based (2.13s):
Hello! It's great that you've reached out for help regarding transferring money between your accounts on Jupiter.money.

Based on the discussion in the forum, you can follow these steps:

1. Go to the "POTS" section in the Jupiter app.
2. Look for the "Withdraw money/delete Pots" option.

Alternatively, user Siddharth B R suggested using UPI (Unified Payments Interface) to transfer the funds. You can give that a try as well if the first method doesn't work for you.

Please remember to double-check the account details before confirming the transfer to ensure the money goes to your desire