In [None]:
from search_engine import SemanticSearchEngine

searcher = SemanticSearchEngine('./data/faq_dataset_10000.txt')

while True:
    query = input("\n🔍 Enter your question (or 'eixt'):")
    if query.lower() == 'exit':
        break
    results = searcher.search(query)
    print(f"🔍 Question: {query}")
    print('📌 Top Matches:')
    for i, (text, score) in enumerate(results, 1):
        print(f"{i}. {text} (score: {score: .4f})")
    print("")

🔍 Question: When will the train leave?
📌 Top Matches:
1. What is your return policy? (score:  0.0991)
2. Where can I track my order? (score:  0.0915)
3. How can I reset my password? (score: -0.0149)

🔍 Question: What method could I do to reset my login information?
📌 Top Matches:
1. Waht should I do if I forget my login information? (score:  0.8048)
2. How can I reset my password? (score:  0.7104)
3. How do I upate my billing details? (score:  0.3776)

🔍 Question: What is the weather?
📌 Top Matches:
1. What is your return policy? (score:  0.1370)
2. How can I reset my password? (score:  0.0521)
3. Where can I track my order? (score:  0.0374)



💯 **Yes — having a bigger `faq_corpus` will almost always make your semantic search engine better** (as long as it’s clean and relevant). Let’s break it down clearly:

---

## 🧠 Why Bigger Is Better (Usually)

### ✅ 1. **More Coverage = Better Matches**
With more FAQs or documents, the system is more likely to contain:
- Close paraphrases
- Related intents
- Answers that match different user phrasing

Example:
> User query: *"How do I cancel my subscription?"*  
If your corpus only has:
- *"How do I reset my password?"* → ❌ poor match  
But if you add:
- *"How do I stop being billed?"*  
- *"Can I cancel my monthly plan?"*  
✅ Now the search engine returns useful results!

---

### ✅ 2. **Better Generalization**
With more samples, especially diverse phrasing:
- The embeddings capture more **semantic variance**
- Cosine similarity has **more candidates** to differentiate

---

### ✅ 3. **Improved User Experience**
- Users don’t need to phrase things *perfectly*
- The system feels smarter, more human-like
- Example: Like Google search or chatbot response matching

---

## 🔎 But… Bigger Isn’t Always Better IF:

### ⚠️ 1. You Have **Duplicate or Noisy Data**
- Repetitive or vague entries confuse similarity scores
- Tip: Deduplicate your corpus and remove overly generic lines like `"contact us"` or `"more info"`.

---

### ⚠️ 2. It’s **Too Domain-Diverse**
If your FAQ file mixes banking, health care, and e-commerce questions together — similarity results may become messy unless you add **category filters** or **clustering**.

---

## 🧰 Best Practices for Bigger Corpora

- ✅ Use **clear, question-like sentences**
- ✅ Remove duplicates or low-content lines
- ✅ Group by topic (e.g., payment, account, delivery)
- ✅ Consider precomputing embeddings if size > 10k
- ✅ Save your `.pt` or `.npy` vectors for faster startup

---

## 🚀 Want to Scale Further?

If your dataset becomes huge (10k+ entries), try:
- **FAISS** (Facebook’s library) for fast vector search with GPU support
- **Annoy** or **ScaNN** for fast approximate matching

---

Would you like help expanding your current corpus or tips on cleaning and grouping a larger FAQ set?

Or shall we do a quick recap + quiz review from Day 14 next?