# Virtual Customer Assistant: Question Answering over Internal Knowledge Bases

**Chapter Alignment**: Chapter 9.4.1 of *Dialogue Systems* — this notebook demonstrates a hybrid SLM/LLM-based Virtual Customer Assistant that answers customer and employee queries by retrieving and synthesizing information from internal legal, HR, and IT knowledge sources.

**Goals:**  
- Provide fast extractive answers using a Small Language Model (SLM) (DistilBERT).  
- Provide richer, generative answers using Large Language Models (LLMs) (e.g., Gemma and BitNet).  
- Maintain a vectorized retrieval layer for context grounding via LangChain + Chroma.  
- Offer a simple querying interface that can be embedded into a virtual assistant pipeline.


In [None]:
# ====== Installation (run once) ======
# You can uncomment and run these if not already installed in your environment
# !pip install --upgrade pip
# !pip install langchain transformers chromadb sentence-transformers accelerate torch gradio

# ====== Environment setup ======
import os
from getpass import getpass

# Hugging Face API token (required for HuggingFaceHub/Gemma/BitNet access)
# It is expected that the user sets HUGGINGFACEHUB_API_TOKEN in the environment or inputs it here.
if "HUGGINGFACEHUB_API_TOKEN" not in os.environ:
    print('HUGGINGFACEHUB_API_TOKEN not found in environment. Prompting for token (won\'t be stored persistently).')
    token = getpass('Enter your HuggingFace Hub API token: ')
    os.environ['HUGGINGFACEHUB_API_TOKEN'] = token

# Basic version info (for reproducibility)
import sys
print('Python version:', sys.version.split()[0])
try:
    import langchain, transformers, chromadb, torch, sentence_transformers
    print('langchain version:', langchain.__version__)
    import transformers as _transformers
    print('transformers version:', _transformers.__version__)
except ImportError as e:
    print('Some libraries are not installed. Please install requirements as shown above.', str(e))


In [None]:
# ====== Sample Knowledge Documents ======
from langchain.schema import Document

# Simulated internal knowledge base: legal, HR, and IT policies
legal_doc = """**Legal Knowledge Base**\n
1. Confidentiality agreements must be signed before sharing client data.\n
2. All contracts over $50,000 require a secondary legal review.\n
3. GDPR compliance demands a data access audit every 6 months.\n"""

hr_doc = """**HR Knowledge Base**\n
1. Employees are entitled to 25 paid leave days per calendar year.\n
2. The onboarding process includes compliance training, software access setup, and orientation.\n
3. Performance reviews are conducted bi-annually.\n"""

it_doc = """**IT Knowledge Base**\n
1. Passwords must be rotated every 90 days.\n
2. Multi-factor authentication (MFA) is required for VPN access.\n
3. Report any suspicious emails to the security team immediately.\n"""

documents = [
    Document(page_content=legal_doc, metadata={'source': 'legal'}),
    Document(page_content=hr_doc, metadata={'source': 'hr'}),
    Document(page_content=it_doc, metadata={'source': 'it'}),
]

print('Created sample documents for Legal, HR, and IT knowledge bases.')

In [None]:
# ====== Chunking and Vector Store Construction ======
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# Split documents into chunks to serve as retrievable context
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)

print(f'Number of document chunks: {len(docs)}')

# Embeddings - use a relatively small embedding model for demo (semantic search)
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

# Build Chroma vector store (in-memory)
vector_store = Chroma.from_documents(docs, embeddings, collection_name="virtual_customer_assistant")

# Retriever for downstream QA
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={'k': 3})
print('Vector store and retriever initialized.')

In [None]:
# ====== Extractive QA with a Small Language Model (SLM) - DistilBERT ======
from transformers import pipeline

# Use a pre-trained DistilBERT model fine-tuned on SQuAD (extractive QA)
extractive_pipeline = pipeline('question-answering', model='distilbert-base-uncased-distilled-squad', tokenizer='distilbert-base-uncased-distilled-squad')

def ask_extractive(question: str):
    # Retrieve top contexts
    retrieved_docs = retriever.get_relevant_documents(question)
    combined_context = "\n\n".join([doc.page_content for doc in retrieved_docs])
    # Query the extractive QA pipeline with concatenated context
    answer = extractive_pipeline(question=question, context=combined_context)
    return {
        'answer': answer.get('answer'),
        'score': answer.get('score'),
        'source_chunks': [doc.metadata for doc in retrieved_docs],
        'context': combined_context
    }

# Demo extractive QA
q1 = "What is the leave policy for employees?"
res1 = ask_extractive(q1)
print('Question:', q1)
print('Extractive Answer:', res1['answer'])
print('Score:', res1['score'])

In [None]:
# ====== Generative QA with LLMs (Gemma & BitNet) ======
from langchain.llms import HuggingFaceHub
from langchain.chains import RetrievalQA

# Initialize Gemma (LLM generative model) via HuggingFaceHub
try:
    gemma = HuggingFaceHub(repo_id='google/gemma-7b-it', model_kwargs={'temperature':0.2, 'max_length':256})
    qa_gemma = RetrievalQA.from_chain_type(llm=gemma, chain_type='stuff', retriever=retriever)
    print('Gemma-based generative QA chain initialized.')
except Exception as e:
    print('Failed to initialize Gemma LLM, falling back to flan-t5-small. Error:', str(e))
    from langchain.llms import HuggingFaceHub as _HFH
    gemma = _HFH(repo_id='google/flan-t5-small', model_kwargs={'temperature':0.2, 'max_length':256})
    qa_gemma = RetrievalQA.from_chain_type(llm=gemma, chain_type='stuff', retriever=retriever)

# Initialize BitNet LLM via HuggingFaceHub
try:
    bitnet = HuggingFaceHub(repo_id='microsoft/bitnet-b1.58-2B-4T', model_kwargs={'temperature':0.2, 'max_length':256})
    qa_bitnet = RetrievalQA.from_chain_type(llm=bitnet, chain_type='stuff', retriever=retriever)
    print('BitNet-based generative QA chain initialized.')
except Exception as e:
    print('Failed to initialize BitNet LLM, falling back to flan-t5-small. Error:', str(e))
    bitnet = HuggingFaceHub(repo_id='google/flan-t5-small', model_kwargs={'temperature':0.2, 'max_length':256})
    qa_bitnet = RetrievalQA.from_chain_type(llm=bitnet, chain_type='stuff', retriever=retriever)

# Functions to ask questions
def ask_question_gemma(question: str):
    result = qa_gemma.run(question)
    return result

def ask_question_bitnet(question: str):
    result = qa_bitnet.run(question)
    return result

# Demo generative QA
q2 = "What are the requirements for GDPR compliance?"
print('Question:', q2)
print('Gemma Answer:', ask_question_gemma(q2))
print('BitNet Answer:', ask_question_bitnet(q2))

In [None]:
# ====== Simple Virtual Customer Assistant Interface ======
def virtual_assistant_loop():
    print('Starting Virtual Customer Assistant. Type "exit" to quit.')
    while True:
        user_q = input('User: ').strip()
        if user_q.lower() in ('exit', 'quit'):
            print('Assistant: Goodbye!')
            break
        # First try extractive for concise/fast answer
        extractive = ask_extractive(user_q)
        print('\n[SLM Extractive Answer]')
        print(f"Answer: {extractive['answer']} (score: {extractive['score']:.3f})")
        # Then generative augmentation
        print('\n[LLM Generative Answer - Gemma]')
        try:
            print(ask_question_gemma(user_q))
        except Exception as e:
            print('Gemma error:', str(e))
        print('\n[LLM Generative Answer - BitNet]')
        try:
            print(ask_question_bitnet(user_q))
        except Exception as e:
            print('BitNet error:', str(e))

# Note: To run interactive mode, uncomment the next line
# virtual_assistant_loop()
print('Virtual assistant interface defined. You can call virtual_assistant_loop().')

## Deployment Options and Scalability

This prototype can be extended and deployed in several ways:

1. **Local Deployment:** Run the notebook code as a backend service (e.g., Flask/FastAPI) within a private corporate network for internal virtual customer assistants. Embed `ask_extractive` and `ask_question_gemma` into REST endpoints.  
2. **Cloud Deployment:** Host on platforms like AWS (SageMaker endpoints for fine-tuned models, or serverless containers for deployment). Use Amazon Lex in front to handle intent recognition and route user queries to this QA backend.  
3. **Containerization:** Package the system as a Docker container to ensure consistent environment across development, staging, and production.  
4. **Hybrid SLM/LLM Strategy:** Use the extractive SLM (DistilBERT) for low-latency quick answers; fallback to LLMs only when the query requires synthesis, explanation, or when confidence from SLM is low.  
5. **Conversation State & Context:** Extend the interface to maintain multi-turn context, injecting previous user turns into retrieval or prompt engineering for more coherent dialogues.  
6. **Monitoring & Feedback Loop:** Log user questions and model answers; collect feedback to re-rank or fine-tune models periodically to improve accuracy in the domain-specific knowledge base.

**Next Steps:**  
- Add authentication and access control for sensitive knowledge.  
- Plug into chat/voice channels (e.g., Slack bots, web chat widgets, Amazon Connect with Lex).  
- Fine-tune the extractive model on internal corporate Q&A pairs for domain adaptation.  
- Implement answer validation with guardrails (e.g., detect conflicting policy answers).

## Summary for Virtual Customer Assistant Use Case

This notebook provides a working skeleton of a Virtual Customer Assistant that answers internal queries over legal, HR, and IT knowledge. It combines:

- **Extractive QA (SLM):** DistilBERT gives fast, grounded spans from retrieved documents.  
- **Generative QA (LLMs):** Gemma and BitNet synthesize more natural, explanatory responses while being grounded via retrieval.  
- **Vector Retrieval Layer:** LangChain + Chroma handle semantic search across chunked knowledge documents.  
- **Simple Interface:** A loop is provided to simulate conversation, enabling a multi-modal assistant to be built on top.

The architecture is suitable for embedding into larger dialogue systems (e.g., with intent understanding via Lex or Rasa), and can be scaled across local, cloud, or hybrid deployments as described above.