<a href="https://colab.research.google.com/github/Ajeeetsingh/document-query-chatbot/blob/main/Query_based_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 Cell 1: Install Required Libraries and ngrok
-  Purpose: Installs Python libraries and ngrok to run the chatbot in Google Colab.
-           Libraries include Streamlit (web app), sentence-transformers (NLP model),          scikit-learn (clustering), torch (model backend), nltk (text processing),          and pyngrok (public URL tunneling). This is similar to installing diffusers          and torch in the Text-to-Image project (Cell 1).
-  Inputs: None (runs pip commands automatically).
-  Outputs: Confirmation of installed libraries or error messages if installation fails.
-  Context: Prepares the Colab environment for the chatbot, ensuring all dependencies          are available before processing text or running the Streamlit app

In [None]:
# Purpose: Install Streamlit, sentence-transformers, NLTK, and ngrok for Colab.

# Install dependencies
!pip install streamlit sentence-transformers scikit-learn torch nltk pyngrok

# Verify installations
try:
    import streamlit
    import sentence_transformers
    import sklearn
    import torch
    import nltk
    import pyngrok
    print("Libraries and ngrok installed successfully!")
except ImportError as e:
    print(f"Installation error: {e}. Please rerun this cell.")

Libraries and ngrok installed successfully!


 Cell 2: Mount Google Drive and Download NLTK Resources
-  Purpose: Mounts Google Drive to access text files (e.g., complex_sample.txt) and
-           downloads NLTK resources (punkt, punkt_tab, wordnet) for text tokenization and query preprocessing. This mirrors mounting Drive in the Text-to-Image
-           project (Cell 2) to save outputs.
-  Inputs: User authentication for Google Drive (follow Colab prompt).
-  Outputs: Mounted Drive at /content/drive and confirmation of NLTK resources.
-  Context: Ensures text files are accessible and NLTK is ready for sentence splitting and lemmatization, critical for processing user-uploaded documents.

In [None]:
# Cell 2: Mount Google Drive and Download NLTK Resources
# Purpose: Mount Google Drive for sample.txt/sample2.txt and download NLTK resources.

from google.colab import drive
drive.mount('/content/drive')  # Mount Google Drive; follow authentication prompt

# Download NLTK resources with verification
import nltk
import os
try:
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
    nltk.data.find('tokenizers/punkt_tab')
    print("NLTK resources (punkt, punkt_tab) downloaded successfully!")
except LookupError:
    print("Failed to download NLTK resources. Retrying...")
    nltk.download('punkt', force=True)
    nltk.download('punkt_tab', force=True)
    try:
        nltk.data.find('tokenizers/punkt_tab')
        print("NLTK resources downloaded successfully!")
    except LookupError:
        print("NLTK download failed. Please check your network and rerun.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
NLTK resources (punkt, punkt_tab) downloaded successfully!


 Cell 3: Set Up ngrok for Public Access
-  Purpose: Downloads and configures ngrok to expose the Streamlit app via a public
-           URL, allowing users to access the chatbot outside Colab. Uses your ngrok
-           authtoken for authentication. This is identical to the ngrok setup in the
-           Text-to-Image project (Cell 9).
-  Inputs: Your ngrok authtoken (hardcoded below; keep secure in production).
-  Outputs: Confirmation of ngrok setup and installation.
-  Context: Enables testing the Streamlit app in a browser, critical for user interaction and portfolio demos.

In [None]:
# Cell 3: Set Up ngrok
# Purpose: Download and configure ngrok with your authtoken.

# Download and install ngrok
!wget https://bin.equinox.io/c/bNyj1mQVY4c/ngrok-v3-stable-linux-amd64.tgz
!tar -xvzf ngrok-v3-stable-linux-amd64.tgz
!mv ngrok /usr/local/bin/

# Set up ngrok authtoken (using your provided authtoken)
!ngrok authtoken 2wUFgUnZUHXkXPh70TnRjtiHoMg_5qpQDaz7krTTXx5Htf8q2

print("ngrok configured successfully!")

--2025-05-12 23:46:19--  https://bin.equinox.io/c/bNyj1mQVY4c/ngrok-v3-stable-linux-amd64.tgz
Resolving bin.equinox.io (bin.equinox.io)... 99.83.220.108, 35.71.179.82, 13.248.244.96, ...
Connecting to bin.equinox.io (bin.equinox.io)|99.83.220.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9395172 (9.0M) [application/octet-stream]
Saving to: ‘ngrok-v3-stable-linux-amd64.tgz.2’


2025-05-12 23:46:20 (10.8 MB/s) - ‘ngrok-v3-stable-linux-amd64.tgz.2’ saved [9395172/9395172]

ngrok
Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml
ngrok configured successfully!


 Cell 4: Save Streamlit App
-  Purpose: Saves the Streamlit app code as app.py, defining the chatbot's logic and UI.
-           Includes query preprocessing (lemmatization, abbreviation expansion), topic
-           clustering (K-means), and sentence matching (all-MiniLM-L6-v2). Uses manual
-           file writing to avoid %%writefile issues, as in the Text-to-Image project (Cell 9).
-  Inputs: None (writes app.py to /content/).
-  Outputs: Confirmation that app.py was saved.
-  Context: Creates the core application, integrating NLP, clustering, and a web interface for users to upload files and query documents.

In [None]:
# Cell 4: Save Streamlit App
# Purpose: Save the Streamlit app as app.py using manual file writing, avoiding %%writefile.

print("Saving app.py manually to avoid %%writefile issues.")
with open("/content/app.py", "w") as f:
    f.write('''
import streamlit as st
from sentence_transformers import SentenceTransformer, util
import nltk
import os
import numpy as np
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# Download NLTK resources
try:
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
    nltk.download('wordnet', quiet=True)
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    st.error("Failed to download NLTK resources. Please refresh the app.")

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

def preprocess_query(query):
    """
    Preprocess the query by tokenizing, lemmatizing, and expanding abbreviations.
    Args:
        query (str): Raw user query.
    Returns:
        str: Preprocessed query.
    """
    if not query.strip():
        return query
    # Lowercase and tokenize
    tokens = word_tokenize(query.lower())
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    # Expand common abbreviations
    expansions = {
        "ai": "artificial intelligence",
        "ml": "machine learning",
        "re": "renewable energy"
    }
    expanded_tokens = []
    for token in tokens:
        expanded_tokens.append(expansions.get(token, token))
    # Remove stop words (simple list for brevity)
    stop_words = {"is", "are", "what", "how", "in", "to", "for", "and", "or"}
    tokens = [token for token in expanded_tokens if token not in stop_words]
    # Join tokens
    processed_query = " ".join(tokens)
    return processed_query if processed_query else query

def process_text_file(file_content, query, num_clusters=2, top_k=3):
    """
    Process a text file's content, cluster sentences by topic, and find top-k relevant sentences.
    Args:
        file_content (str): Text content of the uploaded file.
        query (str): User query.
        num_clusters (int): Number of topic clusters (default: 2).
        top_k (int): Number of top sentences to return (default: 3).
    Returns:
        tuple: (results, all_sentences, all_scores, cluster_labels) or (error_message, None, None, None).
    """
    if not file_content.strip():
        return "Error: File is empty!", None, None, None
    if not query.strip():
        return "Error: Please enter a query!", None, None, None
    try:
        sentences = nltk.sent_tokenize(file_content)
    except LookupError as e:
        return f"Tokenization error: {e}.", None, None, None
    if not sentences:
        return "Error: No sentences found in the file!", None, None, None
    if len(sentences) < 2:
        return "Error: File must contain at least two sentences.", None, None, None

    # Preprocess query
    processed_query = preprocess_query(query)

    # Generate sentence and query embeddings
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
    query_embedding = model.encode(processed_query, convert_to_tensor=True)

    # Cluster sentences by topic
    num_clusters = min(num_clusters, len(sentences))
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(sentence_embeddings.cpu().numpy())

    # Compute cosine similarities
    cos_scores = util.cos_sim(query_embedding, sentence_embeddings)[0].cpu().numpy()

    # Find the most relevant cluster
    cluster_scores = []
    for cluster_id in range(num_clusters):
        cluster_indices = np.where(cluster_labels == cluster_id)[0]
        if len(cluster_indices) > 0:
            cluster_score = np.mean(cos_scores[cluster_indices])
            cluster_scores.append((cluster_id, cluster_score))
    if not cluster_scores:
        return "Error: No valid clusters found.", None, None, None
    relevant_cluster = max(cluster_scores, key=lambda x: x[1])[0]

    # Get top-k sentences from the relevant cluster
    cluster_indices = np.where(cluster_labels == relevant_cluster)[0]
    if len(cluster_indices) == 0:
        return "Error: No sentences in the relevant cluster.", None, None, None
    cluster_scores = cos_scores[cluster_indices]
    cluster_sentences = [sentences[i] for i in cluster_indices]
    top_k_indices = np.argsort(cluster_scores)[::-1][:min(top_k, len(cluster_scores))]
    results = []
    for idx in top_k_indices:
        sentence = cluster_sentences[idx]
        score = cluster_scores[idx]
        explanation = f"This sentence was selected because it closely matches the processed query '{processed_query}' (original: '{query}') with a similarity score of {score:.4f} and belongs to topic cluster {relevant_cluster + 1}, which is most relevant to your query."
        results.append({"sentence": sentence, "score": score, "topic": f"Topic {relevant_cluster + 1}", "explanation": explanation})

    return results, sentences, cos_scores.tolist(), cluster_labels.tolist()

# Streamlit UI
st.title("Document Query Chatbot")
st.markdown("""
Upload a text file (.txt, 100–1000 words) containing multiple topics (e.g., AI and renewable energy).
Enter a query to find the most relevant sentences, grouped by topic.
- **Example file**: A document discussing AI and renewable energy.
- **Example query**: "What's AI?" or "What is renewable energy?"
""")

# File uploader
uploaded_file = st.file_uploader("Upload a text file", type="txt")

# Query input
query = st.text_input("Enter your query", placeholder="e.g., What's AI?")

# Number of clusters
num_clusters = st.slider("Number of topic clusters", min_value=2, max_value=5, value=2)

# Number of sentences to return
top_k = st.slider("Number of sentences to return", min_value=1, max_value=5, value=3)

# Process and display results
if uploaded_file and query:
    with st.spinner("Processing query and clustering sentences..."):
        file_content = uploaded_file.read().decode("utf-8")
        results, all_sentences, all_scores, cluster_labels = process_text_file(file_content, query, num_clusters, top_k)
        if isinstance(results, list) and results:
            st.markdown("**Top Matching Sentences**")
            for result in results:
                st.markdown(f"- **Sentence**: {result['sentence']}")
                st.markdown(f"  - **Topic**: {result['topic']}")
                st.markdown(f"  - **Score**: {result['score']:.4f}")
                st.markdown(f"  - **Explanation**: {result['explanation']}")
            with st.expander("View all sentences and their topics"):
                for sent, score, cluster in zip(all_sentences, all_scores, cluster_labels):
                    st.write(f"- Sentence: {sent} (Score: {score:.4f}, Topic: {cluster + 1})")
        else:
            st.error(results)
elif uploaded_file and not query:
    st.warning("Please enter a query to proceed.")
elif query and not uploaded_file:
    st.warning("Please upload a text file to proceed.")
''')
print("app.py saved successfully!")

Saving app.py manually to avoid %%writefile issues.
app.py saved successfully!


- Cell 5: Run Streamlit App with ngrok
- Purpose: Launches the Streamlit app (app.py) in Colab and exposes it via a public ngrok URL for browser access. Includes a clean shutdown mechanism for interrupting the process. This mirrors the Streamlit/ngrok setup in the Text-to-Image project (Cell 9).
- Inputs: app.py (from Cell 4) and ngrok configuration (from Cell 3).
- Outputs: Public ngrok URL (e.g., https://abc123.ngrok.io) to access the app.
- Context: Allows users to interact with the chatbot’s web interface, testing file uploads and queries in a real-world setting.

In [None]:
# Cell 5: Run Streamlit App with ngrok
# Purpose: Run app.py in Colab and expose it via a public ngrok URL.

import subprocess
import signal
import os
from pyngrok import ngrok

# Terminate existing ngrok tunnels
ngrok.kill()

# Start ngrok tunnel
public_url = ngrok.connect(8501)
print(f"Streamlit app running at: {public_url}")

# Start Streamlit server
streamlit_cmd = ["streamlit", "run", "app.py", "--server.port", "8501", "--server.fileWatcherType", "none"]
streamlit_proc = subprocess.Popen(streamlit_cmd)

# Handle shutdown (matches Text-to-Image Cell 9)
def signal_handler(sig, frame):
    print("Shutting down Streamlit and ngrok...")
    streamlit_proc.terminate()
    ngrok.kill()
    print("Shutdown complete.")
    os._exit(0)

signal.signal(signal.SIGINT, signal_handler)

# Keep the cell running
try:
    streamlit_proc.wait()
except KeyboardInterrupt:
    signal_handler(None, None)

Streamlit app running at: NgrokTunnel: "https://3303-34-125-114-107.ngrok-free.app" -> "http://localhost:8501"
