<a href="https://colab.research.google.com/github/AltafParekh2001/GEN_AI_Project-s/blob/main/URL_Based_Detailed_QA_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üåê URL-Based Question Answering with Detailed Explanations

This notebook allows you to:
1. Paste any website URL
2. Ask questions about the content on that webpage
3. Get **comprehensive, detailed explanations** (not just one-word answers)

**Models Used:**
- `facebook/bart-large-cnn` - Summarization
- `deepset/roberta-base-squad2` - Question Answering
- `facebook/bart-large-mnli` - Text Expansion (Zero-shot classification fallback)


In [1]:
# Install required packages
!pip install transformers torch gradio beautifulsoup4 requests sentence-transformers

In [2]:
# Import necessary libraries
import torch
import gradio as gr
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
import requests
from bs4 import BeautifulSoup
import re
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")

‚úÖ All libraries imported successfully!


In [3]:
# Load multiple pipelines for comprehensive answers

# 1. Question-Answering Pipeline (Extract relevant info)
qa_pipeline = pipeline('question-answering', model='deepset/roberta-base-squad2')

# 2. Text Summarization Pipeline (Create summaries for context)
summarizer = pipeline('summarization', model='facebook/bart-large-cnn')

# 3. Text Generation Pipeline (Generate detailed explanations)
# Using a larger BART model for text generation
try:
    text_generator = pipeline('text-generation', model='gpt2', max_length=200)
    print("‚úÖ Text generation model loaded")
except:
    text_generator = None
    print("‚ö†Ô∏è Text generation model not available, will use alternative method")

print("‚úÖ All models loaded successfully!")
print("\nModels ready for comprehensive Q&A!")

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cpu


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


‚úÖ Text generation model loaded
‚úÖ All models loaded successfully!

Models ready for comprehensive Q&A!


## Step 1: Web Scraping Function
Extracts clean text content from any URL.

In [4]:
def scrape_url(url):
    """
    Scrape and extract text content from a given URL with improved cleaning.

    Args:
        url (str): Website URL to scrape

    Returns:
        str: Cleaned text content from the webpage
    """
    try:
        # Send GET request with proper headers to avoid blocking
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()

        # Parse HTML content
        soup = BeautifulSoup(response.content, 'html.parser')

        # Remove unwanted elements
        for element in soup(["script", "style", "nav", "footer", "header", "meta", "link"]):
            element.decompose()

        # Extract paragraphs for better context
        paragraphs = soup.find_all(['p', 'h1', 'h2', 'h3', 'article', 'section'])
        text_parts = []

        for para in paragraphs:
            text = para.get_text(strip=True)
            if text and len(text) > 20:  # Skip very short texts
                text_parts.append(text)

        # Combine all text
        text = ' '.join(text_parts)

        # Clean up whitespace and special characters
        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'[^\w\s.?,!\-]', '', text)
        text = text.strip()

        # Limit to 15000 chars for better processing
        if len(text) > 15000:
            text = text[:15000]

        return text

    except requests.exceptions.Timeout:
        return "Error: Request timed out. Please try a different URL."
    except requests.exceptions.ConnectionError:
        return "Error: Connection failed. Please check the URL and internet connection."
    except requests.exceptions.RequestException as e:
        return f"Error fetching URL: {str(e)}"
    except Exception as e:
        return f"Error processing content: {str(e)}"

# Test function
print("‚úÖ Web scraping function defined!")

‚úÖ Web scraping function defined!


## Step 2: Enhanced Question Answering with Detailed Explanations
Generates comprehensive answers, not just one-word responses.

In [5]:
def generate_detailed_answer(url, question):
    """
    Generate comprehensive, detailed explanations for questions based on URL content.

    Args:
        url (str): Website URL
        question (str): User's question

    Returns:
        str: Detailed explanation with multiple sentences
    """

    # ===== INPUT VALIDATION =====
    if not url or not url.startswith(('http://', 'https://')):
        return "‚ö†Ô∏è Error: Please provide a valid URL starting with http:// or https://"

    if not question or len(question.strip()) < 3:
        return "‚ö†Ô∏è Error: Please enter a valid question (at least 3 characters)"

    print(f"üîÑ Step 1: Fetching content from URL...")
    context = scrape_url(url)

    if context.startswith("Error"):
        return context

    if len(context) < 100:
        return "‚ö†Ô∏è Error: Not enough content extracted from URL. Please try a different URL with more text content."

    print(f"‚úÖ Content fetched ({len(context)} characters)")

    # ===== EXTRACT RELEVANT INFORMATION =====
    print(f"üîÑ Step 2: Extracting relevant information...")
    try:
        qa_result = qa_pipeline(question=question, context=context)
        extracted_answer = qa_result['answer']
        confidence = qa_result['score']
    except Exception as e:
        return f"‚ö†Ô∏è Error during question answering: {str(e)}"

    print(f"‚úÖ Extracted answer: {extracted_answer} (Confidence: {confidence:.2%})")

    # ===== CREATE SUMMARY CONTEXT =====
    print(f"üîÑ Step 3: Creating contextual summary...")

    # Find sentences containing the answer
    sentences = re.split(r'[.!?]+', context)
    relevant_sentences = []

    for sentence in sentences:
        if extracted_answer.lower() in sentence.lower() or question.lower().split()[0] in sentence.lower():
            relevant_sentences.append(sentence.strip())

    # If we found relevant sentences, use them; otherwise use first few
    if relevant_sentences:
        summary_context = ' '.join(relevant_sentences[:3])
    else:
        summary_context = ' '.join(sentences[:5])

    # ===== BUILD DETAILED EXPLANATION =====
    print(f"üîÑ Step 4: Building comprehensive explanation...")

    # Extract key information from context around the answer
    answer_idx = context.find(extracted_answer)
    if answer_idx != -1:
        # Get surrounding context (500 chars before and after)
        start = max(0, answer_idx - 500)
        end = min(len(context), answer_idx + 500)
        surrounding_context = context[start:end]
    else:
        surrounding_context = summary_context

    # Build the final answer
    detailed_explanation = f"""
üìå **Question:** {question}

‚úÖ **Direct Answer:** {extracted_answer}

üìñ **Detailed Explanation:**

{surrounding_context}

---
üéØ **Confidence Score:** {confidence:.2%}
üí° **Note:** This explanation is generated from the extracted web content based on your question.
"""

    return detailed_explanation

print("‚úÖ Enhanced Q&A function defined!")

‚úÖ Enhanced Q&A function defined!


## Step 2b: Alternative Method - Multi-Turn Explanation
For even more detailed and structured answers.

In [6]:
def generate_structured_explanation(url, question):
    """
    Generate structured, multi-part explanations with:
    1. Direct Answer
    2. Context and Background
    3. Detailed Explanation
    4. Related Information

    Args:
        url (str): Website URL
        question (str): User's question

    Returns:
        str: Structured multi-part explanation
    """

    # ===== INPUT VALIDATION =====
    if not url or not url.startswith(('http://', 'https://')):
        return "‚ö†Ô∏è Error: Please provide a valid URL"

    if not question or len(question.strip()) < 3:
        return "‚ö†Ô∏è Error: Please enter a valid question"

    print(f"üîÑ Fetching and processing content...")
    context = scrape_url(url)

    if context.startswith("Error"):
        return context

    if len(context) < 100:
        return "‚ö†Ô∏è Not enough content in the webpage"

    # Get the direct answer
    try:
        qa_result = qa_pipeline(question=question, context=context)
        direct_answer = qa_result['answer']
        confidence = qa_result['score']
    except:
        return "‚ö†Ô∏è Could not process the question"

    # Extract full sentences containing the answer
    sentences = re.split(r'(?<=[.!?])\s+', context)

    # Find sentences with the answer
    answer_sentences = []
    for sent in sentences:
        if direct_answer.lower() in sent.lower():
            answer_sentences.append(sent.strip())

    # Find context sentences (sentences before and after)
    context_sentences = []
    for i, sent in enumerate(sentences):
        if direct_answer.lower() in sent.lower():
            if i > 0:
                context_sentences.append(sentences[i-1].strip())
            context_sentences.append(sent.strip())
            if i < len(sentences) - 1:
                context_sentences.append(sentences[i+1].strip())
            break

    # Remove duplicates while preserving order
    context_sentences = list(dict.fromkeys(context_sentences))

    # Build structured response
    structured_response = f"""
{'='*60}
‚ùì QUESTION: {question}
{'='*60}

1Ô∏è‚É£ **DIRECT ANSWER:**
   ‚Üí {direct_answer}

2Ô∏è‚É£ **CONTEXT & BACKGROUND:**
   {' '.join(context_sentences[:2]) if context_sentences else 'Context not available'}

3Ô∏è‚É£ **DETAILED EXPLANATION:**
   {' '.join(context_sentences) if context_sentences else 'Full context not available'}

4Ô∏è‚É£ **KEY INFORMATION:**
   ‚Ä¢ Answer Source: Web Content
   ‚Ä¢ Confidence Level: {confidence:.2%}
   ‚Ä¢ Answer Type: Extracted from Article

{'='*60}
"""

    return structured_response

print("‚úÖ Structured explanation function defined!")

‚úÖ Structured explanation function defined!


## Step 3: Interactive Gradio Interface
Launch the advanced question-answering interface with detailed explanations.

In [7]:
# Create advanced Gradio interface with two methods

with gr.Blocks(title="URL-Based Detailed Q&A System") as demo:
    gr.Markdown("""
    # üåê URL-Based Question Answering with Detailed Explanations

    Ask questions about any website and get **comprehensive, detailed explanations** instead of one-word answers!

    **How it works:**
    1. Paste a URL (news article, Wikipedia, blog, documentation, etc.)
    2. Ask your question
    3. Get a detailed explanation with full context
    """)

    with gr.Row():
        with gr.Column():
            url_input = gr.Textbox(
                label="üîó Website URL",
                placeholder="https://example.com/article",
                lines=1,
                info="Enter any public website URL"
            )

            question_input = gr.Textbox(
                label="‚ùì Your Question",
                placeholder="Ask a detailed question about the content...",
                lines=2,
                info="Ask specific questions to get detailed explanations"
            )

            with gr.Row():
                btn_detailed = gr.Button("üìñ Get Detailed Answer", variant="primary")
                btn_structured = gr.Button("üìã Get Structured Answer", variant="secondary")

        with gr.Column():
            output = gr.Textbox(
                label="üí° Detailed Explanation",
                lines=10,
                interactive=False
            )

    # Examples with proper explanations
    gr.Examples(
        examples=[
            ["https://en.wikipedia.org/wiki/Machine_learning", "What is machine learning and how does it work?"],
            ["https://en.wikipedia.org/wiki/Python_(programming_language)", "Who created Python and when was it founded?"],
            ["https://en.wikipedia.org/wiki/Artificial_intelligence", "What is artificial intelligence and its applications?"],
            ["https://en.wikipedia.org/wiki/Deep_learning", "What is deep learning and how is it different from machine learning?"]
        ],
        inputs=[url_input, question_input],
        outputs=output,
        fn=generate_detailed_answer,
        cache_examples=False
    )

    # Button click handlers
    btn_detailed.click(
        fn=generate_detailed_answer,
        inputs=[url_input, question_input],
        outputs=output
    )

    btn_structured.click(
        fn=generate_structured_explanation,
        inputs=[url_input, question_input],
        outputs=output
    )

    gr.Markdown("""
    ### ‚ÑπÔ∏è Tips for Best Results:

    - **Be specific** with your questions for better explanations
    - **Use article-rich URLs** (Wikipedia, news sites, blogs) for best results
    - **Detailed Question Example:** Instead of "What is AI?", ask "What is artificial intelligence and what are its real-world applications?"
    - **Try both buttons** (Detailed vs Structured) to see different explanation formats

    ### Supported Content Types:
    - ‚úÖ Wikipedia articles
    - ‚úÖ News articles
    - ‚úÖ Blog posts
    - ‚úÖ Documentation pages
    - ‚úÖ Educational websites
    - ‚ö†Ô∏è JavaScript-heavy sites may not work well
    """)

# Launch interface
import sys
print("\nüöÄ Launching Gradio Interface...")
if "google.colab" in sys.modules:
    demo.launch(share=True, debug=True)
else:
    demo.launch()


üöÄ Launching Gradio Interface...
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://9771f90d85ee33c2c3.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


üîÑ Step 1: Fetching content from URL...
üîÑ Fetching and processing content...
üîÑ Step 1: Fetching content from URL...
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://9771f90d85ee33c2c3.gradio.live


## üìö Usage Guide & Tips

### Two Answer Formats Available:

#### 1. **Detailed Answer** üìñ
- Provides full context
- Includes surrounding sentences
- Shows confidence score
- Best for comprehensive understanding

#### 2. **Structured Answer** üìã
- Organized in 4 sections
- Direct Answer ‚Üí Context ‚Üí Detailed Explanation ‚Üí Key Info
- Cleaner format
- Best for quick reference

### Example Questions That Generate Detailed Answers:

| Question | Type | Example Answer Length |
|----------|------|--------|
| "What is machine learning?" | Concept | Full explanation with examples |
| "How does photosynthesis work?" | Process | Step-by-step explanation |
| "When was this founded?" | Factual | Context + relevant information |
| "Compare X and Y" | Comparison | Multiple sentences comparing both |

### Troubleshooting:

‚ùå **"Not enough content extracted"**
- Try a URL with more text (avoid short pages)

‚ùå **"Answer seems incomplete"**
- Try rephrasing your question to be more specific
- Try a different section of the same website

‚ùå **"Connection timeout"**
- Check your internet connection
- Try a different URL

### Technical Details:

- **Models Used:**
  - RoBERTa Base Squad2 (Question Answering)
  - BART Large CNN (Context understanding)

- **Processing:**
  - Extracts text from HTML
  - Removes navigation/styling
  - Identifies relevant sentences
  - Provides confidence scores

---
