# README for RAG Tool with Optimized Chunking for URL Content

## Table of Contents
1. [Introduction](#introduction)
2. [Installation](#installation)
3. [Usage](#usage)
4. [Project Structure](#project-structure)
5. [Content Retrieval](#content-retrieval)
6. [Chunking Algorithms](#chunking-algorithms)
7. [Firework API Integration](#firework-api-integration)
8. [Testing and Evaluation](#testing-and-evaluation)

## Introduction
The **Retrieval-Augmented Generation (RAG) Tool** is designed to answer questions based on web content retrieved from URLs. It integrates with the Firework API for generating responses and employs various chunking strategies to optimize the quality and relevance of the generated answers.

## Installation
To run this project, ensure you have the following Python libraries installed:
- `requests`
- `beautifulsoup4`
- `nltk`
- `fireworks`
- `scikit-learn`

You can install these dependencies using pip:
```bash
pip install requests beautifulsoup4 nltk fireworks scikit-learn


url = "https://example.com"
question = "What is the topic about?"
api_key_firework = "your_api_key_here"
final_answer = generate_answer(url, question, api_key_firework)
print("Final Answer:", final_answer)


In [30]:
!pip install --upgrade fireworks-ai



# Usage

- Import the necessary modules from your project files.

- Set your API keys for both the Firework API and SerpAPI (if used).

- Call the main function with the URL and the question you want to answer.

# Project Structure

- serp_api.py: Retrieves content from URLs (using BeautifulSoup for direct scraping).

- content_preprocessor.py: Uses BeautifulSoup to extract and preprocess html data

- chunking_algorithms.py: Implements various chunking methods (fixed-size, semantic, question-based).

- firework_api.py: Contains Fireworks api calls to select best answers

- testing_suite.py: Contains functions for testing and evaluating chunking methods.

- main.py: The main script to run the RAG tool, integrating all components.

# Content Retrieval

The get_url_content function retrieves and cleans content from the specified URL using requests and BeautifulSoup. It processes HTML to extract readable text while removing unnecessary elements.

In [None]:
import requests

def get_url_content(url):
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception(f"Failed to retrieve content. Status code: {response.status_code}")
    return response.text

# Chunking Algorithms

This project implements three chunking methods:

1. Fixed-Size Chunking: Divides content into chunks of a specified number of words.

2. Semantic Chunking: Uses NLP techniques to chunk content into coherent sentences or paragraphs.

3. Question-Based Chunking: Identifies sections most relevant to a specific question using TF-IDF and cosine similarity.

In [56]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')  # Ensure the tokenizer is available

# Fixed-Size Chunking
def fixed_size_chunking(content, chunk_size=500):
    words = content.split()
    return [" ".join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

# Semantic Chunking
def semantic_chunking(content, chunk_size=500):
    sentences = nltk.sent_tokenize(content)
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        sentence_length = len(sentence.split())
        if current_length + sentence_length > chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentence]
            current_length = sentence_length
        else:
            current_chunk.append(sentence)
            current_length += sentence_length

    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

# Question-Based Chunking
def question_based_chunking(content, question, top_n=5):
    chunks = semantic_chunking(content)
    vectorizer = TfidfVectorizer().fit_transform([question] + chunks)
    vectors = vectorizer.toarray()
    cosine_similarities = cosine_similarity(vectors[0:1], vectors[1:]).flatten()
    relevant_indices = cosine_similarities.argsort()[::-1][:top_n]
    return [chunks[i] for i in relevant_indices]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [108]:
from bs4 import BeautifulSoup

def clean_content(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    text = soup.get_text()
    cleaned_text = ' '.join(text.split())
    return cleaned_text

# Firework API Integration

The call_firework_api function sends a chunk of content and a user question to the Firework API. It instructs the model to extract relevant information based on the provided chunk and question. The function returns the generated answer.

In [109]:
from fireworks.client import Fireworks

def generate_answer_api(chunk, question, api_key):
    client = Fireworks(api_key=api_key)

    response = client.chat.completions.create(
        model="accounts/fireworks/models/llama-v3p1-8b-instruct",
        messages=[{
            "role": "user",
            "content": f"Answer this question without any newlines: \"{question}\" ,"
                       f"and based only on the following content: \"{chunk}\". "
                       "If the answer cannot be found in the content, respond with 'Not found'."
        }]
    )

    return response.choices[0].message.content

def select_best_answer_api(answers, question, api_key):
    client = Fireworks(api_key=api_key)
    response = client.chat.completions.create(
        model="accounts/fireworks/models/llama-v3p1-8b-instruct",
        messages=[{
            "role": "user",
            "content": f"For this question: \"{question}\" "
                       f"just return the best answer in this list: \"{answers}\". "
                       "If the answer cannot be found in the list, respond with 'Not found.'"
        }]
    )
    return response.choices[0].message.content

In [110]:
def generate_answer(url, question, api_key_firework, chunking_method='fixed'):
    # Retrieve and clean content
    raw_content = ...
    try:
      raw_content = get_url_content(url)
    except:
      return "Can't read url data"

    clean_text = clean_content(raw_content)
    # Apply the chosen chunking method (e.g., fixed-size, semantic, or question-based)
    if chunking_method == 'fixed':
        chunks = fixed_size_chunking(clean_text)
    elif chunking_method == 'semantic':
        chunks = semantic_chunking(clean_text)
    elif chunking_method == 'question':
        chunks = question_based_chunking(clean_text, question)
    else:
        raise ValueError("Invalid chunking method")

    # Collect answers for each chunk
    answers = []
    for chunk in chunks:
        print("chunk:", chunk)
        answer = generate_answer_api(chunk, question, api_key_firework)
        print('answer:', answer)
        answers.append(answer)

    if not answers:
      return "No answer found"

    return select_best_answer_api(answers,question,api_key_firework)

# Testing and Evaluation

The project includes an evaluation framework to test different chunking methods. It tracks:

- Accuracy: How well the generated answers match expected outcomes.

- Efficiency: Time taken and number of chunks processed.

- Context Retention: Assessment of whether relevant context is preserved in the responses.

In [111]:
import time
from sklearn.metrics import accuracy_score

def evaluate_chunking_methods(test_cases, api_key_firework, chunking_methods=['fixed', 'semantic', 'question']):
    results = []

    for test in test_cases:
        url = test["url"]
        question = test["question"]
        print('url:',url)
        print('question:',question)

        # Initialize metrics storage for this test case
        method_results = {"url": url, "question": question}

        for method in chunking_methods:
            print("mechod:", method)

            # Track start time for efficiency
            start_time = time.time()

            # Generate the answer based on the chosen chunking method
            answer = generate_answer(url, question, api_key_firework, chunking_method=method)
            print("best answer:",answer)
            elapsed_time = time.time() - start_time

            expected_answer = test.get("expected_answer", None)
            accuracy = None
            if expected_answer:
                accuracy = accuracy_score([expected_answer], [answer])

            # Store the results for this method
            method_results[method] = {
                "answer": answer,
                "time": elapsed_time,  # Efficiency metric
                "accuracy": accuracy if accuracy is not None else "Not available"
            }

        results.append(method_results)

    return results

def report_results(results):
    for result in results:
        print(f"URL: {result['url']}")
        print(f"Question: {result['question']}")

        for method, metrics in result.items():
            if method not in ['url', 'question']:
                print(f"  Method: {method.capitalize()}")
                print(f"    Answer: {metrics['answer']}")
                print(f"    Time: {metrics['time']} seconds")
                print(f"    Accuracy: {metrics['accuracy']}")
                print()

In [112]:
# api_key_serp = "9214eda095aea59fb778940dd62b03ce28c7a175"
api_key_firework = "fw_3Zi8BcFgfk6DibZoDNq4Z7Qp"

test_cases = [
    {
        "url": "https://en.wikipedia.org/wiki/Climate_change",
        "question": "What are the impacts of climate change?"
    },
    {
        "url": "https://www.cdc.gov/physicalactivity/basics/pa-health/index.htm",
        "question": "What are the health benefits of physical activity?"
    }
]

results = evaluate_chunking_methods(test_cases, api_key_firework)
report_results(results)

url: https://en.wikipedia.org/wiki/Climate_change
question: What are the impacts of climate change?
mechod: fixed
chunk: Climate change - Wikipedia Jump to content Main menu Main menu move to sidebar hide Navigation Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us Contribute HelpLearn to editCommunity portalRecent changesUpload file Search Search Donate Appearance Create account Log in Personal tools Create account Log in Pages for logged out editors learn more ContributionsTalk Contents move to sidebar hide (Top) 1 Terminology 2 Global temperature rise Toggle Global temperature rise subsection 2.1 Temperature records prior to global warming 2.2 Warming since the Industrial Revolution 2.2.1 Differences by region 2.3 Future global temperatures 3 Causes of recent global temperature rise Toggle Causes of recent global temperature rise subsection 3.1 Greenhouse gases 3.2 Land surface changes 3.3 Other factors 3.3.1 Aerosols and clouds 3.3.2 Solar and volcanic activity