## Downloading Dependencies


In [None]:
!pip install -U langchain-community
!pip install tavily-python



In [None]:
!pip install langchain
!pip install chromadb
!pip install sentence_transformers
!pip install huggingface_hub

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 447, in run
    conflicts = self._determine_conflicts(to_install)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 578, in _determine_conflicts
    return check_install_conflicts(to_install)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/operations/check.py", line 101, in check_install_conflicts
    package_set, _ = create_package_set_from_installed()
              

### Model Selection Rationale ###

For this project, I selected sentence-transformers/multi-qa-mpnet-base-dot-v1 for embeddings and a lightweight LLM (T5-small) to ensure:

   - Compute Efficiency : The models are optimized for deployment on resource-constrained environments.
   - Scalability : These choices allow the system to run smoothly without requiring high-end GPUs.
   - Accessibility : Open-source models enable easy replication and improvement without licensing restrictions.

However, the system is modular and can seamlessly integrate more powerful embeddings (e.g., text-embedding-ada-002) and stronger LLMs (e.g., GPT-4 or Mistral) if additional resources become available. This design ensures future-proofing while maintaining immediate usability.

In [None]:
# (Part of Agent B)
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the T5-small model and tokenizer
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


# **Agent A (Web Crawling Agent)**


This section retrieves stored user credentials in Google Colab and initializes a Tavily API client:

   - userdata.get('HF_TOKEN') retrieves the Hugging Face token (though it's not assigned to a variable here).
   - api_key = userdata.get('TAVILY_API_KEY') fetches the Tavily API key.
   - tavily_client = TavilyClient(api_key=api_key) initializes a Tavily client with the retrieved API key.

This setup allows authenticated access to Hugging Face and Tavily services.

In [None]:
from google.colab import userdata
from tavily import TavilyClient

userdata.get('HF_TOKEN')
api_key = userdata.get('TAVILY_API_KEY')
tavily_client = TavilyClient(api_key=api_key)

### Initializing Hugging Face Embeddings

This code loads a pre-trained embedding model from Hugging Face for text similarity tasks:

- **Library:** `langchain.embeddings`
- **Model Used:** `sentence-transformers/multi-qa-mpnet-base-dot-v1`
- **Purpose:** Converts text into vector embeddings for similarity search and retrieval.



In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

# Initialize the embedding model using sentence-transformers/multi-qa-mpnet-base-dot-v1"
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/multi-qa-mpnet-base-dot-v1")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/multi-qa-mpnet-base-dot-v1")


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### User Input and Knowledge Source URL Generation

#### `get_inputs()`
- Prompts the user to enter **topics or URLs** separated by commas.
- Identifies **URLs** (if they contain `"http://"` or `"https://"`) and classifies others as **topics**.
- Returns a structured list with each item labeled as either `"url"` or `"topic"`.

#### `get_knowledge_urls(topic)`
- Generates **reliable knowledge source URLs** for a given topic.
- Uses **Wikipedia, Britannica, National Geographic, PubMed, ArXiv, and DuckDuckGo News**.
- Formats search queries appropriately for each source.


In [None]:
def get_inputs():
    """
    Prompts the user to input topics or URLs separated by commas.
    Items that include 'http://' or 'https://' are treated as URLs,
    and all others are considered topics.
    """
    user_input = input("Enter topics or URLs separated by commas: ")
    items = [item.strip() for item in user_input.split(',')]
    data = []
    for item in items:
        if "http://" in item or "https://" in item:
            data.append({"type": "url", "value": item})
        else:
            data.append({"type": "topic", "value": item})
    return data

def get_knowledge_urls(topic):
    """
    Generates URLs for a given topic from multiple trusted knowledge sources.
    """
    sources = {
        "Wikipedia": f"https://en.wikipedia.org/wiki/{topic.replace(' ', '_')}",
        "Simple Wikipedia" : f"https://simple.wikipedia.org/wiki/{topic.replace(' ', '_')}",
        "Britannica": f"https://www.britannica.com/search?query={topic.replace(' ', '_')}",
        "National Geographic": f"https://www.nationalgeographic.com/search?q={topic.replace(' ', '_')}",
        "PubMed": f"https://pubmed.ncbi.nlm.nih.gov/?term={topic.replace(' ', '_')}",
        "ArXiv": f"https://arxiv.org/search/?query={topic.replace(' ', '_')}&searchtype=all&abstracts=show&order=-announced_date_first&size=50",
        "DuckDuckGo_fornews" : f"https://duckduckgo.com/?t=h_&q={topic.replace(' ', '_')}&iar=news&ia=news"

    }
    return sources



### Processing User Input into a List of URLs

- Calls `get_inputs()` to retrieve **topics and URLs** from user input.
- Initializes an empty list `result` to store URLs.
- Iterates through each item:
  - If the item is a **topic**, it fetches **relevant knowledge source URLs** using `get_knowledge_urls()` and adds them to `result`.
  - If the item is already a **URL**, it is directly added to `result`.
- Prints and stores the final list of URLs in the variable `urls`.


In [None]:

    input_data = get_inputs()
    result = []

    for item in input_data:
        if item["type"] == "topic":
            knowledge_urls = get_knowledge_urls(item["value"])
            # Extend the result list with all URLs from the knowledge sources
            result.extend(knowledge_urls.values())
        else:
            # Append the URL directly
            result.append(item["value"])

    # Output the result list which now only contains URLs
    print(result)
    urls = result


Enter topics or URLs separated by commas: france
['https://en.wikipedia.org/wiki/france', 'https://simple.wikipedia.org/wiki/france', 'https://www.britannica.com/search?query=france', 'https://www.nationalgeographic.com/search?q=france', 'https://pubmed.ncbi.nlm.nih.gov/?term=france', 'https://arxiv.org/search/?query=france&searchtype=all&abstracts=show&order=-announced_date_first&size=50', 'https://duckduckgo.com/?t=h_&q=france&iar=news&ia=news']


### Web Scraping and Text Extraction

#### **1. Extracting Text from Webpages**
- Uses **Tavily API** (`tavily_client.extract(url)`) to fetch webpage content.
- Processes the HTML with **BeautifulSoup** to remove unwanted elements (`table`, `script`, `style`, etc.).
- Extracts meaningful text:
  - **Prefers:** `<article>` content.
  - **Fallbacks:** Main content div or all paragraph tags.

#### **2. Cleaning Extracted Text**
- **Removes:** HTML tags, scripts, styles, and URLs.
- **Keeps:** Alphanumeric text with basic punctuation.
- **Normalizes:** Extra spaces and special characters.

#### **3. Storing Scraped Data**
- Extracted and cleaned text is stored in `documents[]`, with each entry containing:
  - `"url"` – The source webpage.
  - `"text"` – The cleaned content.

This ensures a structured and readable dataset for further processing.


In [None]:
from bs4 import BeautifulSoup
import requests

# list of URLs: urls


def extract_text(soup):
    # Removing unnecessary elements
    for tag in soup(["table", "nav", "footer", "aside", "form", "script", "style"]):
        tag.decompose()  # Remove from DOM

    # Extracting article content first
    article = soup.find("article")
    if article:
        return article.get_text(separator=" ")

    # Extracting main content div
    main_div = soup.find("div", {"id": "content"}) or soup.find("div", {"class": "main-content"})
    if main_div:
        return main_div.get_text(separator=" ")

    # Fallback: Extract text from all paragraphs
    paragraphs = [p.get_text() for p in soup.find_all("p")]
    if paragraphs:
        return " ".join(paragraphs)

    # Fallback: Extract everything else
    return soup.get_text(separator=" ")


documents = []


import re

def clean_text_advanced(text):
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)

    # Remove JS, CSS
    text = re.sub(r'<script.*?</script>', '', text, flags=re.DOTALL)
    text = re.sub(r'<style.*?</style>', '', text, flags=re.DOTALL)

    # Remove URLs
    text = re.sub(r'http[s]?://\S+', '', text)

    # Remove special characters except basic punctuation
    text = re.sub(r'[^A-Za-z0-9.,!?\'" ]+', ' ', text)

    # Normalize spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

for url in urls:
    
    response = tavily_client.extract(url)
    
    # Assuming response is HTML, process with BeautifulSoup:
    if response['results']:
      extracted_text = response['results'][0]['raw_content']

      soup = BeautifulSoup(extracted_text, 'html.parser')      
      soupedtext = extract_text(soup)
      raw_text = soupedtext
      text = clean_text_advanced(raw_text)

      print(text)  # Preview without tables
      documents.append({
        "url": url,
        "text": text
      })
    else:
      print(f"No results found for URL: {url}")

France France, X officially the French Republic, XI is a country located primarily in Western Europe. Its overseas regions and territories include French Guiana in South America, Saint Pierre and Miquelon in the North Atlantic, the French West Indies, and many islands in Oceania and the Indian Ocean, giving it one of the largest discontiguous exclusive economic zones in the world. Metropolitan France shares borders with Belgium and Luxembourg to the north, Germany to the northeast, Switzerland to the east, Italy and Monaco to the southeast, Andorra and Spain to the south, and a maritime border with the United Kingdom to the northwest. Its metropolitan area extends from the Rhine to the Atlantic Ocean and from the Mediterranean Sea to the English Channel and the North Sea. Its eighteen integral regions five of which are overseas span a combined area of 643,801 km2 248,573 sq mi and have a total population of nearly 68.4 million as of January 2024 update . France is a semi presidential r

'\n\nfor url in urls:\n    #response = requests.get(url)\n    response = tavily_client.extract(url)\n    #print(response)\n\n    # Assuming response is HTML, process with BeautifulSoup:\n\n    extracted_text = response[\'results\'][0][\'raw_content\']\n\n    soup = BeautifulSoup(extracted_text, \'html.parser\')\n    #text = soup.get_text(separator="\n")\n    soupedtext = extract_text(soup)\n    raw_text = soupedtext\n    text = clean_text_advanced(raw_text)\n\n\n\n    print(text)  # Preview without tables\n    documents.append({\n        "url": url,\n        "text": text\n    })\n\n'

# **Agent B (QnA Agent)**

### Text Chunking and Document Preparation for Embeddings

#### **1. Splitting Text into Chunks**
- Uses `CharacterTextSplitter` to break large text into **manageable chunks**.
- **Settings:**
  - Splits at `"."` (sentence boundary).
  - Each chunk is **1000 characters long** with **50-character overlap** to preserve context.

#### **2. Processing Scraped Data**
- Iterates through `documents[]` (scraped webpages).
- **Extracts plain text** using `BeautifulSoup`.
- **Splits the text** into smaller chunks for better embedding quality.

#### **3. Creating Document Objects**
- Each chunk is wrapped in a `Document` object from **LangChain**.
- **Metadata:** Stores the source URL for reference.
- All processed chunks are stored in `all_documents[]`.

This step optimizes the data for vector embeddings and retrieval.


In [None]:
from bs4 import BeautifulSoup
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# Initializing a text splitter for breaking text into chunks
text_splitter = CharacterTextSplitter(separator=".", chunk_size=1000, chunk_overlap=50)
all_documents = []

# Processing each scraped page
for page in documents:
    url = page["url"]
    html_content = page["text"]

    # Parsing the HTML to extract plain text using BeautifulSoup
    soup = BeautifulSoup(html_content, "html.parser")
    text = soup.get_text(separator=".")

    # Splitting the extracted text into smaller chunks for better embedding quality
    chunks = text_splitter.split_text(text)
    print("Number of chunks:", len(chunks))


    # Creating Document objects for each chunk and include the URL as metadata
    for chunk in chunks:
        all_documents.append(Document(page_content=chunk, metadata={"source": url}))



Number of chunks: 106
Number of chunks: 26
Number of chunks: 7
Number of chunks: 11
Number of chunks: 106


### Creating and Storing Embeddings in ChromaDB

#### **1. Purpose**
- Converts text chunks into **vector embeddings** for similarity search.
- Stores these embeddings in **ChromaDB**, a vector database.

#### **2. Process**
- Uses `Chroma.from_documents()` to:
  - **Embed** all `all_documents[]` chunks.
  - **Store** them in the Chroma vector store.
  
This enables fast and efficient retrieval of relevant information based on user queries.


In [None]:
# Create and store the embeddings in a vector database (Chroma)
vector_store = Chroma.from_documents(all_documents, embeddings)

### Query Processing and LLM Response Generation

#### **1. Handling User Queries**
- Continuously accepts user input.
- Allows exiting the program with `"exit"`.

#### **2. Retrieving Relevant Context**
- Performs **vector similarity search** on `vector_store` using the query.
- Retrieves the **top 3 most relevant text chunks**.

#### **3. Constructing the LLM Prompt**
- Merges retrieved chunks into a **single context block**.
- Formats the final prompt.


#### **4. Generating the Answer**
- Encodes the prompt using `tokenizer.encode()`.
- Uses the **LLM model** to generate a response.
- Decodes and prints the final answer.

This setup enables **context-aware question answering** using scraped and stored knowledge.



In [None]:
def process_query(query):
    """Execute some code based on the query."""
    print(f"Processing query: {query}")

    results = vector_store.similarity_search_with_score(query, k=3)
    
    # Extracting the top-k contexts from the search results
    retrieved_contexts = [doc.page_content for doc, score in results]

    # Joining them into a single context block
    context = "\n".join(retrieved_contexts)

    # Formatting the final prompt for LLM
    prompt = f"Based on the context, answer the question: Question: {query}\nContext: {context}\n"

    # Formatting the input 
    input_text = prompt
    input_ids = tokenizer.encode(input_text, return_tensors="pt")

    # Generate the answer (running on CPU)
    outputs = model.generate(input_ids)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print("Answer:", answer)



while True:
    query = input("Enter your query (or type 'exit' to quit): ")
    if query.lower() == 'exit':
        print("Exiting program.")
        break
    process_query(query)


Enter your query (or type 'exit' to quit): Name french authors
Processing query: Name french authors
Answer: Victor Hugo, Alexandre Dumas and Jules Verne
Enter your query (or type 'exit' to quit): Name french architects.
Processing query: Name french architects.
Answer: Jean Nouvel, Dominique Perrault, Christian de Portzamparc and Paul Andreu
Enter your query (or type 'exit' to quit): exit
Exiting program.
