# Task 2: Chat with Website Using RAG Pipeline

## Installing required Libraries

In [None]:
!pip install requests
!pip install beautifulsoup4
!pip install newspaper3k
!pip install langchain
!pip install sentence-transformers
!pip install chromadb
!pip install transformers
!pip install lxml[html_clean]




## 1. Data Ingestion
• Input: URLs or list of websites to crawl/scrape.
• Process:


o Crawl and scrape content from target websites.


o Extract key data fields, metadata, and textual content.


o Segment content into chunks for better granularity.


o Convert chunks into vector embeddings using a pre-trained embedding model.


o Store embeddings in a vector database with associated metadata for eFicient
retrieval.

In [None]:
!pip install sentence-transformers




In [None]:
import requests
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


In [None]:
# Function to scrape content
def scrape_website(url):
    try:
        article = Article(url)
        article.download()
        article.parse()
        return article.text
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None


In [None]:
# Step 2: Segment text into chunks for embedding
def segment_text(text, chunk_size=512):
    if not isinstance(text, str):
        raise ValueError("Input to segment_text must be a string.")
    words = text.split()
    return [" ".join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

# Step 3: Initialize embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")  # Replace with your preferred model

# Step 4: Scrape and process websites
urls = [
    "https://www.uchicago.edu/",
    "https://www.washington.edu/",
    "https://www.stanford.edu/",
    "https://und.edu/"
]

website_data = {url: scrape_website(url) for url in urls}

# Step 5: Process content into chunks and embeddings
embedding_data = {}
for url, content in website_data.items():
    if content:  # Ensure there's valid text content
        chunks = segment_text(content)
        embeddings = model.encode(chunks)  # Generate embeddings
        embedding_data[url] = {"chunks": chunks, "embeddings": embeddings}

# Step 6: Display or save the processed data
for url, data in embedding_data.items():
    print(f"Processed {len(data['chunks'])} chunks for {url}.")


Processed 1 chunks for https://www.uchicago.edu/.
Processed 1 chunks for https://www.washington.edu/.
Processed 1 chunks for https://www.stanford.edu/.
Processed 1 chunks for https://und.edu/.


In [None]:
website_data

{'https://www.uchicago.edu/': 'We value rigorous inquiry\n\nA diversity of people and ideas, coupled with free and open discourse, lays the foundation for students and scholars to bring forth original ideas that define fields and enrich human life.\n\nLEARN MORE',
 'https://www.washington.edu/': 'Husky Football Huskies are bowl-bound Capping a big — and BIG TEN — year, the Huskies are headed for the Tony the Tiger Sun Bowl! Join fellow fans in cheering on our favorite Dawgs against Louisville in El Paso, TX on December 31. Bowl Central\n\nHonors and Awards UW professor among Nobel laureates honored in Stockholm David Baker, professor of biochemistry at the UW School of Medicine in Seattle, received the 2024 Nobel Prize in Chemistry. Nobel Week wove stately traditions with imaginative recognitions. Read story',
 'https://www.stanford.edu/': 'Stanford was founded almost 150 years ago on a bedrock of societal purpose. Our mission is to contribute to the world by educating students for liv

In [None]:
!pip install faiss-cpu


Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m60.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [None]:
import faiss
import numpy as np
import pickle

# Step 7: Create a FAISS index and store embeddings
dimension = embeddings[0].shape[0]  # Embedding dimension
index = faiss.IndexFlatL2(dimension)  # L2 distance for similarity search

# Store URL and chunk metadata for retrieval
metadata = []
for url, data in embedding_data.items():
    index.add(np.array(data["embeddings"]))  # Add embeddings to FAISS
    metadata.extend([(url, chunk) for chunk in data["chunks"]])

# Save FAISS index and metadata
faiss.write_index(index, "faiss_index.bin")
with open("metadata.pkl", "wb") as f:
    pickle.dump(metadata, f)

print("FAISS index and metadata saved.")



FAISS index and metadata saved.


## 2. Query Handling
• Input: User's natural language question.

• Process:

o Convert the user's query into vector embeddings using the same embedding
model.

o Perform a similarity search in the vector database to retrieve the most relevant chunks.


o Pass the retrieved chunks to the LLM along with a prompt or agentic context to generate a detailed response.

In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import pickle
import numpy as np

# Load the SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')  # Ensure the correct model is loaded

def retrieve_similar_chunks(query, top_k=3):
    """
    Retrieve chunks similar to the query from the FAISS index.

    Args:
        query (str): The query string for which similar chunks need to be retrieved.
        top_k (int): The number of top similar chunks to retrieve.

    Returns:
        list: A list of dictionaries containing URLs and chunks.
    """
    try:
        # Generate query embedding
        query_embedding = model.encode([query])

        # Load FAISS index
        index = faiss.read_index("faiss_index.bin")

        # Load metadata
        with open("metadata.pkl", "rb") as f:
            metadata = pickle.load(f)

        # Search for the most similar embeddings
        distances, indices = index.search(np.array(query_embedding).astype('float32'), top_k)

        # Retrieve corresponding chunks and metadata
        results = []
        for idx in indices[0]:
            if idx < len(metadata):  # Ensure valid index
                url, chunk = metadata[idx]
                results.append({"url": url, "chunk": chunk})

        return results

    except Exception as e:
        print(f"Error retrieving similar chunks: {e}")
        return []

# Example query
if __name__ == "__main__":
    query = "When was the standford university founded?"
    results = retrieve_similar_chunks(query)

    for result in results:
        print(f"URL: {result['url']}\nChunk: {result['chunk']}\n")


URL: https://www.stanford.edu/
Chunk: Stanford was founded almost 150 years ago on a bedrock of societal purpose. Our mission is to contribute to the world by educating students for lives of leadership and contribution with integrity; advancing fundamental knowledge and cultivating creativity; leading in pioneering research for effective clinical therapies; and accelerating solutions and amplifying their impact.

URL: https://www.washington.edu/
Chunk: Husky Football Huskies are bowl-bound Capping a big — and BIG TEN — year, the Huskies are headed for the Tony the Tiger Sun Bowl! Join fellow fans in cheering on our favorite Dawgs against Louisville in El Paso, TX on December 31. Bowl Central Honors and Awards UW professor among Nobel laureates honored in Stockholm David Baker, professor of biochemistry at the UW School of Medicine in Seattle, received the 2024 Nobel Prize in Chemistry. Nobel Week wove stately traditions with imaginative recognitions. Read story

URL: https://und.edu/
C

In [None]:
pip install transformers




## 3. Response Generation
• Input: Relevant information retrieved from the vector database and the user query.

• Process:

o Use the LLM with retrieval-augmented prompts to produce responses with exact
values and context.

o Ensure factuality by incorporating retrieved data directly into the response.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "gpt2"  # Replace with your preferred model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Add padding token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Ensure computation runs on GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Function to truncate context
def truncate_context(context, query, max_tokens=1024):
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
    tokens = tokenizer(prompt, truncation=True, max_length=max_tokens, return_tensors="pt")
    return tokenizer.decode(tokens["input_ids"][0], skip_special_tokens=True)

# Function to remove repeated sentences
def remove_repetitions(text):
    sentences = text.split(". ")
    seen = set()
    filtered_sentences = []
    for sentence in sentences:
        if sentence not in seen:
            filtered_sentences.append(sentence)
            seen.add(sentence)
    return ". ".join(filtered_sentences)

# Function to generate answer
def generate_answer_with_huggingface(query, retrieved_chunks, max_tokens=1024):
    try:
        # Combine retrieved chunks into a single context
        context = "\n".join([chunk["chunk"] for chunk in retrieved_chunks])

        # Truncate input to fit the model's maximum token length
        prompt = truncate_context(context, query, max_tokens=max_tokens)

        # Tokenize input and include attention_mask
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)

        # Generate output with controlled decoding
        outputs = model.generate(
            inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=max_tokens,
            no_repeat_ngram_size=3,  # Prevent n-gram repetition
            temperature=0.7,         # Balance between randomness and determinism
            top_k=50,                # Consider the top 50 tokens
            top_p=0.9,               # Use nucleus sampling with p=0.9
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id  # Use the defined padding token
        )

        # Decode and process the generated answer
        answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
        answer = answer.split("Answer:")[-1].strip()  # Extract the answer part
        return remove_repetitions(answer)            # Remove any repeated phrases
    except Exception as e:
        print(f"Error during answer generation: {e}")
        return "Sorry, I couldn't generate an answer due to a runtime issue."

# Example usage
query = "When was the standford university founded?"
# retrieved_chunks = [
#     {"chunk": "Stanford University is known for its focus on technology, innovation, and interdisciplinary research."}
# ]  # Example retrieved context

retrieved_chunks = [
    {"chunk": "Stanford was founded almost 150 years ago on a bedrock of societal purpose. Our mission is to contribute to the world by educating students for lives of leadership and contribution with integrity; advancing fundamental knowledge and cultivating creativity; leading in pioneering research for effective clinical therapies; and accelerating solutions and amplifying their impact."}
]
# Generate and print the answer
answer = generate_answer_with_huggingface(query, retrieved_chunks)
print(f"Answer: {answer}")


Answer: Stanley and his wife, Mary, founded Stanford in 1867. Mary was a graduate of Stanford University, where she became a professor of chemistry, and later a professor at Stanford University. In 1877, she became the first female professor of the college. In 1886, she was elected dean of the College of Pharmacy. In 1893, she married her husband, Richard, and they have three children.
.
 and her children. In the early 1900s, she started a new research institute, Stanford Hospital, in Palo Alto, California. The institute, called Stanford Medical Center, is a pioneering center for the study of medicine, and for the development of new technologies that will advance our health, wellness and health care. Stanford Medical Centers is one of the oldest and most important medical centers in the United States.
 (The Stanford Hospital is located in the heart of the Stanford campus, about 6 miles from the University of California. Stanford Hospital was founded in 1868 by Mary Stanford and her hus