<a href="https://colab.research.google.com/github/DeekshithaD3/Chat-with-websites-using-RAG-pipelines/blob/main/Rag2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
!pip install requests beautifulsoup4 transformers faiss-cpu langchain




In [6]:
import requests
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Step 1: Scrape content from websites
def scrape_website(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    paragraphs = soup.find_all('p')
    content = ' '.join([para.get_text() for para in paragraphs])
    return content

# List of example websites to scrape
websites = [
    "https://www.uchicago.edu/",
    "https://www.washington.edu/",
    "https://www.stanford.edu/",
    "https://und.edu/"
]

# Scrape content from each website
website_contents = {url: scrape_website(url) for url in websites}

# Step 2: Convert website content to embeddings using SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

def get_embeddings(content):
    return model.encode([content])

# Convert content of each website into embeddings
embeddings = {url: get_embeddings(content) for url, content in website_contents.items()}

# Step 3: Store embeddings in FAISS
dimension = embeddings["https://www.uchicago.edu/"].shape[1]  # Get the dimensionality of the embeddings
index = faiss.IndexFlatL2(dimension)  # Initialize FAISS index

# Convert embeddings to numpy array and add to FAISS index
all_embeddings = np.array([embedding[0] for embedding in embeddings.values()])
index.add(all_embeddings)

# Add metadata (URLs) for retrieval
metadata = list(website_contents.keys())

# Step 4: Handle user query, convert to embedding, and retrieve relevant content from FAISS
def query_to_embeddings(query):
    return model.encode([query])

def retrieve_relevant_chunks(query, top_k=3):
    query_embedding = query_to_embeddings(query)
    distances, indices = index.search(np.array(query_embedding), top_k)
    relevant_urls = [metadata[idx] for idx in indices[0]]
    return relevant_urls



In [7]:
# Ask the user for their query
user_query = input("Please enter your query: ")

# Retrieve relevant content based on the user's query
relevant_urls = retrieve_relevant_chunks(user_query)

# Display the relevant content from the retrieved websites
relevant_content = ' '.join([website_contents[url] for url in relevant_urls])
# Optionally, you could add a simple summarization
def summarize_text(text, max_length=300):
    return text[:max_length] + "..." if len(text) > max_length else text

summarized_content = summarize_text(relevant_content)
print("\nSummarized Relevant Content for Query:", user_query)
print(summarized_content)

Please enter your query: What are the research programs at Stanford University?

Summarized Relevant Content for Query: What are the research programs at Stanford University?
UW astronomy undergrads are using cutting-edge coding skills to help scientists make the most of discoveries from a revolutionary new telescope. Read story Chris Mantegna, ’21, is studying how pollutants affect shellfish in our food web — and training a new generation of marine scientists.  Read sto...
