# Analyzing AI-Governance with GraphRAG


## Agenda

1. Webscraping 
2. Setup ChromaDB
3. Pre-processing 
4. Set up GraphRAG
5. Retrieval-Augmented Generation
6. Risk Assessment

### Workflow of the project

![Workflow of the model](Workflow_.jpeg)

## Imporitng Libraries

In [1]:
import os
import requests
from bs4 import BeautifulSoup
import PyPDF2
import chromadb
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever
from sentence_transformers import SentenceTransformer
import uuid
import networkx as nx  # For GraphRAG

  from tqdm.autonotebook import tqdm, trange





- `import os`: 
    - Provides functions to interact with the operating system, such as file and directory manipulation.

- `import requests`:
    - A library for sending HTTP requests in Python, used to fetch content from the web.
    
- `from bs4 import BeautifulSoup`:
    - Imports the BeautifulSoup class from the bs4 library, which is used for parsing and navigating HTML and XML documents.

- `import PyPDF2`:
    - A Python library used to work with PDF files, allowing for reading and manipulation of PDFs.

- `import chromadb`:
    - A library for interacting with Chroma, an open-source embedding database used for storing and retrieving high-dimensional vector embeddings.

- `from langchain_community.embeddings import OllamaEmbeddings`:
    - Imports the OllamaEmbeddings class from the langchain_community.embeddings module, which facilitates generating embeddings using Ollama models.
    
- `from langchain_text_splitters import RecursiveCharacterTextSplitter`:
    - Imports the RecursiveCharacterTextSplitter class from langchain_text_splitters, used for dividing large text documents into smaller segments based on character count, useful for processing large texts.

- `from langchain_community.vectorstores import Chroma`:
    - Imports the Chroma class from langchain_community.vectorstores, allowing interaction with the Chroma vector database for efficient vector storage and retrieval.

- `from langchain.docstore.document import Document`:
    - Imports the Document class from langchain.docstore.document, used to represent and handle documents for storage and retrieval tasks.
    
- `from langchain.prompts import ChatPromptTemplate, PromptTemplate`:
    - Imports ChatPromptTemplate and PromptTemplate classes for generating customizable prompts for chat-based language models.

- `from langchain_core.output_parsers import StrOutputParser`:
    - Imports the StrOutputParser from langchain_core.output_parsers, which converts the output of language models into strings for easier handling and interpretation.
    
- `from langchain_community.chat_models import ChatOllama`:
    - Imports the ChatOllama model, allowing interaction with the Ollama chat model for conversational AI tasks.

- `from langchain_core.runnables import RunnablePassthrough`:
    - Imports the RunnablePassthrough class from langchain_core.runnables, which passes inputs directly to the output without any modifications or processing.

- `from langchain.retrievers.multi_query import MultiQueryRetriever`:
    - Imports the MultiQueryRetriever class for performing multi-query retrieval from a document store, allowing for efficient and flexible search capabilities.

- `from sentence_transformers import SentenceTransformer`:
    - Imports SentenceTransformer, a library for generating dense vector representations (embeddings) from sentences using pre-trained transformer models.

- `import uuid`:
    - A library for generating unique identifiers (UUIDs), useful for creating unique keys or IDs for documents and processes.

- `import networkx as nx`:
    - Imports the networkx library, which is used for creating and manipulating complex networks (graphs), and is useful in 

## Initialize ChromaDB client
- ` What is ChromaDB` : 
    - ChromaDB is a vector database designed for handling large-scale AI and machine learning workloads. In a vector database, data is stored as mathematical vectors (or embeddings), which represent complex, high-dimensional information such as images, text, or other unstructured data. This type of database allows efficient similarity searches, where similar pieces of data are grouped or retrieved based on their proximity in vector space.

- `Working`:
    - pythonCopy codeclient = chromadb.Client()
    - chromadb: This is the Python library or package for ChromaDB that provides an API to interact with the vector database.
    - Client(): This is the class that initiates a connection to ChromaDB. The Client object serves as the main entry point for interacting with the database.

In [2]:
client = chromadb.Client()

## Webscraping

- **`What is BeautifulSoup`:** 
     - BeautifulSoup is a Python library used to parse HTML and XML documents. It provides tools to navigate the HTML structure, extract specific elements, and modify or clean up the data. It is commonly used alongside requests (which fetches the HTML content from a webpage).

- **`Working`:**
    - 1. HTTP Requests: The requests library is used to send an HTTP GET request to the webpage, fetching the raw HTML.
    - 2. HTML Parsing: BeautifulSoup parses the fetched HTML into a structured format (like a tree) that makes it easier to search for specific HTML elements (e.g., headings, paragraphs).
    - 3. Navigating the HTML Tree: The code searches for specific tags and extracts their content based on class names or HTML structure.
    

Here we are using BeautifulSoup to scrape AI regulatory information from dynamically constructed URLs for one country. It sends an HTTP GET request via the requests library to retrieve webpage content. If successful (status code 200), the HTML is parsed to find specific elements containing article content.

Text from headings and paragraphs is extracted in a loop. Headings create new sections in the country data, while paragraphs are added under the latest heading. If no heading exists, content is placed under an "Introduction" section.

The scraped data is structured into a dictionary format for each country and returned. If the request fails, an error message is printed.

In [3]:
def scrape_articles():
    all_countries = ['India']
    urls = [f"https://www.whitecase.com/insight-our-thinking/ai-watch-global-regulatory-tracker-{each}" for each in all_countries]

    # Initialize an empty list to store all the scraped data
    scraped_data = []

    for country, url in zip(all_countries, urls):
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            content = soup.find_all('div', class_='field field--name-body field--type-text-with-summary field--label-hidden field--item')
            current_heading = None
            country_data = {'Country': country, 'Sections': []}
            for section in content:
                paragraphs = section.find_all(['p', 'h2', 'h3', 'li'])
                for paragraph in paragraphs:
                    text = paragraph.get_text(strip=True)
                    if paragraph.name in ['h2', 'h3']:
                        current_heading = text
                        country_data['Sections'].append({'Heading': current_heading, 'Content': []})
                    else:
                        if current_heading:
                            country_data['Sections'][-1]['Content'].append(text)
                        else:
                            if not country_data['Sections']:
                                country_data['Sections'].append({'Heading': 'Introduction', 'Content': []})
                            country_data['Sections'][0]['Content'].append(text)
            scraped_data.append(country_data)
        else:
            print(f"Failed to retrieve the webpage for {country}. Status code: {response.status_code}")
    return scraped_data


## Setup ChromaDB

- **`Access or Create Collection`:** 
     - The function will checks if a ChromaDB collection named "articles1" exists using client.get_collection(). If it doesn't, it creates a new one with client.create_collection().

- **`Load Embedding Model`:**
    - It loads the SentenceTransformer model 'all-MiniLM-L6-v2' to generate embeddings for the article text.

- **`Process and Embed Articles`:**
    - Now for each article in scraped_data, it concatenates headings and content into a single string, full_text. The model then converts this text into an embedding (a vector representation of the textwill get created).

- **`Store in ChromaDB`:**
    - Next this functions adds each article to ChromaDB with the document text, its embedding, metadata (country name), and a unique ID.

- **`Return`:**
    - This will prints how many articles were stored and returns the collection name "articles1".

In [4]:
def add_to_chromadb(scraped_data):
    # Try to get the existing collection or create a new one
    collection_name = "articles1"
    try:
        collection = client.get_collection(name=collection_name)
    except Exception as e:
        print(f"Failed to get collection '{collection_name}': {e}")
        collection = client.create_collection(name=collection_name)

    # Load Sentence Transformer for embedding
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Add articles to ChromaDB
    for i, article in enumerate(scraped_data):
        full_text = "\n".join([f"{section['Heading']}\n" + "\n".join(section['Content']) for section in article['Sections']])
        
        embedding = model.encode(full_text).tolist()

        collection.add(
            documents=[full_text],
            embeddings=[embedding],
            metadatas=[{'title': article['Country']}],
            ids=[str(i)]
        )

    print(f"Stored {len(scraped_data)} articles in ChromaDB.")
    return collection_name


## Pre-Processing the data

- **`Process_articles function will pre-processes the scraped data to ensure the text is broken down into manageable chunks and embedded for efficient storage and retrieval in ChromaDB by performing the following steps`:**

    - `1. Text Splitting`:
        - It uses the RecursiveCharacterTextSplitter to break down large article text into smaller chunks. Each chunk has a size of 800 characters with a 200-character overlap for better context continuity.

    - `2. Document Creation`: 
        - For each article, the text (combining headings and content) is transformed into a Document object, which is then split into chunks using the splitter.

    - `3. Generate Unique IDs`:
        - A unique ID is created for each chunk using the uuid4() function to identify them in the database.

    - `4. Embedding and Storing`: 
        - The chunks are embedded using the OllamaEmbeddings model and stored in the Chroma vector database (local-rag collection) along with their unique IDs.

In [11]:
def process_articles(scraped_data):
    # Initialize the text splitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
    
    # Process and split the document text
    chunks = []
    for article in scraped_data:
        document_text = "\n".join([f"{section['Heading']}\n" + "\n".join(section['Content']) for section in article['Sections']])
        document = Document(page_content=document_text)
        chunks += text_splitter.split_documents([document])
    
    # Generate unique IDs for each chunk
    chunk_ids = [str(uuid.uuid4()) for _ in range(len(chunks))]

    # Initialize the vector database with the split chunks and the generated IDs
    vector_db = Chroma.from_documents(
        documents=chunks, 
        embedding=OllamaEmbeddings(model="nomic-embed-text", show_progress=True),
        ids=chunk_ids,
        collection_name="local-rag"
    )

    return vector_db


## Defining Model 

- **`Model used LLama 3`:**
    - **`1. Model Initialization`:**

        - The local language model llama3 is set as the model to use.An instance of ChatOllama, which is a large language model interface, is initialized with this model.

    - **`2. Query Prompt Creation`:**

        - A PromptTemplate named QUERY_PROMPT is created. It instructs the model to retrieve relevant information only from the vector database without using external knowledge, assumptions, or opinions. If the answer is not found in the retrieved documents, the model should explicitly state that the information is not available.

    - **`3. Retriever Initialization`:**

        - A MultiQueryRetriever is created using the language model and the vector database. This retriever handles the retrieval of relevant content from the vector database based on user questions. It uses the query prompt to reformulate the question and enhance the retrieval accuracy.

    - **`4. Answer Prompt Template`:**

        - A ChatPromptTemplate is created to structure the model's response. It provides a template that requires the model to answer the question based only on the context retrieved from the vector database.

    - **`5. Chain Setup`:**

        - A processing chain is created:
            First, the retriever is used to retrieve context from the vector database.
            The question is passed through using RunnablePassthrough() to maintain its structure.
            The prompt template formats the input for the language model.
            The language model (llm) generates a response based on the retrieved context.
            The StrOutputParser processes the model’s output into a readable format.

    - **`6. Return Chain`:**
        - The function returns the complete chain that processes the question, retrieves relevant context from the vector database, and provides an answer based only on the retrieved content.

In [12]:
def initialize_chain(vector_db):
    local_model = "llama3"
    llm = ChatOllama(model=local_model)
    
    QUERY_PROMPT = PromptTemplate(
        input_variables=["question"],
        template="""You are an AI language model assistant. Your sole task is to retrieve the most relevant information 
        strictly from a vector database containing stored articles and content. Your response must be based **only** on 
        the information retrieved from the provided documents. You are NOT allowed to use any external knowledge, personal 
        opinions, or make assumptions. Use the retrieved information strictly to answer the question in the most accurate 
        way possible. 

        If the relevant information is not present in the retrieved content, say explicitly: "The answer is not found in 
        the documents provided."

        Here are ten alternative ways to phrase the original question to improve retrieval performance and ensure the highest 
        likelihood of retrieving the correct answer.

        Original question: {question}
        """
    )

    retriever = MultiQueryRetriever.from_llm(
        vector_db.as_retriever(), 
        llm,
        prompt=QUERY_PROMPT
    )
    
    template = """Answer the question based ONLY on the following context:
    {context}
    Question: {question}
    """
    
    prompt = ChatPromptTemplate.from_template(template)
    chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    
    return chain


## Setup GraphRAG

- **`What is GraphRAG`:**
     - GraphRAG (Graph-based Retrieval-Augmented Generation) is a technique that combines knowledge graphs and neural language models to enhance the process of answering queries. 
     - It uses a graph structure to represent relationships between different pieces of data or documents, allowing for more structured and contextually rich information retrieval. 
     - Each node in the graph represents a document or piece of knowledge, while edges between nodes signify the relationships or connections between them. 
     - This structure is used to improve the generation of responses by combining both retrieval-based information and graph-based reasoning.

### Process
- **`Initialize Graph and Data Lists`:**
     - The script initializes lists for storing documents, metadatas, and ids for 15 text entries. It also initializes a directed graph (DiGraph) for GraphRAG.

- **`Generate Text and IDs`:**
    - A loop generates text1, text2, ..., text15 and corresponding id1, id2, ..., id15. Each document also has associated metadata (e.g., "source": "text_info1").

- **`Add Nodes to Graph`:**
     - Each generated text and metadata is added to the graph as nodes with document_id as the identifier.

- **`Add Documents to ChromaDB`:**
    - The generated documents, along with their metadata and IDs, are added to a ChromaDB collection named "laws".

- **`Create Graph Connections`:**
    - Consecutive nodes (id1 to id15) are linked with directed edges in the graph to show relationships between them (e.g., consecutive text).

In [13]:
def initialize_graph_rag():
    # Initialize lists for documents, metadatas, and ids
    documents = []
    metadatas = []
    ids = []

    # Initialize GraphRAG
    graph = nx.DiGraph()  # Use DiGraph for a directed graph, or Graph for an undirected graph

    # Loop to generate text1, text2, ..., text15 and id1, id2, ..., id15
    for i in range(1, 16):
        text = f"text{i}"  # This generates text1, text2, ..., text15
        document_id = f"id{i}"  # This generates id1, id2, ..., id15
        metadata = {"source": f"text_info{i}"}  # This generates source metadata for each text

        # Append to the respective lists
        documents.append(text)
        metadatas.append(metadata)
        ids.append(document_id)

        # Add nodes to the GraphRAG
        graph.add_node(document_id, text=text, metadata=metadata)

    # Add the generated data to ChromaDB collection
    collection = client.get_or_create_collection("laws")
    collection.add(
        documents=documents,
        metadatas=metadatas,
        ids=ids
    )

    print("Documents added successfully to ChromaDB!")

    # Link nodes as edges (for simplicity, connecting consecutive nodes)
    for i in range(1, 15):
        graph.add_edge(f"id{i}", f"id{i+1}", relation="consecutive")

    print("Nodes and edges added to GraphRAG!")
    return graph, collection


## Retrieval-Augmented Generation

- **`Query ChromaDB`:**
    - It queries the ChromaDB collection for documents relevant to the provided question. The query retrieves the top 5 documents based on their relevance.

- **`Display Top Document`:**
     - The function extracts and prints the top document from the query results. This document is considered the most relevant to the question.

- **`Fetch Connected Nodes from GraphRAG`:**
    - It retrieves the document ID from the query results. If the ID is a list, it selects the first element.
    - It checks if the document ID exists as a node in the GraphRAG. If it does, the function retrieves and prints the connected nodes (successors) from the graph, indicating relationships or related documents.
    - If no connected nodes are found or the document ID is not in the graph, it prints an appropriate message.

In [14]:
def query_chromadb_and_graph(question, collection, graph):
    # Query ChromaDB for top document
    results = collection.query(
        query_texts=[question],
        n_results=5
    )
    
    top_document = results['documents'][0]
    
    # Display top document
    print(f"Top document found: {top_document}")
    
    # Fetch connected nodes from GraphRAG
    doc_id = results['ids'][0]  # Get the id of the top document (extracting from the list)
    if isinstance(doc_id, list):  # In case doc_id is a list, get the first element
        doc_id = doc_id[0]
    
    if doc_id in graph.nodes:
        connected_nodes = list(graph.successors(doc_id))  # Get connected nodes
        if connected_nodes:
            print(f"Related nodes from GraphRAG: {connected_nodes}")
        else:
            print("No related nodes found.")
    else:
        print("Document ID not found in GraphRAG.")


## Main Function

In [15]:
def main():
    print("AI Governance Tracker Chatbot")
    
    # Scrape and process articles
    print("Scraping articles...")
    scraped_data = scrape_articles()
    print("Scraping complete. Adding to ChromaDB...")
    
    collection_name = add_to_chromadb(scraped_data)
    
    # Process articles into chunks for ChromaDB
    vector_db = process_articles(scraped_data)
    
    # Initialize the LLM retrieval chain
    chain = initialize_chain(vector_db)
    
    # Initialize GraphRAG
    graph, graph_collection = initialize_graph_rag()
    
    question = input("Enter your question: ")
    
    if question:
        # Query ChromaDB and GraphRAG for related content
        query_chromadb_and_graph(question, graph_collection, graph)
        response = chain.invoke(question)
        print("Response:", response)
    else:
        print("Please enter a question.")
    
    cleanup = input("Would you like to cleanup the ChromaDB? (yes/no): ")
    if cleanup.lower() == 'yes':
        client.delete_collection(name=collection_name)
        print("ChromaDB cleaned up.")

if __name__ == "__main__":
    main()


AI Governance Tracker Chatbot
Scraping articles...
Scraping complete. Adding to ChromaDB...


Add of existing embedding ID: 0
Insert of existing embedding ID: 0


Stored 1 articles in ChromaDB.


OllamaEmbeddings: 100%|██████████| 14/14 [00:32<00:00,  2.35s/it]
Insert of existing embedding ID: id1
Insert of existing embedding ID: id2
Insert of existing embedding ID: id3
Insert of existing embedding ID: id4
Insert of existing embedding ID: id5
Insert of existing embedding ID: id6
Insert of existing embedding ID: id7
Insert of existing embedding ID: id8
Insert of existing embedding ID: id9
Insert of existing embedding ID: id10
Insert of existing embedding ID: id11
Insert of existing embedding ID: id12
Insert of existing embedding ID: id13
Insert of existing embedding ID: id14
Insert of existing embedding ID: id15
Add of existing embedding ID: id1
Add of existing embedding ID: id2
Add of existing embedding ID: id3
Add of existing embedding ID: id4
Add of existing embedding ID: id5
Add of existing embedding ID: id6
Add of existing embedding ID: id7
Add of existing embedding ID: id8
Add of existing embedding ID: id9
Add of existing embedding ID: id10
Add of existing embedding ID: id

Documents added successfully to ChromaDB!
Nodes and edges added to GraphRAG!
Top document found: ['text8', 'text14', 'text5', 'text2', 'text3']
Related nodes from GraphRAG: ['id9']


OllamaEmbeddings: 100%|██████████| 1/1 [00:04<00:00,  4.67s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.07s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.11s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.13s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.13s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.16s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.11s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.13s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.12s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.11s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [

Response: Based on the provided context, it can be answered as follows:

Currently, there are no specific codified laws, statutory rules or regulations in India that directly regulate AI.


: 

## llama3 LLM Modal card

- **`Model Overview`:**
    - Model Name: ChatOllama
    - Model Type: Large Language Model (LLM)
    - Usage: The ChatOllama model is used to generate responses by retrieving and processing information from articles stored in a vector database (ChromaDB) and a graph structure (GraphRAG).

### Intended Use

- **`Primary Use Case`:**
    To answer user queries related to AI governance by referencing a curated set of articles and documents.
    To act as an informational tool, offering insights into AI regulations and policies based on pre-scraped data.

- **`Intended Users`**:
    Researchers and policy makers interested in AI governance and regulatory developments.
    Organizations and individuals looking to stay informed about AI regulations and best practices.
    Anyone needing a reliable source for AI governance information that is based on a specific set of documents.

### Performance

- **`Accuracy`**:
    The model's accuracy is tied to its ability to extract and generate responses based on the documents stored in ChromaDB.
    Its responses depend heavily on the relevance and completeness of the scraped articles, as it doesn't have access to information beyond what is stored in the database.

### Limitations
- **`Model Limitations`:**
    - Contextual Understanding: The model may struggle with understanding nuanced or context-specific questions if the required information is not available in the retrieved documents.
    - Training Data Dependency: Since the model’s responses are solely based on the retrieved articles, its performance is directly tied to the quality and scope of the data in ChromaDB.
- **`Performance Variability`:**
    The model's accuracy may vary based on the comprehensiveness of the documents in ChromaDB. If the documents are biased or incomplete, this can affect the model's responses.
    Limited to the scope of the pre-scraped articles; it cannot provide information outside of this dataset.


### Biases
- **`Document Bias`:**
    - If the query asked is biased, the model's answers may reflect that bias.

- **`Query Bias`:**
    - The phrasing of questions can influence the model’s ability to retrieve relevant information, potentially leading to biased or incomplete responses.


### Security & Privacy
- **`Data Security`:**
    - Ensure that the scraped documents do not contain sensitive or personal information. User queries should be handled securely, with no storage of personal data.
    
- **`Model Attacks`:**
The system could be exposed to adversarial attacks through crafted queries designed to exploit the retrieval or response generation process. Measures should be in place to identify and mitigate such risks.

### Ethical Considerations
- **`Ethical Use`:**
    - The system should be used ethically, respecting user privacy and adhering to ethical standards in AI. This includes providing accurate, unbiased information and not using the system to disseminate misinformation.
    - Users should be informed about the limitations and scope of the system to set correct expectations.

- **`Future Work`:**
    - 1. Enhancing Explainability: Further work is needed to make the model’s decision-making process more transparent and understandable to end-users.
     - 2. Addressing Biases: Implementing methods to identify and mitigate biases in the source documents and responses, ensuring a more balanced and fair output.
    - 3. Expanding Document Coverage: Continually update the document set in ChromaDB to include the latest developments in AI governance, ensuring the model stays relevant and comprehensive.
    - 4. Exploring Additional XAI Methods: Investigate additional explainable AI methods to better understand the model's processing and improve user trust.

### Risk Assessment

- **`Data Bias`:**

    - **Issue**: Embedding models can inherit and amplify biases present in their training data, potentially leading to skewed or unfair outcomes in the responses.
    - **Impact on Model**: If the articles and documents in ChromaDB contain biases, the model may provide biased information, reflecting a narrow viewpoint.It depends on the prompt we give and how accurate the prompt is asked.

- **`Adversarial Attacks`:**
    - **Issue**: The model might be vulnerable to adversarial attacks, where slightly altered input can lead to significantly different outputs, undermining the model's reliability.
     - **Impact on Model**: Users could craft queries to exploit the model's weaknesses, causing it to retrieve irrelevant or misleading information.

- **`Privacy Risks`:**
    - **Issue**: If the model processes sensitive data, embeddings might inadvertently reveal personal information.
    - **Impact on Model**: There's a risk of exposing sensitive information if the scraped content includes identifiable data or if user queries are not handled securely.

- **`Overcoming Challenges`:**
  - **Bias Mitigation**: Employ bias detection and mitigation strategies during model training and before deployment. Regularly update the model with more diverse and representative training data.
   - **Robustness Testing**: Implement adversarial testing and robustness checks to identify and mitigate vulnerabilities to attacks, ensuring consistent and reliable outputs.
   - **Privacy Preservation**: Use techniques such as differential privacy during training to minimize the risk of exposing sensitive information through the model's outputs.