# University Presence - Retrieval Augmented Generation Model Workshop

### TODO: Intro into Business Case

# Introduction to Retrieval-Augmented Generation (RAG)

## Overview
Retrieval-Augmented Generation (RAG) is an advanced framework that combines retrieval-based and generation-based approaches to enhance the performance of natural language processing (NLP) tasks. It leverages the strengths of both methods to provide more accurate, relevant, and contextually appropriate responses. This hybrid approach is particularly powerful in scenarios where vast amounts of information need to be efficiently accessed and summarized, such as in question answering systems, customer support, and knowledge management.

## Components of RAG
1. **Document Retrieval:**
   - The first step involves retrieving relevant documents or passages from a large corpus of text. This is typically done using vector databases and similarity search techniques. The aim is to narrow down the vast information to a few relevant pieces that can be further processed.

2. **Embedding Models:**
   - Embedding models transform text data into numerical vectors that capture semantic meanings. Various embedding techniques can be used, such as character-level, word-level, sentence-level, and document-level embeddings. The choice of embedding type depends on the specific use case and the desired level of granularity.

3. **Vector Database (e.g., Pinecone):**
   - A vector database stores and indexes these embeddings, enabling efficient similarity searches. Pinecone, for instance, is a scalable vector database that supports high-dimensional vector search, making it ideal for real-time retrieval tasks in RAG systems.

4. **Language Model (LLM) Prompting:**
   - Once the relevant documents are retrieved, a language model (such as GPT-3.5 or similar) generates a response based on the retrieved context and the user's query. This step involves prompt engineering to guide the model in producing high-quality outputs.

## How RAG Works
1. **Query Processing:**
   - The user inputs a query. This query is embedded using an embedding model to create a vector representation.
   
2. **Retrieval Step:**
   - The query vector is used to search the vector database, retrieving the most similar documents or passages. This narrows down the information to the most relevant pieces.

3. **Generation Step:**
   - The retrieved documents are fed into a language model along with the original query. The language model uses this context to generate a coherent and relevant response.

4. **Response Delivery:**
   - The generated response is presented to the user, providing a comprehensive answer or summary based on the combined knowledge of the retrieved documents and the language model's generative capabilities.


In [None]:
### Idea: Create Multiple Datasets  -> Different topics (i.e. Tech documentation, Finance, Law, Ethics and Compliance, Health, etc). Each group gets to pick a dataset to present a business case (How would you sell it to a client?)
### Idea: More Complex - Evaluate different similarity metrics, and parameters. -> Maybe different metrics https://www.pinecone.io/learn/vector-similarity/

In [None]:
import pinecone

# Initialize Pinecone
pinecone.init(api_key='your_pinecone_api_key', environment='your_pinecone_environment')

# Model Loading


In [None]:
#### Global Variables that should not be modified once run.
group_name = "YOUR_GROUP_NAME"

### Load the corpus/data we want to create a vector database from

In [None]:
import os

def load_documents_from_directory(directory_path, file_extensions : str | list = '.txt'): # TODO: Add logic for file extensions
    documents = {}
    for filename in os.listdir(directory_path):
        if filename.endswith('.txt'):
            with open(os.path.join(directory_path, filename), 'r', encoding='utf-8') as file:
                documents[filename] = file.read()
    return documents

# Example usage
directory_path = './sources/data/{topic}'.format(topic="")
file_extensions = [".txt", ".pdf"]
documents = load_documents_from_directory(directory_path)
print(f"Loaded {len(documents)} documents.")

### Create the embeddings for the documents we want to find information in

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Function to create embeddings
model_name  = 'sentence-transformers/all-MiniLM-L6-v2'
embedding_model = SentenceTransformer(model_name)

def create_embeddings(documents: list = [], model : any = None, embedding_type: str = 'sentence'):

    if (len(documents) == 0):
        raise ValueError('You have provided an empty Document List')
    
    if (model is None):
        raise TypeError('You have provided an invalid model or it has not been loaded correctly')
    
    embeddings = {}

    for doc_name, doc_content in documents.items():
        if embedding_type == 'character':
            embeddings[doc_name] = model.encode([c for c in doc_content]) # Modify this
        elif embedding_type == 'word':
            embeddings[doc_name] = model.encode(doc_content.split()) # Modify this
        elif embedding_type == 'sentence':
            embeddings[doc_name] = model.encode(doc_content.split('.')) # Modify this
        elif embedding_type == 'document':
            embeddings[doc_name] = model.encode([doc_content]) # Modify this
        else:
            raise ValueError(f"Unknown embedding type: {embedding_type}")
    
    return embeddings


document_embeddings = create_embeddings(documents, embedding_type='document')
sentence_embeddings = create_embeddings(documents, embedding_type='document')
word_embeddings = create_embeddings(documents, embedding_type='word')
character_embeddings = create_embeddings(documents, embedding_type='character')


### Create the Vector Database with the vector embeddings

In [None]:

chunk_size = 256 
# Create Pinecone index
def create_pinecode_index(group_name, dimension):
    index_name = 'rag-workshop' + group_name
    if index_name not in pinecone.list_indexes():
        pinecone.create_index(index_name, dimension=dimension)  # Adjust dimension according to your model
    index = pinecone.Index(index_name)
    return index

# Function to add embeddings to Pinecone
def add_embeddings_to_pinecone(index, embeddings):
    for doc_name, embedding in embeddings.items():
        index.upsert([(doc_name, embedding.tolist())])
    

pinecone_index = create_pinecode_index(group_name, chunk_size)
add_embeddings_to_pinecone(embeddings)
print("Embeddings added to Pinecone.")

### Fetch the documents which are the most semantically similar to the query

In [None]:
from transformers import pipeline

# Initialize a language model pipeline
llm_pipeline = pipeline('text-generation', model='gpt-3.5-turbo')  # Replace with appropriate model

# Function to query the vector database and generate a response
def retrieve_similar_documents(query, top_k=5):
    query_embedding = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2').encode([query])
    search_results = index.query(query_embedding.tolist(), top_k=top_k)
    
    # Retrieve documents
    retrieved_docs = [result['id'] for result in search_results['matches']]
    
    return retrieved_docs



### Generate a response using a LLM.

In [None]:
def query_rag_system(documents: list, query: str):

    # Generate response using LLM
    context = " ".join([documents[doc_id] for doc_id in documents])
    response = llm_pipeline(f"Context: {context}\nQuestion: {query}\nAnswer: ")
    
    return response[0]['generated_text']

Great! Now that we have a working RAG (Retrieval-Augmented Generation) Model, we will want our customers/users to be able to retrieve documents freely using natural language.

However, it is often the case that these RAG Models perform better when given prompts that provide additional context and guidance in the role they are trying to fulfill.

# Understanding Prompt Engineering
Prompt engineering is the practice of designing and refining the input prompts given to a language model to achieve the desired output. Essentially, it involves crafting questions or statements in a way that leverages the strengths of the language model, guiding it to produce more accurate, relevant, and contextually appropriate responses. This is crucial for improving the performance of RAG systems, where the quality of retrieved documents and generated responses can directly impact user satisfaction and correctness.

We will now be looking at the benefits of prompt engineering and how we can modify the following query to enhance our RAG Model performance.

## Benefits of Prompt Engineering

Consinder the query: "Tell me about climate change."

it is quite an open-ended and ambigious prompt, leaving the opportunity for the RAG model to hallucinate and generat potentially misleading, useless or incorrect responses. We can _mitigate_ this by employing certain technniques in our prompts.

### Contextual Understanding: 

Prompts that provide specific context can help the language model better understand the user's intent. This reduces ambiguity and ensures that the model retrieves and generates content that is closely aligned with the user's needs.

__Modified Query__:
"Tell me about climate change and its impact on coastal cities."

### Enhanced Relevance: 

By framing prompts to include relevant details, users can improve the relevance of the documents retrieved from the vector database. This ensures that the information presented is more pertinent to the query.

__Modified Query__:
"Tell me about the effects of climate change on agriculture in North America."

### Role Specification: 

Defining the role of the model within the prompt (e.g., "As an expert in history, summarize the events of World War II") can help the model generate responses that are more authoritative and tailored to the specified role.

__Modified Query__:
"As an environmental scientist, explain the causes and effects of climate change."

###  Guidance and Structure: 

Structured prompts (e.g., "Given the following context, provide a summary: [context]") guide the model on how to approach the response, which can lead to more coherent and well-organized outputs.

__Modified Query__:
"Given the following context, provide a summary of the main points about climate change: [context]"

### Bias Mitigation: 

Thoughtfully crafted prompts can help mitigate model biases by steering the model towards neutral and objective language, particularly in sensitive or controversial topics.

__Modified Query__:
"Provide an objective overview of climate change, including its causes, effects, and potential solutions."

# Adding Prompt Engineering to our RAG Model.

It would be too much to ask from our users/customers to apply all these techniques themselves when they are querying the system for information. Therefore, many applications that employ RAG models do some additional preprocessing to user prompts to leverage the benefits of Prompt Engineering.