# University Presence - Retrieval Augmented Generation Model Workshop

### TODO: Small Intro
### TODO: Intro into RAGs
### TODO: Intro into Business Case

In [None]:
### TODO: Optional Part for Advanced students, to do some data wrangling
### AKA: Correct Document loading, cleansing, transforming intermediate steps -> 
### Maybe make it replicate a small BRONZE-SILVER-GOLD set up -> 
#   bronze is raw documents
#   silver is prefiltered documents based on type
#   gold is documents filtered into the correct data type.
###

In [None]:
import os

def load_documents_from_directory(directory_path):
    documents = {}
    for filename in os.listdir(directory_path):
        if filename.endswith('.txt'):
            with open(os.path.join(directory_path, filename), 'r', encoding='utf-8') as file:
                documents[filename] = file.read()
    return documents

# Example usage
directory_path = [...] # W
documents = load_documents_from_directory(directory_path)
print(f"Loaded {len(documents)} documents.")

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Function to create embeddings
def create_embeddings(documents, model_name='sentence-transformers/all-MiniLM-L6-v2', embedding_type='sentence'):
    model = SentenceTransformer(model_name)
    embeddings = {}
    
    for doc_name, doc_content in documents.items():
        if embedding_type == 'character':
            embeddings[doc_name] = model.encode([c for c in doc_content]) # Modify this
        elif embedding_type == 'word':
            embeddings[doc_name] = model.encode(doc_content.split()) # Modify this
        elif embedding_type == 'sentence':
            embeddings[doc_name] = model.encode(doc_content.split('.')) # Modify this
        elif embedding_type == 'document':
            embeddings[doc_name] = model.encode([doc_content]) # Modify this
        else:
            raise ValueError(f"Unknown embedding type: {embedding_type}")
    
    return embeddings

# Example usage
embeddings = create_embeddings(documents, embedding_type='document')
print(f"Created embeddings for {len(embeddings)} documents.")

In [None]:
import pinecone

# Initialize Pinecone
pinecone.init(api_key='your_pinecone_api_key', environment='your_pinecone_environment')

# Create Pinecone index
index_name = 'rag-workshop'
if index_name not in pinecone.list_indexes():
    # TODO: Modify this such that students might have to add their own dimensions and the index name should change.
    pinecone.create_index(index_name, dimension=384)  # Adjust dimension according to your model
index = pinecone.Index(index_name)

# Function to add embeddings to Pinecone
def add_embeddings_to_pinecone(embeddings):
    for doc_name, embedding in embeddings.items():
        index.upsert([(doc_name, embedding.tolist())])
    
# Example usage
add_embeddings_to_pinecone(embeddings)
print("Embeddings added to Pinecone.")

In [None]:
from transformers import pipeline

# Initialize a language model pipeline
llm_pipeline = pipeline('text-generation', model='gpt-3.5-turbo')  # Replace with appropriate model

# Function to query the vector database and generate a response
def query_rag_system(query, top_k=5):
    query_embedding = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2').encode([query])
    search_results = index.query(query_embedding.tolist(), top_k=top_k)
    
    # Retrieve documents
    retrieved_docs = [result['id'] for result in search_results['matches']]
    
    # Generate response using LLM
    context = " ".join([documents[doc_id] for doc_id in retrieved_docs])
    response = llm_pipeline(f"Context: {context}\nQuestion: {query}\nAnswer: ")
    
    return response[0]['generated_text']

# Example usage
query = "What is the content of document X?"
response = query_rag_system(query)
print(response)


Great! Now that we have a working RAG (Retrieval-Augmented Generation) Model, we will want our customers/users to be able to retrieve documents freely using natural language.

However, it is often the case that these RAG Models perform better when given prompts that provide additional context and guidance in the role they are trying to fulfill.

# Understanding Prompt Engineering
Prompt engineering is the practice of designing and refining the input prompts given to a language model to achieve the desired output. Essentially, it involves crafting questions or statements in a way that leverages the strengths of the language model, guiding it to produce more accurate, relevant, and contextually appropriate responses. This is crucial for improving the performance of RAG systems, where the quality of retrieved documents and generated responses can directly impact user satisfaction and correctness.

We will now be looking at the benefits of prompt engineering and how we can modify the following query to enhance our RAG Model performance.

## Benefits of Prompt Engineering

Consinder the query: "Tell me about climate change."

it is quite an open-ended and ambigious prompt, leaving the opportunity for the RAG model to hallucinate and generat potentially misleading, useless or incorrect responses. We can _mitigate_ this by employing certain technniques in our prompts.

### Contextual Understanding: 

Prompts that provide specific context can help the language model better understand the user's intent. This reduces ambiguity and ensures that the model retrieves and generates content that is closely aligned with the user's needs.

__Modified Query__:
"Tell me about climate change and its impact on coastal cities."

### Enhanced Relevance: 

By framing prompts to include relevant details, users can improve the relevance of the documents retrieved from the vector database. This ensures that the information presented is more pertinent to the query.

__Modified Query__:
"Tell me about the effects of climate change on agriculture in North America."

### Role Specification: 

Defining the role of the model within the prompt (e.g., "As an expert in history, summarize the events of World War II") can help the model generate responses that are more authoritative and tailored to the specified role.

__Modified Query__:
"As an environmental scientist, explain the causes and effects of climate change."

###  Guidance and Structure: 

Structured prompts (e.g., "Given the following context, provide a summary: [context]") guide the model on how to approach the response, which can lead to more coherent and well-organized outputs.

__Modified Query__:
"Given the following context, provide a summary of the main points about climate change: [context]"

### Bias Mitigation: 

Thoughtfully crafted prompts can help mitigate model biases by steering the model towards neutral and objective language, particularly in sensitive or controversial topics.

__Modified Query__:
"Provide an objective overview of climate change, including its causes, effects, and potential solutions."

# Adding Prompt Engineering to our RAG Model.

It would be too much to ask from our users/customers to apply all these techniques themselves when they are querying the system for information. Therefore, many applications that employ RAG models do some additional preprocessing to user prompts to leverage the benefits of Prompt Engineering.