In [1]:
import sys
sys.path.append('../scripts')

In [2]:
from evaluation_methods import EvaluationMethods

In [3]:
eval_methods = EvaluationMethods()

In [4]:
# Define prompts and initial ratings
prompts = [
    "What is the primary business objective of PromptlyTech?",
    "Name the key services provided by PromptlyTech.",
    "Explain the importance of Automatic Evaluation Data Generation in PromptlyTech's offerings.",
    "How does the ELO Rating System work in the context of prompt testing and ranking?",
    "What are some innovative approaches suggested for prompt evaluation in Task 5?",
    "How does the user interface contribute to the overall prompt engineering system in Task 5?"
]
elo_ratings = {prompt: 1500 for prompt in prompts}  # Initial ratings

# Conduct multiple rounds of evaluation
for _ in range(10):  # Number of rounds
    elo_ratings = eval_methods.elo_ratings_func(prompts, elo_ratings)

# Sort prompts by their final Elo ratings
sorted_prompts = sorted(prompts, key=lambda x: elo_ratings[x], reverse=True)

# Print the ranked prompts
for prompt in sorted_prompts:
    print(f"{prompt}: {elo_ratings[prompt]}")

How does the ELO Rating System work in the context of prompt testing and ranking?: 1588.4422577440132
Explain the importance of Automatic Evaluation Data Generation in PromptlyTech's offerings.: 1543.6798996529228
What are some innovative approaches suggested for prompt evaluation in Task 5?: 1532.3104658608177
How does the user interface contribute to the overall prompt engineering system in Task 5?: 1525.6051803007426
What is the primary business objective of PromptlyTech?: 1515.1359449372885
Name the key services provided by PromptlyTech.: 1508.719376595432


### Prompts evaluation
1. **How does the user interface contribute to the overall prompt engineering system in Task 5?:** 1557.24
   - This prompt has a **high rating**, suggesting it was evaluated as very relevant and valuable to understanding the contribution of the user interface to the prompt engineering system in Task 5.

2. **What are some innovative approaches suggested for prompt evaluation in Task 5?:** 1543.87
   - This prompt also received a **high rating**, indicating that it was considered valuable in exploring innovative approaches for prompt evaluation in Task 5.

3. **What is the primary business objective of PromptlyTech?:** 1528.40
   - The prompt received a **solid rating**, indicating it was perceived as relevant and important in understanding the primary business objective of PromptlyTech.

4. **How does the ELO Rating System work in the context of prompt testing and ranking?:** 1511.07
   - This prompt has a **good rating**, suggesting it was seen as valuable in explaining the functioning of the ELO Rating System in the context of prompt testing and ranking.

5. **Name the key services provided by PromptlyTech.:** 1508.48
   - The prompt received a **reasonable rating**, indicating it was considered important in identifying the key services provided by PromptlyTech.

6. **Explain the importance of Automatic Evaluation Data Generation in PromptlyTech's offerings.:** 1497.73
   - This prompt received a **slightly lower rating**, suggesting it may be perceived as less critical compared to other prompts in understanding the importance of Automatic Evaluation Data Generation in PromptlyTech's offerings.


In [5]:
def evaluate_prompt(main_prompt, test_cases):
    evaluations = {}

    # Evaluate the main prompt using Monte Carlo and Elo methods
    evaluations['main_prompt'] = {
        'Monte Carlo Evaluation': eval_methods.monte_carlo_eval(main_prompt),
        'Elo Rating Evaluation': eval_methods.elo_eval(main_prompt)
    }

    # Evaluate each test case
    for idx, test_case in enumerate(test_cases):
        evaluations[f'test_case_{idx+1}'] = {
            'Monte Carlo Evaluation': eval_methods.monte_carlo_eval(test_case),
            'Elo Rating Evaluation': eval_methods.elo_eval(test_case)
        }

    return evaluations

In [6]:
main_prompt = "How does effective prompt engineering contribute to the success of AI-driven solutions, especially in optimizing the use of Language Models (LLMs) in various industries?"
test_cases = [
    "What role does prompt engineering play in enhancing decision-making, operational efficiency, and customer experience in various industries?",
    "How does Automatic Prompt Generation by PromptlyTech streamline the process of creating effective prompts for businesses?",
    "In what ways does Automatic Evaluation Data Generation by PromptlyTech contribute to enhancing the reliability and performance of LLM applications?",
    "Explain the significance of PromptlyTech's Prompt Testing and Ranking Service in ensuring accurate and contextually relevant responses from chatbots and virtual assistants.",
    "Can you elaborate on the innovative approaches mentioned for prompt evaluation in Task 5 of the challenge?",
    "How does the user interface developed for prompt engineering contribute to the overall user experience in Task 5?"
]
result = evaluate_prompt(main_prompt, test_cases)
print(result)



{'main_prompt': {'Monte Carlo Evaluation': 1.9, 'Elo Rating Evaluation': 1504.2019499940866}, 'test_case_1': {'Monte Carlo Evaluation': 1.97, 'Elo Rating Evaluation': 1489.2019499940866}, 'test_case_2': {'Monte Carlo Evaluation': 1.98, 'Elo Rating Evaluation': 1489.2019499940866}, 'test_case_3': {'Monte Carlo Evaluation': 2.02, 'Elo Rating Evaluation': 1504.2019499940866}, 'test_case_4': {'Monte Carlo Evaluation': 2.04, 'Elo Rating Evaluation': 1489.2019499940866}, 'test_case_5': {'Monte Carlo Evaluation': 2.07, 'Elo Rating Evaluation': 1504.2019499940866}, 'test_case_6': {'Monte Carlo Evaluation': 1.96, 'Elo Rating Evaluation': 1489.2019499940866}}


### Interpretation

#### 1. Monte Carlo Evaluation:
   - **Scores Range:** From 1 to 3, with higher scores indicating greater relevance or quality of the prompt.
   ###### Interpretation:
   - **1.9 (Main Prompt):** Slightly below average relevance or quality.
   - **1.97, 1.98, 2.02, 2.04, 2.07 (Test Cases):** Scores around 2 suggest moderate relevance or quality. The variation indicates some test cases are deemed slightly more relevant or higher quality than others.

#### 2. Elo Rating Evaluation:
   - **Base Rating:** Usually starts at 1500, with changes based on the 'performance' of the prompt against a set of standards.
   - **Higher than 1500:** Indicates the prompt performed better than average.
   - **Lower than 1500:** Indicates the prompt performed worse than average.
   ###### Interpretation:
   - **1504.20 (Main Prompt):** Slightly below the average performance.
   - **1489.20 (Test Cases 1, 2, 4, 5):** These prompts are rated above the average, suggesting better performance.
   - **1504.20 (Test Case 3):** Slightly above average performance.

#### Overall Interpretation:
   - **Main Prompt:** Both evaluations suggest that the main prompt is slightly below average in terms of relevance and quality.
   - **Test Cases:** Generally, the test cases are rated as average or slightly above average in both relevance and quality. Test Cases 1, 2, 4, and 5 seem to perform particularly well in the Elo evaluation, indicating they might be more effective or well-structured prompts compared to the main prompt and Test Case 3.


## RAGAS Evaluation 

In [7]:
import requests
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter  
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
import weaviate
from weaviate.embedded import EmbeddedOptions
from dotenv import load_dotenv,find_dotenv
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
import weaviate
import os
from weaviate.embedded import EmbeddedOptions
from dotenv import load_dotenv,find_dotenv
# 
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

In [8]:
# Data loader
def data_loader(file_path= '../prompts/context.txt'):
    loader = TextLoader(file_path)
    documents = loader.load()

    # Chunk the data
    text_splitter = CharacterTextSplitter(chunk_size=15773, chunk_overlap=200)
    chunks = text_splitter.split_documents(documents)
    return chunks

In [9]:
def create_retriever(chunks):
    load_dotenv(find_dotenv())
    

    # # Setup vector database
    client = weaviate.Client(
    embedded_options = EmbeddedOptions()
    )
    
    # Populate vector database
    vectorstore = Weaviate.from_documents(
      client = client,    
      documents = chunks,
      embedding = OpenAIEmbeddings(),
      by_text = False
    )
    
    # Define vectorstore as retriever to enable semantic search
    retriever = vectorstore.as_retriever()
    return retriever

In [10]:
# chunks

In [11]:
chunks =  data_loader()
retriever = create_retriever(chunks)

Started /Users/azizamed/.cache/weaviate-embedded: process ID 37482


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-01-19T02:53:06+03:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-01-19T02:53:06+03:00"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-01-19T02:53:06+03:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50060","time":"2024-01-19T02:53:06+03:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:8079","time":"2024-01-19T02:53:06+03:00"}
{"level":"info","msg":"Completed loading shard langchain_17bbdaa4279c488db6082b6609d17e65_Wo3w0e0EQ5AM in 14.079152ms","time":"2024-01-19T02:53:07+03:00"

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-HgabC***************************************N2zH. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

In [12]:

# Define LLM
llm = ChatOpenAI(model_name="gpt-4-1106-preview", temperature=0)

# Define prompt template
template = """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Use two sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:
"""

prompt = ChatPromptTemplate.from_template(template)

# Setup RAG pipeline
rag_chain = (
    {"context": retriever,  "question": RunnablePassthrough()} 
    | prompt 
    | llm
    | StrOutputParser() 
)

  warn_deprecated(


NameError: name 'retriever' is not defined

In [19]:
from datasets import Dataset

questions = [
    "What is the primary business objective of PromptlyTech?",
    "Name the key services provided by PromptlyTech.",
    "Explain the importance of Automatic Evaluation Data Generation in PromptlyTech's offerings.",
    "How does the ELO Rating System work in the context of prompt testing and ranking?",
    "What are some innovative approaches suggested for prompt evaluation in Task 5?",
    "How does the user interface contribute to the overall prompt engineering system in Task 5?"
]

ground_truths = [
    ["PromptlyTech is an innovative e-business specializing in providing AI-driven solutions for optimizing the use of Language Models (LLMs) in various industries. The company aims to revolutionize how businesses interact with LLMs, making the technology more accessible, efficient, and effective. By addressing the challenges of prompt engineering, the company plays a pivotal role in enhancing decision-making, operational efficiency, and customer experience across various industries."],
    ["PromptlyTech focuses on key services: Automatic Prompt Generation, Automatic Evaluation Data Generation, and Prompt Testing and Ranking."],
    ["Automatic Evaluation Data Generation is a crucial service offered by PromptlyTech. This service automates the generation of diverse test cases, ensuring comprehensive coverage and identifying potential issues. By creating a set of test cases that serve as evaluation benchmarks for prompt candidates, PromptlyTech enhances the reliability and performance of LLM applications. This, in turn, saves significant time in the Quality Assurance (QA) process."],
    ["The ELO Rating System, commonly used in chess and other competitive games, rates prompts based on their performance in battles. Each prompt candidate is assigned a rating that reflects its success in previous matchups. The system takes into account not just the number of wins but also the strength of the opponents each prompt has defeated. This rating helps in objectively ranking the prompts based on their effectiveness in generating desired outcomes."],
    ["Task 5 emphasizes adopting innovative approaches to prompt evaluation, including utilizing Monte Carlo matchmaking and ELO rating systems. Additionally, alternative methods such as TrueSkill Rating System, Glicko Rating System, Bayesian Rating Systems, Pairwise Comparison Methods, Categorical Ranking, Adaptive Ranking Algorithms, and Semantic Similarity Matching are mentioned. These methods provide a dynamic and adaptive framework for evaluating prompts in various contexts."],
    ["The user interface plays a crucial role in Task 5 by providing a user-friendly platform for interacting with the prompt engineering system. It allows users to easily input data, receive prompts, and view evaluation results. The design and implementation of the user interface aim to enhance the overall user experience, making it intuitive and efficient for users to engage with the automated prompt generation, evaluation data generation, and prompt testing components."]
]
answers = []
contexts = []

# Inference
for query in questions:

  answers.append(rag_chain.invoke(query))
  contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])

# To dict
data = {
    "question": questions, # list 
    "answer": answers, # list
    "contexts": contexts, # list list
    "ground_truths": ground_truths # list Lists
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)

/usr/local/lib/python3.11/site-packages/langchain_community/embeddings/openai.py:500: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
  response = response.dict()
/usr/local/lib/python3.11/site-packages/pydantic/main.py:979: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
/usr/local/lib/python3.11/site-packages/langchain_community/chat_models/openai.py:458: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
  response = response.dict()
/usr/local/lib/python3.11/site-packages/pydantic/main.py:979: 

In [20]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

result = evaluate(
    dataset = dataset, 
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

df = result.to_pandas()

evaluating with [context_precision]


100%|██████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.46s/it]


evaluating with [context_recall]


100%|██████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.79s/it]


evaluating with [faithfulness]


100%|██████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.16s/it]


evaluating with [answer_relevancy]


  0%|                                                                          | 0/1 [00:00<?, ?it/s]/usr/local/lib/python3.11/site-packages/langchain_community/embeddings/openai.py:500: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
  response = response.dict()
/usr/local/lib/python3.11/site-packages/pydantic/main.py:979: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
100%|██████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.72s/it]


In [15]:
df

Unnamed: 0,question,answer,contexts,ground_truths,context_precision,context_recall,faithfulness,answer_relevancy
0,Who founded OpenAI?,"OpenAI was founded by Sam Altman, Elon Musk, I...",[OpenAI was initially founded in 2015 by Sam A...,"[Sam Altman, Elon Musk, Ilya Sutskever and Gre...",1.0,1.0,1.0,0.959185
1,What was the initial goal of OpenAI?,The initial goal of OpenAI was to advance digi...,[OpenAI was initially founded in 2015 by Sam A...,[To advance digital intelligence in a way that...,1.0,1.0,1.0,0.999999
2,What did OpenAI release in 2016?,"OpenAI released 'OpenAI Gym' in 2016, a toolki...",[The early years of OpenAI were marked with ra...,"[OpenAI Gym, a toolkit for developing and comp...",1.0,1.0,1.0,0.899221


#### Integration with Retrieval-Augmented Generation Assessment:
##### Monte Carlo for Robustness Testing: Use Monte Carlo simulations to test the robustness of the RAG system across a wide range of possible retrieval scenarios. This helps in understanding how different types of retrieved information can impact the quality of the generated content.
##### Elo Rating for Continuous Improvement: Utilize the Elo rating system to continuously assess and improve the RAG model. By comparing new outputs with previous ones and adjusting ratings accordingly, the system can learn which types of retrieval-augmented generations work best.