## LLM App/Project - RAG testing / Text Classification

### Steps: 

1. ~~Create a chat classifier in lamaindex which based on the provided BBC headline classify it into one of the five news categories.~~
2. Using the chat classifier from the first task, add output evaluation - consider a possiblity of adding the way of teaching/improving a chat output based on this information
3. Write RAG - add some wiki documents (~ 20 documents) and evaluate the responses
4. Experiment with different evaluation metrics
5. Experiment with different RAG methodologies and use evaluation metrics to see different results they can provide

Possible problems:
- Gemini API restriction

## Chat Classifier

In [38]:
from llama_index.llms.gemini import Gemini
from llama_index.core.llms import ChatMessage
from dotenv import load_dotenv
import os

load_dotenv()

GOOGLE_API_KEY = os.getenv("GEMINI_API_KEY")

llm = Gemini(
    model="models/gemini-1.5-flash",
    api_key=GOOGLE_API_KEY  # uses GOOGLE_API_KEY env var by default
)

In [39]:
messages = [
    ChatMessage(
        role="system",
        content=(
            "You are a text classifier specializing in BBC headlines. "
            "I will provide you with a headline, and you will classify it into one of the following five categories: "
            "business, entertainment, politics, sport, or tech. "
            "Choose the category that best fits the headline. If a headline fits multiple categories, select the one most relevant. "
            "Provide only the category name as the response, without any additional text."
        )
    )
]

resp = llm.chat(messages)
print(f"System Response: {resp}")

# Chat loop 
while True:
    text_input = input("User: ")
    if text_input.lower() == "exit":
        print("Exiting classifier. Goodbye!")
        break
    
    messages.append(ChatMessage(role="user", content=text_input))

    response = str(llm.chat(messages))
    messages.append(ChatMessage(role='assistant', content=response))    
    print(f"\nChat: {response}\n")

System Response: assistant: Okay, I'm ready. Please provide the headline.



## Correctness Evaulator

In [56]:
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.gemini import Gemini
from llama_index.core.llms import ChatMessage
from dotenv import load_dotenv
import pandas as pd
import random
import os

load_dotenv()

GOOGLE_API_KEY = os.getenv("GEMINI_API_KEY")

llm = Gemini(
    model="models/gemini-1.5-flash",
    api_key=GOOGLE_API_KEY  # uses GOOGLE_API_KEY env var by default
)

evaluator = CorrectnessEvaluator(llm=llm)

# Read Data Frame
df = pd.read_csv(r'data\bbc_data.csv')

In [57]:
# choose random entity 
random_index = random.randint(0, len(df))
entity = df.iloc[random_index]

news = entity.data
label = entity.labels

### Play around with deepeval library

Deepeval problem - I don't have access to GPT models needed to use GEval - useful in correctness measurements

ValueError: Invalid model. Available GPT models: gpt-4o-mini, gpt-4o, gpt-4-turbo, gpt-4-turbo-preview, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4, gpt-4-32k, gpt-4-0613, gpt-4-32k-0613, gpt-3.5-turbo-1106, gpt-3.5-turbo, gpt-3.5-turbo-16k, gpt-3.5-turbo-0125

In [84]:
from deepeval.metrics import GEval

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.dataset import EvaluationDataset

In [85]:
entity = df.iloc[4]

news = entity.data
label = entity.labels

messages = [
    ChatMessage(
        role="system",
        content=(
            "You are a text classifier specializing in BBC headlines. "
            "I will provide you with a headline, and you will classify it into one of the following five categories: "
            "business, entertainment, politics, sport, or tech. "
            "Choose the category that best fits the headline. If a headline fits multiple categories, select the one most relevant. "
            "Provide only the category name as the response, without any additional text."
        )
    )
]

messages.append(ChatMessage(role="user", content=news))
ouput = str(llm.chat(messages))

In [86]:
# Test Case with a correctness score of 1 (complete alignment with expected output)
first_test_case = LLMTestCase(input=news,
                              actual_output=ouput,
                              expected_output=label)

# Test Case with a correctness score of 0.5 (partial alignment with expected output)
second_test_case = LLMTestCase(input=news,
                               actual_output=ouput,
                               expected_output=label)

# Test Case with a correctness score of 0 (no meaningful alignment with expected output)
third_test_case = LLMTestCase(input=news,
                              actual_output=ouput,
                              expected_output=label)

test_cases = [first_test_case, second_test_case, third_test_case]

dataset = EvaluationDataset(test_cases=test_cases)

In [87]:
correctness_metric = GEval(
    name="Correctness",
    model="gemini-1.5-flash",
    evaluation_params=[
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Determine whether the actual output is factually correct based on the expected output."
    ],
)

ValueError: Invalid model. Available GPT models: gpt-4o-mini, gpt-4o, gpt-4-turbo, gpt-4-turbo-preview, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4, gpt-4-32k, gpt-4-0613, gpt-4-32k-0613, gpt-3.5-turbo-1106, gpt-3.5-turbo, gpt-3.5-turbo-16k, gpt-3.5-turbo-0125

### RAGAS Framework playground

You can do it when you have Azure OpenAI API:

- https://docs.ragas.io/en/v0.1.21/howtos/customisations/azure-openai.html

In [115]:
from ragas.dataset_schema import  MultiTurnSample, SingleTurnSample
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics import AgentGoalAccuracyWithReference

In [116]:
entity = df.iloc[4]

news = entity.data
label = entity.labels

messages = [
    ChatMessage(
        role="system",
        content=(
            "You are a text classifier specializing in BBC headlines. "
            "I will provide you with a headline, and you will classify it into one of the following five categories: "
            "business, entertainment, politics, sport, or tech. "
            "Choose the category that best fits the headline. If a headline fits multiple categories, select the one most relevant. "
            "Provide only the category name as the response, without any additional text."
        )
    )
]

messages.append(ChatMessage(role="user", content=news))
output = str(llm.chat(messages)).replace('\t','')

In [117]:
user_input = news

# AI's response
response = output

# Reference answer (ground truth)
reference = label

# Evaluation rubric
rubric = {
    "accuracy": "Correct",
    "completeness": "High",
    "fluency": "Excellent"
}

# Create the SingleTurnSample instance
sample = SingleTurnSample(
    user_input=user_input,
    response=response,
    reference=reference,
    rubric=rubric
)

In [None]:
score = AgentGoalAccuracyWithReference()
score.llm = llm
await score.ascore(sample)

In [None]:
sample = MultiTurnSample(
    
    user_input=[
    HumanMessage(content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"),
    AIMessage(content="Sure, let me find the best options for you.", tool_calls=[
        ToolCall(name="restaurant_search", args={"cuisine": "Chinese", "time": "8:00pm"})
    ]),
    ToolMessage(content="Found a few options: 1. Golden Dragon, 2. Jade Palace"),
    AIMessage(content="I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?"),
    HumanMessage(content="Let's go with Golden Dragon."),
    AIMessage(content="Great choice! I'll book a table for 8:00pm at Golden Dragon.", tool_calls=[
        ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8:00pm"})
    ]),
    ToolMessage(content="Table booked at Golden Dragon for 8:00pm."),
    AIMessage(content="Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!"),
    HumanMessage(content="thanks"),
],
    reference="Table booked at one of the chinese restaurants at 8 pm")

scorer = AgentGoalAccuracyWithReference()
scorer.llm = llm
await scorer.multi_turn_ascore(sample)

In [126]:
from ragas.llms import LlamaIndexLLMWrapper

evaluator_llm = LlamaIndexLLMWrapper(your_llm_instance)



In [125]:
from ragas import SingleTurnSample
from ragas.metrics import BleuScore

test_data = {
    "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
    "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
    "reference": "The company reported an 8% growth in Q3 2024, primarily driven by strong sales in the Asian market, attributed to strategic marketing and localized products, with continued growth anticipated in the next quarter."
}
metric = BleuScore()
test_data = SingleTurnSample(**test_data)
metric.single_turn_score(test_data)

0.13718598426177148