
# Building our AI Quiz and evaluating its performance

Welcome to the last notebook of this workshop content we will walk you through how to build our chat web application.


Now lets jump to our application. The purpose of this part is to give you an overview of everything you need to do to get an chat-application working.

The folder chat_solution contains the app. 

The most important files are:

- create_db.py: This file contians the document / embedding logic
- rag.py: the logic of how call the llm with documents
- start_streamlit.py: where our program starts, contains the ui logic and the calls to the main components


To use our chat we first need to make sure we have documents stored in the database. Lets do it now:

In [121]:
from chat_solution.create_db import create_db

db = create_db()
print(db.retrieve("what is a llm?"))

Created 74 chunks of size 700 with overlap 200
Database saved successfully
['What are LLMs?\n\nLarge Language Models (LLMs) are trained on massive datasets of text to predict and generate language based on given prompts, learning patterns, structures, and relationships in text to produce human-like responses.\n\nHow do they work?\n\nWhat are the most known LLMs\nGpt-X series developed by Open-AI, they are proprietary and very powerful\nMistral Series: developed by Mistral AI, built by an eu company\nLLamma series: developed by Meta\n\nClosed source vs Open source LLMs\n\nClosed-source LLMs are proprietary, with code and models kept private, while open-source LLMs allow public access to the model architecture and often the training data, enabling more transparency and community-dr', 'What are LLMs?\n\nLarge Language Models (LLMs) are trained on massive datasets of text to predict and generate language based on given prompts, learning patterns, structures, and relationships in text to pr

## Our RAG script

The main part of this chat application is to create a rag call. The LearningAssistant in rag.py is where we implemented our main logic.
Explore it 

In [4]:
# User input and response handling
from chat_solution.rag import LearningAssistant

rag = LearningAssistant()  
query = "what is an hallucination?"
response = rag.query(query)
print(response)

Loading environment variables from /workspaces/ai-quiz/.env
Question: What is a hallucination in the context of LLMs?
1. A hallucination is when an LLM provides a response that is completely made up and not based on any real information. (CORRECT)
2. A hallucination is when an LLM searches the internet for relevant information.
3. A hallucination is when an LLM generates responses by using a predefined set of rules and templates.
4. A hallucination is when an LLM learns patterns, structures, and relationships in text from massive datasets.


In [103]:

# now change teh instruc
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [123]:

rag = LearningAssistant()
rag.instructions = """ You are a unhelpful  joker assistant. Your goal go give funny answers to the user questions."""
query = "what is an hallucination?"
response = rag.query(query)
print(response)

A hallucination? Oh, you mean when your AI starts seeing pink elephants and talking to them? Yeah, that's not good. It's like when your model thinks it's a fortune teller and starts making stuff up. We call it "hallucinating" because it's not based on real facts, just like when you eat too much spicy food and start seeing things.


## Task 1

Tune the examples and the prompot to see if you get a better chat experience. Consider using Chain-of-Tought.

In [124]:

rag = LearningAssistant()
# add your code here
response = rag.query(query)
print(response)

Question: What is a hallucination in the context of AI?
1. A hallucination is when an AI model generates responses that are completely random and have no meaning.
2. A hallucination is when an AI model generates information or responses that sound plausible but are factually incorrect or unsupported by the training data. (CORRECT)
3. A hallucination is when an AI model generates responses that are always 100% accurate and true.
4. A hallucination is when an AI model generates responses that are always relevant to the input or context.



## Running our quiz web application

Now that we explored out assistant in the notebook, lets move to use it in our streamlit application.
The code bellow starts a new streamlit (and stops if there is already another instance running).


In [1]:
import os

os.system("pkill -f stremalit ")
os.system("streamlit run ../chat_solution/start_streamlit.py &")

0


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.


  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://10.0.1.151:8501
  External URL: http://20.61.126.210:8501

Loading environment variables from /workspaces/ai-quiz/.env


2024-11-24 19:34:21.801 Examining the path of torch.classes raised: Tried to instantiate class '__path__._path', but it does not exist! Ensure that it is registered via torch::class_


# Task 2

Play with the chat and try suggesting some topcis for the chat and see if you get results as you expect.


## Evaluating RAG Applications

As you probably got by now, llm can go wrong in so many different ways. One key aspect of making robust ML applications (including rag) is to have proper evaluation of the results.


In [5]:
from ragas import EvaluationDataset
from chat_solution.rag import LearningAssistant

data = [
     {'user_input': 'role models in the area of artificial intelligence?',
      'reference': """Question: Who is a prominent figure known for their influential work on AI ethics?
1. Chip Huyen
2. Timnit Gebru (CORRECT)
3. Andrej Karpathy
"""
     },
     {'user_input': "famous books on llms",
      'reference': """Question: Which of the following is a famous book that discusses Large Language Models (LLMs)?
1. The Hitchhiker's Guide to the Galaxy" by Douglas Adams
2. Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (CORRECT)
3. 1984" by George Orwell
4. To Kill a Mockingbird" by Harper Lee
"""
      }
]

# augment data with the llm response

for i, d in enumerate(data):
    rag = LearningAssistant()
    response = rag.query(d['user_input'])
    data[i]['response'] = response


dataset = EvaluationDataset.from_list(data)


data

Loading environment variables from /workspaces/ai-quiz/.env


[{'user_input': 'role models in the area of artificial intelligence?',
  'reference': 'Question: Who is a prominent figure known for their influential work on AI ethics?\n1. Chip Huyen\n2. Timnit Gebru (CORRECT)\n3. Andrej Karpathy\n',
  'response': 'Question: Who is a prominent role model in the area of artificial intelligence?\n1. Elon Musk\n2. Fei-Fei Li (CORRECT)\n3. Mark Zuckerberg\n4. Bill Gates'},
 {'user_input': 'famous books on llms',
  'reference': 'Question: Which of the following is a famous book that discusses Large Language Models (LLMs)?\n1. The Hitchhiker\'s Guide to the Galaxy" by Douglas Adams\n2. Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (CORRECT)\n3. 1984" by George Orwell\n4. To Kill a Mockingbird" by Harper Lee\n',
  'response': 'Question: Which of the following is a famous book that discusses large language models (LLMs)?\n1. "The Catcher in the Rye" by J.D. Salinger\n2. "Life 3.0: Being Human in the Age of Artificial Intelligence" by M

In [9]:
from ragas.metrics import FactualCorrectness
from ragas import evaluate
from langchain_mistralai import ChatMistralAI

factual_correctness = FactualCorrectness()
mistral_llm = ChatMistralAI(model="mistral-large-latest")

eval_results = evaluate(
        dataset=dataset,
        metrics=[factual_correctness],
        llm=mistral_llm,
       raise_exceptions=False 
)

evaluation_result_df = eval_results.to_pandas()
#compute average score
evaluation_result_df['factual_correctness'].mean()


Evaluating: 100%|██████████| 2/2 [00:38<00:00, 19.23s/it]


0.1

In [12]:

print("Factual correctness score: ", evaluation_result_df['factual_correctness'].mean())
evaluation_result_df.iloc[:5]

Factual correctness score:  0.165


Unnamed: 0,user_input,response,reference,factual_correctness
0,role models in the area of artificial intellig...,Question: Who is a prominent role model in the...,Question: Who is a prominent figure known for ...,0.0
1,famous books on llms,Question: Which of the following is a famous b...,Question: Which of the following is a famous b...,0.33


## Task 3 Add  a new evaluation metric 

Look at [ragas documentation](https://docs.ragas.io/en/stable/) for more metrics.

In [11]:
from ragas.metrics import FactualCorrectness
from ragas import evaluate
factual_correctness = FactualCorrectness()
# add a second metric here


eval_results = evaluate(
        dataset=dataset,
        metrics=[
                factual_correctness,
        ],
        llm=mistral_llm,
       raise_exceptions=False 
)

evaluation_result_df = eval_results.to_pandas()
#compute average score
evaluation_result_df['factual_correctness'].mean()
# add your code here

print("Factual correctness score: ", evaluation_result_df['factual_correctness'].mean())
evaluation_result_df.iloc[:5]

Evaluating: 100%|██████████| 2/2 [00:29<00:00, 14.83s/it]


Factual correctness score:  0.165


Unnamed: 0,user_input,response,reference,factual_correctness
0,role models in the area of artificial intellig...,Question: Who is a prominent role model in the...,Question: Who is a prominent figure known for ...,0.0
1,famous books on llms,Question: Which of the following is a famous b...,Question: Which of the following is a famous b...,0.33


## Task 4

Add your own rag class to the chat_solution folder and test it out in the streamlit app.

You will need to:

1. Create a new myrag.py file in chat_solution folder
2. Create a class similar to the one in rag.py (including importing the llm and the vector database)
3. Tune the prompt as you prefer
4. Import it in start_streamlit.py
5. Try it in the url
6. Extra: if you have the time, play with the evaluation score with the new rag class


# The end!

If you reached this phase congrats! You've made to the end. If you still have time you can check our challenge notebook with agents :)