<a href="https://colab.research.google.com/github/AlexUmnov/genai_course/blob/main/week2_llm_application/homework_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1. Question answering

In this task you will practice using LangChain for question answering task.

We will work with the dataset from the [Measuring Massive Multitask Language Understanding](https://arxiv.org/pdf/2009.03300) paper by Hendryks et al. It contains questions from fields as diverse as International Law, Nutrition and Higher Algebra. For each of the questions, 4 answers are given (labeled A-D) and one of them is marked as correct. We'll go for High School Mathematics.

You can download the dataset from here https://people.eecs.berkeley.edu/~hendrycks/data.tar, then unzip uzing your system's dialogue (you can use 7-zip for example). However, we suggest downloading the data with help of Hugging Face [Dataset](https://huggingface.co/docs/datasets/index) library.

In [None]:
!pip install langchain langchain-openai tqdm openai datasets --quiet

In [None]:
from datasets import load_dataset

dataset = load_dataset("cais/mmlu", "high_school_mathematics", split="test")

Let's explore the dataset. What does it have for us?

In [None]:
len(dataset)

To save time and API calls costs we suggest evaluating only 50 examples from the dataset.

In [None]:
dataset = dataset[:50]

In [None]:
import pandas as pd
dataset = pd.DataFrame(dataset)
dataset.head()

Here the answers are not labeled by letters A-D, so we'll do it manually.

In [None]:
questions = dataset["question"]
choices = pd.DataFrame(
    data=dataset["choices"].tolist(), columns=["A", "B", "C", "D"]
    )
answers = dataset["answer"].map(lambda ans: {0: "A", 1: "B", 2: "C", 3: "D"}[ans])

Let's use Generative AI to predict the correct answer. We suggest using GPT-4o-mini, because it's both cheap and proficient.

In [None]:
import os
from google.colab import userdata
from langchain_openai import ChatOpenAI
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

open_ai_api_key = open(".open-ai-api-key").read().strip()
# open_ai_api_key = userdata.get("open_ai_api_key")
os.environ['OPENAI_API_KEY'] = open_ai_api_key


example_id = 0
chat = ChatOpenAI(model="gpt-4o-mini", temperature=0, api_key=open_ai_api_key)
result = chat.invoke([
    HumanMessage(
        content=f"{questions[example_id]} " \
        f"A) {choices['A'][example_id]} " \
        f"B) {choices['B'][example_id]} " \
        f"C) {choices['C'][example_id]} " \
        f"D) {choices['D'][example_id]}"
        )
])
result

You can observe that ChatGPT uses *chain-of-thought reasoning* to tackle this problem (see [Wei et al.](https://arxiv.org/pdf/2201.11903.pdf)). This is generally very helpful to approach math problems.

**Note**. Even if the model avoids chain-of-thought reasoning, you can persuade it with prompts like: `"Break down the question in multiple steps, write them down and then give the answer'"`.

But the thing is that we only need an answer. So, we need a way to extract the right letter from this lengty response.

## Task 1.1. Zero-shot use

Let's start by trying to supress chain-of-thought reasoning. We will ask the LLM to output just one letter A-D.

Write a function doing it. Your solution should only rely on well chosen prompts, without any post-parsing of the output. Please always ensure during this task that you're using the "gpt-4o-mini" model. Otherwise your comparison will not be correct, and/or you accuracy/wrong format number may turn out to be worse than you would expect.

**Hint 1**. You can use `SystemMessage` or just a well chosen prompt template. If you use `SystemMessage`, ensure that you are using a chat model.

**Hint 2**. Don't forget to set temperature to zero. We need truthfulness, not creativity. Note however than even setting temperature to zero doesn't necessary mean that the completions will be completely reproducible. See [this discussion](https://community.openai.com/t/a-question-on-determinism/8185/2) for some hints.

**Hint 3**. Don't forget to look at the outputs. It may greatly help you to create better prompts.

In [None]:
def chatgpt_answer(question: str, a: str, b: str, c: str, d: str) -> str:
    pass

We also provide you with the accuracy calculating function. Which also allows you to debug your answers by passing `verbose=True`

In [None]:
def check_answers(answers, model_answers, verbose=False):
    wrong_format = 0
    correct = 0
    wrong_answers = []
    for correct_answer, model_answer in zip(answers, model_answers):
        if correct_answer == model_answer[0]:
            correct += 1
        else:
            wrong_answers.append(f"Expected answer: {correct_answer} given answer {model_answer}")
        if (model_answer[0] not in ["A", "B", "C", "D"]) or len(model_answer) > 1:
            wrong_format += 1

    result = {
        "accuracy": correct / len(answers),
        "wrong_format": wrong_format / len(answers),
    }

    if verbose:
        result['wrong_answers'] = wrong_answers

    return result

Here we calculate two things:

- Accuracy rate. Note that an answer is considered as accurate whenever it starts from the correct letter.
- Wrong format rate. It penalizes all the answers which are not 'A', 'B', 'C', or 'D'.

In [None]:
chatgpt_answer(
    questions[0],
    choices.A[0],
    choices.B[0],
    choices.C[0],
    choices.D[0],
)

You may also experiment with other subjects, not only school math. The dataset has other subjects, you can see all of them [here](https://huggingface.co/datasets/cais/mmlu). You can pick the subject you like the most and evaluate your functions on it. However, you will need to submit your school math results for evaluation.

In [None]:
from tqdm.auto import tqdm

In [None]:
model_answers = []
for example_id in tqdm(range(len(dataset))):
    model_answers.append(chatgpt_answer(
        questions[example_id],
        choices.A[example_id],
        choices.B[example_id],
        choices.C[example_id],
        choices.D[example_id]
    ))

check_answers(answers, model_answers, verbose=True)

Depending on the subject the accuracy may vary but generally it can be rather poor. It seems that getting rid of chain-of-though wasn't a good idea.

*You should aim at getting at least 20% of the answers in correct format.*

**To submit**. You will need to submit a csv file with model answers. Please launch the code below and submit the `zero_shot_answers.csv` file to grading.

In [None]:
import pandas as pd

df = pd.DataFrame(model_answers, columns=['answer'])
df.to_csv('zero_shot_answers.csv', index=True)

## Task 1.2. Ensuring format with few-shot examples

In the previous task we tried to make the LLM obey a particular format by explaining this format. This time, we will do it by showing the LLM how it's done with few-shot examples.

**Note:** You can implement Few-Shot in two ways:

1. You can combine `SystemMessage`, `HumanMessage` and `AIMessage` to create a fake chat history, like this:

```{python}
chat.invoke([
    SystemMessage(content="""Answer every request only with formulas, using no single word"""),
    HumanMessage(content="""You gave me 3 apples and then took 2 apples from me. How many apples do I have now?"""),
    AIMessage(content="""3 - 2 = 1""")
    HumanMessage(content="""How can I convert Celsius to Fahrenheit?"""),
    AIMessage(content="""F=5/9*​C+32""")
])

```

The next `HumanMessage` will be the user's real request.

2. You can just write all the chat history in a single user message:

```{python}
chat.invoke([
    HumanMessage(content="""Answer every request only with formulas, using no single word.
    
    User: You gave me 3 apples and then took 2 apples from me. How many apples do I have now?
    
    Assistant: 3 - 2 = 1

    ###

    User: How can I convert Celsius to Fahrenheit?

    Assistant: F=5/9*​C+32

    ###

    User: {real user's message}

    Assistant:
])
```

You will need to use Few Shot examples in the math Q&A task to ensure that GPT-4o-Mini outputs only answer codes (A, B, C, or D). Likely, you will observe that showing the right format to an LLM may be more efficient than explaining it.

Try to retain as much of your previous prompt as possible. This will help us to understand the significance of this particular change.

Evaluate the same subject you used earlier with Few-Shot prompt and compare the results.

In [None]:
def chatgpt_few_shot_answer(question: str, a: str, b: str, c: str, d: str) -> str:
    pass

In [None]:
chatgpt_few_shot_answer(
    questions[0],
    choices.A[0],
    choices.B[0],
    choices.C[0],
    choices.D[0],
)

In [None]:
model_answers = []
for example_id in tqdm(range(len(dataset))):
    model_answers.append(chatgpt_few_shot_answer(
        questions[example_id],
        choices.A[example_id],
        choices.B[example_id],
        choices.C[example_id],
        choices.D[example_id]
    ))

check_answers(answers, model_answers, verbose=True)

*You should aim at at least 50% answers in the correct format*

**To submit**. You will need to submit a csv file with model answers. Please launch the code below and submit the `few_shot_answers.csv` file to grading.

In [None]:
import pandas as pd

df = pd.DataFrame(model_answers, columns=['answer'])
df.to_csv('few_shot_answers.csv', index=True)

## Task 1.3. Chain-of-thoughts

Okay, let's confess that, even though we were able to do a decent job with the format, without chain-of-thought reasoning the accuracy is not good. Now, let's allow the LLM to "think out loud" and then use it again to extract the answer from the chain-of-though output (as one letter).

Implement these two LLM calls in one function.

**Note:** Don't forget to feed the answer of the first LLM to the second LLM.
**Note:** If your prompt gets too long due to few shot examples, it's usually a good idea to repeat the question in the end. A model might "forget" what the question was.

Try to retain as much of your previous prompt as possible. This will help us to understand the significance of this particular change.

In [None]:
def chatgpt_step_by_step_answer(question: str, a: str, b: str, c: str, d: str) -> str:
    chat = ChatOpenAI(temperature=0)
    messages = ...
    step_by_step_response = ...
    messages.append(AIMessage(content=step_by_step_response))
    parsed_response = ...

**Note**. This function is not a LangChain chain, just a chat. But in a sence a chat works like a chain. The main difference is that proper chains are better structured:

- In a proper chain we construct prompt templates to facilitate putting together different inputs and outputs. We can instruct an LLM about the relations between them.
- In a chat we have all the inputs and outputs piled together as messages, and we rely on ability of an LLM to extract information from discussions.

**Note**. If you don't get the desired quality, it's a good idea to look at the both reasoning and the answers. This will help you to debug all the process. So, we advise you to keep all the outputs.

In [None]:
chatgpt_step_by_step_answer(
    questions[0],
    choices.A[0],
    choices.B[0],
    choices.C[0],
    choices.D[0],
)

In [None]:
model_answers = []
for example_id in tqdm(range(len(dataset))):
    model_answers.append(chatgpt_step_by_step_answer(
        questions[example_id],
        choices.A[example_id],
        choices.B[example_id],
        choices.C[example_id],
        choices.D[example_id]
    ))

check_answers(answers, model_answers, verbose=True)


You should aim at getting at least 60% of your answers in the correct format and accuracy at least 40%

**To submit**. You will need to submit a csv file with model answers. Please launch the code below and submit the `cot_answers.csv` file to grading.

In [None]:
import pandas as pd

df = pd.DataFrame(model_answers, columns=['answer'])
df.to_csv('cot_answers.csv', index=True)

## Bonus task 1.4*

Rewrite `chatgpt_step_by_step_answer` with chains. Compare the quality.

In [None]:
chatgpt_step_by_step_answer(
    questions[0],
    choices.A[0],
    choices.B[0],
    choices.C[0],
    choices.D[0],
)

## Task 1.5. Self-consistency: an ensemble of Chains of Thoughts

There's another popular and effective method of getting a better answer from your LLM.

We already know that adding a chain-of-thought answer improves the quality, because the model has some "space" to "think" about the answer.
But we can make it even better by allowing the model to generate multiple chains of thought to obtain candidates for the answer and then choose the best of those.

This is essentially the method introduced by this paper [Self-Consistency Improves Chain of Thought Reasoning in Language Models
](https://arxiv.org/abs/2203.11171).

In practice you need to make a function `chatgpt_self_consistency_answer` , which will do the following:
- Generate a diverse set of answers (for this it's good to set the [sampling temperature](https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature) greater than 0). You might need to try a bunch of different value to get a better result;
- Exctract final answer from those step-by-step explanations;
- Select the best answer based on majority vote.

To make a fair comparison try to retain as much of your previous step-by-step prompt as possible.

However, you might want to do a bit more to ensure that your answers are in the same (right) format, because otherwise the majority vote doesn't make a ton of sense.

**Note:** If you were to run it on a full dataset or a big part of it, be aware, that this takes much longer, because it's essintialy num_runs times more calls.

**Bonus:** If you want to make your code faster and more true to how it would be handled in a real usecase, take a look at [asyncio](https://docs.python.org/3/library/asyncio.html). You can launch different calls asynchronously.

**Bonus:** We also encourage you to implement your function with LangChain chains.

In [None]:
def chatgpt_self_consistency_answer(
        question: str,
        a: str,
        b: str,
        c: str,
        d: str,
        num_runs: int = 5,
        sampling_temperature=0.7
    ) -> str:
    answers = []
    for _ in range(num_runs):
        '''Run several Chains of Thoughts + answer extraction'''

    '''Choose the most popular answer and return it'''

In [None]:
chatgpt_self_consistency_answer(
    questions[0],
    choices.A[0],
    choices.B[0],
    choices.C[0],
    choices.D[0],
)

In [None]:
model_answers = []
for example_id in tqdm(range(len(dataset))):
    model_answers.append(chatgpt_self_consistency_answer(
        questions[example_id],
        choices.A[example_id],
        choices.B[example_id],
        choices.C[example_id],
        choices.D[example_id]
    ))

check_answers(answers, model_answers, verbose=True)


**To submit**. You will need to submit a csv file with model answers. Please launch the code below and submit the `self_consistency_answers.csv` file to grading.

In [None]:
import pandas as pd

df = pd.DataFrame(model_answers, columns=['answer'])
df.to_csv('self_consistency_answers.csv', index=True)

You should aim at getting at least 75% of your answers in the correct format and accuracy at least 50%

## Task 1.6 Structured Output

Finally we'll try to focus on format more than quality. Even though with previous techniques you might already be getting quite good results for `wrong_answer` metric, it's still important to exercise using **structured output**. It's consistent and much more appropriate to real world application, rather than toy problems.

You task is to finish the following function (use whichever prompt you like from previous sub-tasks). You need to design the structure yourself in a way, which you think is more appropriate for this task.

For some inspiration we advise you to take a look at supported schemas https://platform.openai.com/docs/guides/structured-outputs/supported-schemas.

To submit the solution, convert your predicted models to the same format as before.

Note: To use structured output with langchain, you can use the [following](https://python.langchain.com/v0.1/docs/modules/model_io/chat/structured_output):

```
from langchain_core.pydantic_v1 import BaseModel, Field

from langchain_openai import ChatOpenAI=


class Joke(BaseModel):
    setup: str = Field(description="The setup of the joke")
    punchline: str = Field(description="The punchline to the joke")

model = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
structured_llm = model.with_structured_output(Joke)

structured_llm.invoke("Tell me a joke about cats")

```

In [None]:
from pydantic import BaseModel

def chatgpt_structured_answer(
        question: str,
        a: str,
        b: str,
        c: str,
        d: str,
        output_model: BaseModel
    ) -> str:
        pass

In [None]:
model_answers = []
for example_id in tqdm(range(len(dataset))):
    model_answers.append(chatgpt_structured_answer(
        questions[example_id],
        choices.A[example_id],
        choices.B[example_id],
        choices.C[example_id],
        choices.D[example_id]
    ))

check_answers(answers, model_answers, verbose=True

**To submit**

In [None]:
import pandas as pd

df = pd.DataFrame(model_answers, columns=['answer'])
df.to_csv('structured_answer.csv', index=True

In [None]:
chatgpt_structured_answer(
    questions[0],
    choices.A[0],
    choices.B[0],
    choices.C[0],
    choices.D[0],
    MMLUAnswerModel
)

In [None]:
model_answers = []
for example_id in tqdm(range(len(dataset))):
    model_answers.append(chatgpt_structured_answer(
        questions[example_id],
        choices.A[example_id],
        choices.B[example_id],
        choices.C[example_id],
        choices.D[example_id],
        MMLUAnswerModel
    ))

check_answers(answers, model_answers, verbose=True)

# Task 2. Introducing vector database search

*3 points*

In the previous task we solved Q&A task with an LLM using only whatever LLM has "learnt" during its training. However, this doesn't always work perfectly. Often, you just need to indroduce specific knowledge to the LLM to get adequate quality of generation. This is usually done by allowing an LLM to search for answers in the net or in some database.

In this task you'll learn to query vector databases with LLMs. We will mainly follow a tutorial of `lancedb`.

Let's install prerequisites.

In [None]:
!pip install lancedb datasets tqdm openai langchain langchain_community -q

In [None]:
from datasets import load_dataset
from tqdm.auto import tqdm
import openai

from langchain.vectorstores import LanceDB
from langchain.schema import Document

import lancedb
from lancedb.embeddings import with_embeddings

For the experiments we'll use `truthful_qa` dataset, which provides both popular misconsceptions and correct answers to a number of question. This dataset is used in research to test generative AI's *truthfullness*.

In [None]:
dataset = load_dataset("truthful_qa", "generation", split='validation')
dataset

In [None]:
dataset[0]

We are going to search by questions.

In [None]:
dataset_df = dataset.to_pandas()
dataset_df['text'] = dataset_df['question']

Let's create our database.

In [None]:
# This line is needed in case you've ran this cell before to clear the db dir
!rm -rf /tmp/lancedb

db = lancedb.connect("/tmp/lancedb")

Now we can choose our embeddings and populate LanceDB tables.

In [None]:
from lancedb.embeddings import with_embeddings
import os
from google.colab import userdata

# open_ai_key = open(".open-ai-api-key").read().strip()
open_ai_key = userdata.get('open_ai_api_key')
openai.api_key = open_ai_key

os.environ["OPENAI_API_KEY"] = open_ai_key

def embed_func(c):
    rs = openai.embeddings.create(input=c, model="text-embedding-ada-002")
    return [record.embedding for record in rs.data]

data = with_embeddings(embed_func, dataset_df, show_progress=True)

In [None]:
truthful_qa_table = db.create_table('truthful_qa', data=data)

In [None]:
def search_table(query, limit=5, table=truthful_qa_table):
    query_embedding = embed_func(query)[0]
    return table.search(query_embedding).limit(limit).to_pandas()

def create_prompt(query, context):
    return f"Using this information: {context}\n\n\n{query}"

Write a function `search_result_to_context` which takes an output from db and returns textual context, which we'll feed to our LLM.

In [None]:
def search_result_to_context(search_result) -> str:
    pass

In [None]:
result = search_table("pigs", limit=2)
print(search_result_to_context(result))

Now write two functions:

- one providing ChatGPT's answer given a query, but without accessing our database;
- another which also uses the database to incorporate the context.

Make sure that the second function accepts `prompt_func`, a function, which creates a contextualised prompt.

In [None]:
def raw_answer(query) -> str:
    pass

def answer_with_db(query, prompt_func=create_prompt) -> str:
    pass

In [None]:
from IPython.display import display

prompt = "Can pigs fly?"

print("Raw answer")
display(raw_answer(prompt))

print("\n\nAnswer using the database")
display(answer_with_db(prompt))


## Bonus task

*1 point*

Now you need to write two new `prompt_func`. They should achieve the following goals:


1.   Only give false information answering users query. (Keep in mind that ChatGPT would be very reluctant to do so, so you should somehow persuade it)
2.   For any answer the models gives, make it cite a source from the context received.



In [None]:
def create_false_information_prompt(query, context) -> str:
    pass

In [None]:
display(answer_with_db(prompt, prompt_func=create_false_information_prompt))

In [None]:
def create_with_source_prompt(query, context) -> str:
    pass

In [None]:
display(answer_with_db(prompt, prompt_func=create_with_source_prompt))

## Task 2.2

*1 point*

In this task you will write your own plugin for ChatGPT.

The `langchain` library has `Tool.from_function` method, which allows you to turn your `str->str` function into a tool for your LLM. You will need to make this function, `db_tool_function`.

Based on the description of our tool, the LLM agent will generate a string, which will be passed to this funciton. The output string will be the result, which the agent will see and try to use in answering your query.

In the end it should be used like this:

```
tools = [
    Tool.from_function(
        func=db_tool_function,
        name=..., # a fitting name
        description=... # a descriptions to help the agent use it
    ),
]
agent = initialize_agent(
    tools=tools, llm, agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION, verbose=True
)
agent.run(
    "What are the common misconceptions about food? List them all"
)
# Agent goes to search the database


In [None]:
!pip install -q langchain langchain-openai langchainhub openai

In [None]:
def db_tool_function(query: str) -> str:
    pass

In [None]:
from langchain import hub
from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import  Tool, tool
from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate

tools = [
    Tool.from_function(
        func=db_tool_function,
        name="Search misconseptions database",
        description="Useful when you are talking about misconceptions, gives you hard questions and correct answers"
    ),
]

In [None]:
import os

os.environ["OPENAI_API_KEY"] = openai.api_key
llm = OpenAI()

prompt = hub.pull("hwchase17/react")

agent = create_react_agent(llm, tools, prompt)

In [None]:
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

In [None]:
agent_executor.invoke({
    "input": "What are the common misconceptions about food? List them all",
})

# Task 2.3

Let's take a closer look at the output, which out database search returns:

In [None]:
search_table(query="pigs fly")

As you can see, one of the columns is actually `distance`. In this task, we suggest you to implement the following system:

For each query we search the database. We check whether at least one of the answers is closer than 0.2 and if yes - we answer with gpt-4o-mini using the database information. Otherwise, the output is generated by gpt-4o.

This will emulate a real scenario where "harder" queries are processed by a bigger "stronger" model.

In [None]:
def answer_with_smart_routing(query, simple_model='gpt-4o-mini', complex_model='gpt-4o') -> str:
    pass

In [None]:
answer_with_smart_routing("Can pigs fly?")

In [None]:
answer_with_smart_routing("What is the theorem of Pythagoras?")