# Mudocun: prompt exploration

In this notebook we run a few sample queries to explore the concept of creating questions involving understanding of images, charts or formulas.

First, let's setup the system path.

In [6]:
import pathlib
import sys

ROOT_DIR = str(pathlib.Path.cwd().parent)
sys.path.append(ROOT_DIR) # Add parent folder to the path

## Generate quiz from a locally available pdf article

Let's generate a sample quiz based on the paper "Towards Autotuning by Alternating Communication Methods", locally available at `./docs/autotuning_cscs.pdf`

In [7]:
from vertexai_utils import generate_quiz
from docs import local_documents
from models import Article

response, input_tokens, output_tokens, gen_time = generate_quiz(Article(local_documents[0]))
response

Generating quiz for document "Towards Autotuning by Alternating Communication Methods"...


'## Multiple Choice Quiz Questions:\n\n**Figure 1: Autotuning framework**\n\n**Question 1:** What is the purpose of code refactoring in the autotuning framework?\n\nA. To optimize the code for a specific hardware architecture.\nB. To expose the communication pattern and enable the use of one-sided communication primitives.\nC. To generate multiple versions of the kernel with different communication methods.\nD. To profile the communication performance of the kernel.\n\n**Correct Answer: B**\n\n**Question 2:** Which of the following is NOT a communication method tested in the platform profiling stage?\n\nA. UPC\nB. MPI\nC. OpenMP\nD. DMAPP\n\n**Correct Answer: C**\n\n**Question 3:** What is the role of the platform specific knowledge base in the autotuning framework?\n\nA. To store the refactored code for different kernels.\nB. To provide information about the performance of different communication methods on the target platform.\nC. To generate the final optimized kernel code.\nD. To e

In [8]:
input_tokens, output_tokens, gen_time

(538, 303, 8.027336835861206)

In [9]:
from IPython.display import display, Markdown
display(Markdown(response))

## Multiple Choice Quiz Questions:

**Figure 1: Autotuning framework**

**Question 1:** What is the purpose of code refactoring in the autotuning framework?

A. To optimize the code for a specific hardware architecture.
B. To expose the communication pattern and enable the use of one-sided communication primitives.
C. To generate multiple versions of the kernel with different communication methods.
D. To profile the communication performance of the kernel.

**Correct Answer: B**

**Question 2:** Which of the following is NOT a communication method tested in the platform profiling stage?

A. UPC
B. MPI
C. OpenMP
D. DMAPP

**Correct Answer: C**

**Question 3:** What is the role of the platform specific knowledge base in the autotuning framework?

A. To store the refactored code for different kernels.
B. To provide information about the performance of different communication methods on the target platform.
C. To generate the final optimized kernel code.
D. To execute and evaluate the performance of different kernel versions.

**Correct Answer: B**

**Question 4:**  What is the final output of the code generation stage?

A. A single optimized kernel version.
B. Multiple kernel versions with different communication methods.
C. A detailed performance report of the original kernel.
D. A platform specific knowledge base.

**Correct Answer: B** 


We see the model produces a quiz based on the figure included in the paper.

There is currently a limitation that prevents from getting reproducible model responses (see [here](https://www.googlecloudcommunity.com/gc/AI-ML/Unexpected-Behavior-Gemini-1-0-Pro-002-Returns-Different-Outputs/m-p/757351/highlight/true#M7495)).

Next, we analyze in detail one of the generations.



## Analysis of a quiz generation

The following quiz was generated in an ealier request (though it can't be reproduced exactly again).

In [10]:
cscs_paper_response = '## Multiple Choice Quiz Questions based on the Scientific Article:\n\n**Figure 1: Autotuning framework**\n\n**Question 1:** What is the purpose of code refactoring in this autotuning framework?\n\nA. To benchmark different communication methods on the target platform.\nB. To generate multiple kernel versions with varying communication strategies.\nC. To expose the communication pattern and enable the use of one-sided primitives.\nD. To profile the performance of various communication methods for different message sizes.\n\n**Correct Answer: C**\n\n**Question 2:** Which of the following is NOT a stage in the presented autotuning framework?\n\nA. Code refactoring\nB. Platform profiling\nC. Performance prediction\nD. Code generation\n\n**Correct Answer: C**\n\n**Question 3:** What does the "Domain knowledge" arrow in Figure 1 represent?\n\nA. Information about the specific scientific problem being solved.\nB. Knowledge about the underlying hardware architecture of the platform.\nC. Understanding of the communication patterns within the application code.\nD. Expertise in optimizing communication using different programming models.\n\n**Correct Answer: C**\n\n**Question 4:**  What is the main advantage of using one-sided communication primitives, as shown in the refactored kernel?\n\nA. Reduced code complexity compared to message-passing approaches.\nB. Improved portability of the code across different HPC platforms.\nC. Increased flexibility for the hardware and runtime to optimize communication.\nD. Elimination of the need for synchronization constructs like barriers and fences.\n\n**Correct Answer: C**\n'
display(Markdown(cscs_paper_response))

## Multiple Choice Quiz Questions based on the Scientific Article:

**Figure 1: Autotuning framework**

**Question 1:** What is the purpose of code refactoring in this autotuning framework?

A. To benchmark different communication methods on the target platform.
B. To generate multiple kernel versions with varying communication strategies.
C. To expose the communication pattern and enable the use of one-sided primitives.
D. To profile the performance of various communication methods for different message sizes.

**Correct Answer: C**

**Question 2:** Which of the following is NOT a stage in the presented autotuning framework?

A. Code refactoring
B. Platform profiling
C. Performance prediction
D. Code generation

**Correct Answer: C**

**Question 3:** What does the "Domain knowledge" arrow in Figure 1 represent?

A. Information about the specific scientific problem being solved.
B. Knowledge about the underlying hardware architecture of the platform.
C. Understanding of the communication patterns within the application code.
D. Expertise in optimizing communication using different programming models.

**Correct Answer: C**

**Question 4:**  What is the main advantage of using one-sided communication primitives, as shown in the refactored kernel?

A. Reduced code complexity compared to message-passing approaches.
B. Improved portability of the code across different HPC platforms.
C. Increased flexibility for the hardware and runtime to optimize communication.
D. Elimination of the need for synchronization constructs like barriers and fences.

**Correct Answer: C**


The results are pretty impressive. We can highlight the following aspects:
* All generated questions correctly match the content of the figure.
* There are multiple and diverse questions, involving several parts of the figure.
* The multiple choice options are relevant, and use scientific vocabulary consistent with the paper's content.
* All answers are correct, except Q3, where the correct answer would be A. Arguably, this was not sufficiently explained in the paper, so the model assumed a different kind of domain knowledge. (Interestingly, this is hinting at a weakness in the figure)
* We observe a sophisticated behavior in Q4, where the correct answer considers information provided in the surrounding text, not in the figure itself. In particular, the correct answer is derived from the following extract in page 1:
> Our autotuning framework is composed of three basic stages as shown in Fig. 1. First is code refactoring, which exposes the communication pattern so that it can be expressed with one-sided communication primitives. When possible, out-of-order message delivery is tolerated too. This transformation allows for maximum flexibility for the hardware and runtime to schedule the communication in the most efficient manner.

## Generate quiz from an internet-hosted article

Let's try now a quiz generation from a larger paper, "Mixtral of Experts" by Mistral.ai, fetched on demand:

In [11]:
from docs import online_documents

response, input_tokens, output_tokens, gen_time = generate_quiz(Article(online_documents[0]))
response

Generating quiz for document "Mixtral of Experts"...


'## Mixtral of Experts Quiz\n\n**Figure 1: Mixture of Experts Layer**\n\n**Question:**  What is the role of the "router" in the Mixture of Experts Layer?\n\n**(A)** It processes the input vector and generates the output vector.\n**(B)** It contains a set of 8 expert networks, each with its own weights.\n**(C)** It determines the weight assigned to the output of each expert.\n**(D)** **It selects two out of the eight experts to process the input vector.**\n\n**Answer: (D)**\n\n**Table 1: Model architecture**\n\n**Question:** What is the context length used for training the Mixtral model?\n\n**(A)** 4096 tokens \n**(B)** 14336 tokens\n**(C)** **32768 tokens**\n**(D)** 32000 tokens \n\n**Answer: (C)**\n\n**Figure 2: Performance of Mixtral and different Llama models on a wide range of benchmarks.**\n\n**Question:** In which task category does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?\n\n**(A)** Knowledge\n**(B)** Comprehension\n**(C)** **Math and Cod

In [12]:
input_tokens, output_tokens, gen_time

(3376, 375, 12.879631996154785)

In [13]:
display(Markdown(response))

## Mixtral of Experts Quiz

**Figure 1: Mixture of Experts Layer**

**Question:**  What is the role of the "router" in the Mixture of Experts Layer?

**(A)** It processes the input vector and generates the output vector.
**(B)** It contains a set of 8 expert networks, each with its own weights.
**(C)** It determines the weight assigned to the output of each expert.
**(D)** **It selects two out of the eight experts to process the input vector.**

**Answer: (D)**

**Table 1: Model architecture**

**Question:** What is the context length used for training the Mixtral model?

**(A)** 4096 tokens 
**(B)** 14336 tokens
**(C)** **32768 tokens**
**(D)** 32000 tokens 

**Answer: (C)**

**Figure 2: Performance of Mixtral and different Llama models on a wide range of benchmarks.**

**Question:** In which task category does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?

**(A)** Knowledge
**(B)** Comprehension
**(C)** **Math and Code**
**(D)** AGI Eval

**Answer: (C)**

**Table 3: Comparison of Mixtral with Llama 2 70B and GPT-3.5.**

**Question:** On the MBPP benchmark, which model demonstrates the highest pass@1 accuracy? 

**(A)** Llama 2 70B
**(B)** GPT-3.5
**(C)** **Mixtral 8x7B**
**(D)** The information is not provided in the table.

**Answer: (C)** 


## Analysis of a larger quiz generation

Similarly to the previous example, we analyze next a previous generation in detail

In [14]:
mixtral_response = '## Multiple Choice Quiz Questions based on "Mixtral of Experts"\n\n**Figure 1: Mixture of Experts Layer**\n\n**Question 1:** In the Mixture of Experts Layer, how many experts are assigned to each input vector by the router?\n\nA) 1\nB) 2  **[CORRECT]**\nC) 4\nD) 8\n\n**Figure 2: Performance of Mixtral and different Llama models**\n\n**Question 2:**  According to Figure 2, in which area does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?\n\nA) Comprehension\nB) Knowledge\nC) Code  **[CORRECT]**\nD) AGI Eval\n\n**Table 2: Comparison of Mixtral with Llama**\n\n**Question 3:**  According to Table 2, how many active parameters does Mixtral 8x7B use during inference?\n\nA) 7B\nB) 13B  **[CORRECT]**\nC) 33B\nD) 70B\n\n**Figure 3: Results on MMLU, commonsense reasoning, etc.**\n\n**Question 4:**  Based on Figure 3, in which area does Mixtral **not** consistently outperform Llama 2 70B while using fewer active parameters?\n\nA) Math\nB) Code\nC) Reasoning\nD) Comprehension  **[CORRECT]**\n\n**Table 3: Comparison of Mixtral with Llama 2 70B and GPT-3.5**\n\n**Question 5:** On the MBPP benchmark, what is the performance difference between Mixtral 8x7B and Llama 2 70B?\n\nA) 10.9%  **[CORRECT]**\nB) 8.4%\nC) 4.9%\nD) 2.3%\n\n**Figure 4: Long range performance of Mixtral**\n\n**Question 6:** What does the left part of Figure 4 demonstrate about Mixtral\'s ability on the Passkey task?\n\nA) Mixtral\'s perplexity decreases with longer context.\nB) Mixtral achieves perfect retrieval accuracy regardless of context length.  **[CORRECT]**\nC) Mixtral struggles with retrieving passkeys in long sequences.\nD) Mixtral\'s performance is comparable to other models on this task.\n\n**Figure 5: Bias Benchmarks**\n\n**Question 7:**  Compared to Llama 2 70B, what does Figure 5 suggest about Mixtral\'s performance on bias benchmarks?\n\nA) Mixtral exhibits higher bias.\nB) Mixtral exhibits similar bias.\nC) Mixtral exhibits lower bias.  **[CORRECT]**\nD) The figure does not provide information about bias.\n\n**Figure 6: LMSys Leaderboard**\n\n**Question 8:** As of December 2023, what is the ranking of Mixtral 8x7B Instruct v0.1 on the LMSys Leaderboard among open-weight models? \n\nA) First  **[CORRECT]**\nB) Second\nC) Third\nD) Fourth\n\n**Figure 7: Proportion of tokens assigned to each expert**\n\n**Question 9:** According to Figure 7, is there a clear pattern of expert specialization based on the topic of the text across different layers?\n\nA) Yes, experts clearly specialize in specific domains.\nB) No, there is no clear pattern of specialization.  **[CORRECT]**\nC) The figure does not provide information about expert specialization.\nD) Only the first layer shows a clear pattern of specialization.\n\n**Table 5: Percentage of expert assignment repetitions**\n\n**Question 10:**  What trend does Table 5 reveal about expert assignment repetitions at higher layers (15 and 31) compared to the first layer (0)?\n\nA) Repetitions are significantly lower at higher layers.\nB) Repetitions are similar across all layers.\nC) Repetitions are significantly higher at higher layers.  **[CORRECT]**\nD) The table does not provide information about expert assignment repetitions. \n'
display(Markdown(mixtral_response))

## Multiple Choice Quiz Questions based on "Mixtral of Experts"

**Figure 1: Mixture of Experts Layer**

**Question 1:** In the Mixture of Experts Layer, how many experts are assigned to each input vector by the router?

A) 1
B) 2  **[CORRECT]**
C) 4
D) 8

**Figure 2: Performance of Mixtral and different Llama models**

**Question 2:**  According to Figure 2, in which area does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?

A) Comprehension
B) Knowledge
C) Code  **[CORRECT]**
D) AGI Eval

**Table 2: Comparison of Mixtral with Llama**

**Question 3:**  According to Table 2, how many active parameters does Mixtral 8x7B use during inference?

A) 7B
B) 13B  **[CORRECT]**
C) 33B
D) 70B

**Figure 3: Results on MMLU, commonsense reasoning, etc.**

**Question 4:**  Based on Figure 3, in which area does Mixtral **not** consistently outperform Llama 2 70B while using fewer active parameters?

A) Math
B) Code
C) Reasoning
D) Comprehension  **[CORRECT]**

**Table 3: Comparison of Mixtral with Llama 2 70B and GPT-3.5**

**Question 5:** On the MBPP benchmark, what is the performance difference between Mixtral 8x7B and Llama 2 70B?

A) 10.9%  **[CORRECT]**
B) 8.4%
C) 4.9%
D) 2.3%

**Figure 4: Long range performance of Mixtral**

**Question 6:** What does the left part of Figure 4 demonstrate about Mixtral's ability on the Passkey task?

A) Mixtral's perplexity decreases with longer context.
B) Mixtral achieves perfect retrieval accuracy regardless of context length.  **[CORRECT]**
C) Mixtral struggles with retrieving passkeys in long sequences.
D) Mixtral's performance is comparable to other models on this task.

**Figure 5: Bias Benchmarks**

**Question 7:**  Compared to Llama 2 70B, what does Figure 5 suggest about Mixtral's performance on bias benchmarks?

A) Mixtral exhibits higher bias.
B) Mixtral exhibits similar bias.
C) Mixtral exhibits lower bias.  **[CORRECT]**
D) The figure does not provide information about bias.

**Figure 6: LMSys Leaderboard**

**Question 8:** As of December 2023, what is the ranking of Mixtral 8x7B Instruct v0.1 on the LMSys Leaderboard among open-weight models? 

A) First  **[CORRECT]**
B) Second
C) Third
D) Fourth

**Figure 7: Proportion of tokens assigned to each expert**

**Question 9:** According to Figure 7, is there a clear pattern of expert specialization based on the topic of the text across different layers?

A) Yes, experts clearly specialize in specific domains.
B) No, there is no clear pattern of specialization.  **[CORRECT]**
C) The figure does not provide information about expert specialization.
D) Only the first layer shows a clear pattern of specialization.

**Table 5: Percentage of expert assignment repetitions**

**Question 10:**  What trend does Table 5 reveal about expert assignment repetitions at higher layers (15 and 31) compared to the first layer (0)?

A) Repetitions are significantly lower at higher layers.
B) Repetitions are similar across all layers.
C) Repetitions are significantly higher at higher layers.  **[CORRECT]**
D) The table does not provide information about expert assignment repetitions. 


Again, the results are remarkably good. We observe the following:
* Every question is correctly referring to one figure or table, even in the case where the authors inconsistently name a table as Figure 5.
* All multiple choice options are consistent with the content of the paper.
* All answers marked as correct are accurate.
* The quiz exhibits comprehension of a broad range of diagrams and figures. It also clearly takes into account at least the text accompanying each figure or table (caption).
* In one case, for question 5, the model even performs a substraction, adding a layer of reasoning beyond simply extracting/reading information.

We observe the following limitations:
* The format is different from the one chosen previously, so it'd be hard to post-process to extract question options and correct answer, among other aspects of interest.
* The model stops at 10 questions, even though there are more figures and tables in the document. Other generations generate even fewer questions.

## Structured generation

In order to be able to create a dataset by automated means, it'd be desirable to have a consistent output format that we can easily parse. In other words, we need "structured generation".

In [15]:
from vertexai_utils import generate_structured_quiz
from models import Quiz

questions, metadata = generate_structured_quiz(mixtral_response)
quiz = Quiz(article=Article(online_documents[0]), questions=questions, metadata=metadata)

Generating structured quiz...
func_dict={'name': 'create_quiz', 'args': {'questions': [{'choices': ['1', '2', '4', '8'], 'correct_answer_idx': 1.0, 'text': 'In the Mixture of Experts Layer, how many experts are assigned to each input vector by the router?'}, {'text': 'According to Figure 2, in which area does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?', 'correct_answer_idx': 2.0, 'choices': ['Comprehension', 'Knowledge', 'Code', 'AGI Eval']}, {'choices': ['7B', '13B', '33B', '70B'], 'correct_answer_idx': 1.0, 'text': 'According to Table 2, how many active parameters does Mixtral 8x7B use during inference?'}, {'text': 'Based on Figure 3, in which area does Mixtral **not** consistently outperform Llama 2 70B while using fewer active parameters?', 'correct_answer_idx': 3.0, 'choices': ['Math', 'Code', 'Reasoning', 'Comprehension']}, {'choices': ['10.9%', '8.4%', '4.9%', '2.3%'], 'correct_answer_idx': 0.0, 'text': 'On the MBPP benchmark, what is the p

In [16]:
quiz

{'article': {'title': 'Mixtral of Experts',
  'uri': 'https://arxiv.org/pdf/2401.04088'},
 'questions': [{'choices': ['1', '2', '4', '8'],
   'correct_answer_idx': 1.0,
   'text': 'In the Mixture of Experts Layer, how many experts are assigned to each input vector by the router?'},
  {'text': 'According to Figure 2, in which area does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?',
   'correct_answer_idx': 2.0,
   'choices': ['Comprehension', 'Knowledge', 'Code', 'AGI Eval']},
  {'choices': ['7B', '13B', '33B', '70B'],
   'correct_answer_idx': 1.0,
   'text': 'According to Table 2, how many active parameters does Mixtral 8x7B use during inference?'},
  {'text': 'Based on Figure 3, in which area does Mixtral **not** consistently outperform Llama 2 70B while using fewer active parameters?',
   'correct_answer_idx': 3.0,
   'choices': ['Math', 'Code', 'Reasoning', 'Comprehension']},
  {'choices': ['10.9%', '8.4%', '4.9%', '2.3%'],
   'correct_answer_idx

We observe how we could convert the generated quiz into a structured format, which can be parsed into the desired output format easily.