# Mudocun: prompt exploration

In this notebook we run a few sample queries to explore the concept of creating questions involving understanding of images, charts or formulas.

In [1]:
import pathlib
import sys

ROOT_DIR = str(pathlib.Path.cwd().parent)
sys.path.append(ROOT_DIR) # Add parent folder to the path

In [2]:
from vertexai_utils import generate_quiz
from docs import local_documents

response = generate_quiz(local_documents[0])
response

Generating quiz for document "Towards Autotuning by Alternating Communication Methods"...




'## Multiple Choice Quiz Questions based on the Scientific Article:\n\n**Figure 1: Autotuning framework**\n\n**Question 1:** What is the purpose of code refactoring in this autotuning framework?\n\nA. To benchmark different communication methods on the target platform.\nB. To generate multiple kernel versions with varying communication strategies.\nC. To expose the communication pattern and enable the use of one-sided primitives.\nD. To profile the performance of various communication methods for different message sizes.\n\n**Correct Answer: C**\n\n**Question 2:** Which of the following is NOT a stage in the presented autotuning framework?\n\nA. Code refactoring\nB. Platform profiling\nC. Performance prediction\nD. Code generation\n\n**Correct Answer: C**\n\n**Question 3:** What does the "Domain knowledge" arrow in Figure 1 represent?\n\nA. Information about the specific scientific problem being solved.\nB. Knowledge about the underlying hardware architecture of the platform.\nC. Under

In [3]:
from IPython.display import display, Markdown
display(Markdown(response))

## Multiple Choice Quiz Questions based on the Scientific Article:

**Figure 1: Autotuning framework**

**Question 1:** What is the purpose of code refactoring in this autotuning framework?

A. To benchmark different communication methods on the target platform.
B. To generate multiple kernel versions with varying communication strategies.
C. To expose the communication pattern and enable the use of one-sided primitives.
D. To profile the performance of various communication methods for different message sizes.

**Correct Answer: C**

**Question 2:** Which of the following is NOT a stage in the presented autotuning framework?

A. Code refactoring
B. Platform profiling
C. Performance prediction
D. Code generation

**Correct Answer: C**

**Question 3:** What does the "Domain knowledge" arrow in Figure 1 represent?

A. Information about the specific scientific problem being solved.
B. Knowledge about the underlying hardware architecture of the platform.
C. Understanding of the communication patterns within the application code.
D. Expertise in optimizing communication using different programming models.

**Correct Answer: C**

**Question 4:**  What is the main advantage of using one-sided communication primitives, as shown in the refactored kernel?

A. Reduced code complexity compared to message-passing approaches.
B. Improved portability of the code across different HPC platforms.
C. Increased flexibility for the hardware and runtime to optimize communication.
D. Elimination of the need for synchronization constructs like barriers and fences.

**Correct Answer: C**


The results are pretty impressive. We can highlight the following aspects:
* All generated questions correctly match the content of the figure
* There are multiple and diverse questions, involving several parts of the figure
* The multiple choice options are relevant, and use scientific vocabulary consistent with the paper's content.
* All answers are correct, except Q3, where the correct answer would be A. Arguably, this was not sufficiently explained in the paper, so the model assumed a different kind of domain knowledge. (Interestingly, this is hinting at a weakness in the figure)

Let's try now a quiz generation from a larger paper, fetched on demand.

In [2]:
from vertexai_utils import generate_quiz
from docs import online_documents

response = generate_quiz(online_documents[0])
response

Generating quiz for document "Mixtral of Experts"...
candidates {
  content {
    role: "model"
    parts {
      text: "## Multiple Choice Quiz Questions based on \"Mixtral of Experts\"\n\n**Figure 1: Mixture of Experts Layer**\n\n**Question 1:** In the Mixture of Experts Layer, how many experts are assigned to each input vector by the router?\n\nA) 1\nB) 2  **[CORRECT]**\nC) 4\nD) 8\n\n**Figure 2: Performance of Mixtral and different Llama models**\n\n**Question 2:**  According to Figure 2, in which area does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?\n\nA) Comprehension\nB) Knowledge\nC) Code  **[CORRECT]**\nD) AGI Eval\n\n**Table 2: Comparison of Mixtral with Llama**\n\n**Question 3:**  According to Table 2, how many active parameters does Mixtral 8x7B use during inference?\n\nA) 7B\nB) 13B  **[CORRECT]**\nC) 33B\nD) 70B\n\n**Figure 3: Results on MMLU, commonsense reasoning, etc.**\n\n**Question 4:**  Based on Figure 3, in which area does Mixt

'## Multiple Choice Quiz Questions based on "Mixtral of Experts"\n\n**Figure 1: Mixture of Experts Layer**\n\n**Question 1:** In the Mixture of Experts Layer, how many experts are assigned to each input vector by the router?\n\nA) 1\nB) 2  **[CORRECT]**\nC) 4\nD) 8\n\n**Figure 2: Performance of Mixtral and different Llama models**\n\n**Question 2:**  According to Figure 2, in which area does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?\n\nA) Comprehension\nB) Knowledge\nC) Code  **[CORRECT]**\nD) AGI Eval\n\n**Table 2: Comparison of Mixtral with Llama**\n\n**Question 3:**  According to Table 2, how many active parameters does Mixtral 8x7B use during inference?\n\nA) 7B\nB) 13B  **[CORRECT]**\nC) 33B\nD) 70B\n\n**Figure 3: Results on MMLU, commonsense reasoning, etc.**\n\n**Question 4:**  Based on Figure 3, in which area does Mixtral **not** consistently outperform Llama 2 70B while using fewer active parameters?\n\nA) Math\nB) Code\nC) Reasoning\nD)

In [3]:
from IPython.display import display, Markdown
display(Markdown(response))

## Multiple Choice Quiz Questions based on "Mixtral of Experts"

**Figure 1: Mixture of Experts Layer**

**Question 1:** In the Mixture of Experts Layer, how many experts are assigned to each input vector by the router?

A) 1
B) 2  **[CORRECT]**
C) 4
D) 8

**Figure 2: Performance of Mixtral and different Llama models**

**Question 2:**  According to Figure 2, in which area does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?

A) Comprehension
B) Knowledge
C) Code  **[CORRECT]**
D) AGI Eval

**Table 2: Comparison of Mixtral with Llama**

**Question 3:**  According to Table 2, how many active parameters does Mixtral 8x7B use during inference?

A) 7B
B) 13B  **[CORRECT]**
C) 33B
D) 70B

**Figure 3: Results on MMLU, commonsense reasoning, etc.**

**Question 4:**  Based on Figure 3, in which area does Mixtral **not** consistently outperform Llama 2 70B while using fewer active parameters?

A) Math
B) Code
C) Reasoning
D) Comprehension  **[CORRECT]**

**Table 3: Comparison of Mixtral with Llama 2 70B and GPT-3.5**

**Question 5:** On the MBPP benchmark, what is the performance difference between Mixtral 8x7B and Llama 2 70B?

A) 10.9%  **[CORRECT]**
B) 8.4%
C) 4.9%
D) 2.3%

**Figure 4: Long range performance of Mixtral**

**Question 6:** What does the left part of Figure 4 demonstrate about Mixtral's ability on the Passkey task?

A) Mixtral's perplexity decreases with longer context.
B) Mixtral achieves perfect retrieval accuracy regardless of context length.  **[CORRECT]**
C) Mixtral struggles with retrieving passkeys in long sequences.
D) Mixtral's performance is comparable to other models on this task.

**Figure 5: Bias Benchmarks**

**Question 7:**  Compared to Llama 2 70B, what does Figure 5 suggest about Mixtral's performance on bias benchmarks?

A) Mixtral exhibits higher bias.
B) Mixtral exhibits similar bias.
C) Mixtral exhibits lower bias.  **[CORRECT]**
D) The figure does not provide information about bias.

**Figure 6: LMSys Leaderboard**

**Question 8:** As of December 2023, what is the ranking of Mixtral 8x7B Instruct v0.1 on the LMSys Leaderboard among open-weight models? 

A) First  **[CORRECT]**
B) Second
C) Third
D) Fourth

**Figure 7: Proportion of tokens assigned to each expert**

**Question 9:** According to Figure 7, is there a clear pattern of expert specialization based on the topic of the text across different layers?

A) Yes, experts clearly specialize in specific domains.
B) No, there is no clear pattern of specialization.  **[CORRECT]**
C) The figure does not provide information about expert specialization.
D) Only the first layer shows a clear pattern of specialization.

**Table 5: Percentage of expert assignment repetitions**

**Question 10:**  What trend does Table 5 reveal about expert assignment repetitions at higher layers (15 and 31) compared to the first layer (0)?

A) Repetitions are significantly lower at higher layers.
B) Repetitions are similar across all layers.
C) Repetitions are significantly higher at higher layers.  **[CORRECT]**
D) The table does not provide information about expert assignment repetitions. 


Again, the results are remarkably good. We observe the following:
* Every question is correctly referring to one figure or table, even in the case where the authors inconsistently name a table as Figure 5.
* All multiple choice options are consistent with the content of the paper.
* All answers marked as correct are accurate.
* The quiz exhibits comprehension of a broad range of diagrams and figures. It also clearly takes into account the text accompanying each figure or table.
* In one case, for question 5, the model even performs a substraction, adding a layer of reasoning beyond simply extracting/reading information.

We observe the following limitations:
* The format is different from the one chosen previously, so it'd be hard to post-process to extract question options and correct answer, among other aspects of interest.
* The model stops at 10 questions, even though there are more figures and tables in the document. It has probably been conditioned to prefer such number when generating answers.
