# Mudocun: feasibility study

In this notebook we run a few sample queries to explore the concept of creating questions involving understanding of images, charts or formulas.

First, let's setup the system path.

In [1]:
import pathlib
import sys
import logging

ROOT_DIR = str(pathlib.Path.cwd().parent)
sys.path.append(ROOT_DIR) # Add parent folder to the path

logging.basicConfig(
    level=logging.DEBUG,
    format="%(asctime)s - %(levelname)s - %(message)s",
    stream=sys.stdout,
)

## Generate quiz from a locally available pdf article

Let's generate a sample quiz based on the paper "Towards Autotuning by Alternating Communication Methods", locally available at `./docs/autotuning_cscs.pdf`

In [2]:
from vertexai_utils import generate_questions
from docs import local_documents
from models import Article

questions, creation_metadata = generate_questions(Article(**local_documents[0]))
questions

2024-05-30 18:37:39,128 - DEBUG - Generating quiz for document "Towards Autotuning by Alternating Communication Methods"...





2024-05-30 18:37:41,416 - DEBUG - Making request: POST https://oauth2.googleapis.com/token
2024-05-30 18:37:43,521 - DEBUG - Starting new HTTPS connection (1): oauth2.googleapis.com:443
2024-05-30 18:37:45,351 - DEBUG - https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None


'## Quiz Questions\n\n**Figure 1: Autotuning framework**\n\n**Question 1:** Which of the following statements accurately describes the "code refactoring" stage of the autotuning framework?\n\nA. It involves benchmarking different communication methods and creating a knowledge database.\nB. It generates different kernels with varying combinations of communication methods.\nC. It transforms the code to explicitly expose the communication pattern for one-sided primitives.\nD. It evaluates the performance of generated kernels under different running conditions.\n\n**Answer: C**\n\n**Question 2:**  In the "Platform profile" stage, what is the purpose of running microbenchmarks?\n\nA. To refactor the code and expose the communication pattern.\nB. To generate different kernels with varying communication methods.\nC. To determine the optimal kernel version for specific running conditions.\nD. To gather platform-specific performance data for different communication operations.\n\n**Answer: D**\

In [3]:
creation_metadata.model_dump()

{'model': 'publishers/google/models/gemini-1.5-pro-001',
 'region': 'europe-west9',
 'num_input_tokens': 538,
 'num_output_tokens': 359,
 'generation_time': 17.34649682044983,
 'timestamp': '2024-05-30 16:37:56.477483+00:00'}

In [4]:
from IPython.display import display, Markdown
display(Markdown(questions))

## Quiz Questions

**Figure 1: Autotuning framework**

**Question 1:** Which of the following statements accurately describes the "code refactoring" stage of the autotuning framework?

A. It involves benchmarking different communication methods and creating a knowledge database.
B. It generates different kernels with varying combinations of communication methods.
C. It transforms the code to explicitly expose the communication pattern for one-sided primitives.
D. It evaluates the performance of generated kernels under different running conditions.

**Answer: C**

**Question 2:**  In the "Platform profile" stage, what is the purpose of running microbenchmarks?

A. To refactor the code and expose the communication pattern.
B. To generate different kernels with varying communication methods.
C. To determine the optimal kernel version for specific running conditions.
D. To gather platform-specific performance data for different communication operations.

**Answer: D**

**Question 3:** What is the key advantage of allowing "out-of-order message delivery" in the refactored code?

A. It simplifies the code and makes it easier to understand.
B. It reduces the amount of data that needs to be communicated.
C. It enables the hardware and runtime to optimize communication scheduling.
D. It eliminates the need for synchronization mechanisms like barriers and fences. 

**Answer: C**

**Question 4:**  Which of the following is NOT a communication method tested within the autotuning framework?

A. UPC (Unified Parallel C)
B. MPI (Message Passing Interface)
C. OpenMP (Open Multi-Processing)
D. DMAPP (Distributed Memory Access Programming Protocol)

**Answer: C** 


We see the model produces a quiz based on the figure included in the paper.

There is currently a limitation that prevents from getting reproducible model responses (see [here](https://www.googlecloudcommunity.com/gc/AI-ML/Unexpected-Behavior-Gemini-1-0-Pro-002-Returns-Different-Outputs/m-p/757351/highlight/true#M7495)).

Next, we analyze in detail one of the generations.



## Analysis of a quiz generation

The following quiz was generated in an ealier request (though it can't be reproduced exactly again).

In [5]:
cscs_paper_questions = '## Multiple Choice Quiz Questions based on the Scientific Article:\n\n**Figure 1: Autotuning framework**\n\n**Question 1:** What is the purpose of code refactoring in this autotuning framework?\n\nA. To benchmark different communication methods on the target platform.\nB. To generate multiple kernel versions with varying communication strategies.\nC. To expose the communication pattern and enable the use of one-sided primitives.\nD. To profile the performance of various communication methods for different message sizes.\n\n**Correct Answer: C**\n\n**Question 2:** Which of the following is NOT a stage in the presented autotuning framework?\n\nA. Code refactoring\nB. Platform profiling\nC. Performance prediction\nD. Code generation\n\n**Correct Answer: C**\n\n**Question 3:** What does the "Domain knowledge" arrow in Figure 1 represent?\n\nA. Information about the specific scientific problem being solved.\nB. Knowledge about the underlying hardware architecture of the platform.\nC. Understanding of the communication patterns within the application code.\nD. Expertise in optimizing communication using different programming models.\n\n**Correct Answer: C**\n\n**Question 4:**  What is the main advantage of using one-sided communication primitives, as shown in the refactored kernel?\n\nA. Reduced code complexity compared to message-passing approaches.\nB. Improved portability of the code across different HPC platforms.\nC. Increased flexibility for the hardware and runtime to optimize communication.\nD. Elimination of the need for synchronization constructs like barriers and fences.\n\n**Correct Answer: C**\n'
display(Markdown(cscs_paper_questions))

## Multiple Choice Quiz Questions based on the Scientific Article:

**Figure 1: Autotuning framework**

**Question 1:** What is the purpose of code refactoring in this autotuning framework?

A. To benchmark different communication methods on the target platform.
B. To generate multiple kernel versions with varying communication strategies.
C. To expose the communication pattern and enable the use of one-sided primitives.
D. To profile the performance of various communication methods for different message sizes.

**Correct Answer: C**

**Question 2:** Which of the following is NOT a stage in the presented autotuning framework?

A. Code refactoring
B. Platform profiling
C. Performance prediction
D. Code generation

**Correct Answer: C**

**Question 3:** What does the "Domain knowledge" arrow in Figure 1 represent?

A. Information about the specific scientific problem being solved.
B. Knowledge about the underlying hardware architecture of the platform.
C. Understanding of the communication patterns within the application code.
D. Expertise in optimizing communication using different programming models.

**Correct Answer: C**

**Question 4:**  What is the main advantage of using one-sided communication primitives, as shown in the refactored kernel?

A. Reduced code complexity compared to message-passing approaches.
B. Improved portability of the code across different HPC platforms.
C. Increased flexibility for the hardware and runtime to optimize communication.
D. Elimination of the need for synchronization constructs like barriers and fences.

**Correct Answer: C**


The results are pretty impressive. We can highlight the following aspects:
* All generated questions correctly match the content of the figure.
* There are multiple and diverse questions, involving several parts of the figure.
* The multiple choice options are relevant, and use scientific vocabulary consistent with the paper's content.
* All answers are correct, except Q3, where the correct answer would be A. Arguably, this was not sufficiently explained in the paper, so the model assumed a different kind of domain knowledge. (Interestingly, this is hinting at a weakness in the figure)
* We observe a sophisticated behavior in Q4, where the correct answer considers information provided in the surrounding text, not in the figure itself. In particular, the correct answer is derived from the following extract in page 1:
> Our autotuning framework is composed of three basic stages as shown in Fig. 1. First is code refactoring, which exposes the communication pattern so that it can be expressed with one-sided communication primitives. When possible, out-of-order message delivery is tolerated too. This transformation allows for maximum flexibility for the hardware and runtime to schedule the communication in the most efficient manner.

## Generate quiz from an internet-hosted article

Let's try now a quiz generation from a larger paper, "Mixtral of Experts" by Mistral.ai, fetched on demand:

In [6]:
from docs import online_documents

questions, creation_metadata = generate_questions(Article(**online_documents[0]))
questions

2024-05-30 18:37:56,522 - DEBUG - Generating quiz for document "Mixtral of Experts"...
2024-05-30 18:37:56,525 - DEBUG - Starting new HTTPS connection (1): arxiv.org:443
2024-05-30 18:37:57,621 - DEBUG - https://arxiv.org:443 "GET /pdf/2401.04088 HTTP/1.1" 200 2475990


"## Figure 1: Mixture of Experts Layer\n\n**Question:** How does the Mixture of Experts Layer determine its output?\n\nA) It averages the outputs from all expert blocks.\nB) It selects the output from the expert block with the highest confidence. \nC) It calculates a weighted sum of outputs from two expert blocks chosen by a router.\nD) It passes the input through all expert blocks sequentially and combines the results.\n\n**Correct Answer:** C) It calculates a weighted sum of outputs from two expert blocks chosen by a router.\n\n\n## Table 1: Model Architecture\n\n**Question:**  What is the context length used in the Mixtral model architecture?\n\nA) 128 tokens\nB) 32k tokens\nC) 4096 tokens\nD) 32000 tokens\n\n**Correct Answer:** B) 32k tokens\n\n\n## Figure 2: Performance Comparison on Benchmarks\n\n**Question:** What do the results presented in Figure 2 generally indicate about Mixtral's performance compared to Llama 2 models?\n\nA) Mixtral consistently underperforms Llama 2 across

In [7]:
creation_metadata.model_dump()

{'model': 'publishers/google/models/gemini-1.5-pro-001',
 'region': 'europe-west9',
 'num_input_tokens': 3376,
 'num_output_tokens': 503,
 'generation_time': 30.29361319541931,
 'timestamp': '2024-05-30 16:38:46.675494+00:00'}

In [8]:
display(Markdown(questions))

## Figure 1: Mixture of Experts Layer

**Question:** How does the Mixture of Experts Layer determine its output?

A) It averages the outputs from all expert blocks.
B) It selects the output from the expert block with the highest confidence. 
C) It calculates a weighted sum of outputs from two expert blocks chosen by a router.
D) It passes the input through all expert blocks sequentially and combines the results.

**Correct Answer:** C) It calculates a weighted sum of outputs from two expert blocks chosen by a router.


## Table 1: Model Architecture

**Question:**  What is the context length used in the Mixtral model architecture?

A) 128 tokens
B) 32k tokens
C) 4096 tokens
D) 32000 tokens

**Correct Answer:** B) 32k tokens


## Figure 2: Performance Comparison on Benchmarks

**Question:** What do the results presented in Figure 2 generally indicate about Mixtral's performance compared to Llama 2 models?

A) Mixtral consistently underperforms Llama 2 across all benchmark categories.
B) Mixtral shows comparable or superior performance to Llama 2 70B despite using fewer active parameters.
C) Mixtral's performance is only better than the smallest Llama 2 models, not the larger ones.
D) There is no discernible pattern in the performance comparison between Mixtral and Llama 2.

**Correct Answer:** B) Mixtral shows comparable or superior performance to Llama 2 70B despite using fewer active parameters.


## Figure 4: Long Range Performance of Mixtral

**Question:** What conclusion can be drawn from the right graph in Figure 4, regarding Mixtral's perplexity on the proof-pile dataset?

A) Perplexity remains constant regardless of context length, suggesting limited long-range understanding. 
B) Perplexity fluctuates unpredictably with increasing context length, indicating instability in handling long sequences.
C) Perplexity decreases with longer contexts, demonstrating Mixtral's ability to leverage information from extended sequences.
D) Perplexity increases with longer contexts, highlighting the model's difficulty in processing and understanding large amounts of text.

**Correct Answer:** C) Perplexity decreases with longer contexts, demonstrating Mixtral's ability to leverage information from extended sequences. 


## Analysis of a larger quiz generation

Similarly to the previous example, we analyze next a previous generation in detail

In [9]:
mixtral_questions = '## Multiple Choice Quiz Questions based on "Mixtral of Experts"\n\n**Figure 1: Mixture of Experts Layer**\n\n**Question 1:** In the Mixture of Experts Layer, how many experts are assigned to each input vector by the router?\n\nA) 1\nB) 2  **[CORRECT]**\nC) 4\nD) 8\n\n**Figure 2: Performance of Mixtral and different Llama models**\n\n**Question 2:**  According to Figure 2, in which area does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?\n\nA) Comprehension\nB) Knowledge\nC) Code  **[CORRECT]**\nD) AGI Eval\n\n**Table 2: Comparison of Mixtral with Llama**\n\n**Question 3:**  According to Table 2, how many active parameters does Mixtral 8x7B use during inference?\n\nA) 7B\nB) 13B  **[CORRECT]**\nC) 33B\nD) 70B\n\n**Figure 3: Results on MMLU, commonsense reasoning, etc.**\n\n**Question 4:**  Based on Figure 3, in which area does Mixtral **not** consistently outperform Llama 2 70B while using fewer active parameters?\n\nA) Math\nB) Code\nC) Reasoning\nD) Comprehension  **[CORRECT]**\n\n**Table 3: Comparison of Mixtral with Llama 2 70B and GPT-3.5**\n\n**Question 5:** On the MBPP benchmark, what is the performance difference between Mixtral 8x7B and Llama 2 70B?\n\nA) 10.9%  **[CORRECT]**\nB) 8.4%\nC) 4.9%\nD) 2.3%\n\n**Figure 4: Long range performance of Mixtral**\n\n**Question 6:** What does the left part of Figure 4 demonstrate about Mixtral\'s ability on the Passkey task?\n\nA) Mixtral\'s perplexity decreases with longer context.\nB) Mixtral achieves perfect retrieval accuracy regardless of context length.  **[CORRECT]**\nC) Mixtral struggles with retrieving passkeys in long sequences.\nD) Mixtral\'s performance is comparable to other models on this task.\n\n**Figure 5: Bias Benchmarks**\n\n**Question 7:**  Compared to Llama 2 70B, what does Figure 5 suggest about Mixtral\'s performance on bias benchmarks?\n\nA) Mixtral exhibits higher bias.\nB) Mixtral exhibits similar bias.\nC) Mixtral exhibits lower bias.  **[CORRECT]**\nD) The figure does not provide information about bias.\n\n**Figure 6: LMSys Leaderboard**\n\n**Question 8:** As of December 2023, what is the ranking of Mixtral 8x7B Instruct v0.1 on the LMSys Leaderboard among open-weight models? \n\nA) First  **[CORRECT]**\nB) Second\nC) Third\nD) Fourth\n\n**Figure 7: Proportion of tokens assigned to each expert**\n\n**Question 9:** According to Figure 7, is there a clear pattern of expert specialization based on the topic of the text across different layers?\n\nA) Yes, experts clearly specialize in specific domains.\nB) No, there is no clear pattern of specialization.  **[CORRECT]**\nC) The figure does not provide information about expert specialization.\nD) Only the first layer shows a clear pattern of specialization.\n\n**Table 5: Percentage of expert assignment repetitions**\n\n**Question 10:**  What trend does Table 5 reveal about expert assignment repetitions at higher layers (15 and 31) compared to the first layer (0)?\n\nA) Repetitions are significantly lower at higher layers.\nB) Repetitions are similar across all layers.\nC) Repetitions are significantly higher at higher layers.  **[CORRECT]**\nD) The table does not provide information about expert assignment repetitions. \n'
display(Markdown(mixtral_questions))

## Multiple Choice Quiz Questions based on "Mixtral of Experts"

**Figure 1: Mixture of Experts Layer**

**Question 1:** In the Mixture of Experts Layer, how many experts are assigned to each input vector by the router?

A) 1
B) 2  **[CORRECT]**
C) 4
D) 8

**Figure 2: Performance of Mixtral and different Llama models**

**Question 2:**  According to Figure 2, in which area does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?

A) Comprehension
B) Knowledge
C) Code  **[CORRECT]**
D) AGI Eval

**Table 2: Comparison of Mixtral with Llama**

**Question 3:**  According to Table 2, how many active parameters does Mixtral 8x7B use during inference?

A) 7B
B) 13B  **[CORRECT]**
C) 33B
D) 70B

**Figure 3: Results on MMLU, commonsense reasoning, etc.**

**Question 4:**  Based on Figure 3, in which area does Mixtral **not** consistently outperform Llama 2 70B while using fewer active parameters?

A) Math
B) Code
C) Reasoning
D) Comprehension  **[CORRECT]**

**Table 3: Comparison of Mixtral with Llama 2 70B and GPT-3.5**

**Question 5:** On the MBPP benchmark, what is the performance difference between Mixtral 8x7B and Llama 2 70B?

A) 10.9%  **[CORRECT]**
B) 8.4%
C) 4.9%
D) 2.3%

**Figure 4: Long range performance of Mixtral**

**Question 6:** What does the left part of Figure 4 demonstrate about Mixtral's ability on the Passkey task?

A) Mixtral's perplexity decreases with longer context.
B) Mixtral achieves perfect retrieval accuracy regardless of context length.  **[CORRECT]**
C) Mixtral struggles with retrieving passkeys in long sequences.
D) Mixtral's performance is comparable to other models on this task.

**Figure 5: Bias Benchmarks**

**Question 7:**  Compared to Llama 2 70B, what does Figure 5 suggest about Mixtral's performance on bias benchmarks?

A) Mixtral exhibits higher bias.
B) Mixtral exhibits similar bias.
C) Mixtral exhibits lower bias.  **[CORRECT]**
D) The figure does not provide information about bias.

**Figure 6: LMSys Leaderboard**

**Question 8:** As of December 2023, what is the ranking of Mixtral 8x7B Instruct v0.1 on the LMSys Leaderboard among open-weight models? 

A) First  **[CORRECT]**
B) Second
C) Third
D) Fourth

**Figure 7: Proportion of tokens assigned to each expert**

**Question 9:** According to Figure 7, is there a clear pattern of expert specialization based on the topic of the text across different layers?

A) Yes, experts clearly specialize in specific domains.
B) No, there is no clear pattern of specialization.  **[CORRECT]**
C) The figure does not provide information about expert specialization.
D) Only the first layer shows a clear pattern of specialization.

**Table 5: Percentage of expert assignment repetitions**

**Question 10:**  What trend does Table 5 reveal about expert assignment repetitions at higher layers (15 and 31) compared to the first layer (0)?

A) Repetitions are significantly lower at higher layers.
B) Repetitions are similar across all layers.
C) Repetitions are significantly higher at higher layers.  **[CORRECT]**
D) The table does not provide information about expert assignment repetitions. 


Again, the results are remarkably good. We observe the following:
* Every question is correctly referring to one figure or table, even in the case where the authors inconsistently name a table as Figure 5.
* All multiple choice options are consistent with the content of the paper.
* All answers marked as correct are accurate.
* The quiz exhibits comprehension of a broad range of diagrams and figures. It also clearly takes into account at least the text accompanying each figure or table (caption).
* In one case, for question 5, the model even performs a substraction, adding a layer of reasoning beyond simply extracting/reading information.

We observe the following limitations:
* The format is different from the one chosen previously, so it'd be hard to post-process to extract question options and correct answer, among other aspects of interest.
* The model stops at 10 questions, even though there are more figures and tables in the document. Other generations generate even fewer questions.

## Structured generation

In order to be able to create a dataset by automated means, it'd be desirable to have a consistent output format that we can easily parse. In other words, we need "structured generation".

In [10]:
from vertexai_utils import generate_structured_questions

structured_questions, structuring_metadata = generate_structured_questions(mixtral_questions)
structured_questions

2024-05-30 18:38:46,737 - DEBUG - Generating structured quiz...


[{'text': 'In the Mixture of Experts Layer, how many experts are assigned to each input vector by the router?',
  'correct_answer_idx': 1.0,
  'choices': ['1', '2', '4', '8']},
 {'text': 'According to Figure 2, in which area does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?',
  'correct_answer_idx': 2.0,
  'choices': ['Comprehension', 'Knowledge', 'Code', 'AGI Eval']},
 {'choices': ['7B', '13B', '33B', '70B'],
  'correct_answer_idx': 1.0,
  'text': 'According to Table 2, how many active parameters does Mixtral 8x7B use during inference?'},
 {'choices': ['Math', 'Code', 'Reasoning', 'Comprehension'],
  'correct_answer_idx': 3.0,
  'text': 'Based on Figure 3, in which area does Mixtral **not** consistently outperform Llama 2 70B while using fewer active parameters?'},
 {'choices': ['10.9%', '8.4%', '4.9%', '2.3%'],
  'correct_answer_idx': 0.0,
  'text': 'On the MBPP benchmark, what is the performance difference between Mixtral 8x7B and Llama 2 70B?'},

We observe how we could convert the generated questions into a structured format, which is much more useful for downstream use.

In [11]:
structuring_metadata.model_dump()

{'model': 'publishers/google/models/gemini-1.5-pro-001',
 'region': 'europe-west9',
 'num_input_tokens': 982,
 'num_output_tokens': 541,
 'generation_time': 16.000974893569946,
 'timestamp': '2024-05-30 16:39:02.739906+00:00'}

We can now create a full Quiz object.

In [13]:
from models import QuizMetadata, QuizQuestion, QuestionMetadata, Quiz

quiz_metadata = QuizMetadata(creation_metadata=creation_metadata, structuring_metadata=structuring_metadata)
quiz_questions = [
    QuizQuestion(multiple_choice_question=q, metadata=QuestionMetadata(is_validated=False))
    for q in structured_questions
]
quiz = Quiz(article=Article(**online_documents[0]), questions=quiz_questions, metadata=quiz_metadata)

In [14]:
quiz.model_dump()

{'article': {'title': 'Mixtral of Experts',
  'uri': 'https://arxiv.org/pdf/2401.04088'},
 'questions': [{'multiple_choice_question': {'text': 'In the Mixture of Experts Layer, how many experts are assigned to each input vector by the router?',
    'choices': ['1', '2', '4', '8'],
    'correct_answer_idx': 1},
   'metadata': {'is_validated': False,
    'validator': None,
    'explanation': None}},
  {'multiple_choice_question': {'text': 'According to Figure 2, in which area does Mixtral demonstrate a significantly superior performance compared to Llama 2 70B?',
    'choices': ['Comprehension', 'Knowledge', 'Code', 'AGI Eval'],
    'correct_answer_idx': 2},
   'metadata': {'is_validated': False,
    'validator': None,
    'explanation': None}},
  {'multiple_choice_question': {'text': 'According to Table 2, how many active parameters does Mixtral 8x7B use during inference?',
    'choices': ['7B', '13B', '33B', '70B'],
    'correct_answer_idx': 1},
   'metadata': {'is_validated': False,
 