# Measuring Flan-T5 performance on MMLU Data where CoT ourperforms Direct Prompting

In [20]:
import datasets
import torch
import re
import csv
import json

import pandas as pd
import numpy as np

from tqdm import tqdm
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [12]:
def read_data(filename):
    lines = []
    with open(filename) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        for row in csv_reader: lines.append(row)
    return lines

In [25]:
business_ethics_val = read_data('data/mmlu/val/business_ethics_val.csv')

In [27]:
business_ethics_val[0]

['Disqualification of directors may result from breaches under the',
 'Sale of Goods Act 1979',
 'Financial Services Act 1986',
 'Companies Act 2006 and Insolvency Act 1986',
 'Health and Safety at Work Act 1974',
 'C']

In [17]:
abstract_algebra_test = read_data('data/mmlu/test/abstract_algebra_test.csv')

In [18]:
abstract_algebra_test

[['Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.',
  '0',
  '4',
  '2',
  '6',
  'B'],
 ['Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.',
  '8',
  '2',
  '24',
  '120',
  'C'],
 ['Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5',
  '0',
  '1',
  '0,1',
  '0,4',
  'D'],
 ['Statement 1 | A factor group of a non-Abelian group is non-Abelian. Statement 2 | If K is a normal subgroup of H and H is a normal subgroup of G, then K is a normal subgroup of G.',
  'True, True',
  'False, False',
  'True, False',
  'False, True',
  'B'],
 ['Find the product of the given polynomials in the given polynomial ring. f(x) = 4x - 5, g(x) = 2x^2 - 4x + 2 in Z_8[x].',
  '2x^2 + 5',
  '6x^2 + 4x + 6',
  '0',
  'x^2 + 1',
  'B'],
 ['Statement 1 | If a group has an element of order 15 it must have at least 8 elements of order 15. Statement 2 | If a group has more than 8 e

In [21]:
mmlu_cot = json.load(open('lib_prompt/mmlu-cot.json'))

In [28]:
print(mmlu_cot['business_ethics'])

The following are multiple choice questions (with answers) about business ethics.

Q: In contrast to _______, _______ aim to reward favourable behaviour by companies. The success of such campaigns have been heightened through the use of ___________, which allow campaigns to facilitate the company in achieving _________ .
(A) Buycotts, Boycotts, Blockchain technology, Charitable donations (B) Buycotts, Boycotts, Digital technology, Increased Sales (C) Boycotts, Buyalls, Blockchain technology, Charitable donations (D) Boycotts, Buycotts, Digital technology, Increased Sales
A: Let's think step by step. We refer to Wikipedia articles on business ethics for help. The sentence that best uses the possible options above is “In contrast to *boycotts*, *buycotts* aim to reward favourable behavior by companies. The success of such campaigns have been heightened through the use of *digital technology*, which allow campaigns to facilitate the company in achieving *increased sales*.” The answer is (

In [29]:
computer_security_val = read_data('data/mmlu/val/computer_security_val.csv')

In [30]:
computer_security_val[0]

['What is penetration testing?',
 'A procedure for testing libraries or other program components for vulnerabilities',
 'Whole-system testing for security flaws and bugs',
 'A security-minded form of unit testing that applies early in the development process',
 'All of the above',
 'B']

In [31]:
print(mmlu_cot['computer_security'])

The following are multiple choice questions (with answers) about computer security.

Q: SHA-1 has a message digest of
(A) 160 bits (B) 512 bits (C) 628 bits (D) 820 bits
A: Let's think step by step. Since SHA-1 is a hash function which takes an input and produces a 160-bit (20-byte) hash value, its message digest is 160 bits. The answer is (A).

Q: _____________ can modify data on your system – so that your system doesn’t run correctly or you can no longer access specific data, or it may even ask for ransom in order to give your access.
(A) IM – Trojans (B) Backdoor Trojans (C) Trojan-Downloader (D) Ransom Trojan
A: Let's think step by step. The system is asking for trojans, which are for ransom, which means ransom trojan. The answer is (D).

Q: What is ethical hacking?
(A) "Hacking" ethics so they justify unintended selfish behavior (B) Hacking systems (e.g., during penetration testing) to expose vulnerabilities so they can be fixed, rather than exploited (C) Hacking into systems run 

In [32]:
high_school_chemistry_val = read_data('data/mmlu/val/high_school_chemistry_val.csv')

In [33]:
high_school_chemistry_val[0]

['Consider the Lewis structures for the following molecules: CO2, CO32-, NO2-, and NO3-. Which molecule would have the smallest bond angle between terminal atoms?',
 'CO2',
 'CO32-',
 'NO2-',
 'NO3-',
 'C']

In [34]:
print(mmlu_cot['high_school_chemistry'])

The following are multiple choice questions (with answers) about high school chemistry.

Q: Which of the following is considered an acid anhydride?
(A) HCl (B) H2SO3 (C) SO2 (D) Al(NO3)3
A: Let's think step by step. An acid anhydride is a compound that is derived by removing water from an acid. The chemical formula for water is H2O, which means that we need to determine which of these options, when combined with H2O, forms an acid. SO2, or Sulfur dioxide, when combined with H2O, makes H2SO4, or sulfuric acid. The answer is (C).

Q: Which of the following is expected to be a polar molecule?
(A) PCl4F (B) BF3 (C) CO2 (D) Si(CH3)4
A: Let's think step by step. A polar molecule is one that has a slightly positive charge on one end of the molecule and a slightly negative charge on the other end. Boron trifluoride (BF3) has Boron as the center atom and three fluorine atoms attached to it; it is trigonal planar and symmetric, so it is nonpolar. Carbon Dioxide (CO2) has Carbon as the central at

In [35]:
high_school_macroeconomics_val = read_data('data/mmlu/val/high_school_macroeconomics_val.csv')

In [36]:
high_school_macroeconomics_val[0]

['Which of the following would lead to an expansion of the money supply?',
 'The FED raises the discount rate.',
 'The FED buys government securities in the secondary market.',
 'The federal government deficit-spends.',
 'The FED raises reserve requirements.',
 'B']

In [37]:
print(mmlu_cot['high_school_macroeconomics'])

The following are multiple choice questions (with answers) about high school macroeconomics.

Q: Which of the following policies best describes supply-side fiscal policy?
(A) An increase in the money supply (B) Increased government spending (C) Lower taxes on research and development of new technology (D) Higher taxes on household income
A: Let's think step by step. We refer to Wikipedia articles on macroeconomics for help. Supply-side fiscal policy stimulates the economy by encouraging more production of goods and services through reduction in taxes and deregulation. The answer is (C).

Q: The short-run Phillips curve indicates a
(A) direct relation between unemployment and inflation (B) direct relation between price and quantity demanded (C) inverse relation between price and quantity demanded (D) inverse relation between unemployment and inflation
A: Let's think step by step. We refer to Wikipedia articles on macroeconomics for help. The short-run Phillips curve shows that whenever 

In [38]:
high_school_geography_val = read_data('data/mmlu/val/high_school_geography_val.csv')

In [39]:
high_school_geography_val[0]

['Which of the following situations does NOT occur in a federal state?',
 'Central government possesses a two-level system of government.',
 'Central government governs country as a single unit.',
 'It often possesses a written constitution.',
 'Lower-level divisions have unique powers.',
 'B']

In [40]:
print(mmlu_cot['high_school_geography'])

The following are multiple choice questions (with answers) about high school geography.

Q: Which one of the following items is an example of nonmaterial culture?
(A) Dove soap (B) Dove candy bar (C) Dove symbol (D) A dove (bird).
A: Let's think step by step. We refer to Wikipedia articles on geography for help. Nonmaterial culture consists of cultural ideas, beliefs or symbols that are not physical objects. The answer is (C).

Q: During the third stage of the demographic transition model, which of the following is true?
(A) Birth rates increase and population growth rate is less rapid. (B) Birth rates decline and population growth rate is less rapid. (C) Birth rates increase and population growth rate increases. (D) Birth rates decrease and population growth rate increases.
A: Let's think step by step. We refer to Wikipedia articles on geography for help. The demographic transition model models the five different stages of population growth as a country goes through economic developme

In [41]:
human_aging_val = read_data('data/mmlu/val/human_aging_val.csv')

In [42]:
human_aging_val[0]

['Comparisons of gay and lesbian couples with heterosexual married couples show that gay and lesbian couples',
 'Are far less satisfied with their relationship',
 'Show many of the same characteristics',
 'Always attempt to conceal their relationship',
 'Are usually more satisfied with their relationship',
 'B']

In [43]:
print(mmlu_cot['human_aging'])

The following are multiple choice questions (with answers) about human aging.

Q: All other things being equal, which of the following persons is more likely to show osteoporosis?
(A) An older Hispanic American woman (B) An older African American woman (C) An older Asian American woman (D) An older Native American woman
A: Let's think step by step. We refer to Wikipedia articles on human aging for help. Although osteoporosis can occur at any age, the risk is higher for older people. It is most common in Asian and non-Hispanic white women. The answer is (C).

Q: The finding that adults tend to remember events from their adolescence better than from other periods in their lives is referred to as the
(A) Adolescence advantage (B) Reminiscence bump (C) Memorial memorial (D) Quadratic retrieval spike
A: Let's think step by step. We refer to Wikipedia articles on human aging for help. Reminiscence bump is a phenomenon that older adults tend to recollect events during their young ages. People

In [45]:
philosophy_val = read_data('data/mmlu/val/philosophy_val.csv')

In [46]:
philosophy_val[0]

['One of the aims of philosophy is to think critically about whether there are good reasons for adopting our beliefs.  Reasons are considered "good reasons" if they are consistent with everyday experience and:',
 'are part of a set of religious, moral, or political beliefs that an individual feels deeply about.',
 'are considered good by at least one culture, sub-culture, or individual.',
 'cannot be interpreted in different ways by different people or cultures.',
 'take into account objections, are acceptable to impartial third parties, and avoid undesirable consequences.',
 'D']

In [47]:
print(mmlu_cot['philosophy'])

The following are multiple choice questions (with answers) about philosophy.

Q: The study of reality in the broadest sense, an inquiry into the elemental nature of the universe and the things in it, is known as _____.
(A) metaphysics (B) epistemology (C) quantum physics (D) axiology
A: Let's think step by step. We refer to Wikipedia articles on philosophy for help. Among the options, only metaphysics studies the nature of reality and existence. The answer is (A).

Q: According to Moore’s “ideal utilitarianism,” the right action is the one that brings about the greatest amount of:
(A) pleasure. (B) happiness. (C) good. (D) virtue.
A: Let's think step by step. We refer to Wikipedia articles on philosophy for help. Moore's "ideal utilitarianism" states that one's actions should maximize intrinsic goods. The answer is (C).

Q: Before Tolstoy's Christian conversion, what was his perspective on the meaning of life?
(A) optimist (B) satisfied (C) nominally religious (D) pessimist
A: Let's th

In [None]:
high_school_macroeconomics_val = read_data('data/mmlu/val/high_school_geo_val.csv')