## Installation and Imports
Please follow the installation guide in the [ThoughtSource Readme file](https://github.com/OpenBioLink/ThoughtSource) before using this notebook.

After cloning the repo, install the library linking to your local location:

In [10]:
!pip install -e ../libs/cot

In [11]:
import os
from cot import Collection
from cot.generate import FRAGMENTS
from pprint import pprint
import json

## Quick intro
The ThoughtSource library offers functionality for: 
* Loading datasets
* Creating random sub-samples
* Generating novel chain-of-thought reasoning data and answers by connecting to external AI services
* Evaluating results

Below we will give a quick intro to the libary, followed by more detailed examples.

In [13]:
# 1) Dataset loading and selecting a random sample
collection = Collection(["worldtree"], verbose=False)
collection = collection.select(split="train", number_samples=10)


# 2) Language Model generates chains of thought and then extracts answers

# os.environ["HUGGINGFACEHUB_API_TOKEN"] = "<token>"   # <--- set token (can be found in your Hugging Face settings page)

config={
    "instruction_keys": ['qa-01'], # "Answer the following question through step-by-step reasoning."
    "cot_trigger_keys": ['kojima-01'], # "Answer: Let's think step by step."
    "answer_extraction_keys": ['kojima-A-D'], # "Therefore, among A through D, the answer is"
    "api_service": "huggingface_hub",
    "engine": "google/flan-t5-xl",
    "warn": False,
    "verbose": False,
}
pprint(collection.generate(config=config))

# 3) Evaluating answers generated by the model
pprint(collection.evaluate())

Loading worldtree...
None
{'accuracy': {'qa-01_kojima-01_kojima-A-D': 0.6}}
None


## 1. Loading, sampling and saving a dataset

In [17]:
# load a dataset to sample from 
collection_worldtree = Collection(["worldtree"], verbose=False)
print(collection_worldtree)

Loading worldtree...
| Name      |   Train |   Valid |   Test |
|-----------|---------|---------|--------|
| worldtree |    2207 |     496 |   1664 |

Not loaded: ['aqua', 'asdiv', 'commonsense_qa', 'entailment_bank', 'gsm8k', 'mawps', 'med_qa', 'medmc_qa', 'open_book_qa', 'pubmed_qa', 'qed', 'strategy_qa', 'svamp']


In [18]:
# Randomly select 100 rows from train split
collection_worldtree_100 = collection_worldtree.select(split="train", number_samples=100, random_samples=True, seed=0)
collection_worldtree_100


| Name      |   Train | Valid   | Test   |
|-----------|---------|---------|--------|
| worldtree |     100 | -       | -      |

Not loaded: ['aqua', 'asdiv', 'commonsense_qa', 'entailment_bank', 'gsm8k', 'mawps', 'med_qa', 'medmc_qa', 'open_book_qa', 'pubmed_qa', 'qed', 'strategy_qa', 'svamp']

In [16]:
# Note that you can also sample from all datasets in a collection like this like this: 
# collection_train_random_100 = collection.select(split="train", number_samples=100, random_samples=True, seed=0)

# Write to JSON file (will be save to the same folder as this notebook)
collection_worldtree_100.dump("worldtree_100_dataset.json")

## 2. Generating novel reasoning chains and answers

ThoughtSource comes pre-loaded with a large [collection of text snippets ('prompt fragments')](https://github.com/OpenBioLink/ThoughtSource/blob/main/libs/cot/cot/fragments.json) to elicit chain-of-thought reasoning in large language models and to extract answers from chains-of-thought. Let's see how prompt fragments look like:

In [19]:
# Show first two cot_trigger prompts
print(json.dumps(list(FRAGMENTS["cot_triggers"].items())[:2], sort_keys=True, indent=2)) 

[
  [
    "kojima-01",
    "Answer: Let's think step by step."
  ],
  [
    "kojima-02",
    "Answer: We should think about this step by step."
  ]
]


### Generating chain-of-thought examples

ThoughtSource can connect to external AI service providers such as the [OpenAI API](https://openai.com/api/) or the [Hugging Face Hub](https://huggingface.co/docs/hub/index). Set your token, 'api_service' and 'engine' parameters accordingly. 

In [20]:
# Sample 100 items from the Worldtree v2 dataset
collection = Collection(["worldtree"], verbose=False)
worldtree_100_random = collection.select(split="train", number_samples=100, random_samples=True, seed=0)

# os.environ["HUGGINGFACEHUB_API_TOKEN"] = "<token>"  # <--- SET ACCORDINGLY
# os.environ["OPENAI_API_KEY"] = "<token>"  # <--- SET ACCORDINGLY

# Configuration for calling AI service. 
config={
    "idx_range": "all", # Determines which indices the generate_and_extract routine is applied to, Default: "all" (All items are used)
    "instruction_keys": ["qa-01"], # Determines which instructions are used from fragments.json, Default: None (no instructions are used)
    "cot_trigger_keys": ["kojima-01"], # Determines which cot triggers are used from fragments.json, Default: ["kojima-01"] (only the first trigger is used)
    "answer_extraction_keys": ["kojima-A-D"], # Determines which answer extraction prompts are used from fragments.json, Default: ["kojima-01"] (only the first prompt is used)
    "author" : "your_name", # Name of the person responsible for generation, Default: ""
    "api_service": "mock_api", # Name of the API called ("openai", "huggingface_hub", or a mock for testing: "mock_api"), Default: "huggingface_hub"  # <--- SET ACCORDINGLY
    "engine": "", # Name of the engine used (for "huggingface_hub" use for example "google/flan-t5-xl"), Default: "google/flan-t5-xl"  # <--- SET ACCORDINGLY
    "temperature": 0, # Name of the person responsible for generation, Default: 0
    "max_tokens": 512, # Maximum length of output generated by the model, Default: 128
    "api_time_interval": 1.0, # Pause between two api calls in seconds, Default: 1.0
    "verbose": False, # Determines whether the progress of the generation is printed, Default: True
    "warn": True, # Determines whether a warnings that external APIs will be called are printed, Default: True
}

Loading worldtree...


In [21]:
# Generating chains-of-thought and answer extractions (This is in Mock-API mode, not calling model over API)
worldtree_100_random.generate(name="worldtree", config=config) #if you cannot press y, set "warn" to false in config


        You are about to [1m call an external API [0m in total 200 times, which [1m may produce costs [0m.
        Number API calls for CoT generation: n_samples 100 * n_instruction_keys 1 * n_cot_trigger_keys 1
        Number API calls for answer extraction: n_samples 100 * n_instruction_keys 1 * n_cot_trigger_keys 1 * n_answer_extraction_keys 1
        Do you want to continue? y/n
        [1m Note: You are using a mock api. When entering 'y', a test run without API calls is made. [0m


In [22]:
# If you did not change the config above, the above was a fake call to the mock API -- Now loading a prepared dataset with real model answers.
collection = Collection.from_json("worldtree_100_generate.json")

#### Display a question, answer choices and gold-standard answer

In [23]:
# Extract from prepared dataset
pprint("Question: "+ collection["worldtree"]["train"][1]["question"])
pprint("Answer Options:")
pprint(collection["worldtree"]["train"][1]["choices"])
pprint("Answer: "+ "".join(collection["worldtree"]["train"][1]["answer"]))

'Question: The length of a year is equivalent to the time it takes for one'
'Answer Options:'
['rotation of Earth',
 'rotation of the Sun',
 'revolution of Earth around the Sun',
 'revolution of the Sun around Earth']
'Answer: revolution of Earth around the Sun'


#### Display model-generated chain-of-thought and extracted answer

In [24]:
pprint(worldtree_100_random["worldtree"]["train"][1]["generated_cot"][0]["cot"])
pprint(worldtree_100_random["worldtree"]["train"][1]["generated_cot"][0]['answers'][0]['answer'])

' Test mock chain of thought.'
' Test mock chain of thought.'


The answer generated by the model was correct! To evaluate model answers automatically, ThoughtSource has an in-built evaluate function.

## 3. Evaluate: Evaluation of model answers

In [25]:
# Loading collection with model answers
collection = Collection.from_json("worldtree_100_generate.json")

In [26]:
# Note that before the evaluation function is run, the 'correct_answer' boolean field is not set.
collection["worldtree"]["train"][0]['generated_cot'][0]["answers"][0]['correct_answer']

In [27]:
# Now, let's evaluate the answers, set the 'correct_answer' field by comparing to gold-standard answers, and calculate the accuracy of the model predictions
collection.evaluate("worldtree","train")

  0%|          | 0/100 [00:00<?, ?ex/s]

{'accuracy': {'qa-01_kojima-01_kojima-A-D': 0.86}}


In [7]:
# Now the 'correct_answer' fields are set
collection["worldtree"]["train"][0]['generated_cot'][0]["answers"][0]['correct_answer']

True

In [8]:
# Save the file that now also includes data in the 'correct_answer' fields 
collection.dump("worldtree_100_evaluate.json")

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]