## Installation and Imports
Please follow the **installation guide** in the [ThoughtSource Readme file](https://github.com/OpenBioLink/ThoughtSource) before using this notebook.

In [10]:
# only execute, if you use this notebook in Google Colab:
# !pip install -e ../libs/cot

In [1]:
import os
from cot import Collection
from cot.generate import FRAGMENTS
from rich.pretty import pprint
import json

## Quick intro
The ThoughtSource library offers functionality for: 
* Loading datasets
* Creating random sub-samples
* Generating novel chain-of-thought reasoning data and answers by connecting to external AI services
* Evaluating results

Below we will give a quick intro to the libary, followed by more detailed examples.

To be able to use external APIs you need a key. In this tutorial we will use the [Hugging Face API](https://huggingface.co/), which is for free. To use the API you need to set the environment variable `HUGGINGFACEHUB_API_TOKEN` to your API token. You can find your token in your Hugging Face settings page. For now you can set the environment variable in the following way:

In [4]:
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = "<token>"   # <--- set token (can be found in your Hugging Face settings page)

In [11]:
# 1) Dataset loading and selecting a random sample
collection = Collection(["worldtree"], verbose=False)
collection = collection.select(split="train", number_samples=1)

# 2) Language Model generates chains of thought and then extracts answers

config={
    "instruction_keys": ['qa-01'], # "Answer the following question through step-by-step reasoning."
    "cot_trigger_keys": ['kojima-01'], # "Answer: Let's think step by step."
    "answer_extraction_keys": ['kojima-A-D'], # "Therefore, among A through D, the answer is"
    "api_service": "huggingface_hub",
    "engine": "google/flan-t5-xl",
    "warn": False,
    "verbose": False,
}
collection.generate(config=config)

# 3) Evaluating answers generated by the model
print(collection.evaluate())

# 4) Saving the generated outputs and evaluation results
# collection.dump("worldtree_10.json")

Evaluating worldtree train...


  0%|          | 0/10 [00:00<?, ?ex/s]

{'worldtree': {'train': {'accuracy': {'google/flan-t5-xl': {'qa-01_kojima-01_kojima-A-D': 0.6}}}}}


Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

## 1. Loading, sampling and saving a dataset

In [16]:
# load a dataset to sample from 
worldtree = Collection(["worldtree"], verbose=False)
print(worldtree)

Loading worldtree...
| Name      |   Train |   Valid |   Test |
|-----------|---------|---------|--------|
| worldtree |    2207 |     496 |   1664 |

Not loaded: ['aqua', 'asdiv', 'commonsense_qa', 'entailment_bank', 'gsm8k', 'mawps', 'med_qa', 'medmc_qa', 'open_book_qa', 'pubmed_qa', 'qed', 'strategy_qa', 'svamp']


In [14]:
# Randomly select 100 rows from train split
worldtree_10 = worldtree.select(split="train", number_samples=10, random_samples=True, seed=0)
worldtree_10

| Name      |   Train | Valid   | Test   |
|-----------|---------|---------|--------|
| worldtree |      10 | -       | -      |

Not loaded: ['aqua', 'asdiv', 'commonsense_qa', 'entailment_bank', 'gsm8k', 'mawps', 'med_qa', 'medmc_qa', 'open_book_qa', 'pubmed_qa', 'qed', 'strategy_qa', 'svamp']

In [19]:
# Note that you could also sample from multiple datasets into one collection like this:
collection_medical = Collection(["med_qa", "medmc_qa", "pubmed_qa"], verbose=False)
collection_medical_100 = collection_medical.select(split="train", number_samples=100)
collection_medical_100

Loading med_qa...
Loading medmc_qa...
Loading pubmed_qa...


| Name      |   Train | Valid   | Test   |
|-----------|---------|---------|--------|
| med_qa    |     100 | -       | -      |
| medmc_qa  |     100 | -       | -      |
| pubmed_qa |     100 | -       | -      |

Not loaded: ['aqua', 'asdiv', 'commonsense_qa', 'entailment_bank', 'gsm8k', 'mawps', 'open_book_qa', 'qed', 'strategy_qa', 'svamp', 'worldtree']

## 2. Generating novel reasoning chains and answers

ThoughtSource comes pre-loaded with a large [collection of text snippets ('prompt fragments')](https://github.com/OpenBioLink/ThoughtSource/blob/main/libs/cot/cot/fragments.json) to elicit chain-of-thought reasoning in large language models and to extract answers from chains-of-thought. Let's see how prompt fragments look like:

In [23]:
# Chain of thought prompts
pprint(list(FRAGMENTS["cot_triggers"].items())[:5])

In [24]:
# Answer extraction prompts
pprint(list(FRAGMENTS["answer_extractions"].items())[2:6])

### Generating chain-of-thought examples

ThoughtSource can connect to external AI service providers such as the [OpenAI API](https://openai.com/api/) or the [Hugging Face Hub](https://huggingface.co/docs/hub/index). Set your token, 'api_service' and 'engine' parameters accordingly. 

In [2]:
from cot.config import Config as config_overview
print('\033[94m' + config_overview.__doc__[48:])

[94m
    "instruction_keys": list(str) - Determines which instruction_keys are used from fragments.json,
        the corresponding string will be inserted under "instruction" in the fragments. Default: [None] (No instruction)
    "cot_trigger_keys": list(str) - Determines which cot triggers are used from fragments.json,
        the corresponding string will be inserted under "cot_trigger" in the fragments. Default: ["kojima-01"]
    "answer_extraction_keys": list(str) - Determines which answer extraction prompts are used from fragments.json,
        the corresponding string will be inserted under "answer" in the fragments. Default: ["kojima-01"]
    "template_cot_generation": string - is the model input in the text generation step, variables in brackets.
        Only variables of this list are allowed: "instruction", 'question", "answer_choices", "cot_trigger"
        Default: {instruction}

{question}
{answer_choices}

{cot_trigger}{cot}
{answer_extraction}
    "template_answer_extra

In [23]:
# Sample 100 items from the Worldtree v2 dataset
collection = Collection(["worldtree"], verbose=False)
worldtree_10 = collection.select(split="train", number_samples=10)

# os.environ["HUGGINGFACEHUB_API_TOKEN"] = "<token>"  # <--- SET ACCORDINGLY
# os.environ["OPENAI_API_KEY"] = "<token>"  # <--- SET ACCORDINGLY

# Configuration for calling AI service. 
config={
    "instruction_keys": ['qa-01'], # "Answer the following question through step-by-step reasoning."
    "cot_trigger_keys": ['kojima-01'], # "Answer: Let's think step by step."
    "answer_extraction_keys": ['kojima-A-D'], # "Therefore, among A through D, the answer is"
    "author" : "your_name",
    "api_service": "mock_api", # <--- SET ACCORDINGLY
    "engine": "", # <--- SET ACCORDINGLY
    "temperature": 0,
    "max_tokens": 512,
    "verbose": False,
    "warn": True,
}

Loading worldtree...


In [24]:
# Generating chains-of-thought and answer extractions (This is in Mock-API mode, not calling model over API)
worldtree_10.generate(config=config) #if you cannot press y, set "warn" to false in config


        You are about to [1m call an external API [0m in total 20 times, which [1m may produce costs [0m.
        Number API calls for CoT generation: n_samples 10 * n_instruction_keys 1 * n_cot_trigger_keys 1
        Number API calls for answer extraction: n_samples 10 * n_instruction_keys 1 * n_cot_trigger_keys 1 * n_answer_extraction_keys 1
        Do you want to continue? y/n
        [1m Note: You are using a mock api. When entering 'y', a test run without API calls is made. [0m


Loading cached processed dataset at /tmp/tmpa3b68m9k/cache-37ebdcd9e87a1613.arrow


The above was a fake call to the mock API
For the **purpose of the tutorial** we now load a prepared dataset with real model answers:

In [25]:
worldtree_10 = Collection.from_json("worldtree_10.json")

#### Display a question, answer choices and gold-standard answer

In [92]:
# Extract from prepared dataset
from pprint import pprint
pprint("Question: "+ worldtree_10["worldtree"]["train"][1]["question"])
pprint("Answer Options:")
pprint(worldtree_10["worldtree"]["train"][1]["choices"])
pprint("Answer: "+ "".join(worldtree_10["worldtree"]["train"][1]["answer"]))

('Question: Animals may fight, make threatening sounds, and act aggressively '
 'toward members of the same species. These behaviors usually occur as the '
 'result of')
'Answer Options:'
['competition', 'conservation', 'decomposition', 'pollution']
'Answer: competition'


#### Display model-generated chain-of-thought and extracted answer

In [27]:
pprint(worldtree_10["worldtree"]["train"][1]["generated_cot"][0]["cot"])
pprint(worldtree_10["worldtree"]["train"][1]["generated_cot"][0]['answers'][0]['answer'])

('Aggressive behaviors are often the result of competition. Competition is the '
 'result of animals fighting, making threatening sounds, and acting '
 'aggressively toward members of the same species. So, the final answer is A.')
'A.'


The answer generated by the model was correct! To evaluate model answers automatically, ThoughtSource has an in-built evaluate function.

## 3. Evaluate: Evaluation of model answers

In [28]:
worldtree_10.evaluate()

Evaluating worldtree train...


  0%|          | 0/10 [00:00<?, ?ex/s]

{'worldtree': {'train': {'accuracy': {'google/flan-t5-xl': {'qa-01_kojima-01_kojima-A-D': 0.6}}}}}

In [29]:
# Save the file that now also includes data in the 'correct_answer' fields 
worldtree_10.dump("worldtree_10.json")

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

## 4. Examine: Use the ThoughtSource Annotation Web Tool

Use our online tool to **see an overview of the models output**. You can also use it to manually annotate the data. Everything is saved to the same json file.

Just download the json file and then open it in the **[ThoughtSource Annotator](http://thought.samwald.info:3000/)**.