# Causation (Chain of Thoughts Self Consistency)

Chain of Thoughts is a prompting technique that use examples containing steps to assist with the reasoning ability of a large language model (llm).

Self consistency is a layer that adds on top that leverage the probabilistic outputs from LLMs and take the majority vote as the final answer.

This notebook is an implementation of this for the GEF causation project.

## Paper: Chain of Thoughts (expand to read)
Chain of Thought (CoT): https://arxiv.org/abs/2201.11903<br>
CoT Self Consistency (CoT-SC): https://arxiv.org/abs/2203.11171

In [None]:
from IPython.display import IFrame
display(IFrame(src='https://arxiv.org/pdf/2201.11903.pdf', width=1600, height=700))

In [None]:
from IPython.display import IFrame
# CoT - Self Consistency
display(IFrame(src='https://arxiv.org/pdf/2203.11171.pdf', width=1600, height=700))

## Overview
1. Supply your OpenAI API Key
2. Choose a sampling scheme and number of completions
3. Upload your file with the Chain of Thoughts examples.
4. Enter a query (i.e. the sentence) and run it for a classification.

## 1. Enter your Open API Key.
Replace `<OPENAI_API_KEY>` below and retain the double quotes.

In [2]:
OPENAI_API_KEY = "<OPENAI_API_KEY>"

# ignore this: it just sets your key onto the environment variable (disappears after you close it or restart the notebook)
import os
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

## 2. Choose a sampling scheme and the number of completions.

__Sampling Scheme__:
+ `top_p` - **the range of output words within a probability threshold.** (0 < top_p <= 1.0)<br>
e.g. <br>
= 0.3 means sample only from outputs that make up the top 30% of the probabilities.

+ `temperature` - **the higher the temperature the more spread out the probabilities are across the output words.** (0 <= temperature <= 2) <br>
e.g. <br>
= 0 means output probabilities are *not* spread out i.e. Only 1 output token have a probability of 1.0 when the LLM is generating the output. This means it will always choose the same output during the decoding ending up with the same completion.<br>
\> 0, ~0 means output probabilities are a little spread out i.e. > 1 output tokens will have a probability of closer to 1.0, combining to 1.0. This means that the LLM will have a chance of sampling different output tokens.<br>
= 2 means output probabilities are *most* spread out i.e. output tokens will have similar probabilities and the LLM will have similar chance of sampling from each token.

`top_p` and `temperature` goes hand in hand. Having a really low temperature means there are less output tokens within the probability threshold.

__Number of completions__ (relates to: self-consistency):
+ `n_completions` - **this is the number of responses you're asking the LLM to generate.**<br>(increases cost but directly relates to the number of votes used by Self Consistency)

<br>
👼 Experiment with different sampling schemes and increase number of completions for more confidence in your votes (beware of costs!)

In [3]:
from llm_experiments import SamplingScheme

sampling_scheme = SamplingScheme(top_k=None, top_p=0.2, temperature=1)    # top_k is None because OpenAI does not support it.
n_completions = 3

## 3. Upload your Chain of Thoughts examples

Chain of thoughts uses examples to help the LLM reason about your queries better by breaking it down into steps.

The idea is to give it more context and prompt it to break down its reasoning process into steps.

👼 Experiment with different 'steps' for each example, I would recommend having a diverse set of reasons and be specific. <br> 
👼 It's usually better with more examples but this also means it'll increase your API cost (see TikDollar later). Starting out with 3-5 examples per class should be sufficient.

In [4]:
from tempfile import mkdtemp
from pathlib import Path
import panel as pn
pn.extension()

MARKDOWN = str

uploader_data = dict(saved_path=None)

def cb_save_to_file(fbytes: bytes, fname: str) -> MARKDOWN:
    if fbytes is None or len(fbytes) <= 0 or fname is None: return ""
    dir_ = Path(mkdtemp())
    path = dir_.joinpath(fname)
    with open(path, 'wb') as h:
        h.write(finput.value)
    uploader_data['saved_path'] = path
    return f"Received: **{fname}**\t\tSaved temporarily to **{uploader_data.get('saved_path')}**"

finput = pn.widgets.FileInput(accept='.toml')
iobject = pn.bind(cb_save_to_file, finput, finput.param.filename)
pn.Row(finput, pn.pane.Markdown(iobject))

👼 Currently supported models: text-davinci-003, text-davinci-002, text-curie-001, text-babbage-001, text-ada-001
(In decreasing performance and cost.)

Note: If you enter the wrong model name, it'll tell you what's available in the error message.

In [5]:
from llm_experiments import CoTSC

assert uploader_data.get('saved_path'), "Have you uploaded your Chain of Thoughts examples .toml configuration file?"

cotsc = CoTSC.from_toml(model='text-davinci-002',
                        prompt_toml=uploader_data.get('saved_path'),
                        sampling_scheme=sampling_scheme, 
                        n_completions=n_completions)

🧑‍💻 Side Note: TOML is a modern standard used often to define configurations in a file for programs. It works well with python and is generally considered to supercede the YAML standard.

### Here's your prompt

In [7]:
print(cotsc.dryrun(query="Canberra immunologist Carola Vinuesa who discovered a gene responsible for the autoimmune diseases lupus and diabetes."))

The following are 5 classes with a description of each. 
    Please classify each 'query' as one of the 5 classes.
determinism: when a trait is believed to be present due to a fixed set of underlying process that are inevitable and beyond control. People will view the genetic trait as inevitable, immutable, determined, and destined to develop. The core of this bias is that if a trait has a genetic cause, it will develop no matter what.

specific_aetiology: when a trait has a genetic cause, therefore the genes are the ultimate or most important cause of that trait. When this bias occurs, people downplay other causes and just focus on genes and the sole or most important causes. This is often seen in claims about the “gene for” things like obesity, intelligence etc.

homogeneity: When groups of people share the same genes they are identified as being the same ‘kinds’ of ‘types’ of people. People are grouped into homogenous and discrete categories based on genetic membership. This is the 

### Here's an example

In [None]:
results = cotsc.run(query="Canberra immunologist Carola Vinuesa who discovered a gene responsible for the autoimmune diseases lupus and diabetes.")
results  # raw output

We'll format the raw output a little bit by putting it in a table.

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 250)

pd.DataFrame.from_dict(results, orient='index')

### TikDollar
**Tikdollar is created by SIH to track your OpenAI expense and cut if off at a specified threshold.<br>
The code will run your calls right up until when your next call will exceed this threshold.**

> OpenAI charges for both your input and output tokens. Each token can be thought of as a word in the normal sense but tokens for LLMs are actually *subwords*.

For more information on subwords here's a cool video: https://www.youtube.com/watch?v=zHvTiHr506c

**Now, we're going to bind TikDollar with our CoTSC calling function.**

The parameters you need to care about:
+ `cost_threshold`  - this is the cut-off USD. You'll need to define this. (e.g. 0.1, 1.0, 20)
+ `raise_err` - when the cut-off will be exceeded in the next call, stop. (or print a message then continue if it's False)
+ `verbose` - whether to print messages per call in terms of your spending.

In [None]:
from llm_experiments.utils import TikDollar as td

# ⚠️ Caveat: When you rerun this cell, tikdollar is reset to 0!
tikdollar = td.track(cotsc, cotsc._tikdollar_run, cost_threshold=0.1, raise_err=True, verbose=True)
tikdollar

In [None]:
cotsc.run(query="Professor Lawford says the institute, based at the Greenslopes Private Hospital, in Brisbane's south, in conjunction with overseas collaborators, is analysing the DNA of Australian Vietnam veterans in a bid to better understand the causes of PTSD.")

In [None]:
tikdollar

## Classification

Now that you have **CoTSC** and **TikDollar** you're equipped to run a classification task on your list of sentences!

At the end of this there'll be a link for you to click on to download all the generated results into an excel sheet.

#### 1. Upload your dataset (requires: 'sentence' column)

In [None]:
uploader_data.clear()
finput = pn.widgets.FileInput(accept='.xlsx')
iobject = pn.bind(cb_save_to_file, finput, finput.param.filename)
pn.Row(finput, pn.pane.Markdown(iobject))

In [None]:
import pandas as pd
assert uploader_data.get('saved_path'), "Did you upload your dataset?"
df = pd.read_excel(uploader_data.get('saved_path'))
df.head(1)

In [None]:
df[['det', 'se', 'nat', 'hom']].hist()

In [None]:
len(df)

#### 2. Set up TikDollar

In [None]:
# Setup TikDollar
# copy of prior cell for task separation and easy access.
from llm_experiments.utils import TikDollar as td

# ⚠️ Caveat: When you rerun this cell, tikdollar is reset to 0!
tikdollar = td.track(cotsc, cotsc._tikdollar_run, cost_threshold=5, raise_err=True, verbose=True)
tikdollar

#### 3. Run classification on uploaded dataset

In [None]:
from llm_experiments.utils.tikdollar import CostThresholdReachedException
from collections import namedtuple
ROW = namedtuple('ROW', ['query', 'clazz', 'votes', 'steps', 'determinism', 'specific_aetiology', 'naturalness', 'homogeneity', 'is_biased', 'completions'])

Rows = list()
for i, row in enumerate(df.sample(10).itertuples()):
    query = row.sentence
    det, se, nat, hom, pos = row.det, row.se, row.nat, row.hom, row.pos
    try:
        results = cotsc.run(query=query)
        for clazz, clz_res in results.items():
            Row = ROW(query=query, clazz=clazz, votes=clz_res.get('votes'), steps=clz_res.get('steps'), 
                      determinism=det, specific_aetiology=se, naturalness=nat, homogeneity=hom, is_biased=pos, completions=clz_res.get('completions'))
            Rows.append(Row)
    except CostThresholdReachedException as ctre:
        print(ctre)
        print(f"Number of queries sent: {i}.")
        break
              
results_df = pd.DataFrame(Rows)

In [None]:
results_df = pd.DataFrame(Rows)

#### 4. Analyse classification results

In [None]:
results_df.head(1)

In [None]:
import sys
import numpy as np

clazz_to_idx = {
    "determinism": 0,
    "specific_aetiology": 1,
    "naturalness": 2,
    "homogeneity": 3,
}

def clazz_to_row(clazz: str) -> list[int]:
    # clean up llm output
    try:
        comma_idx = clazz.index(',')
        clazz = clazz[:comma_idx]
    except:
        pass
    clazz = clazz.strip()
    row = [0, 0, 0, 0]
    if clazz == 'not_biased': 
        return row
    elif clazz_to_idx.get(clazz, None) is None:
        print(f"{clazz} is not one of {', '.join(clazz_to_idx.keys())}. Continued as 'not_biased'", file=sys.stderr)
        return row
    else:
        row[clazz_to_idx.get(clazz)] = 1
        return row

preds, targets = list(), list()
for query, group in results_df.groupby(by='query'):
    best_idx = group.loc[:, 'votes'].idxmax()
    best = group.loc[best_idx]
    prediction = clazz_to_row(best.clazz)
    target = group[['determinism', 'specific_aetiology', 'naturalness', 'homogeneity']].values[0]
    preds.append(prediction)
    targets.append(target)

preds, targets = np.array(preds), np.array(targets)
assert preds.shape == targets.shape, "Mismatched shape between prediction and targets. This should not happen"

In [None]:
from sklearn.metrics import classification_report

clazzes = list(clazz_to_idx.keys())
num_clazzes = preds.shape[1]
for i in range(num_clazzes):
    print(f"=== {clazzes[i]} ===")
    print(classification_report(y_true=targets[:, i], y_pred=preds[:, i]))
    print("\n")

#### 5. Download results as excel

In [None]:
# reformat df for download
formatted_results_df = results_df.copy()
formatted_results_df['steps'] = formatted_results_df['steps'].apply(lambda steps: "\n+++++++\n".join((s for s in steps)))
formatted_results_df['completions'] = formatted_results_df['completions'].apply(lambda completions: "\n+++++\n".join((c for c in completions)))
formatted_results_df.head(1)

In [None]:
import panel as pn

path = 'cotsc-results.xlsx'
formatted_results_df.to_excel(path, index=False)
pn.widgets.FileDownload(file=path, filename=path)