# Causation (Chain of Thoughts Self Consistency)

Chain of Thoughts is a prompting technique that use examples containing steps to assist with the reasoning ability of a large language model (llm).

Self consistency is a layer that adds on top that leverage the probabilistic outputs from LLMs and take the majority vote as the final answer.

This notebook is an implementation of this for the GEF causation project.

## Paper: Chain of Thoughts (expand to read)
Chain of Thought (CoT): https://arxiv.org/abs/2201.11903<br>
CoT Self Consistency (CoT-SC): https://arxiv.org/abs/2203.11171

In [None]:
from IPython.display import IFrame
display(IFrame(src='https://arxiv.org/pdf/2201.11903.pdf', width=1600, height=700))

In [None]:
from IPython.display import IFrame
# CoT - Self Consistency
display(IFrame(src='https://arxiv.org/pdf/2203.11171.pdf', width=1600, height=700))

## Overview
1. Supply your OpenAI API Key
2. Choose a sampling scheme and number of completions
3. Upload your file with the Chain of Thoughts examples.
4. Enter a query (i.e. the sentence) and run it for a classification.

## OpenAI Privacy Policy
This notebook uses OpenAI's API.

For concerns about how your data will be handled, please read through the Privacy Policy here: https://openai.com/policies/api-data-usage-policies.

## 1. Enter your OpenAI API Key.

To create an API Key, please go to: https://platform.openai.com/account/api-keys

In [None]:
# Enter your API Key via a redacted input box.
import panel as pn
pn.extension()

password_input = pn.widgets.PasswordInput(name='Enter your OpenAI API key then run the next cell:', placeholder='<OpenAI API Key>')
password_input

In [None]:
# run this cell once you've entered your API key.
import os
os.environ['OPENAI_API_KEY'] = password_input.value

In [None]:
# Validate your API key.
import re, os
assert len(os.environ['OPENAI_API_KEY']) == 51, "OpenAI's API Key are 51 characters."
os.environ['OPENAI_API_KEY'][:3] + re.sub('.', '*', os.environ['OPENAI_API_KEY'][3:])

## 2. Choose a sampling scheme and the number of completions.

__Sampling Scheme__:
+ `top_p` - **the range of output words within a probability threshold.** (0 < top_p <= 1.0)<br>
e.g. <br>
= 0.3 means sample only from outputs that make up the top 30% of the probabilities.

+ `temperature` - **the higher the temperature the more spread out the probabilities are across the output words.** (0 <= temperature <= 2) <br>
e.g. <br>
= 0 means output probabilities are *not* spread out i.e. Only 1 output token have a probability of 1.0 when the LLM is generating the output. This means it will always choose the same output during the decoding ending up with the same completion.<br>
\> 0, ~0 means output probabilities are a little spread out i.e. > 1 output tokens will have a probability of closer to 1.0, combining to 1.0. This means that the LLM will have a chance of sampling different output tokens.<br>
= 2 means output probabilities are *most* spread out i.e. output tokens will have similar probabilities and the LLM will have similar chance of sampling from each token.

`top_p` and `temperature` goes hand in hand. Having a really low temperature means there are less output tokens within the probability threshold.

__Number of completions__ (relates to: self-consistency):
+ `n_completions` - **this is the number of responses you're asking the LLM to generate.**<br>(increases cost but directly relates to the number of votes used by Self Consistency)

__Penalties__:
+ `presence_penalty` - **this forces the model to be more creative in their word choices per completion. i.e. it penalises words that the model have already said.** (-2.0 <= presence_penalty <= 2.0)<br>
empirically, this have shown to output more tokens if increased.

<br>
👼 Experiment with different sampling schemes and increase number of completions for more confidence in your votes (beware of costs!)

In [None]:
from llm_experiments import SamplingScheme

sampling_scheme = SamplingScheme(top_p=0.8, temperature=1, presence_penalty=0.0)
n_completions = 3

## 3. Upload your Chain of Thoughts examples

Chain of thoughts uses examples to help the LLM reason about your queries better by breaking it down into steps.

The idea is to give it more context and prompt it to break down its reasoning process into steps.

👼 Experiment with different 'steps' for each example, I would recommend having a diverse set of reasons and be specific. <br> 
👼 It's usually better with more examples but this also means it'll increase your API cost (see TikDollar later). Starting out with 3-5 examples per class should be sufficient.

In [None]:
from tempfile import mkdtemp
from pathlib import Path
import panel as pn
pn.extension()

MARKDOWN = str

uploader_data = dict(saved_path=None)

def cb_save_to_file(fbytes: bytes, fname: str) -> MARKDOWN:
    if fbytes is None or len(fbytes) <= 0 or fname is None: return ""
    dir_ = Path(mkdtemp())
    path = dir_.joinpath(fname)
    with open(path, 'wb') as h:
        h.write(finput.value)
    uploader_data['toml'] = path
    return f"Received: **{fname}**\t\tSaved temporarily to **{uploader_data.get('toml')}**"

finput = pn.widgets.FileInput(accept='.toml')
iobject = pn.bind(cb_save_to_file, finput, finput.param.filename)
pn.Row(finput, pn.pane.Markdown(iobject))

👼 Currently supported models: text-davinci-003, text-davinci-002, text-curie-001, text-babbage-001, text-ada-001
(In decreasing performance and cost.)<br>
👼 Newly Added: gpt-3.5-turbo at 10% of the cost of text-davinici-003 but similar performance!

Note: If you enter the wrong model name, it'll tell you what's available in the error message.

In [None]:
from llm_experiments import CoTSC

assert uploader_data.get('toml'), "Have you uploaded your Chain of Thoughts examples .toml configuration file?"

cotsc = CoTSC.from_toml(model='gpt-4',
                        prompt_toml=uploader_data.get('toml'),
                        sampling_scheme=sampling_scheme, 
                        n_completions=n_completions,
                        shuffle_examples=True,
                        shuffle_seed=42)

### Here's your prompt (Not sent to OpenAI)

In [None]:
print(cotsc.dryrun(query="Canberra immunologist Carola Vinuesa who discovered a gene responsible for the autoimmune diseases lupus and diabetes."))

### Here's an example (Sent to OpenAI)

In [None]:
results = cotsc.run(query="Canberra immunologist Carola Vinuesa who discovered a gene responsible for the autoimmune diseases lupus and diabetes.")
results  # raw output

We'll format the raw output a little bit by putting it in a table.

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 250)

pd.DataFrame.from_dict(results, orient='index')

### TikDollar
**Tikdollar is created by SIH to track your OpenAI expense and cut if off at a specified threshold.<br>
The code will run your calls right up until when your next call will exceed this threshold.**

> OpenAI charges for both your input and output tokens. Each token can be thought of as a word in the normal sense but tokens for LLMs are actually *subwords*.

For more information on subwords here's a cool video: https://www.youtube.com/watch?v=zHvTiHr506c

**Now, we're going to bind TikDollar with our CoTSC calling function.**

The parameters you need to care about:
+ `cost_threshold`  - this is the cut-off USD. You'll need to define this. (e.g. 0.1, 1.0, 20)
+ `raise_err` - when the cut-off will be exceeded in the next call, stop. (or print a message then continue if it's False)
+ `verbose` - whether to print messages per call in terms of your spending.

In [None]:
from llm_experiments.utils import TikDollar as td

# ⚠️ Caveat: When you rerun this cell, tikdollar is reset to 0!
tikdollar = td.track(cotsc, cotsc._tikdollar_run, cost_threshold=0.1, raise_err=True, verbose=True)
tikdollar

In [None]:
cotsc.run(query="Professor Lawford says the institute, based at the Greenslopes Private Hospital, in Brisbane's south, in conjunction with overseas collaborators, is analysing the DNA of Australian Vietnam veterans in a bid to better understand the causes of PTSD.")

In [None]:
tikdollar

## 4. Batch Classification

Now that you have **CoTSC** and **TikDollar** you're equipped to run a classification task on your list of sentences!

At the end of this there'll be a link for you to click on to download all the generated results into an excel sheet.

#### 1. Upload your dataset (requires: 'sentence' column)

In [None]:
def cb_save_to_file(fbytes: bytes, fname: str) -> MARKDOWN:
    if fbytes is None or len(fbytes) <= 0 or fname is None: return ""
    dir_ = Path(mkdtemp())
    path = dir_.joinpath(fname)
    with open(path, 'wb') as h:
        h.write(finput.value)
    uploader_data['dataset'] = path
    return f"Received: **{fname}**\t\tSaved temporarily to **{uploader_data.get('dataset')}**"

finput = pn.widgets.FileInput(accept='.xlsx')
iobject = pn.bind(cb_save_to_file, finput, finput.param.filename)
pn.Row(finput, pn.pane.Markdown(iobject))

In [None]:
import pandas as pd
assert uploader_data.get('dataset'), "Did you upload your dataset?"
df = pd.read_excel(uploader_data.get('dataset'))
df.head(1)

In [None]:
# plot class distribution
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd

fig = make_subplots(rows=1, cols=4)

fig.add_trace(go.Histogram(x=df['det']), row=1, col=1)
fig.add_trace(go.Histogram(x=df['hom']), row=1, col=2)
fig.add_trace(go.Histogram(x=df['se']), row=1, col=3)
fig.add_trace(go.Histogram(x=df['nat']), row=1, col=4)

fig.update_xaxes(title_text="determinism", dtick=1, row=1, col=1)
fig.update_xaxes(title_text="homogeneity", dtick=1, row=1, col=2)
fig.update_xaxes(title_text="specific aetiology", dtick=1, row=1, col=3)
fig.update_xaxes(title_text="naturalism", dtick=1, row=1, col=4)

fig.update_yaxes(title_text="Count", range=[0, len(df)])

# Update layout
fig.update_layout(height=300, width=1400, title_text="Distribution of Classes", showlegend=False)

fig.show()

#### 2. Set up TikDollar

In [None]:
# Setup TikDollar
# copy of prior cell for task separation and easy access.
from llm_experiments.utils import TikDollar as td

# ⚠️ Caveat: When you rerun this cell, tikdollar is reset to 0!
tikdollar = td.track(cotsc, cotsc._tikdollar_run, cost_threshold=20, raise_err=True, verbose=True)
tikdollar

#### 3. Run classification on uploaded dataset

🧑‍💻 If you find the TikDollar messages cluttering your screen, set `verbose=False` in the previous cell, run the cell and then run the cell below.

In [None]:
from llm_experiments.utils.tikdollar import CostThresholdReachedException
from llm_experiments.cot import CoTDataLeakException
from collections import namedtuple
from tqdm.auto import tqdm

ROW = namedtuple('ROW', ['query', 'clazz', 'votes', 'steps', 'determinism', 'specific_aetiology', 'naturalness', 'homogeneity', 'is_biased', 'completions'])

Rows = list()
dleak_counter = 0
for i, row in tqdm(enumerate(df.itertuples()), total=len(df)):
    query = row.sentence
    det, se, nat, hom, pos = row.det, row.se, row.nat, row.hom, row.pos
    try:
        results = cotsc.run(query=query)
        for clazz, clz_res in results.items():
            Row = ROW(query=query, clazz=clazz, votes=clz_res.get('votes'), steps=clz_res.get('steps'), 
                      determinism=det, specific_aetiology=se, naturalness=nat, homogeneity=hom, is_biased=pos, completions=clz_res.get('completions'))
            Rows.append(Row)
    except CostThresholdReachedException as ctre:
        print(ctre)
        print(f"Number of queries sent: {i}.")
        break
    except CoTDataLeakException as cotdle:
        print(cotdle)
        print("Data leak detected. Skipped.")
        dleak_counter += 1
        continue

print(f"Number of examples leaked: {dleak_counter}")
results_df = pd.DataFrame(Rows)

In [None]:
results_df = pd.DataFrame(Rows)
len(results_df)

#### 4. Analyse classification results

In [None]:
results_df.head(1)

In [None]:
results_df.clazz.value_counts()

In [None]:
import sys
import numpy as np

clazz_to_idx = {
    "determinism": 0,
    "specific_aetiology": 1,
    "naturalness": 2,
    "homogeneity": 3,
}

def clazz_to_row(clazz: str) -> list[int]:
    # clean up llm output
    try:
        comma_idx = clazz.index(',')
        clazz = clazz[:comma_idx]
    except:
        pass
    clazz = clazz.strip()
    row = [0, 0, 0, 0]
    if clazz == 'not_biased': 
        return row
    elif clazz_to_idx.get(clazz, None) is None:
        print(f"{clazz} is not one of {', '.join(clazz_to_idx.keys())}. Continued as 'not_biased'", file=sys.stderr)
        return row
    else:
        row[clazz_to_idx.get(clazz)] = 1
        return row

preds, targets = list(), list()
for query, group in results_df.groupby(by='query'):
    best_idx = group.loc[:, 'votes'].idxmax()
    best = group.loc[best_idx]
    prediction = clazz_to_row(best.clazz)
    target = group[['determinism', 'specific_aetiology', 'naturalness', 'homogeneity']].values[0]
    preds.append(prediction)
    targets.append(target)

preds, targets = np.array(preds), np.array(targets)
assert preds.shape == targets.shape, "Mismatched shape between prediction and targets. This should not happen"

In [None]:
from sklearn.metrics import classification_report
import sys

def report(clazzes, targets, preds, file=sys.stdout):
    for i in range(len(clazzes)):
        print(f"=== {clazzes[i]} ===", file=file)
        print(classification_report(y_true=targets[:, i], y_pred=preds[:, i]), file=file)
        print("\n", file=file)
    
clazzes = list(clazz_to_idx.keys())
report(clazzes, targets, preds)

In [None]:
# confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

def cm(clazzes, targets, preds):
    fig, axes = plt.subplots(2, 2, figsize=(10, 10))
    for i in range(len(clazzes)):
        y_true, y_pred = targets[:, i], preds[:, i]
        clazz = clazzes[i]
        if sum(y_true) == 0:
            labels = ['neutral']
        else:
            labels = ['neutral', clazz]

        disp = ConfusionMatrixDisplay(confusion_matrix(y_true=y_true, y_pred=y_pred), display_labels=labels)
        ax = axes[i//2, i%2]
        ax.set_title(f"{clazz}")
        disp.plot(ax=axes[i//2, i%2])
    return fig
fig = cm(clazzes, targets, preds)

In [None]:
# convert to single-target-multi-class confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
def to_labels(onehot):
    """ This converts the [det, se, hom, nat] to [neutral, det, se, hom, nat] """
    num_classes_plus_neutral = onehot.shape[1] + 1
    labels = np.zeros((onehot.shape[0], num_classes_plus_neutral))
    for idx in range(len(onehot)):
        if np.sum(onehot[idx]) == 0:
            continue
        else:
            clz_idx = np.argmax(onehot[idx])
            clz_idx = clz_idx + 1
            labels[idx][clz_idx] = 1
    return labels

# dynamic display_labels (dynamic to the samples e.g. SE only, or SE,DET only where target is a subset of all biases)
all_labels=['neutral'] + clazzes
y_true=np.argmax(to_labels(targets), axis=1)
uniques = np.unique(y_true)
display_labels = ['neutral']
for u in uniques:
    display_labels += [all_labels[u]]

fig_st, axes = plt.subplots(figsize=(8, 8))
disp = ConfusionMatrixDisplay(confusion_matrix(
    y_true=y_true,
    y_pred=np.argmax(to_labels(preds), axis=1)
), display_labels=display_labels)
disp.plot(ax=axes)

## 5. Download Results
1. Full COTSC outputs. (cotsc-outputs.xlsx)
2. Classification results. (cotsc-results.txt)
3. COTSC model configuration (cotsc-config.json)
4. COTSC prompt toml (your uploaded toml file)
5. Dataset used for classification.

👼 These are all packaged into `cotsc-results-{timestamp}.zip`. Which you'll be able to click and download.

In [None]:
# create the temporary results directory
import os
import shutil
from pathlib import Path
dir_: Path = Path('./.causation-cotsc-results')   # note: hidden folder.
if dir_.exists(): shutil.rmtree(dir_)
os.makedirs(dir_, exist_ok=True)
assert dir_.exists(), "Temporary directory did not get created."
f"Temporary directory: {dir_}"

In [None]:
# reformat df for readability.
formatted_results_df = results_df.copy()
formatted_results_df['steps'] = formatted_results_df['steps'].apply(lambda steps: "\n".join((f"{i+1}. {s}" for i, s in enumerate(steps))))
formatted_results_df['completions'] = formatted_results_df['completions'].apply(lambda completions: "\n\n".join((c for c in completions)))
formatted_results_df.head(1)

In [None]:
# classification outputs
path = dir_.joinpath('cotsc-outputs.xlsx')
formatted_results_df.to_excel(path, index=False)

assert path.exists(), f"Failed to save to {path}"

In [None]:
# classification evaluation results
path = dir_.joinpath('cotsc-results.txt')
with open(path, 'w') as f:
    clazzes = list(clazz_to_idx.keys())
    num_clazzes = preds.shape[1]
    report(clazzes, targets, preds, file=f)
    
assert path.exists(), f"Failed to save to {path}"

In [None]:
# classification evaluation confusion matrix
path = dir_.joinpath('cotsc-confusion-matrix.png')
path_st = dir_.joinpath('cotsc-confusion-matrix-st.png')
assert fig, "Did you run the confusion matrix cell earlier?"
assert fig_st, "Did you run the confusion matrix (single target) cell earlier?"
_ = fig.savefig(path)
_ = fig_st.savefig(path_st)

In [None]:
# cotsc configuration
import srsly

path = dir_.joinpath('cotsc-config.json')
cotsc_config = {
    'sampling_scheme': sampling_scheme.openai(),
    'n_completions': cotsc.n_completions,
    'model': cotsc.model,
    'classes': cotsc.classes,
}
srsly.write_json(path, cotsc_config)
assert path.exists(), f"Failed to save to {path}"

In [None]:
# List of names of files to be added to the zip
file_names = [uploader_data['toml'], uploader_data['dataset']]  # toml & dataset
file_names += list(dir_.glob('*'))

[f.name for f in file_names]

In [None]:
# Open a zip file in write mode
# zip results and causation config
import zipfile
import os
from datetime import datetime
from pathlib import Path

now = datetime.now().strftime(format="%Y-%m-%d_%H-%M-%S")
zfname = Path(f'cotsc-results_{now}.zip')
with zipfile.ZipFile(zfname, 'w') as zipf:
    # Loop through the list of files
    for file_name in file_names:
        # Add each file to the zip
        zipf.write(file_name, arcname=os.path.basename(file_name))
print(f"Saved to {zfname}")

# todo: download link for the zip.
pn.widgets.FileDownload(file=str(zfname), filename=zfname.name)