# Diverse Dataset

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/main/notebooks/data-augmentation/unified-qa/unified-qa.ipynb)

The purpose of this notebook is to download the DIVERSE dataset and convert it into a format that can be used for training the OpenAssistant.

The DIVERSE repo can be found here: https://github.com/microsoft/CodeT/tree/main/DIVERSE

If you extend or use this work, please cite the relevant papers:
```
@article{li2022advance,
  title={On the Advance of Making Language Models Better Reasoners},
  author={Li, Yifei and Lin, Zeqi and Zhang, Shizhuo and Fu, Qiang and Chen, Bei and Lou, Jian-Guang and Chen, Weizhu},
  journal={arXiv preprint arXiv:2206.02336},
  year={2022}
}

```

# OpenAssistant Data Scheme

We will use the data scheme that can be found in the docs for Open-Assistant. This code is taken from the StackExchange notebook.

In [1]:
from typing import TypeVar, List, Dict, Any, Literal
from json import JSONEncoder

T = TypeVar("T", bound="ConversationTreeNode")


class ConversationTreeNode:
    text: str  # The text of the node
    role: Literal["prompter", "assistant"]  # Whether the node is a user prompt/follow-up or an assistant response
    children: List[T]  # The children of the node (if you have a linear conversation, this will be of length 0 or 1)
    metadata: Dict[str, Any]  # Node metadata (see below)

    def __init__(
        self, text: str, role: Literal["prompter", "assistant"], children: List[T], metadata: Dict[str, Any]
    ) -> None:
        self.text = text
        self.role = role
        self.children = children
        self.metadata = metadata


class ConversationTree:
    root: ConversationTreeNode  # The node containing the initial prompt
    metadata: Dict[str, Any]  # Tree metadata, different from root node metadata.

    def __init__(self, root: ConversationTreeNode, metadata: Dict[str, Any]) -> None:
        self.root = root
        self.metadata = metadata


# subclass JSONEncoder
class TreeEncoder(JSONEncoder):
    def default(self, o):
        return o.__dict__

# Download and convert

We firstly import pandas and any other libraries that we'll need.

In [4]:
import pandas as pd
import json

The following is a simple function to take the data (which has two columns) and convert it to a tree with a root note (question) and one child (answer).

In [77]:
import re

def convert_diverse(dataset_json_path):
    # read files using pandas
    ds = pd.read_json(dataset_json_path, lines=True)

    # create dataset name from path
    ds_name = "diverse"+ file.split("data")[-1].replace("/", "_").split(".")[0]

    # create conversation forest
    conversation_forest = []
    for item in ds["context"]:
        # build nodes and tree
        # Find all answers:

        answers = re.findall(r'Answer:?(.*?)#', item.replace("\n", " "))
        questions = re.findall(r'Question:?(.*?) Answer:', item.replace("\n", " "))

        if len(answers) < len(questions):
            questions.pop(-1)

        for (question, answer) in zip(answers, questions):
            root = ConversationTreeNode(text=question, role="prompter", children=[], metadata=None)
            child = ConversationTreeNode(text=answer, role="assistant", children=[], metadata=None)
            root.children.append(child)
            conversation_tree = ConversationTree(root=root, metadata={"dataset": ds_name})
            conversation_forest.append(conversation_tree)

    conversation_forest_json = [
        json.loads(TreeEncoder().encode(conversation_tree)) for conversation_tree in conversation_forest
    ]

    print(json.dumps(conversation_forest_json, indent=4), file=open(f"./{ds_name}.json", "w+"))
    print("*****", ds_name, "****")

We now clone the repository containing the dataset

In [5]:
!git clone https://github.com/microsoft/CodeT.git

Cloning into 'CodeT'...
remote: Enumerating objects: 144, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 144 (delta 1), reused 0 (delta 0), pack-reused 128[K
Receiving objects: 100% (144/144), 56.76 MiB | 8.89 MiB/s, done.
Resolving deltas: 100% (33/33), done.
Updating files: 100% (64/64), done.


In [78]:
diverse_files = [
    "CodeT/DIVERSE/data/sqa/split1/test.jsonl",
    "CodeT/DIVERSE/data/sqa/split1/train.jsonl",
    "CodeT/DIVERSE/data/sqa/split2/test.jsonl",
    "CodeT/DIVERSE/data/sqa/split2/train.jsonl",
    "CodeT/DIVERSE/data/gsm8k/test.jsonl",
    "CodeT/DIVERSE/data/gsm8k/train.jsonl",
]

In [79]:
for file in diverse_files:
    convert_diverse(file)