# Diverse Dataset

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/main/notebooks/data-augmentation/unified-qa/unified-qa.ipynb)

The purpose of this notebook is to download the DIVERSE dataset and convert it into a format that can be used for training the OpenAssistant.

The DIVERSE repo can be found here: https://github.com/microsoft/CodeT/tree/main/DIVERSE

If you extend or use this work, please cite the relevant papers:
```
@article{li2022advance,
  title={On the Advance of Making Language Models Better Reasoners},
  author={Li, Yifei and Lin, Zeqi and Zhang, Shizhuo and Fu, Qiang and Chen, Bei and Lou, Jian-Guang and Chen, Weizhu},
  journal={arXiv preprint arXiv:2206.02336},
  year={2022}
}

```

# OpenAssistant Data Scheme

We will use the data scheme that can be found in the docs for Open-Assistant. This code is taken from the StackExchange notebook.

In [1]:
from typing import TypeVar, List, Dict, Any, Literal
from json import JSONEncoder

T = TypeVar("T", bound="ConversationTreeNode")


class ConversationTreeNode:
    text: str  # The text of the node
    role: Literal["prompter", "assistant"]  # Whether the node is a user prompt/follow-up or an assistant response
    children: List[T]  # The children of the node (if you have a linear conversation, this will be of length 0 or 1)
    metadata: Dict[str, Any]  # Node metadata (see below)

    def __init__(
        self, text: str, role: Literal["prompter", "assistant"], children: List[T], metadata: Dict[str, Any]
    ) -> None:
        self.text = text
        self.role = role
        self.children = children
        self.metadata = metadata


class ConversationTree:
    root: ConversationTreeNode  # The node containing the initial prompt
    metadata: Dict[str, Any]  # Tree metadata, different from root node metadata.

    def __init__(self, root: ConversationTreeNode, metadata: Dict[str, Any]) -> None:
        self.root = root
        self.metadata = metadata


# subclass JSONEncoder
class TreeEncoder(JSONEncoder):
    def default(self, o):
        return o.__dict__

# Download and convert

We firstly import pandas and any other libraries that we'll need.

In [4]:
import pandas as pd
import json

The following is a simple function to take the data (which has two columns) and convert it to a tree with a root note (question) and one child (answer).

In [77]:
import re

def convert_diverse(dataset_json_path):
    # read files using pandas
    ds = pd.read_json(dataset_json_path, lines=True)

    # create dataset name from path
    ds_name = "diverse"+ file.split("data")[-1].replace("/", "_").split(".")[0]

    # create conversation forest
    conversation_forest = []
    for item in ds["context"]:
        # build nodes and tree
        # Find all answers:

        answers = re.findall(r'Answer:?(.*?)#', item.replace("\n", " "))
        questions = re.findall(r'Question:?(.*?) Answer:', item.replace("\n", " "))

        if len(answers) < len(questions):
            questions.pop(-1)

        for (question, answer) in zip(answers, questions):
            root = ConversationTreeNode(text=question, role="prompter", children=[], metadata=None)
            child = ConversationTreeNode(text=answer, role="assistant", children=[], metadata=None)
            root.children.append(child)
            conversation_tree = ConversationTree(root=root, metadata={"dataset": ds_name})
            conversation_forest.append(conversation_tree)

    conversation_forest_json = [
        json.loads(TreeEncoder().encode(conversation_tree)) for conversation_tree in conversation_forest
    ]

    print(json.dumps(conversation_forest_json, indent=4), file=open(f"./{ds_name}.json", "w+"))
    print("*****", ds_name, "****")

We now clone the repository containing the dataset

In [5]:
!git clone https://github.com/microsoft/CodeT.git

Cloning into 'CodeT'...
remote: Enumerating objects: 144, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 144 (delta 1), reused 0 (delta 0), pack-reused 128[K
Receiving objects: 100% (144/144), 56.76 MiB | 8.89 MiB/s, done.
Resolving deltas: 100% (33/33), done.
Updating files: 100% (64/64), done.


In [78]:
diverse_files = [
    "CodeT/DIVERSE/data/sqa/split1/test.jsonl",
    "CodeT/DIVERSE/data/sqa/split1/train.jsonl",
    "CodeT/DIVERSE/data/sqa/split2/test.jsonl",
    "CodeT/DIVERSE/data/sqa/split2/train.jsonl",
    "CodeT/DIVERSE/data/gsm8k/test.jsonl",
    "CodeT/DIVERSE/data/gsm8k/train.jsonl",
]

In [79]:
for file in diverse_files:
    convert_diverse(file)

In [55]:
ds = pd.read_json("CodeT/DIVERSE/data/gsm8k/test.jsonl", lines=True)

In [60]:
ds

Unnamed: 0,context,samples,metadata
0,Question:\nA community is building a metal fen...,"[She eats 3 and uses 4, so that is 7 eggs.\n16...",{'question': 'Janet’s ducks lay 16 eggs per da...
1,Question:\nThe white rabbit can hop 15 meters ...,[She eats 3 eggs for breakfast and uses 4 in m...,{'question': 'Janet’s ducks lay 16 eggs per da...
2,Question:\nA magician has a top hat with 20 re...,[Janet eats 3 eggs every morning and bakes 4 e...,{'question': 'Janet’s ducks lay 16 eggs per da...
3,Question:\nHilton had a box of 26 marbles that...,[Janet has 16 eggs per day.\nShe eats three fo...,{'question': 'Janet’s ducks lay 16 eggs per da...
4,"Question:\nFor every sandwich that he eats, Sa...",[The robe takes 2 bolts of blue fiber and half...,{'question': 'A robe takes 2 bolts of blue fib...
...,...,...,...
6590,Question:\nDonna is catering for a party. She ...,"[If the pizza is cut into 8 slices, then 7 piz...",{'question': 'Henry and 3 of his friends order...
6591,Question:\nMegan has read 32 books this year. ...,[The number of slices of pizza ordered is 7*8 ...,{'question': 'Henry and 3 of his friends order...
6592,Question:\nMr. Wells has a garden of flowers w...,[The number of slices per pizza is 8.\nThere a...,{'question': 'Henry and 3 of his friends order...
6593,Question:\nMelody planted sunflowers from two ...,[Each pizza has 8 slices.\nThe 4 friends want ...,{'question': 'Henry and 3 of his friends order...


In [64]:
re.findall(r'Question: (.*?) Answer:', ds["context"].iloc[0].replace("\n", " "))

['A community is building a metal fence. Each fence panel is made of 3 metal sheets, and 2 metal beams. The fence is made of 10 fence panels. If each sheet is made of 10 metal rods and each metal beam is made of 4 metal rods, how many metal rods does the community need for the fence?',
 'John buys 3 dress shirts.  They sell for $20 each.  He also has to pay 10% tax on everything.  How much did he pay in total?',
 "Bob gets rent assistance because he's low-income. If he gets a raise of $0.50/hour and works 40 hours a week, how much more will he actually earn a week if his housing benefit is reduced by $60/month?",
 'Annie plants 3 pots of basil, 9 pots of rosemary, and 6 pots of thyme. Each basil plant has 4 leaves, each rosemary plant has 18 leaves, and each thyme plant has 30 leaves. How many leaves are there total?',
 'There are 7 mL of solution in each of 6 test tubes. Dr. Igor takes all of the solution and then evenly distributes it into 3 beakers. How many mL of solution are in ea

In [73]:
re.findall(r'Answer:(.*?)#', ds["context"].iloc[0].replace("\n", " "))


['In each panel, the metal sheets use 3 metal sheets * 10 metal rods = <<3*10=30>>30 metal rods. In each panel, the metal beams use 2 metal beams * 4 metal rods = <<2*4=8>>8 metal rods. So each panel uses 30 + 8 = <<30+8=38>>38 metal rods. The entire fence therefore needs 38 metal rods * 10 fence panels = <<38*10=380>>380 metal rods. ',
 'The shirts cost 3*$20=$<<3*20=60>>60 before tax The tax cost $60*.1=$<<60*.1=6>>6 So in total they paid $60+$6=$<<60+6=66>>66 ',
 "First find the total increase in Bob's earnings: $0.50/hour * 40 hours/week = $<<0.50*40=20>>20/week Then find the weekly decrease in Bob's housing assistance: $60/month / 4 weeks/month = $<<60/4=15>>15/week Then subtract the lost assistance from the increased wages to find Bob's net increase in money: $20/week - $15/week = $<<20-15=5>>5/week ",
 'First find the total number of basil leaves: 3 pots * 4 leaves/pot = <<3*4=12>>12 leaves Then find the total number of rosemary leaves: 9 pots * 18 leaves/pot = <<9*18=162>>162 l

In [72]:
ds["context"].iloc[0].replace("\n", " ")

"Question: A community is building a metal fence. Each fence panel is made of 3 metal sheets, and 2 metal beams. The fence is made of 10 fence panels. If each sheet is made of 10 metal rods and each metal beam is made of 4 metal rods, how many metal rods does the community need for the fence? Answer:In each panel, the metal sheets use 3 metal sheets * 10 metal rods = <<3*10=30>>30 metal rods. In each panel, the metal beams use 2 metal beams * 4 metal rods = <<2*4=8>>8 metal rods. So each panel uses 30 + 8 = <<30+8=38>>38 metal rods. The entire fence therefore needs 38 metal rods * 10 fence panels = <<38*10=380>>380 metal rods. #### 380  Question: John buys 3 dress shirts.  They sell for $20 each.  He also has to pay 10% tax on everything.  How much did he pay in total? Answer:The shirts cost 3*$20=$<<3*20=60>>60 before tax The tax cost $60*.1=$<<60*.1=6>>6 So in total they paid $60+$6=$<<60+6=66>>66 #### 66  Question: Bob gets rent assistance because he's low-income. If he gets a raise