Prepare IFT datasets with `{"input_text":, ... "output_text":...}` format. Prepare in chat format to finetune chat models.

In [1]:
import re
import pandas as pd
from transformers import AutoTokenizer
from datasets import concatenate_datasets, load_dataset, load_from_disk

In [2]:
def create_chat_text(tokenizer, user_prompt, system="You are a helpful AI assistant."):
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": user_prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    return messages, text

In [3]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

### oasst

In [4]:
def process_example(example):
    t = example["text"]
    
    human_prefix, asst_prefix = "### Human: ", "### Assistant: "
    human_matches = list(re.finditer(human_prefix, t))
    asst_matches = list(re.finditer(asst_prefix, t))
    human_start_idx, _ = human_matches[0].span(0)
    asst_start_idx, _ = asst_matches[0].span(0)
    if len(human_matches) > 1:
        asst_end_idx = human_matches[1].span(0)[0]
    else:
        asst_end_idx = len(t)
    
    human_prompt = t[human_start_idx+len(human_prefix):asst_start_idx]
    asst_prompt = t[asst_start_idx+len(asst_prefix):asst_end_idx]
    
    _,input_text = create_chat_text(tokenizer, human_prompt)
    output_text = asst_prompt
    return {"input_text":input_text, "output_text":output_text}

In [5]:
train_ds1 = load_dataset("timdettmers/openassistant-guanaco")['train']

Downloading readme: 100%|██████████| 395/395 [00:00<00:00, 1.53MB/s]
Repo card metadata block was not found. Setting CardData to empty.
Downloading data: 100%|██████████| 20.9M/20.9M [00:00<00:00, 54.2MB/s]
Downloading data: 100%|██████████| 1.11M/1.11M [00:00<00:00, 8.81MB/s]
Generating train split: 100%|██████████| 9846/9846 [00:00<00:00, 142261.10 examples/s]
Generating test split: 100%|██████████| 518/518 [00:00<00:00, 89560.55 examples/s]


In [6]:
train_ds1_processed = train_ds1.map(process_example)

Map: 100%|██████████| 9846/9846 [00:01<00:00, 7352.54 examples/s]


In [7]:
len(train_ds1_processed)

9846

In [8]:
print(train_ds1_processed[0]['input_text'])
print(train_ds1_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.<|eot_id|><|start_header_id|>assistant<|end_header_id|>


"Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.

Recent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies

### orca-math

In [9]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['question'])
    return {"input_text":input_text, "output_text":example['answer']}

In [10]:
train_ds2 = load_dataset("microsoft/orca-math-word-problems-200k")
train_ds2 = train_ds2['train'].shuffle(42).select(range(10000))

Downloading readme: 100%|██████████| 6.91k/6.91k [00:00<00:00, 14.5MB/s]
Downloading data: 100%|██████████| 84.2M/84.2M [00:01<00:00, 69.3MB/s]
Generating train split: 100%|██████████| 200035/200035 [00:01<00:00, 193137.93 examples/s]


In [11]:
train_ds2_processed = train_ds2.map(process_example)

Map: 100%|██████████| 10000/10000 [00:01<00:00, 5633.59 examples/s]


In [12]:
train_ds2_processed[3]

{'question': 'Quentin, Skylar, and Colten have a total of 383 chickens. Quentin has 25 more than double the chickens that Skylar has. Skylar has 4 less than a certain multiple of the number of chickens that Colten has. Colten has 37 chickens. What is the multiple of the number of chickens Colten has that Skylar has 4 less than?',
 'answer': "Let's denote the number of chickens that Skylar has as S and the number of chickens that Quentin has as Q. We know that Colten has 37 chickens.\n\nAccording to the information given:\n\n1) Quentin has 25 more than double the chickens that Skylar has:\nQ = 2S + 25\n\n2) The total number of chickens is 383:\nQ + S + 37 = 383\n\nNow, let's substitute the expression for Q from the first equation into the second equation:\n\n(2S + 25) + S + 37 = 383\n3S + 62 = 383\n3S = 383 - 62\n3S = 321\nS = 321 / 3\nS = 107\n\nNow we know Skylar has 107 chickens.\n\nThe problem states that Skylar has 4 less than a certain multiple of the number of chickens that Colte

In [13]:
len(train_ds2_processed)

10000

In [14]:
print(train_ds2_processed[0]['input_text'])
print(train_ds2_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Sally had 13 peaches at her roadside fruit dish.  She went to the orchard and picked peaches to stock up. She picked 55 peaches. There are _____ peaches now.<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Sally originally had 13 peaches. She picked 55 more peaches. To find out the total number of peaches she has now, we add the two amounts together:

13 (original peaches) + 55 (picked peaches) = 68 peaches

So, there are 68 peaches now.


### meta-math-qa

In [15]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['query'])
    return {"input_text":input_text, "output_text":example['response']}

In [16]:
train_ds3 = load_dataset("meta-math/MetaMathQA")
train_ds3 = train_ds3['train'].shuffle(42).select(range(10000))

Downloading readme: 100%|██████████| 4.45k/4.45k [00:00<00:00, 11.9MB/s]
Downloading data: 100%|██████████| 396M/396M [00:10<00:00, 36.2MB/s] 
Generating train split: 100%|██████████| 395000/395000 [00:16<00:00, 24205.24 examples/s]


In [17]:
train_ds3_processed = train_ds3.map(process_example)

Map: 100%|██████████| 10000/10000 [00:01<00:00, 5123.66 examples/s]


In [18]:
train_ds3_processed[3]

{'type': 'MATH_SV',
 'query': 'Compute the sum of the squares of the roots of the equation \\[x^{2018} +44x^{2015} + 3x^3 + 404 = X The answer is 0. What is the value of unknown variable X?',
 'original_question': 'Compute the sum of the squares of the roots of the equation \\[x^{2018} +44x^{2015} + 3x^3 + 404 = 0.\\]',
 'response': 'To solve this problem, we need to determine the value of x that will make the sum of the squares of the roots of the equation equal to 0.\nThe equation is given as:\n\\[x^{2018} + 44x^{2015} + 3x^3 + 404 = X\\]\nTo find the sum of the squares of the roots, we need to find the roots of the equation and then square them.\nSince the answer is 0, it means that the sum of the squares of the roots must be equal to 0.\nTo make the sum of squares 0, we need to find the values of x that will make each square equal to 0.\nIf a square is equal to 0, it means that the value inside the square must be 0.\nIn this case, we need to set each term inside the square to 0 and

In [19]:
len(train_ds3_processed)

10000

In [20]:
print(train_ds3_processed[0]['input_text'])
print(train_ds3_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

If Anna wants to create a smiley face shape using red and yellow tulips, she requires 8 red tulips for each eye and 18 red tulips for the smile. Additionally, she needs 9 times the number of tulips in the smile to create the yellow background of the face. What is the total number of tulips that Anna needs?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Anna needs 8 red tulips for each eye, so for both eyes she needs 8 * 2 = 16 red tulips.
She also needs 18 red tulips for the smile.
The total number of red tulips she needs is 16 + 18 = 34 red tulips.
For the yellow background, she needs 9 times the number of tulips in the smile, so she needs 9 * 18 = 162 yellow tulips.
The total number of tulips she needs is 34 + 162 = 196 tulips.
#### 196
The answer is: 196


### ultrafeedback

In [21]:
def process_example(example):
    chosen = example['chosen']
    _,input_text = create_chat_text(tokenizer, chosen[0]['content'])
    return {"input_text":input_text, "output_text":chosen[1]['content']}

In [22]:
train_ds4 = load_dataset('HuggingFaceH4/ultrafeedback_binarized')
train_ds4 = train_ds4['train_sft'].shuffle(42).select(range(10000))
train_ds4

Downloading readme: 100%|██████████| 6.77k/6.77k [00:00<00:00, 18.5MB/s]
Downloading data: 100%|██████████| 226M/226M [00:02<00:00, 98.6MB/s] 
Downloading data: 100%|██████████| 226M/226M [00:02<00:00, 109MB/s]  
Downloading data: 100%|██████████| 7.29M/7.29M [00:00<00:00, 15.4MB/s]
Downloading data: 100%|██████████| 3.72M/3.72M [00:00<00:00, 13.2MB/s]
Downloading data: 100%|██████████| 184M/184M [00:01<00:00, 97.1MB/s] 
Downloading data: 100%|██████████| 3.02M/3.02M [00:00<00:00, 12.9MB/s]
Generating train_prefs split: 100%|██████████| 61135/61135 [00:01<00:00, 52191.18 examples/s]
Generating train_sft split: 100%|██████████| 61135/61135 [00:01<00:00, 52801.08 examples/s]
Generating test_prefs split: 100%|██████████| 2000/2000 [00:00<00:00, 52181.27 examples/s]
Generating test_sft split: 100%|██████████| 1000/1000 [00:00<00:00, 51043.60 examples/s]
Generating train_gen split: 100%|██████████| 61135/61135 [00:01<00:00, 58880.05 examples/s]
Generating test_gen split: 100%|██████████| 10

Dataset({
    features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
    num_rows: 10000
})

In [23]:
train_ds4_processed = train_ds4.map(process_example)

Map: 100%|██████████| 10000/10000 [00:03<00:00, 3315.64 examples/s]


In [24]:
len(train_ds4_processed)

10000

In [25]:
print(train_ds4_processed[0]['input_text'])
print(train_ds4_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Do you know something about crystallography and structure factor?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Crystallography is the science of the arrangement of atoms in solids. It is a vast and interdisciplinary field that has applications in physics, chemistry, materials science, biology, and engineering.

The structure factor is a mathematical function that is used to describe the diffraction of waves by a crystal. It is a complex number that is related to the atomic positions in the crystal.

The structure factor can be used to calculate the intensity of the diffracted waves. This information can be used to determine the atomic positions in the crystal and to study the structure of materials.

Crystallography is a powerful tool for understanding the structure of materials. It has been used to determine the structures of many 

### ultrachat-200k

In [26]:
def process_example(example):
    messages = example['messages']
    _,input_text = create_chat_text(tokenizer, messages[0]['content'])
    return {"input_text":input_text, "output_text":messages[1]['content']}

In [27]:
train_ds5 = load_dataset('HuggingFaceH4/ultrachat_200k', name='default')
train_ds5 = train_ds5['train_sft'].shuffle(42).select(range(10000))
train_ds5

Downloading readme: 100%|██████████| 4.44k/4.44k [00:00<00:00, 6.47MB/s]
Downloading data: 100%|██████████| 244M/244M [00:02<00:00, 114MB/s]  
Downloading data: 100%|██████████| 244M/244M [00:01<00:00, 127MB/s]  
Downloading data: 100%|██████████| 244M/244M [00:02<00:00, 97.3MB/s] 
Downloading data: 100%|██████████| 81.2M/81.2M [00:01<00:00, 66.8MB/s]
Downloading data: 100%|██████████| 244M/244M [00:02<00:00, 109MB/s]  
Downloading data: 100%|██████████| 243M/243M [00:02<00:00, 105MB/s]  
Downloading data: 100%|██████████| 243M/243M [00:02<00:00, 115MB/s]  
Downloading data: 100%|██████████| 80.4M/80.4M [00:00<00:00, 83.1MB/s]
Generating train_sft split: 100%|██████████| 207865/207865 [00:07<00:00, 28681.99 examples/s]
Generating test_sft split: 100%|██████████| 23110/23110 [00:00<00:00, 28862.87 examples/s]
Generating train_gen split: 100%|██████████| 256032/256032 [00:06<00:00, 38169.04 examples/s]
Generating test_gen split: 100%|██████████| 28304/28304 [00:00<00:00, 37343.84 example

Dataset({
    features: ['prompt', 'prompt_id', 'messages'],
    num_rows: 10000
})

In [28]:
train_ds5_processed = train_ds5.map(process_example)

Map: 100%|██████████| 10000/10000 [00:03<00:00, 3304.87 examples/s]


In [29]:
len(train_ds5_processed)

10000

In [30]:
print(train_ds5_processed[0]['input_text'])
print(train_ds5_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

How does the location of the Sydney Conservatorium of Music impact the academic and professional opportunities available to music students, and how does the conservatorium support student engagement with the music industry in Australia?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


The location of the Sydney Conservatorium of Music, which is situated in the heart of Sydney's cultural precinct, impacts both the academic and professional opportunities available to music students. The conservatorium is located near several major performing arts venues and organizations, including the Sydney Opera House, the Australian Broadcasting Corporation, and the Sydney Symphony Orchestra, providing students with easy access to performances, rehearsals, and networking opportunities.

One of the primary ways the conservatorium supports student engag

### open hermes 2.5

In [31]:
def process_example(example):
    messages = example['conversations']
    _,input_text = create_chat_text(tokenizer, messages[0]['value'])
    return {"input_text":input_text, "output_text":messages[1]['value']}

In [32]:
train_ds6 = load_dataset('teknium/OpenHermes-2.5')
train_ds6 = train_ds6['train'].shuffle(42).select(range(10000))
train_ds6

Downloading readme: 100%|██████████| 4.75k/4.75k [00:00<00:00, 11.0MB/s]
Downloading data: 100%|██████████| 1.94G/1.94G [00:15<00:00, 127MB/s] 
Generating train split: 100%|██████████| 1001551/1001551 [00:36<00:00, 27747.60 examples/s]


Dataset({
    features: ['custom_instruction', 'topic', 'model_name', 'model', 'skip_prompt_formatting', 'category', 'conversations', 'views', 'language', 'id', 'title', 'idx', 'hash', 'avatarUrl', 'system_prompt', 'source'],
    num_rows: 10000
})

In [33]:
train_ds6_processed = train_ds6.map(process_example)

Map: 100%|██████████| 10000/10000 [00:03<00:00, 2668.03 examples/s]


In [34]:
len(train_ds6_processed)

10000

In [35]:
print(train_ds6_processed[0]['input_text'])
print(train_ds6_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

A Sierpinski arrowhead curve is created by dividing an equilateral triangle into three smaller equilateral triangles and removing the central one, then doing the same for each remaining smaller triangle. If the length of the original triangle is 6cm, what is the perimeter of the Sierpinski arrowhead curve after the third iteration?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


After each iteration, the perimeter of the Sierpinski arrowhead curve increases by a factor of 2. Let's calculate the perimeter after the third iteration:

1st iteration:
Perimeter = 6 cm * 2 = 12 cm

2nd iteration:
Perimeter = 12 cm * 2 = 24 cm

3rd iteration:
Perimeter = 24 cm * 2 = 48 cm

So, the perimeter of the Sierpinski arrowhead curve after the third iteration is 48 cm.


### starcoder-2

In [36]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['prompt'])
    return {"input_text":input_text, "output_text":example['response']}

In [37]:
train_ds7 = load_dataset('bigcode/self-oss-instruct-sc2-exec-filter-50k')
train_ds7 = train_ds7['train'].shuffle(42).select(range(10000))
train_ds7

Downloading readme: 100%|██████████| 922/922 [00:00<00:00, 3.53MB/s]
Downloading data: 100%|██████████| 90.1M/90.1M [00:02<00:00, 42.3MB/s]
Generating train split: 100%|██████████| 50661/50661 [00:00<00:00, 57192.55 examples/s]


Dataset({
    features: ['fingerprint', 'sha1', 'seed', 'response', 'concepts', 'prompt', 'instruction', 'id'],
    num_rows: 10000
})

In [38]:
train_ds7_processed = train_ds7.map(process_example)

Map: 100%|██████████| 10000/10000 [00:02<00:00, 3742.79 examples/s]


In [39]:
len(train_ds7_processed)

10000

In [40]:
print(train_ds7_processed[0]['input_text'])
print(train_ds7_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Provide the best response to a given instruction. Follow the following steps to craft your response:
1. reason about the given instruction
2. provide a high-quality solution
3. offer a concise explanation
4. write tests to verify the correctness your solution

## Example 1
### Instruction
Design a Python function that takes a sorted array and a target value, and return a valid index where target can be inserted to maintain the array's sorted order. Optimize the function to run in logarithmic time complexity.

For example, given `array = [1, 3, 5, 5, 6]` and `target = 5`, the function should return either 2 or 3 because 5 presents at both indices 2 and 3.

### Response
[Reasoning]
To solve this problem efficiently and ensure logarithmic time complexity, we can use a binary search algorithm. Compared with a standard binary search that looks for a

### The Tome

In [41]:
def process_example(example):
    messages = example['conversations']
    _,input_text = create_chat_text(tokenizer, messages[0]['value'])
    return {"input_text":input_text, "output_text":messages[1]['value']}

In [42]:
train_ds8 = load_dataset('arcee-ai/The-Tome')
train_ds8 = train_ds8['train'].shuffle(42).select(range(10000))
train_ds8

Downloading readme: 100%|██████████| 1.75k/1.75k [00:00<00:00, 7.14MB/s]
Downloading data: 100%|██████████| 4.58G/4.58G [02:03<00:00, 37.0MB/s]
Generating train split: 100%|██████████| 1752473/1752473 [00:15<00:00, 114745.85 examples/s]


Dataset({
    features: ['conversations', 'dataset'],
    num_rows: 10000
})

In [43]:
train_ds8_processed = train_ds8.map(process_example)

Map: 100%|██████████| 10000/10000 [00:02<00:00, 3990.70 examples/s]


In [44]:
len(train_ds8_processed)

10000

In [45]:
print(train_ds8_processed[0]['input_text'])
print(train_ds8_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Google’s recent announcement to create 10,000 new jobs in Europe, 6,000 of which will be in the UK, has been welcomed by the Business Secretary. What is the main focus of Google's expansion in Europe and how does it impact the economies of the countries it operates in?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Google's recent announcement to create 10,000 new jobs in Europe, with 6,000 of those positions located in the UK, is a significant development that has been welcomed by the Business Secretary. This expansion is primarily focused on bolstering Google's presence in key areas such as engineering, sales, and marketing, as well as in new initiatives like artificial intelligence and cloud computing.

The impact of this expansion on the economies of the countries it operates in is multifaceted:

1. **Job Creation**: The creation 

### openhermes 200k unfiltered

In [46]:
def process_example(example):
    messages = example['conversations']
    _,input_text = create_chat_text(tokenizer, messages[0]['value'])
    return {"input_text":input_text, "output_text":messages[1]['value']}

In [47]:
train_ds9 = load_dataset("Crystalcareai/openhermes_200k_unfiltered")
train_ds9 = train_ds9['train'].sort("relevance_score").select(range(10000))
train_ds9

Downloading readme: 100%|██████████| 31.0/31.0 [00:00<00:00, 143kB/s]
Downloading data: 100%|██████████| 335M/335M [00:10<00:00, 30.8MB/s] 
Generating train split: 100%|██████████| 184892/184892 [00:01<00:00, 159088.72 examples/s]


Dataset({
    features: ['conversations', 'relevance_score', 'id'],
    num_rows: 10000
})

In [48]:
train_ds9_processed = train_ds9.map(process_example)

Map: 100%|██████████| 10000/10000 [00:02<00:00, 4153.00 examples/s]


In [49]:
len(train_ds9_processed)

10000

In [50]:
print(train_ds9_processed[0]['input_text'])
print(train_ds9_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

If a company produces 5000 units of a product per day and sells them for $20 each, how much additional revenue would they generate if they increased their production by 10%?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


First, let's calculate the current daily revenue generated by selling 5000 units at $20 each:

Current daily revenue = Units sold * Price per unit
                                  = 5000 * $20
                                  = $100,000

Now, we need to find out how many additional units will be produced if the company increases its production by 10%. To do this, we multiply the current production (5000 units) by 10%:

Additional units = Current production * 10%
                            = 5000 * 0.1
                            = 500 units

So, the company will produce an additional 500 units per day.

Next, we ne

### Synth reasoning alpaca combined

In [51]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['instruction'])
    return {"input_text":input_text, "output_text":example['output']}

In [52]:
train_ds10 = load_dataset("Crystalcareai/synthetic_reasoning_natural_Alpaca_Combined")
train_ds10 = train_ds10['train'].shuffle(42).select(range(1000))
train_ds10

Downloading data: 100%|██████████| 4.20M/4.20M [00:00<00:00, 19.2MB/s]
Generating train split: 100%|██████████| 11000/11000 [00:00<00:00, 428459.20 examples/s]


Dataset({
    features: ['instruction', 'output'],
    num_rows: 1000
})

In [53]:
train_ds10_processed = train_ds10.map(process_example)

Map: 100%|██████████| 1000/1000 [00:00<00:00, 5268.39 examples/s]


In [54]:
len(train_ds10_processed)

1000

In [55]:
print(train_ds10_processed[0]['input_text'])
print(train_ds10_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

If Bob is hot, then Bob is big.
If Bob is cold, then Bob is bad.
If Bob is slow and blue, then Bob is beautiful.
If person is round, then person is clean.
If Bob is strong, then Bob is happy.
Fact:
Bob is nice and cool.
The following can be determined about Bob:<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Bob is bad.


### Concatenate

In [56]:
concat_train_ds = concatenate_datasets([train_ds1_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds2_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds3_processed.select_columns(['input_text', 'output_text']),
                                        train_ds4_processed.select_columns(['input_text', 'output_text']),
                                        train_ds5_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds6_processed.select_columns(['input_text', 'output_text']),
                                        train_ds7_processed.select_columns(['input_text', 'output_text']),
                                        train_ds8_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds9_processed.select_columns(['input_text', 'output_text']),
                                        train_ds10_processed.select_columns(['input_text', 'output_text'])])

In [57]:
concat_train_ds = concat_train_ds.shuffle(42)

In [58]:
concat_train_ds.save_to_disk("/workspace/data/llama_large_mix_dataset_v0")

Saving the dataset (1/1 shards): 100%|██████████| 90846/90846 [00:01<00:00, 89638.51 examples/s]


In [11]:
concat_train_ds = load_from_disk("/workspace/data/llama_large_mix_dataset_v0")

In [9]:
def tokenize(exs):
    return {"input_token_ids" : tokenizer(exs['input_text'])['input_ids'],
            "output_token_ids" : tokenizer(exs['output_text'])['input_ids']}

In [12]:
concat_train_ds = concat_train_ds.map(tokenize, batched=True)

Map:   0%|          | 0/90846 [00:00<?, ? examples/s]

Map: 100%|██████████| 90846/90846 [00:23<00:00, 3921.58 examples/s]


In [13]:
def token_lenghts(ex):
    return {"token_length": len(ex['input_token_ids']) + len(ex['output_token_ids'])}

In [14]:
concat_train_ds = concat_train_ds.map(token_lenghts)

Map: 100%|██████████| 90846/90846 [00:30<00:00, 2994.03 examples/s]


In [7]:
concat_train_ds = load_from_disk("/workspace/data/llama_large_mix_dataset_v0_1536")

In [8]:
pd.value_counts(concat_train_ds['token_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()

  pd.value_counts(concat_train_ds['token_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()


(-0.001, 512.0]     0.687703
(512.0, 1024.0]     0.926319
(1024.0, 1536.0]    1.000000
(1536.0, 2048.0]    1.000000
(2048.0, 4096.0]    1.000000
Name: proportion, dtype: float64

In [17]:
concat_train_filtered_ds = concat_train_ds.filter(lambda ex: ex["token_length"] < 1024)

Filter: 100%|██████████| 90846/90846 [00:17<00:00, 5123.26 examples/s]


In [18]:
concat_train_filtered_ds.save_to_disk("/workspace/data/llama_large_mix_dataset_v0_1024")

Saving the dataset (1/1 shards): 100%|██████████| 84289/84289 [00:00<00:00, 84994.93 examples/s]


In [19]:
concat_train_filtered_ds = concat_train_ds.filter(lambda ex: ex["token_length"] < 1536)

Filter: 100%|██████████| 90846/90846 [00:17<00:00, 5145.87 examples/s]


In [20]:
concat_train_filtered_ds.save_to_disk("/workspace/data/llama_large_mix_dataset_v0_1536")

Saving the dataset (1/1 shards): 100%|██████████| 90456/90456 [00:01<00:00, 74953.06 examples/s]


### Dataset Mixture v1

- Dedup at prompt level.
- Add training sets of academic benchmarks.

In [4]:
from datasets import Dataset

In [5]:
concat_train_ds = load_from_disk("/workspace/data/llama_large_mix_dataset_v0")

In [6]:
concat_train_ds

Dataset({
    features: ['input_text', 'output_text'],
    num_rows: 90846
})

In [7]:
import hashlib
def compute_hash(text):
    return hashlib.md5(text.encode('utf-8')).hexdigest()

# Create a dictionary to store unique hashes
unique_hashes = {}

# Function to check if a row should be kept
def should_keep(row):
    hash_value = compute_hash(row['input_text'])
    if hash_value not in unique_hashes:
        unique_hashes[hash_value] = True
        return True
    return False

# Apply the deduplication
deduplicated_dataset = concat_train_ds.filter(should_keep)

In [8]:
deduplicated_dataset

Dataset({
    features: ['input_text', 'output_text'],
    num_rows: 84078
})

In [11]:
print(deduplicated_dataset[0]['input_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Provide the best response to a given instruction. Follow the following steps to craft your response:
1. reason about the given instruction
2. provide a high-quality solution
3. offer a concise explanation
4. write tests to verify the correctness your solution

## Example 1
### Instruction
Here are two special formulas:

$$
f_1(a, b) = (a + 1) \cdot (b + 1) - 1
$$

$$
f_2(k) = \begin{cases}
    \frac{(k + 1)^2}{2} + k + 1 & \text{if } k \text{ is odd} \\
    \frac{k^2}{2} + 2k + 1 & \text{if } k \text{ is even}
\end{cases}
$$

Write a Python function to return $f_2(f_1(a, b))$ for given `a` and `b`.

### Response
[Reasoning]
Based on the formulas you provided, we can define two Python functions, `f1(a, b)` and `f2(k)`, respectively, and then combine them to calculate $f2(f1(a, b))$ f

#### ARC-E

In [44]:
import numpy as np

In [90]:
ds = load_dataset("ai2_arc", name="ARC-Easy")
arc_e_train_samples = ds['train'].shuffle(42).select(range(25,len(ds['train'])))

In [93]:
arc_e_train_samples

Dataset({
    features: ['id', 'question', 'choices', 'answerKey'],
    num_rows: 2226
})

In [92]:
def process_arc_example(ex):
	
	choice_texts = ex['choices']['text']
	choice_labels = ex['choices']['label']
	choices_list = [f"{l}. {c}" for c,l in zip(ex["choices"]["text"], ex['choices']['label'])]

	
	choices = "\n".join(choices_list)
	question = ex["question"] + f"\n\n{choices}"

	answer_idx = {l:i for i,l in enumerate(ex['choices']['label'])}[ex['answerKey']]
	answer = choices_list[answer_idx]
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [94]:
arc_e_train_ds_processed = (arc_e_train_samples.shuffle(42).select(range(1000)).map(process_arc_example)
                            .select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [95]:
print(arc_e_train_ds_processed[0]['input_text'])
print(arc_e_train_ds_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Living organisms require energy for biological processes. Chemical energy in a plant cell is

A. produced in vacuoles.
B. converted from solar energy.
C. developed by centrioles.
D. stored as kinetic energy.<|eot_id|><|start_header_id|>assistant<|end_header_id|>


B. converted from solar energy.


#### ARC-C

In [96]:
ds = load_dataset("ai2_arc", name="ARC-Challenge")
arc_c_train_samples = ds['train'].shuffle(42).select(range(25,len(ds['train'])))

In [99]:
arc_c_train_samples

Dataset({
    features: ['id', 'question', 'choices', 'answerKey'],
    num_rows: 1094
})

In [97]:
arc_c_train_ds_processed = (arc_c_train_samples.shuffle(42).select(range(1000)).map(process_arc_example)
                            .select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [98]:
print(arc_c_train_ds_processed[0]['input_text'])
print(arc_c_train_ds_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Carbon on Earth is found in both living and nonliving matter. In order for carbon to be continuously available, it must be recycled. Through which process is carbon made available in the atmosphere?

A. formation of fossil fuels
B. layering of soil
C. plant photosynthesis
D. forest fires<|eot_id|><|start_header_id|>assistant<|end_header_id|>


D. forest fires


### BoolQ

In [100]:
ds = load_dataset("boolq")
boolq_train_samples = ds['train']
boolq_train_samples

Dataset({
    features: ['question', 'answer', 'passage'],
    num_rows: 9427
})

In [101]:
def process_boolq_example(ex):
	question = "Passage: " + ex["passage"] +"\n\nQuestion: According to the passage, " + ex["question"] + "?"
	answer = "Yes" if ex['answer'] else "No"
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [102]:
boolq_train_ds_processed = (boolq_train_samples.shuffle(42).select(range(1000)).map(process_boolq_example)
                            .select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [103]:
print(boolq_train_ds_processed[0]['input_text'])
print(boolq_train_ds_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Passage: Henry Daniel Mills is a fictional character in ABC's television series Once Upon a Time. Henry is the boy Emma Swan gave up to adoption; Regina Mills adopted him. Henry was originally portrayed as a child by Jared S. Gilmore, who won the Young Artist Award for Best Performance in a TV Series -- Leading Young Actor in 2012. For the show's seventh and final season, Andrew J. West later took over the role of Henry as an adult and father to a eight-year-old girl named Lucy, with Gilmore also making three appearances as Henry during the season.

Question: According to the passage, did henry die in once upon a time?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


No


### Commonsense QA

In [104]:
ds = load_dataset("commonsense_qa")
commonsenseqa_train_samples = ds['train'].shuffle(42).select(range(7,len(ds['train'])))
commonsenseqa_train_samples

Dataset({
    features: ['id', 'question', 'question_concept', 'choices', 'answerKey'],
    num_rows: 9734
})

In [105]:
def process_commonsenseqa_example(ex):
	
	choice_texts = ex['choices']['text']
	choice_labels = ex['choices']['label']
	choices_list = [f"{l}. {c}" for c,l in zip(ex["choices"]["text"], ex['choices']['label'])]

	
	choices = "\n".join(choices_list)
	question = ex["question"] + f"\n\n{choices}"

	answer_idx = {l:i for i,l in enumerate(ex['choices']['label'])}[ex['answerKey']]
	answer = choices_list[answer_idx]
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [106]:
commonsenseqa_train_ds_processed = (commonsenseqa_train_samples.shuffle(42).select(range(1000)).map(process_commonsenseqa_example)
                                    				.select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [107]:
print(commonsenseqa_train_ds_processed[0]['input_text'])
print(commonsenseqa_train_ds_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

As the counterweight began to rise the elevator began to what into the mine?

A. park
B. reduce
C. descend
D. fall
E. set<|eot_id|><|start_header_id|>assistant<|end_header_id|>


C. descend


#### HellaSwag

In [108]:
ds = load_dataset("hellaswag", name=None)
hellaswag_train_samples = ds['train']
hellaswag_train_samples

Dataset({
    features: ['ind', 'activity_label', 'ctx_a', 'ctx_b', 'ctx', 'endings', 'source_id', 'split', 'split_type', 'label'],
    num_rows: 39905
})

In [109]:
hellaswag_train_samples[0]

{'ind': 4,
 'activity_label': 'Removing ice from car',
 'ctx_a': 'Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles.',
 'ctx_b': 'then',
 'ctx': 'Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles. then',
 'endings': [', the man adds wax to the windshield and cuts it.',
  ', a person board a ski lift, while two men supporting the head of the person wearing winter clothes snow as the we girls sled.',
  ', the man puts on a christmas coat, knitted with netting.',
  ', the man continues removing the snow on his car.'],
 'source_id': 'activitynet~v_-1IBHYS3L-Y',
 'split': 'train',
 'split_type': 'indomain',
 'label': '3'}

In [110]:
import string
def preprocess_hellaswag(text):
	text = text.strip()
	# NOTE: Brackets are artifacts of the WikiHow dataset portion of HellaSwag.
	text = text.replace(" [title]", ". ")
	text = re.sub("\\[.*?\\]", "", text)
	text = text.replace("  ", " ")
	return text

In [111]:
def process_hellaswag_example(ex):
	
	ctx = ex["ctx_a"] + " " + ex["ctx_b"].capitalize()
	query = preprocess_hellaswag(ex['activity_label'] + ". " + ctx)
	choices = [preprocess_hellaswag(ending) for ending in ex['endings']]
	choices_list = [f"{l}. {c}" for c, l in zip(choices, string.ascii_uppercase)]
 
	choices = "\n".join(choices_list)
	question = query + f"\n\n{choices}"

	answer = choices_list[int(ex['label'])]
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [112]:
hellaswag_train_ds_processed = (hellaswag_train_samples.shuffle(42).select(range(1000)).map(process_hellaswag_example)
                                    				.select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [113]:
print(hellaswag_train_ds_processed[4]['input_text'])
print(hellaswag_train_ds_processed[4]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Health. How to choose superfoods good for your heart. Eat dark, leafy greens. Kale is a green that can be used in a variety of ways and paired with almost any food. Dark, leafy greens like swiss chard or collard, mustard, and turnip greens are superfoods that are good for your heart.

A. However, those with a higher cholesterol content may use some other vegetables. One ingredient that has a higher cholesterol content is nuts, seeds, chickpeas, and fruit.
B. You can eat these greens whole whole, cut in half, or feed them whole. Leafy greens tend to be full of fiber and can calm the stomach, so try to eat steamed greens cooked whole.
C. They are full of antioxidants and have anti-inflammatory properties, which helps promote heart health. Use kale and other dark leafy greens to make s

#### Winogrande

In [114]:
ds = load_dataset("winogrande", name="winogrande_xl")
winogrande_train_samples = ds['train'].shuffle(42).select(range(5,len(ds['train'])))
winogrande_train_samples

Dataset({
    features: ['sentence', 'option1', 'option2', 'answer'],
    num_rows: 40393
})

In [138]:
import string

def process_winogrande_example(ex):
	idx = ex['sentence'].index("_")
	query = ex['sentence'][:idx].strip()
	remaining = ex['sentence'][idx+1:].strip()

	choices = [ex['option1'] + " " + remaining, ex['option2'] + " " + remaining]
	choices_list = [f"{l}. {c}" for c, l in zip(choices, string.ascii_uppercase)]
	choices = "\n".join(choices_list)

	question = f"Sentence: {query}\n\nWhich ones is the most likely continuation?\n\n{choices}"
	
	answer = choices_list[int(ex['answer'])-1]
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [139]:
wino_train_ds_processed = (winogrande_train_samples.shuffle(42).select(range(1000)).map(process_winogrande_example)
                           .select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [140]:
print(wino_train_ds_processed[1]['input_text'])
print(wino_train_ds_processed[1]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Sentence: Paying the electricity bill is an adult responsibility for Joseph but not for Jeffrey because

Which ones is the most likely continuation?

A. Joseph lives on his own.
B. Jeffrey lives on his own.<|eot_id|><|start_header_id|>assistant<|end_header_id|>


A. Joseph lives on his own.


#### GSM8K

In [150]:
ds = load_dataset("gsm8k", name="main")
gsm8k_train_samples = ds['train'].shuffle(42).select(range(8, len(ds['train'])))	
gsm8k_train_samples

Dataset({
    features: ['question', 'answer'],
    num_rows: 7465
})

In [151]:
def process_gsm8k_example(ex):
	question = ex['question']
	answer = ex['answer']
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [152]:
gsm8k_train_ds_processed = (gsm8k_train_samples.shuffle(42).select(range(1000)).map(process_gsm8k_example)
                           .select_columns(['input_text', 'output_text']))

In [153]:
print(gsm8k_train_ds_processed[1]['input_text'])
print(gsm8k_train_ds_processed[1]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Each member of Greg’s softball team needs to buy one uniform made up of a shirt, a pair of pants, and socks. A shirt costs $7.50, a pair of pants cost $15, and socks cost $4.50 each if each team member buys the uniform items on their own. If they buy the items as a group, they are given a discount. A discounted shirt cost $6.75, a discounted pair of pants cost $13.50, and discounted socks cost $3.75. How much would their team of 12 save with the group discount?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


The cost of a regularly-priced uniform is $7.50 for the shirt + $15 for pants + $4.50 for socks = $<<7.5+15+4.5=27>>27.
The cost of a discounted uniform is $6.75 for the shirt + $13.50 for pants + $3.75 for socks = $<<6.75+13.5+3.75=24>>24.
By purchasing the discounted 

#### Note

human eval, mmlu, mmlu-pro, bbhard, agieval don't have train sets.

In [154]:
# only test
# ds = load_dataset("openai_humaneval")
# ds  

In [27]:
# MMLU_SUBJECTS = [
# 	"abstract_algebra",
# 	"anatomy",
# 	"astronomy",
# 	"business_ethics",
# 	"clinical_knowledge",
# 	"college_biology",
# 	"college_chemistry",
# 	"college_computer_science",
# 	"college_mathematics",
# 	"college_medicine",
# 	"college_physics",
# 	"computer_security",
# 	"conceptual_physics",
# 	"econometrics",
# 	"electrical_engineering",
# 	"elementary_mathematics",
# 	"formal_logic",
# 	"global_facts",
# 	"high_school_biology",
# 	"high_school_chemistry",
# 	"high_school_computer_science",
# 	"high_school_european_history",
# 	"high_school_geography",
# 	"high_school_government_and_politics",
# 	"high_school_macroeconomics",
# 	"high_school_mathematics",
# 	"high_school_microeconomics",
# 	"high_school_physics",
# 	"high_school_psychology",
# 	"high_school_statistics",
# 	"high_school_us_history",
# 	"high_school_world_history",
# 	"human_aging",
# 	"human_sexuality",
# 	"international_law",
# 	"jurisprudence",
# 	"logical_fallacies",
# 	"machine_learning",
# 	"management",
# 	"marketing",
# 	"medical_genetics",
# 	"miscellaneous",
# 	"moral_disputes",
# 	"moral_scenarios",
# 	"nutrition",
# 	"philosophy",
# 	"prehistory",
# 	"professional_accounting",
# 	"professional_law",
# 	"professional_medicine",
# 	"professional_psychology",
# 	"public_relations",
# 	"security_studies",
# 	"sociology",
# 	"us_foreign_policy",
# 	"virology",
# 	"world_religions",
# ]

In [155]:
# ds = load_dataset("cais/mmlu", name=MMLU_SUBJECTS[0])
# ds

In [156]:
# ds = load_dataset("TIGER-Lab/MMLU-Pro")
# MMLU_PRO_SUBJECTS = set(ds['validation']['category'])

In [157]:
# ds

In [158]:
# AGIEVAL_DATASETS = ["dmayhem93/agieval-aqua-rat",
# 					"dmayhem93/agieval-gaokao-english",
# 					"dmayhem93/agieval-logiqa-en",
# 					"dmayhem93/agieval-lsat-ar",
# 					"dmayhem93/agieval-lsat-lr",
# 					"dmayhem93/agieval-lsat-rc",
# 					"dmayhem93/agieval-sat-en-without-passage",
# 					"dmayhem93/agieval-sat-en",
# 					"dmayhem93/agieval-sat-math",
# 					"hails/agieval-math"
# ]

# ds = load_dataset(AGIEVAL_DATASETS[0])
# ds

In [159]:
# BIGBENCH_HARD_TASK_NAMES =  ['boolean_expressions', 
# 						'causal_judgement', 
# 						'date_understanding', 
# 						'disambiguation_qa', 
# 						'dyck_languages',
# 						'formal_fallacies',
# 						'geometric_shapes',
# 						'hyperbaton',
# 						'logical_deduction_three_objects',
# 						'logical_deduction_five_objects',
# 						'logical_deduction_seven_objects',
# 						'movie_recommendation',
# 						'multistep_arithmetic_two',
# 						'navigate',
# 						'object_counting',
# 						'penguins_in_a_table',
# 						'reasoning_about_colored_objects',
# 						'ruin_names',
# 						'salient_translation_error_detection',
# 						'snarks', 
# 						'sports_understanding', 
# 						'temporal_sequences',
# 						'tracking_shuffled_objects_three_objects',					
# 						'tracking_shuffled_objects_five_objects',
# 						'tracking_shuffled_objects_seven_objects',
# 						'web_of_lies',
# 						'word_sorting']

# ds = load_dataset("maveriq/bigbenchhard", name=BIGBENCH_HARD_TASK_NAMES[0])
# ds

#### SQUAD

In [179]:
ds = load_dataset("rajpurkar/squad_v2")
squad_train_samples = ds['train']
squad_train_samples

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 130319
})

In [193]:
def process_squad_example(ex):
	context = ex['context']
	question =ex['question']
	question = f"Context: {context}" + "\n\n" + f"Question: {question}"
	answer = ex['answers']['text']
	if answer:
		answer = answer[0].capitalize()
	else:
		answer = 'No answer can be found in the context.'
 
	return {"input_text":question, "output_text":answer}

In [194]:
squad_train_ds_processed = (squad_train_samples.shuffle(42).select(range(1000)).map(process_squad_example)
                           .select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [195]:
print(squad_train_ds_processed[1]['input_text'])
print(squad_train_ds_processed[1]['output_text'])

Context: Alexandria was the most important trade center in the whole empire during Athanasius's boyhood. Intellectually, morally, and politically—it epitomized the ethnically diverse Graeco-Roman world, even more than Rome or Constantinople, Antioch or Marseilles. Its famous catechetical school, while sacrificing none of its famous passion for orthodoxy since the days of Pantaenus, Clement of Alexandria, Origen of Alexandria, Dionysius and Theognostus, had begun to take on an almost secular character in the comprehensiveness of its interests, and had counted influential pagans among its serious auditors.

Question: What was Alexandria known for?
Important trade center


### DROP

In [197]:
ds = load_dataset("ucinlp/drop")
drop_train_samples = ds['train']
drop_train_samples

Dataset({
    features: ['section_id', 'query_id', 'passage', 'question', 'answers_spans'],
    num_rows: 77400
})

In [211]:
def process_drop_example(ex):
	context = ex['passage']
	question = ex['question']
	question = f"Context: {context}" + "\n\n" + f"Question: {question}"
	answer = ex['answers_spans']['spans']
	if answer:
		answer = answer[0].capitalize()
	else:
		answer = 'No answer can be found in the context.'
 
	return {"input_text":question, "output_text":answer}

In [212]:
drop_train_ds_processed = (drop_train_samples.shuffle(42).select(range(1000)).map(process_drop_example)
                           .select_columns(['input_text', 'output_text']))

In [214]:
print(drop_train_ds_processed[4]['input_text'])
print(drop_train_ds_processed[4]['output_text'])

Context: There are 1,068,573 households in the municipality, giving an average household size of 3.3 people. Of those households, 78.4% are in formal structures (houses or apartment), while 20.5% are in informal structures (Shanty town). 94.0% of households use mains electricity for lighting. 87.3% of households have water supply to the dwelling, while 12.0% have piped water through a communal tap. 94.9% of households have regular refuse collection service. 91.4% of households have a flush toilet or chemical toilet, while 4.5% still use a bucket toilet. 82.1% of households have a refrigerator, 87.3% have a television and 70.1% have a radio. Only 34.0% have a landline telephone, but 91.3% have a cellphone. 37.9% have a computer, and 49.3% have access to the Internet (either through a computer or a cellphone).

Question: How many more percent of residents lived in formal structures than informal structures?
57.9


### Create Dataset v1

In [215]:
concat_train_ds = concatenate_datasets([deduplicated_dataset.select_columns(['input_text', 'output_text']), 
                                        arc_e_train_ds_processed,
                                        arc_c_train_ds_processed,
                                        boolq_train_ds_processed,
                                        commonsenseqa_train_ds_processed,
                                        hellaswag_train_ds_processed,
                                        wino_train_ds_processed,
                                        gsm8k_train_ds_processed,
                                        squad_train_ds_processed,
                                        drop_train_ds_processed]).shuffle(42)

In [216]:
concat_train_ds.save_to_disk("/workspace/data/llama_large_mix_dataset_v1")

Saving the dataset (0/1 shards):   0%|          | 0/93078 [00:00<?, ? examples/s]

In [217]:
concat_train_ds

Dataset({
    features: ['input_text', 'output_text'],
    num_rows: 93078
})

In [218]:
def tokenize(exs):
    return {"input_token_ids" : tokenizer(exs['input_text'])['input_ids'],
            "output_token_ids" : tokenizer(exs['output_text'])['input_ids']}
    
def token_lenghts(ex):
    return {"token_length": len(ex['input_token_ids']) + len(ex['output_token_ids'])}

In [219]:
concat_train_ds = concat_train_ds.map(tokenize, batched=True)
concat_train_ds = concat_train_ds.map(token_lenghts)

Map:   0%|          | 0/93078 [00:00<?, ? examples/s]

Map:   0%|          | 0/93078 [00:00<?, ? examples/s]

In [220]:
pd.value_counts(concat_train_ds['token_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()

  pd.value_counts(concat_train_ds['token_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()


(-0.001, 512.0]     0.699199
(512.0, 1024.0]     0.925278
(1024.0, 1536.0]    0.995885
(1536.0, 2048.0]    0.999194
(2048.0, 4096.0]    0.999914
Name: proportion, dtype: float64

In [222]:
concat_train_filtered_ds = concat_train_ds.filter(lambda ex: ex["token_length"] < 1024)

Filter:   0%|          | 0/93078 [00:00<?, ? examples/s]

In [223]:
concat_train_filtered_ds.save_to_disk("/workspace/data/llama_large_mix_dataset_v1_1024")

Saving the dataset (0/1 shards):   0%|          | 0/86090 [00:00<?, ? examples/s]

In [224]:
concat_train_filtered_ds = concat_train_ds.filter(lambda ex: ex["token_length"] < 1536)

Filter:   0%|          | 0/93078 [00:00<?, ? examples/s]

In [225]:
concat_train_filtered_ds.save_to_disk("/workspace/data/llama_large_mix_dataset_v1_1536")

Saving the dataset (0/2 shards):   0%|          | 0/92693 [00:00<?, ? examples/s]