Prepare IFT datasets with `{"input_text":, ... "output_text":...}` format. Prepare in chat format to finetune chat models.

In [None]:
import os
import re
import pandas as pd
from transformers import AutoTokenizer
from datasets import concatenate_datasets, load_dataset, load_from_disk

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
def create_chat_text(tokenizer, user_prompt, system="You are a helpful AI assistant."):
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": user_prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    return messages, text

In [None]:
model_name = "Qwen/Qwen2.5-32B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

### oasst

In [None]:
def process_example(example):
    t = example["text"]
    
    human_prefix, asst_prefix = "### Human: ", "### Assistant: "
    human_matches = list(re.finditer(human_prefix, t))
    asst_matches = list(re.finditer(asst_prefix, t))
    human_start_idx, _ = human_matches[0].span(0)
    asst_start_idx, _ = asst_matches[0].span(0)
    if len(human_matches) > 1:
        asst_end_idx = human_matches[1].span(0)[0]
    else:
        asst_end_idx = len(t)
    
    human_prompt = t[human_start_idx+len(human_prefix):asst_start_idx]
    asst_prompt = t[asst_start_idx+len(asst_prefix):asst_end_idx]
    
    _,input_text = create_chat_text(tokenizer, human_prompt)
    output_text = asst_prompt
    return {"input_text":input_text, "output_text":output_text}

In [None]:
train_ds1 = load_dataset("timdettmers/openassistant-guanaco")['train']

Repo card metadata block was not found. Setting CardData to empty.
Generating train split: 100%|█████████████████████████████████████████████████████████| 9846/9846 [00:00<00:00, 159085.93 examples/s]
Generating test split: 100%|█████████████████████████████████████████████████████████████| 518/518 [00:00<00:00, 75183.39 examples/s]


In [None]:
train_ds1_processed = train_ds1.map(process_example)

Map: 100%|█████████████████████████████████████████████████████████████████████████████| 9846/9846 [00:00<00:00, 10268.73 examples/s]


In [None]:
len(train_ds1_processed)

9846

In [None]:
idx = 105
print(train_ds1_processed[idx]['input_text'])
print(train_ds1_processed[idx]['output_text'])

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
Siento un dolor punzante en la zona superior izquierda de mi abdomen, ¿qué podría estar causándolo?<|im_end|>
<|im_start|>assistant

Lo siento, no puedo contestar a esa pregunta, ya que no tengo la suficiente información cómo para dictaminar un resultado. Si de verdad padece de algún problema físico, recomiendo ir inmediatamente a un hospital cercano.


### orca-math

In [None]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['question'])
    return {"input_text":input_text, "output_text":example['answer']}

In [None]:
train_ds2 = load_dataset("microsoft/orca-math-word-problems-200k")
train_ds2, eval_ds2 = train_ds2['train'].shuffle(42).select(range(10000)), train_ds2['train'].shuffle(42).select(range(10000, 10100))

Generating train split: 100%|█████████████████████████████████████████████████████| 200035/200035 [00:01<00:00, 148710.43 examples/s]


In [None]:
train_ds2_processed = train_ds2.map(process_example)
eval_ds2_processed = eval_ds2.map(process_example)

Map: 100%|████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 7449.35 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 5816.29 examples/s]


In [None]:
len(train_ds2_processed)

10000

In [None]:
print(train_ds2_processed[0]['input_text'])
print(train_ds2_processed[0]['output_text'])

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
Sally had 13 peaches at her roadside fruit dish.  She went to the orchard and picked peaches to stock up. She picked 55 peaches. There are _____ peaches now.<|im_end|>
<|im_start|>assistant

Sally originally had 13 peaches. She picked 55 more peaches. To find out the total number of peaches she has now, we add the two amounts together:

13 (original peaches) + 55 (picked peaches) = 68 peaches

So, there are 68 peaches now.


In [None]:
print(eval_ds2_processed[0]['input_text'])
print(eval_ds2_processed[0]['output_text'])

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
cara leaves her home and walks toward don ' s house . two hours later , don leaves his home and walks toward cara ' s house . the distance between their homes is 45 kilometers , cara ' s walking speed is 6 km / h , and don ' s walking speed is 5 km / h . how many kilometers will cara walk before she meets don ?<|im_end|>
<|im_start|>assistant

Let's denote the distance Cara walks before meeting Don as \( x \) kilometers.

Since Cara walks at a speed of 6 km/h, she will have walked for \( \frac{x}{6} \) hours before meeting Don.

Don starts walking 2 hours after Cara, so by the time they meet, Don will have been walking for \( \frac{x}{6} - 2 \) hours.

Don walks at a speed of 5 km/h, so the distance Don walks before meeting Cara is \( 5 \times \left( \frac{x}{6} - 2 \right) \) kilometers.

The sum of the distances walked by Cara and Don before they meet should equal the total distance between their homes, whi

### meta-math-qa

In [None]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['query'])
    return {"input_text":input_text, "output_text":example['response']}

In [None]:
train_ds3 = load_dataset("meta-math/MetaMathQA")
train_ds3, eval_ds3 = train_ds3['train'].shuffle(42).select(range(10000)), train_ds3['train'].shuffle(42).select(range(10000, 10100))

Generating train split: 100%|██████████████████████████████████████████████████████| 395000/395000 [00:09<00:00, 40243.25 examples/s]


In [None]:
train_ds3_processed = train_ds3.map(process_example)
eval_ds3_processed = eval_ds3.map(process_example)

Map: 100%|████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 6783.65 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 5691.51 examples/s]


In [None]:
train_ds3_processed[3]

{'type': 'MATH_SV',
 'query': 'Compute the sum of the squares of the roots of the equation \\[x^{2018} +44x^{2015} + 3x^3 + 404 = X The answer is 0. What is the value of unknown variable X?',
 'original_question': 'Compute the sum of the squares of the roots of the equation \\[x^{2018} +44x^{2015} + 3x^3 + 404 = 0.\\]',
 'response': 'To solve this problem, we need to determine the value of x that will make the sum of the squares of the roots of the equation equal to 0.\nThe equation is given as:\n\\[x^{2018} + 44x^{2015} + 3x^3 + 404 = X\\]\nTo find the sum of the squares of the roots, we need to find the roots of the equation and then square them.\nSince the answer is 0, it means that the sum of the squares of the roots must be equal to 0.\nTo make the sum of squares 0, we need to find the values of x that will make each square equal to 0.\nIf a square is equal to 0, it means that the value inside the square must be 0.\nIn this case, we need to set each term inside the square to 0 and

In [None]:
len(train_ds3_processed), len(eval_ds3_processed)

(10000, 100)

In [None]:
print(train_ds3_processed[0]['input_text'])
print(train_ds3_processed[0]['output_text'])

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
If Anna wants to create a smiley face shape using red and yellow tulips, she requires 8 red tulips for each eye and 18 red tulips for the smile. Additionally, she needs 9 times the number of tulips in the smile to create the yellow background of the face. What is the total number of tulips that Anna needs?<|im_end|>
<|im_start|>assistant

Anna needs 8 red tulips for each eye, so for both eyes she needs 8 * 2 = 16 red tulips.
She also needs 18 red tulips for the smile.
The total number of red tulips she needs is 16 + 18 = 34 red tulips.
For the yellow background, she needs 9 times the number of tulips in the smile, so she needs 9 * 18 = 162 yellow tulips.
The total number of tulips she needs is 34 + 162 = 196 tulips.
#### 196
The answer is: 196


In [None]:
print(eval_ds3_processed[0]['input_text'])
print(eval_ds3_processed[0]['output_text'])

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
After waking up to find the thermostat set at 40 degrees, Jerry doubles the initial temperature. However, his dad reduces the temperature by 30 degrees after noticing. Later, Jerry's mother further reduces the temperature by 30%, and then his sister increases it by 24 degrees. What is the final temperature?<|im_end|>
<|im_start|>assistant


The initial temperature was 40 degrees. 
Jerry doubles the initial temperature, so it becomes 40 * 2 = 80 degrees. 
His dad reduces the temperature by 30 degrees, so it becomes 80 - 30 = 50 degrees. 
Jerry's mother reduces the temperature by 30%, so she reduces it by 50 * 0.30 = 15 degrees. 
After his mother's reduction, the temperature becomes 50 - 15 = 35 degrees. 
Jerry's sister increases the temperature by 24 degrees, so it becomes 35 + 24 = 59 degrees. 
The final temperature is 59 degrees. 
#### 59
The answer is: 59


### ultrafeedback

In [None]:
def process_example(example):
    chosen = example['chosen']
    _,input_text = create_chat_text(tokenizer, chosen[0]['content'])
    return {"input_text":input_text, "output_text":chosen[1]['content']}

In [None]:
train_ds4 = load_dataset('HuggingFaceH4/ultrafeedback_binarized')
train_ds4, eval_ds4 = train_ds4['train_sft'].shuffle(42).select(range(10000)), train_ds4['train_sft'].shuffle(42).select(range(10000, 10100))
train_ds4_processed = train_ds4.map(process_example)
eval_ds4_processed = eval_ds4.map(process_example)

Generating train_prefs split: 100%|██████████████████████████████████████████████████| 61135/61135 [00:01<00:00, 35980.10 examples/s]
Generating train_sft split: 100%|████████████████████████████████████████████████████| 61135/61135 [00:01<00:00, 33563.17 examples/s]
Generating test_prefs split: 100%|█████████████████████████████████████████████████████| 2000/2000 [00:00<00:00, 31326.61 examples/s]
Generating test_sft split: 100%|███████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 33861.08 examples/s]
Generating train_gen split: 100%|████████████████████████████████████████████████████| 61135/61135 [00:01<00:00, 37584.17 examples/s]
Generating test_gen split: 100%|███████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 33052.56 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 4156.69 examples/s]
Map: 100%|████████████████████████████████████████████████████

In [None]:
len(train_ds4_processed), len(eval_ds4_processed)

(10000, 100)

In [None]:
print(train_ds4_processed[0]['input_text'])
print(train_ds4_processed[0]['output_text'])

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
Do you know something about crystallography and structure factor?<|im_end|>
<|im_start|>assistant

Crystallography is the science of the arrangement of atoms in solids. It is a vast and interdisciplinary field that has applications in physics, chemistry, materials science, biology, and engineering.

The structure factor is a mathematical function that is used to describe the diffraction of waves by a crystal. It is a complex number that is related to the atomic positions in the crystal.

The structure factor can be used to calculate the intensity of the diffracted waves. This information can be used to determine the atomic positions in the crystal and to study the structure of materials.

Crystallography is a powerful tool for understanding the structure of materials. It has been used to determine the structures of many important materials, including metals, semiconductors, and pharmaceuticals. It is also use

In [None]:
print(eval_ds4_processed[0]['input_text'])
print(eval_ds4_processed[0]['output_text'])

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
instruction:
The task is to read a passage and convert numbers in their text forms into digit forms. The text numbers can be used to represent Quantity, Dates and Time, Addresses or Positions.
question:
was the case of the forgeries upon the Bank of England discovered in eighteen sixty three,
answer:
was the case of the forgeries upon the Bank of England discovered in 1863,


question:
The Chronicles of Newgate, Volume two. By Arthur Griffiths. Section twenty: Newgate Notorieties, part one.
answer:
The Chronicles of Newgate, Volume 2. By Arthur Griffiths. Section 20: Newgate Notorieties, part 1.


question:
If Oswald left the Depository Building at twelve:thirty three p.m.,
answer:
<|im_end|>
<|im_start|>assistant

If Oswald left the Depository Building at 12:33 p.m.,


### ultrachat-200k

In [None]:
def process_example(example):
    messages = example['messages']
    _,input_text = create_chat_text(tokenizer, messages[0]['content'])
    return {"input_text":input_text, "output_text":messages[1]['content']}

In [None]:
train_ds5 = load_dataset('HuggingFaceH4/ultrachat_200k', name='default')
train_ds5, eval_ds5 = train_ds5['train_sft'].shuffle(42).select(range(10000)), train_ds5['train_sft'].shuffle(42).select(range(10000, 10100))
train_ds5_processed = train_ds5.map(process_example)
eval_ds5_processed = eval_ds5.map(process_example)

Generating train_sft split: 100%|██████████████████████████████████████████████████| 207865/207865 [00:10<00:00, 20130.02 examples/s]
Generating test_sft split: 100%|█████████████████████████████████████████████████████| 23110/23110 [00:01<00:00, 20678.60 examples/s]
Generating train_gen split: 100%|██████████████████████████████████████████████████| 256032/256032 [00:09<00:00, 26269.86 examples/s]
Generating test_gen split: 100%|█████████████████████████████████████████████████████| 28304/28304 [00:01<00:00, 24686.44 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 4279.21 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 4067.28 examples/s]


In [None]:
len(train_ds5_processed), len(eval_ds5_processed)

(10000, 100)

In [None]:
print(train_ds5_processed[0]['input_text'])
print(train_ds5_processed[0]['output_text'])

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
How does the location of the Sydney Conservatorium of Music impact the academic and professional opportunities available to music students, and how does the conservatorium support student engagement with the music industry in Australia?<|im_end|>
<|im_start|>assistant

The location of the Sydney Conservatorium of Music, which is situated in the heart of Sydney's cultural precinct, impacts both the academic and professional opportunities available to music students. The conservatorium is located near several major performing arts venues and organizations, including the Sydney Opera House, the Australian Broadcasting Corporation, and the Sydney Symphony Orchestra, providing students with easy access to performances, rehearsals, and networking opportunities.

One of the primary ways the conservatorium supports student engagement with the music industry in Australia is through its strong industry connections. The

In [None]:
print(eval_ds5_processed[0]['input_text'])
print(eval_ds5_processed[0]['output_text'])

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
Could you paraphrase the line "Oh mark the words that I do say, Who know? Tomorrow could be your day"?: From Those Around, I hear a cry, a mouthful, a hopeless sigh. I hear their footsteps leaving slow and then I know my soul must fly!
Oh mark the words that I do say, Who know? Tomorrow could be your day. At last, it comes to Heaven or Hell. Decide which now, do not delay! Come on my brother let's pray. Decide which now, do not delay!
Oh God! Oh God: I can not see! My eyes are blind! Am I still me?! Or has my soul been led astray. And forced to pay a priceless Fee?!
Alas to dust we all return, some shall rejoice, while others burn. If only I knew that before, the line grew short, and came my turn!
Oh mark my words That I don't say, who knows? Tomorrow could be your day. At last, it come to Heaven and Hell. Decide which now, do not delay. Come on my brothers, let's pray. Decide which now, do not delay.<|im_end

### open hermes 2.5

In [None]:
def process_example(example):
    messages = example['conversations']
    _,input_text = create_chat_text(tokenizer, messages[0]['value'])
    return {"input_text":input_text, "output_text":messages[1]['value']}

In [None]:
train_ds6 = load_dataset('teknium/OpenHermes-2.5')
train_ds6, eval_ds6 = train_ds6['train'].shuffle(42).select(range(10000)), train_ds6['train'].shuffle(42).select(range(10000, 10100))
train_ds6_processed = train_ds6.map(process_example)
eval_ds6_processed = eval_ds6.map(process_example)

Generating train split:   0%|                                                                     | 0/1001551 [00:00<?, ? examples/s]

In [None]:
len(train_ds6_processed), len(eval_ds6_processed)

In [None]:
print(train_ds6_processed[0]['input_text'])
print(train_ds6_processed[0]['output_text'])

In [None]:
print(eval_ds6_processed[0]['input_text'])
print(eval_ds6_processed[0]['output_text'])

### starcoder-2

In [None]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['prompt'])
    return {"input_text":input_text, "output_text":example['response']}

In [None]:
train_ds7 = load_dataset('bigcode/self-oss-instruct-sc2-exec-filter-50k')
train_ds7, eval_ds7 = train_ds7['train'].shuffle(42).select(range(10000)), train_ds7['train'].shuffle(42).select(range(10000, 10100))
train_ds7_processed = train_ds7.map(process_example)
eval_ds7_processed = eval_ds7.map(process_example)

In [None]:
len(train_ds7_processed), len(eval_ds7_processed)

In [None]:
print(train_ds7_processed[0]['input_text'])
print(train_ds7_processed[0]['output_text'])

In [None]:
print(eval_ds7_processed[0]['input_text'])
print(eval_ds7_processed[0]['output_text'])

### The Tome

In [None]:
def process_example(example):
    messages = example['conversations']
    _,input_text = create_chat_text(tokenizer, messages[0]['value'])
    return {"input_text":input_text, "output_text":messages[1]['value']}

In [None]:
train_ds8 = load_dataset('arcee-ai/The-Tome')
train_ds8, eval_ds8 = train_ds8['train'].shuffle(42).select(range(10000)), train_ds8['train'].shuffle(42).select(range(10000, 10100))
train_ds8_processed = train_ds8.map(process_example)
eval_ds8_processed = eval_ds8.map(process_example)
len(train_ds8_processed), len(eval_ds8_processed)

In [None]:
print(train_ds8_processed[0]['input_text'])
print(train_ds8_processed[0]['output_text'])

In [None]:
print(eval_ds8_processed[0]['input_text'])
print(eval_ds8_processed[0]['output_text'])

### openhermes 200k unfiltered

In [None]:
def process_example(example):
    messages = example['conversations']
    _,input_text = create_chat_text(tokenizer, messages[0]['value'])
    return {"input_text":input_text, "output_text":messages[1]['value']}

In [None]:
train_ds9 = load_dataset("Crystalcareai/openhermes_200k_unfiltered")
train_ds9, eval_ds9 = train_ds9['train'].sort("relevance_score").select(range(10000)), train_ds9['train'].sort("relevance_score").select(range(10000, 10100))
train_ds9_processed = train_ds9.map(process_example)
eval_ds9_processed = eval_ds9.map(process_example)
len(train_ds9_processed), len(eval_ds9_processed)

In [None]:
print(train_ds9_processed[0]['input_text'])
print(train_ds9_processed[0]['output_text'])

In [None]:
print(eval_ds9_processed[0]['input_text'])
print(eval_ds9_processed[0]['output_text'])

### Synth reasoning alpaca combined

In [None]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['instruction'])
    return {"input_text":input_text, "output_text":example['output']}

In [None]:
train_ds10 = load_dataset("Crystalcareai/synthetic_reasoning_natural_Alpaca_Combined")
train_ds10, eval_ds10 = train_ds10['train'].shuffle(42).select(range(1000)), train_ds10['train'].shuffle(42).select(range(1000, 1010))
train_ds10_processed = train_ds10.map(process_example)
eval_ds10_processed = eval_ds10.map(process_example)
len(train_ds10_processed), len(eval_ds10_processed)

In [None]:
print(train_ds10_processed[0]['input_text'])
print(train_ds10_processed[0]['output_text'])

In [None]:
print(eval_ds10_processed[0]['input_text'])
print(eval_ds10_processed[0]['output_text'])

### Concatenate

In [None]:
concat_train_ds = concatenate_datasets([train_ds1_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds2_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds3_processed.select_columns(['input_text', 'output_text']),
                                        train_ds4_processed.select_columns(['input_text', 'output_text']),
                                        train_ds5_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds6_processed.select_columns(['input_text', 'output_text']),
                                        train_ds7_processed.select_columns(['input_text', 'output_text']),
                                        train_ds8_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds9_processed.select_columns(['input_text', 'output_text']),
                                        train_ds10_processed.select_columns(['input_text', 'output_text'])])

In [None]:
concat_eval_ds = concatenate_datasets([eval_ds2_processed.select_columns(['input_text', 'output_text']), 
                                        eval_ds3_processed.select_columns(['input_text', 'output_text']),
                                        eval_ds4_processed.select_columns(['input_text', 'output_text']),
                                        eval_ds5_processed.select_columns(['input_text', 'output_text']), 
                                        eval_ds6_processed.select_columns(['input_text', 'output_text']),
                                        eval_ds7_processed.select_columns(['input_text', 'output_text']),
                                        eval_ds8_processed.select_columns(['input_text', 'output_text']), 
                                        eval_ds9_processed.select_columns(['input_text', 'output_text']),
                                        eval_ds10_processed.select_columns(['input_text', 'output_text'])])

In [None]:
len(concat_train_ds), len(concat_eval_ds)

In [None]:
concat_train_ds = concat_train_ds.shuffle(42)

In [None]:
concat_train_ds.save_to_disk("/home/k/data/qwen_large_mix_dataset_v0")
concat_train_ds = load_from_disk("/home/k/data/qwen_large_mix_dataset_v0")

Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████| 90846/90846 [00:01<00:00, 48230.00 examples/s]


In [None]:
concat_eval_ds.save_to_disk("/home/k/data/qwen_large_mix_dataset_v0_eval")
concat_eval_ds = load_from_disk("/home/k/data/qwen_large_mix_dataset_v0_eval")

Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████████| 810/810 [00:00<00:00, 58053.15 examples/s]


In [None]:
def tokenize(exs):
    return {"inp_ids" : tokenizer(exs['input_text'])['input_ids'], 
            "out_ids" : tokenizer(exs['output_text'])['input_ids']}
def token_lenghts(ex):
    return {"input_length": len(ex['inp_ids']), 
            "output_length": len(ex['out_ids']),
            "token_length": len(ex['inp_ids']) + len(ex['out_ids'])}

In [None]:
concat_train_ds = concat_train_ds.map(tokenize, batched=True).map(token_lenghts)
concat_eval_ds  = concat_eval_ds.map(tokenize, batched=True).map(token_lenghts)

Map: 100%|████████████████████████████████████████████████████████████████████████████| 90846/90846 [00:21<00:00, 4143.99 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████| 90846/90846 [00:29<00:00, 3100.60 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████| 810/810 [00:00<00:00, 3850.00 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████| 810/810 [00:00<00:00, 3034.39 examples/s]


In [None]:
concat_train_filtered_ds = concat_train_ds.filter(lambda ex: ex["token_length"] < 1024)
concat_train_filtered_ds.save_to_disk("/home/k/data/qwen_large_mix_dataset_v0_1024")

concat_train_filtered_ds = concat_train_ds.filter(lambda ex: ex["token_length"] < 1536)
concat_train_filtered_ds.save_to_disk("/home/k/qwen_large_mix_dataset_v0_1536")

Filter: 100%|█████████████████████████████████████████████████████████████████████████| 90846/90846 [00:16<00:00, 5549.03 examples/s]
Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████| 84037/84037 [00:01<00:00, 75841.31 examples/s]
Filter: 100%|█████████████████████████████████████████████████████████████████████████| 90846/90846 [00:16<00:00, 5557.69 examples/s]
Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████| 90421/90421 [00:01<00:00, 71518.26 examples/s]


In [None]:
input_length_counts = pd.value_counts(concat_train_ds['input_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()
input_length_counts.index.name = 'Input Length Bins'

output_length_counts = pd.value_counts(concat_train_ds['output_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()
output_length_counts.index.name = 'Output Length Bins'

token_length_counts = pd.value_counts(concat_train_ds['token_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()
token_length_counts.index.name = 'Total Length Bins'

print(input_length_counts, "\n") 
print(output_length_counts, "\n") 
print(token_length_counts, "\n") 


Input Length Bins
(-0.001, 512.0]     0.867820
(512.0, 1024.0]     0.984710
(1024.0, 1536.0]    0.999318
(1536.0, 2048.0]    0.999758
(2048.0, 4096.0]    1.000000
Name: proportion, dtype: float64 

Output Length Bins
(-0.001, 512.0]     0.898730
(512.0, 1024.0]     0.993329
(1024.0, 1536.0]    0.998888
(1536.0, 2048.0]    0.999538
(2048.0, 4096.0]    0.999989
Name: proportion, dtype: float64 

Total Length Bins
(-0.001, 512.0]     0.695165
(512.0, 1024.0]     0.925357
(1024.0, 1536.0]    0.995366
(1536.0, 2048.0]    0.998888
(2048.0, 4096.0]    0.999824
Name: proportion, dtype: float64 



In [None]:
input_length_counts = pd.value_counts(concat_eval_ds['input_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()
input_length_counts.index.name = 'Input Length Bins'

output_length_counts = pd.value_counts(concat_eval_ds['output_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()
output_length_counts.index.name = 'Output Length Bins'

token_length_counts = pd.value_counts(concat_eval_ds['token_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()
token_length_counts.index.name = 'Total Length Bins'

print(input_length_counts, "\n") 
print(output_length_counts, "\n") 
print(token_length_counts, "\n") 

Input Length Bins
(-0.001, 512.0]     0.845679
(512.0, 1024.0]     0.986420
(1024.0, 1536.0]    1.000000
(1536.0, 2048.0]    1.000000
(2048.0, 4096.0]    1.000000
Name: proportion, dtype: float64 

Output Length Bins
(-0.001, 512.0]     0.872840
(512.0, 1024.0]     0.988889
(1024.0, 1536.0]    0.998765
(1536.0, 2048.0]    0.998765
(2048.0, 4096.0]    1.000000
Name: proportion, dtype: float64 

Total Length Bins
(-0.001, 512.0]     0.654321
(512.0, 1024.0]     0.922222
(1024.0, 1536.0]    0.993827
(1536.0, 2048.0]    0.998765
(2048.0, 4096.0]    1.000000
Name: proportion, dtype: float64 



In [None]:
import hashlib
from datasets import Dataset

def compute_hash(text): return hashlib.md5(text.encode('utf-8')).hexdigest()
# Create a set to store unique hashes from the training dataset
unique_hashes = set(concat_train_ds.map(lambda row: {"hash":compute_hash(row['input_text'])})['hash'])

Map: 100%|███████████████████████████████████████████████████████████████████████████| 90846/90846 [00:08<00:00, 10449.61 examples/s]


In [None]:
# Apply the deduplication to the eval dataset using the filter function
deduplicated_eval_dataset = concat_eval_ds.filter(lambda row: compute_hash(row['input_text']) not in unique_hashes)

Filter: 100%|█████████████████████████████████████████████████████████████████████████████| 810/810 [00:00<00:00, 5076.40 examples/s]


In [None]:
# save 
deduplicated_eval_dataset.save_to_disk("/home/k/data/qwen_large_mix_dataset_v0_dedup_eval")

Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████████| 736/736 [00:00<00:00, 25996.51 examples/s]


### Dataset Mixture v1

- Dedup at prompt level.
- Add training sets of academic benchmarks.

In [None]:
from datasets import Dataset

In [None]:
concat_train_ds = load_from_disk("/home/k/data/qwen_large_mix_dataset_v0")

In [None]:
concat_train_ds

Dataset({
    features: ['input_text', 'output_text', 'inp_ids', 'out_ids', 'input_length', 'output_length', 'token_length'],
    num_rows: 90846
})

In [None]:
import hashlib
def compute_hash(text):
    return hashlib.md5(text.encode('utf-8')).hexdigest()

# Create a dictionary to store unique hashes
unique_hashes = {}

# Function to check if a row should be kept
def should_keep(row):
    hash_value = compute_hash(row['input_text'])
    if hash_value not in unique_hashes:
        unique_hashes[hash_value] = True
        return True
    return False

# Apply the deduplication
deduplicated_dataset = concat_train_ds.filter(should_keep)

Filter: 100%|█████████████████████████████████████████████████████████████████████████| 90846/90846 [00:16<00:00, 5519.22 examples/s]


In [None]:
deduplicated_dataset

Dataset({
    features: ['input_text', 'output_text', 'inp_ids', 'out_ids', 'input_length', 'output_length', 'token_length'],
    num_rows: 84088
})

In [None]:
dedup_ds = deduplicated_dataset.filter(lambda ex: ex["token_length"] < 1024)
dedup_ds.save_to_disk("/home/k/data/qwen_large_mix_dataset_v0_dedup_1024")

dedup_ds = deduplicated_dataset.filter(lambda ex: ex["token_length"] < 1536)
dedup_ds.save_to_disk("/home/k/data/qwen_large_mix_dataset_v0_dedup_1536")

Filter: 100%|█████████████████████████████████████████████████████████████████████████| 84088/84088 [00:16<00:00, 4964.11 examples/s]
Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████| 77421/77421 [00:01<00:00, 76641.38 examples/s]
Filter: 100%|█████████████████████████████████████████████████████████████████████████| 84088/84088 [00:17<00:00, 4871.63 examples/s]
Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████| 83706/83706 [00:01<00:00, 60328.43 examples/s]


### ARC-E

In [None]:
import numpy as np

In [None]:
ds = load_dataset("ai2_arc", name="ARC-Easy")
arc_e_train_samples = ds['train'].shuffle(42).select(range(25,len(ds['train'])))

In [None]:
arc_e_train_samples

Dataset({
    features: ['id', 'question', 'choices', 'answerKey'],
    num_rows: 2226
})

In [None]:
def process_arc_example(ex):
	
	choice_texts = ex['choices']['text']
	choice_labels = ex['choices']['label']
	choices_list = [f"{l}. {c}" for c,l in zip(ex["choices"]["text"], ex['choices']['label'])]

	
	choices = "\n".join(choices_list)
	question = ex["question"] + f"\n\n{choices}"

	answer_idx = {l:i for i,l in enumerate(ex['choices']['label'])}[ex['answerKey']]
	answer = choices_list[answer_idx]
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [None]:
arc_e_train_ds_processed = (arc_e_train_samples.shuffle(42).select(range(1000)).map(process_arc_example)
                            .select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
print(arc_e_train_ds_processed[0]['input_text'])
print(arc_e_train_ds_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Living organisms require energy for biological processes. Chemical energy in a plant cell is

A. produced in vacuoles.
B. converted from solar energy.
C. developed by centrioles.
D. stored as kinetic energy.<|eot_id|><|start_header_id|>assistant<|end_header_id|>


B. converted from solar energy.


### ARC-C

In [None]:
ds = load_dataset("ai2_arc", name="ARC-Challenge")
arc_c_train_samples = ds['train'].shuffle(42).select(range(25,len(ds['train'])))

In [None]:
arc_c_train_samples

Dataset({
    features: ['id', 'question', 'choices', 'answerKey'],
    num_rows: 1094
})

In [None]:
arc_c_train_ds_processed = (arc_c_train_samples.shuffle(42).select(range(1000)).map(process_arc_example)
                            .select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
print(arc_c_train_ds_processed[0]['input_text'])
print(arc_c_train_ds_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Carbon on Earth is found in both living and nonliving matter. In order for carbon to be continuously available, it must be recycled. Through which process is carbon made available in the atmosphere?

A. formation of fossil fuels
B. layering of soil
C. plant photosynthesis
D. forest fires<|eot_id|><|start_header_id|>assistant<|end_header_id|>


D. forest fires


### BoolQ

In [None]:
ds = load_dataset("boolq")
boolq_train_samples = ds['train']
boolq_train_samples

Dataset({
    features: ['question', 'answer', 'passage'],
    num_rows: 9427
})

In [None]:
def process_boolq_example(ex):
	question = "Passage: " + ex["passage"] +"\n\nQuestion: According to the passage, " + ex["question"] + "?"
	answer = "Yes" if ex['answer'] else "No"
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [None]:
boolq_train_ds_processed = (boolq_train_samples.shuffle(42).select(range(1000)).map(process_boolq_example)
                            .select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
print(boolq_train_ds_processed[0]['input_text'])
print(boolq_train_ds_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Passage: Henry Daniel Mills is a fictional character in ABC's television series Once Upon a Time. Henry is the boy Emma Swan gave up to adoption; Regina Mills adopted him. Henry was originally portrayed as a child by Jared S. Gilmore, who won the Young Artist Award for Best Performance in a TV Series -- Leading Young Actor in 2012. For the show's seventh and final season, Andrew J. West later took over the role of Henry as an adult and father to a eight-year-old girl named Lucy, with Gilmore also making three appearances as Henry during the season.

Question: According to the passage, did henry die in once upon a time?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


No


### Commonsense QA

In [None]:
ds = load_dataset("commonsense_qa")
commonsenseqa_train_samples = ds['train'].shuffle(42).select(range(7,len(ds['train'])))
commonsenseqa_train_samples

Dataset({
    features: ['id', 'question', 'question_concept', 'choices', 'answerKey'],
    num_rows: 9734
})

In [None]:
def process_commonsenseqa_example(ex):
	
	choice_texts = ex['choices']['text']
	choice_labels = ex['choices']['label']
	choices_list = [f"{l}. {c}" for c,l in zip(ex["choices"]["text"], ex['choices']['label'])]

	
	choices = "\n".join(choices_list)
	question = ex["question"] + f"\n\n{choices}"

	answer_idx = {l:i for i,l in enumerate(ex['choices']['label'])}[ex['answerKey']]
	answer = choices_list[answer_idx]
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [None]:
commonsenseqa_train_ds_processed = (commonsenseqa_train_samples.shuffle(42).select(range(1000)).map(process_commonsenseqa_example)
                                    				.select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
print(commonsenseqa_train_ds_processed[0]['input_text'])
print(commonsenseqa_train_ds_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

As the counterweight began to rise the elevator began to what into the mine?

A. park
B. reduce
C. descend
D. fall
E. set<|eot_id|><|start_header_id|>assistant<|end_header_id|>


C. descend


#### HellaSwag

In [None]:
ds = load_dataset("hellaswag", name=None)
hellaswag_train_samples = ds['train']
hellaswag_train_samples

Dataset({
    features: ['ind', 'activity_label', 'ctx_a', 'ctx_b', 'ctx', 'endings', 'source_id', 'split', 'split_type', 'label'],
    num_rows: 39905
})

In [None]:
hellaswag_train_samples[0]

{'ind': 4,
 'activity_label': 'Removing ice from car',
 'ctx_a': 'Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles.',
 'ctx_b': 'then',
 'ctx': 'Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles. then',
 'endings': [', the man adds wax to the windshield and cuts it.',
  ', a person board a ski lift, while two men supporting the head of the person wearing winter clothes snow as the we girls sled.',
  ', the man puts on a christmas coat, knitted with netting.',
  ', the man continues removing the snow on his car.'],
 'source_id': 'activitynet~v_-1IBHYS3L-Y',
 'split': 'train',
 'split_type': 'indomain',
 'label': '3'}

In [None]:
import string
def preprocess_hellaswag(text):
	text = text.strip()
	# NOTE: Brackets are artifacts of the WikiHow dataset portion of HellaSwag.
	text = text.replace(" [title]", ". ")
	text = re.sub("\\[.*?\\]", "", text)
	text = text.replace("  ", " ")
	return text

In [None]:
def process_hellaswag_example(ex):
	
	ctx = ex["ctx_a"] + " " + ex["ctx_b"].capitalize()
	query = preprocess_hellaswag(ex['activity_label'] + ". " + ctx)
	choices = [preprocess_hellaswag(ending) for ending in ex['endings']]
	choices_list = [f"{l}. {c}" for c, l in zip(choices, string.ascii_uppercase)]
 
	choices = "\n".join(choices_list)
	question = query + f"\n\n{choices}"

	answer = choices_list[int(ex['label'])]
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [None]:
hellaswag_train_ds_processed = (hellaswag_train_samples.shuffle(42).select(range(1000)).map(process_hellaswag_example)
                                    				.select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
print(hellaswag_train_ds_processed[4]['input_text'])
print(hellaswag_train_ds_processed[4]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Health. How to choose superfoods good for your heart. Eat dark, leafy greens. Kale is a green that can be used in a variety of ways and paired with almost any food. Dark, leafy greens like swiss chard or collard, mustard, and turnip greens are superfoods that are good for your heart.

A. However, those with a higher cholesterol content may use some other vegetables. One ingredient that has a higher cholesterol content is nuts, seeds, chickpeas, and fruit.
B. You can eat these greens whole whole, cut in half, or feed them whole. Leafy greens tend to be full of fiber and can calm the stomach, so try to eat steamed greens cooked whole.
C. They are full of antioxidants and have anti-inflammatory properties, which helps promote heart health. Use kale and other dark leafy greens to make s

#### Winogrande

In [None]:
ds = load_dataset("winogrande", name="winogrande_xl")
winogrande_train_samples = ds['train'].shuffle(42).select(range(5,len(ds['train'])))
winogrande_train_samples

Dataset({
    features: ['sentence', 'option1', 'option2', 'answer'],
    num_rows: 40393
})

In [None]:
import string

def process_winogrande_example(ex):
	idx = ex['sentence'].index("_")
	query = ex['sentence'][:idx].strip()
	remaining = ex['sentence'][idx+1:].strip()

	choices = [ex['option1'] + " " + remaining, ex['option2'] + " " + remaining]
	choices_list = [f"{l}. {c}" for c, l in zip(choices, string.ascii_uppercase)]
	choices = "\n".join(choices_list)

	question = f"Sentence: {query}\n\nWhich ones is the most likely continuation?\n\n{choices}"
	
	answer = choices_list[int(ex['answer'])-1]
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [None]:
wino_train_ds_processed = (winogrande_train_samples.shuffle(42).select(range(1000)).map(process_winogrande_example)
                           .select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
print(wino_train_ds_processed[1]['input_text'])
print(wino_train_ds_processed[1]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Sentence: Paying the electricity bill is an adult responsibility for Joseph but not for Jeffrey because

Which ones is the most likely continuation?

A. Joseph lives on his own.
B. Jeffrey lives on his own.<|eot_id|><|start_header_id|>assistant<|end_header_id|>


A. Joseph lives on his own.


#### GSM8K

In [None]:
ds = load_dataset("gsm8k", name="main")
gsm8k_train_samples = ds['train'].shuffle(42).select(range(8, len(ds['train'])))	
gsm8k_train_samples

Dataset({
    features: ['question', 'answer'],
    num_rows: 7465
})

In [None]:
def process_gsm8k_example(ex):
	question = ex['question']
	answer = ex['answer']
	_,input_text = create_chat_text(tokenizer, question)
	return {"input_text":input_text, "output_text":answer}

In [None]:
gsm8k_train_ds_processed = (gsm8k_train_samples.shuffle(42).select(range(1000)).map(process_gsm8k_example)
                           .select_columns(['input_text', 'output_text']))

In [None]:
print(gsm8k_train_ds_processed[1]['input_text'])
print(gsm8k_train_ds_processed[1]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Each member of Greg’s softball team needs to buy one uniform made up of a shirt, a pair of pants, and socks. A shirt costs $7.50, a pair of pants cost $15, and socks cost $4.50 each if each team member buys the uniform items on their own. If they buy the items as a group, they are given a discount. A discounted shirt cost $6.75, a discounted pair of pants cost $13.50, and discounted socks cost $3.75. How much would their team of 12 save with the group discount?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


The cost of a regularly-priced uniform is $7.50 for the shirt + $15 for pants + $4.50 for socks = $<<7.5+15+4.5=27>>27.
The cost of a discounted uniform is $6.75 for the shirt + $13.50 for pants + $3.75 for socks = $<<6.75+13.5+3.75=24>>24.
By purchasing the discounted 

#### Note

human eval, mmlu, mmlu-pro, bbhard, agieval don't have train sets.

In [None]:
# only test
# ds = load_dataset("openai_humaneval")
# ds  

In [None]:
# MMLU_SUBJECTS = [
# 	"abstract_algebra",
# 	"anatomy",
# 	"astronomy",
# 	"business_ethics",
# 	"clinical_knowledge",
# 	"college_biology",
# 	"college_chemistry",
# 	"college_computer_science",
# 	"college_mathematics",
# 	"college_medicine",
# 	"college_physics",
# 	"computer_security",
# 	"conceptual_physics",
# 	"econometrics",
# 	"electrical_engineering",
# 	"elementary_mathematics",
# 	"formal_logic",
# 	"global_facts",
# 	"high_school_biology",
# 	"high_school_chemistry",
# 	"high_school_computer_science",
# 	"high_school_european_history",
# 	"high_school_geography",
# 	"high_school_government_and_politics",
# 	"high_school_macroeconomics",
# 	"high_school_mathematics",
# 	"high_school_microeconomics",
# 	"high_school_physics",
# 	"high_school_psychology",
# 	"high_school_statistics",
# 	"high_school_us_history",
# 	"high_school_world_history",
# 	"human_aging",
# 	"human_sexuality",
# 	"international_law",
# 	"jurisprudence",
# 	"logical_fallacies",
# 	"machine_learning",
# 	"management",
# 	"marketing",
# 	"medical_genetics",
# 	"miscellaneous",
# 	"moral_disputes",
# 	"moral_scenarios",
# 	"nutrition",
# 	"philosophy",
# 	"prehistory",
# 	"professional_accounting",
# 	"professional_law",
# 	"professional_medicine",
# 	"professional_psychology",
# 	"public_relations",
# 	"security_studies",
# 	"sociology",
# 	"us_foreign_policy",
# 	"virology",
# 	"world_religions",
# ]

In [None]:
# ds = load_dataset("cais/mmlu", name=MMLU_SUBJECTS[0])
# ds

In [None]:
# ds = load_dataset("TIGER-Lab/MMLU-Pro")
# MMLU_PRO_SUBJECTS = set(ds['validation']['category'])

In [None]:
# ds

In [None]:
# AGIEVAL_DATASETS = ["dmayhem93/agieval-aqua-rat",
# 					"dmayhem93/agieval-gaokao-english",
# 					"dmayhem93/agieval-logiqa-en",
# 					"dmayhem93/agieval-lsat-ar",
# 					"dmayhem93/agieval-lsat-lr",
# 					"dmayhem93/agieval-lsat-rc",
# 					"dmayhem93/agieval-sat-en-without-passage",
# 					"dmayhem93/agieval-sat-en",
# 					"dmayhem93/agieval-sat-math",
# 					"hails/agieval-math"
# ]

# ds = load_dataset(AGIEVAL_DATASETS[0])
# ds

In [None]:
# BIGBENCH_HARD_TASK_NAMES =  ['boolean_expressions', 
# 						'causal_judgement', 
# 						'date_understanding', 
# 						'disambiguation_qa', 
# 						'dyck_languages',
# 						'formal_fallacies',
# 						'geometric_shapes',
# 						'hyperbaton',
# 						'logical_deduction_three_objects',
# 						'logical_deduction_five_objects',
# 						'logical_deduction_seven_objects',
# 						'movie_recommendation',
# 						'multistep_arithmetic_two',
# 						'navigate',
# 						'object_counting',
# 						'penguins_in_a_table',
# 						'reasoning_about_colored_objects',
# 						'ruin_names',
# 						'salient_translation_error_detection',
# 						'snarks', 
# 						'sports_understanding', 
# 						'temporal_sequences',
# 						'tracking_shuffled_objects_three_objects',					
# 						'tracking_shuffled_objects_five_objects',
# 						'tracking_shuffled_objects_seven_objects',
# 						'web_of_lies',
# 						'word_sorting']

# ds = load_dataset("maveriq/bigbenchhard", name=BIGBENCH_HARD_TASK_NAMES[0])
# ds

#### SQUAD

In [None]:
ds = load_dataset("rajpurkar/squad_v2")
squad_train_samples = ds['train']
squad_train_samples

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 130319
})

In [None]:
def process_squad_example(ex):
	context = ex['context']
	question =ex['question']
	question = f"Context: {context}" + "\n\n" + f"Question: {question}"
	answer = ex['answers']['text']
	if answer:
		answer = answer[0].capitalize()
	else:
		answer = 'No answer can be found in the context.'
 
	return {"input_text":question, "output_text":answer}

In [None]:
squad_train_ds_processed = (squad_train_samples.shuffle(42).select(range(1000)).map(process_squad_example)
                           .select_columns(['input_text', 'output_text']))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
print(squad_train_ds_processed[1]['input_text'])
print(squad_train_ds_processed[1]['output_text'])

Context: Alexandria was the most important trade center in the whole empire during Athanasius's boyhood. Intellectually, morally, and politically—it epitomized the ethnically diverse Graeco-Roman world, even more than Rome or Constantinople, Antioch or Marseilles. Its famous catechetical school, while sacrificing none of its famous passion for orthodoxy since the days of Pantaenus, Clement of Alexandria, Origen of Alexandria, Dionysius and Theognostus, had begun to take on an almost secular character in the comprehensiveness of its interests, and had counted influential pagans among its serious auditors.

Question: What was Alexandria known for?
Important trade center


### DROP

In [None]:
ds = load_dataset("ucinlp/drop")
drop_train_samples = ds['train']
drop_train_samples

Dataset({
    features: ['section_id', 'query_id', 'passage', 'question', 'answers_spans'],
    num_rows: 77400
})

In [None]:
def process_drop_example(ex):
	context = ex['passage']
	question = ex['question']
	question = f"Context: {context}" + "\n\n" + f"Question: {question}"
	answer = ex['answers_spans']['spans']
	if answer:
		answer = answer[0].capitalize()
	else:
		answer = 'No answer can be found in the context.'
 
	return {"input_text":question, "output_text":answer}

In [None]:
drop_train_ds_processed = (drop_train_samples.shuffle(42).select(range(1000)).map(process_drop_example)
                           .select_columns(['input_text', 'output_text']))

In [None]:
print(drop_train_ds_processed[4]['input_text'])
print(drop_train_ds_processed[4]['output_text'])

Context: There are 1,068,573 households in the municipality, giving an average household size of 3.3 people. Of those households, 78.4% are in formal structures (houses or apartment), while 20.5% are in informal structures (Shanty town). 94.0% of households use mains electricity for lighting. 87.3% of households have water supply to the dwelling, while 12.0% have piped water through a communal tap. 94.9% of households have regular refuse collection service. 91.4% of households have a flush toilet or chemical toilet, while 4.5% still use a bucket toilet. 82.1% of households have a refrigerator, 87.3% have a television and 70.1% have a radio. Only 34.0% have a landline telephone, but 91.3% have a cellphone. 37.9% have a computer, and 49.3% have access to the Internet (either through a computer or a cellphone).

Question: How many more percent of residents lived in formal structures than informal structures?
57.9


### Create Dataset v1

In [None]:
concat_train_ds = concatenate_datasets([deduplicated_dataset.select_columns(['input_text', 'output_text']), 
                                        arc_e_train_ds_processed,
                                        arc_c_train_ds_processed,
                                        boolq_train_ds_processed,
                                        commonsenseqa_train_ds_processed,
                                        hellaswag_train_ds_processed,
                                        wino_train_ds_processed,
                                        gsm8k_train_ds_processed,
                                        squad_train_ds_processed,
                                        drop_train_ds_processed]).shuffle(42)

In [None]:
concat_train_ds.save_to_disk("/workspace/data/llama_large_mix_dataset_v1")

Saving the dataset (0/1 shards):   0%|          | 0/93078 [00:00<?, ? examples/s]

In [None]:
concat_train_ds

Dataset({
    features: ['input_text', 'output_text'],
    num_rows: 93078
})

In [None]:
def tokenize(exs):
    return {"input_token_ids" : tokenizer(exs['input_text'])['input_ids'],
            "output_token_ids" : tokenizer(exs['output_text'])['input_ids']}
    
def token_lenghts(ex):
    return {"token_length": len(ex['input_token_ids']) + len(ex['output_token_ids'])}

In [None]:
concat_train_ds = concat_train_ds.map(tokenize, batched=True)
concat_train_ds = concat_train_ds.map(token_lenghts)

Map:   0%|          | 0/93078 [00:00<?, ? examples/s]

Map:   0%|          | 0/93078 [00:00<?, ? examples/s]

In [None]:
pd.value_counts(concat_train_ds['token_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()

  pd.value_counts(concat_train_ds['token_length'], normalize=True, bins=[0,512,1024,1536,2048,4096]).sort_index().cumsum()


(-0.001, 512.0]     0.699199
(512.0, 1024.0]     0.925278
(1024.0, 1536.0]    0.995885
(1536.0, 2048.0]    0.999194
(2048.0, 4096.0]    0.999914
Name: proportion, dtype: float64

In [None]:
concat_train_filtered_ds = concat_train_ds.filter(lambda ex: ex["token_length"] < 1024)

Filter:   0%|          | 0/93078 [00:00<?, ? examples/s]

In [None]:
concat_train_filtered_ds.save_to_disk("/workspace/data/llama_large_mix_dataset_v1_1024")

Saving the dataset (0/1 shards):   0%|          | 0/86090 [00:00<?, ? examples/s]

In [None]:
concat_train_filtered_ds = concat_train_ds.filter(lambda ex: ex["token_length"] < 1536)

Filter:   0%|          | 0/93078 [00:00<?, ? examples/s]

In [None]:
concat_train_filtered_ds.save_to_disk("/workspace/data/llama_large_mix_dataset_v1_1536")

Saving the dataset (0/2 shards):   0%|          | 0/92693 [00:00<?, ? examples/s]