Prepare IFT datasets with `{"input_text":, ... "output_text":...}` format. Prepare in chat format to finetune chat models.

In [1]:
import re
import pandas as pd
from transformers import AutoTokenizer
from datasets import concatenate_datasets, load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
def create_chat_text(tokenizer, user_prompt, system="You are a helpful AI assistant."):
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": user_prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    return messages, text

In [3]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

### oasst

In [4]:
def process_example(example):
    t = example["text"]
    
    human_prefix, asst_prefix = "### Human: ", "### Assistant: "
    human_matches = list(re.finditer(human_prefix, t))
    asst_matches = list(re.finditer(asst_prefix, t))
    human_start_idx, _ = human_matches[0].span(0)
    asst_start_idx, _ = asst_matches[0].span(0)
    if len(human_matches) > 1:
        asst_end_idx = human_matches[1].span(0)[0]
    else:
        asst_end_idx = len(t)
    
    human_prompt = t[human_start_idx+len(human_prefix):asst_start_idx]
    asst_prompt = t[asst_start_idx+len(asst_prefix):asst_end_idx]
    
    _,input_text = create_chat_text(tokenizer, human_prompt)
    output_text = asst_prompt
    return {"input_text":input_text, "output_text":output_text}

In [5]:
train_ds1 = load_dataset("timdettmers/openassistant-guanaco")['train']

Downloading readme: 100%|██████████| 395/395 [00:00<00:00, 1.30MB/s]
Repo card metadata block was not found. Setting CardData to empty.
Downloading data: 100%|██████████| 20.9M/20.9M [00:00<00:00, 22.3MB/s]
Downloading data: 100%|██████████| 1.11M/1.11M [00:00<00:00, 17.6MB/s]
Generating train split: 100%|██████████| 9846/9846 [00:00<00:00, 250452.53 examples/s]
Generating test split: 100%|██████████| 518/518 [00:00<00:00, 79965.02 examples/s]


In [6]:
train_ds1_processed = train_ds1.map(process_example)

Map: 100%|██████████| 9846/9846 [00:00<00:00, 9978.76 examples/s] 


In [7]:
len(train_ds1_processed)

9846

In [8]:
print(train_ds1_processed[0]['input_text'])
print(train_ds1_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.<|eot_id|><|start_header_id|>assistant<|end_header_id|>


"Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.

Recent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies

### orca-math

In [9]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['question'])
    return {"input_text":input_text, "output_text":example['answer']}

In [10]:
train_ds2 = load_dataset("microsoft/orca-math-word-problems-200k")
train_ds2 = train_ds2['train'].shuffle(42).select(range(10000))

Downloading readme: 100%|██████████| 6.91k/6.91k [00:00<00:00, 6.57MB/s]
Downloading data: 100%|██████████| 84.2M/84.2M [00:00<00:00, 174MB/s] 
Generating train split: 100%|██████████| 200035/200035 [00:00<00:00, 243547.74 examples/s]


In [11]:
train_ds2_processed = train_ds2.map(process_example)

Map: 100%|██████████| 10000/10000 [00:01<00:00, 7254.98 examples/s]


In [12]:
train_ds2_processed[3]

{'question': 'Quentin, Skylar, and Colten have a total of 383 chickens. Quentin has 25 more than double the chickens that Skylar has. Skylar has 4 less than a certain multiple of the number of chickens that Colten has. Colten has 37 chickens. What is the multiple of the number of chickens Colten has that Skylar has 4 less than?',
 'answer': "Let's denote the number of chickens that Skylar has as S and the number of chickens that Quentin has as Q. We know that Colten has 37 chickens.\n\nAccording to the information given:\n\n1) Quentin has 25 more than double the chickens that Skylar has:\nQ = 2S + 25\n\n2) The total number of chickens is 383:\nQ + S + 37 = 383\n\nNow, let's substitute the expression for Q from the first equation into the second equation:\n\n(2S + 25) + S + 37 = 383\n3S + 62 = 383\n3S = 383 - 62\n3S = 321\nS = 321 / 3\nS = 107\n\nNow we know Skylar has 107 chickens.\n\nThe problem states that Skylar has 4 less than a certain multiple of the number of chickens that Colte

In [13]:
len(train_ds2_processed)

10000

In [14]:
print(train_ds2_processed[0]['input_text'])
print(train_ds2_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Sally had 13 peaches at her roadside fruit dish.  She went to the orchard and picked peaches to stock up. She picked 55 peaches. There are _____ peaches now.<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Sally originally had 13 peaches. She picked 55 more peaches. To find out the total number of peaches she has now, we add the two amounts together:

13 (original peaches) + 55 (picked peaches) = 68 peaches

So, there are 68 peaches now.


### meta-math-qa

In [15]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['query'])
    return {"input_text":input_text, "output_text":example['response']}

In [16]:
train_ds3 = load_dataset("meta-math/MetaMathQA")
train_ds3 = train_ds3['train'].shuffle(42).select(range(10000))

Downloading readme: 100%|██████████| 4.45k/4.45k [00:00<00:00, 4.68MB/s]
Downloading data: 100%|██████████| 396M/396M [00:09<00:00, 43.4MB/s] 
Generating train split: 100%|██████████| 395000/395000 [00:07<00:00, 54893.41 examples/s]


In [17]:
train_ds3_processed = train_ds3.map(process_example)

Map: 100%|██████████| 10000/10000 [00:01<00:00, 6723.92 examples/s]


In [18]:
train_ds3_processed[3]

{'type': 'MATH_SV',
 'query': 'Compute the sum of the squares of the roots of the equation \\[x^{2018} +44x^{2015} + 3x^3 + 404 = X The answer is 0. What is the value of unknown variable X?',
 'original_question': 'Compute the sum of the squares of the roots of the equation \\[x^{2018} +44x^{2015} + 3x^3 + 404 = 0.\\]',
 'response': 'To solve this problem, we need to determine the value of x that will make the sum of the squares of the roots of the equation equal to 0.\nThe equation is given as:\n\\[x^{2018} + 44x^{2015} + 3x^3 + 404 = X\\]\nTo find the sum of the squares of the roots, we need to find the roots of the equation and then square them.\nSince the answer is 0, it means that the sum of the squares of the roots must be equal to 0.\nTo make the sum of squares 0, we need to find the values of x that will make each square equal to 0.\nIf a square is equal to 0, it means that the value inside the square must be 0.\nIn this case, we need to set each term inside the square to 0 and

In [19]:
len(train_ds3_processed)

10000

In [20]:
print(train_ds3_processed[0]['input_text'])
print(train_ds3_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

If Anna wants to create a smiley face shape using red and yellow tulips, she requires 8 red tulips for each eye and 18 red tulips for the smile. Additionally, she needs 9 times the number of tulips in the smile to create the yellow background of the face. What is the total number of tulips that Anna needs?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Anna needs 8 red tulips for each eye, so for both eyes she needs 8 * 2 = 16 red tulips.
She also needs 18 red tulips for the smile.
The total number of red tulips she needs is 16 + 18 = 34 red tulips.
For the yellow background, she needs 9 times the number of tulips in the smile, so she needs 9 * 18 = 162 yellow tulips.
The total number of tulips she needs is 34 + 162 = 196 tulips.
#### 196
The answer is: 196


### ultrafeedback

In [21]:
def process_example(example):
    chosen = example['chosen']
    _,input_text = create_chat_text(tokenizer, chosen[0]['content'])
    return {"input_text":input_text, "output_text":chosen[1]['content']}

In [22]:
train_ds4 = load_dataset('HuggingFaceH4/ultrafeedback_binarized')
train_ds4 = train_ds4['train_sft'].shuffle(42).select(range(10000))
train_ds4

Downloading readme: 100%|██████████| 6.77k/6.77k [00:00<00:00, 7.51MB/s]
Downloading data: 100%|██████████| 226M/226M [00:01<00:00, 159MB/s]  
Downloading data: 100%|██████████| 226M/226M [00:00<00:00, 226MB/s]  
Downloading data: 100%|██████████| 7.29M/7.29M [00:00<00:00, 66.0MB/s]
Downloading data: 100%|██████████| 3.72M/3.72M [00:00<00:00, 35.2MB/s]
Downloading data: 100%|██████████| 184M/184M [00:00<00:00, 221MB/s]  
Downloading data: 100%|██████████| 3.02M/3.02M [00:00<00:00, 26.8MB/s]
Generating train_prefs split: 100%|██████████| 61135/61135 [00:01<00:00, 51651.31 examples/s]
Generating train_sft split: 100%|██████████| 61135/61135 [00:01<00:00, 55942.33 examples/s]
Generating test_prefs split: 100%|██████████| 2000/2000 [00:00<00:00, 44999.40 examples/s]
Generating test_sft split: 100%|██████████| 1000/1000 [00:00<00:00, 43104.71 examples/s]
Generating train_gen split: 100%|██████████| 61135/61135 [00:00<00:00, 61933.34 examples/s]
Generating test_gen split: 100%|██████████| 10

Dataset({
    features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
    num_rows: 10000
})

In [23]:
train_ds4_processed = train_ds4.map(process_example)

Map: 100%|██████████| 10000/10000 [00:02<00:00, 4761.43 examples/s]


In [24]:
len(train_ds4_processed)

10000

In [25]:
print(train_ds4_processed[0]['input_text'])
print(train_ds4_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Do you know something about crystallography and structure factor?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Crystallography is the science of the arrangement of atoms in solids. It is a vast and interdisciplinary field that has applications in physics, chemistry, materials science, biology, and engineering.

The structure factor is a mathematical function that is used to describe the diffraction of waves by a crystal. It is a complex number that is related to the atomic positions in the crystal.

The structure factor can be used to calculate the intensity of the diffracted waves. This information can be used to determine the atomic positions in the crystal and to study the structure of materials.

Crystallography is a powerful tool for understanding the structure of materials. It has been used to determine the structures of many 

### ultrachat-200k

In [26]:
def process_example(example):
    messages = example['messages']
    _,input_text = create_chat_text(tokenizer, messages[0]['content'])
    return {"input_text":input_text, "output_text":messages[1]['content']}

In [27]:
train_ds5 = load_dataset('HuggingFaceH4/ultrachat_200k', name='default')
train_ds5 = train_ds5['train_sft'].shuffle(42).select(range(10000))
train_ds5

Downloading readme: 100%|██████████| 4.44k/4.44k [00:00<00:00, 9.05MB/s]
Downloading data: 100%|██████████| 244M/244M [00:01<00:00, 211MB/s]  
Downloading data: 100%|██████████| 244M/244M [00:01<00:00, 211MB/s]  
Downloading data: 100%|██████████| 244M/244M [00:01<00:00, 217MB/s]  
Downloading data: 100%|██████████| 81.2M/81.2M [00:00<00:00, 188MB/s] 
Downloading data: 100%|██████████| 244M/244M [00:01<00:00, 213MB/s]  
Downloading data: 100%|██████████| 243M/243M [00:01<00:00, 211MB/s]  
Downloading data: 100%|██████████| 243M/243M [00:01<00:00, 194MB/s]  
Downloading data: 100%|██████████| 80.4M/80.4M [00:00<00:00, 183MB/s] 
Generating train_sft split: 100%|██████████| 207865/207865 [00:06<00:00, 32004.72 examples/s]
Generating test_sft split: 100%|██████████| 23110/23110 [00:00<00:00, 42637.06 examples/s]
Generating train_gen split: 100%|██████████| 256032/256032 [00:06<00:00, 42278.84 examples/s]
Generating test_gen split: 100%|██████████| 28304/28304 [00:00<00:00, 56562.58 example

Dataset({
    features: ['prompt', 'prompt_id', 'messages'],
    num_rows: 10000
})

In [28]:
train_ds5_processed = train_ds5.map(process_example)

Map: 100%|██████████| 10000/10000 [00:02<00:00, 4647.41 examples/s]


In [29]:
len(train_ds5_processed)

10000

In [30]:
print(train_ds5_processed[0]['input_text'])
print(train_ds5_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

How does the location of the Sydney Conservatorium of Music impact the academic and professional opportunities available to music students, and how does the conservatorium support student engagement with the music industry in Australia?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


The location of the Sydney Conservatorium of Music, which is situated in the heart of Sydney's cultural precinct, impacts both the academic and professional opportunities available to music students. The conservatorium is located near several major performing arts venues and organizations, including the Sydney Opera House, the Australian Broadcasting Corporation, and the Sydney Symphony Orchestra, providing students with easy access to performances, rehearsals, and networking opportunities.

One of the primary ways the conservatorium supports student engag

### open hermes 2.5

In [31]:
def process_example(example):
    messages = example['conversations']
    _,input_text = create_chat_text(tokenizer, messages[0]['value'])
    return {"input_text":input_text, "output_text":messages[1]['value']}

In [32]:
train_ds6 = load_dataset('teknium/OpenHermes-2.5')
train_ds6 = train_ds6['train'].shuffle(42).select(range(10000))
train_ds6

Downloading readme: 100%|██████████| 4.75k/4.75k [00:00<00:00, 11.8MB/s]
Downloading data: 100%|██████████| 1.94G/1.94G [00:07<00:00, 245MB/s] 
Generating train split: 100%|██████████| 1001551/1001551 [00:22<00:00, 45143.64 examples/s]


Dataset({
    features: ['custom_instruction', 'topic', 'model_name', 'model', 'skip_prompt_formatting', 'category', 'conversations', 'views', 'language', 'id', 'title', 'idx', 'hash', 'avatarUrl', 'system_prompt', 'source'],
    num_rows: 10000
})

In [33]:
train_ds6_processed = train_ds6.map(process_example)

Map: 100%|██████████| 10000/10000 [00:02<00:00, 3748.61 examples/s]


In [34]:
len(train_ds6_processed)

10000

In [35]:
print(train_ds6_processed[0]['input_text'])
print(train_ds6_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

A Sierpinski arrowhead curve is created by dividing an equilateral triangle into three smaller equilateral triangles and removing the central one, then doing the same for each remaining smaller triangle. If the length of the original triangle is 6cm, what is the perimeter of the Sierpinski arrowhead curve after the third iteration?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


After each iteration, the perimeter of the Sierpinski arrowhead curve increases by a factor of 2. Let's calculate the perimeter after the third iteration:

1st iteration:
Perimeter = 6 cm * 2 = 12 cm

2nd iteration:
Perimeter = 12 cm * 2 = 24 cm

3rd iteration:
Perimeter = 24 cm * 2 = 48 cm

So, the perimeter of the Sierpinski arrowhead curve after the third iteration is 48 cm.


### starcoder-2

In [36]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['prompt'])
    return {"input_text":input_text, "output_text":example['response']}

In [37]:
train_ds7 = load_dataset('bigcode/self-oss-instruct-sc2-exec-filter-50k')
train_ds7 = train_ds7['train'].shuffle(42).select(range(10000))
train_ds7

Downloading readme: 100%|██████████| 922/922 [00:00<00:00, 2.62MB/s]
Downloading data: 100%|██████████| 90.1M/90.1M [00:00<00:00, 126MB/s] 
Generating train split: 100%|██████████| 50661/50661 [00:00<00:00, 89854.26 examples/s]


Dataset({
    features: ['fingerprint', 'sha1', 'seed', 'response', 'concepts', 'prompt', 'instruction', 'id'],
    num_rows: 10000
})

In [38]:
train_ds7_processed = train_ds7.map(process_example)

Map: 100%|██████████| 10000/10000 [00:01<00:00, 5161.56 examples/s]


In [39]:
len(train_ds7_processed)

10000

In [40]:
print(train_ds7_processed[0]['input_text'])
print(train_ds7_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Provide the best response to a given instruction. Follow the following steps to craft your response:
1. reason about the given instruction
2. provide a high-quality solution
3. offer a concise explanation
4. write tests to verify the correctness your solution

## Example 1
### Instruction
Design a Python function that takes a sorted array and a target value, and return a valid index where target can be inserted to maintain the array's sorted order. Optimize the function to run in logarithmic time complexity.

For example, given `array = [1, 3, 5, 5, 6]` and `target = 5`, the function should return either 2 or 3 because 5 presents at both indices 2 and 3.

### Response
[Reasoning]
To solve this problem efficiently and ensure logarithmic time complexity, we can use a binary search algorithm. Compared with a standard binary search that looks for a

### The Tome

In [41]:
def process_example(example):
    messages = example['conversations']
    _,input_text = create_chat_text(tokenizer, messages[0]['value'])
    return {"input_text":input_text, "output_text":messages[1]['value']}

In [42]:
train_ds8 = load_dataset('arcee-ai/The-Tome')
train_ds8 = train_ds8['train'].shuffle(42).select(range(10000))
train_ds8

Downloading readme: 100%|██████████| 1.75k/1.75k [00:00<00:00, 5.69MB/s]
Downloading data: 100%|██████████| 4.58G/4.58G [00:23<00:00, 196MB/s] 
Generating train split: 100%|██████████| 1752473/1752473 [00:07<00:00, 235702.00 examples/s]


Dataset({
    features: ['conversations', 'dataset'],
    num_rows: 10000
})

In [43]:
train_ds8_processed = train_ds8.map(process_example)

Map: 100%|██████████| 10000/10000 [00:01<00:00, 5854.94 examples/s]


In [44]:
len(train_ds8_processed)

10000

In [45]:
print(train_ds8_processed[0]['input_text'])
print(train_ds8_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Google’s recent announcement to create 10,000 new jobs in Europe, 6,000 of which will be in the UK, has been welcomed by the Business Secretary. What is the main focus of Google's expansion in Europe and how does it impact the economies of the countries it operates in?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Google's recent announcement to create 10,000 new jobs in Europe, with 6,000 of those positions located in the UK, is a significant development that has been welcomed by the Business Secretary. This expansion is primarily focused on bolstering Google's presence in key areas such as engineering, sales, and marketing, as well as in new initiatives like artificial intelligence and cloud computing.

The impact of this expansion on the economies of the countries it operates in is multifaceted:

1. **Job Creation**: The creation 

### openhermes 200k unfiltered

In [46]:
def process_example(example):
    messages = example['conversations']
    _,input_text = create_chat_text(tokenizer, messages[0]['value'])
    return {"input_text":input_text, "output_text":messages[1]['value']}

In [47]:
train_ds9 = load_dataset("Crystalcareai/openhermes_200k_unfiltered")
train_ds9 = train_ds9['train'].sort("relevance_score").select(range(10000))
train_ds9

Downloading readme: 100%|██████████| 31.0/31.0 [00:00<00:00, 127kB/s]
Downloading data: 100%|██████████| 335M/335M [00:01<00:00, 179MB/s]  
Generating train split: 100%|██████████| 184892/184892 [00:00<00:00, 288737.74 examples/s]


Dataset({
    features: ['conversations', 'relevance_score', 'id'],
    num_rows: 10000
})

In [48]:
train_ds9_processed = train_ds9.map(process_example)

Map: 100%|██████████| 10000/10000 [00:01<00:00, 5748.74 examples/s]


In [49]:
len(train_ds9_processed)

10000

In [50]:
print(train_ds9_processed[0]['input_text'])
print(train_ds9_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

If a company produces 5000 units of a product per day and sells them for $20 each, how much additional revenue would they generate if they increased their production by 10%?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


First, let's calculate the current daily revenue generated by selling 5000 units at $20 each:

Current daily revenue = Units sold * Price per unit
                                  = 5000 * $20
                                  = $100,000

Now, we need to find out how many additional units will be produced if the company increases its production by 10%. To do this, we multiply the current production (5000 units) by 10%:

Additional units = Current production * 10%
                            = 5000 * 0.1
                            = 500 units

So, the company will produce an additional 500 units per day.

Next, we ne

### Synth reasoning alpaca combined

In [51]:
def process_example(example):
    _,input_text = create_chat_text(tokenizer, example['instruction'])
    return {"input_text":input_text, "output_text":example['output']}

In [52]:
train_ds10 = load_dataset("Crystalcareai/synthetic_reasoning_natural_Alpaca_Combined")
train_ds10 = train_ds10['train'].shuffle(42).select(range(1000))
train_ds10

Downloading data: 100%|██████████| 4.20M/4.20M [00:00<00:00, 14.8MB/s]
Generating train split: 100%|██████████| 11000/11000 [00:00<00:00, 428610.46 examples/s]


Dataset({
    features: ['instruction', 'output'],
    num_rows: 1000
})

In [53]:
train_ds10_processed = train_ds10.map(process_example)

Map: 100%|██████████| 1000/1000 [00:00<00:00, 7487.98 examples/s]


In [54]:
len(train_ds10_processed)

1000

In [55]:
print(train_ds10_processed[0]['input_text'])
print(train_ds10_processed[0]['output_text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

If Bob is hot, then Bob is big.
If Bob is cold, then Bob is bad.
If Bob is slow and blue, then Bob is beautiful.
If person is round, then person is clean.
If Bob is strong, then Bob is happy.
Fact:
Bob is nice and cool.
The following can be determined about Bob:<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Bob is bad.


### Concatenate

In [60]:
concat_train_ds = concatenate_datasets([train_ds1_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds2_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds3_processed.select_columns(['input_text', 'output_text']),
                                        train_ds4_processed.select_columns(['input_text', 'output_text']),
                                        train_ds5_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds6_processed.select_columns(['input_text', 'output_text']),
                                        train_ds7_processed.select_columns(['input_text', 'output_text']),
                                        train_ds8_processed.select_columns(['input_text', 'output_text']), 
                                        train_ds9_processed.select_columns(['input_text', 'output_text']),
                                        train_ds10_processed.select_columns(['input_text', 'output_text'])])

In [61]:
concat_train_ds = concat_train_ds.shuffle(42)

In [62]:
concat_train_ds.save_to_disk("/workspace/data/llama_large_mix_dataset_v0")

Saving the dataset (1/1 shards): 100%|██████████| 90846/90846 [00:00<00:00, 155261.29 examples/s]


In [63]:
def tokenize(exs):
    return {"input_token_ids" : tokenizer(exs['input_text'])['input_ids'],
            "output_token_ids" : tokenizer(exs['output_text'])['input_ids']}

In [64]:
concat_train_ds = concat_train_ds.map(tokenize, batched=True)

Map: 100%|██████████| 90846/90846 [00:24<00:00, 3652.50 examples/s]


In [65]:
def token_lenghts(ex):
    return {"token_length": len(ex['input_token_ids']) + len(ex['output_token_ids'])}

In [66]:
concat_train_ds = concat_train_ds.map(token_lenghts)

Map: 100%|██████████| 90846/90846 [00:24<00:00, 3772.17 examples/s]


In [67]:
pd.value_counts(concat_train_ds['token_length'], normalize=True, bins=10).sort_index().cumsum()

  pd.value_counts(concat_train_ds['token_length'], normalize=True, bins=10).sort_index().cumsum()


(22.483999999999998, 478.5]    0.671279
(478.5, 930.0]                 0.898785
(930.0, 1381.5]                0.990996
(1381.5, 1833.0]               0.998437
(1833.0, 2284.5]               0.999362
(2284.5, 2736.0]               0.999692
(2736.0, 3187.5]               0.999769
(3187.5, 3639.0]               0.999868
(3639.0, 4090.5]               0.999912
(4090.5, 4542.0]               1.000000
Name: proportion, dtype: float64

In [68]:
concat_train_filtered_ds = concat_train_ds.filter(lambda ex: ex["token_length"] < 1024)

Filter: 100%|██████████| 90846/90846 [00:15<00:00, 6036.02 examples/s]


In [69]:
concat_train_filtered_ds

Dataset({
    features: ['input_text', 'output_text', 'input_token_ids', 'output_token_ids', 'token_length'],
    num_rows: 84289
})

In [70]:
concat_train_filtered_ds.save_to_disk("/workspace/data/llama_large_mix_dataset_v0_1024")

Saving the dataset (1/1 shards): 100%|██████████| 84289/84289 [00:00<00:00, 122879.25 examples/s]
