# ✍️ From A to Z

Our goal is to go from data extracted from the [ChemArXiV](https://chemrxiv.org/engage/chemrxiv/public-dashboard) dataset by Marta to a dataset of synthetic data that could be used to train models in the Chemistry domain.

In [None]:
!conda create -n fromatob -yq --file ./package-list.txt

## 📖 Dataset generation

First we start by taking the data and transforming it into a huggingface dataset. This is done by temporarily transforming it into a .json file before reading the json and transforming it into a [arrow](https://arrow.apache.org/) file format.

First we import the necessary libraries for this first step.

In [2]:
import os
import json
from concurrent.futures import ThreadPoolExecutor
import xml.etree.ElementTree as ET
from tqdm import tqdm

We then define the input and temp. output dir. ⚠️ Change this to your directory !

In [3]:
SOURCE_DIR = "datasets/chemrxiv_papers"
DEST_DIR = "datasets/temp"

We're going to take the opportunity of embedding the data into the different templates we want to use during the rephrasing part. We define them beneath, they were taken off the [cosmopedia](https://github.com/huggingface/cosmopedia/tree/main) github.

In [4]:
STYLES = {"wikihow":
"""Here is an extract from a webpage: "<INSERT_EXTRACT>".

Write a long and very detailed tutorial that could be part of WikiHow whose title is related to the extract above <ADD_TOPIC>. Include in depth explanations for each step and how it helps achieve the desired outcome, inluding key tips and guidelines. 
Ensure clarity and practicality, allowing readers to easily follow and apply the instructions. Do not use images.""",

"textbook_narrative":
"""Here is an extract from a webpage: "<INSERT_EXTRACT>".

Write an extensive and detailed course unit suitable for a textbook, related to the given extract <ADD_TOPIC>. Do not just list concepts, but develop each one in detail before moving to the next, as we prioritize depth of understanding and comprehensive exploration of the subject matter over breadth. Focus on:

- Rigor: Ensure in-depth coverage of the concepts.
- Engagement: Use a narrative style akin to Michael Lewis, making it captivating and thought-provoking.
- Relevance: Connect the topic with current trends, real-life examples, or recent studies. Do not use images.
Do not include a title or an introduction, simply write the content without headlines and introductory phrases. Do not use images.""",

"textbook_academic":
"""Here is an extract from a webpage: "<INSERT_EXTRACT>".

Write an extensive and detailed course unit suitable for a textbook targeted at college students, related to the given extract <ADD_TOPIC>. Do not just list concepts, but develop each one in detail before moving to the next, as we prioritize depth of understanding and comprehensive exploration of the subject matter over breadth. Focus on:

- Rigor: Ensure in-depth coverage of the concepts/sections.
- Engagement: Write with an academic, professional and engaging tone that captivates interest.
- Application: Incorporate specific, practical examples, such as proofs in calculus or critical dates and figures in history.
Do not include a title or an introduction, simply write the content without headlines and introductory phrases. Do not use images.""",

"blogpost":
"""Here is an extract from a webpage: "<INSERT_EXTRACT>".

Write an informative and insightful blog post that expands upon the extract above <ADD_TOPIC>. Your post should delve into the nuances of the topic, offering fresh perspectives and deeper analysis. Aim to:

- Inform: Provide valuable, well-researched information that educates the reader.
- Engage: Write in a conversational tone that connects with the audience, making complex ideas accessible.
- Illustrate: Use examples, anecdotes, or personal experiences to bring the topic to life.
Do not give a title and do not start with sentences like "Have you ever..." or "Hello dear readers..", simply write the content without these introductory phrases."""
}

Complex part, we iterate over the files and attempt to only keep relevant paragraphs. We multithread to speed up the process and define min. and max. text lenghts we with to extract. Based off [Rephrasing the Web](https://arxiv.org/abs/2401.16380) we keep it underneath 400 words as more makes the model lose context during sythesizing.

        " Each example has a maximum of 300 tokens, which was decided based on our empirical observation that asking an LLM to rephrase more than 300 tokens, often led to loss of information." - Rephrasing the Web


We also keep the title as it can be used for further context down the road.

In [9]:
MIN_TEXT_LENGTH = 200
MAX_TEXT_LENGTH = 400

def save_extracted_data(file_path, data):
    with open(file_path, 'a') as file:
        for sample in data:
            json.dump(sample, file)
            file.write('\n')  # Add newline character to separate JSON objects


def extract_text(filepath):
    source_paper_path = f"{os.getcwd()}/{SOURCE_DIR}/{filepath}"
    sample = {
        "filename": filepath,
        "title": "",  # Initialize title
        "texts": []   # Initialize list to store extracted text
    }
    tree = ET.parse(source_paper_path)
    try:
        root = tree.getroot()

        # Define the namespace map
        ns = {'tei': 'http://www.tei-c.org/ns/1.0'}

        # Extract title
        title = ""
        teiHeader_elem = root.find('tei:teiHeader', namespaces=ns)
        if teiHeader_elem is not None:
            fileDesk_elem = teiHeader_elem.find('tei:fileDesc', namespaces=ns)
            if fileDesk_elem is not None:
                titleStmt_elem = fileDesk_elem.find('tei:titleStmt', namespaces=ns)
                if titleStmt_elem is not None:
                    title_elem = titleStmt_elem.find('tei:title', namespaces=ns)
                    if title_elem is not None:
                        title = title_elem.text
                    else:
                        print("No <title> element found.")
                else:
                    print("No <titleStmt> element found.")
            else:
                print("No <fileDesc> element found.")
        else:
            print("No <teiHeader> element found.")
        sample['title'] = title

        # Extract text
        text_list = []
        text_elem = root.find('tei:text', namespaces=ns)
        if text_elem is not None:
            body_elem = text_elem.find('tei:body', namespaces=ns)
            if body_elem is not None:
                texts = body_elem.findall('tei:div', namespaces=ns)
                for div in texts:
                    text = ' '.join(div.itertext())
                    for prompt in STYLES.values():
                        if len(text.split(" ")) > MIN_TEXT_LENGTH and len(text.split(" ")) < MAX_TEXT_LENGTH: 
                            text2 = prompt.replace("<ADD_TOPIC>", title).replace("<INSERT_EXTRACT>", text)
                            text_list.append(text2)
        sample['texts'] = text_list

    except Exception as e:
        print(e)
    return sample

def main():
    filenames = [k for k in os.listdir(os.getcwd() + "/" + SOURCE_DIR + "/") if k.endswith('.xml')]
    # print(os.getcwd() + "/" + DEST_DIR)
    print(len(filenames))
    batch_size = 100  # Adjust batch size as needed
    total_batches = len(filenames) // batch_size + 1
    with ThreadPoolExecutor(max_workers=2) as executor:
        for batch_num in range(total_batches):
            batch_filenames = filenames[batch_num * batch_size: (batch_num + 1) * batch_size]
            extracted_data = []
            for sample in tqdm(executor.map(extract_text, batch_filenames), total=len(batch_filenames), desc=f"Processing Batch {batch_num + 1}/{total_batches}"):
                extracted_data.append(sample)
            save_extracted_data(f'{DEST_DIR}/data.json', extracted_data)

main()


22739


Processing Batch 1/228:   0%|          | 0/100 [00:00<?, ?it/s]

Processing Batch 1/228:  34%|███▍      | 34/100 [00:00<00:00, 338.38it/s]

replace() argument 2 must be str, not None


Processing Batch 1/228: 100%|██████████| 100/100 [00:00<00:00, 340.05it/s]
Processing Batch 2/228: 100%|██████████| 100/100 [00:00<00:00, 274.95it/s]
Processing Batch 3/228: 100%|██████████| 100/100 [00:00<00:00, 293.13it/s]
Processing Batch 4/228: 100%|██████████| 100/100 [00:00<00:00, 246.54it/s]


replace() argument 2 must be str, not None


Processing Batch 5/228: 100%|██████████| 100/100 [00:00<00:00, 284.27it/s]
Processing Batch 6/228: 100%|██████████| 100/100 [00:00<00:00, 300.05it/s]
Processing Batch 7/228: 100%|██████████| 100/100 [00:00<00:00, 297.73it/s]
Processing Batch 8/228: 100%|██████████| 100/100 [00:00<00:00, 415.98it/s]
Processing Batch 9/228: 100%|██████████| 100/100 [00:00<00:00, 362.61it/s]
Processing Batch 10/228: 100%|██████████| 100/100 [00:00<00:00, 309.17it/s]


replace() argument 2 must be str, not None


Processing Batch 11/228: 100%|██████████| 100/100 [00:00<00:00, 295.74it/s]


replace() argument 2 must be str, not None


Processing Batch 12/228: 100%|██████████| 100/100 [00:00<00:00, 266.86it/s]


replace() argument 2 must be str, not None


Processing Batch 13/228:   0%|          | 0/100 [00:00<?, ?it/s]

replace() argument 2 must be str, not None


Processing Batch 13/228:  24%|██▍       | 24/100 [00:00<00:00, 238.16it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 13/228:  60%|██████    | 60/100 [00:00<00:00, 238.54it/s]

replace() argument 2 must be str, not None


Processing Batch 13/228: 100%|██████████| 100/100 [00:00<00:00, 249.39it/s]
Processing Batch 14/228: 100%|██████████| 100/100 [00:00<00:00, 272.85it/s]
Processing Batch 15/228: 100%|██████████| 100/100 [00:00<00:00, 381.66it/s]
Processing Batch 16/228: 100%|██████████| 100/100 [00:00<00:00, 315.98it/s]
Processing Batch 17/228: 100%|██████████| 100/100 [00:00<00:00, 286.66it/s]
Processing Batch 18/228: 100%|██████████| 100/100 [00:00<00:00, 267.46it/s]
Processing Batch 19/228: 100%|██████████| 100/100 [00:00<00:00, 280.78it/s]
Processing Batch 20/228: 100%|██████████| 100/100 [00:00<00:00, 398.18it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 21/228: 100%|██████████| 100/100 [00:00<00:00, 422.41it/s]


replace() argument 2 must be str, not None


Processing Batch 22/228:  37%|███▋      | 37/100 [00:00<00:00, 331.94it/s]

replace() argument 2 must be str, not None


Processing Batch 22/228: 100%|██████████| 100/100 [00:00<00:00, 373.95it/s]
Processing Batch 23/228: 100%|██████████| 100/100 [00:00<00:00, 404.28it/s]
Processing Batch 24/228: 100%|██████████| 100/100 [00:00<00:00, 387.64it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 25/228: 100%|██████████| 100/100 [00:00<00:00, 413.43it/s]
Processing Batch 26/228: 100%|██████████| 100/100 [00:00<00:00, 396.71it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 27/228: 100%|██████████| 100/100 [00:00<00:00, 375.74it/s]
Processing Batch 28/228: 100%|██████████| 100/100 [00:00<00:00, 380.86it/s]


replace() argument 2 must be str, not None


Processing Batch 29/228: 100%|██████████| 100/100 [00:00<00:00, 421.13it/s]
Processing Batch 30/228: 100%|██████████| 100/100 [00:00<00:00, 359.06it/s]
Processing Batch 31/228: 100%|██████████| 100/100 [00:00<00:00, 350.77it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 32/228: 100%|██████████| 100/100 [00:00<00:00, 358.26it/s]


replace() argument 2 must be str, not None


Processing Batch 33/228: 100%|██████████| 100/100 [00:00<00:00, 374.04it/s]
Processing Batch 34/228: 100%|██████████| 100/100 [00:00<00:00, 373.35it/s]
Processing Batch 35/228: 100%|██████████| 100/100 [00:00<00:00, 307.48it/s]
Processing Batch 36/228: 100%|██████████| 100/100 [00:00<00:00, 408.72it/s]


replace() argument 2 must be str, not None


Processing Batch 37/228: 100%|██████████| 100/100 [00:00<00:00, 396.71it/s]
Processing Batch 38/228: 100%|██████████| 100/100 [00:00<00:00, 365.55it/s]
Processing Batch 39/228: 100%|██████████| 100/100 [00:00<00:00, 369.22it/s]
Processing Batch 40/228: 100%|██████████| 100/100 [00:00<00:00, 356.32it/s]
Processing Batch 41/228: 100%|██████████| 100/100 [00:00<00:00, 279.72it/s]
Processing Batch 42/228: 100%|██████████| 100/100 [00:00<00:00, 328.56it/s]
Processing Batch 43/228: 100%|██████████| 100/100 [00:00<00:00, 439.91it/s]


replace() argument 2 must be str, not None


Processing Batch 44/228: 100%|██████████| 100/100 [00:00<00:00, 315.26it/s]
Processing Batch 45/228: 100%|██████████| 100/100 [00:00<00:00, 447.53it/s]
Processing Batch 46/228: 100%|██████████| 100/100 [00:00<00:00, 390.21it/s]
Processing Batch 47/228: 100%|██████████| 100/100 [00:00<00:00, 299.46it/s]
Processing Batch 48/228: 100%|██████████| 100/100 [00:00<00:00, 333.25it/s]
Processing Batch 49/228: 100%|██████████| 100/100 [00:00<00:00, 402.59it/s]


replace() argument 2 must be str, not None


Processing Batch 50/228:  83%|████████▎ | 83/100 [00:00<00:00, 342.10it/s]

replace() argument 2 must be str, not None


Processing Batch 50/228: 100%|██████████| 100/100 [00:00<00:00, 378.26it/s]
Processing Batch 51/228:  38%|███▊      | 38/100 [00:00<00:00, 349.74it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 51/228: 100%|██████████| 100/100 [00:00<00:00, 374.64it/s]


replace() argument 2 must be str, not None


Processing Batch 52/228:  35%|███▌      | 35/100 [00:00<00:00, 349.58it/s]

replace() argument 2 must be str, not None


Processing Batch 52/228:  78%|███████▊  | 78/100 [00:00<00:00, 396.14it/s]

replace() argument 2 must be str, not None


Processing Batch 52/228: 100%|██████████| 100/100 [00:00<00:00, 364.48it/s]
Processing Batch 53/228:  42%|████▏     | 42/100 [00:00<00:00, 411.75it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 53/228: 100%|██████████| 100/100 [00:00<00:00, 332.07it/s]
Processing Batch 54/228: 100%|██████████| 100/100 [00:00<00:00, 352.97it/s]
Processing Batch 55/228: 100%|██████████| 100/100 [00:00<00:00, 412.33it/s]
Processing Batch 56/228:  49%|████▉     | 49/100 [00:00<00:00, 386.67it/s]

replace() argument 2 must be str, not None


Processing Batch 56/228: 100%|██████████| 100/100 [00:00<00:00, 428.55it/s]


replace() argument 2 must be str, not None


Processing Batch 57/228: 100%|██████████| 100/100 [00:00<00:00, 440.28it/s]


replace() argument 2 must be str, not None


Processing Batch 58/228: 100%|██████████| 100/100 [00:00<00:00, 361.12it/s]
Processing Batch 59/228: 100%|██████████| 100/100 [00:00<00:00, 323.38it/s]
Processing Batch 60/228: 100%|██████████| 100/100 [00:00<00:00, 400.48it/s]
Processing Batch 61/228: 100%|██████████| 100/100 [00:00<00:00, 299.93it/s]


replace() argument 2 must be str, not None


Processing Batch 62/228: 100%|██████████| 100/100 [00:00<00:00, 288.31it/s]
Processing Batch 63/228: 100%|██████████| 100/100 [00:00<00:00, 343.97it/s]


replace() argument 2 must be str, not None


Processing Batch 64/228: 100%|██████████| 100/100 [00:00<00:00, 317.24it/s]


replace() argument 2 must be str, not None


Processing Batch 65/228: 100%|██████████| 100/100 [00:00<00:00, 328.08it/s]
Processing Batch 66/228: 100%|██████████| 100/100 [00:00<00:00, 486.29it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 67/228:  39%|███▉      | 39/100 [00:00<00:00, 340.69it/s]

replace() argument 2 must be str, not None


Processing Batch 67/228: 100%|██████████| 100/100 [00:00<00:00, 355.99it/s]
Processing Batch 68/228: 100%|██████████| 100/100 [00:00<00:00, 375.02it/s]
Processing Batch 69/228: 100%|██████████| 100/100 [00:00<00:00, 416.27it/s]
Processing Batch 70/228: 100%|██████████| 100/100 [00:00<00:00, 361.23it/s]
Processing Batch 71/228:  39%|███▉      | 39/100 [00:00<00:00, 303.72it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 71/228: 100%|██████████| 100/100 [00:00<00:00, 396.42it/s]
Processing Batch 72/228: 100%|██████████| 100/100 [00:00<00:00, 353.33it/s]
Processing Batch 73/228:  88%|████████▊ | 88/100 [00:00<00:00, 332.45it/s]

replace() argument 2 must be str, not None


Processing Batch 73/228: 100%|██████████| 100/100 [00:00<00:00, 326.32it/s]
Processing Batch 74/228: 100%|██████████| 100/100 [00:00<00:00, 374.06it/s]
Processing Batch 75/228: 100%|██████████| 100/100 [00:00<00:00, 373.94it/s]
Processing Batch 76/228: 100%|██████████| 100/100 [00:00<00:00, 420.09it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 77/228: 100%|██████████| 100/100 [00:00<00:00, 473.26it/s]
Processing Batch 78/228: 100%|██████████| 100/100 [00:00<00:00, 309.53it/s]
Processing Batch 79/228: 100%|██████████| 100/100 [00:00<00:00, 420.76it/s]
Processing Batch 80/228: 100%|██████████| 100/100 [00:00<00:00, 420.95it/s]


replace() argument 2 must be str, not None


Processing Batch 81/228: 100%|██████████| 100/100 [00:00<00:00, 471.44it/s]


replace() argument 2 must be str, not None


Processing Batch 82/228: 100%|██████████| 100/100 [00:00<00:00, 321.22it/s]
Processing Batch 83/228: 100%|██████████| 100/100 [00:00<00:00, 397.73it/s]


replace() argument 2 must be str, not None


Processing Batch 84/228: 100%|██████████| 100/100 [00:00<00:00, 442.70it/s]


replace() argument 2 must be str, not None


Processing Batch 85/228: 100%|██████████| 100/100 [00:00<00:00, 471.70it/s]
Processing Batch 86/228:  46%|████▌     | 46/100 [00:00<00:00, 453.10it/s]

replace() argument 2 must be str, not None


Processing Batch 86/228: 100%|██████████| 100/100 [00:00<00:00, 392.29it/s]
Processing Batch 87/228: 100%|██████████| 100/100 [00:00<00:00, 411.21it/s]
Processing Batch 88/228: 100%|██████████| 100/100 [00:00<00:00, 311.82it/s]
Processing Batch 89/228: 100%|██████████| 100/100 [00:00<00:00, 420.03it/s]
Processing Batch 90/228:  73%|███████▎  | 73/100 [00:00<00:00, 371.97it/s]

replace() argument 2 must be str, not None


Processing Batch 90/228: 100%|██████████| 100/100 [00:00<00:00, 319.27it/s]
Processing Batch 91/228: 100%|██████████| 100/100 [00:00<00:00, 374.75it/s]


replace() argument 2 must be str, not None


Processing Batch 92/228:   0%|          | 0/100 [00:00<?, ?it/s]

replace() argument 2 must be str, not None


Processing Batch 92/228:  78%|███████▊  | 78/100 [00:00<00:00, 384.54it/s]

replace() argument 2 must be str, not None


Processing Batch 92/228: 100%|██████████| 100/100 [00:00<00:00, 363.38it/s]
Processing Batch 93/228:   0%|          | 0/100 [00:00<?, ?it/s]

replace() argument 2 must be str, not None


Processing Batch 93/228: 100%|██████████| 100/100 [00:00<00:00, 328.47it/s]
Processing Batch 94/228: 100%|██████████| 100/100 [00:00<00:00, 366.57it/s]
Processing Batch 95/228: 100%|██████████| 100/100 [00:00<00:00, 404.29it/s]


replace() argument 2 must be str, not None


Processing Batch 96/228: 100%|██████████| 100/100 [00:00<00:00, 356.39it/s]
Processing Batch 97/228: 100%|██████████| 100/100 [00:00<00:00, 394.65it/s]
Processing Batch 98/228: 100%|██████████| 100/100 [00:00<00:00, 308.84it/s]
Processing Batch 99/228: 100%|██████████| 100/100 [00:00<00:00, 362.28it/s]
Processing Batch 100/228: 100%|██████████| 100/100 [00:00<00:00, 379.33it/s]
Processing Batch 101/228: 100%|██████████| 100/100 [00:00<00:00, 391.72it/s]


replace() argument 2 must be str, not None


Processing Batch 102/228: 100%|██████████| 100/100 [00:00<00:00, 523.12it/s]
Processing Batch 103/228:  47%|████▋     | 47/100 [00:00<00:00, 456.41it/s]

replace() argument 2 must be str, not None


Processing Batch 103/228: 100%|██████████| 100/100 [00:00<00:00, 395.48it/s]
Processing Batch 104/228:  76%|███████▌  | 76/100 [00:00<00:00, 373.70it/s]

replace() argument 2 must be str, not None


Processing Batch 104/228: 100%|██████████| 100/100 [00:00<00:00, 356.09it/s]
Processing Batch 105/228: 100%|██████████| 100/100 [00:00<00:00, 292.79it/s]
Processing Batch 106/228: 100%|██████████| 100/100 [00:00<00:00, 386.86it/s]
Processing Batch 107/228: 100%|██████████| 100/100 [00:00<00:00, 358.80it/s]
Processing Batch 108/228: 100%|██████████| 100/100 [00:00<00:00, 312.54it/s]
Processing Batch 109/228: 100%|██████████| 100/100 [00:00<00:00, 331.23it/s]

replace() argument 2 must be str, not None



Processing Batch 110/228: 100%|██████████| 100/100 [00:00<00:00, 389.69it/s]


replace() argument 2 must be str, not None


Processing Batch 111/228:   0%|          | 0/100 [00:00<?, ?it/s]

replace() argument 2 must be str, not None


Processing Batch 111/228: 100%|██████████| 100/100 [00:00<00:00, 344.52it/s]
Processing Batch 112/228: 100%|██████████| 100/100 [00:00<00:00, 380.84it/s]


replace() argument 2 must be str, not None

Processing Batch 113/228:  32%|███▏      | 32/100 [00:00<00:00, 301.05it/s]




Processing Batch 113/228: 100%|██████████| 100/100 [00:00<00:00, 324.48it/s]
Processing Batch 114/228: 100%|██████████| 100/100 [00:00<00:00, 332.16it/s]


replace() argument 2 must be str, not None


Processing Batch 115/228: 100%|██████████| 100/100 [00:00<00:00, 299.30it/s]
Processing Batch 116/228: 100%|██████████| 100/100 [00:00<00:00, 277.61it/s]
Processing Batch 117/228: 100%|██████████| 100/100 [00:00<00:00, 275.69it/s]
Processing Batch 118/228: 100%|██████████| 100/100 [00:00<00:00, 273.83it/s]
Processing Batch 119/228: 100%|██████████| 100/100 [00:00<00:00, 273.07it/s]
Processing Batch 120/228: 100%|██████████| 100/100 [00:00<00:00, 274.95it/s]


replace() argument 2 must be str, not None


Processing Batch 121/228: 100%|██████████| 100/100 [00:00<00:00, 277.51it/s]


replace() argument 2 must be str, not None


Processing Batch 122/228: 100%|██████████| 100/100 [00:00<00:00, 254.21it/s]
Processing Batch 123/228: 100%|██████████| 100/100 [00:00<00:00, 310.63it/s]


replace() argument 2 must be str, not None


Processing Batch 124/228: 100%|██████████| 100/100 [00:00<00:00, 249.61it/s]
Processing Batch 125/228: 100%|██████████| 100/100 [00:00<00:00, 236.18it/s]
Processing Batch 126/228: 100%|██████████| 100/100 [00:00<00:00, 299.90it/s]
Processing Batch 127/228: 100%|██████████| 100/100 [00:00<00:00, 474.60it/s]
Processing Batch 128/228: 100%|██████████| 100/100 [00:00<00:00, 356.20it/s]


replace() argument 2 must be str, not None


Processing Batch 129/228:  89%|████████▉ | 89/100 [00:00<00:00, 440.43it/s]

replace() argument 2 must be str, not None


Processing Batch 129/228: 100%|██████████| 100/100 [00:00<00:00, 345.33it/s]
Processing Batch 130/228: 100%|██████████| 100/100 [00:00<00:00, 311.35it/s]
Processing Batch 131/228: 100%|██████████| 100/100 [00:00<00:00, 378.82it/s]
Processing Batch 132/228: 100%|██████████| 100/100 [00:00<00:00, 363.80it/s]


replace() argument 2 must be str, not None


Processing Batch 133/228:  39%|███▉      | 39/100 [00:00<00:00, 377.59it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 133/228: 100%|██████████| 100/100 [00:00<00:00, 369.85it/s]
Processing Batch 134/228: 100%|██████████| 100/100 [00:00<00:00, 330.35it/s]
Processing Batch 135/228: 100%|██████████| 100/100 [00:00<00:00, 393.12it/s]


replace() argument 2 must be str, not None


Processing Batch 136/228:  53%|█████▎    | 53/100 [00:00<00:00, 527.64it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 136/228: 100%|██████████| 100/100 [00:00<00:00, 435.85it/s]
Processing Batch 137/228: 100%|██████████| 100/100 [00:00<00:00, 388.73it/s]
Processing Batch 138/228: 100%|██████████| 100/100 [00:00<00:00, 309.91it/s]


replace() argument 2 must be str, not None


Processing Batch 139/228: 100%|██████████| 100/100 [00:00<00:00, 410.60it/s]
Processing Batch 140/228: 100%|██████████| 100/100 [00:00<00:00, 420.96it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 141/228: 100%|██████████| 100/100 [00:00<00:00, 352.59it/s]
Processing Batch 142/228: 100%|██████████| 100/100 [00:00<00:00, 309.86it/s]
Processing Batch 143/228: 100%|██████████| 100/100 [00:00<00:00, 345.35it/s]
Processing Batch 144/228:  91%|█████████ | 91/100 [00:00<00:00, 381.96it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 144/228: 100%|██████████| 100/100 [00:00<00:00, 401.23it/s]
Processing Batch 145/228: 100%|██████████| 100/100 [00:00<00:00, 413.44it/s]
Processing Batch 146/228: 100%|██████████| 100/100 [00:00<00:00, 417.73it/s]


replace() argument 2 must be str, not None

Processing Batch 147/228:  46%|████▌     | 46/100 [00:00<00:00, 344.31it/s]


replace() argument 2 must be str, not None


Processing Batch 147/228: 100%|██████████| 100/100 [00:00<00:00, 364.41it/s]
Processing Batch 148/228: 100%|██████████| 100/100 [00:00<00:00, 302.65it/s]


replace() argument 2 must be str, not None


Processing Batch 149/228:  43%|████▎     | 43/100 [00:00<00:00, 415.96it/s]

replace() argument 2 must be str, not None


Processing Batch 149/228: 100%|██████████| 100/100 [00:00<00:00, 434.16it/s]
Processing Batch 150/228: 100%|██████████| 100/100 [00:00<00:00, 393.57it/s]
Processing Batch 151/228: 100%|██████████| 100/100 [00:00<00:00, 411.19it/s]
Processing Batch 152/228: 100%|██████████| 100/100 [00:00<00:00, 419.80it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 153/228: 100%|██████████| 100/100 [00:00<00:00, 385.55it/s]


replace() argument 2 must be str, not None


Processing Batch 154/228: 100%|██████████| 100/100 [00:00<00:00, 441.25it/s]


replace() argument 2 must be str, not None


Processing Batch 155/228: 100%|██████████| 100/100 [00:00<00:00, 402.68it/s]
Processing Batch 156/228: 100%|██████████| 100/100 [00:00<00:00, 375.45it/s]
Processing Batch 157/228: 100%|██████████| 100/100 [00:00<00:00, 485.89it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 158/228: 100%|██████████| 100/100 [00:00<00:00, 371.19it/s]
Processing Batch 159/228: 100%|██████████| 100/100 [00:00<00:00, 386.78it/s]

replace() argument 2 must be str, not None



Processing Batch 160/228: 100%|██████████| 100/100 [00:00<00:00, 422.02it/s]


replace() argument 2 must be str, not None


Processing Batch 161/228: 100%|██████████| 100/100 [00:00<00:00, 417.21it/s]
Processing Batch 162/228: 100%|██████████| 100/100 [00:00<00:00, 319.69it/s]
Processing Batch 163/228: 100%|██████████| 100/100 [00:00<00:00, 407.38it/s]
Processing Batch 164/228: 100%|██████████| 100/100 [00:00<00:00, 379.49it/s]
Processing Batch 165/228: 100%|██████████| 100/100 [00:00<00:00, 389.74it/s]


replace() argument 2 must be str, not None


Processing Batch 166/228: 100%|██████████| 100/100 [00:00<00:00, 427.00it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 167/228: 100%|██████████| 100/100 [00:00<00:00, 422.09it/s]
Processing Batch 168/228: 100%|██████████| 100/100 [00:00<00:00, 415.52it/s]
Processing Batch 169/228:  65%|██████▌   | 65/100 [00:00<00:00, 331.14it/s]

replace() argument 2 must be str, not None


Processing Batch 169/228: 100%|██████████| 100/100 [00:00<00:00, 319.24it/s]
Processing Batch 170/228: 100%|██████████| 100/100 [00:00<00:00, 303.70it/s]
Processing Batch 171/228: 100%|██████████| 100/100 [00:00<00:00, 415.27it/s]
Processing Batch 172/228:  42%|████▏     | 42/100 [00:00<00:00, 412.28it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 172/228: 100%|██████████| 100/100 [00:00<00:00, 311.60it/s]


replace() argument 2 must be str, not None


Processing Batch 173/228: 100%|██████████| 100/100 [00:00<00:00, 299.58it/s]


replace() argument 2 must be str, not None


Processing Batch 174/228:  35%|███▌      | 35/100 [00:00<00:00, 349.04it/s]

replace() argument 2 must be str, not None


Processing Batch 174/228: 100%|██████████| 100/100 [00:00<00:00, 335.95it/s]
Processing Batch 175/228: 100%|██████████| 100/100 [00:00<00:00, 321.96it/s]
Processing Batch 176/228: 100%|██████████| 100/100 [00:00<00:00, 398.29it/s]


replace() argument 2 must be str, not None


Processing Batch 177/228: 100%|██████████| 100/100 [00:00<00:00, 433.32it/s]

replace() argument 2 must be str, not None



Processing Batch 178/228: 100%|██████████| 100/100 [00:00<00:00, 423.58it/s]
Processing Batch 179/228: 100%|██████████| 100/100 [00:00<00:00, 418.98it/s]

replace() argument 2 must be str, not None



Processing Batch 180/228: 100%|██████████| 100/100 [00:00<00:00, 397.55it/s]


replace() argument 2 must be str, not None


Processing Batch 181/228: 100%|██████████| 100/100 [00:00<00:00, 318.40it/s]
Processing Batch 182/228: 100%|██████████| 100/100 [00:00<00:00, 403.27it/s]
Processing Batch 183/228: 100%|██████████| 100/100 [00:00<00:00, 415.92it/s]
Processing Batch 184/228: 100%|██████████| 100/100 [00:00<00:00, 359.96it/s]
Processing Batch 185/228: 100%|██████████| 100/100 [00:00<00:00, 417.65it/s]


replace() argument 2 must be str, not None


Processing Batch 186/228: 100%|██████████| 100/100 [00:00<00:00, 330.38it/s]
Processing Batch 187/228: 100%|██████████| 100/100 [00:00<00:00, 371.09it/s]

replace() argument 2 must be str, not None



Processing Batch 188/228: 100%|██████████| 100/100 [00:00<00:00, 395.41it/s]
Processing Batch 189/228: 100%|██████████| 100/100 [00:00<00:00, 394.18it/s]


replace() argument 2 must be str, not None


Processing Batch 190/228: 100%|██████████| 100/100 [00:00<00:00, 309.17it/s]
Processing Batch 191/228: 100%|██████████| 100/100 [00:00<00:00, 352.44it/s]
Processing Batch 192/228: 100%|██████████| 100/100 [00:00<00:00, 386.99it/s]
Processing Batch 193/228:  48%|████▊     | 48/100 [00:00<00:00, 474.36it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 193/228: 100%|██████████| 100/100 [00:00<00:00, 443.13it/s]
Processing Batch 194/228: 100%|██████████| 100/100 [00:00<00:00, 362.29it/s]
Processing Batch 195/228: 100%|██████████| 100/100 [00:00<00:00, 406.63it/s]
Processing Batch 196/228:  55%|█████▌    | 55/100 [00:00<00:00, 406.56it/s]

replace() argument 2 must be str, not None


Processing Batch 196/228: 100%|██████████| 100/100 [00:00<00:00, 406.42it/s]


replace() argument 2 must be str, not None


Processing Batch 197/228:  50%|█████     | 50/100 [00:00<00:00, 484.96it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 197/228: 100%|██████████| 100/100 [00:00<00:00, 401.73it/s]
Processing Batch 198/228: 100%|██████████| 100/100 [00:00<00:00, 364.36it/s]


replace() argument 2 must be str, not None


Processing Batch 199/228: 100%|██████████| 100/100 [00:00<00:00, 385.60it/s]


replace() argument 2 must be str, not None


Processing Batch 200/228: 100%|██████████| 100/100 [00:00<00:00, 380.60it/s]


replace() argument 2 must be str, not None


Processing Batch 201/228:  50%|█████     | 50/100 [00:00<00:00, 485.15it/s]

replace() argument 2 must be str, not None


Processing Batch 201/228: 100%|██████████| 100/100 [00:00<00:00, 477.51it/s]
Processing Batch 202/228: 100%|██████████| 100/100 [00:00<00:00, 366.21it/s]
Processing Batch 203/228: 100%|██████████| 100/100 [00:00<00:00, 319.18it/s]
Processing Batch 204/228: 100%|██████████| 100/100 [00:00<00:00, 328.63it/s]
Processing Batch 205/228: 100%|██████████| 100/100 [00:00<00:00, 424.92it/s]


replace() argument 2 must be str, not None


Processing Batch 206/228:  44%|████▍     | 44/100 [00:00<00:00, 351.85it/s]

replace() argument 2 must be str, not None


Processing Batch 206/228: 100%|██████████| 100/100 [00:00<00:00, 379.54it/s]
Processing Batch 207/228:   0%|          | 0/100 [00:00<?, ?it/s]

replace() argument 2 must be str, not None


Processing Batch 207/228: 100%|██████████| 100/100 [00:00<00:00, 319.17it/s]
Processing Batch 208/228: 100%|██████████| 100/100 [00:00<00:00, 302.57it/s]
Processing Batch 209/228: 100%|██████████| 100/100 [00:00<00:00, 369.42it/s]


replace() argument 2 must be str, not None


Processing Batch 210/228: 100%|██████████| 100/100 [00:00<00:00, 366.14it/s]


replace() argument 2 must be str, not None


Processing Batch 211/228:   0%|          | 0/100 [00:00<?, ?it/s]

replace() argument 2 must be str, not None


Processing Batch 211/228:  47%|████▋     | 47/100 [00:00<00:00, 464.13it/s]

replace() argument 2 must be str, not None


Processing Batch 211/228: 100%|██████████| 100/100 [00:00<00:00, 461.70it/s]
Processing Batch 212/228:   0%|          | 0/100 [00:00<?, ?it/s]

replace() argument 2 must be str, not None


Processing Batch 212/228: 100%|██████████| 100/100 [00:00<00:00, 400.69it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 213/228: 100%|██████████| 100/100 [00:00<00:00, 415.63it/s]


replace() argument 2 must be str, not None


Processing Batch 214/228:   0%|          | 0/100 [00:00<?, ?it/s]

replace() argument 2 must be str, not None


Processing Batch 214/228: 100%|██████████| 100/100 [00:00<00:00, 360.05it/s]
Processing Batch 215/228: 100%|██████████| 100/100 [00:00<00:00, 392.60it/s]


replace() argument 2 must be str, not None


Processing Batch 216/228:   0%|          | 0/100 [00:00<?, ?it/s]

replace() argument 2 must be str, not None


Processing Batch 216/228:  75%|███████▌  | 75/100 [00:00<00:00, 357.68it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 216/228: 100%|██████████| 100/100 [00:00<00:00, 350.30it/s]
Processing Batch 217/228:   0%|          | 0/100 [00:00<?, ?it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 217/228: 100%|██████████| 100/100 [00:00<00:00, 345.93it/s]
Processing Batch 218/228: 100%|██████████| 100/100 [00:00<00:00, 343.30it/s]


replace() argument 2 must be str, not None


Processing Batch 219/228:   0%|          | 0/100 [00:00<?, ?it/s]

replace() argument 2 must be str, not None


Processing Batch 219/228:  84%|████████▍ | 84/100 [00:00<00:00, 373.79it/s]

replace() argument 2 must be str, not None


Processing Batch 219/228: 100%|██████████| 100/100 [00:00<00:00, 339.79it/s]
Processing Batch 220/228: 100%|██████████| 100/100 [00:00<00:00, 372.85it/s]
Processing Batch 221/228: 100%|██████████| 100/100 [00:00<00:00, 364.29it/s]
Processing Batch 222/228:  66%|██████▌   | 66/100 [00:00<00:00, 310.60it/s]

replace() argument 2 must be str, not None


Processing Batch 222/228: 100%|██████████| 100/100 [00:00<00:00, 314.24it/s]
Processing Batch 223/228: 100%|██████████| 100/100 [00:00<00:00, 343.46it/s]
Processing Batch 224/228: 100%|██████████| 100/100 [00:00<00:00, 335.84it/s]
Processing Batch 225/228:  75%|███████▌  | 75/100 [00:00<00:00, 329.38it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None
replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 225/228: 100%|██████████| 100/100 [00:00<00:00, 376.47it/s]
Processing Batch 226/228: 100%|██████████| 100/100 [00:00<00:00, 338.53it/s]


replace() argument 2 must be str, not None


Processing Batch 227/228: 100%|██████████| 100/100 [00:00<00:00, 482.27it/s]


replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 228/228:   0%|          | 0/39 [00:00<?, ?it/s]

replace() argument 2 must be str, not None
replace() argument 2 must be str, not None


Processing Batch 228/228: 100%|██████████| 39/39 [00:00<00:00, 419.68it/s]


We now have a json file with the collected data and transform it into the arrow format for ease of use.

In [8]:
import jsonlines
from datasets import Dataset
from tqdm import tqdm

dataset_dict = {
        "title": [],
        "filename": [],
        "text": []
    }

with jsonlines.open(f'{DEST_DIR}/data.json') as reader:
    for paper in tqdm(reader):
        for text in paper["texts"]:
            dataset_dict["title"].append(paper["title"])
            dataset_dict["filename"].append(paper["filename"])
            dataset_dict["text"].append(text)

    dataset = Dataset.from_dict(dataset_dict)
    dataset.save_to_disk("datasets/huggingface_dataset_large")

47188it [00:00, 78143.86it/s] 


TypeError: 'Dataset' object does not support item assignment

## ✨ Synthetic data

Now that we have a dataset in the arrows format we can start synthesizing some data. To be as efficient as possible and allow for iterative testing, we use a modified version of llm_swarm that allows for the creation of side-by-side pods running [TGI](https://huggingface.co/docs/text-generation-inference/index) with whatever model that is available.

⚠️ Replace YOURTOKENHERE by your huggingface token (cat ~/.cache/huggingface/token)

In [None]:
import asyncio
import json
import sys
sys.path.append("./llm_swarm")
from llm_swarm import LLMSwarm, LLMSwarmConfig
from huggingface_hub import AsyncInferenceClient
from transformers import AutoTokenizer
import nest_asyncio
from tqdm.asyncio import tqdm_asyncio
import time

# Define your LLMSwarmConfig
isc = LLMSwarmConfig(
    instances=1, #Number of instances
    inference_engine="tgi", #The engine to use. (Could use vLLM)
    job_scheduler="runai", #The scheduler to use (would otherwise be slurm)
    gpus=1, #Number of GPUs to use per instance
    model="meta-llama/Meta-Llama-3-8B-Instruct", #Model for inference
    template_path="llm_swarm/templates/tgi.template.yml", #Template used for the run
    load_balancer_template_path="llm_swarm/templates/nginx.template.conf", #Load distributor
    huggingface_token="YOURTOKENHERE", #Huggingface token
    model_max_total=3000,
    model_max_input=1200,
    per_instance_max_parallel_requests=300,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(isc.model)

# Load dataset
ds = Dataset.load_from_disk("datasets/huggingface_dataset_large")
print(ds.features)
ds = ds.select(range(0, 100))

# Define your processing function
async def process_text(task, client, semaphore, tokenizer):
    async with semaphore:
        prompt = rf"<s>[INST] {task['text']} [\INST]"
        completion = await client.text_generation(
            prompt=prompt,
            max_new_tokens=1000,
            stop_sequences=["User:", "###", "<|endoftext|>"],
            repetition_penalty=1.3,
        )
        tokenized_completion = tokenizer.encode(completion)
        token_length = len(tokenized_completion)
        return task["title"], task["filename"], task["text"], completion, token_length

# Save results function
def save_results(results):
    with open("results.json", "a") as file:
        json.dump(results, file)
        file.write("\n")

# Main processing function
async def main():
    start_time = time.time()

    with LLMSwarm(isc) as llm_swarm:
        semaphore = asyncio.Semaphore(llm_swarm.suggested_max_parallel_requests)
        client = AsyncInferenceClient(model=llm_swarm.endpoint)

        tasks = [
            process_text(task, client, semaphore, tokenizer) for task in ds
        ]
        results = await tqdm_asyncio.gather(*tasks)

        end_time = time.time()
        total_duration = end_time - start_time
        total_tokens = sum(result[4] for result in results)
        overall_tokens_per_second = total_tokens / total_duration if total_duration > 0 else 0

        # Prepare processed data
        processed_data = {
            "title": [result[0] for result in results],
            "filename": [result[1] for result in results],
            "original_text": [result[2] for result in results],
            "processed_text": [result[3] for result in results],
            "token_length": [result[4] for result in results],
        }

        processed_ds = Dataset.from_dict(processed_data)
        processed_ds.save_to_disk("synthetic_data")

        # processed_ds.push_to_hub("TugdualKerjan/SynthChem")

        print(f"Overall Tokens per Second: {overall_tokens_per_second}")
        save_results(results)

# Run the main function
nest_asyncio.apply()
asyncio.run(main())
