# ✍️ From A to Z

Our goal is to go from data extracted from the [ChemArXiV](https://chemrxiv.org/engage/chemrxiv/public-dashboard) dataset by Marta to a dataset of synthetic data that could be used to train models in the Chemistry domain.

In [None]:
!conda create -n fromatoz --file ./package-list.txt

## 📖 Dataset generation

First we start by taking the data and transforming it into a huggingface dataset. This is done by temporarily transforming it into a .json file before reading the json and transforming it into a [arrow](https://arrow.apache.org/) file format.

First we import the necessary libraries for this first step.

In [1]:
import os
import json
from concurrent.futures import ThreadPoolExecutor
import xml.etree.ElementTree as ET
from tqdm import tqdm

We then define the input and temp. output dir. ⚠️ Change this to your directory !

In [2]:
SOURCE_DIR = "datasets/chemrxiv_papers"
DEST_DIR = "datasets/temp"

We're going to take the opportunity of embedding the data into the different templates we want to use during the rephrasing part. We define them beneath, they were taken off the [cosmopedia](https://github.com/huggingface/cosmopedia/tree/main) github.

In [3]:
STYLES = {"wikihow":
"""Here is an extract from a webpage: "<INSERT_EXTRACT>".

Write a long and very detailed tutorial that could be part of WikiHow whose title is related to the extract above <ADD_TOPIC>. Include in depth explanations for each step and how it helps achieve the desired outcome, inluding key tips and guidelines. 
Ensure clarity and practicality, allowing readers to easily follow and apply the instructions. Do not use images.""",

"textbook_narrative":
"""Here is an extract from a webpage: "<INSERT_EXTRACT>".

Write an extensive and detailed course unit suitable for a textbook, related to the given extract <ADD_TOPIC>. Do not just list concepts, but develop each one in detail before moving to the next, as we prioritize depth of understanding and comprehensive exploration of the subject matter over breadth. Focus on:

- Rigor: Ensure in-depth coverage of the concepts.
- Engagement: Use a narrative style akin to Michael Lewis, making it captivating and thought-provoking.
- Relevance: Connect the topic with current trends, real-life examples, or recent studies. Do not use images.
Do not include a title or an introduction, simply write the content without headlines and introductory phrases. Do not use images.""",

"textbook_academic":
"""Here is an extract from a webpage: "<INSERT_EXTRACT>".

Write an extensive and detailed course unit suitable for a textbook targeted at college students, related to the given extract <ADD_TOPIC>. Do not just list concepts, but develop each one in detail before moving to the next, as we prioritize depth of understanding and comprehensive exploration of the subject matter over breadth. Focus on:

- Rigor: Ensure in-depth coverage of the concepts/sections.
- Engagement: Write with an academic, professional and engaging tone that captivates interest.
- Application: Incorporate specific, practical examples, such as proofs in calculus or critical dates and figures in history.
Do not include a title or an introduction, simply write the content without headlines and introductory phrases. Do not use images.""",

"blogpost":
"""Here is an extract from a webpage: "<INSERT_EXTRACT>".

Write an informative and insightful blog post that expands upon the extract above <ADD_TOPIC>. Your post should delve into the nuances of the topic, offering fresh perspectives and deeper analysis. Aim to:

- Inform: Provide valuable, well-researched information that educates the reader.
- Engage: Write in a conversational tone that connects with the audience, making complex ideas accessible.
- Illustrate: Use examples, anecdotes, or personal experiences to bring the topic to life.
Do not give a title and do not start with sentences like "Have you ever..." or "Hello dear readers..", simply write the content without these introductory phrases."""
}

Complex part, we iterate over the files and attempt to only keep relevant paragraphs. We multithread to speed up the process and define min. and max. text lenghts we with to extract. Based off [Rephrasing the Web](https://arxiv.org/abs/2401.16380) we keep it underneath 400 words as more makes the model lose context during sythesizing.

We also keep the title as it can be used for further context down the road.

In [11]:
MIN_TEXT_LENGTH = 200
MAX_TEXT_LENGTH = 400

def save_extracted_data(file_path, data):
    with open(file_path, 'a') as file:
        for sample in data:
            json.dump(sample, file)
            file.write('\n')  # Add newline character to separate JSON objects


def extract_text(filepath):
    source_paper_path = f"{os.getcwd()}/{SOURCE_DIR}/{filepath}"
    sample = {
        "filename": filepath,
        "title": "",  # Initialize title
        "texts": []   # Initialize list to store extracted text
    }
    tree = ET.parse(source_paper_path)
    try:
        root = tree.getroot()

        # Define the namespace map
        ns = {'tei': 'http://www.tei-c.org/ns/1.0'}

        # Extract title
        title = ""
        teiHeader_elem = root.find('tei:teiHeader', namespaces=ns)
        if teiHeader_elem is not None:
            fileDesk_elem = teiHeader_elem.find('tei:fileDesc', namespaces=ns)
            if fileDesk_elem is not None:
                titleStmt_elem = fileDesk_elem.find('tei:titleStmt', namespaces=ns)
                if titleStmt_elem is not None:
                    title_elem = titleStmt_elem.find('tei:title', namespaces=ns)
                    if title_elem is not None:
                        title = title_elem.text
                    else:
                        print("No <title> element found.")
                else:
                    print("No <titleStmt> element found.")
            else:
                print("No <fileDesc> element found.")
        else:
            print("No <teiHeader> element found.")
        sample['title'] = title

        # Extract text
        text_list = []
        text_elem = root.find('tei:text', namespaces=ns)
        if text_elem is not None:
            body_elem = text_elem.find('tei:body', namespaces=ns)
            if body_elem is not None:
                texts = body_elem.findall('tei:div', namespaces=ns)
                for div in texts:
                    text = ' '.join(div.itertext())
                    for prompt in STYLES.values():
                        if len(text.split(" ")) > MIN_TEXT_LENGTH and len(text.split(" ")) < MAX_TEXT_LENGTH: 
                            text2 = prompt.replace("<ADD_TOPIC>", title).replace("<INSERT_EXTRACT>", text)
                            text_list.append(text2)
        sample['texts'] = text_list

    except Exception as e:
        print(e)
    return sample

def main():
    filenames = [k for k in os.listdir(os.getcwd() + "/" + SOURCE_DIR + "/") if k.endswith('.xml')]
    # print(os.getcwd() + "/" + DEST_DIR)
    print(len(filenames))
    batch_size = 100  # Adjust batch size as needed
    total_batches = len(filenames) // batch_size + 1
    with ThreadPoolExecutor(max_workers=2) as executor:
        for batch_num in range(total_batches):
            batch_filenames = filenames[batch_num * batch_size: (batch_num + 1) * batch_size]
            extracted_data = []
            for sample in tqdm(executor.map(extract_text, batch_filenames), total=len(batch_filenames), desc=f"Processing Batch {batch_num + 1}/{total_batches}"):
                extracted_data.append(sample)
            save_extracted_data(f'{DEST_DIR}/data.json', extracted_data)

main()


50


Processing Batch 1/1: 100%|██████████| 50/50 [00:00<00:00, 343.95it/s]

replace() argument 2 must be str, not None





We now have a json file with the collected data and transform it into the arrow format for ease of use.

In [12]:
import jsonlines
from datasets import Dataset
from tqdm import tqdm

dataset_dict = {
        "title": [],
        "filename": [],
        "text": []
    }

with jsonlines.open(f'{DEST_DIR}/data.json') as reader:
    for paper in tqdm(reader):
        for text in paper["texts"]:
            dataset_dict["title"].append(paper["title"])
            dataset_dict["filename"].append(paper["filename"])
            dataset_dict["text"].append(text)

    dataset = Dataset.from_dict(dataset_dict)
    dataset.save_to_disk("datasets/huggingface_dataset_large")

24449it [00:00, 377545.11it/s]
Saving the dataset (1/1 shards): 100%|██████████| 440/440 [00:00<00:00, 206177.38 examples/s]


## ✨ Synthetic data

Now that we have a dataset in the arrows format we can start synthesizing some data. To be as efficient as possible and allow for iterative testing, we use a modified version of llm_swarm that allows for the creation of side-by-side pods running [TGI](https://huggingface.co/docs/text-generation-inference/index) with whatever model that is available.

⚠️ Replace YOURTOKENHERE by your huggingface token (cat ~/.cache/huggingface/token)

In [15]:
import asyncio
import json
import sys
sys.path.append("./llm_swarm")
from llm_swarm import LLMSwarm, LLMSwarmConfig
from huggingface_hub import AsyncInferenceClient
from transformers import AutoTokenizer
import nest_asyncio
from tqdm.asyncio import tqdm_asyncio
import time

# Define your LLMSwarmConfig
isc = LLMSwarmConfig(
    instances=1, #Number of instances
    inference_engine="tgi", #The engine to use. (Could use vLLM)
    job_scheduler="runai", #The scheduler to use (would otherwise be slurm)
    gpus=1, #Number of GPUs to use per instance
    model="meta-llama/Meta-Llama-3-8B-Instruct", #Model for inference
    template_path="llm_swarm/templates/tgi.template.yml", #Template used for the run
    load_balancer_template_path="llm_swarm/templates/nginx.template.conf", #Load distributor
    huggingface_token="YOURTOKENHERE", #Huggingface token
    model_max_total=3000,
    model_max_input=1200,
    per_instance_max_parallel_requests=300,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(isc.model)

# Load dataset
ds = Dataset.load_from_disk("datasets/huggingface_dataset_large")
print(ds.features)
ds = ds.select(range(0, 100))

# Define your processing function
async def process_text(task, client, semaphore, tokenizer):
    async with semaphore:
        prompt = rf"<s>[INST] {task['text']} [\INST]"
        completion = await client.text_generation(
            prompt=prompt,
            max_new_tokens=1000,
            stop_sequences=["User:", "###", "<|endoftext|>"],
            repetition_penalty=1.3,
        )
        tokenized_completion = tokenizer.encode(completion)
        token_length = len(tokenized_completion)
        return task["title"], task["filename"], task["text"], completion, token_length

# Save results function
def save_results(results):
    with open("results.json", "a") as file:
        json.dump(results, file)
        file.write("\n")

# Main processing function
async def main():
    start_time = time.time()

    with LLMSwarm(isc) as llm_swarm:
        semaphore = asyncio.Semaphore(llm_swarm.suggested_max_parallel_requests)
        client = AsyncInferenceClient(model=llm_swarm.endpoint)

        tasks = [
            process_text(task, client, semaphore, tokenizer) for task in ds
        ]
        results = await tqdm_asyncio.gather(*tasks)

        end_time = time.time()
        total_duration = end_time - start_time
        total_tokens = sum(result[4] for result in results)
        overall_tokens_per_second = total_tokens / total_duration if total_duration > 0 else 0

        # Prepare processed data
        processed_data = {
            "title": [result[0] for result in results],
            "filename": [result[1] for result in results],
            "original_text": [result[2] for result in results],
            "processed_text": [result[3] for result in results],
            "token_length": [result[4] for result in results],
        }

        processed_ds = Dataset.from_dict(processed_data)
        processed_ds.save_to_disk("synthetic_data")

        # processed_ds.push_to_hub("TugdualKerjan/SynthChem")

        print(f"Overall Tokens per Second: {overall_tokens_per_second}")
        save_results(results)

# Run the main function
nest_asyncio.apply()
asyncio.run(main())


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{'title': Value(dtype='string', id=None), 'filename': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None)}
👎 Aborted! Waiting for runai-1715862733-0 to be created                        


Task exception was never retrieved
future: <Task finished name='Task-1' coro=<main() done, defined at /tmp/ipykernel_3598626/2030642281.py:56> exception=KeyboardInterrupt()>
Traceback (most recent call last):
  File "/home/kerjan/miniconda3/envs/fresh/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_3598626/2030642281.py", line 92, in <module>
    asyncio.run(main())
  File "/home/kerjan/miniconda3/envs/fresh/lib/python3.10/site-packages/nest_asyncio.py", line 30, in run
    return loop.run_until_complete(task)
  File "/home/kerjan/miniconda3/envs/fresh/lib/python3.10/site-packages/nest_asyncio.py", line 92, in run_until_complete
    self._run_once()
  File "/home/kerjan/miniconda3/envs/fresh/lib/python3.10/site-packages/nest_asyncio.py", line 133, in _run_once
    handle._run()
  File "/home/kerjan/miniconda3/envs/fresh/lib/python3.10/asyncio/events.py", line 80, in _run


KeyboardInterrupt: 