In [None]:
#| hide
from fastdata.core import *

# How To ~~Train~~ Synthesize Your ~~Dragon~~ Data

> The fastest way to create high quality synthetic data

## Introduction

Synthetic data has become a popular topic in the field of large language models (LLMs) recently with modern LLMs such as Meta's Llama 3 models being trained on a large portion of synthetic data. This blog post attempts to introduce synthetic data generation and showcase the important bits to consider when generating synthetic data.

## Synthetic Data Overview

When we refer to synthetic data in this blog post, we are referring to data that is generated by an LLM. Synthetic data has a number of benefits over other types of data such as web scraped data. The first and foremost is the controllability of the data. We can specify the type of data we want to generate, targetting specific tasks or languages, levels of difficulty, and even specific topics. We can also easily generate large amounts of data in a shorter amount of time due to the parallelizable nature of LLMs compared to using an annotation service such as [scale.ai](scale.ai). However, there are also some downsides to synthetic data. The first is that it is not always clear how to define the quality of the data. The second is that it can be difficult to generate diverse data which covers the long tail distribution of real world knowledge. The third is that it can be difficult to ensure that the data is faithfully represents the real world, i.e., does not contain hallucinations. Additionally, there are some concerns that training on synthetic data can lead to model collapse. For example, the awesome paper [The Curse of Recursion: Training on Generated Data Makes Models Forget](https://arxiv.org/abs/2305.17493) showed that consecutively training a model on data generated from itself leads to the model forgetting the original training data and devolving into nonsense. This blog post will discuss and demonstrate many lessons learned in the field of synthetic data generation.

<!-- along with introducing the `fastdata` library to make it easy to generate high quality synthetic data for LLMs. -->

## Important Bits

Two of the most important components behind data is quality and diversity. These two components sadly can be in conflict with each other. For example, a random string generator will give you a ton of diversity, but the quality will be low. On the other hand, looking at Encyclopedia Britannica, you will find high quality articles of various topics, but they will all be written in a similar style to each other and will lack depth on many topics such as mathematics or will not contain any content on others such as fan fictions of popular tv shows, movies and novels.



## How to Nail the Important Bits

Let's start with diversity since it is the simplest, especially due to how LLMs work. LLMs at their core are a probability distribution of possible words given previous words in a sequence with words that often come after other words being more probable in this probability distribution. We can easily sample from and manipulate this distribution to cover as much of the language space as we want. A recent popular method to do this is to bias the distribution using topics or personas that lead the LLM to generate words in the direction of that topic or in the style of the given persona. This has a nice benefit of not degrading the quality of the generated text, which can happen with other methods such as increasing the temperature of a given sampling method as discussed in [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751) by Holtzman et. al. Here is an example of the persona approach in action from the recently released paper [Scaling Synthetic Data Creation with 1,000,000,000 Personas](https://arxiv.org/abs/2406.20094v1) by Chan et. al. Now this sadly only solves one aspect of diversity, that of breath. There is however, a depth associated with diversity which I think is best expressed in the paper [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244) by Xu et. al. In this paper, the authors evolve seed instructions in terms of breath, the type of topics used for an instruction, and depth, the complexity of the instruction.

Great, diversity, check! How about quality? Unfortunately quality is a lot more fickled than diversity in this context. For example, high quality can refer to how well a particular text is written, it could refer to the accuracy of the information presented in the text (e.g., a mathematical proof), the readability of the text, etc. An analogy that might help encapsulate the issue is what happens with grading student exams. Usually, and especially for free form responses, the answer of the student is graded against a rubric that represents different characteristics to determine the quality of the answer. One method to try and improve quality is use personas where we specify a persona that is an expert in whatever data we seek to generate (e.g., `You are an expert senior level Python developer with deep knowledge of numpy and pandas`). However, this approach does not get around issues with hallucinations that LLMs are plagued with. Therefore, the approach I use most often is fixing in post (aka filtering). The idea is to initially generate a large diverse set of data and then find the bits and pieces that align with whatever notion (i.e., get a high score on your rubric) of quality you have for your data. For certain types of data such as code, we can use heuristics and simple filters such as whether the code compiles. However, for more abstract ideas of quality, we need to get more creative. One approach I quite like and have done a lot of work on is from the work [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Gunasekar et. al. In this work, the authors show how you can use a strong large language model to classify code files as high or low quality where they defined quality as the educational content of the file. Filtering a large collection of code data for only high quality educational code resulted in fairly sizable bumps in downstream performance. This LLM filtering technique has been utilized in a number of other works such as [How to Train Data-Efficient LLMs](https://arxiv.org/abs/2402.09668) and [The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale](https://arxiv.org/abs/2406.17557) to similar affects.

## Let's Play with some Data

You don't need to take my word for these. Let's use this fun little library called [`instructor`](https://github.com/jxnl/instructor) to showcase some of these ideas. `instructor` is a library for forcing LLMs to generate data in a specific format such as a given JSON schema. For example, let's say we want to generate a dataset of english and german phrases. We can define a [`pydantic`](https://docs.pydantic.dev/latest/) model to represent the data and then use `instructor` to generate the data. Below is a code snippet that shows how to do this:

In [None]:
from pydantic import BaseModel, Field

class EnglishToGerman(BaseModel):
    english: str = Field(description="An english phrase")
    german: str = Field(description="An equivalent german phrase that is a translation of the english phrase")

`instructor` supports many models and model providers. For this example, we'll use Anthropic:

::: {.callout-note}
Make sure you have an API key for the model you want to use and the proper environment variables set. For example, if you are using Anthropic, you need to set the `ANTHROPIC_API_KEY` environment variable.
:::

In [None]:
import anthropic
import instructor

client = instructor.from_anthropic(anthropic.Anthropic())
for _ in range(5):
    translation: EnglishToGerman = client.chat.completions.create(
        model="claude-3-haiku-20240307", # let's use the small, but mighty haiku model
        max_tokens=512,
        max_retries=0,
        messages=[
            {
                "role": "user",
                "content": "Create an english and german translation pair",
            }
        ],
        response_model=EnglishToGerman,
    )
    print(translation)

english='I would like to create an English and German translation pair.' german='<UNKNOWN>'
english='Hello, how are you today?' german='Hallo, wie geht es Ihnen heute?'
english='How are you doing today?' german='Wie geht es Ihnen heute?'
english='Hello, how are you today?' german='Hallo, wie geht es Ihnen heute?'
english='Hello, how are you today?' german='Hallo, wie geht es Ihnen heute?'


Eindrucksvoll! Looking at the output, we can see that the model is able to generate the data in the correct format. However, these pairs are quite simple and lack depth. Let's see if we can improve the quality of the generations by adding some examples. To do this, we will do some prompt engineering to give the model some examples that showcase the type of quality we want.

In [None]:
examples = [
    EnglishToGerman(
        english="Hello, my name is Nathan. I am a research scientist at an AI startup.",
        german="Hallo mein Name ist Nathan. Ich bin wissenschaftlicher Mitarbeiter bei einem KI-Startup."),
    EnglishToGerman(
        english="How much wood could a woodchuck chuck if a woodchuck could chuck wood?",
        german="Wie viel Holz könnte ein Waldmurmeltier einspannen, wenn ein Waldmurmeltier Holz einspannen könnte?"),
    EnglishToGerman(
        english="Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See.",
        german="Thomas Cranmer (2. Juli 1489 - 21. März 1556) war ein Anführer der englischen Reformation und Erzbischof von Canterbury während der Herrschaft von Heinrich VIII., Eduard VI. und für kurze Zeit auch Maria I. Er half bei der Ausarbeitung der Klage für die Aufhebung von Heinrichs Heirat mit Katharina von Aragon, die eine der Ursachen für die Trennung der englischen Kirche von der Union mit dem Heiligen Stuhl war."
    ),
]

prompt = """\
Create an english and german translation pair that is similar to the examples.

Here are some examples:
- {examples}
"""
prompt = prompt.format(examples="\n- ".join([f"{e.english} -> {e.german}" for e in examples]))
print(prompt)

for _ in range(5):
    translation: EnglishToGerman = client.chat.completions.create(
        model="claude-3-haiku-20240307", # let's use the small, but mighty haiku model
        max_tokens=512,
        max_retries=0,
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        response_model=EnglishToGerman,
    )
    print(translation)

Create an english and german translation pair that is similar to the examples.

Here are some examples:
- Hello, my name is Nathan. I am a research scientist at an AI startup. -> Hallo mein Name ist Nathan. Ich bin wissenschaftlicher Mitarbeiter bei einem KI-Startup.
- How much wood could a woodchuck chuck if a woodchuck could chuck wood? -> Wie viel Holz könnte ein Waldmurmeltier einspannen, wenn ein Waldmurmeltier Holz einspannen könnte?
- Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See. -> Thomas Cranmer (2. Juli 1489 - 21. März 1556) war ein Anführer der englischen Reformation und Erzbischof von Canterbury während der Herrschaft von Heinrich VIII., Eduard VI. und 

english='The quick brown fox jumps over the lazy dog.' german='Der schnelle braune Fuchs springt über den faulen Hund.'
english='The quick brown fox jumps over the lazy dog.' german='Der schnelle braune Fuchs springt über den faulen Hund.'
english='The quick brown fox jumps over the lazy dog.' german='Der schnelle braune Fuchs springt über den faulen Hund.'
english='The new exhibit at the museum explores the history of ancient civilizations.' german='Die neue Ausstellung im Museum erforscht die Geschichte alter Zivilisationen.'
english='The quick brown fox jumps over the lazy dog.' german='Der schnelle braune Fuchs springt über den faulen Hund.'


Interesting! We are getting some better results, but a lot of duplicates. One thing that I discovered while prompting these models is that where you place your examples in the prompt can have a big impact on the quality of the generations. For example, if you place them at the end like the above does, the model will often repeat the examples in the generations. Let's try placing them at the beginning.

In [None]:
prompt = """\
Here are some examples:
- {examples}

Create an english and german translation pair that is similar to the examples.
"""
prompt = prompt.format(examples="\n- ".join([f"{e.english} -> {e.german}" for e in examples]))
print(prompt)

for _ in range(5):
    translation: EnglishToGerman = client.chat.completions.create(
        model="claude-3-haiku-20240307", # let's use the small, but mighty haiku model
        max_tokens=512,
        max_retries=0,
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        response_model=EnglishToGerman,
    )
    print(translation)

Here are some examples:
- Hello, my name is Nathan. I am a research scientist at an AI startup. -> Hallo mein Name ist Nathan. Ich bin wissenschaftlicher Mitarbeiter bei einem KI-Startup.
- How much wood could a woodchuck chuck if a woodchuck could chuck wood? -> Wie viel Holz könnte ein Waldmurmeltier einspannen, wenn ein Waldmurmeltier Holz einspannen könnte?
- Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See. -> Thomas Cranmer (2. Juli 1489 - 21. März 1556) war ein Anführer der englischen Reformation und Erzbischof von Canterbury während der Herrschaft von Heinrich VIII., Eduard VI. und für kurze Zeit auch Maria I. Er half bei der Ausarbeitung der Klage für die Aufh

Woah, what a difference! Though we are still seeing that the model really likes foxes jumping over dogs. Let's see what happens if we focus on improving diversity instead of quality. To accomplish this, we will use a list of topics to guide the generations.

In [None]:
topics = ["otters", "penguins", "sloths", "cats", "dogs"]
for topic in topics:
    prompt = """\
    Create an english and german translation pair about the following topic:
    {topic}
    """
    prompt = prompt.format(topic=topic)
    translation: EnglishToGerman = client.chat.completions.create(
        model="claude-3-haiku-20240307", # let's use the small, but mighty haiku model
        max_tokens=512,
        max_retries=0,
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        response_model=EnglishToGerman,
    )
    print(translation)

english='Otters are adorable aquatic mammals that live near rivers, lakes, and coastlines.' german='Otter sind entzückende Wassersäugetiere, die in der Nähe von Flüssen, Seen und Küstengebieten leben.'
english='Penguins are flightless birds that live in cold regions near the South Pole.' german='Pinguine sind flugunfähige Vögel, die in kalten Regionen in der Nähe des Südpols leben.'
english='Sloths are slow-moving tree-dwelling mammals found in Central and South America.' german='Faultiere sind langsam bewegende, baumlebende Säugetiere, die in Mittel- und Südamerika vorkommen.'
english='Cats are adorable furry companions that bring joy to many households.' german='Katzen sind entzückende pelzige Begleiter, die vielen Haushalten Freude bringen.'
english='Dogs are loyal and loving companions.' german='Hunde sind treue und liebevolle Begleiter.'


Okay nice, getting some diversity based on our list of topics. Also, since are using a relatively powerful model, the quality is pretty good. However, if we were to use a smaller model, we would likely see a drop in quality. Let's try combining our examples and topics trick together.

In [None]:
for topic in topics:
    prompt = """\
Here are some examples:
- {examples}

Create an english and german translation pair that is similar to the examples and is about the following topic:
{topic}
    """
    prompt = prompt.format(examples="\n- ".join([f"{e.english} -> {e.german}" for e in examples]), topic=topic)
    translation: EnglishToGerman = client.chat.completions.create(
        model="claude-3-haiku-20240307", # let's use the small, but mighty haiku model
        max_tokens=512,
        max_retries=0,
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        response_model=EnglishToGerman,
    )
    print(translation)

english='Otters are semiaquatic mammals that belong to the weasel family. They have webbed feet and dense fur that helps them stay warm in the water. Otters are playful creatures and are known for their love of sliding down muddy banks into rivers and streams.' german='Otter sind halbaquatische Säugetiere, die zur Familie der Marder gehören. Sie haben Schwimmhäute zwischen den Zehen und ein dichtes Fell, das ihnen hilft, im Wasser warm zu bleiben. Otter sind verspielt und bekannt dafür, dass sie gerne an schlammigen Ufern in Flüsse und Bäche rutschen.'
english='Penguins are flightless birds that live in the Southern Hemisphere. They have black and white plumage and distinctive beaks.' german='Pinguine sind flugunfähige Vögel, die auf der Südhalbkugel leben. Sie haben ein schwarz-weißes Gefieder und auffällige Schnäbel.'
english='Sloths are slow-moving animals that live in the treetops of tropical forests in Central and South America. They have long limbs, sharp claws, and move at a lei

Not too shabby and I'd say better than the previous examples. However, definitely needs work and has some interesting quirks such as the more exotic animals having phrases that are descriptions you might find in a wikipedia article where as the domesticated animals (cat and dog) have phrases of owners discussing them. Similar to what was discussed above, the order in which these examples and topics appear in the prompt can significantly impact the quality and diversity of the generations. Let's try swapping them.

In [None]:
for topic in topics:
    prompt = """\
Create an english and german translation pair that is similar to the examples and is about the following topic:
{topic}

Here are some examples:
- {examples}
    """
    prompt = prompt.format(examples="\n- ".join([f"{e.english} -> {e.german}" for e in examples]), topic=topic)
    translation: EnglishToGerman = client.chat.completions.create(
        model="claude-3-haiku-20240307", # let's use the small, but mighty haiku model
        max_tokens=512,
        max_retries=0,
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        response_model=EnglishToGerman,
    )
    print(translation)

english='Otters are playful and intelligent aquatic mammals. They have webbed feet, sleek fur, and can swim rapidly.' german='Otter sind verspielt und intelligente Säugetiere, die im Wasser leben. Sie haben geschwänzte Füße, ein glattes Fell und können schnell schwimmen.'
english='Penguins are fascinating birds that live in the coldest places on Earth.' german='Pinguine sind faszinierende Vögel, die in den kältesten Orten der Erde leben.'
english='Sloths are slow-moving mammals that live in the trees of Central and South America.' german='Faultiere sind langsam bewegende Säugetiere, die in den Bäumen Mittel- und Südamerikas leben.'
english='Cats are wonderful pets. They are very independent and playful animals.' german='Katzen sind wunderbare Haustiere. Sie sind sehr unabhängig und verspielt.'
english='My dog is very friendly. He loves to play fetch with me.' german='Mein Hund ist sehr freundlich. Er liebt es, Fangen mit mir zu spielen.'


As you can see, the generations are already way shorter than the previous ones, most likely because the examples are shorter as well. Another way we can improve quality is to simply use a bigger model. Let's see what we get when using Claude 3.5 Sonnet.

In [None]:
translations = []
for topic in topics:
    prompt = """\
Here are some examples:
- {examples}

Create an english and german translation pair that is similar to the examples and is about the following topic:
{topic}
    """
    prompt = prompt.format(examples="\n- ".join([f"{e.english} -> {e.german}" for e in examples]), topic=topic)
    translation: EnglishToGerman = client.chat.completions.create(
        model="claude-3-5-sonnet-20240620", # let's use Anthropic's best model to date
        max_tokens=512,
        max_retries=0,
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        response_model=EnglishToGerman,
    )
    print(translation)
    translations.append(translation)

english='Otters are semi-aquatic mammals known for their playful behavior and their ability to use tools. They are found in rivers, lakes, and coastal areas around the world. These adorable creatures have thick, water-repellent fur that keeps them warm in cold waters.' german='Otter sind semi-aquatische Säugetiere, die für ihr verspieltes Verhalten und ihre Fähigkeit, Werkzeuge zu benutzen, bekannt sind. Sie kommen in Flüssen, Seen und Küstengebieten auf der ganzen Welt vor. Diese niedlichen Tiere haben ein dickes, wasserabweisendes Fell, das sie in kaltem Wasser warm hält.'
english='Penguins are flightless seabirds that are highly adapted for life in the water. They are found almost exclusively in the Southern Hemisphere, particularly in Antarctica. Despite their inability to fly, penguins are excellent swimmers and can dive to great depths in search of food.' german='Pinguine sind flugunfähige Seevögel, die hervorragend an das Leben im Wasser angepasst sind. Sie kommen fast ausschlie

Haiku already has some pretty good generations, so these don't too much of an improvement, but they are definitely more detailed and longer. Additionally, it does not suffer from having the cat and dog example being written in a different way from the others.

Now, we can keep tweaking the prompt and examples to get the best results. However, one thing that will continue to be issues are hallucinations and other general quality issues. Therefore, let us see how we can implement a common post-processing step to clean up the generations by using another LLM to evaluate the generations. To do this, we will be following the same additive 5-point scoring system that was found to be the most effective in the wonderful paper [The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale](https://arxiv.org/abs/2306.08510). Additionally, we will have the model generate a written critique of the translation before scoring it, which should also help with accurately evaluating the generations. Luckily this is quite easy to do with `instructor` since the responses are autoregressive and therefore, we can construct our `pydantic` model in such a way that the model can generate the critique first and which will then be used to score the generations.

In [None]:
from pydantic import BaseModel, Field
from typing import Literal

class TranslationCritique(BaseModel):
    critique: str = Field(description="A critique of the translation.")
    score: Literal[0, 1, 2, 3, 4, 5] = Field(description="A score of the translation from 0 to 5.")

prompt = """\
Below is an extract of a translation. Evaluate its quality as a senior translator would, considering its suitability for professional use. Use the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the translation conveys the basic meaning of the source text, even if it includes some minor errors or awkward phrasing.
- Add another point if the translation is generally accurate but lacks refinement in style or fails to capture some nuances of the original. It might use inconsistent terminology or have occasional lapses in register.
- Award a third point if the translation is appropriate for professional use and accurately conveys key concepts of the source text. It demonstrates good understanding of both languages, though it may not be flawless or could include some slight inconsistencies. It resembles the work of a competent translator but may have room for improvement in fluency or precision.
- Grant a fourth point if the translation is highly accurate and reads naturally in the target language, exhibiting a consistent and appropriate style. It could be similar to the work of an experienced translator, offering faithful rendering of content and tone, with minimal errors, and effectively handling complex concepts or cultural references. The result is coherent, well-expressed, and valuable for its intended purpose.
- Bestow a fifth point if the translation is outstanding, demonstrating mastery of both source and target languages. It captures subtle nuances, maintains the author's voice and intent, and reads as if it were originally written in the target language. The translator has made excellent choices in dealing with challenging elements like wordplay, idiomatic expressions, or culture-specific content.

The translation extract: {translation}

After examining the translation:

- Briefly justify your total score, up to 100 words.
- Conclude with the score of the translation.
"""

In [None]:
for translation in translations:
    critique: TranslationCritique = client.chat.completions.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=512,
        max_retries=0,
        messages=[
            {
                "role": "user",
                "content": prompt.format(translation=str(translation)),
            }
        ],
        response_model=TranslationCritique,
    )
    print(translation)
    print("Critique:", critique.critique)
    print("Score:", critique.score)

english='Otters are semi-aquatic mammals known for their playful behavior and their ability to use tools. They are found in rivers, lakes, and coastal areas around the world. These adorable creatures have thick, water-repellent fur that keeps them warm in cold waters.' german='Otter sind semi-aquatische Säugetiere, die für ihr verspieltes Verhalten und ihre Fähigkeit, Werkzeuge zu benutzen, bekannt sind. Sie kommen in Flüssen, Seen und Küstengebieten auf der ganzen Welt vor. Diese niedlichen Tiere haben ein dickes, wasserabweisendes Fell, das sie in kaltem Wasser warm hält.'
Critique: The German translation is excellent, earning all 5 points:
1. It accurately conveys the basic meaning.
2. It captures nuances and maintains consistency.
3. It's suitable for professional use with accurate key concepts.
4. It reads naturally with appropriate style and terminology.
5. It demonstrates mastery, capturing subtle nuances and maintaining the original tone.

The translation precisely conveys all 

As you can see, the model is able to generate a critique and score for each translation. Also, it seems quite fond of its own translations. Let's see what happens when we give it a terrible translation.

In [None]:
bad_translation = EnglishToGerman(
    english="The city council meeting on climate change initiatives was contentious, with passionate arguments from both sides. Ultimately, the proposal for increased funding for renewable energy projects was approved by a narrow margin.",
    german="Die Stadt Rat Treffen auf Klima Änderung Initiativen war streitsüchtig, mit passioniert Argumente von beide Seiten. Ultimativ, die Proposal für gesteigert Geld für erneuerbar Energie Projekten war approved bei ein eng Margin."
)

critique: TranslationCritique = client.chat.completions.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=512,
    max_retries=0,
    messages=[
        {
            "role": "user",
            "content": prompt.format(translation=str(bad_translation)),
        }
    ],
    response_model=TranslationCritique,
)
print(bad_translation)
print("Critique:", critique.critique)
print("Score:", critique.score)

english='The city council meeting on climate change initiatives was contentious, with passionate arguments from both sides. Ultimately, the proposal for increased funding for renewable energy projects was approved by a narrow margin.' german='Die Stadt Rat Treffen auf Klima Änderung Initiativen war streitsüchtig, mit passioniert Argumente von beide Seiten. Ultimativ, die Proposal für gesteigert Geld für erneuerbar Energie Projekten war approved bei ein eng Margin.'
Critique: The translation conveys the basic meaning of the source text, but it has significant issues. While it communicates the general idea of a contentious city council meeting about climate change initiatives, the German translation is riddled with errors and awkward phrasing. It uses many English words directly (e.g., "approved," "ultimativ") instead of their German equivalents. The sentence structure is unnatural, following English syntax too closely. Terminology is inconsistent and often incorrect (e.g., "Stadt Rat" i

Wunderbar! Now, we have a relatively sophisticated way of generated tons of high quality data. Let's put our newfound knowledge to the test. For this, let us train a coding language model (something close to my heart) on a bunch of synethetic programs. To interject diversity, we will be using a list of personas for the model to generate the programs for. Specifically, we will be using the PersonaHub dataset from the paper [Scaling Synthetic Data Creation with 1,000,000,000 Personas](https://arxiv.org/abs/2406.20094v1), which contains a subset of roughly 200k personas. Below is some of the bits that we will be using to generate the programs. However, we will be applying multiprocessing to speed up the generation process that makes the code a bit unwieldy. To see the full code, please see the repository [tiny_programs](https://github.com/AnswerDotAI/tiny_programs).

In [None]:
class TinyProgram(BaseModel):
    requirements: str = Field(description="A description of the requirements for the program to help the persona.")
    code: str = Field(description="The code that satisfies the requirements. Ensure it is well written and documented.")

TinyProgram(
    requirements="A Python-based data aggregation and analysis tool that scrapes key Salvadoran news websites and government portals for the latest political updates, election results, and policy changes. The program would use standard libraries like requests for web scraping, re for text parsing, and pandas for data manipulation. It would store the collected information in a structured format, perform basic sentiment analysis on news articles, and generate a daily summary report highlighting significant political events, trending topics, and shifts in public opinion. The tool could also track mentions of key political figures and parties, providing a quick overview of their media presence and associated sentiments.",
    code="""\
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
from textblob import TextBlob
from collections import Counter
import datetime

def scrape_news(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    articles = soup.find_all('article', class_='article-item')
    
    news_data = []
    for article in articles:
        title = article.find('h2', class_='article-title').text.strip()
        summary = article.find('p', class_='article-summary').text.strip()
        news_data.append({'title': title, 'summary': summary})
    
    return news_data

def analyze_sentiment(text):
    return TextBlob(text).sentiment.polarity

def generate_report(data):
    df = pd.DataFrame(data)
    df['sentiment'] = df['summary'].apply(analyze_sentiment)
    
    # Calculate average sentiment
    avg_sentiment = df['sentiment'].mean()
    
    # Find most mentioned words
    all_words = ' '.join(df['title'] + ' ' + df['summary']).lower().split()
    word_freq = Counter(word for word in all_words if len(word) > 3)
    top_words = word_freq.most_common(5)
    
    # Generate report
    report = f"Daily Political Analysis Report for El Salvador - {datetime.date.today()}\n\n"
    report += f"Number of articles analyzed: {len(df)}\n"
    report += f"Average sentiment: {'Positive' if avg_sentiment > 0 else 'Negative'} ({avg_sentiment:.2f})\n\n"
    report += "Top mentioned words:\n"
    for word, count in top_words:
        report += f"- {word}: {count} times\n"
    
    report += "\nMost positive article:\n"
    pos_article = df.loc[df['sentiment'].idxmax()]
    report += f"Title: {pos_article['title']}\nSentiment: {pos_article['sentiment']:.2f}\n\n"
    
    report += "Most negative article:\n"
    neg_article = df.loc[df['sentiment'].idxmin()]
    report += f"Title: {neg_article['title']}\nSentiment: {neg_article['sentiment']:.2f}\n"
    
    return report

def main():
    url = "https://www.elsalvador.com/noticias/nacional/"  # Example Salvadoran news website
    news_data = scrape_news(url)
    report = generate_report(news_data)
    print(report)
    
    # Optionally, save the report to a file
    with open(f"el_salvador_political_report_{datetime.date.today()}.txt", "w") as f:
        f.write(report)

if __name__ == "__main__":
    main()
```
"""
)

prompt_template = """\
Here are some examples:
{examples}

Create requirements and the python program that satisfies them for the following persona: {persona}
"""

To evaluate the quality of the code, we will be using the following prompt:

In [None]:
prompt_template = """\
Below is a code snippet. Evaluate its educational value for teaching programming to beginners in this language, using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the code is syntactically correct and runs without errors, providing a basic example of working code in the language.
- Add another point if the code demonstrates fundamental programming concepts (e.g., variables, control structures, functions) in a straightforward manner, even if it's not optimized or doesn't follow all best practices.
- Award a third point if the code is well-commented, explaining key concepts and the purpose of different code sections. It should be readable and illustrate good naming conventions, making it easier for beginners to understand.
- Grant a fourth point if the code showcases language-specific features or common programming patterns in an accessible way. It should provide clear examples of how to apply these concepts practically.
- Bestow a fifth point if the code is an exemplary teaching tool, striking an excellent balance between simplicity and real-world applicability. It should inspire further learning, possibly including deliberate mistakes or opportunities for improvement that a teacher could use as discussion points.

The code snippet:
```python
{code}
```

After examining the code:

- Briefly justify your total score, up to 100 words, focusing on its effectiveness as a teaching tool for beginners.
- Conclude with the score.
"""

I've gone ahead and ran the above for 1,000 programs, which resulted in 992 programs that were properly generated and scored. Here is the distribution of the scores:

| Score | Count |
|-------|-------|
| 1     | 25    |
| 2     | 117   |
| 3     | 96    |
| 4     | 256   |
| 5     | 498   |

I went ahead and got the quality scores for 10,000 random programs from GitHub and they are distributed as follows:

| Score | Count |
|-------|-------|
| 1     | 2239  |
| 2     | 5230  |
| 3     | 1545  |
| 4     | 618   |
| 5     | 236   |

Now, let's train a model on these programs. Specifically, I want to compare the performance of a model trained on these programs to a model trained on the same dataset from the wild, aka GitHub. We will be using the SmolLM-360M model from Huggingface as the baseline and we will test out the following configurations all of which were trained on 5 epochs of roughly 1,000 programs:
0. Baseline model
1. Train on the 992 synthetic programs
2. Train on 992 random GitHub programs
3. Train on a mixture of 496 scored 4 and 5 synthetic programs and 496 random GitHub programs
4. Train on the 754 4 and 5 scored synthetic programs.
5. Train on the 754 4 and 5 scored GitHub programs to make it equal to the synthetic programs.

To evaluate the performance of these models, we will be using the standard HumanEval benchmark, which is a collection of 164 programming questions that are designed to test the ability of a coding LLM to generate correct and efficient code. Here are the results!

| Setup   | pass@1 |
|---------|--------|
| Setup 0 | 11.6%  | 0.11585365853658537
| Setup 1 | 09.1%  | 0.09146341463414634
| Setup 2 | 11.0%  | 0.10975609756097561
| Setup 3 | 09.8%  | 0.0975609756097561
| Setup 4 | 12.2%  | 0.12195121951219512
| Setup 5 | 08.5%  | 0.08536585365853659

### Key findings from the experiment:

We find some interesting results from these experiments! The common theme is that training on synthetic data is better than training on random GitHub programs regardless of quality filtering as shown in Setup 1 and 2 and Setup 4 and 5. Also of note is that we are only able to improve over the baseline by a small margin when using high quality synthetic data as shown in Setup 4. All other setups degrade performance, especially Setup 5, which is training on only high quality GitHub programs and a bit surprising as much research has gone into showing high quality data is better for training. More investigation will be needed to see why this is the case, but one possibility is that the scoring system is not as good these GitHub programs as the synthetic programs or it could be due to a lack of diversity in the GitHub programs.

Some homework I'd like for you to do is to try to replicate the experiment for on your own with your own task and experiment with scaling up the size of the dataset to see how it impacts the performance of the model trained on it. As always, please share your findings with the community and feel free to reach out for help!

## Take Aways

Alright, so what are the key takeaways I want you to leave with? The first is that both quality and diversity are very important aspects when it comes to synthetic data and can make or break models trained on this data. The second is that, imo, quality by far is harder than diversity due to its multi-dimensional nature especially for free form content. And lastly, I'd like you to take away that synthetic data is a great tool to go to when you don't have a lot of data for your task. It's cheap and fast to create and when done correctly can boost performance on your task.

You can find all the code for this post in our minimal synthetic data repo [fastdata](https://github.com/AnswerDotAI/fastdata).

## Some ramblings and interesting resources in this area

The first paper that I remember reading on this topic was [Evolution through Large Models](https://arxiv.org/abs/2206.08896) by Lehman et. al. The problem they faced was generating a walker robot in the Sodarace domain that can move across a given terrain. These walker robots are defined in Python using a framework that was not seen by the model the authors used. To teach the model how to generate these walker robots python programs, they took a synthetic data approach. Since LLMs need a ton of data to learn, they first use a coding LLM to mutate existing programs in a genetic programming style to mutant/augment existing programs. They then finetune the LLM on these programs and use it to synthesize even more programs. They show that this process can be repeated to improve the models ability to generate walker robots.

```python
from walk_creator import walker_creator
...
def make_walker():
    wc = walker_creator()
    # the main body is a square
    sides = make_square(wc, 0, 0, 10, 10)
    center = wc.add_joint(5, 5)
    ...
```

Another interesting paper from some of my colleagues is [Quality-Diversity through AI Feedback](https://arxiv.org/abs/2310.13032) by Bradley et. al. This was where I really started to understand the connection among quality-diversity, artificial life and synthetic data for LLMs. Until this paper, I didn't really know about the work on QD algorithms, which I recently learned was pioneered by [Joel Lehman and Kenneth O. Stanley](https://quality-diversity.github.io/) (of course, I should have known smh). Compared to the standard genetic programming approaches that are purely an optimization problem attempting to find a single most fit solution, QD attempts to find a diverse set of solution, each with a high level of quality/fitness. Okay, back to the paper, this paper in particular is interesting as it shows how they show you can apply QD to tasks that are not easily measured for fitness such as creative writing by leveraging feedback from LLMs.

Other recommended resources:
* [The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text](https://arxiv.org/abs/2311.09807)
* [How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse](https://arxiv.org/abs/2406.17557)
* [The Curse of Recursion: Training on Generated Data Makes Models Forget](https://arxiv.org/abs/2305.17493)
* [Best Practices and Lessons Learned on Synthetic Data for Language Models](https://arxiv.org/abs/2404.07503)
* [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
* [🍷 FineWeb: decanting the web for the finest text data at scale](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)
* [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644)
* [Quality-Diversity through AI Feedback](https://arxiv.org/abs/2310.13032)
* [Nemotron-4 340B Technical Report](https://arxiv.org/abs/2406.11704)
* [Scaling Synthetic Data Creation with 1,000,000,000 Personas](https://arxiv.org/abs/2406.20094v1)
* [Quality-Diversity optimisation algorithms](https://quality-diversity.github.io/)
* [ICML 2019 Tutorial: Recent Advances in Population-Based Search for Deep Neural Networks](https://youtu.be/g6HiuEnbwJE?si=kdnEOFsrvwAyqei9)
* [How to Train Data-Efficient LLMs](https://arxiv.org/abs/2402.09668)