In [None]:
#| hide
from fastdata.core import *

# How To ~~Train~~ Synthesize Your ~~Dragon~~ Data

> The fastest way to create high quality synthetic data

## Introduction

Synthetic data has become a prominent topic in the field of large language models (LLMs), with modern models like Meta's Llama 3 being trained on substantial portions of synthetic data. This blog post introduces synthetic data generation and highlights key considerations when creating synthetic data for LLMs.

We'll explore the process of synthetic data generation, its importance in AI development, and practical techniques for producing high-quality synthetic datasets. By the end, you'll have a solid understanding of how to approach synthetic data creation for large language models.

## Synthetic Data Overview

Synthetic data in this context refers to data generated by large language models (LLMs). This approach offers several advantages over traditional methods like web scraping:

1. Controllability: We can specify the data type, target specific tasks, languages, difficulty levels, and topics.
2. Efficiency: Large amounts of data can be generated quickly due to the parallelizable nature of LLMs, compared to using annotation services like [scale.ai](https://scale.ai).

However, synthetic data also presents challenges:

1. Quality assessment: Defining and measuring data quality can be unclear.
2. Diversity: Generating data that covers the long-tail distribution of real-world knowledge is difficult.
3. Faithfulness: Ensuring the data accurately represents the real world without hallucinations is challenging.
4. Model collapse: Training on synthetic data may lead to issues like those described in [The Curse of Recursion: Training on Generated Data Makes Models Forget](https://arxiv.org/abs/2305.17493), where models trained on self-generated data can forget original training data and produce nonsensical output.

This blog post will discuss and demonstrate key lessons learned in synthetic data generation for LLMs.

## Key Components of Synthetic Data

Quality and diversity are crucial yet often conflicting elements in synthetic data. A random string generator offers high diversity but low quality, while Encyclopedia Britannica provides high-quality content with limited stylistic range and depth in certain areas. The challenge lies in maintaining consistent quality across diverse topics, ensuring depth in specialized fields, and covering non-traditional content. Balancing these factors is essential for generating effective synthetic data for LLMs.

## Achieving Diversity and Quality in Synthetic Data

LLMs fundamentally represent a probability distribution of words given previous words in a sequence. We can manipulate this distribution to cover a wide language space. We can easily sample from this distribution to generate diverse text. Using topics or personas to bias the distribution is particularly effective, as demonstrated in [Scaling Synthetic Data Creation with 1,000,000,000 Personas](https://arxiv.org/abs/2406.20094v1). This approach maintains quality better than methods like increasing sampling temperature. Increasing temperature can boost diversity but often at the cost of quality, as discussed in [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751).

Diversity extends beyond breadth to depth. [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244) explores this by evolving seed instructions in both topic range and complexity.

Quality in synthetic data is multifaceted, encompassing writing style, information accuracy, and readability. Expert personas such as `You are an expert senior level Python developer with deep knowledge of numpy and pandas` can help but don't eliminate hallucinations. A common strategy is post-generation filtering: creating a large, diverse dataset, then selecting high-quality segments based on specific criteria. For code, simple filters like compilation checks work well. However, abstract quality measures require more sophisticated techniques.

[Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) demonstrates using a strong LLM to classify code files for educational quality, significantly improving downstream performance. This LLM filtering approach has shown success in works like [How to Train Data-Efficient LLMs](https://arxiv.org/abs/2402.09668) and [The FineWeb Datasets](https://arxiv.org/abs/2406.17557).

Balancing diversity and quality remains a key challenge in synthetic data generation for LLMs.

## Let's Play with some Data

Don't just take my word for it—let's explore these concepts using `claudette`, a fun little library we at Answer.AI built to showcase these ideas. [`claudette`](https://github.com/AnswerDotAI/claudette) is a minimal wrapper around Anthropic's Claude API, designed to enhance its usability. One of its key features is tool calling with Claude, which we'll be ~abusing~ using to generate synthetic data!

Here's how it works: we define a class with the attributes we want to be generated, then pass this along with a prompt asking Claude to fill in the blanks. This approach forces the generation to adhere to the schema defined in our Python class. Let's see this in action below.

In [None]:
from fastcore.utils import *

class Translation():
    """Translation from an English phrase to a German phrase"""
    def __init__(self, english: str, german: str): store_attr()
    
    __repr__ = basic_repr(["english", "german"])
    def __str__(self): return "The translation has been created."

Translation("Hello, how are you today?", "Hallo, wie geht es Ihnen heute?")

__main__.Translation(english='Hello, how are you today?', german='Hallo, wie geht es Ihnen heute?')

::: {.callout-note}
Make sure you have an API key for the model you want to use and the proper environment variables set. For example, if you are using Anthropic, you need to set the `ANTHROPIC_API_KEY` environment variable.
:::

In [None]:
from claudette import *

model = models[-1] # haiku 3
sp = "You will help generate synethetic data of English and German phrases."
def wrapper(pr):
    chat = Chat(
        model,
        sp=sp,
        tools=[Translation],
        tool_choice='Translation',
    )
    chat(pr, temp=1)
    return chat.last_tool_result.content

for _ in range(5):
    translation: Translation = wrapper('Create an english and german translation pair.')
    print(translation)

__main__.Translation(english='The sun is shining brightly today.', german='Die Sonne scheint heute hell.')
__main__.Translation(english='Hello, how are you?', german='Hallo, wie geht es Ihnen?')
__main__.Translation(english='The cat is playing with the ball.', german='Die Katze spielt mit dem Ball.')
__main__.Translation(english='The cat chased the mouse.', german='Die Katze jagte die Maus.')
__main__.Translation(english='How are you today?', german='Wie geht es dir heute?')


Eindrucksvoll! The output demonstrates the model's ability to generate data in the correct format. However, these translation pairs are quite simple and lack depth. To enhance the quality of the generations, let's employ some prompt engineering by providing examples that showcase the level of quality we're aiming for.

In [None]:
examples = [
    Translation(
        english="Hello, my name is Nathan. I am a research scientist at an AI startup.",
        german="Hallo mein Name ist Nathan. Ich bin wissenschaftlicher Mitarbeiter bei einem KI-Startup."),
    Translation(
        english="How much wood could a woodchuck chuck if a woodchuck could chuck wood?",
        german="Wie viel Holz könnte ein Waldmurmeltier einspannen, wenn ein Waldmurmeltier Holz einspannen könnte?"),
    Translation(
        english="Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See.",
        german="Thomas Cranmer (2. Juli 1489 - 21. März 1556) war ein Anführer der englischen Reformation und Erzbischof von Canterbury während der Herrschaft von Heinrich VIII., Eduard VI. und für kurze Zeit auch Maria I. Er half bei der Ausarbeitung der Klage für die Aufhebung von Heinrichs Heirat mit Katharina von Aragon, die eine der Ursachen für die Trennung der englischen Kirche von der Union mit dem Heiligen Stuhl war."
    ),
]

prompt_template = """\
Create an english and german translation pair that is similar to the examples.

Here are some examples:
- {examples}
"""
prompt = prompt_template.format(examples="\n- ".join([f"{e.english} -> {e.german}" for e in examples]))
print(prompt)

for _ in range(5):
    translation: Translation = wrapper(prompt)
    print(translation)

Create an english and german translation pair that is similar to the examples.

Here are some examples:
- Hello, my name is Nathan. I am a research scientist at an AI startup. -> Hallo mein Name ist Nathan. Ich bin wissenschaftlicher Mitarbeiter bei einem KI-Startup.
- How much wood could a woodchuck chuck if a woodchuck could chuck wood? -> Wie viel Holz könnte ein Waldmurmeltier einspannen, wenn ein Waldmurmeltier Holz einspannen könnte?
- Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See. -> Thomas Cranmer (2. Juli 1489 - 21. März 1556) war ein Anführer der englischen Reformation und Erzbischof von Canterbury während der Herrschaft von Heinrich VIII., Eduard VI. und 

Interesting! We're seeing some improvement in the results, but there's a noticeable pattern: most are introductions stating how the person works at a tech company. This repetition highlights an important aspect of prompting these models: the placement of examples within the prompt can significantly impact the quality and diversity of generations.

When examples are placed at the end of the prompt, as we did above, the model often tends to repeat or closely mimic these examples in its generations. To potentially improve diversity, let's try placing the examples at the beginning of our prompt instead.

In [None]:
prompt_template = """\
Here are some examples:
- {examples}

Create an english and german translation pair that is similar to the examples.
"""
prompt = prompt_template.format(examples="\n- ".join([f"{e.english} -> {e.german}" for e in examples]))
print(prompt)

for _ in range(5):
    translation: Translation = wrapper(prompt)
    print(translation)

Here are some examples:
- Hello, my name is Nathan. I am a research scientist at an AI startup. -> Hallo mein Name ist Nathan. Ich bin wissenschaftlicher Mitarbeiter bei einem KI-Startup.
- How much wood could a woodchuck chuck if a woodchuck could chuck wood? -> Wie viel Holz könnte ein Waldmurmeltier einspannen, wenn ein Waldmurmeltier Holz einspannen könnte?
- Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See. -> Thomas Cranmer (2. Juli 1489 - 21. März 1556) war ein Anführer der englischen Reformation und Erzbischof von Canterbury während der Herrschaft von Heinrich VIII., Eduard VI. und für kurze Zeit auch Maria I. Er half bei der Ausarbeitung der Klage für die Aufh

Woah, what a difference! The results show more variety in content and structure compared to our previous attempt. However, we're still seeing some lack of diversity in the topics covered. Let's shift our focus from quality to diversity and see what happens. To accomplish this, we'll use a list of topics to guide the generations, which should help broaden the range of subjects in our translations.

In [None]:
topics = ["otters", "penguins", "sloths", "cats", "dogs"]
for topic in topics:
    prompt_template = """\
    Create an english and german translation pair about the following topic:
    {topic}
    """
    prompt = prompt_template.format(topic=topic)
    translation: Translation = wrapper(prompt)
    print(translation)

__main__.Translation(english='Otters are small semiaquatic mammals that thrive in freshwater and coastal environments. They are known for their playful behavior, webbed feet, and ability to float on their backs while eating or grooming.', german='Ottern sind kleine, semiaquatische Säugetiere, die in Süßwasser- und Küstengebieten gedeihen. Sie sind für ihr verspieltes Verhalten, ihre Schwimmhäute und ihre Fähigkeit bekannt, auf dem Rücken schwimmend zu essen oder zu putzen.')
__main__.Translation(english='Penguins are flightless birds found in the Southern Hemisphere. They have distinctive black and white plumage and short, flipper-like wings.', german='Pinguine sind flugunfähige Vögel, die in der Südhalbkugel vorkommen. Sie haben ein charakteristisches schwarzweißes Gefieder und kurze, flügelartige Flossen.')
__main__.Translation(english='Sloths are slow-moving tree-dwelling mammals found in Central and South America.', german='Faultiere sind langsam bewegende baumlaufende Säugetiere, 

Okay, nice! We're seeing increased diversity based on our list of topics. The quality remains pretty good, thanks to the powerful model we're using. However, with a smaller model, we might see a drop in quality. To potentially improve both diversity and quality simultaneously, let's try combining our examples and topics techniques.

In [None]:
for topic in topics:
    prompt_template = """\
Here are some examples:
- {examples}

Create an english and german translation pair that is similar to the examples and is about the following topic:
{topic}
    """
    prompt = prompt_template.format(examples="\n- ".join([f"{e.english} -> {e.german}" for e in examples]), topic=topic)
    translation: Translation = wrapper(prompt)
    print(translation)

__main__.Translation(english="Otters are playful and intelligent aquatic mammals. They are known for their webbed feet, sleek fur, and ability to swim gracefully in the water. Otters often live in family groups and are known to hold hands while sleeping so they don't drift apart.", german='Otter sind verspielt und intelligente Wassertiere. Sie sind bekannt für ihre Schwimmhäute, ihr glattes Fell und ihre Fähigkeit, sich im Wasser elegant zu bewegen. Otter leben oft in Familiengruppen und halten sich beim Schlafen an den Händen fest, damit sie nicht auseinandertreiben.')
__main__.Translation(english='Penguins are flightless seabirds found in the Southern Hemisphere. They have a distinctive black and white plumage and webbed feet, which makes them excellent swimmers. Penguins mainly eat fish, krill, and other small marine creatures. Some of the most well-known penguin species include the Emperor Penguin, Adelie Penguin, and Gentoo Penguin.', german='Pinguine sind flugunfähige Seevögel, d

Not too shabby, and I'd say better than the previous examples. We're seeing more detailed and varied content across the different animal topics. However, there are some interesting quirks to note:

1. The more exotic animals (otters, penguins, sloths) have descriptions reminiscent of Wikipedia articles.
2. The domesticated animals (cats and dogs) have phrases from the perspective of owners discussing their pets.

This difference in style highlights how the model interprets and applies the given examples and topics. As we discussed earlier, the order in which these examples and topics appear in the prompt can significantly impact the quality and diversity of the generations. Let's try swapping their positions to see how it affects the output.

In [None]:
for topic in topics:
    prompt_template = """\
Create an english and german translation pair that is similar to the examples and is about the following topic:
{topic}

Here are some examples:
- {examples}
    """
    prompt = prompt_template.format(examples="\n- ".join([f"{e.english} -> {e.german}" for e in examples]), topic=topic)
    translation: Translation = wrapper(prompt)
    print(translation)

__main__.Translation(english='Otters are fascinating creatures that live in rivers and lakes. They have webbed feet and a sleek, streamlined body that allows them to swim gracefully underwater. Otters are known for their playful behavior, often seen sliding down muddy banks or holding hands while floating on their backs. They are important predators that help maintain the balance of aquatic ecosystems. Otters are also very social animals, living in family groups and using a range of vocalizations to communicate. Unfortunately, many otter populations are threatened by habitat loss and pollution. Conservation efforts are underway to protect these beloved marine mammals.', german='Ottern sind faszinierende Lebewesen, die in Flüssen und Seen leben. Sie haben Schwimmhäute und einen schlanken, stromlinienförmigen Körper, der es ihnen ermöglicht, graziös unter Wasser zu schwimmen. Ottern sind für ihr verspieltes Verhalten bekannt, oft sieht man sie an schlammigen Ufern hinunterrutschen oder s

Swapping the order of topic and examples in our prompt yielded interesting results:

1. The otter description became much more detailed, while others remained brief.
2. Descriptions for cats and dogs now focus on general characteristics rather than personal anecdotes.
3. There's significant variation in length and detail across translations.
4. Emotional language is more prevalent in descriptions of familiar animals.

These changes highlight the sensitivity of large language models to prompt structure. The order of elements in a prompt can significantly influence the content, style, and length of generated text, underscoring the importance of careful prompt engineering for consistent, high-quality synthetic data. Let's see what we get when using a stronger model, i.e., Claude 3.5 Sonnet.

In [None]:
translations = []
for topic in topics:
    prompt_template = """\
Here are some examples:
- {examples}

Create an english and german translation pair that is similar to the examples and is about the following topic:
{topic}
    """
    prompt = prompt_template.format(examples="\n- ".join([f"{e.english} -> {e.german}" for e in examples]), topic=topic)
    translation: Translation = wrapper(prompt)
    print(translation)
    translations.append(translation)

__main__.Translation(english='Otters are semiaquatic mammals that belong to the family Mustelidae. They have a streamlined body and webbed feet, which make them excellent swimmers. Otters are found on every continent except Australia and Antarctica, and they thrive in both freshwater and saltwater environments. They are known for their playful behavior, their dexterity with their paws, and their ability to use tools. Otters play an important role in the ecosystem, as they help to control the populations of fish, crustaceans, and other aquatic species.', german='Otter sind semiaquatische Säugetiere, die zur Familie der Marder gehören. Sie haben einen stromlinienförmigen Körper und Schwimmhäute an den Füßen, die sie zu hervorragenden Schwimmern machen. Otter kommen auf jedem Kontinent außer Australien und der Antarktis vor und gedeihen sowohl in Süß- als auch in Salzwasserumgebungen. Sie sind bekannt für ihr verspieltes Verhalten, ihre Geschicklichkeit mit den Pfoten und ihre Fähigkeit, 

The Haiku model already produces good quality generations, but Claude 3.5 Sonnet shows some improvements in detail and length. However, we can still encounter issues with hallucinations and overall quality.

To address these concerns, let's implement a post-processing step using another LLM to evaluate and clean up the generations. We'll follow the additive 5-point scoring system that proved most effective in the paper [The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale](https://arxiv.org/abs/2306.08510). 

Our approach will involve:

1. Having the model generate a written critique of the translation.
2. Using this critique to score the generation.

This method should help us more accurately evaluate the quality of our synthetic data. Fortunately, LLM's autoregressive responses make this process straightforward. We can design our Python class to generate the critique first, which will then inform the scoring of the generations.

Let's implement this evaluation system and see how it improves our synthetic data quality.

In [None]:
class TranslationCritique():
    """
    A critique of the translation.
    """
    def __init__(
        self,
        critique: str, # A critique of the translation.
        score: int # A score of the translation from 1 to 5. 
    ): store_attr()
        
    __repr__ = basic_repr(['critique', 'score'])

sp = "You will help critique synethetic data of English and German phrases."
def wrapper(pr):
    chat = Chat(
        model,
        sp=sp,
        tools=[TranslationCritique],
        tool_choice='TranslationCritique',
    )
    chat(pr, temp=1)
    return chat.last_tool_result.content

prompt_template = """\
Below is an extract of a translation. Evaluate its quality as a senior translator would, considering its suitability for professional use. Use the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the translation conveys the basic meaning of the source text, even if it includes some minor errors or awkward phrasing.
- Add another point if the translation is generally accurate but lacks refinement in style or fails to capture some nuances of the original. It might use inconsistent terminology or have occasional lapses in register.
- Award a third point if the translation is appropriate for professional use and accurately conveys key concepts of the source text. It demonstrates good understanding of both languages, though it may not be flawless or could include some slight inconsistencies. It resembles the work of a competent translator but may have room for improvement in fluency or precision.
- Grant a fourth point if the translation is highly accurate and reads naturally in the target language, exhibiting a consistent and appropriate style. It could be similar to the work of an experienced translator, offering faithful rendering of content and tone, with minimal errors, and effectively handling complex concepts or cultural references. The result is coherent, well-expressed, and valuable for its intended purpose.
- Bestow a fifth point if the translation is outstanding, demonstrating mastery of both source and target languages. It captures subtle nuances, maintains the author's voice and intent, and reads as if it were originally written in the target language. The translator has made excellent choices in dealing with challenging elements like wordplay, idiomatic expressions, or culture-specific content.

The translation extract: {translation}

After examining the translation:

- Briefly justify your total score, up to 100 words.
- Conclude with the score of the translation.
"""

In [None]:
for translation in translations:
    prompt = prompt_template.format(translation=translation)
    critique: TranslationCritique = wrapper(prompt)
    print(translation)
    print("Critique:", critique.critique)
    print("Score:", critique.score)

__main__.Translation(english='Otters are semiaquatic mammals that belong to the family Mustelidae. They have a streamlined body and webbed feet, which make them excellent swimmers. Otters are found on every continent except Australia and Antarctica, and they thrive in both freshwater and saltwater environments. They are known for their playful behavior, their dexterity with their paws, and their ability to use tools. Otters play an important role in the ecosystem, as they help to control the populations of fish, crustaceans, and other aquatic species.', german='Otter sind semiaquatische Säugetiere, die zur Familie der Marder gehören. Sie haben einen stromlinienförmigen Körper und Schwimmhäute an den Füßen, die sie zu hervorragenden Schwimmern machen. Otter kommen auf jedem Kontinent außer Australien und der Antarktis vor und gedeihen sowohl in Süß- als auch in Salzwasserumgebungen. Sie sind bekannt für ihr verspieltes Verhalten, ihre Geschicklichkeit mit den Pfoten und ihre Fähigkeit, 

As we can see, the model successfully generates a critique and score for each translation. Interestingly, the model seems impressed with the translations, consistently scoring them at 4 out of 5.

Let's see what happens when we give it some really bad translations.

In [None]:
bad_translation = Translation(
    english="The city council meeting on climate change initiatives was contentious, with passionate arguments from both sides. Ultimately, the proposal for increased funding for renewable energy projects was approved by a narrow margin.",
    german="Die Stadt Rat Treffen auf Klima Änderung Initiativen war streitsüchtig, mit passioniert Argumente von beide Seiten. Ultimativ, die Proposal für gesteigert Geld für erneuerbar Energie Projekten war approved bei ein eng Margin."
)

prompt = prompt_template.format(translation=bad_translation)
critique: TranslationCritique = wrapper(prompt)
print(bad_translation)
print("Critique:", critique.critique)
print("Score:", critique.score)

__main__.Translation(english='The city council meeting on climate change initiatives was contentious, with passionate arguments from both sides. Ultimately, the proposal for increased funding for renewable energy projects was approved by a narrow margin.', german='Die Stadt Rat Treffen auf Klima Änderung Initiativen war streitsüchtig, mit passioniert Argumente von beide Seiten. Ultimativ, die Proposal für gesteigert Geld für erneuerbar Energie Projekten war approved bei ein eng Margin.')
Critique: The translation conveys the basic meaning of the source text, but there are several issues that impact its quality and suitability for professional use:

- The phrasing is often awkward and unnatural, with word-for-word translations that do not flow well in German. Examples include "Stadt Rat Treffen", "Klima Änderung Initiativen", and "gesteigert Geld".
- The terminology is inconsistent, using different renderings for "proposal" ("Proposal") and "approved" ("approved").
- There are some outr

Wunderbar! We've now developed a sophisticated method for generating high-quality data in large quantities. Let's put our newfound knowledge to the test by training a coding language model, a subject close to my heart.

To inject diversity into our synthetic programs, we'll use a list of personas for the model to generate from. Specifically, we'll utilize the PersonaHub dataset from the paper [Scaling Synthetic Data Creation with 1,000,000,000 Personas](https://arxiv.org/abs/2406.20094v1), which contains a subset of approximately 200,000 personas.

Below are some key components we'll use to generate the programs. However, to speed up the generation process, we'll apply multiprocessing, which makes the code a bit more complex. For the complete code, please refer to the [tiny_programs repository](https://github.com/AnswerDotAI/tiny_programs).

Let's dive into the essential bits of our program generation process:

In [None]:
class TinyProgram():
    """
    A program that satisfies the requirements.
    """
    def __init__(
        requirements: str, # A description of the requirements for the program to help the persona.
        code: str # The code that satisfies the requirements. Ensure it is well written and documented.
    ): store_attr()
    
    __repr__ = basic_repr(['requirements', 'code'])

example = TinyProgram(
    requirements="A Python-based data aggregation and analysis tool that scrapes key Salvadoran news websites and government portals for the latest political updates, election results, and policy changes. The program would use standard libraries like requests for web scraping, re for text parsing, and pandas for data manipulation. It would store the collected information in a structured format, perform basic sentiment analysis on news articles, and generate a daily summary report highlighting significant political events, trending topics, and shifts in public opinion. The tool could also track mentions of key political figures and parties, providing a quick overview of their media presence and associated sentiments.",
    code="""\
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
from textblob import TextBlob
from collections import Counter
import datetime

def scrape_news(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    articles = soup.find_all('article', class_='article-item')
    
    news_data = []
    for article in articles:
        title = article.find('h2', class_='article-title').text.strip()
        summary = article.find('p', class_='article-summary').text.strip()
        news_data.append({'title': title, 'summary': summary})
    
    return news_data

def analyze_sentiment(text):
    return TextBlob(text).sentiment.polarity

def generate_report(data):
    df = pd.DataFrame(data)
    df['sentiment'] = df['summary'].apply(analyze_sentiment)
    
    # Calculate average sentiment
    avg_sentiment = df['sentiment'].mean()
    
    # Find most mentioned words
    all_words = ' '.join(df['title'] + ' ' + df['summary']).lower().split()
    word_freq = Counter(word for word in all_words if len(word) > 3)
    top_words = word_freq.most_common(5)
    
    # Generate report
    report = f"Daily Political Analysis Report for El Salvador - {datetime.date.today()}\n\n"
    report += f"Number of articles analyzed: {len(df)}\n"
    report += f"Average sentiment: {'Positive' if avg_sentiment > 0 else 'Negative'} ({avg_sentiment:.2f})\n\n"
    report += "Top mentioned words:\n"
    for word, count in top_words:
        report += f"- {word}: {count} times\n"
    
    report += "\nMost positive article:\n"
    pos_article = df.loc[df['sentiment'].idxmax()]
    report += f"Title: {pos_article['title']}\nSentiment: {pos_article['sentiment']:.2f}\n\n"
    
    report += "Most negative article:\n"
    neg_article = df.loc[df['sentiment'].idxmin()]
    report += f"Title: {neg_article['title']}\nSentiment: {neg_article['sentiment']:.2f}\n"
    
    return report

def main():
    url = "https://www.elsalvador.com/noticias/nacional/"  # Example Salvadoran news website
    news_data = scrape_news(url)
    report = generate_report(news_data)
    print(report)
    
    # Optionally, save the report to a file
    with open(f"el_salvador_political_report_{datetime.date.today()}.txt", "w") as f:
        f.write(report)

if __name__ == "__main__":
    main()
```
"""
)

prompt_template = """\
Here are some examples:
{examples}

Create requirements and the python program that satisfies them for the following persona: {persona}
"""

To evaluate the quality of the code, we will be using the following prompt:

In [None]:
critique_template = """\
Below is a code snippet. Evaluate its educational value for teaching programming to beginners in this language, using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the code is syntactically correct and runs without errors, providing a basic example of working code in the language.
- Add another point if the code demonstrates fundamental programming concepts (e.g., variables, control structures, functions) in a straightforward manner, even if it's not optimized or doesn't follow all best practices.
- Award a third point if the code is well-commented, explaining key concepts and the purpose of different code sections. It should be readable and illustrate good naming conventions, making it easier for beginners to understand.
- Grant a fourth point if the code showcases language-specific features or common programming patterns in an accessible way. It should provide clear examples of how to apply these concepts practically.
- Bestow a fifth point if the code is an exemplary teaching tool, striking an excellent balance between simplicity and real-world applicability. It should inspire further learning, possibly including deliberate mistakes or opportunities for improvement that a teacher could use as discussion points.

The code snippet:
```python
{code}
```

After examining the code:

- Briefly justify your total score, up to 100 words, focusing on its effectiveness as a teaching tool for beginners.
- Conclude with the score.
"""

I've generated 1,000 programs using our method, resulting in 992 properly generated and scored programs. Here's the distribution of scores:

| Score | Count |
|-------|-------|
| 1     | 25    |
| 2     | 117   |
| 3     | 96    |
| 4     | 256   |
| 5     | 498   |

For comparison, here's the distribution of quality scores for 10,000 random programs from GitHub:

| Score | Count |
|-------|-------|
| 1     | 2239  |
| 2     | 5230  |
| 3     | 1545  |
| 4     | 618   |
| 5     | 236   |

Now, let's compare the performance of models trained on these programs. We'll use the SmolLM-360M model from Huggingface as the baseline and test the following configurations, each trained for 5 epochs on roughly 1,000 programs:

0. Baseline model
1. Train on 992 synthetic programs
2. Train on 992 random GitHub programs
3. Train on a mixture of 496 scored 4 and 5 synthetic programs and 496 random GitHub programs
4. Train on 754 synthetic programs scored 4 and 5
5. Train on 754 GitHub programs scored 4 and 5

We'll evaluate these models using the HumanEval benchmark, a collection of 164 programming questions designed to test a coding LLM's ability to generate correct and efficient code. Here are the results:

| Setup   | pass@1 |
|---------|--------|
| Setup 0 | 11.6%  |
| Setup 1 | 09.1%  |
| Setup 2 | 11.0%  |
| Setup 3 | 09.8%  |
| Setup 4 | 12.2%  |
| Setup 5 | 08.5%  |

### Key findings from the experiment:

1. Training on synthetic data outperforms training on random GitHub programs, regardless of quality filtering (Setup 1 vs 2 and Setup 4 vs 5).
2. Only high-quality synthetic data (Setup 4) slightly improves performance over the baseline.
3. All other setups degrade performance, with Setup 5 (high-quality GitHub programs) showing the most significant drop.
4. The unexpected poor performance of high-quality GitHub programs (Setup 5) warrants further investigation. Possible explanations include:
   - The scoring system may not be as effective for GitHub programs as for synthetic ones.
   - There might be a lack of diversity in the GitHub programs.

For further exploration, I encourage you to:
1. Replicate this experiment with your own task.
2. Experiment with larger datasets to see how it affects model performance.
3. Share your findings with the community and reach out if you need assistance!

## Key Takeaways

After exploring synthetic data generation and its applications, here are the crucial points to remember:

1. Quality and diversity are both critical in synthetic data:
   These factors can significantly impact the performance of models trained on this data. Balancing both is essential for creating effective synthetic datasets.

2. Achieving quality is more challenging than diversity:
   Quality is multidimensional, especially for free-form content, making it harder to consistently achieve high standards across all aspects of the generated data.

3. Synthetic data is a valuable tool for data-scarce scenarios:
   When you lack sufficient data for your task, synthetic data offers a cost-effective and rapid solution. When properly generated, it can notably enhance performance on your specific task.

Remember, synthetic data is not a one-size-fits-all solution, but a powerful tool in your AI development toolkit. Its effectiveness depends on careful implementation and consideration of your specific use case.

For those interested in exploring further, you can find all the code used in this post in our minimal synthetic data repository: [fastdata](https://github.com/AnswerDotAI/fastdata).

## Further Exploration and Resources

The field of synthetic data generation for AI is rich with fascinating research. Here are some key papers and resources that provide deeper insights:

### Pioneering Work

[Evolution through Large Models](https://arxiv.org/abs/2206.08896) by Lehman et al. tackles generating walker robots in the Sodarace domain using Python programs unseen by the model. They employ a synthetic data approach, using a coding LLM to mutate existing programs in a genetic programming style, then fine-tuning the LLM on these programs to synthesize more. This iterative process improves the model's ability to generate walker robots.

### Quality-Diversity in AI

[Quality-Diversity through AI Feedback](https://arxiv.org/abs/2310.13032) by Bradley et al. explores the connection between quality-diversity, artificial life, and synthetic data for LLMs. It introduces Quality-Diversity (QD) algorithms, pioneered by [Joel Lehman and Kenneth O. Stanley](https://quality-diversity.github.io/), which aim to find diverse, high-quality solutions rather than a single optimal one. The paper demonstrates how QD can be applied to tasks like creative writing using LLM feedback.

### Additional Resources

For those interested in diving deeper, here are some recommended readings:

- [The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text](https://arxiv.org/abs/2311.09807)
- [How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse](https://arxiv.org/abs/2406.17557)
- [The Curse of Recursion: Training on Generated Data Makes Models Forget](https://arxiv.org/abs/2305.17493)
- [Best Practices and Lessons Learned on Synthetic Data for Language Models](https://arxiv.org/abs/2404.07503)
- [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
- [🍷 FineWeb: decanting the web for the finest text data at scale](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)
- [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644)
- [Nemotron-4 340B Technical Report](https://arxiv.org/abs/2406.11704)
- [Scaling Synthetic Data Creation with 1,000,000,000 Personas](https://arxiv.org/abs/2406.20094v1)
- [Quality-Diversity optimisation algorithms](https://quality-diversity.github.io/)
- [ICML 2019 Tutorial: Recent Advances in Population-Based Search for Deep Neural Networks](https://youtu.be/g6HiuEnbwJE?si=kdnEOFsrvwAyqei9)
- [How to Train Data-Efficient LLMs](https://arxiv.org/abs/2402.09668)

These resources offer a comprehensive overview of the current state and future directions in synthetic data generation and its applications in AI.