<a href="https://colab.research.google.com/github/DesmondChoy/llm_tutorials/blob/main/Synthetic_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import

Tutorial from [Answer AI](https://www.answer.ai/posts/2024-10-15-how-to-synthesize-data.html)

In [27]:
!pip install claudette fastcore python-fastdata -qq

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for ratelimit (setup.py) ... [?25l[?25hdone


In [28]:
import os
from claudette import *
from fastcore.utils import *
from fastdata.core import *

from google.colab import userdata
from IPython.display import Markdown


os.environ['ANTHROPIC_API_KEY'] = userdata.get('ANTHROPIC_API_KEY')


# Using Haiku

In [3]:
model = models[-1] # haiku 3
sp = "You will help generate synthetic data of English and Chinese phrases."
cli = Client(model)

In [4]:
class Translation():
    "Translation from an English phrase to a Chinese phrase"
    def __init__(self, english: str, chinese: str): store_attr()
    def __repr__(self): return f"{self.english} ➡ *{self.chinese}*"

# Translation("Hello, how are you today?", "您好，您今天好吗?")

In [5]:
def synthesize(pr): return cli.structured(pr, sp=sp, temp=1, tools=Translation)[0]

prompt = 'Create an English and Chinese translation pair.'
translations = [synthesize(prompt) for _ in range(10)]

In [6]:
clps_fmt = '- {s}\n\n<details>\n<summary> Click to show the rest </summary>\n{ls}\n</details>'
def to_md(ss, collapsible=False):
    ls = '\n'.join(f'- {s}' for s in ss)
    return clps_fmt.format(s=str(ss[0]), ls=ls.replace(f'- {ss[0]}', '')) if collapsible else ls
def show(ss, collapsible=False): return Markdown(to_md(ss, collapsible=collapsible))

In [7]:
show(translations)

- Good morning ➡ *早上好*
- Hello, how are you today? ➡ *你今天好吗？*
- Hello, how are you today? ➡ *你好,你今天好吗?*
- How are you today? ➡ *你今天好吗?*
- The sunny day was beautiful. ➡ *阳光明媚的日子真美好。*
- The sun is shining brightly today. ➡ *今天太阳正在灿烂地照耀。*
- The cat is sleeping on the bed. ➡ *猫咪正在床上睡觉。*
- The sun is shining brightly today. ➡ *今天阳光明媚。*
- The sky is blue. ➡ *天空是蓝色的。*
- The sun is shining brightly today. ➡ *今天阳光灿烂。*

In [8]:
show(translations, collapsible=True)

- Good morning ➡ *早上好*

<details>
<summary> Click to show the rest </summary>

- Hello, how are you today? ➡ *你今天好吗？*
- Hello, how are you today? ➡ *你好,你今天好吗?*
- How are you today? ➡ *你今天好吗?*
- The sunny day was beautiful. ➡ *阳光明媚的日子真美好。*
- The sun is shining brightly today. ➡ *今天太阳正在灿烂地照耀。*
- The cat is sleeping on the bed. ➡ *猫咪正在床上睡觉。*
- The sun is shining brightly today. ➡ *今天阳光明媚。*
- The sky is blue. ➡ *天空是蓝色的。*
- The sun is shining brightly today. ➡ *今天阳光灿烂。*
</details>

In [9]:
examples = [
    Translation(
        english="Hello, my name is Nathan. I am a research scientist at an AI startup.",
        chinese="你好，我叫Nathan。我是一家人工智能初创公司的研究科学家。."),
    Translation(
        english="How much wood could a woodchuck chuck if a woodchuck could chuck wood?",
        chinese="如果土拨鼠能丢木头的话，一只土拨鼠能丢多少木头？"),
    Translation(
        english="Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See.",
        chinese="托马斯·克兰默（1489年7月2日至1556年3月21日）是英国宗教改革的领袖，在亨利八世、爱德华六世以及短暂的玛丽一世统治期间担任坎特伯雷大主教。他帮助建立了亨利与阿拉贡的凯瑟琳解除婚姻关系的理据，这是英国教会脱离罗马教廷的原因之一。"
    ),
]

In [10]:
examples_md = to_md(examples)
Markdown(examples_md)

- Hello, my name is Nathan. I am a research scientist at an AI startup. ➡ *你好，我叫Nathan。我是一家人工智能初创公司的研究科学家。.*
- How much wood could a woodchuck chuck if a woodchuck could chuck wood? ➡ *如果土拨鼠能丢木头的话，一只土拨鼠能丢多少木头？*
- Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See. ➡ *托马斯·克兰默（1489年7月2日至1556年3月21日）是英国宗教改革的领袖，在亨利八世、爱德华六世以及短暂的玛丽一世统治期间担任坎特伯雷大主教。他帮助建立了亨利与阿拉贡的凯瑟琳解除婚姻关系的理据，这是英国教会脱离罗马教廷的原因之一。*

In [11]:
prompt_template = """\
Create an English and Chinese translation pair that is similar to the examples.

<examples>
{examples}
</examples>"""
prompt = prompt_template.format(examples=examples_md)
print(prompt)

Create an English and Chinese translation pair that is similar to the examples.

<examples>
- Hello, my name is Nathan. I am a research scientist at an AI startup. ➡ *你好，我叫Nathan。我是一家人工智能初创公司的研究科学家。.*
- How much wood could a woodchuck chuck if a woodchuck could chuck wood? ➡ *如果土拨鼠能丢木头的话，一只土拨鼠能丢多少木头？*
- Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See. ➡ *托马斯·克兰默（1489年7月2日至1556年3月21日）是英国宗教改革的领袖，在亨利八世、爱德华六世以及短暂的玛丽一世统治期间担任坎特伯雷大主教。他帮助建立了亨利与阿拉贡的凯瑟琳解除婚姻关系的理据，这是英国教会脱离罗马教廷的原因之一。*
</examples>


In [12]:
show([synthesize(prompt) for _ in range(5)])


- The sun rises in the east and sets in the west. ➡ *太阳从东边升起,落在西边。*
- Let's go to the park and have a picnic under the big oak tree. ➡ *让我们去公园,在那棵大橡树下野餐吧。*
- This is a sunny day. The weather is perfect for a picnic outside. ➡ *这是个阳光明媚的日子。天气非常适合在户外野餐。*
- The dog jumped over the fence and ran down the street. ➡ *狗跳过栅栏,沿着街道跑下去了.*
- The new employee at the AI startup company is a research scientist named Nathan. ➡ *新入职的人工智能创业公司员工是一名名叫Nathan的研究科学家。*

Try putting the examples first in the prompt:

In [13]:
prompt_template = """\
<examples>
{examples}
</examples>

Create an English and Chinese translation pair that is similar to the examples."""
prompt = prompt_template.format(examples=examples_md)
print(prompt)

<examples>
- Hello, my name is Nathan. I am a research scientist at an AI startup. ➡ *你好，我叫Nathan。我是一家人工智能初创公司的研究科学家。.*
- How much wood could a woodchuck chuck if a woodchuck could chuck wood? ➡ *如果土拨鼠能丢木头的话，一只土拨鼠能丢多少木头？*
- Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See. ➡ *托马斯·克兰默（1489年7月2日至1556年3月21日）是英国宗教改革的领袖，在亨利八世、爱德华六世以及短暂的玛丽一世统治期间担任坎特伯雷大主教。他帮助建立了亨利与阿拉贡的凯瑟琳解除婚姻关系的理据，这是英国教会脱离罗马教廷的原因之一。*
</examples>

Create an English and Chinese translation pair that is similar to the examples.


In [14]:
show([synthesize(prompt) for _ in range(5)])


- The rapid development of AI has transformed many industries, from healthcare to transportation. But this technological progress also raises important ethical questions that we must grapple with as a society. ➡ *人工智能的快速发展已经改变了从医疗到交通等许多行业。但是这种技术进步也提出了重要的伦理问题,我们必须作为一个社会一起解决。*
- The weather is beautiful today. I am going to the park for a picnic with my family. ➡ *今天天气很好。我要和家人一起去公园野餐。*
- The weather today is quite pleasant. It's sunny with a light breeze. ➡ *今天天气非常舒适。阳光明媚,微风习习。*
- I am an engineer working at a small tech company. We are developing new AI algorithms to improve recommendation systems. ➡ *我是一名在一家小型科技公司工作的工程师。我们正在开发新的人工智能算法以提高推荐系统的性能。*
- The weather is quite pleasant today. I'm looking forward to going for a walk in the park later. ➡ *今天天气非常好。我后来期待在公园里散步。*

Better quality output!

Add more diversity by including different topics:

In [15]:
topics = ["otters", "penguins", "sloths", "cats", "dogs"]
prompt_template = """\
Create an English and Chinese translation pair about the following topic:
<topic>{topic}</topic>"""
print(prompt_template.format(topic=topics[0]))


Create an English and Chinese translation pair about the following topic:
<topic>otters</topic>


In [16]:
show([synthesize(prompt_template.format(topic=topic)) for topic in topics])


- Otters are small semiaquatic mammals that live in freshwater environments. They have webbed feet and dense fur to help them swim and stay warm in the water. Otters are playful, social animals and are known for holding hands while they sleep to stay together. ➡ *水獭是生活在淡水环境中的小型半水生哺乳动物。它们有蹼状的脚和厚实的毛皮,可以帮助它们在水中游泳并保持温暖。水獭是好动且社交的动物,它们睡觉时会牵着手以保持在一起。*
- Penguins are flightless seabirds that live in the southern hemisphere. ➡ *企鹅是生活在南半球的不会飞的海鸟。*
- Sloths are slow-moving arboreal mammals that live in the tropical rainforests of Central and South America. ➡ *树懒是生活在中美洲和南美洲热带雨林中的缓慢移动的树栖哺乳动物。*
- The curious cat watched the bird fly by. ➡ *好奇的猫咪观察着飞过的鸟儿。*
- Dogs are loyal and friendly animals that make great companions. ➡ *狗狗是忠诚友善的动物,是很好的伙伴。*

Prompts are now diverse, but the last two are lacking in quality.

Now try adding the examples (at the start) in the prompt:

In [17]:
prompt_template = """\
<examples>
{examples}
</examples>

Create an English and Chinese translation pair that is similar to the examples and is about the following topic:
<topic>{topic}</topic>"""

translations = [synthesize(prompt_template.format(examples=examples_md, topic=topic))
                for topic in topics]
show(translations, collapsible=True)

- Otters are small, semiaquatic mammals that belong to the Lutrinae subfamily. They have a thick, waterproof fur coat and webbed feet, which make them excellent swimmers. Otters can be found in various habitats, including rivers, lakes, and coastal areas, where they feed on fish, crustaceans, and other aquatic prey. They are known for their playful and social behavior, often seen frolicking in the water and holding paws while sleeping to stay together. ➡ *水獭是一种小型半水生哺乳动物,属于水獭亚科。它们拥有厚实而防水的毛皮和蹼状的脚,这使它们成为出色的游泳者。水獭可以在河流、湖泊和沿海地区等各种环境中找到,它们以鱼类、甲壳类和其他水生猎物为食。它们以其嬉戏和社交的行为而闻名,常常在水中嬉戏,睡觉时互相拉着爪子保持在一起。*

<details>
<summary> Click to show the rest </summary>

- Penguins are flightless seabirds that live in the southern hemisphere. They have black and white feathers and are known for their unique and charming behavior. Penguins are excellent swimmers and divers, and they use their wings as flippers to propel themselves through the water. ➡ *企鹅是生活在南半球的不会飞的海鸟。它们有黑白相间的羽毛,以独特而可爱的行为著称。企鹅是出色的游泳和潜水者,它们使用翅膀作为鳍来推动自己在水中移动。*
- Sloths are slow-moving mammals that live in the tropical forests of Central and South America. They have long limbs, sharp claws, and spend most of their time hanging upside down from tree branches, sleeping or feeding on leaves and fruit. ➡ *树懒是生活在中美洲和南美洲热带森林中的缓慢移动的哺乳动物。它们有着又长又尖的爪子,大部分时间都是倒挂在树枝上睡觉或进食树叶和水果。*
- Cats are beloved household pets that provide companionship and joy to many people around the world. They are known for their independent and playful nature, often amusing their owners with their quirky behaviors. With their soft fur, captivating eyes, and agile movements, cats have captured the hearts of countless individuals. ➡ *猫是深受人们喜爱的家庭宠物,为许多人带来了陪伴和欢乐。它们以独立和多动的天性而著称,常常用它们古怪的行为逗乐主人。凭借柔软的皮毛、迷人的眼睛和灵敏的动作,猫咪俘获了无数人的心。*
- Dogs are loyal and affectionate companions. They come in many breeds, sizes, and personalities. Some are energetic and playful, while others are more calm and relaxed. No matter the breed, dogs make wonderful pets that bring joy and happiness to their owners. ➡ *狗是忠诚而亲密的伙伴。它们有着各种品种、大小和性格。有些活泼好动,有些则更加安静冷静。不论品种如何,狗都是美好的宠物,给主人带来欢乐和快乐。*
</details>

# Using Sonnet 3.5

In [18]:
sp = "You will help generate synthetic data of English and Chinese phrases."
cli = Client('claude-3-5-sonnet-latest')  # sonnet 3.5b

In [19]:
translations = [synthesize(prompt_template.format(examples=examples_md, topic=topic))
                for topic in topics]
show(translations, collapsible=True)

- Sea otters are fascinating marine mammals known for their playful behavior and tool use. They often float on their backs while using rocks to crack open shellfish, and they hold hands while sleeping to avoid drifting apart in the water. ➡ *海獭是迷人的海洋哺乳动物，以其爱玩耍的行为和使用工具而闻名。它们经常仰卧漂浮时用石头砸开贝类，而且在睡觉时会牵着手以避免在水中漂散。*

<details>
<summary> Click to show the rest </summary>

- Emperor penguins are the largest of all penguin species, standing up to 1.2 meters tall and can survive the harsh Antarctic winter where temperatures drop below -60°C. They huddle together in large groups to stay warm and take turns moving from the outside to the inside of the group. ➡ *帝企鹅是所有企鹅物种中体型最大的，身高可达1.2米，能够在气温低于零下60度的严酷南极冬季中存活。它们会聚集在一起形成大群以保持温暖，并轮流从群体外围移动到内部。*
- Did you know that sloths are one of the slowest mammals on Earth? They move so slowly that algae can grow on their fur, which helps them blend in with their rainforest environment and provides a unique ecosystem for various insects. ➡ *你知道树懒是地球上行动最慢的哺乳动物之一吗？它们动作如此缓慢，以至于藻类能够在它们的毛皮上生长，这帮助它们与热带雨林环境融为一体，并为各种昆虫提供了独特的生态系统。*
- My orange tabby cat, Whiskers, loves to chase butterflies in our garden during sunny afternoons. She's incredibly graceful when pouncing, but rarely manages to catch any of them. ➡ *我那只名叫Whiskers的橘色虎斑猫喜欢在阳光明媚的下午在我们的花园里追逐蝴蝶。她扑击时非常优雅，但很少能抓到蝴蝶。*
- My golden retriever, Max, loves playing fetch in the park every morning. He gets so excited when he sees other dogs and always tries to make new furry friends. ➡ *我的金毛寻回犬Max很喜欢每天早上在公园玩接球游戏。每当看到其他狗狗时，他都会特别兴奋，总是想交新的毛茸茸的朋友。*
</details>

Creating a class that critques:

In [20]:
class TranslationCritique(BasicRepr):
  "A critique of the translation."
  def __init__(self,
               critique: str, # A brief 1-line critique of the translation
               score: int # A score of the translation from 1 to 5
               ): store_attr()

In [21]:
sp = "You will help critique synthetic data of English and Chinese phrases."
def synthesize(pr): return cli.structured(pr, temp=1, tools=TranslationCritique)[0]

In [22]:
eval_prompt_template = """\
Below is an extract of a translation. Evaluate its quality as a senior translator would, considering its suitability for professional use. Use the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the translation conveys the basic meaning of the source text, even if it includes some minor errors or awkward phrasing.
- Add another point if the translation is generally accurate but lacks refinement in style or fails to capture some nuances of the original. It might use inconsistent terminology or have occasional lapses in register.
- Award a third point if the translation is appropriate for professional use and accurately conveys key concepts of the source text. It demonstrates good understanding of both languages, though it may not be flawless or could include some slight inconsistencies. It resembles the work of a competent translator but may have room for improvement in fluency or precision.
- Grant a fourth point if the translation is highly accurate and reads naturally in the target language, exhibiting a consistent and appropriate style. It could be similar to the work of an experienced translator, offering faithful rendering of content and tone, with minimal errors, and effectively handling complex concepts or cultural references. The result is coherent, well-expressed, and valuable for its intended purpose.
- Bestow a fifth point if the translation is outstanding, demonstrating mastery of both source and target languages. It captures subtle nuances, maintains the author's voice and intent, and reads as if it were originally written in the target language. The translator has made excellent choices in dealing with challenging elements like wordplay, idiomatic expressions, or culture-specific content.

<translation>
{translation}
</translation>

After examining the translation:

- Briefly justify your total score in a single line.
- Conclude with the score of the translation."""

In [23]:
def show_critique(t, critique):
    return f"""{t}
\t- **Critique**: {critique.critique}
\t- **Score**: {critique.score}"""

def get_critique(t):
    critique = synthesize(eval_prompt_template.format(translation=t))
    return show_critique(t, critique)

In [24]:
show([get_critique(t) for t in translations], collapsible=True)

- Sea otters are fascinating marine mammals known for their playful behavior and tool use. They often float on their backs while using rocks to crack open shellfish, and they hold hands while sleeping to avoid drifting apart in the water. ➡ *海獭是迷人的海洋哺乳动物，以其爱玩耍的行为和使用工具而闻名。它们经常仰卧漂浮时用石头砸开贝类，而且在睡觉时会牵着手以避免在水中漂散。*
	- **Critique**: Excellent translation demonstrating natural flow, accurate terminology, perfect handling of behavioral descriptions, and culturally appropriate phrasing while maintaining scientific precision
	- **Score**: 5

<details>
<summary> Click to show the rest </summary>

- Emperor penguins are the largest of all penguin species, standing up to 1.2 meters tall and can survive the harsh Antarctic winter where temperatures drop below -60°C. They huddle together in large groups to stay warm and take turns moving from the outside to the inside of the group. ➡ *帝企鹅是所有企鹅物种中体型最大的，身高可达1.2米，能够在气温低于零下60度的严酷南极冬季中存活。它们会聚集在一起形成大群以保持温暖，并轮流从群体外围移动到内部。*
	- **Critique**: Excellent translation that accurately conveys scientific information, maintains precise measurements, captures the behavioral description naturally, and demonstrates perfect register consistency - reads fluently in Chinese while preserving all source content.
	- **Score**: 5
- Did you know that sloths are one of the slowest mammals on Earth? They move so slowly that algae can grow on their fur, which helps them blend in with their rainforest environment and provides a unique ecosystem for various insects. ➡ *你知道树懒是地球上行动最慢的哺乳动物之一吗？它们动作如此缓慢，以至于藻类能够在它们的毛皮上生长，这帮助它们与热带雨林环境融为一体，并为各种昆虫提供了独特的生态系统。*
	- **Critique**: Excellent translation capturing scientific accuracy, maintaining natural flow in Chinese, preserving all nuances including biological terms and ecosystem concepts, with appropriate formal-educational tone
	- **Score**: 5
- My orange tabby cat, Whiskers, loves to chase butterflies in our garden during sunny afternoons. She's incredibly graceful when pouncing, but rarely manages to catch any of them. ➡ *我那只名叫Whiskers的橘色虎斑猫喜欢在阳光明媚的下午在我们的花园里追逐蝴蝶。她扑击时非常优雅，但很少能抓到蝴蝶。*
	- **Critique**: Excellent natural flow in Chinese with accurate meaning transfer, appropriate register, and elegant phrasing, though keeping 'Whiskers' untranslated is debatable for Chinese audience
	- **Score**: 4
- My golden retriever, Max, loves playing fetch in the park every morning. He gets so excited when he sees other dogs and always tries to make new furry friends. ➡ *我的金毛寻回犬Max很喜欢每天早上在公园玩接球游戏。每当看到其他狗狗时，他都会特别兴奋，总是想交新的毛茸茸的朋友。*
	- **Critique**: Translation excellently captures both meaning and natural expression, with perfect handling of dog terminology, playful tone, and emotional nuances; maintains colloquial warmth through appropriate Chinese pet-specific language (狗狗, 毛茸茸)
	- **Score**: 5
</details>

In [26]:
bad_translation = Translation(
    english="Despite their fearsome reputation, most piranha species are actually omnivorous or even primarily vegetarian. The red-bellied piranha (Pygocentrus nattereri), which is the most well-known species, primarily scavenges for dead and dying animals rather than hunting healthy prey.",
    chinese="虽然鱼很可怕的名声，大多数食人鱼种类实际是吃素的和肉。红色肚子食人鱼（Pygocentrus nattereri）这个最出名种类，主要是吃死掉动物不是追捕健康的动物。"
)

show([get_critique(bad_translation)])

- Despite their fearsome reputation, most piranha species are actually omnivorous or even primarily vegetarian. The red-bellied piranha (Pygocentrus nattereri), which is the most well-known species, primarily scavenges for dead and dying animals rather than hunting healthy prey. ➡ *虽然鱼很可怕的名声，大多数食人鱼种类实际是吃素的和肉。红色肚子食人鱼（Pygocentrus nattereri）这个最出名种类，主要是吃死掉动物不是追捕健康的动物。*
	- **Critique**: Basic meaning conveyed but contains grammatical errors, awkward phrasing ('鱼很可怕' instead of '它们可怕的'), non-idiomatic expressions, and oversimplified terminology ('吃素的和肉' for omnivorous)
	- **Score**: 2

Remember these key points:

- Quality and diversity are critical in synthetic data. They can have a significant impact on the performance of models trained on this data. Balancing both is essential for creating effective synthetic datasets.
- Quality is harder to achieve than diversity. Quality is multidimensional. This is especially true for free-form content. It makes it tough to meet high standards in all aspects of the generated data.
- Synthetic data is a valuable tool for data-scarce scenarios. It is a cost-effective, quick solution when you lack enough data for your task. When generated correctly, it can significantly enhance performance on your specific task.

# Using fastdata

## Generate Translations

In [41]:
class Translation():
    "Translation from an English phrase to a Chinese phrase"
    def __init__(self, english: str, chinese: str): store_attr()
    def __repr__(self): return f"{self.english} ➡ *{self.chinese}*"

examples = [
    Translation(
        english="Hello, my name is Nathan. I am a research scientist at an AI startup.",
        chinese="你好，我叫Nathan。我是一家人工智能初创公司的研究科学家。."),
    Translation(
        english="How much wood could a woodchuck chuck if a woodchuck could chuck wood?",
        chinese="如果土拨鼠能丢木头的话，一只土拨鼠能丢多少木头？"),
    Translation(
        english="Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See.",
        chinese="托马斯·克兰默（1489年7月2日至1556年3月21日）是英国宗教改革的领袖，在亨利八世、爱德华六世以及短暂的玛丽一世统治期间担任坎特伯雷大主教。他帮助建立了亨利与阿拉贡的凯瑟琳解除婚姻关系的理据，这是英国教会脱离罗马教廷的原因之一。"
    ),
]

topics = ["otters", "penguins", "sloths", "cats", "dogs"]

sp = "You will help generate synthetic data of English and Chinese phrases."

prompt_template = """\
<examples>b
{examples}
</examples>

Create an English and Chinese translation pair that is similar to the examples and is about the following topic:
<topic>{topic}</topic>"""


In [42]:
fast_data = FastData(model="claude-3-5-sonnet-latest")
translations = fast_data.generate(
    prompt_template=prompt_template,
    inputs=[{"examples": examples, "topic": topic} for topic in topics],
    schema=Translation,
    sp=sp)

100%|██████████| 5/5 [00:03<00:00,  1.33it/s]


In [45]:
translations

[My cat Felix loves to chase mice in the garden, but he's actually quite gentle and spends most of his time sleeping on the windowsill in the warm sunshine. ➡ *我的猫Felix很喜欢在花园里追老鼠，但它其实很温顺，大部分时间都在温暖的阳光下的窗台上睡觉。*,
 Dogs are known as man's best friend, and they have been loyal companions to humans for over 15,000 years. They can be trained to perform various tasks, from herding sheep to assisting people with disabilities. ➡ *狗被称为人类最好的朋友，它们已经陪伴人类超过15,000年了。它们可以被训练执行各种任务，从放牧羊群到帮助残障人士。*,
 Sea otters are remarkable marine mammals known for using tools like rocks to crack open shellfish. They often float on their backs while eating and have been observed holding hands while sleeping to avoid drifting apart. ➡ *海獭是非凡的海洋哺乳动物，以使用石头等工具来破开贝类而闻名。它们经常仰卧着进食，并且有人观察到它们睡觉时会手拉手以避免彼此漂散。*,
 Penguins are flightless aquatic birds that live primarily in the Southern Hemisphere, with most species found in Antarctica. Despite not being able to fly, they are excellent swimmers and can dive deep underwater to catch 

# Generate Scores

In [None]:
class TranslationCritique(BasicRepr):
  "A critique of the translation."
  def __init__(self,
               critique: str, # A brief 1-line critique of the translation
               score: int # A score of the translation from 1 to 5
               ): store_attr()

eval_sp = "You will help critique synthetic data of English and Chinese phrases."
eval_prompt_template = """
Below is an extract of a translation. Evaluate its quality as a senior translator would, considering its suitability for professional use. Use the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the translation conveys the basic meaning of the source text, even if it includes some minor errors or awkward phrasing.
- Add another point if the translation is generally accurate but lacks refinement in style or fails to capture some nuances of the original. It might use inconsistent terminology or have occasional lapses in register.
- Award a third point if the translation is appropriate for professional use and accurately conveys key concepts of the source text. It demonstrates good understanding of both languages, though it may not be flawless or could include some slight inconsistencies. It resembles the work of a competent translator but may have room for improvement in fluency or precision.
- Grant a fourth point if the translation is highly accurate and reads naturally in the target language, exhibiting a consistent and appropriate style. It could be similar to the work of an experienced translator, offering faithful rendering of content and tone, with minimal errors, and effectively handling complex concepts or cultural references. The result is coherent, well-expressed, and valuable for its intended purpose.
- Bestow a fifth point if the translation is outstanding, demonstrating mastery of both source and target languages. It captures subtle nuances, maintains the author's voice and intent, and reads as if it were originally written in the target language. The translator has made excellent choices in dealing with challenging elements like wordplay, idiomatic expressions, or culture-specific content.

<translation>
{translation}
</translation>

After examining the translation:

- Briefly justify your total score in a single line.
- Conclude with the score of the translation."""


In [None]:
critiques = fast_data.generate(
    prompt_template=eval_prompt_template,
    inputs=[{"translation": f"{translation}"} for translation in translations],
    schema=TranslationCritique,
    sp=eval_sp)