# fastdata

> Easiest and fastest way to 1B synthetic tokens

Minimalist library that wraps around `claudette` to make generating synthetic data easy.

## Developer Guide

If you are new to using `nbdev` here are some useful pointers to get you started.

### Install fastdata in Development mode

```sh
# make sure fastdata package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to fastdata
$ nbdev_prepare
```

## Usage

### Installation

Install latest from the GitHub [repository][repo]:

```sh
$ pip install git+https://github.com/AnswerDotAI/fastdata.git
```

or from [conda][conda]

```sh
$ conda install -c AnswerDotAI fastdata
```

or from [pypi][pypi]


```sh
$ pip install fastdata
```


[repo]: https://github.com/AnswerDotAI/fastdata
[docs]: https://AnswerDotAI.github.io/fastdata/
[pypi]: https://pypi.org/project/fastdata/
[conda]: https://anaconda.org/AnswerDotAI/fastdata

### Documentation

Documentation can be found hosted on this GitHub [repository][repo]'s [pages][docs]. Additionally you can find package manager specific guidelines on [conda][conda] and [pypi][pypi] respectively.

[repo]: https://github.com/AnswerDotAI/fastdata
[docs]: https://AnswerDotAI.github.io/fastdata/
[pypi]: https://pypi.org/project/fastdata/
[conda]: https://anaconda.org/AnswerDotAI/fastdata

## How to use

First you need to define the structure of the data you want to generate. `claudette`, which is the library that fastdata uses to generate data, requires you to define the schema of the data you want to generate. This is done using pydantic models.

In [None]:
from fastcore.utils import *

In [None]:
class Translation():
    "Translation from an English phrase to a Spanish phrase"
    def __init__(self, english: str, spanish: str): store_attr()
    def __repr__(self): return f"{self.english} ➡ *{self.spanish}*"

Translation("Hello, how are you today?", "Hola, ¿cómo estás hoy?")

Hello, how are you today? ➡ *Hola, ¿cómo estás hoy?*

Next, you need to define the prompt that will be used to generate the data and any inputs you want to pass to the prompt.

In [None]:
prompt_template = """\
Generate English and Spanish translations on the following topic:
<topic>{topic}</topic>
"""

inputs = [{"topic": "Otters are cute"}, {"topic": "I love programming"}]

Finally, we can generate some data with fastdata.

::: {.callout-note}
We only support Anthropic models at the moment. Therefore, make sure you have an API key for the model you want to use and the proper environment variables set or pass the api key to the `FastData` class `FastData(api_key="sk-ant-api03-...")`.
:::

In [None]:
from fastdata.core import FastData

In [None]:
fast_data = FastData(model="claude-3-haiku-20240307")
translations = fast_data.generate(
    prompt_template=prompt_template,
    inputs=inputs,
    schema=Translation,
)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.57it/s]


In [None]:
from IPython.display import Markdown

In [None]:
Markdown("\n".join(f'- {t}' for t in translations))

- I love programming ➡ *Me encanta la programación*
- Otters are cute ➡ *Las nutrias son lindas*

If you'd like to see how best to generate data with fastdata, check out our blog post [here](https://www.answer.ai/blog/introducing-fastdata) and some of the examples in the [examples](https://github.com/AnswerDotAI/fastdata/tree/main/examples) directory.