In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Author : [Madhup Sukoon](https://github.com/vagrantism)

Reviewer : [Erwin Huizenga](https://github.com/erwinh85)

### Context
This notebook demonstrates how to leverage Gemini using [Snowfakery](https://snowfakery.readthedocs.io/en/docs/index.html) to generate synthetic data for a given schema at scale. Although this framework can be used for any synthetic-data gneration use-case and schema, the current examples shows the simple use case of generating long (blogs) and short (comments) format contents using a given Wikipedia page as the seed data.

### Setup
The framework uses `pyproject.toml` to list dependencies, which can be installed using `pip` as follows:

In [None]:
! pip install .

### Recipe
In order to generate synthetic data, the schema of the synthetic data must be defined first. This is done by creating a `recipe` in a YAML format as demonstrated below, more details on writing recipes can be found [here](https://snowfakery.readthedocs.io/en/latest/#central-concepts) . This particular recipe first creates a random number (between 100 and 500) of `users`. Each `user` object has fields like `first_name`, `last_name` etc. Next, it uses the custom Wikipedia plugin to read a wikipedia page and parse it's contents. This is stored as a `seed` object with fields like `title`, `url` and `section_count`. Then, for each section in the seed wikipedia page, we create a `blog_ideas` object. To convert this wikipedia section content into a long-format blog post, we use the Gemini model along with a simple prompt defined in `synthetic_data_generation/prompts/blog_generator.jinja`. For every blog generated, we also generate a `blog_post_comments` object which uses Gemini to generate a comment using a random row from `users` as an author.

In [None]:
recipe = """
- plugin: synthetic_data_generation.plugins.Wikipedia
- plugin: synthetic_data_generation.plugins.Gemini
- option: wiki_title

- object : users
  count : ${{random_number(min=100, max=500)}}
  fields :
    first_name : ${{fake.FirstName}}
    last_name : ${{fake.FirstName}}
    age:
      random_number:
        min: 18
        max: 95
    email : ${{fake.Email}}
    phone : ${{fake.PhoneNumber}}
    interests : ${{fake.Bs}}
    postal_code : ${{fake.Postalcode}}
    organization : ${{fake.Company}}
    profession : ${{fake.Job}}

- object : seeds
  count : 1
  fields :
    __seed :
      - Wikipedia.get_page :
        title : ${{wiki_title}}
    title : ${{__seed.title}}
    url : ${{__seed.url}}
    section_count : ${{__seed.sections | length}}

  friends:
    - object : blog_ideas
      count : ${{seeds.section_count}}
      fields :
        seed_id : ${{seeds.id}}
        section : ${{(seeds.__seed.sections.keys() | list)[child_index]}}
        body : ${{seeds.__seed.sections[section]}}

      friends:
        - object : blog_posts
          fields :
            blog_idea_id : ${{blog_ideas.id}}
            title : ${{seeds.title}} - ${{blog_ideas.section}}
            body :
              - Gemini.generate:
                prompt_name : blog_generator.jinja
                idea_title : ${{title}}
                idea_body : ${{blog_ideas.body}}
            author : Gemini
          friends:
            - object : blog_post_comments
              fields :
                blog_post_id : ${{blog_posts.id}}
                author_id :
                  random_reference : users
                author_email : ${{author_id.email}}
                comment :
                  - Gemini.generate:
                    prompt_name : comment_generator.jinja
                    first_name : ${{author_id.first_name}}
                    last_name : ${{author_id.last_name}}
                    age : ${{author_id.age}}
                    interests : ${{author_id.interests}}
                    organization : ${{author_id.organization}}
                    profession : ${{author_id.profession}}
                    blog_title : ${{blog_posts.title}}
                    blog_body : ${{blog_posts.body | truncate(1000)}}
"""

### Running the recipe

The follwoing cell uses the recipe defined above with the `Python_(programming_language)` Wikipedia page, and generates the data into CSV files.

In [None]:
import vertexai
from io import StringIO
from snowfakery import generate_data

vertexai.init(project="<YOUR-GCP-PROJECT>", location="us-central1")

generate_data(
    StringIO(recipe),
    output_format="csv",
    output_folder="outputs",
    user_options={"wiki_title": "Python_(programming_language)"},
)