# Generating synthetic data

This notebook walks through training a probabilistic, generative RNN model<br>
on a rental scooter location dataset, and then generating a synthetic<br>
dataset with greater privacy guarantees.

For both training and generating data, we can use the ``config.py`` module and<br>
create a ``LocalConfig`` instance that contains all the attributes that we need<br>
for both activities.

In [2]:
# Google Colab support
# Note: Click "Runtime->Change Runtime Type" set Hardware Accelerator to "GPU"
# Note: Use
!pip install gretel-synthetics[tf] to install tensorflow if necessary
#
!pip install gretel-synthetics --upgrade

Collecting to
  Using cached to-0.3.tar.gz (26 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
[31mERROR: Could not find a version that satisfies the requirement install (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for install[0m[31m
Collecting gretel-synthetics
  Using cached gretel_synthetics-0.22.19-py3-none-any.whl.metadata (7.2 kB)
Collecting category-encoders==2.4.0 (from gretel-synthetics)
  Using cached category_encoders-2.4.0-py2.py3-none-any.whl.metadata (7.3 kB)
Collecting numpy<1.24,>=1.18.0 (from gretel-synthetics)
  Using cached numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting pandas<2,>=1.1.0 (from gretel-synthetics)
  Using cached pandas-1.5.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting protobuf<=4.24.0,>=4 (from gretel-synthetics)
  Using cached protobuf-4.24.0-cp37-abi3-manylinux2014_x86_64.whl.metadata (540 bytes)
INFO: pip is lo

In [None]:
from pathlib import Path

from gretel_synthetics.config import LocalConfig

# Create a config that we can use for both training and generating data
# The default values for ``max_lines`` and ``epochs`` are optimized for training on a GPU.

config = LocalConfig(
    max_line_len=2048,   # the max line length for input training data
    vocab_size=20000,    # tokenizer model vocabulary size
    field_delimiter=",", # specify if the training text is structured, else ``None``
    overwrite=True,      # overwrite previously trained model checkpoints
    checkpoint_dir=(Path.cwd() / 'checkpoints').as_posix(),
    input_data_path="https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/uber_scooter_rides_1day.csv" # filepath or S3
)


In [None]:
# Train a model
# The training function only requires our config as a single arg
from gretel_synthetics.train import train_rnn

train_rnn(config)

In [None]:
# Let's generate some text!
#
# The ``generate_text`` funtion is a generator that will return
# a line of predicted text based on the ``gen_lines`` setting in your
# config.
#
# There is no limit on the line length as with proper training, your model
# should learn where newlines generally occur. However, if you want to
# specify a maximum char len for each line, you may set the ``gen_chars``
# attribute in your config object
from gretel_synthetics.generate import generate_text

# Optionally, when generating text, you can provide a callable that takes the
# generated line as a single arg. If this function raises any errors, the
# line will fail validation and will not be returned.  The exception message
# will be provided as a ``explain`` field in the resulting dict that gets
# created by ``generate_text``
def validate_record(line):
    rec = line.split(", ")
    if len(rec) == 6:
        float(rec[5])
        float(rec[4])
        float(rec[3])
        float(rec[2])
        int(rec[0])
    else:
        raise Exception('record not 6 parts')

for line in generate_text(config, line_validator=validate_record, num_lines=1000):
    print(line)