[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/sdk_blueprints/Gretel_101_Blueprint.ipynb)

<br>

<center><a href=https://gretel.ai/><img src="https://gretel-public-website.s3.us-west-2.amazonaws.com/assets/brand/gretel_brand_wordmark.svg" alt="Gretel" width="350"/></a></center>

<br>

## Welcome to the Gretel 101 Blueprint!

In this Blueprint, we will use Gretel to train a deep generative model and use it to generate high-quality synthetic (tabular) data. We will accomplish this by submitting training and generation jobs to the [Gretel Cloud](https://gretel.ai/faqs/gretel-cloud) via [Gretel's Python SDK](https://docs.gretel.ai/guides/environment-setup/cli-and-sdk).

Behind the scenes, Gretel will spin up workers with the necessary compute resources, set up the model with your desired configuration, and perform the submitted task.

## Create your Gretel account

To get started, you will need to [sign up for a free Gretel account](https://console.gretel.ai/).

<br>

#### Ready? Let's go 🚀

## 💾 Install `gretel-client` and its dependencies

In [1]:
%%capture
!pip install gretel-client

## 🛜 Configure your Gretel session

- The `Gretel` object provides a high-level interface for streamlining interactions with Gretel's APIs.

- Each `Gretel` instance is bound to a single [Gretel project](https://docs.gretel.ai/guides/gretel-fundamentals/projects).

- Running the cell below will prompt you for your Gretel API key, which you can retrieve [here](https://console.gretel.ai/users/me/key).

- With `validate=True`, your login credentials will be validated immediately at instantiation.

In [2]:
from gretel_client import Gretel

gretel = Gretel(api_key="prompt", validate=True)

Gretel API Key: ··········
Using endpoint https://api.gretel.cloud
Logged in as achakrabarty8@gatech.edu ✅


In [10]:
import pandas as pd

dataset = "Merged_data.xlsx"

df = pd.read_excel(dataset)
# Convert DataFrame to Excel file

df.to_csv('input_data.csv', index=False)

In [13]:
dataset = "input_data.csv"
df = pd.read_csv(dataset)

In [14]:
# explore the data using pandas
#df = pd.read_csv(dataset)
df.head()

Unnamed: 0,response id,concerns,concerns category,anything else,anything else category
0,551,That all of my knowledge from calc BC escapes ...,AC,The sample exams and quizzes during linear alg...,AC
1,416,My only concern about this course is that I wi...,AC,,NC
2,422,My only concern is if I'll be able to study/pr...,AC,,NC
3,408,My only concern is that so far the video lesso...,AC,,NC
4,356,One thing that was concerning for me last seme...,AC,,NC


## 🏋️‍♂️ Train a generative model

- The [tabular-actgan](https://github.com/gretelai/gretel-blueprints/blob/main/config_templates/gretel/synthetics/tabular-actgan.yml) base config tells Gretel which model to train and how to configure it.

- You can replace `tabular-actgan` with the path to a custom config file, or you can select any of the tabular configs [listed here](https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics).

- The training data is passed in using the `data_source` argument. Its type can be a file path or `DataFrame`.

- **Tip:** Click the printed Console URL to monitor your job's progress in the Gretel Console.

In [15]:
trained = gretel.submit_train("tabular-actgan", data_source=dataset)

Submitting ACTGAN training job...
Model Docs: https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-actgan
Console URL: https://console.gretel.ai/proj_2iExWK9T7b44J2Mq2Byx2QoNtKJ/models/6676ece0f5296ddce896546a/activity
Model ID: 6676ece0f5296ddce896546a
Analyzing input data and checking for auto-params... 
Found 3 auto-params that were set based on input data. epochs 600, batch_size 600, force_conditioning False
Starting ACTGAN model training... num_epochs 600
Training data loaded. record_count 1279, field_count 5, upsample_count 0
Training: [██████████████████████████████████████████████████] 600/600 epochs.
ACTGAN model training complete. 
Sampling records for data preview... num_records 5000
Preparing privacy filters 
Loaded 0 privacy filters 
Starting privacy filtering 
Privacy filtering complete. 
Sampled 5000 records. 
Creating synthetic quality report (SQS)... 
Finished creating SQS 
Uploading artifacts to Gretel Cloud... 
Upload to Gretel Cloud is completed. 


## 🧐 Evaluate the synthetic data quality

- Gretel automatically creates a [synthetic data quality report](https://docs.gretel.ai/reference/evaluate/synthetic-data-quality-report) for each model you train.

- The training results object returned by `submit_train` has a `GretelReport` attribute for viewing the quality report.


In [16]:
# view the quality scores
print(trained.report)

GretelReport(
    synthetic_data_quality_score: 70
    field_correlation_stability: 60
    principal_component_stability: 52
    field_distribution_stability: 100
    privacy_protection_level: 0
)



In [17]:
# display the full report within this notebook
trained.report.display_in_notebook()

0,1,2,3,4,5
How to interpret your SQS,Excellent,Good,Moderate,Poor,Very Poor
Suitable for machine learning or statistical analysis,,,,,
Suitable for balancing or augmenting machine learning data sources,,,,,
Suitable for pre-production testing environments,,,,,
Suitable for demo environments or mock data,,,,,
Improve your model using our tips and advice,,,,,
Significant tuning required to improve model,,,,,

0,1,2,3,4,5
Data Sharing Use Case,Excellent,Very Good,Good,Normal,Poor
"Internally, within the same team",,,,,
"Internally, across different teams",,,,,
"Externally, with trusted partners",,,,,
"Externally, public availability",,,,,

Unnamed: 0,Training Data,Synthetic Data
Row Count,1279,1279
Column Count,5,5
Training Lines Duplicated,--,87

Default Privacy Protections,Advanced Protections

Field,Unique,Missing,Ave. Length,Type,Distribution Stability
concerns,488,591,89.93,Text,
response id,1035,0,3.24,Other,
anything else,193,964,59.34,Text,
concerns category,5,0,2.0,Categorical,Excellent
anything else category,4,0,2.0,Categorical,Excellent


In [18]:
# inspect the synthetic data used to create the report
df_synth_report = trained.fetch_report_synthetic_data()
df_synth_report.head()

Unnamed: 0,response id,concerns,concerns category,anything else,anything else category
0,328,I am most concerned about not understanding th...,AC,,NC
1,587,,NC,,NC
2,AC36,,NC,,NC
3,AC06,My concerns are just having it be self-guided ...,AC,I'm very interested in physics and this class ...,NC
4,110,"The difficulty. ""Multivariable calculus"" sound...",AC,,NC


## 🤖 Generate synthetic data

- The `model_id` argument can be the ID of any trained model within the current project.


In [22]:
generated = gretel.submit_generate(trained.model_id, num_records=6000)

Submitting ACTGAN generate job...
Model Docs: https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-actgan
Console URL: https://console.gretel.ai/proj_2iExWK9T7b44J2Mq2Byx2QoNtKJ/models/6676ece0f5296ddce896546a/data
Loading model to worker 
Loading ACTGAN model... 
Sampling 6000 records... 
Preparing privacy filters 
Loaded 0 privacy filters 
Starting privacy filtering 
Privacy filtering complete. 
Uploading artifacts to Gretel Cloud... 
Upload to Gretel Cloud is completed. 


In [23]:
# inspect the generated synthetic data
generated.synthetic_data.head()

Unnamed: 0,response id,concerns,concerns category,anything else,anything else category
0,706,,NC,There is nothing so far.,NC
1,375,I am most concerned with keeping up with the c...,AC,,NC
2,310,No fundamental practice after watching videos,OT,No nothing much. Matrices are my worst nighPCa...,NC
3,506,NA so far,NC,,NC
4,177,I am concerned about the lack of guidance from...,AC,,NC


In [24]:
# Save synthetic data to CSV file
synthetic_data_file = 'synthetic_data.csv'
generated.synthetic_data.to_csv(synthetic_data_file, index=False)

# Confirm the file is saved
print(f"Synthetic data saved to {synthetic_data_file}")

Synthetic data saved to synthetic_data.csv
