# Reference: https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator-fine-tuning-intro-tutorial.ipynb#scrollTo=sULbdBh4k71u

# Install the required dependencies

In [1]:
%%capture
!pip install gretel-client

# Initialise the project parameters

In [2]:
from gretel_client import Gretel

gretel = Gretel(
    project_name="generate-heart-disease-data",
    api_key="prompt",
    endpoint="https://api.gretel.cloud",
    validate=True,
)

* 'allow_population_by_field_name' has been renamed to 'populate_by_name'


Found cached Gretel credentials
Using endpoint https://api.gretel.cloud
Logged in as ec23839@qmul.ac.uk ✅
Project URL: https://console.gretel.ai/proj_2kSa6XEWsFjin7D8RiTL6nqaAXD


# Load the source dataset

In [3]:
data_source = "heart_gen_ai_917_source.csv"


In [4]:
trained = gretel.submit_train("navigator-ft", data_source=data_source)

Submitting NAVIGATOR FINE TUNING training job...
Model Docs: https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-navigator-fine-tuning
Console URL: https://console.gretel.ai/proj_2kSa6XEWsFjin7D8RiTL6nqaAXD/models/66b72bc34be1fd776ab4e380/activity
Model ID: 66b72bc34be1fd776ab4e380
<< 🧭 Navigator FT >> Preparing for training 
<< 🧭 Navigator FT >> Tokenizing records 
<< 🧭 Navigator FT >> Number of unique train records: 871 
<< 🧭 Navigator FT >> Assembling examples from 2870.3% of the input records 
<< 🧭 Navigator FT >> Training Example Statistics: 

╒════════╤═════════════════════╤══════════════════════╤═══════════════════════╕
│        │   Tokens per record │   Tokens per example │   Records per example │
╞════════╪═════════════════════╪══════════════════════╪═══════════════════════╡
│ min    │                 102 │                 1773 │                    16 │
├────────┼─────────────────────┼──────────────────────┼───────────────────────┤
│ max    │                 

In [5]:
# view the quality scores
trained.report

GretelReport(
    synthetic_data_quality_score: 93
    field_correlation_stability: 91
    principal_component_stability: 96
    field_distribution_stability: 94
    privacy_protection_level: 0
    membership_inference_attack_score: 85.65
    attribute_inference_attack_score: 38.1
    data_privacy_score: 61
)

In [6]:
# display the full report within this notebook
trained.report.display_in_notebook()

0,1,2,3,4,5
How to interpret your SQS,Excellent,Good,Moderate,Poor,Very Poor
Suitable for machine learning or statistical analysis,,,,,
Suitable for balancing or augmenting machine learning data sources,,,,,
Suitable for pre-production testing environments,,,,,
Suitable for demo environments or mock data,,,,,
Improve your model using our tips and advice,,,,,
Significant tuning required to improve model,,,,,

0,1,2,3,4,5
Data Sharing Use Case,Excellent,Very Good,Good,Normal,Poor
"Internally, within the same team",,,,,
"Internally, across different teams",,,,,
"Externally, with trusted partners",,,,,
"Externally, public availability",,,,,

0,1,2,3,4,5
Data Sharing Use Case,Excellent,Very Good,Good,Normal,Poor
"Internally, within the same team",,,,,
"Internally, across different teams",,,,,
"Externally, with trusted partners",,,,,
"Externally, public availability",,,,,

Unnamed: 0,Training Data,Synthetic Data
Row Count,871,871
Column Count,12,12
Training Lines Duplicated,--,0

Default Privacy Protections,Advanced Protections

Field,Unique,Missing,Ave. Length,Type,Distribution Stability
oldpeak,47,0,3.03,Numeric,Excellent
cholesterol,184,0,5.0,Numeric,Excellent
max_heart_rate,116,0,4.92,Numeric,Excellent
resting_bp,57,0,4.99,Numeric,Excellent
age,50,0,2.0,Numeric,Excellent
exercise_angina,2,0,1.0,Binary,Excellent
chest_pain_type,4,0,1.0,Categorical,Excellent
resting_ecg,3,0,1.0,Categorical,Excellent
sex,2,0,1.0,Binary,Excellent
ST_slope,3,0,1.0,Categorical,Excellent


In [7]:
# inspect the synthetic data used to create the report
df_synth_report = trained.fetch_report_synthetic_data()
df_synth_report.head()

Unnamed: 0,age,sex,chest_pain_type,resting_bp,cholesterol,fasting_blood_sugar,resting_ecg,max_heart_rate,exercise_angina,oldpeak,ST_slope,target
0,69,1,4,170.0,237.0,1,1,158.0,0,-1.0,2,1
1,54,1,4,130.0,226.0,0,0,121.0,1,3.75,3,1
2,60,1,4,140.0,237.0,1,0,115.0,1,1.0,2,1
3,57,1,3,144.0,195.0,0,2,160.0,1,3.0,2,1
4,55,0,4,110.0,250.0,0,0,160.0,1,1.0,2,1


In [8]:
df_synth_report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  5000 non-null   int64  
 1   sex                  5000 non-null   int64  
 2   chest_pain_type      5000 non-null   int64  
 3   resting_bp           5000 non-null   float64
 4   cholesterol          5000 non-null   float64
 5   fasting_blood_sugar  5000 non-null   int64  
 6   resting_ecg          5000 non-null   int64  
 7   max_heart_rate       5000 non-null   float64
 8   exercise_angina      5000 non-null   int64  
 9   oldpeak              5000 non-null   float64
 10  ST_slope             5000 non-null   int64  
 11  target               5000 non-null   int64  
dtypes: float64(4), int64(8)
memory usage: 468.9 KB


In [9]:
df_synth_report.shape

(5000, 12)

# Saving the results as a csv

In [None]:
# # Save the df_synth_report data frame as a csv for analysis and modelling
# df_synth_report.to_csv('df_synth_report.csv', index=False)