# What Is Synthetic Data? 

[Reference: Accelerating AI with Synthetic Data](https://learning.oreilly.com/library/view/accelerating-ai-with/9781492045991/) - [Khaled El Emam](https://learning.oreilly.com/search/?query=author%3A%22Khaled%20El%20Emam%22&sort=relevance&highlight=true)

At a conceptual level, synthetic data is not real data but is data that has been generated from real data and that has the same statistical properties as the real data. This means that an analyst who works with a synthetic dataset should get analysis results that are similar to those they would get with real data. The degree to which a synthetic dataset is an accurate proxy for real data is a measure of utility. Furthermore, we refer to the process of generating synthetic data as synthesis.

Data in this context can mean different things. For example, data can be structured data (i.e., rows and columns), as one would see in a relational database. Data can also be unstructured text, such as doctors’ notes, transcripts of conversations among people or with digital assistants, or online interactions by email or chat. Furthermore, images, videos, audio, and virtual environments are also types of data that can be synthesized. We have seen examples of fake images in the machine learning literature; for instance, realistic faces of people who do not exist in the real world can be created, and you can view the results online.


To create a synthetic dataset, follow these steps: - [ChatGPT](https://chat.openai.com/chat)

1. Define the problem and determine the type of data needed: Determine what kind of data is required for your problem and the type of distribution it should follow.
2. Select the appropriate statistical distribution: Choose a statistical distribution that best fits the data you want to generate. For example, if you want to generate data for a normally distributed variable, use the Gaussian distribution.
3. Set the parameters of the distribution: Determine the mean and standard deviation of the distribution you have selected.
4. Generate the data: Use a random number generator or a library in your preferred programming language to generate data samples from the distribution you have selected.
5. Validate the synthetic data: Verify that the generated data is similar to the real-world data. This can be done by comparing various statistical measures, such as mean, standard deviation, and distribution shape.
6. Save and use the synthetic data: Store the synthetic data in a file or database for future use.

Note: It is important to understand the underlying distribution of the real-world data to generate accurate synthetic data. In some cases, you may need to use multiple distributions to generate synthetic data that mimics real-world data.

Synthetic data is divided into two types, based on whether it is generated from actual datasets or not.

1. The first type is synthesized from real datasets. The analyst will have some real datasets and then build a model to capture the distributions and structure of that real data.
2. The second type of synthetic data is not generated from real data. It is created by using existing models or by using background knowledge of the analyst.

|Type of synthetic data|Utility|
|:---|:---|
|Generated from real (nonpublic) datasets | Can be quite high |
|Generated from real public data | Can be high, although limitations exist because public data tends to be de-identified or aggregated |
| Generated from an existing model of a process, which can also be represented in a simulation engine | Will depend on the fidelity of the existing generating model|
|Based on analyst knowledge | Will depend on how well the analyst knows the domain and the complexity of the phenomenon|
|Generated from generic assumptions not specific to the phenomenon|Will likely be low|

# Faker

[Faker Docs](https://faker.readthedocs.io/en/master/)

Creating synthetic data is a technique that can be used in several different areas of academics, including computer science, statistics, and artificial intelligence.

In computer science, synthetic data generation can be used to train machine learning models or to test software systems. In statistics, it can be used to generate datasets that have certain statistical properties or to create data that adheres to a particular model. In artificial intelligence, synthetic data can be used to generate more diverse and representative datasets for training machine learning models.

Therefore, creating synthetic data can be applied in various fields of academics, including those mentioned above, to create data that can be used to test or train algorithms or systems.

## Generate synthetic data with Python Faker

`Example 1: fake.name(), fake.address(), fake.email(), fake.phone_number()`

In [3]:
#%pip install faker
from faker import Faker
import pandas as pd # for data manipulation

In [4]:
# Instantiate Faker() instance
fake = Faker()

# Create a dataset including names, addresses, emails, and phone numbers
data = []
for _ in range(10):
    data.append([fake.name(), fake.address(), fake.email(), fake.phone_number()])

# Convert to DataFrame
df = pd.DataFrame(data, columns=['Name', 'Address', 'Email', 'Phone'])
df.head()

Unnamed: 0,Name,Address,Email,Phone
0,Mark Mitchell,"584 Rebecca Curve\nPort Jessica, WV 76188",cbrown@example.com,+1-709-223-0764x617
1,Julia Parker,"84809 Megan Tunnel\nWest Angela, NY 23621",newtondustin@example.net,717-181-2111
2,Matthew Gonzalez,"74744 Mariah Extension Apt. 996\nLake Tonyside, PA 01028",collinschelsea@example.net,+1-996-732-7982x2493
3,Cindy Wallace,"804 Dennis Ports\nWest Mark, RI 41849",erin65@example.net,746.878.8865x2503
4,Patricia Sweeney,"16843 Turner Mills\nNew Derrick, AK 10627",leonardjohnson@example.net,5409020195


In [18]:
fake.providers

[<faker.providers.user_agent.Provider at 0x172e429d0>,
 <faker.providers.ssn.en_US.Provider at 0x172e42940>,
 <faker.providers.python.Provider at 0x172e42a00>,
 <faker.providers.profile.Provider at 0x172e428e0>,
 <faker.providers.phone_number.en_US.Provider at 0x172e427c0>,
 <faker.providers.person.en_US.Provider at 0x172e42640>,
 <faker.providers.misc.en_US.Provider at 0x172e42550>,
 <faker.providers.lorem.en_US.Provider at 0x172e423a0>,
 <faker.providers.job.en_US.Provider at 0x172e421f0>,
 <faker.providers.isbn.Provider at 0x172e421c0>,
 <faker.providers.internet.en_US.Provider at 0x172e42160>,
 <faker.providers.geo.en_US.Provider at 0x1720e2610>,
 <faker.providers.file.Provider at 0x1720e2940>,
 <faker.providers.emoji.Provider at 0x1720e2580>,
 <faker.providers.date_time.en_US.Provider at 0x172e1d910>,
 <faker.providers.currency.en_US.Provider at 0x172e1dd90>,
 <faker.providers.credit_card.en_US.Provider at 0x172e1dbb0>,
 <faker.providers.company.en_US.Provider at 0x172e1db20>,
 <f

Faker's package has a number of callable functions, called providers, that will generate random data for you. In the above code chunk, I used the BaseProvider's functions to generate names, physical mailing addresses, email addresses, and phone numbers.

## Example 2: fake.profile()
Let's use another provider: profile, and see what data we can generate.

In [5]:
# Create a list of fake profiles
profiles = []

for _ in range(10):
    profiles.append(fake.profile())

# Save as a DataFrame
df2 = pd.DataFrame(profiles, columns = profiles[0].keys())
df2.head()

Unnamed: 0,job,company,ssn,residence,current_location,blood_group,website,username,name,sex,address,mail,birthdate
0,"Horticulturist, commercial",Lynch Group,814-97-0579,"30920 Gardner Coves Apt. 546\nBryanbury, VT 17671","(54.184749, 161.444703)",B+,"[https://white-taylor.com/, https://kelley-reynolds.biz/]",christopherpeterson,Pam Conner,F,"4934 Christopher Terrace Suite 004\nWest Amanda, RI 92088",mbarrera@gmail.com,1981-10-15
1,Oncologist,Cross-Cross,498-71-6922,"579 Bob Island Suite 419\nBradleyburgh, HI 25576","(69.3295965, -29.756487)",A+,"[https://king.com/, https://silva-bruce.org/]",kimberlyanderson,Robert Wallace,M,Unit 3754 Box 2163\nDPO AE 44468,jennifer06@yahoo.com,2016-11-08
2,Heritage manager,Farley LLC,266-67-8839,"873 Harris Well Suite 661\nRobbinshaven, SD 09887","(-26.379812, -127.134039)",A+,"[http://www.mckenzie-gallagher.org/, http://smith-nielsen.com/, https://gregory-franklin.com/, http://williams.biz/]",daniel02,Cameron Horne,M,"2055 Caitlin Track\nSouth Jamie, NY 14047",uroberts@yahoo.com,2001-10-10
3,"Engineer, aeronautical",Dennis-Holder,005-80-4211,"1569 Jill Mountain Suite 451\nSouth Joshuastad, ME 14246","(7.229123, -32.277896)",O-,[https://cruz.org/],shane19,Linda Armstrong,F,"530 Lucas Road Suite 009\nWest Rachelview, NC 82802",barnettandrea@hotmail.com,1969-01-21
4,Public librarian,"Vang, Williamson and Alvarado",683-98-0658,"6639 Russell Crossing\nHoodburgh, MO 87971","(51.9849805, 88.212373)",O+,[http://www.cross.com/],jnorris,Jacob Mclaughlin,M,"328 Smith Land\nKelleyton, AR 84140",linda43@yahoo.com,1978-10-18


As you can see from the output, there's a lot of information. Let's take a look at an individual profile:

In [4]:
fake.profile()

{'job': 'Farm manager',
 'company': 'Hill Ltd',
 'ssn': '039-03-3022',
 'residence': '04481 Karen Springs Suite 061\nGeorgeburgh, PR 61159',
 'current_location': (Decimal('-85.390027'), Decimal('-29.155379')),
 'blood_group': 'O-',
 'website': ['https://www.callahan-logan.com/',
  'https://www.johnson-west.com/'],
 'username': 'brandon61',
 'name': 'Shannon Harrington',
 'sex': 'F',
 'address': '24924 Michael Circle\nPhiliptown, NC 91944',
 'mail': 'keithnicholas@gmail.com',
 'birthdate': datetime.date(1918, 12, 14)}

## Example 3: customize fake.profile(fields = [])
Depending on the columns you actually want for your fake profiles, you can list whichever attributes you're interested in using the `fields` argument.

In [10]:
# Create fake profiles using specific columns
profiles2 = []

for _ in range(10):
    profiles2.append(fake.profile(fields = ["name", "sex", "ssn", "blood_group", "birthdate"]))

df3 = pd.DataFrame(profiles2, columns = profiles2[0].keys())
df3.head()

Unnamed: 0,ssn,blood_group,name,sex,birthdate
0,466-17-6145,B+,Sophia Davis,F,2013-09-19
1,333-23-5147,A-,Theresa Morrison,F,1980-02-25
2,761-08-9917,O-,George Adams,M,2009-08-12
3,749-87-6525,A-,Eric Brown,M,1932-09-29
4,239-45-3588,O-,Patrick Mendez,M,1973-03-05


## DynamicProvider: customizable provider

In [13]:
from faker.providers import DynamicProvider

In [14]:
df_museums = pd.read_csv('../../Data/museums.csv')

In [16]:
# Get unique list of museum names from existing dataset
museum_list = set(df_museums["Museum Name"])

# Create museum_provider
museum_provider = DynamicProvider(
     provider_name = "museum_provider",
     elements = museum_list,
)

# Instantiate new Faker() instance
fake_more = Faker()

# Add new provider
fake_more.add_provider(museum_provider)

# Use new provider
fake_more.museum_provider()

'BREMEN HISTORICAL SOCIETY'

In [17]:
fake_more.museum_provider()

'QUEPONCO RAILWAY STATION'

In [22]:
fake_more.providers

[<faker.providers.DynamicProvider at 0x17667bf10>,
 <faker.providers.user_agent.Provider at 0x175fe90d0>,
 <faker.providers.ssn.en_US.Provider at 0x175fe9790>,
 <faker.providers.python.Provider at 0x175fe9a00>,
 <faker.providers.profile.Provider at 0x175fe90a0>,
 <faker.providers.phone_number.en_US.Provider at 0x175fe9100>,
 <faker.providers.person.en_US.Provider at 0x175fe9d90>,
 <faker.providers.misc.en_US.Provider at 0x175fe96a0>,
 <faker.providers.lorem.en_US.Provider at 0x175fe9fa0>,
 <faker.providers.job.en_US.Provider at 0x175fe9ac0>,
 <faker.providers.isbn.Provider at 0x175fe9be0>,
 <faker.providers.internet.en_US.Provider at 0x175fe9ee0>,
 <faker.providers.geo.en_US.Provider at 0x175cd69d0>,
 <faker.providers.file.Provider at 0x175cd6520>,
 <faker.providers.emoji.Provider at 0x175cd6340>,
 <faker.providers.date_time.en_US.Provider at 0x175cd65b0>,
 <faker.providers.currency.en_US.Provider at 0x175cd66a0>,
 <faker.providers.credit_card.en_US.Provider at 0x176687850>,
 <faker.pr

In this dummy example, I took an existing [dataset on museums](https://www.kaggle.com/datasets/imls/museum-directory?resource=download), extracted just the names, and in 2 lines of code, created a new provider that will randomly generate a museum name based on the data I've provided it. This could be applied to any other existing dataset that you have.

## Python Faker providers: standard vs. community
To learn more about other providers you can use the following line of code. Note that we're calling on the providers attribute of a Faker() instance, called fake. All of the providers' accompanying functions can be called on like we did above without any additional import statements.

In [9]:
# Get full list of built-in providers
fake.providers

[<faker.providers.user_agent.Provider at 0x174593eb0>,
 <faker.providers.ssn.en_US.Provider at 0x174593e20>,
 <faker.providers.python.Provider at 0x174593ee0>,
 <faker.providers.profile.Provider at 0x174593dc0>,
 <faker.providers.phone_number.en_US.Provider at 0x174593ca0>,
 <faker.providers.person.en_US.Provider at 0x174593b20>,
 <faker.providers.misc.en_US.Provider at 0x174593a30>,
 <faker.providers.lorem.en_US.Provider at 0x174593880>,
 <faker.providers.job.en_US.Provider at 0x1745936d0>,
 <faker.providers.isbn.Provider at 0x1745936a0>,
 <faker.providers.internet.en_US.Provider at 0x174593640>,
 <faker.providers.geo.en_US.Provider at 0x1745932b0>,
 <faker.providers.file.Provider at 0x1745934f0>,
 <faker.providers.emoji.Provider at 0x174593430>,
 <faker.providers.date_time.en_US.Provider at 0x174593040>,
 <faker.providers.currency.en_US.Provider at 0x1742aee20>,
 <faker.providers.credit_card.en_US.Provider at 0x1742ae610>,
 <faker.providers.company.en_US.Provider at 0x1745708b0>,
 <f

Beyond the basic providers, there are also community-developed providers, such as:

- faker_airtravel: airport and flight information
- faker_music: music genres, subgenres, and instrument information
- faker_vehicle: year, make, model, and other vehicle information

But you will have to install and import community providers separately:

In [10]:
#%pip install faker_airtravel

In [20]:
from faker import Faker
from faker_airtravel import AirTravelProvider
fake.add_provider(AirTravelProvider)

Check out Python Faker's full [GitHub](https://github.com/joke2k/faker) and [documentation](https://faker.readthedocs.io/en/master/) for more.

In [21]:
#%pip install faker_vehicle
#%pip install faker_music

In [22]:
import faker_music
from faker_music import MusicProvider
fake.add_provider(MusicProvider)

In [23]:
import faker_vehicle
from faker_vehicle import VehicleProvider
fake.add_provider(VehicleProvider)

In [24]:
fake.providers

[<faker_vehicle.VehicleProvider at 0x172f2f0d0>,
 <faker_music.music.MusicProvider at 0x173bc4700>,
 <faker_airtravel.airports.AirTravelProvider at 0x173bd2e20>,
 <faker.providers.user_agent.Provider at 0x172e429d0>,
 <faker.providers.ssn.en_US.Provider at 0x172e42940>,
 <faker.providers.python.Provider at 0x172e42a00>,
 <faker.providers.profile.Provider at 0x172e428e0>,
 <faker.providers.phone_number.en_US.Provider at 0x172e427c0>,
 <faker.providers.person.en_US.Provider at 0x172e42640>,
 <faker.providers.misc.en_US.Provider at 0x172e42550>,
 <faker.providers.lorem.en_US.Provider at 0x172e423a0>,
 <faker.providers.job.en_US.Provider at 0x172e421f0>,
 <faker.providers.isbn.Provider at 0x172e421c0>,
 <faker.providers.internet.en_US.Provider at 0x172e42160>,
 <faker.providers.geo.en_US.Provider at 0x1720e2610>,
 <faker.providers.file.Provider at 0x1720e2940>,
 <faker.providers.emoji.Provider at 0x1720e2580>,
 <faker.providers.date_time.en_US.Provider at 0x172e1d910>,
 <faker.providers.cu

In [25]:
fake.airline()

'Tropic Air'

In [30]:
#todo: create a fake dataset with a specific distribution starting with Gaussian
