# What Is Synthetic Data? 

[Reference: Accelerating AI with Synthetic Data](https://learning.oreilly.com/library/view/accelerating-ai-with/9781492045991/) - [Khaled El Emam](https://learning.oreilly.com/search/?query=author%3A%22Khaled%20El%20Emam%22&sort=relevance&highlight=true)

At a conceptual level, synthetic data is not real data but is data that has been generated from real data and that has the same statistical properties as the real data. This means that an analyst who works with a synthetic dataset should get analysis results that are similar to those they would get with real data. The degree to which a synthetic dataset is an accurate proxy for real data is a measure of utility. Furthermore, we refer to the process of generating synthetic data as synthesis.

Data in this context can mean different things. For example, data can be structured data (i.e., rows and columns), as one would see in a relational database. Data can also be unstructured text, such as doctors’ notes, transcripts of conversations among people or with digital assistants, or online interactions by email or chat. Furthermore, images, videos, audio, and virtual environments are also types of data that can be synthesized. We have seen examples of fake images in the machine learning literature; for instance, realistic faces of people who do not exist in the real world can be created, and you can view the results online.


To create a synthetic dataset, follow these steps: - [ChatGPT](https://chat.openai.com/chat)

1. Define the problem and determine the type of data needed: Determine what kind of data is required for your problem and the type of distribution it should follow.
2. Select the appropriate statistical distribution: Choose a statistical distribution that best fits the data you want to generate. For example, if you want to generate data for a normally distributed variable, use the Gaussian distribution.
3. Set the parameters of the distribution: Determine the mean and standard deviation of the distribution you have selected.
4. Generate the data: Use a random number generator or a library in your preferred programming language to generate data samples from the distribution you have selected.
5. Validate the synthetic data: Verify that the generated data is similar to the real-world data. This can be done by comparing various statistical measures, such as mean, standard deviation, and distribution shape.
6. Save and use the synthetic data: Store the synthetic data in a file or database for future use.

Note: It is important to understand the underlying distribution of the real-world data to generate accurate synthetic data. In some cases, you may need to use multiple distributions to generate synthetic data that mimics real-world data.

Synthetic data is divided into two types, based on whether it is generated from actual datasets or not.

1. The first type is synthesized from real datasets. The analyst will have some real datasets and then build a model to capture the distributions and structure of that real data.
2. The second type of synthetic data is not generated from real data. It is created by using existing models or by using background knowledge of the analyst.

|Type of synthetic data|Utility|
|:---|:---|
|Generated from real (nonpublic) datasets | Can be quite high |
|Generated from real public data | Can be high, although limitations exist because public data tends to be de-identified or aggregated |
| Generated from an existing model of a process, which can also be represented in a simulation engine | Will depend on the fidelity of the existing generating model|
|Based on analyst knowledge | Will depend on how well the analyst knows the domain and the complexity of the phenomenon|
|Generated from generic assumptions not specific to the phenomenon|Will likely be low|

# Faker

[Faker Docs](https://faker.readthedocs.io/en/master/)

Creating synthetic data is a technique that can be used in several different areas of academics, including computer science, statistics, and artificial intelligence.

In computer science, synthetic data generation can be used to train machine learning models or to test software systems. In statistics, it can be used to generate datasets that have certain statistical properties or to create data that adheres to a particular model. In artificial intelligence, synthetic data can be used to generate more diverse and representative datasets for training machine learning models.

Therefore, creating synthetic data can be applied in various fields of academics, including those mentioned above, to create data that can be used to test or train algorithms or systems.

## Generate synthetic data with Python Faker

`Example 1: fake.name(), fake.address(), fake.email(), fake.phone_number()`

In [1]:
#%pip install faker
from faker import Faker
import pandas as pd # for data manipulation

In [2]:
# Instantiate Faker() instance
fake = Faker()

# Create a dataset including names, addresses, emails, and phone numbers
data = []
for _ in range(10):
    data.append([fake.name(), fake.address(), fake.email(), fake.phone_number()])

# Convert to DataFrame
df = pd.DataFrame(data, columns=['Name', 'Address', 'Email', 'Phone'])
df.head()

Unnamed: 0,Name,Address,Email,Phone
0,Natalie Duran,"PSC 6735, Box 9068\nAPO AA 02356",howellchristina@example.com,004-842-2994x8563
1,Justin Padilla,"697 Pope Village Apt. 133\nPort Tina, MD 48474",gwilliams@example.net,368.636.2946
2,Victoria Williams,"29458 Ronald Curve\nGreenbury, WV 93668",vrodriguez@example.org,249.446.4119
3,Timothy Vance,USNS Ramsey\nFPO AP 08357,daleroach@example.net,(372)484-5395x95656
4,David Black,"118 Sean Island Suite 222\nNew Patriciaport, W...",ihendricks@example.org,319-347-3306


Faker's package has a number of callable functions, called providers, that will generate random data for you. In the above code chunk, I used the BaseProvider's functions to generate names, physical mailing addresses, email addresses, and phone numbers.

## Example 2: fake.profile()
Let's use another provider: profile, and see what data we can generate.

In [3]:
# Create a list of fake profiles
profiles = []

for _ in range(10):
    profiles.append(fake.profile())

# Save as a DataFrame
df2 = pd.DataFrame(profiles, columns = profiles[0].keys())
df2.head()

Unnamed: 0,job,company,ssn,residence,current_location,blood_group,website,username,name,sex,address,mail,birthdate
0,Water quality scientist,"Finley, Martin and Hamilton",184-38-1528,"33433 Newman Port Suite 055\nSouth Andreabury,...","(-2.381543, -88.404296)",A+,"[http://jackson.net/, https://www.wilson-moral...",jweber,Paul Bowman,M,"70814 Hancock Hill Suite 503\nLake Brian, HI 2...",emilyberger@gmail.com,1917-06-23
1,Psychiatrist,Stewart-Wilson,813-01-0199,"88251 Rogers Harbors\nEast Billy, VT 21547","(-11.2108775, -167.749415)",AB-,"[http://www.scott.com/, https://roth.com/]",sharplori,Ronald Burns,M,"99999 Rose Mount\nWest Anthony, AS 40365",charles51@yahoo.com,2007-02-22
2,"Pharmacist, hospital",Rodriguez-Chandler,773-27-9043,015 Berg Extensions Suite 494\nNorth Christoph...,"(21.2930435, 7.645400)",O-,"[http://www.price.com/, http://www.nguyen-shaw...",plin,Robert Torres,M,"7012 Dennis Overpass\nGutierrezview, SC 27473",joshua83@gmail.com,1917-11-06
3,Legal secretary,"Anderson, Jones and Day",749-54-9049,"11572 Peter Plains Apt. 476\nPatriciaview, MO ...","(81.8625765, -141.863404)",A+,"[http://www.patel.com/, https://austin-roberts...",hillsandra,Christopher Parker,M,"99297 Martin Coves Apt. 307\nMichaelmouth, WI ...",owright@yahoo.com,1935-11-27
4,Psychiatrist,Howell PLC,071-17-6919,"198 Theresa Orchard Apt. 797\nNorth Davidfort,...","(10.929648, -76.927497)",B-,"[http://www.russell.com/, https://www.smith-ca...",melissaparrish,Philip Murillo,M,"1454 Jeffrey Harbors Apt. 291\nSimpsonhaven, A...",donald11@gmail.com,1977-12-20


As you can see from the output, there's a lot of information. Let's take a look at an individual profile:

In [4]:
fake.profile()

{'job': 'Farm manager',
 'company': 'Hill Ltd',
 'ssn': '039-03-3022',
 'residence': '04481 Karen Springs Suite 061\nGeorgeburgh, PR 61159',
 'current_location': (Decimal('-85.390027'), Decimal('-29.155379')),
 'blood_group': 'O-',
 'website': ['https://www.callahan-logan.com/',
  'https://www.johnson-west.com/'],
 'username': 'brandon61',
 'name': 'Shannon Harrington',
 'sex': 'F',
 'address': '24924 Michael Circle\nPhiliptown, NC 91944',
 'mail': 'keithnicholas@gmail.com',
 'birthdate': datetime.date(1918, 12, 14)}

## Example 3: customize fake.profile(fields = [])
Depending on the columns you actually want for your fake profiles, you can list whichever attributes you're interested in using the `fields` argument.

In [5]:
# Create fake profiles using specific columns
profiles2 = []

for _ in range(10):
    profiles2.append(fake.profile(fields = ["name", "sex", "occupation", "blood_group", "birthdate"]))

df3 = pd.DataFrame(profiles2, columns = profiles2[0].keys())
df3.head()

Unnamed: 0,blood_group,name,sex,birthdate
0,O+,Andrew Brown,M,2016-10-30
1,B+,William Sims,M,2002-12-14
2,AB+,Philip Moore,M,1947-03-03
3,AB-,Gwendolyn Miller,F,1929-01-24
4,AB+,Hannah Jackson,F,1977-12-27


## DynamicProvider: customizable provider

In [6]:
from faker.providers import DynamicProvider

In [7]:
df_museums = pd.read_csv('../../Data/museums.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [8]:
# Get unique list of museum names from existing dataset
museum_list = set(df_museums["Museum Name"])

# Create museum_provider
museum_provider = DynamicProvider(
     provider_name = "museum_provider",
     elements = museum_list,
)

# Instantiate new Faker() instance
fake_more = Faker()

# Add new provider
fake_more.add_provider(museum_provider)

# Use new provider
fake_more.museum_provider()

'MUSEUM OF LIFESTYLE & FASHION HISTORY'

In [18]:
fake_more.providers

[<faker.providers.DynamicProvider at 0x174676640>,
 <faker.providers.user_agent.Provider at 0x174687b50>,
 <faker.providers.ssn.en_US.Provider at 0x174687a60>,
 <faker.providers.python.Provider at 0x174687ac0>,
 <faker.providers.profile.Provider at 0x174687a00>,
 <faker.providers.phone_number.en_US.Provider at 0x1746877f0>,
 <faker.providers.person.en_US.Provider at 0x1746872e0>,
 <faker.providers.misc.en_US.Provider at 0x1746871c0>,
 <faker.providers.lorem.en_US.Provider at 0x174687880>,
 <faker.providers.job.en_US.Provider at 0x174687280>,
 <faker.providers.isbn.Provider at 0x174687370>,
 <faker.providers.internet.en_US.Provider at 0x174687340>,
 <faker.providers.geo.en_US.Provider at 0x174687490>,
 <faker.providers.file.Provider at 0x1746873a0>,
 <faker.providers.emoji.Provider at 0x174687520>,
 <faker.providers.date_time.en_US.Provider at 0x1746874f0>,
 <faker.providers.currency.en_US.Provider at 0x1746874c0>,
 <faker.providers.credit_card.en_US.Provider at 0x174649b20>,
 <faker.pr

In this dummy example, I took an existing [dataset on museums](https://www.kaggle.com/datasets/imls/museum-directory?resource=download), extracted just the names, and in 2 lines of code, created a new provider that will randomly generate a museum name based on the data I've provided it. This could be applied to any other existing dataset that you have.

## Python Faker providers: standard vs. community
To learn more about other providers you can use the following line of code. Note that we're calling on the providers attribute of a Faker() instance, called fake. All of the providers' accompanying functions can be called on like we did above without any additional import statements.

In [9]:
# Get full list of built-in providers
fake.providers

[<faker.providers.user_agent.Provider at 0x174593eb0>,
 <faker.providers.ssn.en_US.Provider at 0x174593e20>,
 <faker.providers.python.Provider at 0x174593ee0>,
 <faker.providers.profile.Provider at 0x174593dc0>,
 <faker.providers.phone_number.en_US.Provider at 0x174593ca0>,
 <faker.providers.person.en_US.Provider at 0x174593b20>,
 <faker.providers.misc.en_US.Provider at 0x174593a30>,
 <faker.providers.lorem.en_US.Provider at 0x174593880>,
 <faker.providers.job.en_US.Provider at 0x1745936d0>,
 <faker.providers.isbn.Provider at 0x1745936a0>,
 <faker.providers.internet.en_US.Provider at 0x174593640>,
 <faker.providers.geo.en_US.Provider at 0x1745932b0>,
 <faker.providers.file.Provider at 0x1745934f0>,
 <faker.providers.emoji.Provider at 0x174593430>,
 <faker.providers.date_time.en_US.Provider at 0x174593040>,
 <faker.providers.currency.en_US.Provider at 0x1742aee20>,
 <faker.providers.credit_card.en_US.Provider at 0x1742ae610>,
 <faker.providers.company.en_US.Provider at 0x1745708b0>,
 <f

Beyond the basic providers, there are also community-developed providers, such as:

- faker_airtravel: airport and flight information
- faker_music: music genres, subgenres, and instrument information
- faker_vehicle: year, make, model, and other vehicle information

But you will have to install and import community providers separately:

In [10]:
#%pip install faker_airtravel

In [11]:
from faker import Faker
from faker_airtravel import AirTravelProvider
fake.add_provider(AirTravelProvider)

Check out Python Faker's full [GitHub](https://github.com/joke2k/faker) and [documentation](https://faker.readthedocs.io/en/master/) for more.

In [12]:
#%pip install faker_vehicle
#%pip install faker_music

In [13]:
import faker_music
from faker_music import MusicProvider
fake.add_provider(MusicProvider)

In [14]:
import faker_vehicle
from faker_vehicle import VehicleProvider
fake.add_provider(VehicleProvider)

In [19]:
fake.providers

[<faker_vehicle.VehicleProvider at 0x1753d7130>,
 <faker_music.music.MusicProvider at 0x1742aea90>,
 <faker_airtravel.airports.AirTravelProvider at 0x1742aef70>,
 <faker.providers.user_agent.Provider at 0x174593eb0>,
 <faker.providers.ssn.en_US.Provider at 0x174593e20>,
 <faker.providers.python.Provider at 0x174593ee0>,
 <faker.providers.profile.Provider at 0x174593dc0>,
 <faker.providers.phone_number.en_US.Provider at 0x174593ca0>,
 <faker.providers.person.en_US.Provider at 0x174593b20>,
 <faker.providers.misc.en_US.Provider at 0x174593a30>,
 <faker.providers.lorem.en_US.Provider at 0x174593880>,
 <faker.providers.job.en_US.Provider at 0x1745936d0>,
 <faker.providers.isbn.Provider at 0x1745936a0>,
 <faker.providers.internet.en_US.Provider at 0x174593640>,
 <faker.providers.geo.en_US.Provider at 0x1745932b0>,
 <faker.providers.file.Provider at 0x1745934f0>,
 <faker.providers.emoji.Provider at 0x174593430>,
 <faker.providers.date_time.en_US.Provider at 0x174593040>,
 <faker.providers.cu

In [17]:
fake.

'USNS Stewart\nFPO AP 24678'