# What Is Synthetic Data? 

[Reference: Accelerating AI with Synthetic Data](https://learning.oreilly.com/library/view/accelerating-ai-with/9781492045991/) - [Khaled El Emam](https://learning.oreilly.com/search/?query=author%3A%22Khaled%20El%20Emam%22&sort=relevance&highlight=true)

At a conceptual level, synthetic data is not real data but is data that has been generated from real data and that has the same statistical properties as the real data. This means that an analyst who works with a synthetic dataset should get analysis results that are similar to those they would get with real data. The degree to which a synthetic dataset is an accurate proxy for real data is a measure of utility. Furthermore, we refer to the process of generating synthetic data as synthesis.

Data in this context can mean different things. For example, data can be structured data (i.e., rows and columns), as one would see in a relational database. Data can also be unstructured text, such as doctors’ notes, transcripts of conversations among people or with digital assistants, or online interactions by email or chat. Furthermore, images, videos, audio, and virtual environments are also types of data that can be synthesized. We have seen examples of fake images in the machine learning literature; for instance, realistic faces of people who do not exist in the real world can be created, and you can view the results online.


To create a synthetic dataset, follow these steps: - [ChatGPT](https://chat.openai.com/chat)

1. Define the problem and determine the type of data needed: Determine what kind of data is required for your problem and the type of distribution it should follow.
2. Select the appropriate statistical distribution: Choose a statistical distribution that best fits the data you want to generate. For example, if you want to generate data for a normally distributed variable, use the Gaussian distribution.
3. Set the parameters of the distribution: Determine the mean and standard deviation of the distribution you have selected.
4. Generate the data: Use a random number generator or a library in your preferred programming language to generate data samples from the distribution you have selected.
5. Validate the synthetic data: Verify that the generated data is similar to the real-world data. This can be done by comparing various statistical measures, such as mean, standard deviation, and distribution shape.
6. Save and use the synthetic data: Store the synthetic data in a file or database for future use.

Note: It is important to understand the underlying distribution of the real-world data to generate accurate synthetic data. In some cases, you may need to use multiple distributions to generate synthetic data that mimics real-world data.

Synthetic data is divided into two types, based on whether it is generated from actual datasets or not.

1. The first type is synthesized from real datasets. The analyst will have some real datasets and then build a model to capture the distributions and structure of that real data.
2. The second type of synthetic data is not generated from real data. It is created by using existing models or by using background knowledge of the analyst.

|Type of synthetic data|Utility|
|:---|:---|
|Generated from real (nonpublic) datasets | Can be quite high |
|Generated from real public data | Can be high, although limitations exist because public data tends to be de-identified or aggregated |
| Generated from an existing model of a process, which can also be represented in a simulation engine | Will depend on the fidelity of the existing generating model|
|Based on analyst knowledge | Will depend on how well the analyst knows the domain and the complexity of the phenomenon|
|Generated from generic assumptions not specific to the phenomenon|Will likely be low|

# Faker

Creating synthetic data is a technique that can be used in several different areas of academics, including computer science, statistics, and artificial intelligence.

In computer science, synthetic data generation can be used to train machine learning models or to test software systems. In statistics, it can be used to generate datasets that have certain statistical properties or to create data that adheres to a particular model. In artificial intelligence, synthetic data can be used to generate more diverse and representative datasets for training machine learning models.

Therefore, creating synthetic data can be applied in various fields of academics, including those mentioned above, to create data that can be used to test or train algorithms or systems.

## Generate synthetic data with Python Faker

`Example 1: fake.name(), fake.address(), fake.email(), fake.phone_number()`

In [1]:
#%pip install faker
from faker import Faker
import pandas as pd # for data manipulation

In [2]:
# Instantiate Faker() instance
fake = Faker()

# Create a dataset including names, addresses, emails, and phone numbers
data = []
for _ in range(10):
    data.append([fake.name(), fake.address(), fake.email(), fake.phone_number()])

# Convert to DataFrame
df = pd.DataFrame(data, columns=['Name', 'Address', 'Email', 'Phone'])
df.head()

Unnamed: 0,Name,Address,Email,Phone
0,Amy Powell,"4133 Daniels Inlet\nNorth Jeffrey, ND 44157",mary63@example.com,638.181.7475x36470
1,Christopher Davis,"0342 Boyd Ramp\nNorth Donaldview, IA 88660",laceygonzalez@example.com,850.318.3804x3815
2,Maria Wilkerson,"20855 Campbell Shoals\nGregorychester, WV 93427",matthewjohnson@example.org,(779)443-9176x70949
3,Jasmine Fitzgerald,"5153 Mitchell Loaf Apt. 952\nLopezmouth, MN 12219",salinascurtis@example.com,+1-586-406-2337x062
4,Daniel Bailey,"164 Jill Ports Suite 879\nPort Alanbury, AS 52202",jenniferfloyd@example.org,(553)636-2399


Faker's package has a number of callable functions, called providers, that will generate random data for you. In the above code chunk, I used the BaseProvider's functions to generate names, physical mailing addresses, email addresses, and phone numbers.

## Example 2: fake.profile()
Let's use another provider: profile, and see what data we can generate.

In [3]:
# Create a list of fake profiles
profiles = []

for _ in range(10):
    profiles.append(fake.profile())

# Save as a DataFrame
df2 = pd.DataFrame(profiles, columns = profiles[0].keys())
df2.head()

Unnamed: 0,job,company,ssn,residence,current_location,blood_group,website,username,name,sex,address,mail,birthdate
0,Pathologist,Morgan-Rodriguez,337-16-6505,"27146 Thomas Lakes\nAngelashire, CO 11428","(-65.249321, -67.614365)",O-,"[https://gordon-cohen.com/, https://baker.com/, https://matthews.com/]",stanleybenson,Emily Roberts,F,"PSC 7554, Box 0302\nAPO AE 52908",anelson@gmail.com,2002-03-05
1,Educational psychologist,Young PLC,215-93-0814,"89138 Kyle Groves Apt. 147\nRichardview, CT 86335","(-72.3012235, -51.064258)",B+,"[https://garrison.com/, https://www.welch-nixon.com/, http://nelson-miller.com/]",richardsonanthony,Laura Pittman,F,"8753 Julia Parkway\nWest Lisa, MA 31554",amy42@hotmail.com,1934-09-23
2,Broadcast engineer,Gibson and Sons,508-95-2290,"278 Christian Mountain\nMichaelhaven, VA 28216","(79.158172, -173.776936)",A+,"[http://www.bell.info/, http://www.ball-fischer.com/]",heatherjones,Mary Green MD,F,"260 Thomas Roads Apt. 359\nGonzalezburgh, TN 14234",wagnerkelly@hotmail.com,2011-09-25
3,"Teacher, adult education",Allen-Garcia,288-78-5024,"PSC 9267, Box 9063\nAPO AE 23781","(-2.665059, -153.035046)",B-,"[https://www.miller.com/, http://meza.com/, https://www.gould-wilson.net/]",higginsmadeline,Mitchell Davis,M,"9174 Shawn Port\nAnthonytown, TX 49372",melvinserrano@hotmail.com,1961-05-05
4,Broadcast engineer,Brown Group,682-02-4814,"0282 Cathy Grove\nWest Deborahborough, OH 05259","(-75.969342, 17.736184)",A+,[https://schroeder-brown.com/],jaime32,Sylvia Peterson,F,"291 Jordan Camp Suite 982\nStevenborough, WI 17703",robert46@yahoo.com,2010-04-04


As you can see from the output, there's a lot of information. Let's take a look at an individual profile:

In [4]:
fake.profile()

{'job': 'Financial controller',
 'company': 'Tate LLC',
 'ssn': '419-69-0984',
 'residence': '1634 Brown Brooks Apt. 160\nAndersonbury, RI 44456',
 'current_location': (Decimal('-50.698553'), Decimal('146.445818')),
 'blood_group': 'A+',
 'website': ['http://lee.com/'],
 'username': 'ashleyweber',
 'name': 'Joshua Taylor',
 'sex': 'M',
 'address': '997 Brittany Plaza Suite 980\nJacquelinefort, NC 36486',
 'mail': 'blin@gmail.com',
 'birthdate': datetime.date(1920, 1, 10)}

## Example 3: customize fake.profile(fields = [])
Depending on the columns you actually want for your fake profiles, you can list whichever attributes you're interested in using the `fields` argument.

In [5]:
# Create fake profiles using specific columns
profiles2 = []

for _ in range(10):
    profiles2.append(fake.profile(fields = ["name", "sex", "occupation", "blood_group", "birthdate"]))

df3 = pd.DataFrame(profiles2, columns = profiles2[0].keys())
df3.head()

Unnamed: 0,blood_group,name,sex,birthdate
0,A-,Gloria Schwartz,F,1960-03-02
1,O+,Natasha Watson,F,1974-09-10
2,A-,Amber Peterson,F,1951-02-20
3,O+,George Wood,M,1948-07-01
4,AB+,Margaret Galvan,F,1995-12-21


## DynamicProvider: customizable provider

In [6]:
from faker.providers import DynamicProvider

In [9]:
df_museums = pd.read_csv('../../Data/museums.csv')

In [10]:
# Get unique list of museum names from existing dataset
museum_list = set(df_museums["Museum Name"])

# Create museum_provider
museum_provider = DynamicProvider(
     provider_name = "museum_provider",
     elements = museum_list,
)

# Instantiate new Faker() instance
fake_more = Faker()

# Add new provider
fake_more.add_provider(museum_provider)

# Use new provider
fake_more.museum_provider()

'DANFORTH MUSEUM OF ART'

In this dummy example, I took an existing [dataset on museums](https://www.kaggle.com/datasets/imls/museum-directory?resource=download), extracted just the names, and in 2 lines of code, created a new provider that will randomly generate a museum name based on the data I've provided it. This could be applied to any other existing dataset that you have.

## Python Faker providers: standard vs. community
To learn more about other providers you can use the following line of code. Note that we're calling on the providers attribute of a Faker() instance, called fake. All of the providers' accompanying functions can be called on like we did above without any additional import statements.

In [11]:
# Get full list of built-in providers
fake.providers

[<faker.providers.user_agent.Provider at 0x16d8ea8e0>,
 <faker.providers.ssn.en_US.Provider at 0x16d8ea850>,
 <faker.providers.python.Provider at 0x16d8ea910>,
 <faker.providers.profile.Provider at 0x16d8ea7f0>,
 <faker.providers.phone_number.en_US.Provider at 0x16d8ea6d0>,
 <faker.providers.person.en_US.Provider at 0x16d8ea550>,
 <faker.providers.misc.en_US.Provider at 0x16d8ea460>,
 <faker.providers.lorem.en_US.Provider at 0x16d8ea2b0>,
 <faker.providers.job.en_US.Provider at 0x16d8ea100>,
 <faker.providers.isbn.Provider at 0x16d8ea0a0>,
 <faker.providers.internet.en_US.Provider at 0x16d8d5f70>,
 <faker.providers.geo.en_US.Provider at 0x16d8d5ca0>,
 <faker.providers.file.Provider at 0x16d8d5ee0>,
 <faker.providers.emoji.Provider at 0x16d8d5a60>,
 <faker.providers.date_time.en_US.Provider at 0x16d8d5be0>,
 <faker.providers.currency.en_US.Provider at 0x16d8d5d60>,
 <faker.providers.credit_card.en_US.Provider at 0x16d8d58b0>,
 <faker.providers.company.en_US.Provider at 0x16d8d5a90>,
 <f

Beyond the basic providers, there are also community-developed providers, such as:

- faker_airtravel: airport and flight information
- faker_music: music genres, subgenres, and instrument information
- faker_vehicle: year, make, model, and other vehicle information

But you will have to install and import community providers separately:

In [12]:
#%pip install faker_airtravel

In [13]:
from faker import Faker
from faker_airtravel import AirTravelProvider
fake.add_provider(AirTravelProvider)

Check out Python Faker's full [GitHub](https://github.com/joke2k/faker) and [documentation](https://faker.readthedocs.io/en/master/) for more.

In [14]:
#%pip install faker_vehicle
#%pip install faker_music

In [15]:
import faker_music
from faker_music import MusicProvider
fake.add_provider(MusicProvider)

In [16]:
import faker_vehicle
from faker_vehicle import VehicleProvider
fake.add_provider(VehicleProvider)

In [17]:
fake.providers

[<faker_vehicle.VehicleProvider at 0x16f386f70>,
 <faker_music.music.MusicProvider at 0x16d5aea60>,
 <faker_airtravel.airports.AirTravelProvider at 0x16d8c22b0>,
 <faker.providers.user_agent.Provider at 0x16d8ea8e0>,
 <faker.providers.ssn.en_US.Provider at 0x16d8ea850>,
 <faker.providers.python.Provider at 0x16d8ea910>,
 <faker.providers.profile.Provider at 0x16d8ea7f0>,
 <faker.providers.phone_number.en_US.Provider at 0x16d8ea6d0>,
 <faker.providers.person.en_US.Provider at 0x16d8ea550>,
 <faker.providers.misc.en_US.Provider at 0x16d8ea460>,
 <faker.providers.lorem.en_US.Provider at 0x16d8ea2b0>,
 <faker.providers.job.en_US.Provider at 0x16d8ea100>,
 <faker.providers.isbn.Provider at 0x16d8ea0a0>,
 <faker.providers.internet.en_US.Provider at 0x16d8d5f70>,
 <faker.providers.geo.en_US.Provider at 0x16d8d5ca0>,
 <faker.providers.file.Provider at 0x16d8d5ee0>,
 <faker.providers.emoji.Provider at 0x16d8d5a60>,
 <faker.providers.date_time.en_US.Provider at 0x16d8d5be0>,
 <faker.providers.cu

In [48]:
fake.()

('NMC', 'Namecoin')