# What Is Synthetic Data? 

[Reference: Accelerating AI with Synthetic Data](https://learning.oreilly.com/library/view/accelerating-ai-with/9781492045991/) - [Khaled El Emam](https://learning.oreilly.com/search/?query=author%3A%22Khaled%20El%20Emam%22&sort=relevance&highlight=true)

At a conceptual level, synthetic data is not real data but is data that has been generated from real data and that has the same statistical properties as the real data. This means that an analyst who works with a synthetic dataset should get analysis results that are similar to those they would get with real data. The degree to which a synthetic dataset is an accurate proxy for real data is a measure of utility. Furthermore, we refer to the process of generating synthetic data as synthesis.

Data in this context can mean different things. For example, data can be structured data (i.e., rows and columns), as one would see in a relational database. Data can also be unstructured text, such as doctors’ notes, transcripts of conversations among people or with digital assistants, or online interactions by email or chat. Furthermore, images, videos, audio, and virtual environments are also types of data that can be synthesized. We have seen examples of fake images in the machine learning literature; for instance, realistic faces of people who do not exist in the real world can be created, and you can view the results online.


To create a synthetic dataset, follow these steps: - [ChatGPT](https://chat.openai.com/chat)

1. Define the problem and determine the type of data needed: Determine what kind of data is required for your problem and the type of distribution it should follow.
2. Select the appropriate statistical distribution: Choose a statistical distribution that best fits the data you want to generate. For example, if you want to generate data for a normally distributed variable, use the Gaussian distribution.
3. Set the parameters of the distribution: Determine the mean and standard deviation of the distribution you have selected.
4. Generate the data: Use a random number generator or a library in your preferred programming language to generate data samples from the distribution you have selected.
5. Validate the synthetic data: Verify that the generated data is similar to the real-world data. This can be done by comparing various statistical measures, such as mean, standard deviation, and distribution shape.
6. Save and use the synthetic data: Store the synthetic data in a file or database for future use.

Note: It is important to understand the underlying distribution of the real-world data to generate accurate synthetic data. In some cases, you may need to use multiple distributions to generate synthetic data that mimics real-world data.

Synthetic data is divided into two types, based on whether it is generated from actual datasets or not.

1. The first type is synthesized from real datasets. The analyst will have some real datasets and then build a model to capture the distributions and structure of that real data.
2. The second type of synthetic data is not generated from real data. It is created by using existing models or by using background knowledge of the analyst.

|Type of synthetic data|Utility|
|:---|:---|
|Generated from real (nonpublic) datasets | Can be quite high |
|Generated from real public data | Can be high, although limitations exist because public data tends to be de-identified or aggregated |
| Generated from an existing model of a process, which can also be represented in a simulation engine | Will depend on the fidelity of the existing generating model|
|Based on analyst knowledge | Will depend on how well the analyst knows the domain and the complexity of the phenomenon|
|Generated from generic assumptions not specific to the phenomenon|Will likely be low|