# Synthetic Data

In this notebook we will generate some [synthetic data](https://en.wikipedia.org/wiki/Synthetic_data) related to basketball player statistics. We'll use the [NumPy](https://numpy.org) computing library to create data about 100 hypothetical players based on the following assumptions:

* heights are normally distributed with a mean of 180 cm and a standard deviation of 10 cm
* wingspans are 5 cm longer than heights, with a standard deviation of 5 cm
* reaction times are normally distributed with a mean of 180 cm and a standard deviation of 10 ms
* free throw probabilities are based on reaction times

In [None]:
import numpy as np
import pandas as pd

number_of_players = 100

height = np.round(np.random.normal(loc=180, scale=10, size=number_of_players)).astype(int)
wingspan = np.round(height + np.random.normal(loc=5, scale=5, size=number_of_players)).astype(int)
reaction_time = np.round(np.clip(np.random.normal(loc=180, scale=10, size=number_of_players), a_min=180, a_max=400)).astype(int)
ft_probability = np.clip(1-(reaction_time-180)/200, a_min=0.3, a_max=0.9)
free_throws = np.random.binomial(n=5, p=ft_probability)

df = pd.DataFrame({"Height (cm)": height, "Wingspan (cm)": wingspan, "Reaction Time (ms)": reaction_time, "Free Throws (out of 5)": free_throws})
df

Now that we have a dataframe of player data, we can create a visualization.

In [None]:
import plotly.express as px
px.scatter(df, x="Height (cm)", y="Wingspan (cm)", title="Wingspan vs Height", trendline="ols")

## Questions

1. What do we mean by "synthetic data"? What are some purposes of synthetic data?
2. Even though we generated random data, why do the values still look realistic?
3. We made some assumptions when generating the data. Are they valid?