![Banner](https://github.com/Data-Dunkers/lessons/blob/main/images/top-banner.jpg?raw=true)

# Lesson: Synthetic Data

In data science, we don't always use real-world data right away. Sometimes we generate **synthetic data**â€”artificial data that follows certain mathematical rules. This allows us to test our tools and models in a controlled environment.

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
print('Libraries imported')

## 1. Generating Synthetic Data

We'll create 50 artificial basketball players. We'll assume that for every 10 more shots a player takes, they score about 8 more points (an 80% success rate), but we'll add some "noise" to make it look realistic.

In [None]:
np.random.seed(42)
shots = np.random.normal(50, 15, 50)
points = shots * 0.8 + np.random.normal(0, 5, 50)

df = pd.DataFrame({'Shots': shots, 'Points': points})
df.head()

## 2. Visualizing the Synthetic Relationship

Even though this data is fake, we can analyze it exactly like real NBA data.

In [None]:
fig = px.scatter(df, x="Shots", y="Points", 
                 title="Synthetic Relationship: Shots Taken vs Points Scored",
                 trendline="ols")
fig.show()

## Reflection Questions

1. Why is the trendline not perfectly straight covering every single point, even though we made the data ourselves?
2. How does the "noise" (randomness) we added make this more useful for training than a perfectly straight line?
3. When might a data scientist prefer synthetic data over real data?

---

### Online Access
You can run this notebook online using the following links:

*   [**Google Colab**](https://colab.research.google.com/github/Data-Dunkers/student/blob/main/activities/synthetic-data.ipynb)
*   [**Callysto Hub**](https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FData-Dunkers%2Fstudent&branch=main&subPath=activities/synthetic-data.ipynb&depth=1)