![Banner](https://github.com/Data-Dunkers/lessons/blob/main/images/top-banner.jpg?raw=true)

# Standard Deviation

We are going to use some NBA player statistics from the 2024-2025 season to talk about the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation). Standard deviation is a measure of the amount of variation in a set of values. A low standard deviation means that they tend to be close to the mean, while a high standard deviation means they are more spread out.

Let's start by importing and displaying the data.

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Data-Dunkers/data/refs/heads/main/NBA/player/nba_player_stats_2024-2025.csv')
df

The first thing we can try is the `.describe()` method, which will give us a quick overview of the data. You'll see that the standard deviation for each column is the third row.

In [None]:
df.describe()

We can also apply the standard deviation method to a particular column, such as "Games Played" (`GP`).

In [None]:
df['GP'].std()

To display the standard deviations for all of the columns in the dataset, we need to specify that it should only calculate for numeric columns.

In [None]:
df.std(numeric_only=True)

To visualize these standard deviations, we can use a bar graph.

In [None]:
import plotly.express as px
px.bar(df.std(numeric_only=True), title="Standard Deviations")

It looks like "games played" (`G`) has a high standard deviation, while "steals" (`STL`) has a low standard deviation. Let's create histograms to compare those two column, and add a vertical line to show the [mean](https://en.wikipedia.org/wiki/Mean) and one standard deviation in each direction.

In [None]:
GP_mean = df['GP'].mean()
GP_std = df['GP'].std()
STL_mean = df['STL'].mean()
STL_std = df['STL'].std()

px.histogram(df['GP'], title='Games Played').add_vline(x=GP_mean-GP_std).add_vline(x=GP_mean).add_vline(x=GP_mean+GP_std).show()
px.histogram(df['STL'], title='Steals').add_vline(x=STL_mean-STL_std).add_vline(x=STL_mean).add_vline(x=STL_mean+STL_std).show()

We can also calculate values such as two standard deviations from the mean.

In [None]:
STL_mean + 2*STL_std

## Questions

1. Why do you think "games played" had the highest standard deviation?
2. How does standard deviation help explain the appearance of the the two histograms?
3. If we removed all of the "steals" values that were greater than two standard deviations from the mean, how do you think that would affect the standard deviation?

---

### Online Access
You can run this notebook online using the following links:

*   [**Google Colab**](https://colab.research.google.com/github/Data-Dunkers/student/blob/main/activities/standard-deviation.ipynb)
*   [**Callysto Hub**](https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FData-Dunkers%2Fstudent&branch=main&subPath=activities/standard-deviation.ipynb&depth=1)