# What is "Typical"? An Introduction to Center and Spread

<div class="alert alert-block alert-info">
Welcome back! This activity is part of an introduction to computational notebooks, designed specifically for K-12 educators.

In this notebook, we will explore how to measure the **center** and **spread** of a dataset. We'll also explore how visualizations can give insight into what is "typical" in a dataset.
</div>

When we have any collection of measurements, we have a **distribution**. These tools help us answer a fundamental question: How can we summarize a whole set of data with just a few numbers? 

## Key Ideas in Describing Center and Spread

When working with data, a collection of measurements is called a **distribution**. Notebooks are the perfect environment to explore the "shape" of that data, through visualizing data and even adding values to a dataset to explore how the impact the whole.

**Our Learning Goals:**
* **Managing Lists and Dataframes:** Learn about using a `list` or `dataframe` to store and manipulate data. 
* **Exploring Center and Spread:** Understand how ideas like `mean`, `median`, range, outliers, and standard deviation can be used to reason about typicality. Explore the "shape" of data using `histograms` and `boxplots`.
* **Content:** Apply these statistical concepts to explore different patterns you may find in classroom test scores and housing prices.

Let's start with a simple, common scenario you may often think about as a teacher. Imagine you have a set of student test scores from a quiz graded out of 100 points.

### Part I: Loading the Data

We will start with a simple list of hypothetical test scores below. We are storing these numbers as a `list` by putting them in brackets. This will make it easy for you to change values in the list and test what happens. 

<a id="set-scores"></a>

In [None]:
# Define our list of test scores.
scores = [85, 92, 78, 88, 95, 81, 75, 89, 90, 85]

scores

<div class="alert alert-success">

**Considering the data above…**

* What would you describe as a "typical" score on this test, given the list above?

* How would you change the list (if at all) to create different teaching scenarios? For example, how would these scores look for a test that was especially difficult? For a diagnostic pre-test where you expect students to have very different levels of preparation?

We'll use this list of scores to practice coding and creating visualizations for time series analysis.  

As always, we'll model good documentation: explaining the **why** (in text) and the **how** (in `# code comments`).  

### Part II: Describing the data

Let's explore some different ways we can describe this list of scores. We'll start with two different ways to describe what is a _typical_ score. The mean is calculated by adding all scores together, then dividing by the number of scores. The median indicates the "middle" value when all values in the list are sorted.

In [None]:
import numpy as np # a library for performing calculations

# The lines below calculate the mean and median of the scores list.
mean_score = np.mean(scores) 
median_score = np.median(scores) 

# The lines below each output a label, followed by the corresponding value.
print("Mean:", mean_score)  
print("Median:", median_score) 

<div class="alert alert-success">

**Considering the results above…**

* Would you say the mean or the median is more "typical" of this class' performance on the test? Why?

* How would you explain the difference between these two values? What do they tell you about more general patterns in the data?

Now, let's find some ways to describe the _spread_ of scores. You can think of this as a way to describe how different the scores might be from each other while still being typical of scores in general.

In [None]:
# Calculate the min, max, and standard deviation of the scores.
min_score = np.min(scores )
max_score = np.max(scores) 
std_dev_score = np.std(scores) 

# Output the labels and corresponding values.
print("Min:", min_score)  
print("Max:", max_score) 
print("Standard Deviation:", f"{std_dev_score:.2f}") # round to 2 decimal places

<div class="alert alert-success">

**Considering the results above…**

* What would be an example of an "atypical" or outlier score for this list, in light of the values above?

* How would you explain the standard deviation? What is its relationship to the maximum and minimum values? What does it tell you about what's "typical" in this group of scores?

## Part III: Visualizing the Data

Values like the mean, median, and standard deviation can give you quick summaries about a collection of data, but it can still be hard to get a sense of the overall "shape" of a set of values. For this, visualization works well. Let's take a look at the distribution of the test scores you described above in a few different ways.

In [None]:
import seaborn as sns # seaborn is for making beautiful plots

# Create a histogram of scores.
sns.histplot(scores, bins=5)

You can add lines to indicate where the `mean_score` and `median_score` that you calculated above lie in the distribution of scores. Use the code example below to add these lines to the plot above.

``` 
# this library gives you more control over plotting functions
import matplotlib.pyplot as plt 

# Add a line for the mean.
plt.axvline(mean_score, color='red', linestyle='dashed', linewidth=1, label='Mean')
```

You may also be familiar with using boxplots to visualize distributions. Below is code to create a boxplot of the same list of scores.

In [None]:
sns.boxplot(scores)

<div class="alert alert-success">

**Looking at the histogram and boxplot above...**

* What do you pay attention to when thinking about what a "typical" test score is in this list of scores?

* What is easier, and harder, to understand or describe when you are using the plots versus when you are using values like the mean, median, and standard deviation?

## Part IV: Exploring Different Distributions

Now, let's see how interactivity can help you explore these ideas. 

Take a moment to consider how these values would change if you were working with different patterns of scores. What would happen to the **mean**, **median**, and **standard deviation** of the scores in the following scenarios?

* A test was particularly difficult for students, with only a few doing well.

* You gave students a pre-quiz about a topic some knew a lot about, and others knew nothing about.

* A large subset of students are brazenly cheating on an otherwise very difficult test.

Now, choose at least one of these scenarios to test. You will enter a new list of hypothetical scores reflecting the patterns you expect, and use the rest of the code to test your predictions about what would happen. To go back to the top of this section and change the list of scores, [click here](#set-scores).


<div class="alert alert-success">

**As you explore, consider...**

* How do the mean, median, and standard deviation help you understand the differences between different kinds of distributions?  

* What features of a dataset seem important to understand before describing a "typical" part of the data?

* What information might still be "hidden" by these statistics, and how can plots help you learn more?

**Summary:** You now know how to describe a collection of measurements using `mean`, `median`, `standard deviation`, and different types of plots. You should also have an idea about what these methods can and cannot tell you about the true "shape" of a dataset.

Now, we'll apply these skills to a real dataset. 

# STOP HERE

Keeping it Real (Estate): Exploring Typical Housing Prices in Atypical Markets

Now, we are going to use the ideas above to explore home sales prices. Exploring home prices 

We have written the code to focus on the San Francisco Bay Area, but we are getting our data dynamically from an Application Programming Interface. This means you can edit the code to focus on any area in the United States that you find interesting. 

In [None]:
# Define our list of housing prices (in millions).
prices = [1.2, 1.5, 1.3, 1.6, 1.4, 1.8, 1.25, 1.45, 1.7, 20.0]

# --- Calculate Measures of Center ---
mean_price = np.mean(prices)
median_price = np.median(prices)

print(f"Mean Price (in millions): ${mean_price:.2f}M")
print(f"Median Price (in millions): ${median_price:.2f}M")

# --- Visualize the Distribution ---
plt.hist(prices, bins=8, edgecolor='black')

plt.axvline(mean_price, color='red', linestyle='dashed', linewidth=2, label=f'Mean: ${mean_price:.2f}M')
plt.axvline(median_price, color='green', linestyle='dashed', linewidth=2, label=f'Median: ${median_price:.2f}M')

plt.title('Distribution of Bay Area Home Prices')
plt.xlabel('Price (in millions of $)')
plt.ylabel('Number of Homes')
plt.legend()
plt.show()

## PLANNED USE & DISPOSITIONS: Interpreting the Results

Look at the plot above. The red line (mean) is pulled far to the right by the single $20M home, while the green line (median) remains with the main cluster of houses.

**Conclusion:** In a skewed dataset with outliers, the **median** is often a more robust and representative measure of the "center" or "typical" value than the mean. This is why you almost always hear about the *median* home price in the news, not the mean.

### In the Classroom and Beyond
* **Lesson Idea:** Have students collect data on a topic with likely outliers, like the number of minutes they spend on their phone each day, or the price of items on a restaurant menu. They can use a notebook to find the mean and median and debate which is a better measure of "typical."
* **Tinkering is Thinking:** Now it's your turn. Go back to the code cell for the housing prices. What happens if you change the $20M home to $5M? What if you add another house that costs $30M? Change the numbers and re-run the cell to see how the mean and median respond. Experimenting is the best way to build intuition!

## CONTENT EXTENSIONS: Where Else Do We See Skewed Data?

The skills you just practiced—calculating mean vs. median and visualizing distributions with histograms—are critical in any field that deals with real-world data, because real-world data is rarely perfectly symmetrical.

* **Economics:** Individual income in a country is famously skewed. Most people earn a modest income, while a very few earn extremely high incomes, pulling the mean income far above the median.
* **Biology:** The number of offspring produced by individuals in a species is often skewed. Most individuals may have 1-2 offspring, while a few "super-reproducers" might have many, many more.
* **Internet & Social Media:** The number of "likes" or "shares" on a post is highly skewed. The vast majority of posts get very few interactions, while a tiny fraction go viral and get millions.

In all these cases, understanding the difference between mean and median, and being able to see the shape of the data with a histogram, is essential for drawing accurate conclusions.

## Connections to K-12 Standards

The concepts in this notebook align with several key educational standards, providing a clear path for classroom integration.

### Common Core State Standards for Mathematics
* **6.SP.A.2 & 3:** Understand that a set of data collected to answer a statistical question has a distribution which can be described by its center, spread, and overall shape. Recognize that a measure of center for a numerical data set summarizes all of its values with a single number.
* **HSS.ID.A.1, 2, & 3:** Represent data with plots on the real number line (histograms). Use statistics appropriate to the shape of the data distribution to compare center (median, mean) of different data sets. Interpret differences in shape, center, and spread in the context of the data sets, accounting for possible effects of extreme data points (outliers).

### Next Generation Science Standards (NGSS)
* **Science and Engineering Practice 4: Analyzing and Interpreting Data:** This notebook provides a direct application of this practice, as students must analyze data using computational tools to describe a dataset and understand its variability.
* **Crosscutting Concept 3: Scale, Proportion, and Quantity:** Understanding measures of center and spread is fundamental to describing a system quantitatively. This is especially true when discussing the impact of outliers, which highlights the importance of scale.