# What is "Typical"? An Introduction to Center and Spread

<div class="alert alert-block alert-info">
Welcome back! This activity is part of an introduction to computational notebooks, designed specifically for K-12 educators.

In this notebook, we will explore how to measure the **center** and **spread** of a dataset. We'll also explore how visualizations can give insight into what is "typical" in a dataset.
</div>

When we have any collection of measurements, we have a **distribution**. These tools help us answer a fundamental question: How can we summarize a whole set of data with just a few numbers? 

## Key Ideas in Describing Center and Spread

When working with data, a collection of measurements is called a **distribution**. Notebooks are the perfect environment to explore the "shape" of that data, through visualizing data and even adding values to a dataset to explore how the impact the whole.

**Our Learning Goals:**
* **Managing Lists and Dataframes:** Learn about using a `list` or `dataframe` to store and manipulate data. Learn to `filter` data so you can focus only on records that are relevant to your question.
* **Exploring Center and Spread:** Understand how ideas like `mean`, `median`, range, outliers, and standard deviation can be used to reason about typicality. Explore the "shape" of data using `histograms` and `boxplots`.
* **Content:** Apply these statistical concepts to explore different patterns you may find in classroom test scores and housing prices.

Let's start with a simple, common scenario you may often think about as a teacher. Imagine you have a set of student test scores from a quiz graded out of 100 points.

### Part I: Loading the Data

We will start with a simple list of hypothetical test scores below. We are storing these numbers as a `list` by putting them in brackets. This will make it easy for you to change values in the list and test what happens. 

<a id="set-scores"></a>

In [None]:
# Define our list of test scores.
scores = [85, 92, 78, 88, 95, 81, 75, 89, 90, 85]

scores

<div class="alert alert-success">

**Considering the data above…**

* What would you describe as a "typical" score on this test, given the list above?

* How would you change the list (if at all) to create different teaching scenarios? For example, how would these scores look for a test that was especially difficult? For a diagnostic pre-test where you expect students to have very different levels of preparation?

We'll use this list of scores to practice coding and creating visualizations for time series analysis.  

As always, we'll model good documentation: explaining the **why** (in text) and the **how** (in `# code comments`).  

### Part II: Describing the data

Let's explore some different ways we can describe this list of scores. We'll start with two different ways to describe what is a _typical_ score. The mean is calculated by adding all scores together, then dividing by the number of scores. The median indicates the "middle" value when all values in the list are sorted.

In [None]:
import numpy as np # a library for performing calculations

# The lines below calculate the mean and median of the scores list.
mean_score = np.mean(scores) 
median_score = np.median(scores) 

# The lines below each output a label, followed by the corresponding value.
print("Mean:", mean_score)  
print("Median:", median_score) 

<div class="alert alert-success">

**Considering the results above…**

* Would you say the mean or the median is more "typical" of this class' performance on the test? Why?

* How would you explain the difference between these two values? What do they tell you about more general patterns in the data?

Now, let's find some ways to describe the _spread_ of scores. You can think of this as a way to describe how different the scores might be from each other while still being typical of scores in general.

In [None]:
# Calculate the min, max, and standard deviation of the scores.
min_score = np.min(scores)
max_score = np.max(scores) 
std_dev_score = np.std(scores) 

# Round the standard deviation to two decimal places.
std_dev_score = round(std_dev_score, 2)

# Output the labels and corresponding values.
print("Min:", min_score)  
print("Max:", max_score) 
print(f"Standard Deviation:", std_dev_score) 

<div class="alert alert-success">

**Considering the results above…**

* What would be an example of an "atypical" or outlier score for this list, in light of the values above?

* How would you explain the standard deviation? What is its relationship to the maximum and minimum values? What does it tell you about what's "typical" in this group of scores?

## Part III: Visualizing the Data

Values like the mean, median, and standard deviation can give you quick summaries about a collection of data, but it can still be hard to get a sense of the overall "shape" of a set of values. For this, visualization works well. Let's take a look at the distribution of the test scores you described above in a few different ways.

In [None]:
import seaborn as sns # seaborn is for making beautiful plots

# Create a histogram of scores.
sns.histplot(scores, bins=5)

You can add lines to indicate where the `mean_score` and `median_score` that you calculated above lie in the distribution of scores. Use and adapt the code example below to add lines to the plot above.

``` 
# this library gives you more control over plotting functions
import matplotlib.pyplot as plt 

# Add a line for the mean.
plt.axvline(mean_score, color='red', linestyle='dashed', linewidth=1, label='Mean')
```

You may also be familiar with using boxplots to visualize distributions. Below is code to create a boxplot of the same list of scores.

In [None]:
sns.boxplot(scores)

<div class="alert alert-success">

**Looking at the histogram and boxplot above...**

* What do you pay attention to when thinking about what a "typical" test score is in this list of scores?

* What is easier, and harder, to understand or describe when you are using the plots versus when you are using values like the mean, median, and standard deviation?

## Part IV: Exploring Different Distributions

Now, let's see how interactivity can help you explore these ideas. 

Take a moment to consider how these values would change if you were working with different patterns of scores. What would happen to the **mean**, **median**, and **standard deviation** of the scores in the following scenarios?

* A test was particularly difficult for students, with only a few doing well.

* You gave students a pre-quiz about a topic some knew a lot about, and others knew nothing about.

* A large subset of students are brazenly cheating on an otherwise very difficult test.

Now, choose at least one of these scenarios to test. You will enter a new list of hypothetical scores reflecting the patterns you expect, and use the rest of the code to test your predictions about what would happen. To go back to the top of this section and change the list of scores, [click here](#set-scores).


<div class="alert alert-success">

**As you explore, consider...**

* How do the mean, median, and standard deviation help you understand the differences between different kinds of distributions?  

* What features of a dataset seem important to understand before describing a "typical" part of the data?

* What information might still be "hidden" by these statistics, and how can plots help you learn more?

**Summary:** You now know how to describe a collection of measurements using `mean`, `median`, `standard deviation`, and different types of plots. You should also have an idea about what these methods can and cannot tell you about the true "shape" of a dataset.

Now, we'll apply these skills to a real dataset. 

# Keeping it Real (Estate): Exploring Typical Housing Prices in Atypical Markets

We are going to start by exploring housing prices in South Lake Tahoe, CA. This city has a mixture of residential and vacation homes, which makes it difficult to describe what a "typical" home price is in this area.

<div style="display: flex; justify-content: space-around;">
<br/>

<img src="https://github.com/CalCoRE/show-your-work/blob/typicality/images/lakeside.jpg?raw=true" height="150"> 
<br/>

<img src="https://github.com/CalCoRE/show-your-work/blob/typicality/images/roadside.jpg?raw=true" height="150"> 
<br/>

<img src="https://github.com/CalCoRE/show-your-work/blob/typicality/images/snowy.jpg?raw=true" height="150"> 
<br/>

</div>
<br/>

You will then have an opportunity to compare our findings with other cities that you choose.

## Part I: Loading and Filtering the Data

As always, we'll start by loading and previewing the dataset that we'll be working with. Our data comes from 2021-2022 homes sales listings across the United States that were scraped from the realtor.com website.

Before running the next cell, think about what you minght expect to see in our national home sales dataset.

In [None]:
import pandas as pd

all_housing = pd.read_csv('https://github.com/CalCoRE/show-your-work/blob/typicality/data/sold-realtor-data.csv?raw=true')

all_housing

For the first part of this activity, we will focus on homes sold in South Lake Tahoe, CA. You will filter the dataset to only include homes sold in this city. 

Below, we create a new dataset called `my_housing` that only includes data from the full dataset `all_housing` that matches our specified state and city. We use both because there are often cities with the same name in different states.

You will have a chance to come back later and try this with other subsets of the data.

<a id="set-housing-data"></a>

In [None]:
# Filter the all_housing dataset by state to create a new dataset called my_housing.
my_housing = all_housing[all_housing['state'] == 'California']

# Further filter my_housing by city.
my_housing = my_housing[my_housing['city'] == 'South Lake Tahoe']

my_housing

<div class="alert alert-success">

**Examine the data table previewed above…**


* What do you notice and wonder about this dataset?


* What do you predict the mean home price in South Lake Tahoe, CA to be? Why?

## Part II: Describing Center and Spread

In the first activity in this notebook, we used a list of values to represent test scores. Here, information is stored as a dataframe so that each record can have multiple values. However, we can treat specific columns of a dataframe similarly to how we treated lists before using the column name such as: `my_housing["price"]`. 

Let's use the same tools we used above to explore our dataset home sales prices dataset from South Lake Tahoe, CA. We'll start simple, by taking a look at the minimum and maximum home sale prices.

In [None]:
min_price = my_housing["price"].min()
max_price = my_housing["price"].max()

# The lines below each output a label, followed by the corresponding value.
print("Min:", min_price)  
print("Max:", max_price) 

<div class="alert alert-success">

**Considering the values above…**


* Do these values seem reasonable, given what you know about South Lake Tahoe? Why or why not?


* Would you say that knowing the maximum and minimum home prices can help you describe what a "typical" home price is in this area? Why or why not?


* Do you expect the mean or median home price to be higher? Why?

Now, let's take a look at the mean, median, and standard deviation of home prices in South Lake Tahoe, CA.

In [None]:
mean_price = np.mean(my_housing['price'])
median_price = np.median(my_housing['price'])
stdev_price = np.std(my_housing['price'])

mean_price = round(mean_price, 2)
stdev_price = round(stdev_price, 2)

print("Mean Price: ", mean_price)
print("Median Price: ", median_price)
print("Standard Deviation of Prices: ", stdev_price)


<div class="alert alert-success">

**Considering the results above…**


* Are the mean, median, and standard deviation what you expected? Why or why not?



* Can you explain why the mean price for homes in South Lake Tahoe, CA is higher than the median price?



* Would you say a home price of $750,000 is "typical" for South Lake Tahoe? What about $1.5M? Why or why not?



* How would you explain the standard deviation shown above to a student?

## Part III: Visualizing the Data

So far, we have seen some of the dramatic effects that having just a few extreme outliers can have on a dataset. But, it might still be difficult to get a full sense of what's a "typical" price in this area. Let's use visualization to dig a little deeper.

In [None]:
import matplotlib.pyplot as plt

# Create a histogram of home prices.
plt.hist(my_housing['price'], bins=20, edgecolor='black')

# Add vertical lines for the mean and median prices.
plt.axvline(mean_price, color='red', label='Mean Price')
plt.axvline(median_price, color='yellow', label='Median Price')

# Format the x-axis to show prices in millions.
plt.ticklabel_format(style='sci', scilimits=(6, 6), axis='x')

# Add titles and labels.
plt.title('Distribution of Home Prices')
plt.xlabel('Price (in millions of $)')
plt.ylabel('Number of Homes')
plt.legend()
plt.show()

<div class="alert alert-success">

**Looking at the visualization above…**


* Has your understanding of what a "typical" home price in South Lake Tahoe changed after seeing the histogram and boxplot? How or why not?



* Give some examples of what you would consider to be "typical," "borderline," and "atypical" home prices in this area, based on the visualizations above.

### Part IV: Explore Other Collections of Housing Prices

We chose to focus on South Lake Tahoe, CA because it has an especially *skewed* distribution of home prices, with a few very expensive homes pulling the mean up well above the median. While this is an especially dramatic example, home prices are typically skewed in this way. This is why often you will see the median home price reported in real estate listings or in the news, rather than the mean.

Think about other ways that you can explore the distribution of home prices. You might be interested in other `cities`, entire `states` or in other aspects of housing such as looking at the distribution of housing prices based on the number of `bedrooms` or the `house_size`. 

Consider some combination of factors that would capture:
* a subset of homes for which the mean and median prices are closer together,
* a subset of homes with the median price is higher than the mean,
* a subset of homes with a wide range of what you would call "typical" prices.

Now, choose at least one of these scenarios to explore. You will filter the dataset to reflect the patterns you expect, and use the rest of the code to test your predictions about what would happen. [Click here](#set-housing-data) to go back to the top of this section and get started.

## In the Classroom

The skills you just practiced—calculating mean vs. median and visualizing distributions with histograms—are critical in any field that deals with real-world data, because real-world data is rarely perfectly symmetrical.

**Where Else Do We See Skewed Data?**

* **Economics:** Individual income in a country is famously skewed. Most people earn a modest income, while a very few earn extremely high incomes, pulling the mean income far above the median.
* **Biology:** The number of offspring produced by individuals in a species is often skewed. Most individuals may have 1-2 offspring, while a few "super-reproducers" might have many, many more.
* **Internet & Social Media:** The number of "likes" or "shares" on a post is highly skewed. The vast majority of posts get very few interactions, while a tiny fraction go viral and get millions.

In all these cases, understanding the difference between mean and median, and being able to see the shape of the data with a histogram, is essential for drawing accurate conclusions.

## Connections to K-12 Standards

The concepts in this notebook align with several key educational standards, providing a clear path for classroom integration.

### Common Core State Standards for Mathematics
* **6.SP.A.2 & 3:** Understand that a set of data collected to answer a statistical question has a distribution which can be described by its center, spread, and overall shape. Recognize that a measure of center for a numerical data set summarizes all of its values with a single number.
* **HSS.ID.A.1, 2, & 3:** Represent data with plots on the real number line (histograms). Use statistics appropriate to the shape of the data distribution to compare center (median, mean) of different data sets. Interpret differences in shape, center, and spread in the context of the data sets, accounting for possible effects of extreme data points (outliers).

### Next Generation Science Standards (NGSS)
* **Science and Engineering Practice 4: Analyzing and Interpreting Data:** This notebook provides a direct application of this practice, as students must analyze data using computational tools to describe a dataset and understand its variability.
* **Crosscutting Concept 3: Scale, Proportion, and Quantity:** Understanding measures of center and spread is fundamental to describing a system quantitatively. This is especially true when discussing the impact of outliers, which highlights the importance of scale.

# Credits

This notebook was developed as part of "Show York Work" (SyW), a research and development project at UC Berkeley to introduce computational notebooks to K-12 educators.

The SyW team includes, in alphabetical order: Pavritha Arun Anand, Sun Young Ban, Chul Huang, JungMin Shin, Michelle Wilkerson, and Xiaoyue Zhang.

This specific notebook was written by Sun Young Ban and includes contributions from Michelle Wilkerson and JungMin Shin.

Creation of this notebook was done with the assistance of Google Gemini Pro 2.5.

The real estate dataset used here is a subset of the USA Real Estate Dataset shared by Ahmed Shahriar Sakib on kaggle.com (https://doi.org/10.34740/kaggle/ds/3202774).

The South Lake Tahoe Roadside photo is available in the public domain on [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:50_through_South_Lake_Tahoe_by_Mark_Miller.jpg) by Amadscientist/Mark James Miller

The Lakeside House in South Lake Tahoe photo is available under CC-BY 3.0 via [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:South_Lake_Tahoe,_CA_-_panoramio.jpg) (originally Panoramio) by donifirebg

Snowy House "Cricket" photo is available under CC-BY-NC-SA 2.0 by jillmotts/Jill Siegrist on [flickr](https://www.flickr.com/photos/amayu/401688703/in/photolist-iW3Y1s-2qRRg2r-2qRvTXC-2n7QgNz-2n7G3oM-2mwhieU-eKq9zK-6QPQR-44Kp99-w7EPhU-DMmv1B-xeMnvT-2mnx2Zd-9eAq7-9eAox-9eApk-61DxKn-43vFZ3-2pSCAs9-GvSrjF-9eH1c8-43vHbq-2hhjGns-43rzhF-9Ji5cq-43vE1f-GvSn8e-2pSJmcp-J3gJds-J3gHud-9Hy1t3-BuL4i-2mWESLF).