# Week 1 · Day 3: Statistics for Data Analytics

Today we’ll build statistical intuition for data analysis.

We’ll focus on:
- Measures of **central tendency** (mean, median, mode)
- Measures of **spread** (min/max, range, variance, standard deviation)
- A couple of charts to help us *see* the distribution


## Learning Objectives

By the end of this notebook, you will be able to:
- Explain what a **DataFrame** is
- Calculate mean, median, and mode
- Calculate min, max, range, variance, and standard deviation
- Interpret what these numbers say about a dataset
- Use a histogram and box plot to visualize a distribution

## A Light Intro to Pandas (and DataFrames)

**Pandas** is a Python library used for working with data in a table format.

A **DataFrame** is like a spreadsheet:
- **Rows** are observations (e.g., each student)
- **Columns** are variables (e.g., quiz score)

Why we use DataFrames:
- They make it easier to calculate statistics
- They help us filter, sort, and summarize data
- They work well with charts and analysis tools

## Import Libraries

- **pandas**: tables (DataFrames)
- **numpy**: math tools
- **matplotlib / seaborn**: charts

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

## Build a Small Dataset

We’ll use a small dataset representing quiz scores for 15 students.

**Why this matters:** Real analytics often starts with a table of data. This is a safe, small example we can practice on.

In [None]:
scores = pd.DataFrame({
    "student_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
    "quiz_score": [72, 85, 90, 68, 88, 94, 77, 83, 91, 70, 76, 89, 84, 92, 79]
})

scores

### ✅ Expected output

You should see a table with:
- 15 rows
- 2 columns: `student_id` and `quiz_score`

The exact formatting may vary, but the values should match what we created.

## Quick DataFrame Basics

Let’s practice a few DataFrame commands:
- `.head()` shows the first few rows
- `.shape` tells you (rows, columns)
- `.columns` lists the column names

**Why this matters:** Before you compute statistics, you should always confirm you understand what data you’re working with.

In [None]:
print("First 5 rows:")
display(scores.head())

print("\nShape (rows, columns):", scores.shape)
print("Columns:", list(scores.columns))

### ✅ Expected output

- A small preview of the first 5 rows
- `Shape (rows, columns): (15, 2)`
- `Columns: ['student_id', 'quiz_score']`

## Measures of Central Tendency

Central tendency describes a "typical" value.

- **Mean**: average
- **Median**: middle value (after sorting)
- **Mode**: most common value

**Why this matters:** People often ask, “What’s the typical score?” Mean/median/mode help you answer that question.

In [None]:
mean_score = scores["quiz_score"].mean()
median_score = scores["quiz_score"].median()

# Mode can return more than one value.
modes = scores["quiz_score"].mode()

print("Mean:", mean_score)
print("Median:", median_score)
print("Mode(s):", list(modes))

### ✅ Expected output

You should see three lines printed:
- Mean: around the low-to-mid 80s
- Median: also around the low-to-mid 80s
- Mode(s): often empty or a single value depending on repeated scores (in this dataset, many scores are unique)

**Note:** If every score is unique, there is no “strong” mode (pandas may still return values depending on ties).

## Measures of Spread

Spread tells us how *varied* the scores are.

- **Min** and **Max**: lowest and highest values
- **Range**: max − min
- **Variance**: how spread out values are (in squared units)
- **Standard deviation**: spread in the *same units* as the data

**Why this matters:** Two classes can have the same average score but very different consistency. Spread helps you describe that difference.

In [None]:
min_score = scores["quiz_score"].min()
max_score = scores["quiz_score"].max()
range_score = max_score - min_score

# pandas uses sample variance/std by default
variance_score = scores["quiz_score"].var()
std_score = scores["quiz_score"].std()

print("Min:", min_score)
print("Max:", max_score)
print("Range:", range_score)
print("Variance (sample):", variance_score)
print("Standard Deviation (sample):", std_score)

### ✅ Expected output

- Min should be **68**
- Max should be **94**
- Range should be **26** (because 94 − 68 = 26)
- Variance and standard deviation should be positive numbers (std dev usually around ~8 for this dataset)

If your min/max/range don’t match, double-check the dataset values.

## Visualizing the Distribution

Charts help us *see* the shape of the data.

**Why this matters:** Numbers alone can hide patterns. A quick chart can reveal clusters or unusual values.

### Histogram

A histogram shows how many scores fall into each score range.

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(scores["quiz_score"], bins=6, kde=True)
plt.title("Distribution of Quiz Scores")
plt.xlabel("Quiz Score")
plt.ylabel("Number of Students")
plt.show()

### ✅ Expected output

You should see a histogram showing quiz scores from about **68 to 94**.

Look for:
- Where most of the bars cluster (most common score range)
- Whether the distribution looks balanced or skewed

### Box Plot

A box plot shows:
- the median
- the middle 50% of values (the “box”)
- and possible outliers

**Why this matters:** A box plot is a quick summary of spread and is common in analytics reporting.

In [None]:
plt.figure(figsize=(6, 4))
sns.boxplot(x=scores["quiz_score"])
plt.title("Box Plot of Quiz Scores")
plt.xlabel("Quiz Score")
plt.show()

### ✅ Expected output

You should see one horizontal box plot.

Look for:
- where the median line sits
- how wide the box is (middle 50% spread)
- whether any points appear far away (possible outliers)

## Guided Practice Questions

Answer these questions by writing code in the cells below.

1) What are the **mean** and **median**? Are they close or far apart?

2) What are the **min**, **max**, and **range**?

3) Calculate **variance** and **standard deviation**.
   - Which one is easier to interpret in plain language? Why?

4) Find the **three lowest** scores and the **three highest** scores.

**Why this matters:** These questions are exactly the kind of quick summaries an analyst produces to explain a dataset.

In [None]:
# Guided Practice 1: mean and median

mean_score = scores["quiz_score"].mean()
median_score = scores["quiz_score"].median()

print("Mean:", mean_score)
print("Median:", median_score)

### ✅ Expected output

Two printed numbers for mean and median.

If they are very close, that often suggests the distribution is fairly balanced (not strongly skewed).

In [None]:
# Guided Practice 2: min, max, range

min_score = scores["quiz_score"].min()
max_score = scores["quiz_score"].max()
range_score = max_score - min_score

print("Min:", min_score)
print("Max:", max_score)
print("Range:", range_score)

In [None]:
# Guided Practice 3: variance and standard deviation

variance_score = scores["quiz_score"].var()
std_score = scores["quiz_score"].std()

print("Variance (sample):", variance_score)
print("Standard Deviation (sample):", std_score)

### ✅ Expected output

Two positive numbers:
- Variance will usually be a larger number (because it uses squared units)
- Standard deviation will be in the same unit as scores (points), which is often easier to interpret

In [None]:
# Guided Practice 4: three lowest and three highest scores

sorted_scores = scores.sort_values(by="quiz_score")

lowest_three = sorted_scores.head(3)
highest_three = sorted_scores.tail(3)

print("Three lowest scores:")
display(lowest_three)

print("\nThree highest scores:")
display(highest_three)

### ✅ Expected output

You should see:
- The three lowest scores (including **68** and **70**)
- The three highest scores (including **92** and **94**)

Your exact student IDs should match the rows in the dataset.

## Hands-On: Write a Short Stats Summary

Imagine you’re reporting to someone who wants a quick update about how the class did.

### Task
1) Compute (or reuse) these values:
- mean
- median
- min
- max
- range
- standard deviation

2) Write **3–5 bullet points** interpreting what you found.

**Why this matters:** Analysts don’t just calculate numbers—they explain what the numbers *mean* in plain language.

In [None]:
# Hands-On: Compute a small summary

mean_score = scores["quiz_score"].mean()
median_score = scores["quiz_score"].median()
min_score = scores["quiz_score"].min()
max_score = scores["quiz_score"].max()
range_score = max_score - min_score
std_score = scores["quiz_score"].std()

print("Mean:", mean_score)
print("Median:", median_score)
print("Min:", min_score)
print("Max:", max_score)
print("Range:", range_score)
print("Standard Deviation:", std_score)

### ✅ Expected output

You should see the six statistics printed.

Use them to write your interpretation below (next cell).

### Write your interpretation (3–5 bullets)

- 
- 
- 
- 
- 

## Key Takeaways

- Measures of **central tendency** describe a typical value (mean/median/mode).
- Measures of **spread** describe variation (range/variance/std dev).
- Visuals like histograms and box plots help you see the distribution.
- In analytics, your job is to compute **and** communicate what it means.