## Intro to Statistics: Part 2 - Measures of central tendency and variation

### 📊 Measures of central tendency and variability

Welcome! In this Jupyter notebook, we'll explore descriptive statistics in Python. As you've learnt in lecture, descriptive statistics are used by researchers to characterize the sample data. In this notebook, we'll explore some of the most important and common descriptive statistical measures:

1. **Measures of Central Tendency** (Mean & Median)
2. **Measures of Dispersion** (Range & Standard Deviation) 

<br>
We'll work with a dataset containing the height of 1,000 adults.

<br><br>
Let's begin, as always, by importing the libraries that contain the functions we'll need in order to calculate statistics!

In [None]:
# 📦 Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print('Imported Libraries!')

# Enable plots in the notebook
%matplotlib inline

#### 📂 Load the Height Dataset

Next, let's load our dataset, which is called `height_data.csv`. Just as we did last time, let's begin by inspecting the size of our dataset. You should see that our dataset, which we assign to the variable `height_df` (df for short) contains a single column of 1000 height values.

In [None]:
# Load the height dataset
height_df = pd.read_csv("csv files/height_data.csv")

# Display how many rows and columns are in the dataset
print(f"Dataset shape: {height_df.shape}")

In [None]:
# Display the first 20 rows of our dataset
height_df.head(20)

#### 📈 Plot Histogram of Heights

Now let's visualize the distribution of height scores. To do so, let's graph a histogram and tell Python that we want the histogram to contain 50 evenly spaced bins.

In [None]:
# Plot histogram
plt.figure(figsize=(10, 5))
plt.hist(height_df['Height'], bins=50, edgecolor='black')

# Let's make our plot look pretty and informative:
plt.title("Distribution of Height (100 Bins)")
plt.xlabel("Height (cm)")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()

### 📌 Section 1: Measures of Central Tendency

We will now calculate the **mean** and **median** of this distribution of height values using built-in functions that are part of the libraries we imported earlier.
<br><br>
After calculating the mean and median, let's overlay these values on top of the histogram so we can get a sense of how the mean and median relate to the overall distribution of heights.

In [None]:
# Calculate the mean and median

mean_height = np.mean(height_df['Height'])
median_height = np.median(height_df['Height'])

print(f"Mean = {mean_height}")
print(f"Median = {median_height}")


In [None]:
# Plot histogram with mean and median as green and red dashed lines
plt.figure(figsize=(10, 5))
plt.hist(height_df['Height'], bins=100, edgecolor='black', alpha=0.7)
plt.axvline(mean_height, color='red', linestyle='dashed', linewidth=2)
plt.axvline(median_height, color='green', linestyle='dashed', linewidth=2)

# Let's make our plot look pretty and informative:
plt.title("Height Distribution with Mean and Median")
plt.xlabel("Height (cm)")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()


**Questions❓**
- Notice that the mean and median are very close in value to one another. Why do you think that is?
- What are some examples of distribution shapes for which the mean and median would differ?

### 📌 Section 2: Measures of Dispersion

In addition to a summary statistic that tells us the "central tendency" of our distribution, it is useful to calculate statistics that tell us how spread out our data is. These are known as measures of "dispersion". 
<br><br>

Let's return to our height dataset and calculate 3 different measures of dispersion

- **Full Range**
- **Inter-quartile range**
- **Standard Deviation** 

**1. Full Range**
<br><br>
The full range of the data is simply the difference between the largest and smallest scores in our sample. The cell below calculates the range.

In [None]:
# Calculate range by subtracting the min score from the max score
range_height = height_df['Height'].max() - height_df['Height'].min()
print(f"Range = {range_height:.2f} cm")

<br>**2. Inter-quartile range**
<br><br>
A more useful and often used statistic is the inter-quartile range, which is the difference in value between the 75th and 25th percentile.

In [None]:
# Calculate the first and third quartiles
q1, q3 = np.quantile(height_df, [0.25, 0.75])
print(f"25th percentile = {q1:.2f} cm")
print(f"75th percentile = {q3:.2f} cm")

# Compute the Interquartile Range (IQR)
iqr = q3 - q1 
print(f"inter-quartile range = {iqr:.2f} cm")

<br>**Standard Deviation**
<br><br>
Finally, let's look at the standard deviation, which is the most common measure of spread/variability. Recall that the standard deviation is related to the difference between each score's value and the mean across all scores. Mathematically, we would write this as the following: 
<br><br>

$$
\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}
$$

<br>
It would be complicated to manually calculate this in code, but thankfully python libraries have functions taht allow us to calculate the standard deviation easily.

In [None]:
# Calculate standard deviation
std_height = np.std(height_df['Height'])
print(f"Std = {std_height}")

## ✅ Summary

In this notebook, you: 
- Plotted histograms to visualize data distributions
- Calculated measures of central tendency (mean & median)
- Calculated measures of variability (full range, inter-quartile range and standard deviation) 

<br>
These concepts form the foundation of statistical analysis and will help you interpret real-world data more effectively.
<br><br>
🎉 Great work!