# Python Lecture on Basic Statistics

Welcome to this lecture on fundamental statistical concepts using Python. We will cover four key measures:

* **Mean**: The average value of a dataset.
* **Median**: The middle value of a sorted dataset.
* **Mode**: The most frequently occurring value in a dataset.
* **Standard Deviation**: A measure of the amount of variation or dispersion of a set of values.

Let's dive in!

---

## 1. Mean (Average)

The **mean** is the arithmetic average of a set of numbers. It is calculated by summing all the values in a dataset and then dividing by the number of values.

Formula: 

$$ \mu = \frac{\sum_{i=1}^{n} x_i}{n} $$

Here is a Python function to calculate the mean:

In [17]:
def calculate_mean(data):
    """
    Calculates the mean (average) of a list of numbers.
    Args:
        data (list): A list of numerical data.
    Returns:
        float: The mean of the data.
    """
    # Handle the case of an empty list to avoid division by zero
    if not data:
        return 0
    return sum(data) / len(data)

### Example 1: Simple integers

In [18]:
dataset1 = [10, 20, 30, 40, 50]
mean1 = calculate_mean(dataset1)
print(f"Dataset: {dataset1}")
print(f"The mean is: {mean1}")

Dataset: [10, 20, 30, 40, 50]
The mean is: 30.0


### Example 2: Numbers with a decimal

In [19]:
dataset2 = [2.5, 3.5, 4.0, 6.0, 7.5]
mean2 = calculate_mean(dataset2)
print(f"Dataset: {dataset2}")
print(f"The mean is: {mean2}")

Dataset: [2.5, 3.5, 4.0, 6.0, 7.5]
The mean is: 4.7


### Example 3: Negative numbers

In [20]:
dataset3 = [-5, 0, 5, 10, 15, 20]
mean3 = calculate_mean(dataset3)
print(f"Dataset: {dataset3}")
print(f"The mean is: {mean3}")

Dataset: [-5, 0, 5, 10, 15, 20]
The mean is: 7.5


---

## 2. Median

The **median** is the middle value of a dataset when it is sorted in ascending order. It's a useful measure of central tendency because it is not affected by extremely large or small values (outliers).

* If the dataset has an **odd** number of values, the median is the single middle value.
* If the dataset has an **even** number of values, the median is the average of the two middle values.

Here is a Python function to calculate the median:

In [21]:
def calculate_median(data):
    """
    Calculates the median of a list of numbers.
    Args:
        data (list): A list of numerical data.
    Returns:
        float: The median of the data.
    """
    # Handle the case of an empty list
    if not data:
        return 0
    
    sorted_data = sorted(data)
    n = len(sorted_data)
    
    # Odd number of values
    if n % 2 == 1:
        # The middle index is the length divided by 2 (integer division)
        return sorted_data[n // 2]
    # Even number of values
    else:
        # Get the two middle values and average them
        mid1 = sorted_data[n // 2 - 1]
        mid2 = sorted_data[n // 2]
        return (mid1 + mid2) / 2

### Example 1: Odd number of values

In [22]:
dataset4 = [1, 7, 3, 9, 5]
median1 = calculate_median(dataset4)
print(f"Dataset: {dataset4}")
print(f"The median is: {median1}")

Dataset: [1, 7, 3, 9, 5]
The median is: 5


### Example 2: Even number of values

In [23]:
dataset5 = [10, 20, 30, 40]
median2 = calculate_median(dataset5)
print(f"Dataset: {dataset5}")
print(f"The median is: {median2}")

Dataset: [10, 20, 30, 40]
The median is: 25.0


### Example 3: Outliers and unsorted data

In [24]:
dataset6 = [1, 2, 100, 3, 4, 5]
median3 = calculate_median(dataset6)
print(f"Dataset: {dataset6}")
print(f"The median is: {median3}")
# Note how the median (3.5) is not heavily skewed by the outlier (100).

Dataset: [1, 2, 100, 3, 4, 5]
The median is: 3.5


---

## 3. Mode

The **mode** is the value that appears most frequently in a dataset. A dataset can have one mode, multiple modes (bimodal, multimodal), or no mode at all if every value appears only once.

We will use the `collections` module to easily count the frequency of each item.

Here is a Python function to find the mode:

In [25]:
from collections import Counter

def calculate_mode(data):
    """
    Calculates the mode(s) of a list of numbers.
    Args:
        data (list): A list of data.
    Returns:
        list: A list of the mode(s).
    """
    if not data:
        return []
    
    counts = Counter(data)
    max_count = max(counts.values())
    
    # Return all items that have the max count
    modes = [key for key, value in counts.items() if value == max_count]
    
    # Handle the case where every element appears only once
    if len(modes) == len(data):
        return [] # No mode
    else:
        return modes

### Example 1: Single mode

In [26]:
dataset7 = [1, 2, 2, 3, 4, 4, 4, 5, 5]
mode1 = calculate_mode(dataset7)
print(f"Dataset: {dataset7}")
print(f"The mode is: {mode1}")

Dataset: [1, 2, 2, 3, 4, 4, 4, 5, 5]
The mode is: [4]


### Example 2: Bimodal (multiple modes)

In [27]:
dataset8 = [1, 2, 2, 3, 4, 4, 5]
mode2 = calculate_mode(dataset8)
print(f"Dataset: {dataset8}")
print(f"The modes are: {mode2}")

Dataset: [1, 2, 2, 3, 4, 4, 5]
The modes are: [2, 4]


### Example 3: No mode

In [28]:
dataset9 = [10, 20, 30, 40, 50]
mode3 = calculate_mode(dataset9)
print(f"Dataset: {dataset9}")
print(f"The mode is: {mode3}")

Dataset: [10, 20, 30, 40, 50]
The mode is: []


---

## 4. Standard Deviation

The **standard deviation** is a measure of how spread out the numbers in a dataset are from the mean. A low standard deviation means values are close to the mean, while a high standard deviation indicates values are spread out over a wider range.

To calculate standard deviation, we follow these steps:
1.  Calculate the mean of the data.
2.  For each number, subtract the mean and square the result (this is the squared deviation).
3.  Calculate the mean of these squared deviations (this is the variance).
4.  Take the square root of the variance to get the standard deviation.

Formula:

$$ \sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \mu)^2}{n}} $$

Here is a Python function to calculate the standard deviation. We will reuse the `calculate_mean` function defined earlier.

In [29]:
import math

def calculate_standard_deviation(data):
    """
    Calculates the standard deviation of a list of numbers.
    Args:
        data (list): A list of numerical data.
    Returns:
        float: The standard deviation of the data.
    """
    if not data:
        return 0
    
    # Step 1: Calculate the mean
    mean_value = calculate_mean(data)
    
    # Step 2: Calculate the squared deviations
    squared_deviations = [(x - mean_value) ** 2 for x in data]
    
    # Step 3: Calculate the variance (mean of squared deviations)
    variance = calculate_mean(squared_deviations)
    
    # Step 4: Take the square root of the variance
    return math.sqrt(variance)

### Example 1: Low standard deviation

In [30]:
dataset10 = [10, 11, 10, 12, 9, 10]
std_dev1 = calculate_standard_deviation(dataset10)
print(f"Dataset: {dataset10}")
print(f"The standard deviation is: {std_dev1:.4f}") # Low spread

Dataset: [10, 11, 10, 12, 9, 10]
The standard deviation is: 0.9428


### Example 2: High standard deviation

In [31]:
dataset11 = [1, 5, 10, 15, 20]
std_dev2 = calculate_standard_deviation(dataset11)
print(f"Dataset: {dataset11}")
print(f"The standard deviation is: {std_dev2:.4f}") # High spread

Dataset: [1, 5, 10, 15, 20]
The standard deviation is: 6.7941


### Example 3: Zero standard deviation

In [32]:
dataset12 = [5, 5, 5, 5, 5]
std_dev3 = calculate_standard_deviation(dataset12)
print(f"Dataset: {dataset12}")
print(f"The standard deviation is: {std_dev3:.4f}") # No spread

Dataset: [5, 5, 5, 5, 5]
The standard deviation is: 0.0000
