<a href="https://colab.research.google.com/github/Tealexkay/Midterm-project/blob/main/Day4_Measures_of_Variation__Graphing_Box_Plots_Using_Matplotlib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 4: Measures of Variation - Graphing Box Plots Using Matplotlib

In today’s session, we will build on our understanding of **measures of central tendency** and move into exploring **measures of variation**. We’ll cover how to quantify **data spread** using **range**, **variance**, and **standard deviation**.

We will also introduce **Chebyshev’s Theorem** as a tool for detecting **outliers** and learn how to visualize **data distribution** using **box and whisker plots**. Finally, we’ll connect our statistical insights to summary statistics provided by **`.describe()`** and discuss their **real-world applications**.

## 1. Quick Review from Previous Class
- **Recap of key topics**: Wrangling data and exploring measures of central tendency.
- **Using `pandas.astype()`**: Converting data types to ensure consistency in analysis.
- **Sorting and Grouping Data**: Techniques for organizing and summarizing datasets.
- **Using the `.describe()` Function**: Generating summary statistics to understand data distribution.
- **Review of Measures of Central Tendency**: Understanding **median**, **mode**, and **trimmed mean** as ways to summarize data and calculate them using python.

## 2. What Are Measures of Variation?
- **Definition**: Quantify how spread out or dispersed data values are.
- **Importance**: Understanding data spread is crucial for risk assessment, data comparison, and identifying outliers.
- **Common measures**: Range, Variance, Standard Deviation.
  - **Range**: Maximum - Minimum.
  - **Variance**: Average of squared deviations from the mean.
  - **Standard Deviation**: Square root of the variance.
  - **Interquartile Range (IQR)**: Difference between the 75th percentile (Q3) and the 25th percentile (Q1), measuring the spread of the middle 50% of the data.  

### 2.1 Range
- **Concept**: The simplest measure of variation, calculated as the **difference between the maximum and minimum** values.
- **Strengths**:
  - Very easy to compute.
- **Weaknesses**:
  - Very sensitive to outliers (a single extreme value can distort it).
- **Interpretation**: A quick snapshot of how wide the data is spread.

**NOTE:** we will continue using the same nfl suspensions dataset as day 3, the following cell will load the dataframe and clean it.

In [None]:
import pandas as pd

nfl_suspension_df = pd.read_csv("https://raw.githubusercontent.com/liger1apwm/MAT-301_Applied_Stats_Data_Analysis/refs/heads/main/data/nfl-suspensions-data.csv")

# Select relevant columns
nfl_suspension_df = nfl_suspension_df[['name', 'team', 'games', 'category', 'year']]

# Remove rows where 'games' is 'Indef.'
nfl_suspension_df = nfl_suspension_df[nfl_suspension_df['games'] != "Indef."]


# Convert remaining columns to the correct types
nfl_suspension_df = nfl_suspension_df.astype({
    'name': 'string',
    'team': 'string',
    'games': 'int16',
    'category': 'string',
    'year':'int16'
})

# Reset index
nfl_suspension_df = nfl_suspension_df.reset_index(drop=True)

nfl_suspension_df

**Now lets calculate the range for the column `games`:**

In [None]:
games_range = nfl_suspension_df['games'].max() - nfl_suspension_df['games'].min()

print(f"Range of 'games' column: {games_range}")

### 2.2 Variance
- **Concept**: The average of squared deviations from the mean.
- **Formula** (sample Variance):
  $
  s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}
  $
  (For a population, we  use $\mu$ as the mean and $(N)$ in the denominator instead.)
- **Significance**: Tells us how data points spread out from the mean. A higher variance means more spread.
- **Use cases**: Comparing dispersion in different datasets.

In [None]:
variance_sample = nfl_suspension_df['games'].var()  # sample variance
variance_population = nfl_suspension_df['games'].var(ddof=0)  # population variance

print("Sample Variance (ddof=1):", variance_sample)
print("Population Variance (ddof=0):", variance_population)

**Note**: ddof stands for Delta Degrees of Freedom, which represents the adjustment made to the degrees of freedom in statistical calculations.

### 2.3 Standard Deviation
- **Concept**: The square root of the variance. It brings the measure of spread back to the original units of the data.
- **Importance**: Easier to interpret in real-world terms compared to the variance.
- **Real-world applications**:  
  - **Finance**: Measures market volatility—higher values indicate greater risk.  
  - **Quality Control**: Detects inconsistencies in manufacturing processes.  
  - **Education**: Evaluates score dispersion in exams to assess performance variability.  
  - **Healthcare**: Analyzes patient response variability in clinical studies.  
  - **Sports & Weather**: Tracks performance consistency and climate fluctuations over time.  

similar to Variance,Standard deviation quantifies **how much data deviates from the mean**, making it essential for understanding **variability and consistency** in different fields.  

In [None]:
# Standard Deviation in pandas
std_sample = nfl_suspension_df['games'].std()  # sample std dev (ddof=1)
std_population = nfl_suspension_df['games'].std(ddof=0)  # population std dev

print("Sample Standard Deviation (ddof=1):", std_sample)
print("Population Standard Deviation (ddof=0):", std_population)

### 2.3 Interquartile Range (IQR)
- **Concept**: The **Interquartile Range (IQR)** measures the spread of the **middle 50%** of a dataset. It is the difference between the **third quartile (Q3)** and the **first quartile (Q1)**.  

- **Formula**:  
  $$ \text{IQR} = Q3 - Q1 $$  

  where:  
  - \( Q1 \) (25th percentile) is the median of the lower half of the data.  
  - \( Q3 \) (75th percentile) is the median of the upper half of the data.  

- **Significance**:  
  - Less sensitive to outliers than range and standard deviation.  
  - Useful for detecting **outliers**, where potential outliers are defined as values less than the lower bound or greater than the upper bound by using this formulas to calculate the bounds:  
    $$ \text{Lower Bound} = Q1 - 1.5 \times \text{IQR} $$  
    $$ \text{Upper Bound} = Q3 + 1.5 \times \text{IQR} $$  

- **Use cases**:  
  - Identifying extreme values in datasets.  
  - Summarizing dispersion without being affected by outliers.  
  - Comparing data distributions in different groups.  

In [None]:
# Calculate Q1, Q3, and IQR
Q1 = nfl_suspension_df['games'].quantile(0.25)
Q3 = nfl_suspension_df['games'].quantile(0.75)
IQR = Q3 - Q1

print(f"Q1 (25th percentile): {Q1}")
print(f"Q3 (75th percentile): {Q3}")
print(f"Interquartile Range (IQR): {IQR}")

Let’s display all the `games` suspension values and check if they look close.

In [None]:
print(*sorted(nfl_suspension_df['games']))

## 3. Chebyshev’s Theorem and Outlier Identification

- **Chebyshev’s Theorem**: States that for any dataset (without assuming normality), at least  
  $$ 1 - \frac{1}{k^2} $$  
  of the data values lie within \( k \) standard deviations of the mean.

- **Example**: For \( k = 2 \), at least  
  $$ 1 - \frac{1}{4} = 0.75 \text{ (75%)} $$  
  of values are within 2 standard deviations of the mean.

- **Role**: Useful for **non-normal distributions** to identify potential outliers. Potential outliers are defined as values less than the lower bound or greater than the upper bound and using this formulas to calculate the bounds:  
    $$ \text{Lower Bound} = \bar{x} - k \cdot s $$  
    $$ \text{Upper Bound} = \bar{x} + k \cdot s $$  

- **Advantages**: More general than the **Empirical Rule**, which requires normality.

- **Limitations**: Often provides a **wider range** than the Empirical Rule, potentially **overestimating** spread for certain distributions.

In [None]:
# Define k (number of standard deviations)
k = 3

# Calculate mean and standard deviation
mean_games = nfl_suspension_df['games'].mean()
std_dev_games = nfl_suspension_df['games'].std()

# Compute the bounds using Chebyshev's theorem
lower_bound = mean_games - k * std_dev_games
upper_bound = mean_games + k * std_dev_games

# Identify potential outliers (values outside the bounds)
outliers = nfl_suspension_df[(nfl_suspension_df['games'] < lower_bound) | (nfl_suspension_df['games'] > upper_bound)]

# Display results
print(f"Mean: {mean_games:.2f}")
print(f"Standard Deviation: {std_dev_games:.2f}")
print(f"Percentage of data within {k} std dev: {(1-(1/(k**2)))*100:.2f}% ")
print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")
print("\nPotential Outliers:")
display(outliers)

## 4. Box and Whisker Plots Using Python with Outliers
- **Purpose of Box Plots**:
  - Quickly visualize data spread, median, and potential outliers.
- **What is matplotlib?**
  - A popular Python library for creating static, animated, and interactive visualizations.
- **How to create a box plot**:
  - Use `plt.boxplot()` from the `matplotlib.pyplot` module. Usually we import this library with the alias plt like this:</br> `import matplotlib.pyplot as plt`
- **Components**:
  - **Minimum** (excluding outliers)
  - **Q1** (25th percentile)
  - **Median** (50th percentile)
  - **Q3** (75th percentile)
  - **Maximum** (excluding outliers)
  - **Outliers** (plotted as individual points)
- **Real-world examples**: Comparing multiple groups, summarizing how each group’s data is distributed.

**Remember**: Google Colab comes with several packages, such as pandas, numpy, and matplotlib, pre-installed. However, if you’re working in other IDEs, you may need to install these packages first.

Lets first do a Box and Whiskers plot with Outliers

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt  # For plotting
import numpy as np  # For numerical operations

# Create a figure with a specified size (width=10 inches, height=5 inches)
plt.figure(figsize=(10, 5))

# Create a horizontal box plot of the "games" column from nfl_suspension_df
plt.boxplot(nfl_suspension_df['games'],
            vert=False,          # Makes the box plot horizontal
            patch_artist=True,   # Fills the box with color
            whis=[0, 100])       # Includes all data points in whiskers (no outliers removed)

# Set the title of the plot
plt.title("Box Plot of Games Suspended (Without Outliers)")

# Label the x-axis to indicate the values represent suspended games
plt.xlabel("Games Suspended")

# Determine the minimum and maximum values of the "games" column
min_games = int(nfl_suspension_df['games'].min())  # Convert min value to integer
max_games = int(nfl_suspension_df['games'].max())  # Convert max value to integer

# Set custom x-axis ticks at an interval of 2 (from min_games to max_games)
plt.xticks(np.arange(min_games, max_games + 1, 2))

# Display the final box plot
plt.show()

With outliers:

In [None]:
# change the size of the figure
plt.figure(figsize=(10, 5))
plt.boxplot(nfl_suspension_df['games'],
            vert=False,          # Makes the box plot horizontal instead of vertical
            patch_artist=True)   # Fills the box with color for better visualization

plt.title("Box Plot of Games Suspended (With Outliers)")
plt.xlabel("Games Suspended")

min_games = int(nfl_suspension_df['games'].min())
max_games = int(nfl_suspension_df['games'].max())
plt.xticks(np.arange(min_games, max_games + 1, 2))

plt.show()

## 5. Relation Between Box Plot and `.describe()`
- `df.describe()` gives us:
  1. **count**
  2. **mean**
  3. **std** (standard deviation)
  4. **min**
  5. **25%** (Q1)
  6. **50%** (median)
  7. **75%** (Q3)
  8. **max**
- These statistics directly relate to the box plot components (Q1, median, Q3, min, max, and outliers).

In [None]:
df_description = nfl_suspension_df['games'].describe()
df_description

Notice that the box plot is just reflecting on the values provided by the describe function

Let's try to use describe in the whole dataframe:

In [None]:
nfl_suspension_df.describe(include="all")

## 6. Box and Whisker Plots Based on Categories Within a Column  

When grouped by categories within a column, it allows for **comparisons between different groups**.


### **Why Group Box Plots by Category?**  
Grouping box plots by **categories within a column** helps in:
- **Comparing distributions** across different groups.  
- **Identifying variability** within each category.  
- **Detecting outliers** that may indicate anomalies in specific groups.  
- **Understanding trends** and how different factors influence numerical values.

###  **Example Use Case: Suspensions in the NFL**  
If we plot **suspension length (`games`) grouped by category (`team`)**, we can analyze:  
- Which **team** tend to have **longer suspensions**.  
- How much **variation** exists in suspension lengths for different rule violations.  
- Whether certain categories have **outliers**, indicating extreme cases.  



In [None]:

plt.figure(figsize=(10, 5))
ax = nfl_suspension_df.boxplot(column='games', by='team', figsize=(12, 8))

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')  # Rotates labels and aligns them to the right


plt.title("Box Plot of Games Suspended by Team")
plt.suptitle("")  # Removes the default 'Boxplot grouped by category' title
plt.xlabel("Team")
plt.ylabel("Games Suspended")
plt.show()

###  **Interpreting the Plot**  
- A **wide box** suggests a **high variation** in suspension lengths within that category.  
- A **high median line** means players in that category typically receive **longer suspensions**.  
- **Outliers** beyond the whiskers indicate **exceptionally long or short suspensions** compared to the norm.  

By grouping boxplots by team, we can analyze the distribution of `games`suspended across different `teams`, providing insights into how disciplinary actions vary. This visualization helps identify trends, such as which teams have players with longer or more frequent suspensions, potential outliers, and overall consistency in enforcement.

## 9. Summary and Questions
- **Recap key points**:
  1. **Range**: Simple but sensitive to outliers.
  2. **Variance** and **Standard Deviation**: Crucial for understanding spread.
  3. **Interquartile Range (IQR)**: Measures the middle 50% of data, reducing the impact of outliers.
  3. **Chebyshev’s Theorem**: A general rule for data spread within \(k\) std deviations.
  4. **Box Plots**: Powerful visual for identifying outliers and understanding distribution.
- **Identify outliers**: Both Chebyshev’s theorem (theoretical bound) and box plots (practical visualization) help.
- **Combine numerical summaries and visuals**: Use `.describe()` and **box plots** together.

**Questions?** Feel free to ask for more examples or clarifications on any of these concepts.