# **<center>Statistics and Probability**

**<center>Exploring the Unknown**

<center><img src="images/pier.jpeg" width=600 height=900 />

You return to a familiar pier in your hometown, the last remnant of a time and place made indistinguishable from dream by the flow of time.    
It stands as you remember it.   
But no, not quite.  
The pylons, holding up the pier, are more exposed than they once were.   
**The distance between the boardwalk and the water is too wide.**   
**"What happened?"**   
You walk toward the jutting structure,   
To face the monument of your youth, standing defiant, in an ocean of time.

# **Objectives**

- Measures of central tendency
    - Mode
    - Mean
    - Meadian
    - Standard Deviation
- Percentiles
- Distributions
    - Normal Distribution

# **Measusres of Central Tendency**

What does it mean to be average?  
What does it mean to be normal?

[Average Americans believe they are above average in intelligence](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6029792/)

Descriptive statistics help us summarize and understand the main features of a dataset. We'll cover common measures like mean, median, mode, variance, and standard deviation, and demonstrate how to compute them with practical examples.

In [1]:
import statistics as st

# **Mode**

The mode is a measure of centrality that represents the most frequently occurring value in a dataset. It is particularly useful for categorical and discrete data, where the data points are not necessarily numerical. The mode can provide insights into the most common category or value within the dataset.

To calculate the mode in Python, you can use the statistics module, which provides a built-in function called `mode()`. This function takes a list of data as input and returns the mode(s) of the dataset.

In [2]:
# Example data
data = [10, 20, 30, 20, 40, 30, 50, 40, 40]

# Calculate the mode
mode = st.mode(data)
print(f"Mode: {mode}")

Mode: 40


**Suitable for Categorical Data:** Mode doesn't require numeric inputs

In [3]:
# Example categorical data
data = ['Red', 'Green', 'Blue', 'Red', 'Red']

# Calculate the mode
mode = st.mode(data)
print(f"Mode: {mode}")

Mode: Red


**Suitable for Ordinal Data:** Mode doesn't require numeric inputs

In [4]:
data = ['Low', 'Medium', 'High', 'Medium', 'Low']

# Calculate the mode
mode = st.mode(data)
print(f"Mode: {mode}")

Mode: Low


#### Multimode:

In [5]:
data = ['Low', 'Medium', 'High', 'Medium', 'Low']

# Calculate the mode
mode = st.multimode(data)
print(f"Multimode: {mode}")

Multimode: ['Low', 'Medium']


# **Weaknesses of the mode**

**1. Multimodal Distributions:** The dataset exhibits multiple modes (20 and 40), indicating that these values occur most frequently. If there are multiple modes (bimodal or multimodal data), the function will only return the first. In such cases, the mode might not accurately represent a single central value.

In [6]:
# Example data with multiple modes
data = [10, 20, 20, 20, 30, 40, 40, 40, 50, 50]

# Calculate the mode
mode_value = st.multimode(data)
print(f"Modes: {mode_value}")

Modes: [20, 40]


**2. Skewed Distributions:** 

<center><img src="images/skew1.png" width=600 height=900 />

In [7]:
# Example data with a skewed distribution
data = [10, 15, 20, 25, 30, 35, 40, 50, 50, 50]

# Calculate the mode
mode_value = st.mode(data)
print(f"Mode: {mode_value}")

Mode: 50


**3. Lack of Central Tendency:** 

In [8]:
# Example data with equal frequencies
data = [1, 2, 3, 4, 5, 6]

# Calculate the mode
mode_value = st.mode(data)
print(f"Mode: {mode_value}")

Mode: 1


**4. Sensitivity to Small Changes:**

In [9]:
# Example data with slightly different values
data = [1, 1, 2, 2, 3, 4, 5, 5, 5] #What happens if we add a 1?

# Calculate the mode
mode_value = st.mode(data)
print(f"Mode: {mode_value}")

Mode: 5


While the mode is a valuable measure for certain types of data, these weaknesses highlight the importance of considering the nature of the dataset and the specific characteristics of the data points when choosing an appropriate measure of centrality.   
Depending on the data's **distribution** and **type**, the mode may not always provide the most accurate representation of central tendency.

# **Mean**

**Mean** is the sum of all data points divided by the total number of data points.

[The idea of the average (as a mean)](https://timharford.com/2019/08/the-strange-power-of-the-idea-of-average/#:~:text=We%20do%20not%20know%20who,the%20direction%20of%20magnetic%20north.)

In [10]:
# Example data
data = [12, 15, 18, 20, 22]
number_of_data_points = len(data)

# Calculate the mean
mean = round(st.mean(data),2)
print(f"Mean: {mean}")

Mean: 17.4


# **Weaknesses of the Mean**

**1. Sensitive to Outliers:** The mean is highly sensitive to extreme values, also known as outliers. Even a single outlier can significantly affect the mean, pulling it towards higher or lower values. As a result, the mean may not accurately represent the typical value of the majority of the data.

In [11]:
# Example data with an outlier
data = [10, 15, 18, 20, 100]

# Calculate the mean
mean = round(st.mean(data),2)
print(f"Mean: {mean}")

Mean: 32.6


**2. Skewed Distributions:** In datasets with skewed distributions (asymmetrically distributed), the mean can be misleading. It may not align with the bulk of the data points, giving a false impression of the central tendency. In such cases, the median might be a more appropriate measure of centrality.

In [12]:
# Example data with a skewed distribution
data = [10, 15, 20, 25, 30, 50, 100]

# Calculate the mean
mean = round(st.mean(data),2)
print(f"Mean: {mean}")

Mean: 35.71


The data in this example is positively skewed with a long tail on the right. The mean (35.7) is pulled towards the larger values, even though the bulk of the data is clustered towards the left.

**3. Biased by Sample Size:** just like all measures of centrality

In [13]:
# Example data with a small sample size
data = [5, 10, 15]

# Calculate the mean
mean = round(st.mean(data),2)
print(f"Mean: {mean}")

Mean: 10


**Not Suitable for Categorical Data:** Mean requires numeric inputs**

In [14]:
# Example categorical data
data = ['Red', 'Green', 'Blue', 'Red', 'Red']

# Calculate the mean
# mean = st.mean(data)
# print(f"Mode: {mean}")

**Not Suitable for Ordinal Data:** Median requires numeric inputs**

In [15]:
# Example ordinal data
data = ['Low', 'Medium', 'High', 'Medium', 'Low']

# Calculate the mean
# mean = st.mean(data)
# print(f"Mode: {mean}")

# **Median**

The middle value of a dataset when it is arranged in ascending or descending order. It is a useful measure, especially when dealing with skewed data or datasets with outliers. The median is less affected by outliers and skewed data than the mean and is usually the preferred measure of central tendency when the distribution is not symmetrical.  
To calculate the median, you sort the data and find the value at the center. If the dataset has an odd number of data points, the median is the middle value. If the dataset has an even number of data points, the median is the average of the two middle values.

In [16]:
# Example data
data = [10, 15, 20, 25, 30, 50, 50, 50, 50]

# Calculate the median
median = st.median(data)
print(f"Median: {median}")

Median: 30


# **Weaknesses of the median**

**1. May Not Capture Multiple Modes:**

In [17]:
# Example data with multiple modes
data = [10, 20, 30, 20, 40, 30, 50, 40, 40]

# Calculate the median
median = st.median(data)
print(f"Median: {median}")

Median: 30


**2. Skewed data**

In [18]:
# Example data with a skewed distribution
data = [10, 15, 20, 25, 30, 50, 50, 50, 50]

# Calculate the median
median = st.median(data)
print(f"Median: {median}")

Median: 30


**Not Suitable for Categorical Data:** Median requires numeric inputs

In [19]:
# Example categorical data
data = ['Red', 'Green', 'Blue', 'Red', 'Red']

# Calculate the median
# mean = st.median(data)
# print(f"Mode: {median}")

**Not Suitable for Ordinal Data:** Median requires numeric inputs

In [20]:
# Example ordinal data
data = ['Low', 'Medium', 'High', 'Medium', 'Low']

# Calculate the mean
# mean = st.median(data)
# print(f"Mode: {median}")

# **Standard Deviation for Beginners**

**Standard deviation** is a statistical measure that quantifies the amount of variability or dispersion in a dataset. It tells us how much the data points deviate from the mean (average) of the dataset. A low standard deviation indicates that the data points are close to the mean, while a high standard deviation suggests that the data points are spread out over a wider range.

<center><img src="images/stdev_normal.png" width=600 height=900 />

# Example data: Exam scores
scores = [80, 85, 90, 95, 100]

In [21]:
scores = [80, 85, 90, 95, 100]

**1. Calculation of Mean (Average):** The mean is calculated as the sum of all scores divided by the total number of scores. In this example, the mean is `90`.

In [22]:
mean = sum(scores) / len(scores)
print(f"Mean: {mean}")

Mean: 90.0


**2. Deviation from the Mean:** The deviation from the mean is calculated for each score by subtracting the mean from each data point. For our example, the deviations are `[-10, -5, 0, 5, 10]`.

In [23]:
# Calculate deviations from the mean
# scores = [80, 85, 90, 95, 100]

deviations = [score - mean for score in scores]
print("Deviations from the Mean:", deviations)

Deviations from the Mean: [-10.0, -5.0, 0.0, 5.0, 10.0]


**3. Squaring the Deviations:** To avoid negative values, we square the deviations.** In our example, the squared deviations are [100, 25, 0, 25, 100].

In [24]:
# Square the deviations
# deviations = [-10.0, -5.0, 0.0, 5.0, 10.0]

squared_deviations = [deviation ** 2 for deviation in deviations]
print("Squared Deviations:", squared_deviations)

Squared Deviations: [100.0, 25.0, 0.0, 25.0, 100.0]


**4. Calculation of Variance:** The variance is the average of the squared deviations. In this case, the variance is `50`.

In [25]:
# Calculate the variance
variance = sum(squared_deviations) / len(squared_deviations)
print(f"Variance: {variance}")

Variance: 50.0


**5. Calculation of Standard Deviation:** The standard deviation is the `square root` of the variance. In our example, the standard deviation is approximately `7.07`.

In [26]:
standard_deviation = round(st.sqrt(variance),2)
print(f"Standard Deviation: {standard_deviation}")

Standard Deviation: 7.07


In [27]:
print(round(st.pstdev(scores),2))

7.07


Standard deviation measures the spread or variability of data points in a dataset.  
A higher standard deviation indicates a wider spread of data, while a lower standard deviation suggests that the data points are clustered closer to the mean.   

# **Percentiles, Quartiles and Quantiles**

**Percentile**

In [28]:
import numpy as np

Percentiles are statistical measures that divide a dataset into 100 equal parts, each containing a specified percentage of data points. They are used to understand the distribution of data and identify values at specific positions relative to the entire dataset. **The n'th percentile (where n is a value between 0 and 100) represents the value below which a given percentage of the data falls.**

In [29]:
# Example dataset, GRE Scores
data = [400, 218, 100, 200, 92, 250, 270, 350, 150, 300, 200, 250, 151, 150, 100]

# Calculate the percentiles using numpy
p25 = np.percentile(data, 25)
p50 = np.percentile(data, 50)
p75 = np.percentile(data, 75)

# Output the results
print(f"75th Percentile: {p75}")

75th Percentile: 260.0


**Quartile**

Quartiles are statistical measures that divide a dataset into four equal parts, each containing an equal number of data points. They are useful in understanding the distribution and spread of data. The three quartiles are as follows:

**First Quartile (Q1):** It separates the lowest 25% of the data from the rest.  
**Second Quartile (Q2) or the Median:** It divides the data into two halves, with 50% of the data points below and 50% above.  
**Third Quartile (Q3):** It separates the highest 25% of the data from the rest.

In [30]:
# Example dataset
data = [12, 17, 18, 20, 22, 25, 27, 30, 35, 40]

# Calculate the quartiles using numpy
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50) # median
q3 = np.percentile(data, 75)

# Output the results
print(f"Q1 (First Quartile): {q1}")
print(f"Q2 (Second Quartile): {q2}" )
print(f"Q3 (Third Quartile): {q3}")

Q1 (First Quartile): 18.5
Q2 (Second Quartile): 23.5
Q3 (Third Quartile): 29.25


The `numpy.percentile()` function can calculate any percentile, not just quartiles. So, by changing the percentile value (e.g., using np.percentile(data, 90)), you can calculate other quantiles as well.  

<center><img src="images/boxplot.png" width=600 height=900 />

**Quantile**

Quantiles are statistical measures that divide a dataset into equal parts, each containing a specified percentage of data points.

In [31]:
# Example dataset
data = [12, 17, 18, 20, 22, 25, 27, 30, 35, 40]

# Calculate the quartiles using numpy
q1 = np.percentile(data, 20)
q2 = np.percentile(data, 40) 
q3 = round(np.percentile(data, 60),2)
q4 = round(np.percentile(data, 80),2)

# Output the results
print(f"Q1 (First Quantile): {q1}")
print(f"Q2 (Second Quantile): {q2}" )
print(f"Q3 (Third Quantile): {q3}")
print(f"Q3 (Third Quantile): {q3}")

Q1 (First Quantile): 17.8
Q2 (Second Quantile): 21.2
Q3 (Third Quantile): 25.8
Q3 (Third Quantile): 25.8
