# Measures of Variability

Measures of dispersion provide a quantitative assessment of the spread within a distribution. They indicate whether the values are clustered around a central point or dispersed across a range. The following are the most commonly used measures of dispersion:

**Range:** The range represents the difference between the highest and lowest values in a dataset.

**Interquartile Range (IQR):** The IQR measures the difference between the third quartile ($Q_3$) and the first quartile ($Q_1$). It is less affected by extreme values, focusing on the middle portion of the dataset. This makes the IQR particularly useful for skewed distributions with outliers. The IQR is calculated as:  
$$IQR = Q_3 - Q_1$$

**Variance:** Variance quantifies the extent to which the values in a dataset deviate from the mean. It provides an indication of whether the mean is a representative measure of central tendency. A small variance suggests that the mean is a good representation of the dataset. The formula for variance is:

$$\sigma^2 = \frac{\sum (x-\mu)^2}{N}$$
    
Where $\mu$ is the mean, and $N$ is the number of values in the dataset.

**Sample Variance** is given by:

$$S^2 = \frac{\sum (x-\overline x)^2}{n-1}$$

Where $\overline x$ is the sample mean, and $n$ is the number of values in the sample.

**Standard deviation:** This measure is calculated by taking the square root of the variance. Since the variance is not in the same units as the original data (it involves squaring the differences), taking the square root brings the standard deviation back to the same units as the data. For example, in a dataset measuring average rainfall in centimeters, the variance would be in $cm^2$, which isn't interpretable. However, the standard deviation, expressed in $cm$, provides a meaningful indication of the average deviation of rainfall in centimeters.

**Skewness:** This measures the degree of asymmetry of a distribution

<center><img src="./data/mean-median-mode.png"/></center>

**Positive Skewness:** A positively skewed distribution is characterized by numerous outliers in the upper region, or right tail. It is termed "skewed right" due to its relatively elongated upper (right) tail.

**Negative Skewness:** Conversely, a negatively skewed distribution exhibits a disproportionate number of outliers within its lower (left) tail. Such a distribution is referred to as "skewed left" owing to its extended lower tail.

**Kurtosis:** Kurtosis serves as a measure indicating the curvature, peakiness, or flatness of a given distribution of data.

<center><img src="./data/kurtosis.png"/></center>

In [1]:
import pandas as pd
data = pd.Series([19,23,19,18,25,16,17,19,15,23,21,23,21,11,6])

In [2]:
data.describe()

count    15.000000
mean     18.400000
std       4.997142
min       6.000000
25%      16.500000
50%      19.000000
75%      22.000000
max      25.000000
dtype: float64

In [3]:
data.mode()

0    19
1    23
dtype: int64

The values 19 and 23 are the most frequently occurring values

In [4]:
data.median()

19.0

In [5]:
range_data = max(data)-min(data)
range_data

19

In [6]:
data.std()

4.99714204034952

In [7]:
data.var()

24.97142857142857

In [8]:
from scipy.stats import skew, kurtosis

skew(data), kurtosis(data)

(-1.038344732097918, 0.6995494033062934)

**Points to note:**  
1. The mean value is affected by outliers (extreme values). Whenever there are outliers in a dataset, it is better to use the median.
2. The standard deviation and variance are closely tied to the mean. Thus, if there are outliers, standard deviation and variance may not be representative measures too.
3. The mode is generally used for discrete data since there can be more than one modal value for continuous data.