### What are the measures of Variability ?

The measures of variability are statistical measures that describe the degree to which a set of data points differ from one another. The most common measures of variability are:

- Range: The range is the difference between the highest and lowest values in a dataset.

Range = max(x) - min(x) where x is the dataset.

- Variance: The variance is the average of the squared differences from the mean. It measures how spread out the data is around the mean.

![image.png](attachment:image.png)

- Standard deviation: The standard deviation is the square root of the variance. It provides a measure of how much the data deviates from the mean.

![image.png](attachment:image-2.png)

- Interquartile range: The interquartile range (IQR) is the difference between the third and first quartiles of the dataset. It represents the range of the middle 50% of the data.

IQR = Q3 - Q1

- Mean absolute deviation: The mean absolute deviation (MAD) is the average of the absolute differences from the mean. It measures how much the data deviates from the mean on average.

![image.png](attachment:image-4.png)

where n is the number of data points, x is each data point, μ is the mean of the dataset, and Σ represents the sum of the values. The absolute value | | is used to ensure that the differences from the mean are always positive.

Note that the formulas for the range and interquartile range remain the same for both population and sample data.

### Why n-1 is not used in Sample mean absolute deviation (MAD)?

The reason why (n-1) is not used in the formula for the sample mean absolute deviation (MAD) is because the absolute deviations from the sample mean do not have a natural zero point. In other words, the deviation from the sample mean could be positive or negative, and taking the absolute value makes all deviations positive.

When calculating the sample variance or standard deviation, we square the deviations from the sample mean, which makes them positive and creates a natural zero point. However, when we take the absolute value of the deviations for the MAD, we lose that natural zero point, and the deviation values no longer follow a normal distribution. Therefore, the sample MAD formula uses (1/n) instead of (1/(n-1)) as the denominator.

It's worth noting that the difference between using (1/n) and (1/(n-1)) is typically small for large sample sizes, but can become more significant for smaller sample sizes. For larger sample sizes, the difference becomes negligible, and using (1/n) instead of (1/(n-1)) makes the calculation of the sample MAD simpler and more straightforward.

### Why N-1 is used for calculating sample variance and standard deviation?

The reason why (n-1) is used instead of n in the formula for the sample variance and standard deviation is to account for the fact that the sample mean is being used to estimate the population mean.

When we calculate the variance or standard deviation of a sample, we use the sample mean as an estimate of the population mean. However, the sample mean is itself a random variable, and the variability in the sample mean can affect the accuracy of our estimate of the population variance or standard deviation.

Using (n-1) instead of n in the denominator of the sample variance and standard deviation formulas is known as the Bessel's correction. It adjusts for the fact that the sample mean is an estimate of the population mean and introduces a slight bias correction to the sample variance and standard deviation, resulting in a better estimate of the population variance and standard deviation.

In other words, the use of (n-1) instead of n makes the sample variance and standard deviation more representative of the population variance and standard deviation by accounting for the reduced degree of freedom in the sample mean estimate.

### What are degrees of freedom in statistics?

In statistics, degrees of freedom refers to the number of independent observations in a sample that are available to estimate a parameter or test a hypothesis.

Degrees of freedom are important because they affect the precision of our estimates and the accuracy of our statistical tests. Generally, the more degrees of freedom we have, the more precise our estimates and the more accurate our tests will be.

The concept of degrees of freedom is used in many statistical tests, including t-tests, ANOVA, chi-squared tests, and regression analysis. In each of these tests, the degrees of freedom are calculated differently based on the specific test being performed.

For example, in a t-test, the degrees of freedom are equal to the sample size minus one (n-1). In a regression analysis with one independent variable, the degrees of freedom are equal to the sample size minus two (n-2).

Understanding degrees of freedom is important for interpreting statistical results and making appropriate conclusions from data analyses.

Some examples to help illustrate the concept of degrees of freedom.

- T-test:

Suppose we want to test whether the mean height of a population is equal to a certain value. We collect a random sample of 10 individuals and measure their heights. To perform a t-test, we need to calculate the sample mean and sample standard deviation, and we also need to know the degrees of freedom.

The degrees of freedom in a t-test are equal to the sample size minus one, so in this case, the degrees of freedom would be 10-1 = 9. These degrees of freedom are used to look up the appropriate t-value in a t-distribution table, which is used to calculate the p-value and determine the significance of the test.

- Chi-squared test:

Suppose we want to test whether two categorical variables are independent or not. We collect data on a sample of 100 individuals, recording their gender and whether or not they smoke. We want to test whether smoking status is independent of gender.

To perform a chi-squared test, we need to calculate the expected values for each cell in a contingency table based on the assumption that smoking status and gender are independent. To calculate the degrees of freedom for the test, we use the formula (r-1)(c-1), where r is the number of rows and c is the number of columns in the contingency table.

In this case, the contingency table has 2 rows and 2 columns, so the degrees of freedom would be (2-1)(2-1) = 1. These degrees of freedom are used to look up the appropriate chi-squared value in a chi-squared distribution table, which is used to calculate the p-value and determine the significance of the test.

- Linear regression:

Suppose we want to model the relationship between two variables, x and y, using linear regression. We collect a sample of 20 data points and fit a regression line to the data. To perform hypothesis tests on the regression coefficients, we need to know the degrees of freedom.

In a simple linear regression with one independent variable, the degrees of freedom are equal to the sample size minus two, so in this case, the degrees of freedom would be 20-2 = 18. These degrees of freedom are used to look up the appropriate t-value in a t-distribution table, which is used to calculate the p-value and determine the significance of the regression coefficients.

### What are number of independent observations in a sample?

The number of independent observations in a sample refers to the number of data points that are not influenced by each other, and therefore can be considered as separate pieces of information.

In general, the number of independent observations in a sample depends on the study design and the nature of the data being collected. Here are some examples of different types of data and the number of independent observations they might contain:

 - Simple random sample: Suppose we want to estimate the mean weight of all college students in a particular state. We randomly select 50 college students from a list of all college students in the state and weigh them. In this case, each of the 50 students represents an independent observation, and the sample contains 50 independent observations.

 - Repeated measurements: Suppose we want to measure the blood pressure of a patient at three different times of the day. We take three measurements of the patient's blood pressure, once in the morning, once in the afternoon, and once in the evening. In this case, each of the three measurements represents an independent observation, and the sample contains three independent observations.

- Longitudinal study: Suppose we want to study the effect of a new medication on blood pressure over time. We recruit 100 patients with high blood pressure and measure their blood pressure before and after taking the medication for 12 weeks. In this case, each patient represents an independent observation, but the repeated measurements over time are not independent. Therefore, the sample contains 100 independent observations.

- Cluster sampling: Suppose we want to estimate the average income of households in a particular neighborhood. We randomly select 10 blocks in the neighborhood and interview all households in each block. In this case, the households within each block are not independent observations, but the blocks themselves are independent. Therefore, the sample contains 10 independent observations.

It's important to keep in mind that the number of independent observations in a sample can affect the precision of statistical estimates and the accuracy of statistical tests. In general, larger sample sizes and more independent observations lead to more precise estimates and more accurate tests.

### What is Coefficient of Variation?

The coefficient of variation (CV) is a measure of relative variability that is used to compare the dispersion of two or more datasets that have different scales or units of measurement. The CV is expressed as a percentage and is calculated as the ratio of the standard deviation to the mean of the dataset, multiplied by 100%.

Population: CV = (σ/μ) x 100%

Sample: CV = (s/x̄) x 100%

where σ is the population standard deviation, μ is the population mean, s is the sample standard deviation, and x̄ is the sample mean.

The CV is useful when comparing the variability of datasets with different units or scales, as it allows us to compare the relative variability of the datasets without being influenced by their absolute values. For example, if we have two datasets, one that measures weight in kilograms and another that measures income in dollars, we cannot compare their standard deviations directly because they have different units of measurement. However, we can use the CV to compare the relative variability of the two datasets, regardless of their units.

A higher CV indicates that the dataset has a higher relative variability or dispersion, while a lower CV indicates that the dataset has a lower relative variability or dispersion. The CV is often used in fields such as finance, economics, and biology to compare the variability of different stocks, economic indicators, or biological variables.

It is important to note that the CV has some limitations. Firstly, it cannot be used if the mean of the dataset is close to zero or if there are negative values in the dataset. Secondly, the CV may be sensitive to extreme values or outliers in the dataset, especially when the sample size is small. In such cases, other measures of dispersion such as the range, interquartile range, or mean absolute deviation may be more appropriate.

### Which is better a higher CV or a lower CV?

In statistics, neither a higher nor a lower coefficient of variation (CV) is inherently better. The usefulness of the CV depends on the context and the purpose of the analysis.

A higher CV indicates that the dataset has a higher relative variability or dispersion, while a lower CV indicates that the dataset has a lower relative variability or dispersion. Thus, the CV can be useful for comparing the dispersion of datasets with different scales or units of measurement. It can help identify which dataset has a greater or lesser degree of variability relative to its mean.

In some contexts, a higher CV may be desirable. For example, in finance or investment, a higher CV may indicate a greater potential for risk or return, depending on the investor's risk appetite. On the other hand, in quality control or manufacturing, a lower CV may be desirable, as it indicates that the variability of the products or processes is relatively low and consistent.

Therefore, whether a higher or lower CV is considered better depends on the specific context and objective of the analysis. The CV should be interpreted in conjunction with other statistical measures and contextual information to gain a better understanding of the dataset's characteristics and variability.

In [1]:
import numpy as np

# Create a sample dataset
data = np.array([10, 12, 15, 17, 20, 23, 25, 27, 30, 33])

In [2]:
# Range
range = np.max(data) - np.min(data)
print("Range:", range)

Range: 23


In [3]:
# Interquartile Range (IQR)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
print("IQR:", iqr)

IQR: 11.0


In [5]:
# Variance
variance = np.var(data, ddof=1) # ddof=1 for sample variance
print("Variance:", variance)

Variance: 59.5111111111111


In [6]:
# Standard Deviation
std_dev = np.std(data, ddof=1) # ddof=1 for sample standard deviation
print("Standard Deviation:", std_dev)

Standard Deviation: 7.714344503009384


In [10]:
# Coefficient of Variation (CV)
cv = std_dev / np.mean(data) * 100
print("Coefficient of Variation:", cv)

Coefficient of Variation: 36.3884174670254


In [11]:
# Mean Absolute Deviation (MAD)
mad = np.mean(np.abs(data - np.mean(data)))
print("Mean Absolute Deviation:", mad)

Mean Absolute Deviation: 6.4


Note: We use ddof=1 in the variance and standard deviation calculations to calculate the unbiased estimate of these measures for a sample. If you are working with a population, set ddof=0.

### Summary

 - Range: The range is the simplest measure of dispersion and represents the difference between the highest and lowest values in a dataset. It is often used to identify outliers or extreme values in the data. For example, in a class of students' test scores, the range can help identify the highest and lowest scores and give an indication of the spread of scores.

 - Interquartile Range (IQR): The IQR is a measure of dispersion that represents the difference between the 25th and 75th percentiles of a dataset. It is less sensitive to outliers than the range and can help identify the spread of the middle 50% of the data. For example, in a study of employee salaries, the IQR can help identify the range of salaries for the majority of employees.

 - Variance and Standard Deviation: The variance and standard deviation are measures of how spread out the data is from the mean. They are commonly used in statistics to quantify the dispersion of a dataset. For example, in financial analysis, the standard deviation can be used to measure the risk associated with different investments.

 - Coefficient of Variation (CV): The CV is a measure of relative variability that is used to compare the dispersion of two or more datasets that have different scales or units of measurement. It is commonly used in finance, economics, and biology to compare the variability of different stocks, economic indicators, or biological variables.

 - Mean Absolute Deviation (MAD): The MAD is a measure of how spread out the data is from the mean, but unlike the standard deviation, it is less sensitive to outliers. It is commonly used in finance to measure the variability of investment returns.

In general, these measures of dispersion help us to understand the variability of a dataset and can provide valuable insights into the spread of data. They are used in many fields, including finance, economics, biology, and social sciences, to quantify and compare the variability of different datasets.