# Numerical Summary
Sometimes, data can be large and difficult to understand. Numerical Summary thus summurizes such data into a single value that is easy to interpret. There are two types of Numerical Summaries:
* Measures of Central Tendancy
* Measures of Dispersion

This notebook explores Measures of Dispersion.

## 2. Measures of Dispersion
When looking at measures of central tendancy, we explored three different ways of estimating the middleground of the data: the mean, the median and the mode. However, for the most part, measures of central tendancy only tell half the story. The other half, measures of dispersion, helps us understand the full story behind the data. As the name suggests, Measures of Dispersions tend to gauge how the data is spread.

Take for example when we observed the mean and median salaries of Company A and Company B. While the companies had different salaries, they had the same mean and median. However, the salaries of Company B appeared to be more spread out than those of Company A. Let's explore this with the five measures of dispersion:
1. Variance
2. Standard Deviation
3. Mean Absolute Deviation
4. Range
5. Quartiles, Deciles and Percentiles.

### i. Variance
Let's return to our two companies. This is how we obtained the averages and coincidentally found out thath they had the same mean.

In [5]:
# The following data shows the annual salaries of 7 employees in two companies:
company_a_salaries = [34500, 30700, 32900, 36000, 34100, 33800, 32500]
company_b_salaries = [34900, 27500, 31600, 39700, 35300, 33800, 31700]

# Find the mean salary in both companies
def get_mean(data: list):
    mean = (sum(data))/len(data)
    return mean

company_a_mean_salary = get_mean(company_a_salaries)
company_b_mean_salary = get_mean(company_b_salaries)

print(f"The average salary of Company A employees is ${company_a_mean_salary} and that of Company B employees is ${company_b_mean_salary}")

The average salary of Company A employees is $33500.0 and that of Company B employees is $33500.0


When measuring spread/dispersion, a good place to start would be to determine how far the observances/data points lie from the mean. (x - mean).

In [6]:
company_a_salaries_differences = []

for i in company_a_salaries:
    company_a_salaries_difference = i-company_a_mean_salary
    company_a_salaries_differences.append(company_a_salaries_difference)

print(company_a_salaries_differences)

[1000.0, -2800.0, -600.0, 2500.0, 600.0, 300.0, -1000.0]


As you can see, some data points result to negative while others remain positive. A negative value shows it was lower than the mean, while a positive value depicts that it was higher than the mean. Adding up the values cancels them out and the result will be 0.

However, we are not satisfied with the format of the results. You see, we really couldn't care less whether the values were above or below the mean, our main concern is "by how much?". Therefore, those signs really don't mean much and we have to get rid of them.There are two ways you can do this, one being squaring. When we square the values, we get rid of the negative signs.

In [7]:
company_a_salaries_squared_differences = []

for i in company_a_salaries_differences:
    company_a_salaries_squared_differences.append(i ** 2)

print(company_a_salaries_squared_differences)

[1000000.0, 7840000.0, 360000.0, 6250000.0, 360000.0, 90000.0, 1000000.0]


The next step would be to get the mean of these squared differences in order to get the average squared difference. Remember, for measures of central tendancy and measures of dispersion, we attempt to summarize large data into a single digit with a meaning. In this case, by getting the mean of these values, we are getting a single value which shows us the average squared deviation (from the central point).

In [8]:
get_mean(company_a_salaries_squared_differences)

2414285.714285714

We get a mean squared deviation of ```2,414,285.71...```. This is what we call the variance.

Now, let's define a function to get the variance of any list of numbers, so that we can get the variance of company B

In [9]:
def get_variance(data: list):
    mean = get_mean(data)

    deviation = []
    deviation_squared = []

    for i in data:
        deviation.append(i - mean)

    for i in deviation:
        deviation_squared.append(i ** 2)

    variance = get_mean(deviation_squared)

    return variance

get_variance(company_b_salaries)

12368571.42857143

Company B has a variance of ```12,368,571.42...```. Evidently, Company B has a greater variance than Company A. This shows that the salaries in Company B are more dispersed (spread out) than those in Company A.

However, something about variance makes it harder to interpret. The salaries are in thousands of dollars yet their variance is in millions of dollars. This is because we squared some values to get the variance. This makes variance a large digit that is harder to interpret and understand. It is for this reason that we have the *Standard Deviation.*

### ii. Standard Deviation
As the name suggests, Standard Deviation standardizes the results of the variance (squared deviation) into a more understandable form. It does this by getting the square root of the variance. 

In [15]:
import math

def get_stdeviation(data: list):
    variance = get_variance(data)
    stdeviation = math.sqrt(variance)

    return stdeviation

print(get_stdeviation(company_a_salaries))
print(get_stdeviation(company_b_salaries))

1553.7971921347116
3516.8979838163386


The standard deviation of the salaries in Company A is ```1553.79...``` while that of Company B is ```3516.89...```. This confirms that the salaries in Company B are more spread out than those in Company A. In fact, one can infer that the salaries in Company A are averagely spread out by $1553 while those in Company B are roughly spread out by $3516.