# Numerical Summary
Sometimes, data can be large and difficult to understand. Numerical Summary thus summurizes such data into a single value that is easy to interpret. There are two types of Numerical Summaries:
* Measures of Central Tendancy
* Measures of Dispersion

This notebook explores Measures of Dispersion.

## 2. Measures of Dispersion
When looking at measures of central tendancy, we explored three different ways of estimating the middleground of the data: the mean, the median and the mode. However, for the most part, measures of central tendancy only tell half the story. The other half, measures of dispersion, helps us understand the full story behind the data. As the name suggests, Measures of Dispersions tend to gauge how the data is spread.

Take for example when we observed the mean and median salaries of Company A and Company B. While the companies had different salaries, they had the same mean and median. However, the salaries of Company B appeared to be more spread out than those of Company A. Let's explore this with the five measures of dispersion:
1. Variance
2. Standard Deviation
3. Mean Absolute Deviation
4. Range
5. Quartiles, Deciles and Percentiles.

### i. Variance
Let's return to our two companies. This is how we obtained the averages and coincidentally found out thath they had the same mean.

In [1]:
# The following data shows the annual salaries of 7 employees in two companies:
company_a_salaries = [34500, 30700, 32900, 36000, 34100, 33800, 32500]
company_b_salaries = [34900, 27500, 31600, 39700, 35300, 33800, 31700]

# Find the mean salary in both companies
def get_mean(data: list):
    mean = (sum(data))/len(data)
    return mean

company_a_mean_salary = get_mean(company_a_salaries)
company_b_mean_salary = get_mean(company_b_salaries)

print(f"The average salary of Company A employees is ${company_a_mean_salary} and that of Company B employees is ${company_b_mean_salary}")

The average salary of Company A employees is $33500.0 and that of Company B employees is $33500.0


When measuring spread/dispersion, a good place to start would be to determine how far the observances/data points lie from the mean. (x - mean).

In [2]:
company_a_salaries_differences = [i - company_a_mean_salary for i in company_a_salaries]

print(company_a_salaries_differences)

[1000.0, -2800.0, -600.0, 2500.0, 600.0, 300.0, -1000.0]


As you can see, some data points result to negative while others remain positive. A negative value shows it was lower than the mean, while a positive value depicts that it was higher than the mean. Adding up the values cancels them out and the result will be 0.

However, we are not satisfied with the format of the results. You see, we really couldn't care less whether the values were above or below the mean, our main concern is "by how much?". Therefore, those signs really don't mean much and we have to get rid of them.There are two ways you can do this, one being squaring. When we square the values, we get rid of the negative signs.

In [3]:
company_a_salaries_squared_differences = [i ** 2 for i in company_a_salaries_differences]

print(company_a_salaries_squared_differences)

[1000000.0, 7840000.0, 360000.0, 6250000.0, 360000.0, 90000.0, 1000000.0]


The next step would be to get the mean of these squared differences in order to get the average squared difference. Remember, for measures of central tendancy and measures of dispersion, we attempt to summarize large data into a single digit with a meaning. In this case, by getting the mean of these values, we are getting a single value which shows us the average squared deviation (from the central point).

In [4]:
get_mean(company_a_salaries_squared_differences)

2414285.714285714

We get a mean squared deviation of ```2,414,285.71...```. This is what we call the variance.

Now, let's define a function to get the variance of any list of numbers, so that we can get the variance of company B

In [5]:
def get_variance(data: list):
    """
    Function: get_variance() -> Gets the variance of a list of numbers
    Input: data -> a list of numbers
    Output: variance -> a float representing variance

    Variance is obtained by:
    1. getting the mean of the data
    2. getting the differences between the mean and each datapoint
    3. squaring these differences
    4. getting the mean of the squares differences
    """

    # getting the mean of the data
    mean = get_mean(data)

    # getting the squared differences
    deviation_squared = [(i - mean) ** 2 for i in data]

    # getting the mean of the squared differences
    variance = get_mean(deviation_squared)

    return variance

get_variance(company_b_salaries)

12368571.42857143

Company B has a variance of ```12,368,571.42...```. Evidently, Company B has a greater variance than Company A. This shows that the salaries in Company B are more dispersed (spread out) than those in Company A.

However, something about variance makes it harder to interpret. The salaries are in thousands of dollars yet their variance is in millions of dollars. This is because we squared some values to get the variance. This makes variance a large digit that is harder to interpret and understand. It is for this reason that we have the **Standard Deviation.**

### ii. Standard Deviation
As the name suggests, Standard Deviation standardizes the results of the variance (squared deviation) into a more understandable form. It does this by getting the square root of the variance. We import the ```math``` module to help us with the square root function ```math.sqrt()```

In [6]:
import math

def get_stdeviation(data: list):
    """
    Function: get_stdeviation() -> Get's the standard deviation of a list of data
    Input: data -> a list of numbers
    Output: stdeviation -> a float representing the standard deviation.

    Standard deviation is derived from the square root of the variance of the data
    """

    # getting the variance of the data
    variance = get_variance(data)

    # getting the square root of the data
    stdeviation = math.sqrt(variance)

    return stdeviation

print(get_stdeviation(company_a_salaries))
print(get_stdeviation(company_b_salaries))

1553.7971921347116
3516.8979838163386


The standard deviation of the salaries in Company A is ```1553.79...``` while that of Company B is ```3516.89...```. This confirms that the salaries in Company B are more spread out than those in Company A. In fact, one can infer that the salaries in Company A are averagely spread out by $1553 while those in Company B are roughly spread out by $3516.

### 3. Mean Absolute Deviation (MAD)

Just like Variance and Standard Deviation, the Mean Absolute Deviation seeks to measure how datapoints in a data set are spread out using their differences from the mean. This makes the first two sreos of getting the MAD similar to getting Variance and Standard deviation
1. Get the mean of the data set
2. Get the difference of each data point from the mean (x - mean)
.
.
.
Here's where things become different. After getting the differences, no matter how the data set looks like, we'll always end up with positive and negative values. As mentioned earlier, these signs are not a concern to us. We are more concerned with the values which shows us *by how much the data point is from the mean*. 

While working with Variance and Standard Deviation, we got rid of these (negative) signs by squaring all the values. When it comes to MAD, as the name suggests, we get rid of the (negative) signs by making the values absolute. We then continue with the process of finding the average of the absolute differences which gives us the Mean Absolute Deviation.
.
.
.
3. Make the differences absolute
4. Get the mean of the absolute differences

As a fucntion, the Mean Absolute Deviation would look like this:

In [7]:
def get_mad(data: list):
    """
    Function: get_mad() -> a function that gets the Mean Absolute Deviation (MAD)
    Input: data -> a list of numbers
    Output: mad -> a float representing the Mean Absolute Deviation (MAD)

    The Mean Absolute Deviation is obtained by:
    1. Deriving the mean of the data
    2. Getting the difference between the mean and each datapoint
    3. Making these differences absolute
    4. Getting the mean of the absolute differences
    """

    # getting the mean of the data
    mean = get_mean(data)

    # getting the difference between the mean and each datapoint
    absolute_deviation = [abs(i - mean) for i in data]

    # getting the mean of the absolute differences
    mad = get_mean(absolute_deviation)

    return mad

print(get_mad(company_a_salaries))
print(get_mad(company_b_salaries))

1257.142857142857
2771.4285714285716


### 4. Range
This has to be the easiest measure of dispersion. Range is simply the difference between the highest value and the lowest value in a data set. In our salaries example, we can see the difference between the salaries of the highest earner and the lowest earner in a given company. Let's do this by coming up with a function.

In [8]:
def get_range(data: list):
    """
    Function: get_range() -> get's the range of a list of numbers
    Input: data -> a list of numbers
    Output: the_range -> a number representing the range

    The range is obtained by getting the difference between the highest and the lowest number in a list of numbers.
    """

    the_range = max(data) - min(data)
    return the_range

print(get_range(company_a_salaries))
print(get_range(company_b_salaries))

5300
12200


The range of salaries in these two companies goes ahead to confirm our findings that the salaries in Company B are more spresad out than those of Company A.

### 5. Interquartile Range (IQR)
Range measures dispersion by getting the difference between the highest and the lowest value. As you might have guessed, this measure of dispersion is adversely affected by outliers. That's where the Interquartile Range (IQR) comes to play.

The IQR get's rid of outliers by getting rid of a few high values and a few low values. It does this by arranging the data in ascending order then dividing the data into 4 groups. The data points that act as the dividing points are called quartiles. There are, therefore, four groups divided by 3 quartiles.

The 1st quartile (Q1) is found at the 25th percentile, that is to say, it seperates the first 25% of the data from the rest of the data. The 2nd quartile (Q2) seperates the first 50% percent of the data from the rest of the data. It's the middle value of the data (commonly referred to as the median). Finally, we have the 3rd quartile which seperates the first 75% of the data from the last 25%. Diagrammatically, this is what the quartiles look like:

                |----------------+----------------+----------------+---------------|
            starting    group   Q1      group     Q2    group      Q3   group   ending
            point         1               2    (median)   3               4     point

The IQR, therefore, gets rid of the of the outliers by ignoring the data in group 1 and group 4 and then finding the range. The lowest value then becomes Q1 and the highest value becomes Q3.
> IQR = Q3 - Q1

Note how to get the IQR, we do not need Q2.

This seems simple at first, but the challenge comes when finding Q1 and Q3.

*To find Q1:*

Q1 happens to be the (1/4 * (n + 1))th value. When this is a whole number, e.g. 3, then Q1 will be the 3rd value. However, when this is not a whole number, e.g. 3.25, it implies that it is the 3rd value + 0.25 * (4th value - 3rd value). We get the 0.25 or 1/4 of the distance between the 3rd and the 4th value then adding it to the 3rd value. This is, quite literally, the 3.25th value. The process of getting the 3.25th value (or the median where decimals are involved) is called **linear interpolation**.

*To find Q3:*

Q3 is the (3/4 * (n + 1))th value. Again, when this is a whole number, e.g. 6, then Q3 is the 6th value. However, when it's not a whole number, e.g. 6.75, then we get the 6.75th value by getting the 3/4th or 0.75th distance between the 6th value and the 7th value and adding it to the 6th value. Mathematically: 6 + 0.75 * (7th value - 6th value). Once more, this process is known as linear interpolation.

With that out of the way, let's see how we can implement this in code.

In [7]:
# INTERQUARTILE RANGE
def iqr(data: list):
    """
    Function: iqr() -> get's the interquartile range for a list of numbers
    Input: data -> a list of numbers
    Output: iqr -> a number representing the interquartile range
    """

    # sort the data
    data = sorted(data)

    # find the q1th value [1/4 * (n+1)]
    q1th_value = (1/4)*(len(data) + 1)

    # let q1 be the q1th value if the q1th value is whole
    if q1th_value.is_integer() :
        q1 = data[int(q1th_value) - 1] # -1 because of zero based indexing
    
    # and carry out linear interpolation to find q1 if it is not a whole number
    else:
        q1 = data[int(q1th_value) - 1] + ((q1th_value - int(q1th_value))*((data[int(q1th_value)]) - (data[int(q1th_value) - 1])))

    # find the q3rd value [3/4 * (n+1)]
    q3rd_value = (3/4)*(len(data) + 1)

    # let q3 be the q3rd value if the q3rd value is whole
    if q3rd_value.is_integer():
        q3 = data[int(q3rd_value) - 1] # -1 because of zero-based indexing

    # if the q3rd value is not a whole number, carry out linear interpolation
    else:
        q3 = data[int(q3rd_value) - 1] + ((q3rd_value - int(q3rd_value))*((data[int(q3rd_value)]) - (data[int(q3rd_value) - 1])))

    # finally, to get IQR, we get the difference between q3 and q1
    iqr = q3-q1
    return iqr

# Testing a list
iqr([7, 1, 3, 5, 9, 6, 8, 2])

# Testing a list of floats
iqr([2.2, 2.3, 4.5, 7.1, 5.9, 8.8, 14.4])

6.500000000000001