# <span style="color:#54B1FF">Describing data:</span> &nbsp; <span style="color:#1B3EA9"><b>Dispersion</b></span>

<br>


The main [dispersion](https://en.wikipedia.org/wiki/Statistical_dispersion) variables are:

* Minimum
* Maximum
* Range (Maximum - Minimum)
* [Percentiles](https://en.wikipedia.org/wiki/Percentile) (especially: 25th, 75th and inter-quartile range)
* [Variance](https://en.wikipedia.org/wiki/Variance)
* [Standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) (SD)

They can be computed using NumPy as follows:

In [1]:
import numpy as np
from matplotlib import pyplot as plt
from scipy import stats

In [2]:
filenameCSV = 'num_friends.csv'
n = np.loadtxt(filenameCSV, delimiter=',')  # 204 integer values


y0 = np.min( n )  # minimum
y1 = np.max( n )  # maximum
y2 = y1 - y0      # range
y3 = np.percentile( n, 25 ) # 25th percentile (i.e., lower quartile)
y4 = np.percentile( n, 75 ) # 75th percentile (i.e., upper quartile)
y5 = y4 - y3 # inter-quartile range
y6 = np.var( n ) # variance
y7 = np.std( n ) # standard deviation


print("Minimum              = ", y0)
print("Maximum              = ", y1)
print("Range                = ", y2)
print("25th percentile      = ", y3)
print("75th percentile      = ", y4)
print("Inter-quartile range = ", y5)
print("Variance             = ", y6)
print("Standard deviation   = ", y7)

Minimum              =  1.0
Maximum              =  100.0
Range                =  99.0
25th percentile      =  3.0
75th percentile      =  9.0
Inter-quartile range =  6.0
Variance             =  81.14379084967321
Standard deviation   =  9.007984838446012


Note that variance is equivalent to the the squared SD:

In [3]:
print("Variance     = ", y6)
print("Squared SD   = ", y7**2)

Variance     =  81.14379084967321
Squared SD   =  81.14379084967322


<br> 

___

### Counting array elements using Boolean arrays

Another common way to describe data dispersion is to count the number of array elements that meet a specific criterion.

For example:

* How many values of "4" exist in the array?
* How many array values are greater than 10?
* What is the proprtion of array values are greater than 10?

These questions can be answered using **Boolean arrays**.
* A "[boolean](https://en.wikipedia.org/wiki/Boolean_data_type)" is a variable that has only two possible values: True or False (i.e., 1 or 0).
* A "boolean array" is an array that contains only True and/or False values (i.e., 1 or 0)

Here is an example boolean array:

In [4]:
b = np.array( [True, True, False, False, False] )
print(b)

[ True  True False False False]


How many True values are there?

Since `True` has a value of 1, we can use `sum` to count the number of `True` values.

In [5]:
n = sum(b)
print(n, ' values are True')

# or

n = np.sum(b)
print(n, ' values are True')

# or

n = b.sum()
print(n, ' values are True')


2  values are True
2  values are True
2  values are True


What is the proportion of values that are True?  Two are true, and there are five values, so the proportion should be 0.4.

This can be calculated as the average of all array values:

In [6]:
p = sum( b ) / len( b )
print('Proprtion of True values: ', p)

# or

p = np.sum( b ) / b.size
print('Proprtion of True values: ', p)

# or

p = np.mean( b )
print('Proprtion of True values: ', p)

# or

p = b.mean()
print('Proprtion of True values: ', p)

Proprtion of True values:  0.4
Proprtion of True values:  0.4
Proprtion of True values:  0.4
Proprtion of True values:  0.4


<br> 

Let's now answer the following questions for the `num_friends` dataset.

* How many values of "4" exist in the array?
* How many array values are greater than 10?
* What is the proprtion of array values are greater than 10?

In [7]:
filenameCSV = 'num_friends.csv' 
n = np.loadtxt(filenameCSV, delimiter=',')  # 204 integer values


n0 = ( n==4 ).sum()
n1 = ( n>10 ).sum()
n2 = ( n>10 ).mean()

print(n0, 'values are equal to 4')
print(n1, 'values are greater than 10')
print(n2*100, '% of all values are greater than 10')

20 values are equal to 4
25 values are greater than 10
12.254901960784313 % of all values are greater than 10
