# Definition of Statistical Measures – Central Tendency and Spread
### Dr. Tirthajyoti Sarkar, Fremont, CA 94536
---
This notebook discusses fundamentals concepts of descriptive statistics such as central tendency and dispersion (spread) measures - mean/median/mode and variance. We show how one can compute such descriptive statistics using NumPy functions.

### Central tendency
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. They are also categorized as summary statistics:

* **Mean**: Mean is the sum of all values divided by the total number of values.

$$ \mu = \frac{\sum{n_i}}{N} \\ \text{where } N = \sum{i} \text{ : total number of observations}$$

* **Median**: The median is the middle value. It is the value that splits the dataset in half. To find the median, order your data from smallest to largest, and then find the data point that has an equal amount of values above it and below it.
* **Mode**: The mode is the value that occurs the most frequently in your dataset. On a bar chart, the mode is the highest bar.

Generally, the mean is a better measure to use for symmetric data and median is a better measure for data with a skewed (left or right heavy) distribution. For categorical data, you have to use the mode.

### Spread
The spread of the data is a measure of by how much the values in the dataset are likely to differ from the mean of the values. If all the values are close together then the spread is low; on the other hand, if some or all of the values differ by a large
amount from the mean (and each other), then there is a large spread in the data.

* **Variance**: This is the most common measure of spread. Variance is the average of the squares of the deviations from the mean. Squaring the deviations ensures that negative and positive deviations do not cancel each other out.

$$V = \frac{\sum{(n_i-\mu)^2}}{N}$$

* **Standard Deviation**: Because variance is produced by squaring the distance from the mean, its unit does not match that of the original data. Standard deviation is a mathematical trick to bring back the parity. It is the positive square root of the variance.

$$\sigma = \sqrt{\frac{\sum{(n_i-\mu)^2}}{N}}$$

> **NOTE**: When we later build regression models, we will revisit these definitions in the conext of statistical estimation. There, the sample variance will be given by a slightly different formula (the denominator will change),

$$V = \frac{\sum{(n_i-\mu)^2}}{N-2}$$

## Let's measure statistical properties of an array of numbers

### Somewhat naive way to do it - we can simply write a 'for' loop, add the numbers, and divide by the length of the array

In [37]:
array = [3,4,4,7,5,6,5.5,8,5,6.5,9,7.5,6]

In [38]:
sum = 0
for num in array:
    sum+=num
mean = sum/len(array)
print("Mean: ",mean)

Mean:  5.884615384615385


In [39]:
from time import time

In [44]:
t1 = time()
for _ in range(100000):
    sum = 0
    for num in array:
        sum+=num
    mean = sum/len(array)
t2 = time()

print("Mean: {}\nAverage time taken for computing the mean using for loop: {} seconds ".format(mean,(t2-t1)/100000))

Mean: 5.884615384615385
Average time taken for computing the mean using for loop: 3.309731483459473e-06 seconds 


### Using NumPy with `ndarray.mean()` method

In [45]:
import numpy as np
np_array = np.array(array)
print("Mean: ",np_array.mean())

Mean:  5.884615384615385


In [46]:
t1 = time()
np_array = np.array(array)
for _ in range(100000):
    mean = np_array.mean()
t2 = time()

print("Mean: {}\nAverage time taken for computing the mean using NumPy: {} seconds ".format(mean,(t2-t1)/100000))

Mean: 5.884615384615385
Average time taken for computing the mean using NumPy: 1.1149728298187255e-05 seconds 


### So, the `NumPy` method does not offer significant boost in performance. But what happens when the array is large?

In [47]:
from random import randint
lst = []
for _ in range(1000000):
    lst.append(randint(1,100))

In [48]:
len(lst)

1000000

In [49]:
t1 = time()
for _ in range(10):
    sum = 0
    for num in lst:
        sum+=num
    mean = sum/len(lst)
t2 = time()

print("Mean: {}\nAverage time taken for computing the mean using for loop: {} seconds ".format(mean,(t2-t1)/10))

Mean: 50.465672
Average time taken for computing the mean using for loop: 0.18960354328155518 seconds 


In [50]:
t1 = time()
np_lst = np.array(lst)
for _ in range(10):
    mean = np_lst.mean()
t2 = time()

print("Mean: {}\nAverage time taken for computing the mean using NumPy: {} seconds ".format(mean,(t2-t1)/10))

Mean: 50.465672
Average time taken for computing the mean using NumPy: 0.014644837379455567 seconds 


### Variance and standard deviation
* `ndarray.var()`
* `ndarray.std()`

In [51]:
t1 = time()
np_lst = np.array(lst)
for _ in range(10):
    v = np_lst.var()
t2 = time()

print("Variance: {}\nAverage time taken for computing the variance using NumPy: {} seconds ".format(v,(t2-t1)/10))

Variance: 834.1976155884162
Average time taken for computing the variance using NumPy: 0.034854984283447264 seconds 


In [52]:
t1 = time()
np_lst = np.array(lst)
for _ in range(10):
    s = np_lst.std()
t2 = time()

print("Std. dev: {}\nAverage time taken for computing the standard deviation using NumPy: {} seconds ".format(s,(t2-t1)/10))

Std. dev: 28.88247938782985
Average time taken for computing the standard deviation using NumPy: 0.04588747024536133 seconds 


## What if there are `NaN` values in the array
* `nanmean()`
* `nanmedian()`
* `nanstd()`
* `nanvar()`

In [61]:
array = 20*np.random.random(10)

In [62]:
print(array)

[18.12425815  2.38705063  7.96675581  4.69638935  8.70467508  7.83295758
  2.44450172  5.24213916  4.43874888 15.89713518]


In [63]:
array[2]=np.nan
array[6]=np.nan

In [64]:
print(array)

[18.12425815  2.38705063         nan  4.69638935  8.70467508  7.83295758
         nan  5.24213916  4.43874888 15.89713518]


In [66]:
print("Mean:",array.mean())
print("Var:",array.var())

Mean: nan
Var: nan


### Using special functions which ignore `NaN`. Notice they are methods of the base Numpy (`np`) class not of an individual array

In [69]:
print("Mean ignoring NaN:",np.nanmean(array))
print("Var ignoring NaN:",np.nanvar(array))
print("Std. dev ignoring NaN:",np.nanstd(array))
print("Median ignoring NaN:",np.nanmedian(array))

Mean ignoring NaN: 8.415419252372196
Var ignoring NaN: 28.339572737681575
Std. dev ignoring NaN: 5.323492531945694
Median ignoring NaN: 6.537548372374532


## Other descriptive statistics measures
* Min and max
* Range
* Quantile
* Percentile

<img src="images/percentile.PNG" width=500 height=400></img>

<img src="images/quantiles.png" width=400 height=300></img>

In [70]:
array = 20*np.random.random(10)
print(array)

[ 9.09854842  9.84177347  4.75767837 16.68784841  4.36347513 17.65766902
  6.58960485 12.95874598 16.94898066 11.32537736]


In [71]:
# Using np.amax()
print("Max of the array:",np.amax(array))
# Using array.max()
print("Max of the array:",array.max())

Max of the array: 17.657669024849667
Max of the array: 17.657669024849667


In [73]:
# Using np.amax()
print("Min of the array:",np.amin(array))
# Using array.max()
print("Min of the array:",array.min())

Min of the array: 4.36347513170992
Min of the array: 4.36347513170992


In [76]:
# Compute range by using max() and min() functions
print("Range of the array: ", array.max()-array.min())
# Compute range by using ptp() function
print("Range of the array: ", np.ptp(array))

Range of the array:  13.294193893139749
Range of the array:  13.294193893139749


In [77]:
# Percentile
print("20th percentile of the array: ", np.percentile(array,20))

20th percentile of the array:  6.223219550082087


In [81]:
# Quantile
print("0.5-th quantile of the array: ", np.quantile(array,0.5))

0.5-th quantile of the array:  10.583575418149183
