# Measures of Central Tendency

## statistic module
https://www.w3schools.com/python/module_statistics.asp
https://docs.python.org/3/library/statistics.html

In [None]:
print(dir(statistics))
help(statistics)

## Averages and measures of central location

In [None]:
import statistics as stat
import random
d = random.sample(range(500), 200) # sample 200 observations from a the interval [0; 500]

stat.mean(d) # Arithmetic mean (“average”) of data.
stat.fmean() # Fast, floating point arithmetic mean.
stat.geometric_mean() # Geometric mean of data.
stat.harmonic_mean() # Harmonic mean of data.
stat.median(d) # Median (middle value) of data.
stat.median_low(d) # Low median of data.
stat.median_high(d) # High median of data.
stat.median_grouped(d) # Median, or 50th percentile, of grouped data.
stat.mode(d) # Single mode (most common value) of discrete or nominal data.
stat.multimode( [2, 3, 4, 2])  # List of modes (most common values) of discrete or nominal data.
stat.quantiles() # Divide data into intervals with equal probability.

## Measures of spread
Measure that describe how much the population or sample tends to deviate from the typical or average values.

In [1]:
stat.variance(d)  # Sample variance of data.
stat.stdev(d)  # Sample standard deviation of data.
stat.pvariance(d)  # Population variance of data.
stat.pstdev(d)  # Population standard deviation of data.

NameError: name 'stat' is not defined

## basic statistics in pandas

In [2]:
import requests
import pandas as pd

url = 'https://www.fao.org/fileadmin/templates/worldfood/Reports_and_docs/Food_price_indices_data_jul825.csv'
r = requests.get(url)
open('temp.csv', 'wb').write(r.content)
food = pd.read_csv('temp.csv',  header=2, nrows=391)
food = food[['Date', 'Food Price Index', 'Meat', 'Dairy', 'Cereals', 'Oils', 'Sugar']][1:].dropna()
food['Date'] = pd.to_datetime(food['Date'])
food.set_index("Date", inplace=True)

In [3]:
food.describe(include='all').round(3)

Unnamed: 0,Food Price Index,Meat,Dairy,Cereals,Oils,Sugar
count,390.0,390.0,390.0,390.0,390.0,390.0
mean,84.947,83.705,83.508,86.157,88.301,80.795
std,25.624,16.699,32.534,30.952,37.885,31.183
min,50.5,51.1,36.8,48.6,35.83,31.8
25%,63.725,70.625,55.05,60.025,62.262,57.95
50%,78.45,82.1,76.2,84.2,80.415,74.9
75%,99.0,97.075,109.15,101.75,105.752,99.0
max,159.7,124.7,156.5,173.5,251.8,183.2


In [4]:
# export table for excel
with pd.ExcelWriter("fao.xlsx", date_format="YYYY-MM") as writer: 
    food.to_excel(writer)

In [None]:
food.count() 	       
food.sum() 	            
food.mean() 	      
food.median() 	
food.mode() 	
food.std() 	
food.min() 	
food.max() 	
food.abs() 	
food.prod() 	
food.cumsum() 	
food.cumprod() 	

# Measures in detail

<img src="https://upload.wikimedia.org/wikipedia/commons/3/33/Visualisation_mode_median_mean.svg" width=260 height=250 />

### Arithmetic mean
${\displaystyle A={\frac {1}{n}}\sum _{i=1}^{n}a_{i}={\frac {a_{1}+a_{2}+\cdots +a_{n}}{n}}}$
- Useful to characterize symmetric distributions without outliers 
- the arithmetic mean is greatly influenced by outliers and thus not **robust**
- for skewed distributions the a. mean diviates from the mode and modus<br>

In [None]:
def my_amean (num: list) -> float:
    sum = 0
    for numbers in num:
        sum += numbers
    amean = sum / len(num)
    return amean

my_amean([3,5,6,9])

5.75

### Median
$\tilde x
=\begin{cases}
  x_{m+1}                     & \text { for odd n = 2m+1}\\
  \frac{1}{2} (x_m + x_{m+1}) & \text { for even n = 2m}
\end{cases}
$
- Exact **middle value among a dataset**.
- The median $\tilde x$ is the **value splitting the higher half from the lower half** of a data sample.<br>
- **Useful for skewed distribution or data with outliers.**
- The basic feature of the median in describing data compared to the mean $\overline x$ is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of a "typical" value. 
- The median is  the **most resistant / robust statistic** 


In [None]:
def my_median (num: list) -> float:
    '''number that resides in the middle of a sorted list of  numbers'''
    ls = sorted(num)
    lg = len(num)
    m = lg/2
    if lg%2 == 1:
        med = ls[int(m)]
    else:
        med = (ls[int(m-1)]+ls[int(m)])/2
    return med

my_median([10,17,20, 22, 25, 34, 37, 48, 59])

25

## Mode
- The mode is the value that appears most often in a set of data values.<br>
- The numerical value of the mode is the same as that of the mean and median in a normal distribution, and it may be very different in highly skewed distributions. <br>


In [74]:
import statistics as stat
stat.mode([2, 3, 4, 2])
stat.multimode([2, 3, 3, 4, 2]) # if there's more than one mode

[2, 3]

In [None]:
def my_mode(ls):
    ''' mode that allows for more than one mode value'''
    # count - counts elements in an iterable
    # map - goes ove every element in iterable
    # filter - picks items for wich the filter criteria is true
    most = max(list(map(ls.count, ls)))
    return list(set(filter(lambda x: ls.count(x) == most, ls)))

my_mode([2,3,5, 5, 5, 2, 3, 8])
my_mode([1, 2, 2, 3, 3])


[2, 3]

In [None]:
def my_modus(num: list) -> float:
    ''' mode that allows for more than one mode value'''
    freq = {}                             # build a dict, that lists the existing vakues and their frequencies
    for x in num:
        h = num.count(x)          # count frequency of value x
        freq[x] = h                     # add value x and it's frequency to the dict

    value_list = list(freq.values())    # make two lists of the same lenght out of
    key_list = list(freq.keys())          # dictionaries keys and values

    maximum = 0                         # find max frequency
    for y in value_list:
        if  y > maximum:
            maximum = y

    position = value_list.index(maximum)    # take position of max frequency to extract corresponding key
    return key_list[position]


my_modus([1, 2, 3])
my_mode([1, 2, 2, 3, 3])


[2, 3]

## Geometric Mean
- Useful for averaging ratios. (growth of figures, population growth, interest rates)
- Always less than arithmetic mean

The geometric mean is one of the three classical Pythagorean means, together with the arithmetic mean and the harmonic mean. For all positive data sets containing at least one pair of unequal values, the harmonic mean is always the least of the three means, while the arithmetic mean is always the greatest of the three and the geometric mean is always in between (see Inequality of arithmetic and geometric means.) 

$\left(\prod _{i=1}^{n}x_{i}\right)^{\frac {1}{n}}={\sqrt[{n}]{x_{1}x_{2}\cdots x_{n}}}$

In [1]:
from scipy.stats import gmean
gmean([1.0, 0.00001, 10000000000.])

46.415888336127786

In [75]:
from statistics import geometric_mean
geometric_mean([1.0, 0.00001, 10000000000.])

46.415888336127786

In [None]:
def my_geomean(ls: list) -> float:
    '''mean or average, which indicates the central tendency or 
    typical value of a set of numbers by using the product of their value'''
    growthfactors = []

    for index in range(1, len(ls)):          # range fct spits out the indexes
        g_fact = ls[index]/ls[index-1]
        growthfactors.append(g_fact)         # create a list

    prod = 1                                 # product of list instead of product()
    for fact in  growthfactors:
        prod *= fact

    geo_mean = prod**(1/2)
    return geo_mean


my_geomean([2, 3, 5, 5, 5, 2, 3, 8])


2.0

## Harmonic Mean
The harmonic mean is one of the Pythagorean means. <br>
It is sometimes appropriate for situations when the average rate is desired. 

The harmonic mean can be expressed as the reciprocal of the arithmetic mean of the reciprocals of the given set of observations. As a simple example, the harmonic mean of 1, 4, and 4 is 

$\left(\frac{1^{-1} + 4^{-1} + 4^{-1}}{3}\right)^{-1} = \frac{3}{\frac{1}{1} + \frac{1}{4} + \frac{1}{4}} = \frac{3}{1.5} = 2$

$\frac {n}{\frac{1}{x_1}+\frac{1}{x_2}+\frac{1}{x_3}+…+\frac{1}{x_n}}$

In many situations involving rates and ratios, the harmonic mean provides the correct average.

The weighted harmonic mean is the preferable method for averaging multiples, such as the price–earnings ratio. If these ratios are averaged using a weighted arithmetic mean, high data points are given greater weights than low data points. The weighted harmonic mean, on the other hand, correctly weights each data point.


If a set of [[weight function|weights]] $w_1 ..., w_n$ is associated to the dataset $x_1, ..., x_n$, the '''weighted harmonic mean''' is defined by 

  $H = \frac{\sum\limits_{i=1}^n w_i}{\sum\limits_{i=1}^n \frac{w_i}{x_i}}
    = \left( \frac{\sum\limits_{i=1}^n w_i x_i^{-1}}{\sum\limits_{i=1}^n w_i} \right)^{-1}$

The unweighted harmonic mean can be regarded as the special case where all of the weights are equal.

In [4]:
def my_harmonic_mean(ls: list) -> float:
    sum = 0
    for x in ls:
        sum += 1/x
    mean = sum / len(ls)
    hmean = 1/mean
    return hmean

In [5]:
growth = (100, 103, 104, 134, 145)
print(my_harmonic_mean(growth))

114.46005782930303


In [6]:
import statistics as stats

stats.harmonic_mean([100, 103, 104, 134, 145])
stats.harmonic_mean([40, 60], weights=[5, 30]) # with weights

114.46005782930303