# Means

Measuring data by taking a mean is pretty ubiquitous. However means also omits a lot of information.

This is a primer on means and some other 'measures of centrality' in data.

#### Summarizing data using a single number

The goal is to capture informationo about the distribution of data.

## Arithmetic mean

The arithmetic mean is used very frequently to summarize numerical data, and is usually the one assumed to be meant by the word "average" It is defined as th sum of the observation dvided by the number of observationos

u = Sum X, / N

In [1]:
import scipy.stats as stats
import numpy as np

In [3]:
x1 = [1,2,2,3,4,5,5,7]
x2 = x1 + [100]

In [6]:
print 'Mean of x1:', sum(x1),'/', len(x1), '=', np.mean(x1)
print 'Mean of x2:', sum(x2),'/', len(x2), '=', np.mean(x2)

Mean of x1: 29 / 8 = 3.625
Mean of x2: 129 / 9 = 14.3333333333


#### Weighted Arithmetic Mean

Useful for explicity specifying the number of times each observation should be cnounted. Multiply each value with a weight.

## Median

The median of a set of data is the number which appears in the middle of the list when it is sorted.

In [7]:
np.median(x1),np.median(x2)

(3.5, 4.0)

## Mode

The most frequenly occuring value in a data set. Useful for data whose possible values are independent. For example in the outcomes of weighted dice, coming up 6 often does not mean it is likely to come up 5.



In [12]:
# Returns only one value
stats.mode(x1)[0][0]

2

In [13]:
def mode(l):
    counts ={}
    for e in l:
        if e in counts:
            counts[e]+=1
        else:
            counts[e] =1
    
    maxcount = 0
    modes = {}
    for (k,v) in counts.items():
        if v > maxcount:
            maxcount = v
            modes = {k}
        elif v == maxcount:
            modes.add(k)
    if maxcount >1 or len(l) == 1:
        return list(modes)
    return 'No Mode'


In [11]:
mode(x1)

[2, 5]

For data that can take on many different values such as returns data, there may not be any values that appear more than once. 

In this case we can bin values, like we do when constructiong a histogram. Then find the mode of the data set where each value is replaced with the name of its bin

That is we find which bin elements fall into most often.

In [14]:
# Get return data foor an asset and comopute the mode of the data set
start = '2019-01-01'
end = '2020-01-01'
pricing = get_pricing('SPY', fields='price', start_date=start,end_date=end)
returns = pricing.pct_change()[1:]
mode(returns)

'No Mode'

Since all of the returns are distinct we use a frequency distribution to get an alternative mode.

np histogram returns the frequency distribution over he bins as well as the endpoints of the bins

In [16]:
hist,bins = np.histogram(returns,20)
maxfreq = max(hist)

bin_result = [(bins[i],bins[i+1]) for i, j in enumerate(hist) if j == maxfreq]
print 'Mode of bins:', bin_result

Mode of bins: [(-0.0010477179735711162, 0.0021455730028153708)]


## Geometric Mean

While the arithmetic mean averages using addition, the geometric mean uses multiplication.

G = sqrt(x1*x2*...*xN,N)

where Xi > 0 We can also rewrite it as the arithmetic mean using logarithms:

ln G = sum(ln X1)/N

The geometric mean is always less than or equal to the arithmetifc mean (when working with non-negative observations) with equality only when all of the observations are the same.


In [18]:
stats.gmean(x1),stats.gmean(x2)

(3.0941040249774403, 4.5525345876200713)

The geometric mean is always less than or equal to the arithmetic mean when working with non negative observations. with equality only when all of the observations are the same.

What if we want to compute the geometric mean when we have negative observations? 

In the case of asset returns, where our values are always at least -1. we can add 1 to a return Rt to get 1 + Rt whic is the ratioio of the price of the asset for two consecutive perids (as opposed to the percent change between the prices, R)

This quantity will alwyas be non-negative. So we can compute the geometric mean return,

Rg = sqrt(sum(1+R1*1+R2*...*1+RT),t)-1

In [22]:
# add 1 to every value in the returns array and then compute R-G
ratios = returns + np.ones(len(returns))
ratios[:10]

2019-01-03 00:00:00+00:00    0.975506
2019-01-04 00:00:00+00:00    1.034078
2019-01-07 00:00:00+00:00    1.007289
2019-01-08 00:00:00+00:00    1.009436
2019-01-09 00:00:00+00:00    1.004636
2019-01-10 00:00:00+00:00    1.004188
2019-01-11 00:00:00+00:00    0.999807
2019-01-14 00:00:00+00:00    0.994400
2019-01-15 00:00:00+00:00    1.011343
2019-01-16 00:00:00+00:00    1.002188
Freq: C, Name: Equity(8554 [SPY]), dtype: float64

In [23]:
R_G = stats.gmean(ratios)-1

In [24]:
print 'Geometric mean of returns: ', R_G

Geometric mean of returns:  0.0010781304075


The geometric mean is defined so that if the rate of return over the whole time period were constant and equal to Rg, the final price of the securiity would be the same as in the case of returns R1,...,Rt

In [26]:
T = len(returns)
init_price = pricing[0]
final_price = pricing[T]
print 'Initial price:', init_price
print 'Final price: ', final_price
print 'Final price as computed with R_G:', init_price*(1+R_G)**T

Initial price: 245.61
Final price:  321.89
Final price as computed with R_G: 321.89


# WOW THATS COOL

# Harmonic Mean

This harmonic mean is less commonly used than the ther types of means. It is defined as 

H = n/ sum(1/x)

As with the geometric mean we can rewrite the harmonic mean to look like the arithmetic mean 

The harmonic mean for nnegatibve numbers is always at most the geometric mean, assuming equality.

####The harmonic mean can be used when the data can be naturally phrased in terms of ratios.

For instance in the dollar cost averaging strategy, a fixed amount is spent on shares of a stock at regular intervals. 

The higher the price of the stock, then the fewer shares an investor following the strategy buys.

The average (arithmetic mean) amount they pay for the stock is the harmonic mean of the prices.

In [27]:
stats.hmean(x1),stats.hmean(x2)

(2.5590251332825593, 2.8697236562405108)